BigQuery Public Datasets: Practical Application in Real World Circumstances

Hannah
5 min readApr 11, 2020

--

Unentitled by ev from Unsplash

Access to data is becoming more accessible, applicable and relevant to all of us, and as someone who finds data and its applications fascinating, what could be better than data, but big data! However, being able to access, store and maintain this data and then find and use the right tools correctly to analyse it can prove tricky, so I want to introduce one option to access high-quality data via BigQuery Public Datasets.

First things first…how can we access all this big data?

The BigQuery public datasets page, currently hosts 141 projects, just two years ago that number was around 40, therefore we can see how data tracking is exponential and estimated to double each year. A collection of tables is called a dataset and a project is a collection of datasets. Within each dataset you will find a series of tables which are accessed by thousands of queries from hundreds of projects from all over the world — you can do some pretty inspiring things with this stuff.

BigQuery Public Datasets offers the first terabyte(TB) of data for free, which is great to allow anyone to get set up and start testing to check functionality and relevance to your application . Datasets are hosted, maintained and updated by Google BigQuery at their expense, however, should you wish to go over the 1 TB limit you will have to add billing details and review the specific cost to your dataset.

BigQuery Public Datasets in Application

An interesting smaller dataset to kick off with is the Sunroof Solar project . It contains 8.66 MB of data in nearly 12,000 rows regarding the feasibility of the rooftops of US households for solar panel installation based on the quality of their kilowatt(kW) energy yield. The data is currently only available for the US and appears to be based off Google maps meaning the feasibility to scale-out and calculate internationally using Google’s datasets is high, therefore should such datasets be rolled out in future it could be highly impactful in countries with low access to electricity in addition to showcasing countries such as Costa Rica which are already 100% powered by renewable sustainable energy. Going into the dataset we can see the tables divided into latitude, longitude, yearly sunlight kW threshold average, number of median panels, percent covered and percent qualified so we can start asking questions like “What is the average of households qualified nationally?”, “What is the yearly sunlight kW threshold average in town/city/state X?”, “What is the number of total panels when yearly sunlight kWh threshold average is greater than X and percent currently covered is less than Y?”. BigQuery makes those answers accessible, click on the project you are interested in, select the dataset and then select the tables and “preview” option of your records to understand how the tables are named, ordered and organised and whether their values are integers, strings, etc. Once you have decided what questions you would like to ask you can use a standard SQL query, (further info on standard SQL commands), to answer the specific questions you have about the dataset:

What is the average of households qualified nationally?

What is the yearly sunlight kW threshold average in North Carolina?

What is the number of total panels when yearly sunlight kWh threshold average is greater than 1000 and percent currently covered is less than 50%?

Larger Data Sets:

Now let’s take a look at a larger BigQuery Public Dataset by looking at the World Development Indicators from the World Bank. This table contains 743.35 MB and approximately 7.4M rows of global development data collected from the early 1960’s to present offering indicators such as agriculture and food security, climate change, population growth, economic growth, education, energy, natural resources and many more. So again, after taking a look at the preview inside the table to see what sort of information is contained, let’s outline a few questions we are interested in getting accurate data to elaborate on.

Which countries have a population access to electricity of less than 50%(arbitrary example figure)?

“Which countries have net energy imports of greater than 50% and electricity production from natural gas and hydroelectric sources of less than 10%?”

Which countries have greater than 30% primary aged female students out of school and compulsory education duration of less than or equal to 10 years?

Further Application in the Current Context:

Given current circumstances surrounding the outbreak of COVID-19 I would also like to point out the applications of a repository of aggregated coronavirus COVID-19 cases by John Hopkins University to demonstrate the extensive aggregation potential and process of potentially applying similar data to your project. Unfortunately, for undisclosed reasons, as per 03/23/2020 this BigQuery dataset itself is no longer being maintained, therefore I cannot use it in a live query examples, however, going into the dataset we can see the tables divided into confirmed cases, deaths and recoveries and a continuing working model tracking this data which demonstrates the incredible application and the great potential of all of the third parties accessing this dataset to provide crucial lifesaving information in their relevant circumstances. Reference link here for further details:

To sum up, BigQuery Open Datasets is an incredibly valuable resource if you are interested in exploring public data and practicing your data analysis skills or directly utilising for practical applications. There is a wealth of additional information with specific case studies which can be found on BigQuery’s public datasets documentation. So if this peaks your interest, check out some data via public datasets and get some queries on the go!

--

--