How I Solved The Challenge Lab: Insights from Data With BigQuery
BigQuery is a data warehouse on Google Cloud you can use to store and analyze data. Imagine that you have a huge amount of data (in petabytes) with a lot of columns and hundreds of thousands of rows, but you can still run queries and get insights from the data in just a couple of seconds. I don’t know about you, but this impressed me the first time I learned about SQL example query. You can also use BigQuery to train machine learning models using BigQuery Machine Learning (BQML). BigQuery is useful for marketing professionals, managers, decision-makers, data analysts, machine learning engineers, and other data professionals.
You read the title with How I solved The Challenge Lab, it might be you are wondering about this. Please, Let me explain this in a short. The challenge lab here is a part of the Quest in Qwiklabs. It usually is the end of the lab. If you are successful in solving the challenge lab, you will get the special badge skills.
The quest in qwiklabs consists of 5–7 labs and each lab has a timer and Google’s guidance which is easy to follow. While in the challenge lab, there is no guidance step by step from Google. You must find the solution. I love to learn GCP using the qwiklabs and this impressed me the first time to learn GCP since 2019.
As I wrote an article at medium about sharing my experience using the GCP with qwiklabs, this lab is very interactive and interesting to explore about the subject you want to know much more about. Every time you finish the lab you will get the badge to show your quest completed.
The following is a sample Quest with some labs and the challenge lab.
Quest Insights From Data With BigQuery
Introduction to SQL for BigQuery and Cloud SQL
BigQuery: Qwik Start — Console
Exploring Your Ecommerce Dataset with SQL in Google BigQuery
Troubleshooting Common SQL Errors with BigQuery
Explore and Create Reports with Data Studio
Insights from Data with BigQuery: Challenge Lab
The following are sample badges without the challenge Lab. BigQuery basics for Data Analysts badge is part of the quest Insights From Data With BigQuery without the challenge lab.
Qwiklabs was founded in Feb 2012 by Nidhi Aggarwal, a tech entrepreneur from a traditional Indian family. Google acquired the company on Nov 21, 2016. The platform enables developers and IT professionals to gain hands-on experience using cloud platforms and software.
Prerequisites :
BigQuery supports standard SQL to query data. SQL stands for Structured Query Language. Don’t worry if that doesn’t make sense right now, I’ll briefly review the SQL Syntax basics, and point you to additional resources so you can dive in deeper.
I will cover these topics in this article :
- Why is BigQuery an important tool to use in analysis?
- What is BigQuery?
- How can we use BigQuery?
- Learning SQL Syntax basics
- Loading Data Into BigQuery using Public Dataset
- Let’s get started Analyzing The Data Using BigQuery.
- Summary.
- Additional References
- About the Author
Okay, you might be very curious about these topics. Let’s discuss.
Why is BigQuery?
A data warehouse system that manages large amounts of hardware.
Users can also share datasets with users outside of the project using the service.
According to Spotify Engineer Paul Lamere, BigQuery can handle a large amount of data quickly and cheaply.
(Lakshmanan, V., 2019)
What is BigQuery?
The definition of BigQuery is a little bit short in the introduction above. BigQuery is a data warehouse on Google Cloud you can use to store and analyze data using SQL Standard.
The Big query is fully managed and serverless.
BigQuery is fully managed. To get started, you don’t need to deploy any resources, such as disks and virtual machines.
What does `serverless` mean? a serverless compute architecture that separates computation from storage. you can run queries without the need to manage infrastructure. Google provides this.
It enables you to carry out analyses that process aggregations over the entire dataset in seconds to minutes.
Google allows you to ingest data in various formats such as CSV, JSON, AVRO, and so on.
It supports native machine learning using Bigquery Machine Learning (BQML) for predictive analysis and geospatial analysis.
(Lakshmanan, V., 2019)
How can we use Bigquery?
There are 4 ways we can use the BigQuery: You can use the console google cloud to the BigQuery Web UI, Using the REST API, using the bq command-line tool, and the third-party tools.
From the BigQuery web UI
(console.cloud.google.com)
Using the REST API, a variety of Client Libraries such as Java, .Net, or Python. Bigquery provides some Available APIs.
Using the bq command-line tool.
bq ls command to see whether your default project has any existing datasets.
bq ls to list the datasets in a specific project by including the project ID followed by a colon (:). For example bq ls public data:
bq mk command to create a new dataset.
bq load command creates or updates a table and loads data in a single step.
bq rm command to remove the dataset.
This following is one of bq command-line example :
bq query — nouse_legacy_sql \
‘SELECT
COUNT(*)
FROM
`bigquery-public-data`.samples.shakespeare’
(source: https://cloud.google.com/bigquery/docs/bq-command-line-tool)
Third-party tools
You also can use other third-party tools to load the data into bigquery.
(Source: https://cloud.google.com/bigquery)
Learning SQL Syntax basics
In this section, Why you should learn SQL.
We focus on standard SQL. The BigQuery user interface (UI) on the Google Cloud Platform (GCP) Cloud Console defaults to standard SQL.
I’ll give you some example queries. This is how you can select a row from a table. But SQL can do more, and here is a list of the resources I found helpful when I was learning SQL.
If you are not familiar with SQL yet, you can learn SQL from the following references :
“SQL Cookbook” By Anthony Molinaro, O’Reilly Publisher.
“Learning SQL” By Alan Beaulieu, O’Reilly Publisher.
You can practice SQL with this learning interactive in O’Reilly
The relational approach to database management is built on a mathematical model that includes intimidating-sounding terms like relational algebra and relational calculus.
E. F. Codd, the inventor of the relational model, has developed a detailed list of criteria that implementations of the model must meet. (Practical SQL Handbook, The: Using SQL Variants, Fourth Edition. By Judith S. Bowman, Marcy Darnovsky, and Sandra L. Emerson)
“The SQL language includes two distinct sets of commands: Data Definition Language (DDL) is the subset of SQL used to define and modify various data structures, while Data Manipulation Language (DML) is the subset of SQL used to access and manipulate data contained within the data structures previously defined via DDL. DDL includes numerous commands for handling such tasks as creating tables, indexes, views, and constraints, while DML is comprised of just four statements:
INSERT
Adds data to a database.
UPDATE
Modifies data in a database.
DELETE
Removes data from a database.
SELECT
Retrieves data from a database.” (Mishra, Sanjaya. et.al, )
According to Google BigQuery: The Definitive Guide By Valliappa Lakshmanan and Jordan Tigani, said that
“BigQuery is primarily a data warehouse into which you will typically load or stream data and not modify it. However, BigQuery does provide the ability to modify data: DML lets you insert, update, delete, and merge data into tables.
Tip
BigQuery is an analytical database, not an online transaction processing (OLTP) database. As such, it is not designed for very-high-frequency DML updates. If you find that you have a lot of data that you want to update, consider performing the updates in batches. One approach is to create a staging table with the updates that you want to apply, and then you can run a single UPDATE or MERGE statement to perform all of those changes in a single operation.”
The majority of database performance issues are caused by inefficient DML statements. So why should you care about SQL? In this age of Internet computing, does anyone even care about data access anymore? The following section will look at the origins of the SQL language as well as the SQL standard’s support. (Mastering Oracle SQL By Sanjay Mishra and Alan Beaulieu).
In general, data for analysis and statistics is stored as databases. The language to query the data is called SQL. Databases manage the data in tables. There are columns and rows.
(source: slideteam)
You will familiar with the SQL syntax like this :
SELECT * FROM table_name;
You use SELECT and FROM when you choose the table name to get the data that you need to analyze.
Another example for SQL :
SELECT customerName
FROM tableCustomer;
You can use SELECT to choose which column from which the table data.
Another SELECT Statement :
SELECT column
FROM table
WHERE condition
ORDER BY column [ASC | DESC]
Or Another SELECT Statement :
SELECT * FROM table LIMIT [number];
Loading Data Into BigQuery using Public Dataset
You are wondering about what exactly is the public dataset? What is it used for?
The Google Cloud Documentation explains the Public Dataset as the following:
“A public dataset is any dataset that is stored in BigQuery and made available to the general public through the Google Cloud Public Dataset Program. The public datasets are datasets that BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these datasets and provides public access to the data via a project. You pay only for the queries that you perform on the data. The first 1 TB per month is free, subject to query pricing details.
Public datasets are available for you to analyze using either legacy SQL or standard SQL queries. Use a fully qualified table name when querying public datasets, for example, bigquery-public-data.bbc_news.fulltext.
You can access BigQuery public datasets by using the Cloud Console, by using the bq command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.
You can find more details about each dataset by clicking the dataset’s name in the Datasets section of Cloud Marketplace.”
(Source: https://cloud.google.com/bigquery/public-data).
But, you can find another Free public dataset on some websites :
UCI Machine Learning Repository
Why should you use that public dataset?
You can use the public dataset for practice purposes to analyze the data using legacy SQL.
You can access and integrate it into your application.
You’ll use BigQuery in our examples for data collection and feature engineering. You’ll use data from Google Cloud Public Datasets, a set of free, publicly available data.
(See the Google explanation about the public dataset above)
Learn more about Google Dataset
According to cloud.google.com, there are several ways to ingest data into bigquery :
- Batch loads a set of data records.
- Stream individual records or batches of records.
- Queries can be used to generate new data and append or overwrite the results to a table.
- Use a third-party application or service.
Let’s get started Analyzing The Data Using BigQuery.
In this section, we will discuss the topic of analyzing data Using BigQuery. I’ll walk you through solving a Qwiklab. I even made a video about it! Here is the video, this YouTube Channel. In this article, we will discuss how to use and implement the SQL Standard into BigQuery. There are 2 videos that I’ll use to discuss in this article. BigQuery Qwikstart Console (With Google’s Guidance) and Insights From Data With BigQuery Challenge Lab (We should find the solution for this challenge).
VIDEO TUTORIAL BigQuery Qwik Start Console.
This tutorial uses Qwiklabs. Google gives us guidance step by step so we can follow it easily.
In this lab you will learn :
- How to use Compose BigQuery
- How to load data into Bigquery.
Using Compose BigQuery.
Please look at this SQL :
#standardSQL
SELECT
weight_pounds, state, year, gestation_weeks
FROM
`bigquery-public-data.samples.natality`
ORDER BY weight_pounds DESC LIMIT 10;
This is a standard SQL. Can you see the SQL syntax basics?
Weight_ponds, state, year, and gestation_weeks are the column names on the table’s name of samples.natality. You can see that they separate from the comma. You will use this when you use multiple column names.
Bigquery-public-data shows you that this bigquery uses the sample data using a public dataset that is available on BigQuery.
You can browse the schema of other public datasets in BigQuery by clicking + ADD DATA > Explore public datasets, then search for “bigquery public data ‘’ in the Search field.
Then, look at that syntax above
ORDER BY weight_pounds DESC LIMIT 10;
This is sorted DESCENDING by the “weigh_pounds” column with the limit 10 to show the number of rows that appear to the result
You can use the ASC or DESC keyword.
ASC: to sort the data in ascending order.
DESC: to sort the data in descending order.
The next step is to add custom data then create a bucket.
gsutil cp gs://spls/gsp072/baby-names.zip .
To create a bucket
gsutil cp yob2014.txt gs://<your_bucket>
Both of them, the command line in Cloud Shell. The gsutil tool has some commands such as mb and cp to perform operations. You can use the gsutil tool for :
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects.
Moving, copying, and renaming objects.
Editing objects and bucket ACLs.
In this step Load the data into a new table, you will create a new table inside the dataset.
You just need to fulfill these following when you create on the table dialog
Field
Value
Create table from:
Google Cloud Storage
Select file from GCS bucket:
<bucket_name>/yob2014.txt, replace <bucket_name> with the name of the bucket you created earlier.
File format:
CSV
Table name
names_2014
Schema > Edit as text
Slide on, then add the following in the textbox: name:string,gender:string,count:integer
Then, you query a custom dataset with this SQL Standard
#standardSQL
SELECT
name, count
FROM
`babynames.names_2014`
WHERE
gender = ‘M’
ORDER BY count DESC LIMIT 5;
SELECT name, count. You can see that there are 2 column names here, name and count. Remember you use SELECT SQL Statement when you choose which column name.
FROM `babynames.names_2014`. That is to indicate dataset names that you just created in the previous step, names_2014. You create a new table inside the dataset.
WHERE gender = ‘M’.
WHERE Syntax on SQL Standard to perform WHERE condition; It uses to fulfill the filter record on a specified condition.
ORDER BY count DESC LIMIT 5;
ORDER BY Keyword, to sort Ascending or Descending Order. Then using LIMIT 5; It uses 5 rows to appear in the result.
VIDEO TUTORIAL Insights from Data with BigQuery Challenge Lab
Overview
You must complete a series of tasks within the allocated time period. Instead of following step-by-step instructions, you’ll be given a scenario and a set of tasks — you figure out how to complete it on your own! An automated scoring system (shown on this page) will provide feedback on whether you have completed your tasks correctly.
To score 100% you must complete all tasks within the time period!
When you take a Challenge Lab, you will not be taught Google Cloud concepts. To build the solution to the challenge presented, use skills learned from the labs in the quest this challenge lab is part of. You will be expected to extend your learned skills; you will be expected to change broken queries.
This lab is recommended for students who have enrolled in the Insights from Data with BigQuery quest. Are you ready for the challenge?
Scenario
You’re part of a public health organization that is tasked with identifying answers to queries related to the Covid-19 pandemic. Obtaining the right answers will help the organization in planning and focusing healthcare efforts and awareness programs appropriately.
The dataset and table that will be used for this analysis will be bigquery-public-data.covid19_open_data.covid19_open_data. This repository contains country-level datasets of daily time-series data related to COVID-19 globally. It includes data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, and weather.
Query 1: Total Confirmed Cases
Build a query that will answer “What was the total count of confirmed cases on Apr 15, 2020?” The query needs to return a single row containing the sum of confirmed cases across all countries. The name of the column should be total_cases_worldwide.
Columns to reference:
- cumulative_confirmed
- date
Task — 1 : Total Confirmed Cases
SELECT sum(cumulative_confirmed) as total_cases_worldwide
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date=’2020–04–15'
Query 2: Worst Affected Areas
Build a query for answering “How many states in the US had more than 100 deaths on Apr 10, 2020?” The query needs to list the output in the field count_of_states. Hint: Don’t include NULL values.
Columns to reference:
- country_name
- subregion1_name (for state information)
- cumulative_deceased
Task — 2: Worst Affected Areas
with deaths_by_states as (
SELECT subregion1_name as state, sum(cumulative_deceased) as death_count
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name=”United States of America” and date=’2020–04–10' and subregion1_name is NOT NULL
group by subregion1_name
)
select count(*) as count_of_states
from deaths_by_states
where death_count > 100
Query 3: Identifying Hotspots
Build a query that will answer “List all the states in the United States of America that had more than 1000 confirmed cases on Apr 10, 2020?” The query needs to return the State Name and the corresponding confirmed cases arranged in descending order. Name of the fields to return state and total_confirmed_cases.
Columns to reference:
- country_code
- subregion1_name (for state information)
- cumulative_confirmed
Task — 3: Identifying Hotspots
SELECT * FROM (
SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_code=”US” AND date=’2020–04–10' AND subregion1_name is NOT NULL
GROUP BY subregion1_name
ORDER BY total_confirmed_cases DESC
)
WHERE total_confirmed_cases > 1000
Query 4: Fatality Ratio
Build a query that will answer “What was the case-fatality ratio in Italy for the month of April 2020?” Case-fatality ratio here is defined as (total deaths / total confirmed cases) * 100. Write a query to return the ratio for the month of April 2020 and containing the following fields in the output: total_confirmed_cases, total_deaths, case_fatality_ratio.
Columns to reference:
- country_name
- cumulative_confirmed
- cumulative_deceased
Task — 4 : Fatality Ratio
SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name=”Italy” AND date BETWEEN ‘2020–04–01’and ‘2020–04–30’
Query 5: Identifying specific day
Build a query that will answer: “On what day did the total number of deaths cross 10000 in Italy?” The query should return the date in the format yyyy-mm-dd.
Columns to reference:
- country_name
- cumulative_deceased
Task — 5 : Identifying specific day
SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name=”Italy” and cumulative_deceased>10000
order by date asc
limit 1
Query 6: Finding days with zero net new cases
The following query is written to identify the number of days in India between 21 Feb 2020 and 15 March 2020 when there were zero increases in the number of confirmed cases. However it is not executing properly. You need to update the query to complete it and obtain the result:
WITH india_cases_by_date AS (
SELECT
date,
SUM(cumulative_confirmed) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”India”
AND date between ‘2020–02–21’ and ‘2020–03–15’
GROUP BY
date
ORDER BY
date ASC
)
, india_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
Task — 6 : Finding days with zero net new cases
WITH india_cases_by_date AS (
SELECT
date,
SUM( cumulative_confirmed ) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name =”India”
AND date between ‘2020–02–21’ and ‘2020–03–15’
GROUP BY
date
ORDER BY
date ASC
)
, india_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
select count(*)
from india_previous_day_comparison
where net_new_cases=0
Query 7: Doubling rate
Using the previous query as a template, write a query to find out the dates on which the confirmed cases increased by more than 10% compared to the previous day (indicating doubling rate of ~ 7 days) in the US between the dates March 22, 2020 and April 20, 2020. The query needs to return the list of dates, the confirmed cases on that day, the confirmed cases the previous day, and the percentage increase in cases between the days. Use the following names for the returned fields: Date, Confirmed_Cases_On_Day, Confirmed_Cases_Previous_Day and Percentage_Increase_In_Cases.
Task — 7 : Doubling rate
WITH us_cases_by_date AS (
SELECT
date,
SUM(cumulative_confirmed) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”United States of America”
AND date between ‘2020–03–22’ and ‘2020–04–20’
GROUP BY
date
ORDER BY
date ASC
)
, us_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases,
(cases — LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase
FROM us_cases_by_date
)
select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases
from us_previous_day_comparison
where percentage_increase > 10
Query 8: Recovery rate
Build a query to list the recovery rates of countries arranged in descending order (limit to 10) upto the date May 10, 2020. Restrict the query to only those countries having more than 50K confirmed cases. The query needs to return the following fields: country, recovered_cases, confirmed_cases, recovery_rate.
Columns to reference:
- country_name
- cumulative_confirmed
- cumulative_recovered
Task — 8 : Recovery rate
WITH cases_by_country AS (
SELECT
country_name AS country,
sum(cumulative_confirmed) AS cases,
sum(cumulative_recovered) AS recovered_cases
FROM
bigquery-public-data.covid19_open_data.covid19_open_data
WHERE
date = ‘2020–05–10’
GROUP BY
country_name
)
, recovered_rate AS
(SELECT
country, cases, recovered_cases,
(recovered_cases * 100)/cases AS recovery_rate
FROM cases_by_country
)
SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate
FROM recovered_rate
WHERE cases > 50000
ORDER BY recovery_rate desc
LIMIT 10
Query 9: CDGR — Cumulative Daily Growth Rate
The following query is trying to calculate the CDGR on May 10, 2020(Cumulative Daily Growth Rate) for France since the day the first case was reported. The first case was reported on Jan 24, 2020. The CDGR is calculated as:
((last_day_cases/first_day_cases)¹/days_diff)-1)
Where :
- last_day_cases is the number of confirmed cases on May 10, 2020
- first_day_cases is the number of confirmed cases on Feb 02, 2020
- days_diff is the number of days between Feb 02 — May 10, 2020
The query isn’t executing properly. Can you fix the error to make the query execute successfully?
WITH
france_cases AS (
SELECT
date,
SUM(cumulative_confirmed) AS total_cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”France”
AND date IN (‘2020–01–24’,
‘2020–05–10’)
GROUP BY
date
ORDER BY
date)
, summary as (
SELECT
total_cases AS first_day_cases,
LEAD(total_cases) AS last_day_cases,
DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, SQRT((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary
Note: Refer to the following page to learn more about the SQL function referenced LEAD().
Task — 9 : CDGR — Cumulative Daily Growth Rate
WITH
france_cases AS (
SELECT
date,
SUM(cumulative_confirmed) AS total_cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”France”
AND date IN (‘2020–01–24’,
‘2020–05–10’)
GROUP BY
date
ORDER BY
date)
, summary as (
SELECT
total_cases AS first_day_cases,
LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,
DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, POW((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary
Create a Datastudio report
Create a Datastudio report that plots the following for the United States:
- Number of Confirmed Cases
- Number of Deaths
- Date range : 2020–03–15 to 2020–04–30
Task — 10 : Create a Datastudio report
SELECT
date, SUM(cumulative_confirmed) AS country_cases,
SUM(cumulative_deceased) AS country_deaths
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
date BETWEEN ‘2020–03–15’
AND ‘2020–04–30’
AND country_name =”United States of America”
GROUP BY date
ANOTHER VIDEO TUTORIAL ABOUT THE BIGQuery :
VIDEO TUTORIAL Introduction to SQL for BigQuery and Cloud SQL
VIDEO TUTORIAL Exploring Your Ecommerce Dataset with SQL in Google BigQuery
VIDEO TUTORIAL Exploring the Public Cryptocurrency Datasets Available in BigQuery
VIDEO TUTORIAL Weather Data in BigQuery
VIDEO TUTORIAL Using BigQuery and Cloud Logging to Analyze BigQuery Usage
VIDEO TUTORIAL LAB ETL Processing on Google Cloud Using Dataflow and BigQuery
SUMMARY :
You have learned many things in this article. Hope all lessons are useful for you in Data Analytical Skills. Let’s wrap up what you’ve learned here. The first section, what SQL is, why you should learn this, you learned about the SQL syntax basics and how to query data using SQL Syntax Standard. Then, The second section, you learned about How to load data into bigQuery using a public dataset. Here, you have learned about what BigQuery is, why you should learn BigQuery, What the Public Dataset, Where you can find this and why you should use a Public Dataset. The last section, I discover the topic with How I solve the Challenge Lab From Data To Insights using BigQuery. You can visit a link to a video Tutorial in my YouTube Channel too. Please don’t forget to Subscribe, Like and Comments.
Keep Spirit To Learn. Never Give Up.
Happy Learning.
Reference :
Beighley, L. (2021). Head First SQL 1st (first) edition Text Only. O’Reilly Media.
Beaulieu, Alan. Learning SQL. O’Reilly Media.
Bowman, J. S. Darnovsky, Marcy. Emerson, Sandra L. (2021). The Practical SQL Handbook: Using SQL Variants 4th (forth) edition. O’Reilly Media.
Cassel, P., Eddy, C., & Price, J. (2001). Sams Teach Yourself Microsoft Access 2002 in 21 Days. Sams.
Cloud.google.com
Feiler, J. (2011). Data-driven iOS apps for iPad and iPhone with FileMaker Pro, FileMaker Bento, and FileMaker Go. Que.
Lakshmanan, V. (2019). Google BigQuery: the Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale. O’Reilly Media, Incorporated.
Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine learning design patterns: solutions to common challenges in data preparation, model building, and MLOps. O’Reilly Media.
Molinaro,Anthony. SQL CookBook. O’Reilly Media.
Manoochehri, Michael. Data Just Right: Introduction to Large-Scale Data & Analytics. O’Reilly Media.
Mishra, S., & Beaulieu, A. (2004). Mastering Oracle SQL, 2nd Edition (Second ed.). O’Reilly Media.
Rischert, A. (2009). Oracle SQL By Example (4th ed.). Pearson.
Wilder,Edd., James. (2012). Planning for Big Data. O’Reilly Media
Qwiklabs Hands-On Google
https://google.qwiklabs.com/focuses/1145?parent=catalog
Insights From Data With BigQuery Challenge Lab
About The Author :
Hello, My name is Rachmawati Ari Taurisia. Most Indonesian people call me Rachma. Some international friends call me Ari. They are very difficult to pronounce or spell to my first name so my middle name is simple. In Bangkit Academy Indonesia, they call me Taurisia because there are many people using the similar name Rachma or Ari.
As a lifelong learner, I love to learn something new as my nature and teach to inspire, motivate, and help others to have new knowledge and skills in technology.
Having Educational Background in Informatics Management (Diploma 3) and Computer Science (Currently Bachelor’s Degree).
Some accomplishments in Cloud Engineering With Google Cloud, AI, and ML Learning. A brief of my short story, my GCP learning Journey. I love to join in some events and communities. I joined the Next 19 event at GDGCloud Bandung Community, then another event, Cloud Study Jam and JuaraGCP season 1–4 along 2019–2020 and JuaraGCP Crash Course 2021. Features on Google Blog Indonesia.
Working experiences: Freelancer online Class Trainer taught Web Development Basics For Beginners, Course Creator Web Apps On GCP, Advisor Cloud Computing Learning Path.
Speaker (Webinar Event, Conference, GDGCloud Bandung Community, and GDG Jakarta). Nominee Category Upskill and Reskill in 2020 Recognized By Womentech Network community, Third Winner Article Writing IWD2020, Recipient Scholarships (Linux Foundation 2020, O’Reilly 2021, WomenWhoCode 2021, UoPeople 2021).
a dataset
- Create a table
- Add data to your project (to a storage bucket)
- Load the data from the bucket to the table you created
It’s very easy to create a Dataset on BigQuery, you can see the video on how to create this.
The next step is to add custom data then create a bucket.
gsutil cp gs://spls/gsp072/baby-names.zip .
To create a bucket
gsutil cp yob2014.txt gs://<your_bucket>
Both of them, the command line in Cloud Shell. The gsutil tool has some commands such as mb and cp to perform operations. You can use the gsutil tool for :
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects.
Moving, copying, and renaming objects.
Editing objects and bucket ACLs.
In this step Load the data into a new table, you will create a new table inside the dataset.
You just need to fulfill these following when you create on the table dialog
Field
Value
Create table from:
Google Cloud Storage
Select file from GCS bucket:
<bucket_name>/yob2014.txt, replace <bucket_name> with the name of the bucket you created earlier.
File format:
CSV
Table name
names_2014
Schema > Edit as text
Slide on, then add the following in the textbox: name:string,gender:string,count:integer
Then, you query a custom dataset with this SQL Standard
#standardSQL
SELECT
name, count
FROM
`babynames.names_2014`
WHERE
gender = ‘M’
ORDER BY count DESC LIMIT 5;
SELECT name, count. You can see that there are 2 column names here, name and count. Remember you use SELECT SQL Statement when you choose which column name.
FROM `babynames.names_2014`. That is to indicate dataset names that you just created in the previous step, names_2014. You create a new table inside the dataset.
WHERE gender = ‘M’.
WHERE Syntax on SQL Standard to perform WHERE condition; It uses to fulfill the filter record on a specified condition.
ORDER BY count DESC LIMIT 5;
ORDER BY Keyword, to sort Ascending or Descending Order. Then using LIMIT 5; It uses 5 rows to appear in the result.
VIDEO TUTORIAL Insights from Data with BigQuery Challenge Lab
Overview
You must complete a series of tasks within the allocated time period. Instead of following step-by-step instructions, you’ll be given a scenario and a set of tasks — you figure out how to complete it on your own! An automated scoring system (shown on this page) will provide feedback on whether you have completed your tasks correctly.
To score 100% you must complete all tasks within the time period!
When you take a Challenge Lab, you will not be taught Google Cloud concepts. To build the solution to the challenge presented, use skills learned from the labs in the quest this challenge lab is part of. You will be expected to extend your learned skills; you will be expected to change broken queries.
This lab is recommended for students who have enrolled in the Insights from Data with BigQuery quest. Are you ready for the challenge?
Scenario
You’re part of a public health organization that is tasked with identifying answers to queries related to the Covid-19 pandemic. Obtaining the right answers will help the organization in planning and focusing healthcare efforts and awareness programs appropriately.
The dataset and table that will be used for this analysis will be bigquery-public-data.covid19_open_data.covid19_open_data. This repository contains country-level datasets of daily time-series data related to COVID-19 globally. It includes data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, and weather.
Query 1: Total Confirmed Cases
Build a query that will answer “What was the total count of confirmed cases on Apr 15, 2020?” The query needs to return a single row containing the sum of confirmed cases across all countries. The name of the column should be total_cases_worldwide.
Columns to reference:
- cumulative_confirmed
- date
Task — 1 : Total Confirmed Cases
SELECT sum(cumulative_confirmed) as total_cases_worldwide
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date=’2020–04–15'
Query 2: Worst Affected Areas
Build a query for answering “How many states in the US had more than 100 deaths on Apr 10, 2020?” The query needs to list the output in the field count_of_states. Hint: Don’t include NULL values.
Columns to reference:
- country_name
- subregion1_name (for state information)
- cumulative_deceased
Task — 2: Worst Affected Areas
with deaths_by_states as (
SELECT subregion1_name as state, sum(cumulative_deceased) as death_count
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name=”United States of America” and date=’2020–04–10' and subregion1_name is NOT NULL
group by subregion1_name
)
select count(*) as count_of_states
from deaths_by_states
where death_count > 100
Query 3: Identifying Hotspots
Build a query that will answer “List all the states in the United States of America that had more than 1000 confirmed cases on Apr 10, 2020?” The query needs to return the State Name and the corresponding confirmed cases arranged in descending order. Name of the fields to return state and total_confirmed_cases.
Columns to reference:
- country_code
- subregion1_name (for state information)
- cumulative_confirmed
Task — 3: Identifying Hotspots
SELECT * FROM (
SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_code=”US” AND date=’2020–04–10' AND subregion1_name is NOT NULL
GROUP BY subregion1_name
ORDER BY total_confirmed_cases DESC
)
WHERE total_confirmed_cases > 1000
Query 4: Fatality Ratio
Build a query that will answer “What was the case-fatality ratio in Italy for the month of April 2020?” Case-fatality ratio here is defined as (total deaths / total confirmed cases) * 100. Write a query to return the ratio for the month of April 2020 and containing the following fields in the output: total_confirmed_cases, total_deaths, case_fatality_ratio.
Columns to reference:
- country_name
- cumulative_confirmed
- cumulative_deceased
Task — 4 : Fatality Ratio
SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name=”Italy” AND date BETWEEN ‘2020–04–01’and ‘2020–04–30’
Query 5: Identifying specific day
Build a query that will answer: “On what day did the total number of deaths cross 10000 in Italy?” The query should return the date in the format yyyy-mm-dd.
Columns to reference:
- country_name
- cumulative_deceased
Task — 5 : Identifying specific day
SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name=”Italy” and cumulative_deceased>10000
order by date asc
limit 1
Query 6: Finding days with zero net new cases
The following query is written to identify the number of days in India between 21 Feb 2020 and 15 March 2020 when there were zero increases in the number of confirmed cases. However it is not executing properly. You need to update the query to complete it and obtain the result:
WITH india_cases_by_date AS (
SELECT
date,
SUM(cumulative_confirmed) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”India”
AND date between ‘2020–02–21’ and ‘2020–03–15’
GROUP BY
date
ORDER BY
date ASC
)
, india_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
Task — 6 : Finding days with zero net new cases
WITH india_cases_by_date AS (
SELECT
date,
SUM( cumulative_confirmed ) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name =”India”
AND date between ‘2020–02–21’ and ‘2020–03–15’
GROUP BY
date
ORDER BY
date ASC
)
, india_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
select count(*)
from india_previous_day_comparison
where net_new_cases=0
Query 7: Doubling rate
Using the previous query as a template, write a query to find out the dates on which the confirmed cases increased by more than 10% compared to the previous day (indicating doubling rate of ~ 7 days) in the US between the dates March 22, 2020 and April 20, 2020. The query needs to return the list of dates, the confirmed cases on that day, the confirmed cases the previous day, and the percentage increase in cases between the days. Use the following names for the returned fields: Date, Confirmed_Cases_On_Day, Confirmed_Cases_Previous_Day and Percentage_Increase_In_Cases.
Task — 7 : Doubling rate
WITH us_cases_by_date AS (
SELECT
date,
SUM(cumulative_confirmed) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”United States of America”
AND date between ‘2020–03–22’ and ‘2020–04–20’
GROUP BY
date
ORDER BY
date ASC
)
, us_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases,
(cases — LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase
FROM us_cases_by_date
)
select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases
from us_previous_day_comparison
where percentage_increase > 10
Query 8: Recovery rate
Build a query to list the recovery rates of countries arranged in descending order (limit to 10) upto the date May 10, 2020. Restrict the query to only those countries having more than 50K confirmed cases. The query needs to return the following fields: country, recovered_cases, confirmed_cases, recovery_rate.
Columns to reference:
- country_name
- cumulative_confirmed
- cumulative_recovered
Task — 8 : Recovery rate
WITH cases_by_country AS (
SELECT
country_name AS country,
sum(cumulative_confirmed) AS cases,
sum(cumulative_recovered) AS recovered_cases
FROM
bigquery-public-data.covid19_open_data.covid19_open_data
WHERE
date = ‘2020–05–10’
GROUP BY
country_name
)
, recovered_rate AS
(SELECT
country, cases, recovered_cases,
(recovered_cases * 100)/cases AS recovery_rate
FROM cases_by_country
)
SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate
FROM recovered_rate
WHERE cases > 50000
ORDER BY recovery_rate desc
LIMIT 10
Query 9: CDGR — Cumulative Daily Growth Rate
The following query is trying to calculate the CDGR on May 10, 2020(Cumulative Daily Growth Rate) for France since the day the first case was reported. The first case was reported on Jan 24, 2020. The CDGR is calculated as:
((last_day_cases/first_day_cases)¹/days_diff)-1)
Where :
- last_day_cases is the number of confirmed cases on May 10, 2020
- first_day_cases is the number of confirmed cases on Feb 02, 2020
- days_diff is the number of days between Feb 02 — May 10, 2020
The query isn’t executing properly. Can you fix the error to make the query execute successfully?
WITH
france_cases AS (
SELECT
date,
SUM(cumulative_confirmed) AS total_cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”France”
AND date IN (‘2020–01–24’,
‘2020–05–10’)
GROUP BY
date
ORDER BY
date)
, summary as (
SELECT
total_cases AS first_day_cases,
LEAD(total_cases) AS last_day_cases,
DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, SQRT((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary
Note: Refer to the following page to learn more about the SQL function referenced LEAD().
Task — 9 : CDGR — Cumulative Daily Growth Rate
WITH
france_cases AS (
SELECT
date,
SUM(cumulative_confirmed) AS total_cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name=”France”
AND date IN (‘2020–01–24’,
‘2020–05–10’)
GROUP BY
date
ORDER BY
date)
, summary as (
SELECT
total_cases AS first_day_cases,
LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,
DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, POW((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary
Create a Datastudio report
Create a Datastudio report that plots the following for the United States:
Number of Confirmed Cases
Number of Deaths
Date range : 2020–03–15 to 2020–04–30
Task — 10 : Create a Datastudio report
SELECT
date, SUM(cumulative_confirmed) AS country_cases,
SUM(cumulative_deceased) AS country_deaths
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
date BETWEEN ‘2020–03–15’
AND ‘2020–04–30’
AND country_name =”United States of America”
GROUP BY date
ANOTHER VIDEO TUTORIAL ABOUT THE BIGQuery :
VIDEO TUTORIAL Introduction to SQL for BigQuery and Cloud SQL
VIDEO TUTORIAL Exploring Your Ecommerce Dataset with SQL in Google BigQuery
VIDEO TUTORIAL Exploring the Public Cryptocurrency Datasets Available in BigQuery
VIDEO TUTORIAL Weather Data in BigQuery
VIDEO TUTORIAL Using BigQuery and Cloud Logging to Analyze BigQuery Usage
VIDEO TUTORIAL LAB ETL Processing on Google Cloud Using Dataflow and BigQuery
SUMMARY :
You have learned many things in this article. Hope all lessons are useful for you in Data Analytical Skills. Let’s wrap up what you’ve learned here. The first section, what SQL is, why you should learn this, you learned about the SQL syntax basics and how to query data using SQL Syntax Standard. Then, The second section, you learned about How to load data into bigQuery using a public dataset. Here, you have learned about what BigQuery is, why you should learn BigQuery, What the Public Dataset, Where you can find this and why you should use a Public Dataset. The last section, I discover the topic with How I solve the Challenge Lab From Data To Insights using BigQuery. You can visit a link to a video Tutorial in my YouTube Channel too. Please don’t forget to Subscribe, Like and Comments.
Keep Spirit To Learn. Never Give Up.
Happy Learning.
Reference :
Beighley, L. (2021). Head First SQL 1st (first) edition Text Only. O’Reilly Media.
Beaulieu, Alan. Learning SQL. O’Reilly Media.
Bowman, J. S. Darnovsky, Marcy. Emerson, Sandra L. (2021). The Practical SQL Handbook: Using SQL Variants 4th (forth) edition. O’Reilly Media.
Cassel, P., Eddy, C., & Price, J. (2001). Sams Teach Yourself Microsoft Access 2002 in 21 Days. Sams.
Cloud.google.com
Feiler, J. (2011). Data-driven iOS apps for iPad and iPhone with FileMaker Pro, FileMaker Bento, and FileMaker Go. Que.
Lakshmanan, V. (2019). Google BigQuery: the Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale. O’Reilly Media, Incorporated.
Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine learning design patterns: solutions to common challenges in data preparation, model building, and MLOps. O’Reilly Media.
Molinaro,Anthony. SQL CookBook. O’Reilly Media.
Manoochehri, Michael. Data Just Right: Introduction to Large-Scale Data & Analytics. O’Reilly Media.
Mishra, S., & Beaulieu, A. (2004). Mastering Oracle SQL, 2nd Edition (Second ed.). O’Reilly Media.
Rischert, A. (2009). Oracle SQL By Example (4th ed.). Pearson.
Wilder,Edd., James. (2012). Planning for Big Data. O’Reilly Media
Qwiklabs Hands-On Google
https://google.qwiklabs.com/focuses/1145?parent=catalog
Insights From Data With BigQuery Challenge Lab
About The Author :
Hello, My name is Rachmawati Ari Taurisia. Most Indonesian people call me Rachma. Some international friends call me Ari. They are very difficult to pronounce or spell to my first name so my middle name is simple. In Bangkit Academy Indonesia, they call me Taurisia because there are many people using the similar name Rachma or Ari.
As a lifelong learner, I love to learn something new as my nature and teach to inspire, motivate, and help others to have new knowledge and skills in technology.
Having Educational Background in Informatics Management (Diploma 3) and Computer Science (Currently Bachelor’s Degree).
Some accomplishments in Cloud Engineering With Google Cloud, AI, and ML Learning. A brief of my short story, my GCP learning Journey. I love to join in some events and communities. I joined the Next 19 event at GDGCloud Bandung Community, then another event, Cloud Study Jam and JuaraGCP season 1–4 along 2019–2020 and JuaraGCP Crash Course 2021. Features on Google Blog Indonesia.
Working experiences: Freelancer online Class Trainer taught Web Development Basics For Beginners, Course Creator Web Apps On GCP, Advisor Cloud Computing Learning Path.
Speaker (Webinar Event, Conference, GDGCloud Bandung Community, and GDG Jakarta). Nominee Category Upskill and Reskill in 2020 Recognized By Womentech Network community, Third Winner Article Writing IWD2020, Recipient Scholarships (Linux Foundation 2020, O’Reilly 2021, WomenWhoCode 2021, UoPeople 2021).