How I Solved The Challenge Lab: Insights from Data With BigQuery

Rachmawati Ari Taurisia
29 min readJul 24, 2021

--

BigQuery is a data warehouse on Google Cloud you can use to store and analyze data. Imagine that you have a huge amount of data (in petabytes) with a lot of columns and hundreds of thousands of rows, but you can still run queries and get insights from the data in just a couple of seconds. I don’t know about you, but this impressed me the first time I learned about SQL example query. You can also use BigQuery to train machine learning models using BigQuery Machine Learning (BQML). BigQuery is useful for marketing professionals, managers, decision-makers, data analysts, machine learning engineers, and other data professionals.

You read the title with How I solved The Challenge Lab, it might be you are wondering about this. Please, Let me explain this in a short. The challenge lab here is a part of the Quest in Qwiklabs. It usually is the end of the lab. If you are successful in solving the challenge lab, you will get the special badge skills.

The quest in qwiklabs consists of 5–7 labs and each lab has a timer and Google’s guidance which is easy to follow. While in the challenge lab, there is no guidance step by step from Google. You must find the solution. I love to learn GCP using the qwiklabs and this impressed me the first time to learn GCP since 2019.

As I wrote an article at medium about sharing my experience using the GCP with qwiklabs, this lab is very interactive and interesting to explore about the subject you want to know much more about. Every time you finish the lab you will get the badge to show your quest completed.

The following is a sample Quest with some labs and the challenge lab.

My Public Qwiklabs Profile

Quest Insights From Data With BigQuery

Introduction to SQL for BigQuery and Cloud SQL

BigQuery: Qwik Start — Console

Exploring Your Ecommerce Dataset with SQL in Google BigQuery

Troubleshooting Common SQL Errors with BigQuery

Explore and Create Reports with Data Studio

Insights from Data with BigQuery: Challenge Lab

The following are sample badges without the challenge Lab. BigQuery basics for Data Analysts badge is part of the quest Insights From Data With BigQuery without the challenge lab.

My Public Qwiklabs Profile

Qwiklabs was founded in Feb 2012 by Nidhi Aggarwal, a tech entrepreneur from a traditional Indian family. Google acquired the company on Nov 21, 2016. The platform enables developers and IT professionals to gain hands-on experience using cloud platforms and software.

Prerequisites :

BigQuery supports standard SQL to query data. SQL stands for Structured Query Language. Don’t worry if that doesn’t make sense right now, I’ll briefly review the SQL Syntax basics, and point you to additional resources so you can dive in deeper.

I will cover these topics in this article :

  1. Why is BigQuery an important tool to use in analysis?
  2. What is BigQuery?
  3. How can we use BigQuery?
  4. Learning SQL Syntax basics
  5. Loading Data Into BigQuery using Public Dataset
  6. Let’s get started Analyzing The Data Using BigQuery.
  7. Summary.
  8. Additional References
  9. About the Author

Okay, you might be very curious about these topics. Let’s discuss.

Why is BigQuery?

A data warehouse system that manages large amounts of hardware.

Users can also share datasets with users outside of the project using the service.

According to Spotify Engineer Paul Lamere, BigQuery can handle a large amount of data quickly and cheaply.

(Lakshmanan, V., 2019)

What is BigQuery?

The definition of BigQuery is a little bit short in the introduction above. BigQuery is a data warehouse on Google Cloud you can use to store and analyze data using SQL Standard.

The Big query is fully managed and serverless.

BigQuery is fully managed. To get started, you don’t need to deploy any resources, such as disks and virtual machines.

What does `serverless` mean? a serverless compute architecture that separates computation from storage. you can run queries without the need to manage infrastructure. Google provides this.

It enables you to carry out analyses that process aggregations over the entire dataset in seconds to minutes.

Google allows you to ingest data in various formats such as CSV, JSON, AVRO, and so on.

It supports native machine learning using Bigquery Machine Learning (BQML) for predictive analysis and geospatial analysis.

(Lakshmanan, V., 2019)

How can we use Bigquery?

There are 4 ways we can use the BigQuery: You can use the console google cloud to the BigQuery Web UI, Using the REST API, using the bq command-line tool, and the third-party tools.

From the BigQuery web UI

(console.cloud.google.com)

Using the REST API, a variety of Client Libraries such as Java, .Net, or Python. Bigquery provides some Available APIs.

Using the bq command-line tool.

bq ls command to see whether your default project has any existing datasets.

bq ls to list the datasets in a specific project by including the project ID followed by a colon (:). For example bq ls public data:

bq mk command to create a new dataset.

bq load command creates or updates a table and loads data in a single step.

bq rm command to remove the dataset.

This following is one of bq command-line example :

bq query — nouse_legacy_sql \

‘SELECT

COUNT(*)

FROM

`bigquery-public-data`.samples.shakespeare’

(source: https://cloud.google.com/bigquery/docs/bq-command-line-tool)

Third-party tools

You also can use other third-party tools to load the data into bigquery.

(Source: https://cloud.google.com/bigquery)

Learning SQL Syntax basics

In this section, Why you should learn SQL.

We focus on standard SQL. The BigQuery user interface (UI) on the Google Cloud Platform (GCP) Cloud Console defaults to standard SQL.

I’ll give you some example queries. This is how you can select a row from a table. But SQL can do more, and here is a list of the resources I found helpful when I was learning SQL.

If you are not familiar with SQL yet, you can learn SQL from the following references :

“SQL Cookbook” By Anthony Molinaro, O’Reilly Publisher.

“Learning SQL” By Alan Beaulieu, O’Reilly Publisher.

You can practice SQL with this learning interactive in O’Reilly

The relational approach to database management is built on a mathematical model that includes intimidating-sounding terms like relational algebra and relational calculus.

E. F. Codd, the inventor of the relational model, has developed a detailed list of criteria that implementations of the model must meet. (Practical SQL Handbook, The: Using SQL Variants, Fourth Edition. By Judith S. Bowman, Marcy Darnovsky, and Sandra L. Emerson)

“The SQL language includes two distinct sets of commands: Data Definition Language (DDL) is the subset of SQL used to define and modify various data structures, while Data Manipulation Language (DML) is the subset of SQL used to access and manipulate data contained within the data structures previously defined via DDL. DDL includes numerous commands for handling such tasks as creating tables, indexes, views, and constraints, while DML is comprised of just four statements:

INSERT

Adds data to a database.

UPDATE

Modifies data in a database.

DELETE

Removes data from a database.

SELECT

Retrieves data from a database.” (Mishra, Sanjaya. et.al, )

According to Google BigQuery: The Definitive Guide By Valliappa Lakshmanan and Jordan Tigani, said that

“BigQuery is primarily a data warehouse into which you will typically load or stream data and not modify it. However, BigQuery does provide the ability to modify data: DML lets you insert, update, delete, and merge data into tables.

Tip

BigQuery is an analytical database, not an online transaction processing (OLTP) database. As such, it is not designed for very-high-frequency DML updates. If you find that you have a lot of data that you want to update, consider performing the updates in batches. One approach is to create a staging table with the updates that you want to apply, and then you can run a single UPDATE or MERGE statement to perform all of those changes in a single operation.”

The majority of database performance issues are caused by inefficient DML statements. So why should you care about SQL? In this age of Internet computing, does anyone even care about data access anymore? The following section will look at the origins of the SQL language as well as the SQL standard’s support. (Mastering Oracle SQL By Sanjay Mishra and Alan Beaulieu).

In general, data for analysis and statistics is stored as databases. The language to query the data is called SQL. Databases manage the data in tables. There are columns and rows.

(source: slideteam)

You will familiar with the SQL syntax like this :

SELECT * FROM table_name;

You use SELECT and FROM when you choose the table name to get the data that you need to analyze.

Another example for SQL :

SELECT customerName

FROM tableCustomer;

You can use SELECT to choose which column from which the table data.

Another SELECT Statement :

SELECT column

FROM table

WHERE condition

ORDER BY column [ASC | DESC]

Or Another SELECT Statement :

SELECT * FROM table LIMIT [number];

Loading Data Into BigQuery using Public Dataset

You are wondering about what exactly is the public dataset? What is it used for?

The Google Cloud Documentation explains the Public Dataset as the following:

“A public dataset is any dataset that is stored in BigQuery and made available to the general public through the Google Cloud Public Dataset Program. The public datasets are datasets that BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these datasets and provides public access to the data via a project. You pay only for the queries that you perform on the data. The first 1 TB per month is free, subject to query pricing details.

Public datasets are available for you to analyze using either legacy SQL or standard SQL queries. Use a fully qualified table name when querying public datasets, for example, bigquery-public-data.bbc_news.fulltext.

You can access BigQuery public datasets by using the Cloud Console, by using the bq command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

You can find more details about each dataset by clicking the dataset’s name in the Datasets section of Cloud Marketplace.”

(Source: https://cloud.google.com/bigquery/public-data).

But, you can find another Free public dataset on some websites :

BuzzFeedNews

Kaggle

Socrata

Awesome-public-dataset

FiveThirtyEight

UCI Machine Learning Repository

Data.gov

Academic Torrents

Quandl

Data-Is-Plural

Jakarta Open Data

Google Scholar

Google Finance

Google Public Data

Google Data Search

Google Trends

Climate Data Online

The Global Health Observatory

Singapore Residents

Earth Data

1000 Genomes Project

Pew Internet

Visual Data

CMU Libraries

NLP Database

The Boston Housing Dataset

Mall Customers

The MNIST Dataset

Irish Dataset

Why should you use that public dataset?

You can use the public dataset for practice purposes to analyze the data using legacy SQL.

You can access and integrate it into your application.

You’ll use BigQuery in our examples for data collection and feature engineering. You’ll use data from Google Cloud Public Datasets, a set of free, publicly available data.

(See the Google explanation about the public dataset above)

Learn more about Google Dataset

According to cloud.google.com, there are several ways to ingest data into bigquery :

  • Batch loads a set of data records.
  • Stream individual records or batches of records.
  • Queries can be used to generate new data and append or overwrite the results to a table.
  • Use a third-party application or service.

Let’s get started Analyzing The Data Using BigQuery.

In this section, we will discuss the topic of analyzing data Using BigQuery. I’ll walk you through solving a Qwiklab. I even made a video about it! Here is the video, this YouTube Channel. In this article, we will discuss how to use and implement the SQL Standard into BigQuery. There are 2 videos that I’ll use to discuss in this article. BigQuery Qwikstart Console (With Google’s Guidance) and Insights From Data With BigQuery Challenge Lab (We should find the solution for this challenge).

VIDEO TUTORIAL BigQuery Qwik Start Console.

This tutorial uses Qwiklabs. Google gives us guidance step by step so we can follow it easily.

In this lab you will learn :

  1. How to use Compose BigQuery
  2. How to load data into Bigquery.

Using Compose BigQuery.

Please look at this SQL :

#standardSQL

SELECT

weight_pounds, state, year, gestation_weeks

FROM

`bigquery-public-data.samples.natality`

ORDER BY weight_pounds DESC LIMIT 10;

This is a standard SQL. Can you see the SQL syntax basics?

Weight_ponds, state, year, and gestation_weeks are the column names on the table’s name of samples.natality. You can see that they separate from the comma. You will use this when you use multiple column names.

Bigquery-public-data shows you that this bigquery uses the sample data using a public dataset that is available on BigQuery.

You can browse the schema of other public datasets in BigQuery by clicking + ADD DATA > Explore public datasets, then search for “bigquery public data ‘’ in the Search field.

(BigQuery Qwik Start Console)

Then, look at that syntax above

ORDER BY weight_pounds DESC LIMIT 10;

This is sorted DESCENDING by the “weigh_pounds” column with the limit 10 to show the number of rows that appear to the result

You can use the ASC or DESC keyword.

ASC: to sort the data in ascending order.

DESC: to sort the data in descending order.

The next step is to add custom data then create a bucket.

gsutil cp gs://spls/gsp072/baby-names.zip .

To create a bucket

gsutil cp yob2014.txt gs://<your_bucket>

Both of them, the command line in Cloud Shell. The gsutil tool has some commands such as mb and cp to perform operations. You can use the gsutil tool for :

Creating and deleting buckets.

Uploading, downloading, and deleting objects.

Listing buckets and objects.

Moving, copying, and renaming objects.

Editing objects and bucket ACLs.

In this step Load the data into a new table, you will create a new table inside the dataset.

You just need to fulfill these following when you create on the table dialog

Field

Value

Create table from:

Google Cloud Storage

Select file from GCS bucket:

<bucket_name>/yob2014.txt, replace <bucket_name> with the name of the bucket you created earlier.

File format:

CSV

Table name

names_2014

Schema > Edit as text

Slide on, then add the following in the textbox: name:string,gender:string,count:integer

(BigQuery Qwik Start Console)

Then, you query a custom dataset with this SQL Standard

#standardSQL

SELECT

name, count

FROM

`babynames.names_2014`

WHERE

gender = ‘M’

ORDER BY count DESC LIMIT 5;

SELECT name, count. You can see that there are 2 column names here, name and count. Remember you use SELECT SQL Statement when you choose which column name.

FROM `babynames.names_2014`. That is to indicate dataset names that you just created in the previous step, names_2014. You create a new table inside the dataset.

WHERE gender = ‘M’.

WHERE Syntax on SQL Standard to perform WHERE condition; It uses to fulfill the filter record on a specified condition.

ORDER BY count DESC LIMIT 5;

ORDER BY Keyword, to sort Ascending or Descending Order. Then using LIMIT 5; It uses 5 rows to appear in the result.

VIDEO TUTORIAL Insights from Data with BigQuery Challenge Lab

Overview

You must complete a series of tasks within the allocated time period. Instead of following step-by-step instructions, you’ll be given a scenario and a set of tasks — you figure out how to complete it on your own! An automated scoring system (shown on this page) will provide feedback on whether you have completed your tasks correctly.

To score 100% you must complete all tasks within the time period!

When you take a Challenge Lab, you will not be taught Google Cloud concepts. To build the solution to the challenge presented, use skills learned from the labs in the quest this challenge lab is part of. You will be expected to extend your learned skills; you will be expected to change broken queries.

This lab is recommended for students who have enrolled in the Insights from Data with BigQuery quest. Are you ready for the challenge?

Scenario

You’re part of a public health organization that is tasked with identifying answers to queries related to the Covid-19 pandemic. Obtaining the right answers will help the organization in planning and focusing healthcare efforts and awareness programs appropriately.

The dataset and table that will be used for this analysis will be bigquery-public-data.covid19_open_data.covid19_open_data. This repository contains country-level datasets of daily time-series data related to COVID-19 globally. It includes data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, and weather.

Query 1: Total Confirmed Cases

Build a query that will answer “What was the total count of confirmed cases on Apr 15, 2020?” The query needs to return a single row containing the sum of confirmed cases across all countries. The name of the column should be total_cases_worldwide.

Columns to reference:

  • cumulative_confirmed
  • date

Task — 1 : Total Confirmed Cases

SELECT sum(cumulative_confirmed) as total_cases_worldwide

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE date=’2020–04–15'

Query 2: Worst Affected Areas

Build a query for answering “How many states in the US had more than 100 deaths on Apr 10, 2020?” The query needs to list the output in the field count_of_states. Hint: Don’t include NULL values.

Columns to reference:

  • country_name
  • subregion1_name (for state information)
  • cumulative_deceased

Task — 2: Worst Affected Areas

with deaths_by_states as (

SELECT subregion1_name as state, sum(cumulative_deceased) as death_count

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

where country_name=”United States of America” and date=’2020–04–10' and subregion1_name is NOT NULL

group by subregion1_name

)

select count(*) as count_of_states

from deaths_by_states

where death_count > 100

Query 3: Identifying Hotspots

Build a query that will answer “List all the states in the United States of America that had more than 1000 confirmed cases on Apr 10, 2020?” The query needs to return the State Name and the corresponding confirmed cases arranged in descending order. Name of the fields to return state and total_confirmed_cases.

Columns to reference:

  • country_code
  • subregion1_name (for state information)
  • cumulative_confirmed

Task — 3: Identifying Hotspots

SELECT * FROM (

SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE country_code=”US” AND date=’2020–04–10' AND subregion1_name is NOT NULL

GROUP BY subregion1_name

ORDER BY total_confirmed_cases DESC

)

WHERE total_confirmed_cases > 1000

Query 4: Fatality Ratio

Build a query that will answer “What was the case-fatality ratio in Italy for the month of April 2020?” Case-fatality ratio here is defined as (total deaths / total confirmed cases) * 100. Write a query to return the ratio for the month of April 2020 and containing the following fields in the output: total_confirmed_cases, total_deaths, case_fatality_ratio.

Columns to reference:

  • country_name
  • cumulative_confirmed
  • cumulative_deceased

Task — 4 : Fatality Ratio

SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

where country_name=”Italy” AND date BETWEEN ‘2020–04–01’and ‘2020–04–30’

Query 5: Identifying specific day

Build a query that will answer: “On what day did the total number of deaths cross 10000 in Italy?” The query should return the date in the format yyyy-mm-dd.

Columns to reference:

  • country_name
  • cumulative_deceased

Task — 5 : Identifying specific day

SELECT date

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

where country_name=”Italy” and cumulative_deceased>10000

order by date asc

limit 1

Query 6: Finding days with zero net new cases

The following query is written to identify the number of days in India between 21 Feb 2020 and 15 March 2020 when there were zero increases in the number of confirmed cases. However it is not executing properly. You need to update the query to complete it and obtain the result:

WITH india_cases_by_date AS (

SELECT

date,

SUM(cumulative_confirmed) AS cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”India”

AND date between ‘2020–02–21’ and ‘2020–03–15’

GROUP BY

date

ORDER BY

date ASC

)

, india_previous_day_comparison AS

(SELECT

date,

cases,

LAG(cases) OVER(ORDER BY date) AS previous_day,

cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases

FROM india_cases_by_date

)

Task — 6 : Finding days with zero net new cases

WITH india_cases_by_date AS (

SELECT

date,

SUM( cumulative_confirmed ) AS cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name =”India”

AND date between ‘2020–02–21’ and ‘2020–03–15’

GROUP BY

date

ORDER BY

date ASC

)

, india_previous_day_comparison AS

(SELECT

date,

cases,

LAG(cases) OVER(ORDER BY date) AS previous_day,

cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases

FROM india_cases_by_date

)

select count(*)

from india_previous_day_comparison

where net_new_cases=0

Query 7: Doubling rate

Using the previous query as a template, write a query to find out the dates on which the confirmed cases increased by more than 10% compared to the previous day (indicating doubling rate of ~ 7 days) in the US between the dates March 22, 2020 and April 20, 2020. The query needs to return the list of dates, the confirmed cases on that day, the confirmed cases the previous day, and the percentage increase in cases between the days. Use the following names for the returned fields: Date, Confirmed_Cases_On_Day, Confirmed_Cases_Previous_Day and Percentage_Increase_In_Cases.

Task — 7 : Doubling rate

WITH us_cases_by_date AS (

SELECT

date,

SUM(cumulative_confirmed) AS cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”United States of America”

AND date between ‘2020–03–22’ and ‘2020–04–20’

GROUP BY

date

ORDER BY

date ASC

)

, us_previous_day_comparison AS

(SELECT

date,

cases,

LAG(cases) OVER(ORDER BY date) AS previous_day,

cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases,

(cases — LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase

FROM us_cases_by_date

)

select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases

from us_previous_day_comparison

where percentage_increase > 10

Query 8: Recovery rate

Build a query to list the recovery rates of countries arranged in descending order (limit to 10) upto the date May 10, 2020. Restrict the query to only those countries having more than 50K confirmed cases. The query needs to return the following fields: country, recovered_cases, confirmed_cases, recovery_rate.

Columns to reference:

  • country_name
  • cumulative_confirmed
  • cumulative_recovered

Task — 8 : Recovery rate

WITH cases_by_country AS (

SELECT

country_name AS country,

sum(cumulative_confirmed) AS cases,

sum(cumulative_recovered) AS recovered_cases

FROM

bigquery-public-data.covid19_open_data.covid19_open_data

WHERE

date = ‘2020–05–10’

GROUP BY

country_name

)

, recovered_rate AS

(SELECT

country, cases, recovered_cases,

(recovered_cases * 100)/cases AS recovery_rate

FROM cases_by_country

)

SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate

FROM recovered_rate

WHERE cases > 50000

ORDER BY recovery_rate desc

LIMIT 10

Query 9: CDGR — Cumulative Daily Growth Rate

The following query is trying to calculate the CDGR on May 10, 2020(Cumulative Daily Growth Rate) for France since the day the first case was reported. The first case was reported on Jan 24, 2020. The CDGR is calculated as:

((last_day_cases/first_day_cases)¹/days_diff)-1)

Where :

  • last_day_cases is the number of confirmed cases on May 10, 2020
  • first_day_cases is the number of confirmed cases on Feb 02, 2020
  • days_diff is the number of days between Feb 02 — May 10, 2020

The query isn’t executing properly. Can you fix the error to make the query execute successfully?

WITH

france_cases AS (

SELECT

date,

SUM(cumulative_confirmed) AS total_cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”France”

AND date IN (‘2020–01–24’,

‘2020–05–10’)

GROUP BY

date

ORDER BY

date)

, summary as (

SELECT

total_cases AS first_day_cases,

LEAD(total_cases) AS last_day_cases,

DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff

FROM

france_cases

LIMIT 1

)

select first_day_cases, last_day_cases, days_diff, SQRT((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr

from summary

Note: Refer to the following page to learn more about the SQL function referenced LEAD().

Task — 9 : CDGR — Cumulative Daily Growth Rate

WITH

france_cases AS (

SELECT

date,

SUM(cumulative_confirmed) AS total_cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”France”

AND date IN (‘2020–01–24’,

‘2020–05–10’)

GROUP BY

date

ORDER BY

date)

, summary as (

SELECT

total_cases AS first_day_cases,

LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,

DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff

FROM

france_cases

LIMIT 1

)

select first_day_cases, last_day_cases, days_diff, POW((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr

from summary

Create a Datastudio report

Create a Datastudio report that plots the following for the United States:

  • Number of Confirmed Cases
  • Number of Deaths
  • Date range : 2020–03–15 to 2020–04–30

Task — 10 : Create a Datastudio report

SELECT

date, SUM(cumulative_confirmed) AS country_cases,

SUM(cumulative_deceased) AS country_deaths

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

date BETWEEN ‘2020–03–15’

AND ‘2020–04–30’

AND country_name =”United States of America”

GROUP BY date

ANOTHER VIDEO TUTORIAL ABOUT THE BIGQuery :

VIDEO TUTORIAL Introduction to SQL for BigQuery and Cloud SQL

VIDEO TUTORIAL Exploring Your Ecommerce Dataset with SQL in Google BigQuery

VIDEO TUTORIAL Exploring the Public Cryptocurrency Datasets Available in BigQuery

VIDEO TUTORIAL Weather Data in BigQuery

VIDEO TUTORIAL Using BigQuery and Cloud Logging to Analyze BigQuery Usage

VIDEO TUTORIAL LAB ETL Processing on Google Cloud Using Dataflow and BigQuery

SUMMARY :

You have learned many things in this article. Hope all lessons are useful for you in Data Analytical Skills. Let’s wrap up what you’ve learned here. The first section, what SQL is, why you should learn this, you learned about the SQL syntax basics and how to query data using SQL Syntax Standard. Then, The second section, you learned about How to load data into bigQuery using a public dataset. Here, you have learned about what BigQuery is, why you should learn BigQuery, What the Public Dataset, Where you can find this and why you should use a Public Dataset. The last section, I discover the topic with How I solve the Challenge Lab From Data To Insights using BigQuery. You can visit a link to a video Tutorial in my YouTube Channel too. Please don’t forget to Subscribe, Like and Comments.

Keep Spirit To Learn. Never Give Up.

Happy Learning.

Reference :

Beighley, L. (2021). Head First SQL 1st (first) edition Text Only. O’Reilly Media.

Beaulieu, Alan. Learning SQL. O’Reilly Media.

Bowman, J. S. Darnovsky, Marcy. Emerson, Sandra L. (2021). The Practical SQL Handbook: Using SQL Variants 4th (forth) edition. O’Reilly Media.

Cassel, P., Eddy, C., & Price, J. (2001). Sams Teach Yourself Microsoft Access 2002 in 21 Days. Sams.

Cloud.google.com

Feiler, J. (2011). Data-driven iOS apps for iPad and iPhone with FileMaker Pro, FileMaker Bento, and FileMaker Go. Que.

Lakshmanan, V. (2019). Google BigQuery: the Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale. O’Reilly Media, Incorporated.

Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine learning design patterns: solutions to common challenges in data preparation, model building, and MLOps. O’Reilly Media.

Molinaro,Anthony. SQL CookBook. O’Reilly Media.

Manoochehri, Michael. Data Just Right: Introduction to Large-Scale Data & Analytics. O’Reilly Media.

Mishra, S., & Beaulieu, A. (2004). Mastering Oracle SQL, 2nd Edition (Second ed.). O’Reilly Media.

Rischert, A. (2009). Oracle SQL By Example (4th ed.). Pearson.

Wilder,Edd., James. (2012). Planning for Big Data. O’Reilly Media

Qwiklabs Hands-On Google

https://google.qwiklabs.com/focuses/1145?parent=catalog

Insights From Data With BigQuery Challenge Lab

Inno YouTube Channel

About The Author :

Hello, My name is Rachmawati Ari Taurisia. Most Indonesian people call me Rachma. Some international friends call me Ari. They are very difficult to pronounce or spell to my first name so my middle name is simple. In Bangkit Academy Indonesia, they call me Taurisia because there are many people using the similar name Rachma or Ari.

As a lifelong learner, I love to learn something new as my nature and teach to inspire, motivate, and help others to have new knowledge and skills in technology.

Having Educational Background in Informatics Management (Diploma 3) and Computer Science (Currently Bachelor’s Degree).

Some accomplishments in Cloud Engineering With Google Cloud, AI, and ML Learning. A brief of my short story, my GCP learning Journey. I love to join in some events and communities. I joined the Next 19 event at GDGCloud Bandung Community, then another event, Cloud Study Jam and JuaraGCP season 1–4 along 2019–2020 and JuaraGCP Crash Course 2021. Features on Google Blog Indonesia.

Working experiences: Freelancer online Class Trainer taught Web Development Basics For Beginners, Course Creator Web Apps On GCP, Advisor Cloud Computing Learning Path.

Speaker (Webinar Event, Conference, GDGCloud Bandung Community, and GDG Jakarta). Nominee Category Upskill and Reskill in 2020 Recognized By Womentech Network community, Third Winner Article Writing IWD2020, Recipient Scholarships (Linux Foundation 2020, O’Reilly 2021, WomenWhoCode 2021, UoPeople 2021).

a dataset

  • Create a table
  • Add data to your project (to a storage bucket)
  • Load the data from the bucket to the table you created

It’s very easy to create a Dataset on BigQuery, you can see the video on how to create this.

The next step is to add custom data then create a bucket.

gsutil cp gs://spls/gsp072/baby-names.zip .

To create a bucket

gsutil cp yob2014.txt gs://<your_bucket>

Both of them, the command line in Cloud Shell. The gsutil tool has some commands such as mb and cp to perform operations. You can use the gsutil tool for :

Creating and deleting buckets.

Uploading, downloading, and deleting objects.

Listing buckets and objects.

Moving, copying, and renaming objects.

Editing objects and bucket ACLs.

In this step Load the data into a new table, you will create a new table inside the dataset.

You just need to fulfill these following when you create on the table dialog

Field

Value

Create table from:

Google Cloud Storage

Select file from GCS bucket:

<bucket_name>/yob2014.txt, replace <bucket_name> with the name of the bucket you created earlier.

File format:

CSV

Table name

names_2014

Schema > Edit as text

Slide on, then add the following in the textbox: name:string,gender:string,count:integer

(BigQuery Qwik Start Console)

Then, you query a custom dataset with this SQL Standard

#standardSQL

SELECT

name, count

FROM

`babynames.names_2014`

WHERE

gender = ‘M’

ORDER BY count DESC LIMIT 5;

SELECT name, count. You can see that there are 2 column names here, name and count. Remember you use SELECT SQL Statement when you choose which column name.

FROM `babynames.names_2014`. That is to indicate dataset names that you just created in the previous step, names_2014. You create a new table inside the dataset.

WHERE gender = ‘M’.

WHERE Syntax on SQL Standard to perform WHERE condition; It uses to fulfill the filter record on a specified condition.

ORDER BY count DESC LIMIT 5;

ORDER BY Keyword, to sort Ascending or Descending Order. Then using LIMIT 5; It uses 5 rows to appear in the result.

VIDEO TUTORIAL Insights from Data with BigQuery Challenge Lab

Overview

You must complete a series of tasks within the allocated time period. Instead of following step-by-step instructions, you’ll be given a scenario and a set of tasks — you figure out how to complete it on your own! An automated scoring system (shown on this page) will provide feedback on whether you have completed your tasks correctly.

To score 100% you must complete all tasks within the time period!

When you take a Challenge Lab, you will not be taught Google Cloud concepts. To build the solution to the challenge presented, use skills learned from the labs in the quest this challenge lab is part of. You will be expected to extend your learned skills; you will be expected to change broken queries.

This lab is recommended for students who have enrolled in the Insights from Data with BigQuery quest. Are you ready for the challenge?

Scenario

You’re part of a public health organization that is tasked with identifying answers to queries related to the Covid-19 pandemic. Obtaining the right answers will help the organization in planning and focusing healthcare efforts and awareness programs appropriately.

The dataset and table that will be used for this analysis will be bigquery-public-data.covid19_open_data.covid19_open_data. This repository contains country-level datasets of daily time-series data related to COVID-19 globally. It includes data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, and weather.

Query 1: Total Confirmed Cases

Build a query that will answer “What was the total count of confirmed cases on Apr 15, 2020?” The query needs to return a single row containing the sum of confirmed cases across all countries. The name of the column should be total_cases_worldwide.

Columns to reference:

  • cumulative_confirmed
  • date

Task — 1 : Total Confirmed Cases

SELECT sum(cumulative_confirmed) as total_cases_worldwide

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE date=’2020–04–15'

Query 2: Worst Affected Areas

Build a query for answering “How many states in the US had more than 100 deaths on Apr 10, 2020?” The query needs to list the output in the field count_of_states. Hint: Don’t include NULL values.

Columns to reference:

  • country_name
  • subregion1_name (for state information)
  • cumulative_deceased

Task — 2: Worst Affected Areas

with deaths_by_states as (

SELECT subregion1_name as state, sum(cumulative_deceased) as death_count

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

where country_name=”United States of America” and date=’2020–04–10' and subregion1_name is NOT NULL

group by subregion1_name

)

select count(*) as count_of_states

from deaths_by_states

where death_count > 100

Query 3: Identifying Hotspots

Build a query that will answer “List all the states in the United States of America that had more than 1000 confirmed cases on Apr 10, 2020?” The query needs to return the State Name and the corresponding confirmed cases arranged in descending order. Name of the fields to return state and total_confirmed_cases.

Columns to reference:

  • country_code
  • subregion1_name (for state information)
  • cumulative_confirmed

Task — 3: Identifying Hotspots

SELECT * FROM (

SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE country_code=”US” AND date=’2020–04–10' AND subregion1_name is NOT NULL

GROUP BY subregion1_name

ORDER BY total_confirmed_cases DESC

)

WHERE total_confirmed_cases > 1000

Query 4: Fatality Ratio

Build a query that will answer “What was the case-fatality ratio in Italy for the month of April 2020?” Case-fatality ratio here is defined as (total deaths / total confirmed cases) * 100. Write a query to return the ratio for the month of April 2020 and containing the following fields in the output: total_confirmed_cases, total_deaths, case_fatality_ratio.

Columns to reference:

  • country_name
  • cumulative_confirmed
  • cumulative_deceased

Task — 4 : Fatality Ratio

SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

where country_name=”Italy” AND date BETWEEN ‘2020–04–01’and ‘2020–04–30’

Query 5: Identifying specific day

Build a query that will answer: “On what day did the total number of deaths cross 10000 in Italy?” The query should return the date in the format yyyy-mm-dd.

Columns to reference:

  • country_name
  • cumulative_deceased

Task — 5 : Identifying specific day

SELECT date

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

where country_name=”Italy” and cumulative_deceased>10000

order by date asc

limit 1

Query 6: Finding days with zero net new cases

The following query is written to identify the number of days in India between 21 Feb 2020 and 15 March 2020 when there were zero increases in the number of confirmed cases. However it is not executing properly. You need to update the query to complete it and obtain the result:

WITH india_cases_by_date AS (

SELECT

date,

SUM(cumulative_confirmed) AS cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”India”

AND date between ‘2020–02–21’ and ‘2020–03–15’

GROUP BY

date

ORDER BY

date ASC

)

, india_previous_day_comparison AS

(SELECT

date,

cases,

LAG(cases) OVER(ORDER BY date) AS previous_day,

cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases

FROM india_cases_by_date

)

Task — 6 : Finding days with zero net new cases

WITH india_cases_by_date AS (

SELECT

date,

SUM( cumulative_confirmed ) AS cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name =”India”

AND date between ‘2020–02–21’ and ‘2020–03–15’

GROUP BY

date

ORDER BY

date ASC

)

, india_previous_day_comparison AS

(SELECT

date,

cases,

LAG(cases) OVER(ORDER BY date) AS previous_day,

cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases

FROM india_cases_by_date

)

select count(*)

from india_previous_day_comparison

where net_new_cases=0

Query 7: Doubling rate

Using the previous query as a template, write a query to find out the dates on which the confirmed cases increased by more than 10% compared to the previous day (indicating doubling rate of ~ 7 days) in the US between the dates March 22, 2020 and April 20, 2020. The query needs to return the list of dates, the confirmed cases on that day, the confirmed cases the previous day, and the percentage increase in cases between the days. Use the following names for the returned fields: Date, Confirmed_Cases_On_Day, Confirmed_Cases_Previous_Day and Percentage_Increase_In_Cases.

Task — 7 : Doubling rate

WITH us_cases_by_date AS (

SELECT

date,

SUM(cumulative_confirmed) AS cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”United States of America”

AND date between ‘2020–03–22’ and ‘2020–04–20’

GROUP BY

date

ORDER BY

date ASC

)

, us_previous_day_comparison AS

(SELECT

date,

cases,

LAG(cases) OVER(ORDER BY date) AS previous_day,

cases — LAG(cases) OVER(ORDER BY date) AS net_new_cases,

(cases — LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase

FROM us_cases_by_date

)

select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases

from us_previous_day_comparison

where percentage_increase > 10

Query 8: Recovery rate

Build a query to list the recovery rates of countries arranged in descending order (limit to 10) upto the date May 10, 2020. Restrict the query to only those countries having more than 50K confirmed cases. The query needs to return the following fields: country, recovered_cases, confirmed_cases, recovery_rate.

Columns to reference:

  • country_name
  • cumulative_confirmed
  • cumulative_recovered

Task — 8 : Recovery rate

WITH cases_by_country AS (

SELECT

country_name AS country,

sum(cumulative_confirmed) AS cases,

sum(cumulative_recovered) AS recovered_cases

FROM

bigquery-public-data.covid19_open_data.covid19_open_data

WHERE

date = ‘2020–05–10’

GROUP BY

country_name

)

, recovered_rate AS

(SELECT

country, cases, recovered_cases,

(recovered_cases * 100)/cases AS recovery_rate

FROM cases_by_country

)

SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate

FROM recovered_rate

WHERE cases > 50000

ORDER BY recovery_rate desc

LIMIT 10

Query 9: CDGR — Cumulative Daily Growth Rate

The following query is trying to calculate the CDGR on May 10, 2020(Cumulative Daily Growth Rate) for France since the day the first case was reported. The first case was reported on Jan 24, 2020. The CDGR is calculated as:

((last_day_cases/first_day_cases)¹/days_diff)-1)

Where :

  • last_day_cases is the number of confirmed cases on May 10, 2020
  • first_day_cases is the number of confirmed cases on Feb 02, 2020
  • days_diff is the number of days between Feb 02 — May 10, 2020

The query isn’t executing properly. Can you fix the error to make the query execute successfully?

WITH

france_cases AS (

SELECT

date,

SUM(cumulative_confirmed) AS total_cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”France”

AND date IN (‘2020–01–24’,

‘2020–05–10’)

GROUP BY

date

ORDER BY

date)

, summary as (

SELECT

total_cases AS first_day_cases,

LEAD(total_cases) AS last_day_cases,

DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff

FROM

france_cases

LIMIT 1

)

select first_day_cases, last_day_cases, days_diff, SQRT((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr

from summary

Note: Refer to the following page to learn more about the SQL function referenced LEAD().

Task — 9 : CDGR — Cumulative Daily Growth Rate

WITH

france_cases AS (

SELECT

date,

SUM(cumulative_confirmed) AS total_cases

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

country_name=”France”

AND date IN (‘2020–01–24’,

‘2020–05–10’)

GROUP BY

date

ORDER BY

date)

, summary as (

SELECT

total_cases AS first_day_cases,

LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,

DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff

FROM

france_cases

LIMIT 1

)

select first_day_cases, last_day_cases, days_diff, POW((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr

from summary

Create a Datastudio report

Create a Datastudio report that plots the following for the United States:

Number of Confirmed Cases

Number of Deaths

Date range : 2020–03–15 to 2020–04–30

Task — 10 : Create a Datastudio report

SELECT

date, SUM(cumulative_confirmed) AS country_cases,

SUM(cumulative_deceased) AS country_deaths

FROM

`bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE

date BETWEEN ‘2020–03–15’

AND ‘2020–04–30’

AND country_name =”United States of America”

GROUP BY date

ANOTHER VIDEO TUTORIAL ABOUT THE BIGQuery :

VIDEO TUTORIAL Introduction to SQL for BigQuery and Cloud SQL

VIDEO TUTORIAL Exploring Your Ecommerce Dataset with SQL in Google BigQuery

VIDEO TUTORIAL Exploring the Public Cryptocurrency Datasets Available in BigQuery

VIDEO TUTORIAL Weather Data in BigQuery

VIDEO TUTORIAL Using BigQuery and Cloud Logging to Analyze BigQuery Usage

VIDEO TUTORIAL LAB ETL Processing on Google Cloud Using Dataflow and BigQuery

SUMMARY :

You have learned many things in this article. Hope all lessons are useful for you in Data Analytical Skills. Let’s wrap up what you’ve learned here. The first section, what SQL is, why you should learn this, you learned about the SQL syntax basics and how to query data using SQL Syntax Standard. Then, The second section, you learned about How to load data into bigQuery using a public dataset. Here, you have learned about what BigQuery is, why you should learn BigQuery, What the Public Dataset, Where you can find this and why you should use a Public Dataset. The last section, I discover the topic with How I solve the Challenge Lab From Data To Insights using BigQuery. You can visit a link to a video Tutorial in my YouTube Channel too. Please don’t forget to Subscribe, Like and Comments.

Keep Spirit To Learn. Never Give Up.

Happy Learning.

Reference :

Beighley, L. (2021). Head First SQL 1st (first) edition Text Only. O’Reilly Media.

Beaulieu, Alan. Learning SQL. O’Reilly Media.

Bowman, J. S. Darnovsky, Marcy. Emerson, Sandra L. (2021). The Practical SQL Handbook: Using SQL Variants 4th (forth) edition. O’Reilly Media.

Cassel, P., Eddy, C., & Price, J. (2001). Sams Teach Yourself Microsoft Access 2002 in 21 Days. Sams.

Cloud.google.com

Feiler, J. (2011). Data-driven iOS apps for iPad and iPhone with FileMaker Pro, FileMaker Bento, and FileMaker Go. Que.

Lakshmanan, V. (2019). Google BigQuery: the Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale. O’Reilly Media, Incorporated.

Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine learning design patterns: solutions to common challenges in data preparation, model building, and MLOps. O’Reilly Media.

Molinaro,Anthony. SQL CookBook. O’Reilly Media.

Manoochehri, Michael. Data Just Right: Introduction to Large-Scale Data & Analytics. O’Reilly Media.

Mishra, S., & Beaulieu, A. (2004). Mastering Oracle SQL, 2nd Edition (Second ed.). O’Reilly Media.

Rischert, A. (2009). Oracle SQL By Example (4th ed.). Pearson.

Wilder,Edd., James. (2012). Planning for Big Data. O’Reilly Media

Qwiklabs Hands-On Google

https://google.qwiklabs.com/focuses/1145?parent=catalog

Insights From Data With BigQuery Challenge Lab

Inno YouTube Channel

About The Author :

Hello, My name is Rachmawati Ari Taurisia. Most Indonesian people call me Rachma. Some international friends call me Ari. They are very difficult to pronounce or spell to my first name so my middle name is simple. In Bangkit Academy Indonesia, they call me Taurisia because there are many people using the similar name Rachma or Ari.

As a lifelong learner, I love to learn something new as my nature and teach to inspire, motivate, and help others to have new knowledge and skills in technology.

Having Educational Background in Informatics Management (Diploma 3) and Computer Science (Currently Bachelor’s Degree).

Some accomplishments in Cloud Engineering With Google Cloud, AI, and ML Learning. A brief of my short story, my GCP learning Journey. I love to join in some events and communities. I joined the Next 19 event at GDGCloud Bandung Community, then another event, Cloud Study Jam and JuaraGCP season 1–4 along 2019–2020 and JuaraGCP Crash Course 2021. Features on Google Blog Indonesia.

Working experiences: Freelancer online Class Trainer taught Web Development Basics For Beginners, Course Creator Web Apps On GCP, Advisor Cloud Computing Learning Path.

Speaker (Webinar Event, Conference, GDGCloud Bandung Community, and GDG Jakarta). Nominee Category Upskill and Reskill in 2020 Recognized By Womentech Network community, Third Winner Article Writing IWD2020, Recipient Scholarships (Linux Foundation 2020, O’Reilly 2021, WomenWhoCode 2021, UoPeople 2021).

Public Qwiklabs Profile

LinkedIn

--

--

Rachmawati Ari Taurisia
Rachmawati Ari Taurisia

Written by Rachmawati Ari Taurisia

I love to learn new things, IT learner, Self Development. Join the IT and the business community. https://www.linkedin.com/in/rachmawati-ari-taurisia-830b6786/

Responses (1)