
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
This lab is an in-depth walkthrough of working with semi-structured data (ingesting JSON, array data types) inside of BigQuery. Denormalizing your schema into a single table with nested and repeated fields can yield performance improvements, but the SQL syntax for working with array data can be tricky. You will practice loading, querying, troubleshooting, and unnesting various semi-structured datasets.
In this lab you learn how to:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
Name the new dataset fruit_store
. Leave the other options at their default values (Data location, Default table expiration).
Click CREATE DATASET.
Normally in SQL you will have a single value for each row like this list of fruits below:
Row | Fruit |
---|---|
1 | raspberry |
2 | blackberry |
3 | strawberry |
4 | cherry |
What if you wanted a list of fruit items for each person at the store? It could look something like this:
Row | Fruit | Person |
---|---|---|
1 | raspberry | sally |
2 | blackberry | sally |
3 | strawberry | sally |
4 | cherry | sally |
5 | orange | frederick |
6 | apple | frederick |
In traditional relational database SQL, you would look at the repetition of names and immediately think to split the above table into two separate tables: Fruit Items and People.
In BigQuery, you're going to take a faster approach which involves the use of the array data type to achieve the below:
Row | Fruit (array) | Person |
---|---|---|
1 | raspberry | sally |
blackberry | ||
strawberry | ||
cherry | ||
2 | orange | frederick |
apple |
Look at the differences in this table compared to other tables you've seen before.
An easier way to interpret the Fruit array:
Row | Fruit (array) | Person |
---|---|---|
1 | [raspberry, blackberry, strawberry, cherry] | sally |
2 | [orange, apple] | frederick |
Both of these tables are exactly the same. There are two key learnings here:
Click Run.
Now try executing this one:
You should get an error that looks like the following:
Error: Array elements of types {INT64, STRING} do not have a common supertype at [3:1]
Arrays can only share one data type (all strings, all numbers). You might ask at this point, can you have an array of arrays? Yes, they can! This will be covered later.
Click Run.
After viewing the results, click the JSON tab to view the nested structure of the results.
What if you had a JSON file that you needed to ingest into BigQuery? You'll try this now.
Create a new table in the fruit_store
dataset.
To create a table, click on the View actions icon next to the fruit_store dataset and select Open.
Then click Create table from the right panel.
cloud-training/data-insights-course/labs/optimizing-for-performance/shopping_cart.json
.Call the new Table name fruit_details
.
Under Schema, click on the checkbox of Auto detect.
Click Create table.
Click on the table fruit_details
.
In the schema, note that fruit_array is marked as REPEATED
, which means it is an array.
As you saw earlier, arrays can only have one data type. There is, however, a data type which supports having both multiple field names and types – the STRUCT data type.
The next dataset will be lap times of runners around the track. Each lap will be called a "split".
Row | runner.name | runner.split |
---|---|---|
1 | Rudisha | 23.4 |
What do you notice about the field aliases? Since there are fields nested within the struct (name and split are a subset of runner) you end up with a dot notation.
What if the runner has multiple splits in one record? How could you have multiple split times within a single record? Hint: The splits all have the same numeric datatype.
Answer: With an array of course!
Row | runner.name | runner.split |
---|---|---|
1 | Rudisha | 23.4 26.3 26.4 26.1 |
To recap:
Create a new dataset titled racing.
Create a new table titled race_results.
Ingest this Google Cloud Storage JSON file:
cloud-training/data-insights-course/labs/optimizing-for-performance/race_results.json
.
cloud-training/data-insights-course/labs/optimizing-for-performance/race_results.json
.race_results
.Click Create table.
After the load job is successful, preview the schema for the newly created table.
Which field is the STRUCT? How do you know?
Answer: The participants field is the STRUCT because it is of type RECORD.
Which field is the ARRAY?
Answer: The participants.splits
field is an array of floats inside the parent participants
struct. It has a REPEATED Mode which indicates an array. Values of that array are called nested values since they are multiple values inside of a single field.
What if you wanted to list the name of each runner and the type of race?
Error: Cannot access field name on a value with type ARRAY<STRUCT<name STRING, splits ARRAY<FLOAT64>>>> at [2:27]
Much like forgetting to GROUP BY when you use aggregation functions, here there are two different levels of granularity. One row for the race and three rows for the participants names. So how do you change this:
Row | race | participants.name |
---|---|---|
1 | 800M | Rudisha |
2 | ??? | Makhloufi |
3 | ??? | Murphy |
...to this:
Row | race | participants.name |
---|---|---|
1 | 800M | Rudisha |
2 | 800M | Makhloufi |
3 | 800M | Murphy |
In traditional relational SQL, if you had a races table and a participants table what would you do to get information from both tables? You would JOIN them together. Here the participant STRUCT (which is conceptually very similar to a table) is already part of your races table but is not yet correlated correctly with your non-STRUCT field "race".
Can you think of what two word SQL command you would use to correlate the 800M race with each of the racers in the first table?
Answer: CROSS JOIN
Great!
Error: Table name "participants" missing dataset while no default dataset is set in the request
.
Even though the participants STRUCT is like a table, it is still technically a field in the racing.race_results
table.
Wow! You've successfully listed all of the racers for each race!
Row | race | name |
---|---|---|
1 | 800M | Rudisha |
2 | 800M | Makhloufi |
3 | 800M | Murphy |
4 | 800M | Bosse |
5 | 800M | Rotich |
6 | 800M | Lewandowski |
7 | 800M | Kipketer |
8 | 800M | Berian |
You can simplify the last query by:
If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?
Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, refer to the working with arrays and structs reference documentation.
Recap of STRUCTs:
STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner
.Answer the below questions using the racing.race_results
table you created previously.
Task: Write a query to COUNT how many racers were there in total.
To start, use the below partially written query:
Hint: Remember you will need to cross join in your struct name as an additional data source after the FROM.
Possible solution:
Row | racer_count |
---|---|
1 | 8 |
Answer: There were 8 racers who ran the race.
Now that you are familiar working with structs, it's time to apply that same knowledge of unpacking arrays to some traditional arrays.
Recall that the UNNEST operator takes an array and returns a table, with one row for each element in the array.
This will allow you to perform normal SQL operations like:
As a reminder, an array is an ordered list of elements that share a data type.
Here is a string array of the 8 racer names:
You can create arrays in BigQuery by adding brackets [ ] and comma separating values.
Answer: It is a single row with 8 array elements.
Row | normal_array |
---|---|
1 | Rudisha |
Makhloufi | |
Murphy | |
Bosse | |
Rotich | |
Lewandowski | |
Kipketer | |
Berian |
Tip: If you already have a field that isn't in an array format you can aggregate those values into an array by using ARRAY_AGG().
In order to find the racers whose names begin with the letter M, you need to unpack the above array into individual rows so you can use a WHERE
clause.
Unpacking the array is done by wrapping the array (or the name of the array) with UNNEST()
as shown below.
And you should see:
Row | unnested_array_of_names |
---|---|
1 | Rudisha |
2 | Makhloufi |
3 | Murphy |
4 | Bosse |
5 | Rotich |
6 | Lewandowski |
7 | Kipketer |
8 | Berian |
You have successfully unnested the array. This is also called flattening the array.
Row | unnested_array_of_names |
---|---|
1 | Makhloufi |
2 | Murphy |
Write a query that will list the total race time for racers whose names begin with R. Order the results with the fastest total time first. Use the UNNEST()
operator and start with the partially written query below.
Hint:
Possible solution:
Row | name | total_race_time |
---|---|---|
1 | Rudisha | 102.19999999999999 |
2 | Rotich | 103.6 |
You happened to see that the fastest lap time recorded for the 800 M race was 23.2 seconds, but you did not see which runner ran that particular lap. Create a query that returns that result.
Possible solution:
Row | name | split_time |
---|---|---|
1 | Kipketer | 23.2 |
You've successfully ingested JSON datasets, created arrays and structs, and unnested semi-structured data for insights.
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one