
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Create a new dataset and load JSON data into the table
/ 5
Creating arrays with ARRAY_AGG()
/ 5
Querying datasets that already have ARRAYs
/ 5
Explore a dataset with STRUCTs
/ 5
Practice with STRUCTs and ARRAYs
/ 5
BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
This lab is an in-depth walkthrough of working with semi-structured data (ingesting JSON, Array data types) inside of BigQuery. Denormalizing your schema into a single table with nested and repeated fields can yield performance improvements, but the SQL syntax for working with array data can be tricky. You will practice loading, querying, troubleshooting, and unnesting various semi-structured datasets.
In this lab, you learn about the following:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
Name the new dataset fruit_store. Leave the other options at their default values (Data Location, Default Table Expiration).
Click Create dataset.
Normally in SQL you will have a single value for each row like this list of fruits below:
Row |
Fruit |
1 |
raspberry |
2 |
blackberry |
3 |
strawberry |
4 |
cherry |
What if you wanted a list of fruit items for each person at the store? It could look something like this:
Row |
Fruit |
Person |
1 |
raspberry |
sally |
2 |
blackberry |
sally |
3 |
strawberry |
sally |
4 |
cherry |
sally |
5 |
orange |
frederick |
6 |
apple |
frederick |
In traditional relational database SQL, you would look at the repetition of names and immediately think to split the above table into two separate tables: Fruit Items and People. That process is called normalization (going from one table to many). This is a common approach for transactional databases like mySQL.
For data warehousing, data analysts often go the reverse direction (denormalization) and bring many separate tables into one large reporting table.
Now, you're going to learn a different approach that stores data at different levels of granularity all in one table using repeated fields:
Row |
Fruit (array) |
Person |
1 |
raspberry |
sally |
blackberry | ||
strawberry | ||
cherry | ||
2 |
orange |
frederick |
apple |
What looks strange about the previous table?
What's the key insight? The array
data type!
An easier way to interpret the Fruit array:
Row |
Fruit (array) |
Person |
1 |
[raspberry, blackberry, strawberry, cherry] |
sally |
2 |
[orange, apple] |
frederick |
Both of these tables are exactly the same. There are two key learnings here:
Try it yourself.
Click Run.
Now try executing this one:
You should get an error that looks like the following:
Error: Array elements of types {INT64, STRING} do not have a common supertype at [3:1]
Arrays can only share one data type (all strings, all numbers).
Click Run.
After viewing the results, click the JSON tab to view the nested structure of the results.
What if you had a JSON file that you needed to ingest into BigQuery?
fruit_store
dataset.cloud-training/data-insights-course/labs/optimizing-for-performance/shopping_cart.json
Call the new table "fruit_details".
Click Create table.
In the schema, note that fruit_array
is marked as REPEATED which means it's an array.
Recap
Click Check my progress to verify the objective.
Don't have arrays in your tables already? You can create them!
ARRAY_AGG()
function to aggregate our string values into an array. Copy and paste the below query to explore this public dataset:ARRAY_LENGTH()
function to count the number of pages and products that were viewed:DISTINCT
to our ARRAY_AGG()
:Recap
You can do some pretty useful things with arrays like:
ARRAY_LENGTH(<array>)
ARRAY_AGG(DISTINCT <field>)
ARRAY_AGG(<field> ORDER BY <field>)
ARRAY_AGG(<field> LIMIT 5)
Click Check my progress to verify the objective.
The BigQuery Public Dataset for Google Analytics bigquery-public-data.google_analytics_sample
has many more fields and rows than our course dataset data-to-insights.ecommerce.all_sessions
. More importantly, it already stores field values like products, pages, and transactions natively as ARRAYs.
Run the query.
Scroll right in the results until you see the hits.product.v2ProductName
field (we will discuss the multiple field aliases shortly).
You will get an error:
Cannot access field product on a value with type ARRAY> at [5:8]
Before we can query REPEATED fields (arrays) normally, you must first break the arrays back into rows.
For example, the array for hits.page.pageTitle
is stored currently as a single row like:
and we need it to be:
How do we do that with SQL?
We'll cover UNNEST() more in detail later but for now just know that:
Click Check my progress to verify the objective.
You may have wondered why the field alias hit.page.pageTitle
looks like three fields in one separated by periods. Just as ARRAY values give you the flexibility to go deep into the granularity of your fields, another data type allows you to go wide in your schema by grouping related fields together. That SQL data type is the STRUCT data type.
The easiest way to think about a STRUCT is to consider it conceptually like a separate table that is already pre-joined into your main table.
A STRUCT can have:
Sounds just like a table right?
Under Explorer find the bigquery-public-data dataset.
If it's not present already, click ADD > Star a project by name.
Click Enter project name.
Enter bigquery-public-data
and click Star.
Click bigquery-public-data
in the pinned project list to expand it.
Find and open google_analytics_sample.
Click the ga_sessions table.
Start scrolling through the schema and answer the following question by using the find feature of your browser (i.e. CTRL + F). Tip: Expand all columns before you begin counting.
Storing your large reporting tables as STRUCTs (pre-joined "tables") and ARRAYs (deep granularity) allows you to:
Click Check my progress to verify the objective.
The next dataset will be lap times of runners around the track. Each lap will be called a "split".
Row |
runner.name |
runner.split |
1 |
Rudisha |
23.4 |
What do you notice about the field aliases? Since there are fields nested within the struct (name and split are a subset of runner) you end up with a dot notation.
What if the runner has multiple split times for a single race (like time per lap)?
Row |
runner.name |
runner.splits |
1 |
Rudisha |
23.4 |
26.3 | ||
26.4 | ||
26.1 |
To recap:
Create a new dataset titled racing.
Create a new table titled race_results.
Ingest this Google Cloud Storage JSON file:
cloud-training/data-insights-course/labs/optimizing-for-performance/race_results.json
Click Create table.
After the load job is successful, preview the schema for the newly created table:
Which field is the STRUCT? How do you know?
The participants field is the STRUCT because it is of type RECORD.
Which field is the ARRAY?
The participants.splits
field is an array of floats inside of the parent participants
struct. It has a REPEATED Mode which indicates an array. Values of that array are called nested values since they are multiple values inside of a single field.
How many rows were returned?
Answer: 1
What if you wanted to list the name of each runner and the type of race?
Error: Cannot access field name on a value with type ARRAY\<STRUCT\<name STRING, splits ARRAY\<FLOAT64\>>>> at [1:21]
Much like forgetting to GROUP BY when you use aggregation functions, here there are two different levels of granularity. One row for the race and three rows for the participants names. So how do you change this:
Row |
race |
participants.name |
1 |
800M |
Rudisha |
2 |
??? |
Makhloufi |
3 |
??? |
Murphy |
...to this:
Row |
race |
participants.name |
1 |
800M |
Rudisha |
2 |
800M |
Makhloufi |
3 |
800M |
Murphy |
In traditional relational SQL, if you had a races table and a participants table what would you do to get information from both tables? You would JOIN them together. Here the participant STRUCT (which is conceptually very similar to a table) is already part of your races table but is not yet correlated correctly with your non-STRUCT field "race".
Can you think of what two word SQL command you would use to correlate the 800M race with each of the racers in the first table?
Answer: CROSS JOIN
Great!
Error: Table name "participants" cannot be resolved: dataset name is missing
.
Even though the participants STRUCT is like a table, it is still technically a field in the racing.race_results
table.
Wow! You've successfully listed all of the racers for each race!
Row |
race |
name |
1 |
800M |
Rudisha |
2 |
800M |
Makhloufi |
3 |
800M |
Murphy |
4 |
800M |
Bosse |
5 |
800M |
Rotich |
6 |
800M |
Lewandowski |
7 |
800M |
Kipketer |
8 |
800M |
Berian |
You can simplify the last query by:
This will give you the same query result:
If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?
Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, see working with ARRAYs and STRUCTs.
Click Check my progress to verify the objective.
Recap of STRUCTs:
STRUCT(
"Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits
)
AS runner
Answer the below questions using the racing.race_results
table you created previously.
Task: Write a query to COUNT how many racers were there in total.
Possible solution:
Row |
racer_count |
1 |
8 |
Answer: There were 8 racers who ran the race.
Write a query that will list the total race time for racers whose names begin with R. Order the results with the fastest total time first. Use the UNNEST() operator and start with the partially written query below.
Possible solution:
Row |
name |
total_race_time |
1 |
Rudisha |
102.19999999999999 |
2 |
Rotich |
103.6 |
You happened to see that the fastest lap time recorded for the 800 M race was 23.2 seconds, but you did not see which runner ran that particular lap. Create a query that returns that result.
Possible solution:
Row |
name |
split_time |
1 |
Kipketer |
23.2 |
You've successfully ingested JSON datasets, created ARRAYs and STRUCTs, and unnested semi-structured data for insights.
For additional reading, refer to Working with Arrays.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one