arrow_back

Working with JSON, Arrays, and Structs in BigQuery v1.5

Sign in Join
Get access to 700+ labs and courses

Working with JSON, Arrays, and Structs in BigQuery v1.5

Lab 1 hour 15 minutes universal_currency_alt 5 Credits show_chart Introductory
info This lab may incorporate AI tools to support your learning.
Get access to 700+ labs and courses

Overview

BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

This lab is an in-depth walkthrough of working with semi-structured data (ingesting JSON, array data types) inside of BigQuery. Denormalizing your schema into a single table with nested and repeated fields can yield performance improvements, but the SQL syntax for working with array data can be tricky. You will practice loading, querying, troubleshooting, and unnesting various semi-structured datasets.

Objectives

In this lab you learn how to:

  • Ingest JSON datasets.
  • Create ARRAYs and STRUCTs.
  • Unnest semi-structured data for insights.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

  5. Click Open Google Console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Open BigQuery Console

  1. In the Google Cloud Console, select Navigation menu > BigQuery.

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.

  1. Click Done.

Task 1. Create a new dataset

  1. To create a dataset, click on the View actions icon next to your Project ID and then select Create dataset.

  1. Name the new dataset fruit_store. Leave the other options at their default values (Data location, Default table expiration).

  2. Click CREATE DATASET.

Task 2. Practice working with arrays in SQL

Normally in SQL you will have a single value for each row like this list of fruits below:

Row Fruit
1 raspberry
2 blackberry
3 strawberry
4 cherry

What if you wanted a list of fruit items for each person at the store? It could look something like this:

Row Fruit Person
1 raspberry sally
2 blackberry sally
3 strawberry sally
4 cherry sally
5 orange frederick
6 apple frederick

In traditional relational database SQL, you would look at the repetition of names and immediately think to split the above table into two separate tables: Fruit Items and People.

In BigQuery, you're going to take a faster approach which involves the use of the array data type to achieve the below:

Row Fruit (array) Person
1 raspberry sally
blackberry
strawberry
cherry
2 orange frederick
apple

Look at the differences in this table compared to other tables you've seen before.

  • It's only two rows.
  • There are multiple field values for Fruit in a single row.
  • The people are associated with all of the field values.

An easier way to interpret the Fruit array:

Row Fruit (array) Person
1 [raspberry, blackberry, strawberry, cherry] sally
2 [orange, apple] frederick

Both of these tables are exactly the same. There are two key learnings here:

  • An array is simply a list of items in brackets [ ].
  • BigQuery (in Standard SQL mode) displays arrays as flattened. It simply lists the value in the array vertically (note that all of those values still belong to a single row).
  1. Try it yourself. Enter the following in the BigQuery Query Editor:
#standardSQL SELECT ['raspberry', 'blackberry', 'strawberry', 'cherry'] AS fruit_array
  1. Click Run.

  2. Now try executing this one:

#standardSQL SELECT ['raspberry', 'blackberry', 'strawberry', 'cherry', 33] AS fruit_array

You should get an error that looks like the following:

Error: Array elements of types {INT64, STRING} do not have a common supertype at [3:1]

Arrays can only share one data type (all strings, all numbers). You might ask at this point, can you have an array of arrays? Yes, they can! This will be covered later.

  1. Here's the final table to query against:
#standardSQL SELECT person, fruit_array, total_cost FROM `data-to-insights.advanced.fruit_store`;
  1. Click Run.

  2. After viewing the results, click the JSON tab to view the nested structure of the results.

Uploading JSON files

What if you had a JSON file that you needed to ingest into BigQuery? You'll try this now.

Create a new table in the fruit_store dataset.

  1. To create a table, click on the View actions icon next to the fruit_store dataset and select Open.

  2. Then click Create table from the right panel.

Note: You may have to widen your browser window to see the Create table option.
  1. Add the following details for the table:
  • Source: Choose Google Cloud Storage in the Create table from dropdown.
  • Select file from GCS bucket: cloud-training/data-insights-course/labs/optimizing-for-performance/shopping_cart.json.
  • File format: JSONL (Newline delimited JSON).
  1. Call the new Table name fruit_details.

  2. Under Schema, click on the checkbox of Auto detect.

  3. Click Create table.

  4. Click on the table fruit_details.

In the schema, note that fruit_array is marked as REPEATED, which means it is an array.

Task 3. Store data about an entity that has different data types

As you saw earlier, arrays can only have one data type. There is, however, a data type which supports having both multiple field names and types – the STRUCT data type.

The next dataset will be lap times of runners around the track. Each lap will be called a "split".

  1. With this query, try out the STRUCT syntax and note the different field types within the struct container:
#standardSQL SELECT STRUCT("Rudisha" as name, 23.4 as split) as runner
Row runner.name runner.split
1 Rudisha 23.4

What do you notice about the field aliases? Since there are fields nested within the struct (name and split are a subset of runner) you end up with a dot notation.

What if the runner has multiple splits in one record? How could you have multiple split times within a single record? Hint: The splits all have the same numeric datatype.

Answer: With an array of course!

  1. Run the below query to confirm:
#standardSQL SELECT STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner
Row runner.name runner.split
1 Rudisha 23.4
26.3
26.4
26.1

To recap:

  • Structs are containers that can have multiple field names and data types nested inside.
  • Arrays can be one of the field types inside of a Struct (as shown above with the splits field).

Load in race results from other racers into a new table

  1. Create a new dataset titled racing.

  2. Create a new table titled race_results.

  3. Ingest this Google Cloud Storage JSON file: cloud-training/data-insights-course/labs/optimizing-for-performance/race_results.json.

  • Source: select Google Cloud Storage under Create table from dropdown.
  • Select file from GCS bucket: cloud-training/data-insights-course/labs/optimizing-for-performance/race_results.json.
  • File format: JSONL (Newline delimited JSON) and set Table name as race_results.
  1. Move the Edit as text slider and add the following:
[ { "name": "race", "type": "STRING", "mode": "NULLABLE" }, { "name": "participants", "type": "RECORD", "mode": "REPEATED", "fields": [ { "name": "name", "type": "STRING", "mode": "NULLABLE" }, { "name": "splits", "type": "FLOAT", "mode": "REPEATED" } ] } ]
  1. Click Create table.

  2. After the load job is successful, preview the schema for the newly created table.

Which field is the STRUCT? How do you know?

Answer: The participants field is the STRUCT because it is of type RECORD.

Which field is the ARRAY?

Answer: The participants.splits field is an array of floats inside the parent participants struct. It has a REPEATED Mode which indicates an array. Values of that array are called nested values since they are multiple values inside of a single field.

Practice querying nested and repeated fields

  1. Let's see all of our racers for the 800 meter race:
#standardSQL SELECT * FROM racing.race_results

What if you wanted to list the name of each runner and the type of race?

  1. Run the below schema and see what happens:
#standardSQL SELECT race, participants.name FROM racing.race_results

Error: Cannot access field name on a value with type ARRAY<STRUCT<name STRING, splits ARRAY<FLOAT64>>>> at [2:27]

Much like forgetting to GROUP BY when you use aggregation functions, here there are two different levels of granularity. One row for the race and three rows for the participants names. So how do you change this:

Row race participants.name
1 800M Rudisha
2 ??? Makhloufi
3 ??? Murphy

...to this:

Row race participants.name
1 800M Rudisha
2 800M Makhloufi
3 800M Murphy

In traditional relational SQL, if you had a races table and a participants table what would you do to get information from both tables? You would JOIN them together. Here the participant STRUCT (which is conceptually very similar to a table) is already part of your races table but is not yet correlated correctly with your non-STRUCT field "race".

Can you think of what two word SQL command you would use to correlate the 800M race with each of the racers in the first table?

Answer: CROSS JOIN

Great!

  1. Now try running this:
#standardSQL SELECT race, participants.name FROM racing.race_results CROSS JOIN participants # this is the STRUCT (it's like a table within a table)

Error: Table name "participants" missing dataset while no default dataset is set in the request.

Even though the participants STRUCT is like a table, it is still technically a field in the racing.race_results table.

  1. Add the dataset name to the query:
#standardSQL SELECT race, participants.name FROM racing.race_results CROSS JOIN race_results.participants # full STRUCT name
  1. And Run.

Wow! You've successfully listed all of the racers for each race!

Row race name
1 800M Rudisha
2 800M Makhloufi
3 800M Murphy
4 800M Bosse
5 800M Rotich
6 800M Lewandowski
7 800M Kipketer
8 800M Berian

You can simplify the last query by:

  • Adding an alias for the original table
  • Replacing the words "CROSS JOIN" with a comma (a comma implicitly cross joins)
  1. This will give you the same query result:
#standardSQL SELECT race, participants.name FROM racing.race_results AS r, r.participants

If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?

Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, refer to the working with arrays and structs reference documentation.

Recap of STRUCTs:

  • A SQL STRUCT is simply a container of other data fields which can be of different data types. The word struct means data structure. Recall the example from earlier: STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner.
  • STRUCTs are given an alias (like runner above) and can conceptually be thought of as a table inside of your main table.
  • STRUCTs (and arrays) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the struct itself or the struct field that is an array in order to unpack and flatten it.

Lab question: STRUCT()

Answer the below questions using the racing.race_results table you created previously.

Task: Write a query to COUNT how many racers were there in total.

To start, use the below partially written query:

#standardSQL SELECT COUNT(participants.name) AS racer_count FROM racing.race_results

Hint: Remember you will need to cross join in your struct name as an additional data source after the FROM.

Possible solution:

#standardSQL SELECT COUNT(p.name) AS racer_count FROM racing.race_results AS r, UNNEST(r.participants) AS p
Row racer_count
1 8

Answer: There were 8 racers who ran the race.

Task 4. Unpack arrays with UNNEST( )

Now that you are familiar working with structs, it's time to apply that same knowledge of unpacking arrays to some traditional arrays.

Recall that the UNNEST operator takes an array and returns a table, with one row for each element in the array.

This will allow you to perform normal SQL operations like:

  • Aggregating values within an array.
  • Filtering arrays for particular values.
  • Ordering and sorting arrays.

As a reminder, an array is an ordered list of elements that share a data type.

Here is a string array of the 8 racer names:

['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']

You can create arrays in BigQuery by adding brackets [ ] and comma separating values.

  1. Try the below query and be sure to note how many rows are output. Will it be 8 rows?
#standardSQL SELECT ['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian'] AS normal_array

Answer: It is a single row with 8 array elements.

Row normal_array
1 Rudisha
Makhloufi
Murphy
Bosse
Rotich
Lewandowski
Kipketer
Berian

Tip: If you already have a field that isn't in an array format you can aggregate those values into an array by using ARRAY_AGG().

In order to find the racers whose names begin with the letter M, you need to unpack the above array into individual rows so you can use a WHERE clause.

Unpacking the array is done by wrapping the array (or the name of the array) with UNNEST() as shown below.

  1. Run the below query and note how many rows are returned:
#standardSQL SELECT * FROM UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names

And you should see:

Row unnested_array_of_names
1 Rudisha
2 Makhloufi
3 Murphy
4 Bosse
5 Rotich
6 Lewandowski
7 Kipketer
8 Berian

You have successfully unnested the array. This is also called flattening the array.

  1. Now add a normal WHERE clause to filter these rows, and run the query:
#standardSQL SELECT * FROM UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names WHERE unnested_array_of_names LIKE 'M%'
Row unnested_array_of_names
1 Makhloufi
2 Murphy

Lab question: Unpacking arrays with UNNEST( )

Write a query that will list the total race time for racers whose names begin with R. Order the results with the fastest total time first. Use the UNNEST() operator and start with the partially written query below.

  • Complete the query:
#standardSQL SELECT p.name, SUM(split_times) as total_race_time FROM racing.race_results AS r , r.participants AS p , p.splits AS split_times WHERE GROUP BY ORDER BY ;

Hint:

  • You will need to unpack both the struct and the array within the struct as data sources after your FROM clause
  • Be sure to use aliases where appropriate

Possible solution:

#standardSQL SELECT p.name, SUM(split_times) as total_race_time FROM racing.race_results AS r , UNNEST(r.participants) AS p , UNNEST(p.splits) AS split_times WHERE p.name LIKE 'R%' GROUP BY p.name ORDER BY total_race_time ASC;
Row name total_race_time
1 Rudisha 102.19999999999999
2 Rotich 103.6

Task 5. Filter within array values

You happened to see that the fastest lap time recorded for the 800 M race was 23.2 seconds, but you did not see which runner ran that particular lap. Create a query that returns that result.

  • Complete the partially written query:
#standardSQL SELECT p.name, split_time FROM racing.race_results AS r , r.participants AS p , p.splits AS split_time WHERE split_time = ;

Possible solution:

#standardSQL SELECT p.name, split_time FROM racing.race_results AS r , UNNEST(r.participants) AS p , UNNEST(p.splits) AS split_time WHERE split_time = 23.2;
Row name split_time
1 Kipketer 23.2

Congratulations!

You've successfully ingested JSON datasets, created arrays and structs, and unnested semi-structured data for insights.

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

Before you begin

  1. Labs create a Google Cloud project and resources for a fixed time
  2. Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
  3. On the top left of your screen, click Start lab to begin

This content is not currently available

We will notify you via email when it becomes available

Great!

We will contact you via email if it becomes available

One lab at a time

Confirm to end all existing labs and start this one

Use private browsing to run the lab

Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.