arrow_back

Designing BigQuery Table Schemas for Teradata Professionals

Sign in Join
Get access to 700+ labs and courses

Designing BigQuery Table Schemas for Teradata Professionals

Lab 1 hour 30 minutes universal_currency_alt 5 Credits show_chart Introductory
info This lab may incorporate AI tools to support your learning.
Get access to 700+ labs and courses

Overview

In BigQuery, you organize data into BigQuery datasets and define the schema (or structure) of each table using column names and data types. The schema of a table can impact query performance and cost in BigQuery, as it determines how quickly and efficiently BigQuery can access and process the data in a table. BigQuery supports flexible schemas, and it is possible to make changes to the schema without having to rewrite the data.

The objective of this lab is to provide Teradata professionals with the necessary knowledge and skills to start designing and implementing effective BigQuery table schemas. Upon completion of this lab, Teradata professionals will have a deeper understanding of how to design, optimize, and query table schemas in BigQuery

In this lab, you create BigQuery datasets and tables to store data, create nested and repeated fields to maintain relationships in denormalized data, and create partitioned and clustered tables to optimize query performance.

For more information on translating SQL from Teradata to BigQuery, review the Teradata to BigQuery SQL translation guide: https://cloud.google.com/bigquery/docs/migration/teradata-sql.

Objectives

In this lab, you learn how to:

  • Create datasets and tables in BigQuery.
  • Create and query nested and repeated fields in BigQuery.
  • Create and query partitioned tables in BigQuery.
  • Create and query clustered tables in BigQuery.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

  5. Click Open Google Console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

How to start your lab and sign in to the Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is a panel populated with the temporary credentials that you must use for this lab.

  2. Copy the username, and then click Open Google Console. The lab spins up resources, and then opens another tab that shows the Choose an account page.

    Note: Open the tabs in separate windows, side-by-side.
  3. On the Choose an account page, click Use Another Account. The Sign in page opens.

  4. Paste the username that you copied from the Connection Details panel. Then copy and paste the password.

Note: You must use the credentials from the Connection Details panel. Do not use your Google Cloud Skills Boost credentials. If you have your own Google Cloud account, do not use it for this lab (avoids incurring charges).
  1. Click through the subsequent pages:
  • Accept the terms and conditions.
  • Do not add recovery options or two-factor authentication (because this is a temporary account).
  • Do not sign up for free trials.

After a few moments, the Cloud console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left.

Task 1. Create BigQuery dataset and tables

In BigQuery, you can use data definition language (DDL) to create datasets and tables. You can also use the SQL statement LOAD DATA to load data from one or more files into a new or existing table.

To learn more about using DDL statements to create BigQuery datasets and tables and using the LOAD DATA SQL statement to load data, review the documentation titled CREATE SCHEMA statement, CREATE TABLE statement, and LOAD DATA statement.

In this task, you use DDL to create a dataset and tables in BigQuery, and then load data into the new tables using the LOAD DATA statement.

  1. In the Google Cloud console, in the Navigation menu (), under Analytics, click BigQuery.

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.

  1. Click Done.

  2. In the SQL Workspace toolbar, click the Editor icon to open the SQL query editor.

  1. In the query editor, copy and paste the following query, and click Run:
CREATE SCHEMA IF NOT EXISTS ticket_sales OPTIONS( location="us");

This query creates a new BigQuery dataset named ticket_sales. Note that the DDL statement uses the term SCHEMA to refer to a logical collection of tables, views, and other resources, which in BigQuery is known as a dataset.

  1. Expand the Explorer pane (left side) which lists the dataset, and then click on the dataset name ticket_sales to confirm that it was successfully created.

  1. In the query editor, run the following query:
CREATE OR REPLACE TABLE ticket_sales.sales( salesid INT64, listid INT64, sellerid INT64, buyerid INT64, eventid INT64, dateid INT64, qtysold INT64, pricepaid INT64, commission FLOAT64, saletime STRING);

This query creates a new table called sales in the ticket_sales dataset.

  1. Expand the Explorer pane which lists the dataset and table, and then click on the table name sales.

  2. Click on the Details and Preview tabs to see more information about the table.

Note that there is no data in the table yet.

  1. In the query editor, run the following query:
LOAD DATA INTO ticket_sales.sales ( salesid INT64, listid INT64, sellerid INT64, buyerid INT64, eventid INT64, dateid INT64, qtysold INT64, pricepaid INT64, commission FLOAT64, saletime STRING ) FROM FILES ( skip_leading_rows=1, format = 'CSV', field_delimiter = ',', max_bad_records = 10, uris = ['gs://tcd_repo/data/entertainment_media/ticket-sales/sales.csv']);

This query uses an explicit table schema definition to load data into the sales table from a CSV file in Cloud Storage.

The Results pane displays a message that the LOAD statement ran successfully.

  1. In the Explorer pane, click on the Details and Preview tabs to confirm the data was loaded into the sales table.

You can click Refresh (top right) to refresh the data in the Preview tab.

Click Check my progress to verify the objective. Create BigQuery dataset and tables

You have now used SQL statements in BigQuery to create a dataset and a table, and load data into it. Practice creating another table and loading data for event information.

  1. In the query editor, run the following query to create a new table names events:
CREATE OR REPLACE TABLE ticket_sales.events( eventid INT64, venueid INT64, catid INT64, dateid INT64, eventname STRING, starttime TIMESTAMP);

  1. Review the Explorer pane, and confirm that there are now two tables: one named sales and one named events.

  1. In the query editor, run the following query to load data into the events table:
LOAD DATA INTO ticket_sales.events FROM FILES ( skip_leading_rows=1, format = 'CSV', field_delimiter = ',', max_bad_records = 10, uris =['gs://tcd_repo/data/entertainment_media/ticket-sales/events.csv']);

This query uses schema auto-detection to load data into the events table from a CSV file in Cloud Storage.

The Results pane displays a message that the LOAD statement ran successfully.

  1. Click on the Details and Preview tabs to confirm the data was loaded into the events table.

You can click Refresh (top right) to refresh the data in the Preview tab.

Click Check my progress to verify the objective. Load data into event table

Task 2. Create and query nested and repeated fields

Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. In BigQuery, the recommended way to denormalize data is to use nested and repeated fields. You can use nested and repeated fields to maintain relationships in denormalized data, instead of completely flattening your data.

To learn more about nested and repeated fields in BigQuery, review the documentation titled Use nested and repeated fields.

In this task, you learn how to create and query nested and repeated fields in BigQuery.

  1. In the query editor, run the following query:
SELECT e.eventid, e.eventname, s.saletime, s.qtysold, s.pricepaid, s.commission FROM ticket_sales.events e JOIN ticket_sales.sales s ON e.eventid = s.eventid ORDER BY eventid, eventname;

Note: There is a one-to-many relationship between the events table and the sales table. When you run this query, you see repetition on the "one" side of the one-to-many relationship. For every sale, the event is repeated. To remove the repetition, you can aggregate the sales data into an array.
  1. In the query editor, run the following query:
SELECT e.eventid, e.eventname, ARRAY_AGG(STRUCT( s.saletime, s.qtysold, s.pricepaid, s.commission)) as sales FROM ticket_sales.events e JOIN ticket_sales.sales s ON e.eventid = s.eventid GROUP BY eventid, eventname ORDER BY eventid, eventname;

While the SQL for step 2 is similar to previous SQL in step 1, notice the addition of the ARRAY_AGG, STRUCT, and GROUP BY functions.

  1. Review the query results.

Rather than repeating the data on the "one" side of the one-to-many relationship, the data on the "many" side is now in an array of structs.

You can also wrap the previous query in a CREATE TABLE statement to create a nested hierarchical table.

  1. In the query editor, run the following query:
CREATE OR REPLACE TABLE ticket_sales.event_sales as ( SELECT e.eventid, e.eventname, ARRAY_AGG(STRUCT( s.saletime, s.qtysold, s.pricepaid, s.commission)) as sales FROM ticket_sales.events e JOIN ticket_sales.sales s ON e.eventid = s.eventid GROUP BY eventid, eventname );

  1. Click Go to Table, and examine the table schema.

The schema includes a nested and repeated field named sales, which contains the sale time, quantity sold, price paid, and commission for each event sale.

This new nested and repeated structure changes the way that you write queries.

  1. To count the number of sales by event, run the following query:
SELECT eventid, eventname, ARRAY_LENGTH(sales) AS sale_count FROM ticket_sales.event_sales ORDER BY sale_count DESC;

What if you wanted to see the top commissions per event? That would require querying inside the array. To do that, you need to unnest (or flatten) the array.

  1. To unnest the array and identify the top two commissions per event, run the following:
SELECT eventid, eventname, ARRAY((SELECT AS STRUCT saletime, commission FROM UNNEST(sales) ORDER BY(commission) DESC LIMIT 2)) as top_2 FROM ticket_sales.event_sales ORDER BY eventid;

The UNNEST operator is used to flatten the sales array, so that it can be queried, and those results are converted into an array.

To learn more about using using UNNEST to flatten arrays, review the documentation titled UNNEST operator.

Click Check my progress to verify the objective. Create and query nested and repeated fields

Task 3. Create and query partitioned tables

In BigQuery, one method of reducing the number of bytes processed by a query is to divide a large table into smaller segments called partitions, and then include a filter in your queries to select only data from the appropriate partition. This process is known as partition pruning and can be used to reduce query costs.

To learn more about partitioned tables and partition pruning, review the documentation titled Introduction to partitioned tables and Query partitioned tables.

In this task, you learn how to create and query time-unit partitioned tables (on a DATETIME column) to minimize the number of bytes processed by queries.

  1. In the query editor, paste the following code, but do not click Run:
SELECT * FROM ticket_sales.sales;

The BigQuery query validator provides an estimate of the number of bytes that will be processed before you run it. Note the estimated number for this query (15.12 MB).

  1. Paste the following query, but do not click Run:
SELECT * FROM ticket_sales.sales WHERE saletime = '12/14/2008 09:13:17';

Notice that the number of bytes processed is the same as the query in step 1 (15.12 MB), even though the new query is only asking for sales from a specific date.

  1. To create a new sales table that is partitioned daily by the saletime column, run the following query:
CREATE OR REPLACE TABLE ticket_sales.sales_partitioned_by_date PARTITION BY DATETIME_TRUNC(saletime, DAY) AS ( SELECT * except (saletime), PARSE_DATETIME( "%m/%d/%Y %H:%M:%S", saletime) as saletime FROM ticket_sales.sales );

  1. Click on the Details tab to confirm that the table is partitioned by DAY on the column named saletime.

  1. Paste the following query, and notice the lower estimate of data to be processed (18.98 KB).
SELECT * FROM ticket_sales.sales_partitioned_by_date WHERE saletime = parse_datetime("%m/%d/%Y %H:%M:%S", '12/14/2008 09:13:17');
  1. Click Run to retrieve the query results.

The query processes less data (18.98 KB) because it is running on the partitioned table. BigQuery is able to use partition pruning to process less data which can result in less cost and faster queries.

Click Check my progress to verify the objective. Create and query partitioned tables

Task 4. Create and query clustered tables

Another method of optimizing query performance in BigQuery is to cluster values within a table to sort and group data into logical storage blocks. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns, instead of the entire table or table partition. This process is known as block pruning and can make joins, searches, grouping, and sorting faster.

To learn more about clustered tables and block pruning, review the documentation titled Introduction to clustered tables and Querying clustered tables.

In this task, you learn how to create and query clustered tables to optimize query performance.

  1. In the query editor, run the following query:
SELECT LANGUAGE, COUNT(views) AS views FROM `cloud-training-demos.wikipedia_benchmark.Wiki1B` GROUP BY LANGUAGE ORDER BY views DESC;

This query uses one of Google's public datasets that contains a lot of data (in this case, 1 billion rows). The query counts views by language in a table of Wikipedia data.

  1. In the Results pane, click on the Execution details tab.

Note the number of bytes shuffled (619.68 KB).

  1. To create a new dataset for the Wikipedia data, run the following query:
CREATE SCHEMA IF NOT EXISTS wiki_clustered OPTIONS( location="us");
  1. To create a table clustered on the language column, run the following query:
CREATE OR REPLACE TABLE wiki_clustered.Wikipedia_by_language CLUSTER BY language AS ( SELECT * FROM `cloud-training-demos.wikipedia_benchmark.Wiki1B`);

This command can take up to several minutes to run.

  1. Review the Details tab for the table to confirm that it is clustered on the column named language.

  1. To query the clustered table, run the following query below:
SELECT LANGUAGE, COUNT(views) AS views FROM wiki_clustered.Wikipedia_by_language GROUP BY language ORDER BY views DESC;

  1. In the Results pane, click on the Execution details tab.

Notice that the number of bytes shuffled is lower (47.98 KB) when the same query is run on the clustered table. Fewer bytes shuffled means faster execution time in BigQuery.

Click Check my progress to verify the objective. Create and query clustered tables

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

Before you begin

  1. Labs create a Google Cloud project and resources for a fixed time
  2. Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
  3. On the top left of your screen, click Start lab to begin

This content is not currently available

We will notify you via email when it becomes available

Great!

We will contact you via email if it becomes available

One lab at a time

Confirm to end all existing labs and start this one

Use private browsing to run the lab

Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.