Loading...
No results found.

    Achieving Advanced Insights with BigQuery

    Quick tip: Review the prerequisites before you run the lab
    Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the student account, which may cause extra charges incurred to your personal account.
    Test and share your knowledge with our community!
    done
    Get access to over 700 hands-on labs, skill badges, and courses

    Optimizing BigQuery for Cost and Performance v1.5

    Lab 50 minutes universal_currency_alt 5 Credits show_chart Introductory
    info This lab may incorporate AI tools to support your learning.
    Test and share your knowledge with our community!
    done
    Get access to over 700 hands-on labs, skill badges, and courses

    Overview

    BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

    This lab focuses on how to architect a data warehouse for query performance. In this lab, you will compare a traditional relational schema with joins against a denormalized schema. You will also use BigQuery's Query Execution Plan to quantifiably assess performance trade-offs.

    What you'll do

    In this lab, you will learn how to perform these tasks:

    • Load a comma-separated value (CSV) file into a BigQuery table using the web UI.
    • Load a JavaScript® Object Notation (JSON) file into a BigQuery table using the command-line interface (CLI).
    • Transform data and join tables using the web UI.
    • Store query results in a destination table.
    • Query a destination table using the web UI to confirm your data was transformed and loaded correctly.

    Setup and requirements

    For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

    1. Sign in to Qwiklabs using an incognito window.

    2. Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
      There is no pause feature. You can restart if needed, but you have to start at the beginning.

    3. When ready, click Start lab.

    4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

    5. Click Open Google Console.

    6. Click Use another account and copy/paste credentials for this lab into the prompts.
      If you use other credentials, you'll receive errors or incur charges.

    7. Accept the terms and skip the recovery resource page.

    Activate Google Cloud Shell

    Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.

    Google Cloud Shell provides command-line access to your Google Cloud resources.

    1. In Cloud console, on the top right toolbar, click the Open Cloud Shell button.

      Highlighted Cloud Shell icon

    2. Click Continue.

    It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

    Project ID highlighted in the Cloud Shell Terminal

    gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

    • You can list the active account name with this command:
    gcloud auth list

    Output:

    Credentialed accounts: - @.com (active)

    Example output:

    Credentialed accounts: - google1623327_student@qwiklabs.net
    • You can list the project ID with this command:
    gcloud config list project

    Output:

    [core] project =

    Example output:

    [core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide .

    Open BigQuery Console

    1. In the Google Cloud Console, select Navigation menu > BigQuery.

    The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.

    1. Click Done.

    Task 1. Create a new dataset to store your tables

    In your BigQuery project, create a new dataset titled liquor_sales.

    1. In the Explorer section, click on the View actions icon next to your project ID and select Create dataset.

    The navigation path to Create dataset.

    The Create dataset dialog opens.

    1. Set the dataset ID to liquor_sales. Leave the other options at their default values, and click Create dataset.

    In the left pane, you see a liquor_sales table listed under your project.

    Task 2. Load and query relational data

    In this section, you measure query performance for relational data in BigQuery.

    BigQuery supports large JOINs, and JOIN performance is good. However, BigQuery is a columnar datastore, and maximum performance is achieved on denormalized datasets. Because BigQuery storage is inexpensive and scalable, it's a good practice to denormalize and pre-JOIN datasets into homogeneous tables. In other words, you exchange compute resources for storage resources (the latter being more performant and cost-effective).

    In this section, you will be doing the following:

    • Upload a set of tables from a relational schema (in 3rd normal form).
    • Run queries against the relational tables.
    • Note the performance of the queries to compare to the performance of the same queries against a table from a denormalized schema containing the same information.

    You upload tables that have a relational schema. The relational schema consists of the following tables:

    Table name Description
    sales Contains the date and sales metrics.
    item The description of the item sold.
    vendor The producer of the item.
    category The grouping to which the item belongs.
    store The store that sold th item.
    county The county where the item was sold.
    convenience_store The list of stores that are considered convenience stores.

    Here's a diagram of the relational schema.

    Relational schema diagram of data table

    Create the sales table

    1. In the Explorer section, click on the View actions icon next to the liquor_sales dataset, select Open, and then click Create Table.

    Create table button

    1. On the Create Table page, in the Source section, do the following:
    • For Create table from, choose Google Cloud Storage.
    • Enter the path to the Google Cloud Storage bucket name:
    cloud-training/data-insights-course/labs/optimizing-for-performance/sales.csv
    • For File format, choose CSV.
    Note: When you've created a table previously, the Select Previous Job option allows you to quickly use your settings to create similar tables.
    1. In the Destination section, configure the following:
    • For Table name, enter sales.
    • Leave the remaining destination fields at their defaults.

    Set schema table name to sales

    1. In the Schema section, you configure the following:
    • Click Edit as text.
    • Copy and paste the below schema:
    [ { "name": "date", "type": "STRING" }, { "name": "store", "type": "STRING" }, { "name": "category", "type": "STRING" }, { "name": "vendor_no", "type": "STRING" }, { "name": "item", "type": "STRING" }, { "name": "state_btl_cost", "type": "FLOAT" }, { "name": "btl_price", "type": "FLOAT" }, { "name": "bottle_qty", "type": "INTEGER" }, { "name": "total", "type": "FLOAT" } ]
    1. Click Advanced options to display and configure these options:
    • For Field delimiter, verify that Comma is selected.
    • Because sales.csv contains a single header row, set Header rows to skip, to 1.
    • Check Quoted newlines.
    • Accept the remaining default values and click Create Table.

    Configure advanced table options dialog

    BigQuery creates a load job to create the table and upload data into the table (this may take a few seconds).

    1. Click Personal History to track job progress.

    Create the remaining tables

    Create the remaining tables in the relational schema using Cloud Shell command line.

    1. Create the category table:
    bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.category gs://cloud-training/data-insights-course/labs/optimizing-for-performance/category.csv category:STRING,category_name:STRING
    1. Create the convenience_store table:
    bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.convenience_store gs://cloud-training/data-insights-course/labs/optimizing-for-performance/convenience_store.csv store:STRING
    1. Create the county table:
    bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.county gs://cloud-training/data-insights-course/labs/optimizing-for-performance/county.csv county_number:STRING,county:STRING
    1. Create the item table:
    bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.item gs://cloud-training/data-insights-course/labs/optimizing-for-performance/item.csv item:STRING,description:string,pack:INTEGER,liter_size:INTEGER
    1. Create the store table:
    bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.store gs://cloud-training/data-insights-course/labs/optimizing-for-performance/store.csv store:STRING,name:STRING,address:STRING,city:STRING,zipcode:STRING,store_location:STRING,county_number:STRING
    1. Create the vendor table:
    bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.vendor gs://cloud-training/data-insights-course/labs/optimizing-for-performance/vendor.csv vendor_no:STRING,vendor:STRING
    1. Go back to the BigQuery web UI. Confirm you see the new tables loaded into your liquor_sales dataset. Refresh the browser if needed.

    Liquor_sales dataset expanded

    Query relational data

    Next, you use the Query editor to query your data.

    1. Under the Query editor code box, click More > Query Settings.

    2. In the Resource management, Cache preference section, uncheck the Use Cached Results checkbox and click Save. If you have to run the query more than once, you don't want to use cached results.

    3. In the Query editor window, enter the following query against the relational tables and click Run:

    #standardSQL SELECT gstore.county AS county, ROUND(cstore_total/gstore_total * 100,1) AS cstore_percentage FROM ( SELECT cy.county AS county, SUM(total) AS gstore_total FROM `liquor_sales.sales` AS s JOIN `liquor_sales.store` AS st ON s.store = st.store JOIN `liquor_sales.county` AS cy ON st.county_number = cy.county_number LEFT OUTER JOIN `liquor_sales.convenience_store` AS c ON s.store = c.store WHERE c.store IS NULL GROUP BY county) AS gstore JOIN ( SELECT cy.county AS county, SUM(total) AS cstore_total FROM `liquor_sales.sales` AS s JOIN `liquor_sales.store` AS st ON s.store = st.store JOIN `liquor_sales.county` AS cy ON st.county_number = cy.county_number LEFT OUTER JOIN `liquor_sales.convenience_store` AS c ON s.store = c.store WHERE c.store IS NOT NULL GROUP BY county) AS hstore ON gstore.county = hstore.county
    1. At the bottom, in the Query results section, click on the Results tab. Note the Query complete time. An example is shown below (your execution time may vary).

    Query results tab

    This will be compared to the time to query a flattened dataset in later sections.

    Task 3. Load and query flattened data

    In this section, you denormalize the schemas and analyze liquor sales for the state of Iowa using the flattened data. Running this query on the flattened data should be faster than running on the relational data. You will note the time to compare and confirm.

    A denormalized schema flattens all relational data into a single row. For example, in the denormalized schema, county_number, county, store, name, address, city, zipcode, store_location, county_number, and cstore are fields containing all fields from the County, Store, and Convenience_store tables.

    Note: The cstore field (in the denormalized schema) represents the convenience_store.store field in the relational schema above. It has a value of Y if a store is a convenience store, and null otherwise.

    The following diagram shows the denormalized schema.

    Denormalized schema diagram

    Create the iowa_sales_denorm table

    1. In the left pane, select liquor_sales dataset and click Create Table on the right.

    The Create table dialog opens.

    1. In the Source section, configure the following:
    • For Create table from, choose Google Cloud Storage.
    • Enter the path to the Google Cloud Storage bucket name:
    cloud-training/data-insights-course/labs/optimizing-for-performance/iowa_sales_denorm.csv
    • For File format, choose CSV.
    1. In the Destination section, configure as follows:
    • For Table name, enter iowa_sales_denorm.
    • Leave the remaining destination fields at their defaults.
    1. In the Schema section, configure the following:
    • Click Edit as text.
    • Copy and paste the below schema:
    [ { "name": "date", "type": "STRING" }, { "name": "cstore", "type": "STRING" }, { "name": "store", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "address", "type": "STRING" }, { "name": "city", "type": "STRING" }, { "name": "zipcode", "type": "STRING" }, { "name": "store_location", "type": "STRING" }, { "name": "county_number", "type": "STRING" }, { "name": "county", "type": "STRING" }, { "name": "category", "type": "STRING" }, { "name": "category_name", "type": "STRING" }, { "name": "vendor_no", "type": "STRING" }, { "name": "vendor", "type": "STRING" }, { "name": "item", "type": "STRING" }, { "name": "description", "type": "STRING" }, { "name": "pack", "type": "INTEGER" }, { "name": "liter_size", "type": "INTEGER" }, { "name": "state_btl_cost", "type": "FLOAT" }, { "name": "btl_price", "type": "FLOAT" }, { "name": "bottle_qty", "type": "INTEGER" }, { "name": "total", "type": "FLOAT" } ]
    1. In the Advanced options section, configure as follows:
    • For Field delimiter, verify that Comma is selected.
    • Because iowa_sales_denorm.csv contains a single header row, for Header rows to skip, enter 1.
    • Check Quoted newlines.
    • Accept the remaining default values and click Create Table.

    BigQuery creates a load job to create the table and upload data into the table (this may take a few seconds).

    1. Click Personal History to track job progress.

    2. Enter and run the following query against the table with a denormalized schema (this query produces the same results as the query in the previous section):

    #standardSQL SELECT gstore.county AS county, ROUND(cstore_total/gstore_total * 100,1) AS cstore_percentage FROM ( SELECT county, sum(total) AS gstore_total FROM `liquor_sales.iowa_sales_denorm` WHERE cstore is null GROUP BY county) AS gstore JOIN ( SELECT county, sum(total) AS cstore_total FROM `liquor_sales.iowa_sales_denorm` WHERE cstore is not null GROUP BY county) AS cstore ON gstore.county = cstore.county ORDER BY county
    1. At the bottom, in the Query results section, click on the Results tab, note the Query complete time. This will be compared to the time to query a flattened dataset in later sections.

    2. Note the time the query takes to run by subtracting the Start Time from the End Time.

    Notice that the query corresponding to the table with denormalized schema runs slightly faster, and has simpler syntax. Wherever possible, pre-JOIN datasets into homogeneous tables to optimize performance in BigQuery.

    Compare query performance with execution details

    1. Select PROJECT HISTORY.

    2. Click the first query job you ran against the normalized relational schema, then click OPEN AS NEW QUERY.

    3. Select Execution details.

    The execution plan has two major sections:

    • Average and max worker timing by type of work per stage

    • High level perforomance benchmarks

      • Elapsed time: Total time for the query to process.
      • Slot time consumed: If the query were not processed in parallel on multiple machines, how long would it take to process.
      • Bytes shuffled: Automatic in-memory data shuffling for massive parallel processing.
      • Bytes spilled to disk: If data cannot be processed in memory, how much was spilled to persistent disk (usually data skew is to blame).
    1. First, compare the benchmark timings between each of the queries we ran.

    1. Next, compare the type of work where the workers spent the most time.

    Query 1. Relational schema execution details

    Relational schema execution details

    Relational schema execution details query

    Query 2. Denormalized schema execution details

    Denormalized schema execution details

    Denormalized schema execution details query

    Observations:

    • The denormalized query (#2) is faster and uses less slot time to achieve the same result.
    • The relational query (#1) has many more input stages and spends the most worker time in joining the datasets together.
    • The denormalized query (#2) spends the most time reading data input and outputting the results. There is minimal time spent in aggregations and joins.
    • Neither query resulted in bytes spilled to disk which suggests out datasets are likely not skewed (or significantly large enough to spill out from memory from an individual worker).
    Note: The queries used in this lab are for demonstration purposes only. The time difference between the 2 queries becomes more significant as the dataset sizes increase and the complexity of the JOIN clauses increases.

    To learn more about execution details and query plan optimization, you can refer to the query plan explanation reference guide.

    Avoiding performance anti-patterns

    Now that you are familiar with effective database schema design, it is time to practice optimizing some poorly written queries.

    The below query is running slowly, what can you do to correct it?

    1. Copy and paste the below query into the Query editor and run the query to get a benchmark.

    Goal: Count all the U.S. non-profit organizations that have filed taxes using paper (non-electronic) in 2015.

    #standardSQL # count all paper filings for 2015 SELECT * FROM `bigquery-public-data.irs_990.irs_990_2015` WHERE UPPER(elf) LIKE '%P%' #Paper Filers in 2015 ORDER BY ein # 86,831 as per pagination count, 23s

    What can you do to improve the performance?

    1. Compare against the below solution:
    #standardSQL SELECT COUNT(*) AS paper_filers FROM `bigquery-public-data.irs_990.irs_990_2015` WHERE elf = 'P' #Paper Filers in 2015 # 86,831 at 2s /* Remove ORDER BY when there is no limit Use Aggregation Functions Examine data and confirmed P always uppercase */
    1. Run your updated version and track the time.

    2. Clear the Query editor.

    This new below query is running slowly. (Run the query to get a benchmark; stop after 30 seconds if it does not complete).

    Goal: Using the Employer Identification Number (ein) as a linking field, join together the tax return filings table with the organizational names table and return the names of all the organizations that filed in 2015.

    1. Add this query in the Query editor, then click Run:
    #standardSQL # get all Organization names who filed in 2015 SELECT tax.ein, name FROM `bigquery-public-data.irs_990.irs_990_2015` tax JOIN `bigquery-public-data.irs_990.irs_990_ein` org ON tax.tax_pd = org.tax_period
    1. Correct the above query. (Hint: Remember the correct JOIN field condition for our schema).

    Compare against the below solution.

    1. Add this query in the Query editor, then click Run:
    #standardSQL # get all Organization names who filed in 2015 SELECT tax.ein, name FROM `bigquery-public-data.irs_990.irs_990_2015` tax JOIN `bigquery-public-data.irs_990.irs_990_ein` org ON tax.ein = org.ein # 86,831 as per pagination count, 23s /* Incorrect JOIN key resulted in CROSS JOIN Correct result: 294,374 at 13s */
    1. Run your updated version and track the time.

    Do you see an improvement? How quickly does the query run?

    Lessons learned

    Congratulations!

    This concludes this hands-on lab looking at Effective BigQuery Schema Design and Query Performance. You have loaded CVS and JSON files into BigQuery tables, transformed data and join tables, stored query results, and then confirmed that data was transformed and loaded correctly.

    End your lab

    When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

    You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

    The number of stars indicates the following:

    • 1 star = Very dissatisfied
    • 2 stars = Dissatisfied
    • 3 stars = Neutral
    • 4 stars = Satisfied
    • 5 stars = Very satisfied

    You can close the dialog box if you don't want to provide feedback.

    For feedback, suggestions, or corrections, please use the Support tab.

    Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

    Previous Next

    Before you begin

    1. Labs create a Google Cloud project and resources for a fixed time
    2. Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
    3. On the top left of your screen, click Start lab to begin

    This content is not currently available

    We will notify you via email when it becomes available

    Great!

    We will contact you via email if it becomes available

    One lab at a time

    Confirm to end all existing labs and start this one

    Use private browsing to run the lab

    Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
    Preview