arrow_back

Performance and Cost Optimization with BigQuery

Anmelden Teilnehmen
Testen und teilen Sie Ihr Wissen mit unserer Community.
done
Sie erhalten Zugriff auf über 700 praxisorientierte Labs, Skill-Logos und Kurse

Performance and Cost Optimization with BigQuery

Lab 1 Stunde 30 Minuten universal_currency_alt 1 Guthabenpunkt show_chart Einsteiger
info Dieses Lab kann KI-Tools enthalten, die den Lernprozess unterstützen.
Testen und teilen Sie Ihr Wissen mit unserer Community.
done
Sie erhalten Zugriff auf über 700 praxisorientierte Labs, Skill-Logos und Kurse

GSP266

Google Cloud Self-Paced Labs

Overview

BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

In this lab, you will learn to improve your database from a performance and cost perspective.

Objectives

In this lab, you will learn to:

  • Create BigQuery tables and load data into it.
  • Reduce the cost of a query by using partitioning.
  • Optimize a query by using denormalization.
  • Improve row filtering and join performance by using clustering.
  • Improve dashboard loading speed and cost by using materialized view.
  • Reduce the elapsed time and cost by pruning the partitions with filters.
  • Save cost by moving data out of unused partitions in BigQuery and onto cheaper Google Cloud Storage.

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This Qwiklabs hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

What you need

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended)
  • Time to complete the lab
Note: If you already have your own personal Google Cloud account or project, do not use the same account/project for this lab to avoid incurring charges. Note: If you are using a Pixelbook, open an Incognito window to run this lab.

How to start your lab and sign in to the Google Cloud console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab details pane, which is populated with the temporary credentials that are needed for this lab.

    Lab details pane

  2. Copy the Password and then click Open Google Cloud console. The lab spins up resources, then opens another tab that shows the Sign in page.

    Tip: Open the tabs in separate windows, side by side. Note: If you see the Choose an account page, click Use another account.
  3. On the Sign in page, verify that the username from the Lab details pane is auto-filled. Click Next.

  4. Paste the password in the Enter your password field. Click Next.

    Important: Use the credentials from the Lab details pane. Using your personal Google Cloud account may incur charges to your account.
  5. Click through the subsequent pages:

    • Understand your account management.

    • Accept the terms and conditions.

After a few moments, the console opens.

Note: You can view the menu with a list of Google Cloud products and services by clicking the Navigation menu in the top left. Google Cloud console menu with the Navigation menu icon highlighted

Activate Cloud Shell

Cloud Shell is a virtual machine (VM) that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

  1. In the Cloud Console, in the top-right toolbar, click the Activate Cloud Shell button.

Cloud Shell icon

  1. Click Continue.

cloudshell_continue.png

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

Cloud Shell Terminal

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  1. You can list the active account name with this command:
gcloud auth list

Output:

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

Example output:

Credentialed accounts: - google1623327_student@qwiklabs.net
  1. You can list the project ID with this command:
gcloud config list project

Output:

[core] project = <project_ID>

Example output:

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: For full documentation of gcloud see the gcloud command-line tool overview.

Importing data into your project

Now you will add an Instacart data warehouse data files into your Google Cloud project. This export is publicly available on Kaggle. To avoid creating a Kaggle account, the csv files have been uploaded to a publicly accessible Google Cloud Storage bucket for use with this lab.

  1. Execute the following to create an environment variable for your Google Cloud Project ID:
export GCP_PROJECT_ID=$(gcloud config list core/project --format="value(core.project)")
  1. Enter the following snippet into the command-line prompt to create a new bucket:
gcloud storage buckets create gs://${GCP_PROJECT_ID}-cymbal-bq-opt --project=${GCP_PROJECT_ID} --location=EU --uniform-bucket-level-access
  1. Click Authorize in the Authorize Cloud Shell dialog box that pops up. authorize.png

  2. Enter the following snippets into the command-line prompt to copy the Instacart data into your new bucket:

gsutil cp gs://spls/gsp266/departments.csv gs://${GCP_PROJECT_ID}-cymbal-bq-opt/departments.csv gsutil cp gs://spls/gsp266/aisles.csv gs://${GCP_PROJECT_ID}-cymbal-bq-opt/aisles.csv gsutil cp gs://spls/gsp266/products.csv gs://${GCP_PROJECT_ID}-cymbal-bq-opt/products.csv gsutil cp gs://spls/gsp266/orders_with_timestamps.csv gs://${GCP_PROJECT_ID}-cymbal-bq-opt/orders_with_timestamps.csv gsutil cp gs://spls/gsp266/order_products.csv gs://${GCP_PROJECT_ID}-cymbal-bq-opt/order_products.csv
  1. Now confirm that all five CSV files have been uploaded to the bucket in your project:
gsutil ls -l gs://${GCP_PROJECT_ID}-cymbal-bq-opt/
  1. To lay the groundwork for this lab, three BigQuery datasets need to be created using the following commands:
bq --location=EU mk \ --dataset \ ${GCP_PROJECT_ID}:cymbal_bq_opt_1 bq --location=EU mk \ --dataset \ ${GCP_PROJECT_ID}:cymbal_bq_opt_2

Check BigQuery Datasets creation

Note: Please wait for about 10 seconds after creating your resources and before checking your progress on the Activity Tracking. If you click it sooner, it might show that the resources are not created.

Task 1: Create BigQuery tables and load data into the tables

  1. In Cloud Shell, run the following commands to create five tables and load data into these tables:
bq load \ --autodetect \ --source_format=CSV \ cymbal_bq_opt_1.departments \ gs://${GCP_PROJECT_ID}-cymbal-bq-opt/departments.csv bq load \ --autodetect \ --source_format=CSV \ cymbal_bq_opt_1.aisles \ gs://${GCP_PROJECT_ID}-cymbal-bq-opt/aisles.csv bq load \ --autodetect \ --source_format=CSV \ cymbal_bq_opt_1.products \ gs://${GCP_PROJECT_ID}-cymbal-bq-opt/products.csv bq load \ --autodetect \ --source_format=CSV \ cymbal_bq_opt_1.orders_with_timestamps \ gs://${GCP_PROJECT_ID}-cymbal-bq-opt/orders_with_timestamps.csv bq load \ --autodetect \ --source_format=CSV \ cymbal_bq_opt_1.order_products \ gs://${GCP_PROJECT_ID}-cymbal-bq-opt/order_products.csv

Check BigQuery tables creation

  1. Once those commands have finished, execute the following command to check that all five tables were created:
bq ls cymbal_bq_opt_1

The output should be a table with contents similar to the output below:

all_tables.png

Task 2: Write SQL queries

Below are a few questions regarding the data that you loaded into BigQuery tables earlier. Write SQL queries to solve the following problem statements in the BigQuery Editor and answer the questions. Also, note the bytes billed from the job details page for each of those queries.

The entity relationship diagram may be helpful and is presented below:

relationship_between_tables.png

Open BigQuery Console

In the Google Cloud Console, select Navigation menu > BigQuery:

BigQuery_menu.png

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.

Click Done.

The BigQuery console opens.

bq-console.png

Click on + Compose new query.

Use this query to answer the question:

SELECT order_ts, order_hour_of_day, COUNT(order_id) FROM `cymbal_bq_opt_1.orders_with_timestamps` WHERE DATE(order_ts) = "2022-08-01" GROUP BY order_ts, order_hour_of_day ORDER BY order_hour_of_day ASC;

Note that the bytes billed was approximately 53 MB.

Use this query to answer the question:

WITH target_orders AS ( SELECT order_id FROM `cymbal_bq_opt_1.orders_with_timestamps` WHERE DATE(order_ts) = "2022-08-01" ) SELECT p.product_name, COUNT(*) AS volume_of_product_purchased FROM target_orders o INNER JOIN `cymbal_bq_opt_1.order_products` map ON o.order_id = map.order_id INNER JOIN `cymbal_bq_opt_1.products` p ON map.product_id = p.product_id GROUP BY p.product_name ORDER BY volume_of_product_purchased DESC;

The bytes billed was approximately 550 MB.

Use this query to answer the question:

SELECT o.product_id, p.product_name FROM `cymbal_bq_opt_1.order_products` o JOIN `cymbal_bq_opt_1.products` p on o.product_id = p.product_id WHERE order_id = 1564244;

The bytes billed was approximately 745 MB.

Use this query to answer the question:

SELECT TIMESTAMP_TRUNC(order_ts, HOUR) AS order_hour, COUNT(*) AS volume_of_orders FROM `cymbal_bq_opt_1.orders_with_timestamps` GROUP BY order_hour ORDER BY order_hour;

The bytes billed was approximately 10 MB.

Task 3: Reduce the cost by making use of partitioning

In this task, you will learn to optimize the cost measured in bytes billed for a query by partitioning a table. A partitioned table is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance and control costs by reducing the number of bytes read by a query. You partition tables by specifying a partition column which is used to segment the table.

If a query uses a qualifying filter on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining partitions. This process is called pruning.

Challenge A : Write a query to find which hour was the busiest in terms of order volume on August 1, 2022.

Original query

Your boss has written a query to solve this challenge.

  1. In the SQL Workspace toolbar, click the + Compose new query icon to open the SQL code editor.

  2. Copy and paste the following query into the BigQuery editor, then click Run.

SELECT order_ts, order_hour_of_day, COUNT(order_id) FROM `cymbal_bq_opt_1.orders_with_timestamps` WHERE DATE(order_ts) = "2022-08-01" GROUP BY order_ts, order_hour_of_day ORDER BY order_hour_of_day ASC;

This query outputs the busiest hour in terms of order volume for 1st August 2022.

Job information details

Your boss noted that the cost of this query, measured in bytes billed, was 79 MB.

  1. In the Query results pane, under the job information tab, note the number of the bytes billed. bites billed 79MB

Your boss is unhappy with this query and has asked you to review its performance and optimize it.

Optimize this query using the techniques that you have learned.

Execution Details

  1. In the Query results pane, examine the details in the execution details tab.

execution details query results

One observation is in stage "S00: Input" you are doing a full table scan, "Records read: 3421083" equals the total number of rows in cymbal_bq_opt_1.orders_with_timestamps. For each record in this table, BigQuery slots are doing Compute to convert a Timestamp to a DATE and compare that to the value "2022-08-01". This is done 3.4 million times.

One way to avoid a full table scan, and to reduce the cost of the query, is to make use of table partitioning.

Optimise the query using table partitioning

  1. In the BigQuery editor, copy and paste the following query and click the Run button to create a new partitioned table.
CREATE TABLE cymbal_bq_opt_2.orders_with_timestamps PARTITION BY DATE(order_ts) OPTIONS ( require_partition_filter = TRUE) AS SELECT * FROM `cymbal_bq_opt_1.orders_with_timestamps`;

Check partioned table creation

  1. Run a query on this partitioned table to calculate the total order volume by date and hour:
SELECT order_ts, order_hour_of_day, COUNT(order_id) FROM `cymbal_bq_opt_2.orders_with_timestamps` WHERE DATE(order_ts) = "2022-08-01" GROUP BY order_ts, order_hour_of_day ORDER BY order_hour_of_day ASC;

Comparison

As shown in the table below, partitioning greatly reduces bytes billed. This is driven by a massive reduction in records read in stage "S00: Input", achieved by avoiding a full table scan.

Challenge A (Before) Challenge A (After) Improvement
Bytes billed 79 MB 10 MB 87% reduction
No of records read in "S00: Input" 3,421,083 114,036 96% reduction
When getting your final estimates of bytes billed, make sure to clear "Use cached results" for the most accurate measurement. See Disabling retrieval of cached results for more information on how to do this.

Task 4: Optimize a query by using denormalization

Challenge B : Your boss has written the following query to find the most popular product sold on August 1, 2022.

  1. In the BigQuery editor, copy and paste the following query, then click Run:
WITH target_orders AS ( SELECT order_id FROM `cymbal_bq_opt_1.orders_with_timestamps` WHERE DATE(order_ts) = "2022-08-01" ) SELECT p.product_name, COUNT(*) AS volume_of_product_purchased FROM target_orders o INNER JOIN `cymbal_bq_opt_1.order_products` map ON o.order_id = map.order_id INNER JOIN `cymbal_bq_opt_1.products` p ON map.product_id = p.product_id GROUP BY p.product_name ORDER BY volume_of_product_purchased DESC;

This query outputs the most popular product sold on August 1, 2022.

Job information details

  1. In the Query Results pane, under the job information tab, note the number of bytes billed.

Your boss sees that the cost of this query, measured in bytes billed, was 550 MB.

Your boss is unhappy with this query and has asked you to review its performance and optimize it.

Optimize this query using the techniques that you have learned.

Execution details

  1. In the Query results pane, examine the details in the execution details tab. execution details query results

From the job information page, you see that the bytes billed is 550 MB.

One observation is that the most expensive stage is "S03: Join+" as slots are spending time on Compute to join the three input tables. The dataset seems to be in a normalized form. There is a many-to-many relationship between orders and products. One order has many products, and one product may be purchased across many orders. To deal with this, a bridging table called order_products is introduced. This way, there is a 1-to-many relationship on either side of the bridging table. One order has many order_products and one product has many order_products. Because of this normalized structure, three tables need to be joined. This join is very expensive.

For this analytics use case, this normalized Online Transaction Processing (OLTP) approach is not necessary. By de-normalzining the structure of the dataset, you can greatly improve analytical queries.

Optimize the query using denormalization

  1. Run the following query to achieve this denormalization by using nested and repeated fields:
CREATE TABLE cymbal_bq_opt_2.orders_nested PARTITION BY DATE(order_ts) OPTIONS ( require_partition_filter = TRUE) AS WITH flat_orders AS ( SELECT o.order_id, o.order_ts, o.order_dow, o.order_hour_of_day, o.user_id, o.days_since_prior_order, o.order_number, STRUCT( p.product_id, p.product_name, p.aisle_id, p.department_id, map.add_to_cart_order, map.reordered ) AS product_purchased FROM `cymbal_bq_opt_1.orders_with_timestamps` o INNER JOIN `cymbal_bq_opt_1.order_products` map ON o.order_id = map.order_id INNER JOIN `cymbal_bq_opt_1.products` p ON map.product_id = p.product_id ) SELECT order_id, order_ts, order_dow, order_hour_of_day, user_id, days_since_prior_order, order_number, ARRAY_AGG(product_purchased) AS order_details FROM flat_orders GROUP BY order_id, order_ts, order_dow, order_hour_of_day, user_id, days_since_prior_order, order_number;

Check denormalized table creation

  1. With the new denormalized table created, a query can be written using the following code. Note the query makes use of this technique for flatten arrays.
WITH target_orders AS ( SELECT order_details FROM `cymbal_bq_opt_2.orders_nested` WHERE DATE(order_ts) = "2022-08-01" ) SELECT product.product_name, COUNT(*) volume_purchased FROM --https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays target_orders, target_orders.order_details product GROUP BY product_name ORDER BY volume_purchased DESC;

Comparison

As per the table below, denormalization has a big impact on the cost of the query (measured in bytes billed). This is because, through denormalization, you avoid the three table join that was seen in the original query.

Challenge B (Before) Challenge 2B (After) Improvement
Bytes billed 550 MB 29 MB 94% reduction

Task 5: Improve row filtering and join performance by using clustering

Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.

In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table. A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition. As a result, BigQuery might not be able to accurately estimate the bytes to be processed by the query or the query costs, but it attempts to reduce the total bytes at execution.

Challenge C : The Fraud Team has developed a new ML model, and one of its inputs is the number of products per order. The ML model has predicted that order 1564244 is suspicious. Write a query to report which products were purchased in that order. Report on the sequence in which they were added to the basket.

Write a query to find which product was added to the basket first in this suspicious order.

Original query

  1. In the BigQuery editor, copy and paste the following query and click Run:
SELECT o.product_id, p.product_name FROM `cymbal_bq_opt_1.order_products` o JOIN `cymbal_bq_opt_1.products` p on o.product_id = p.product_id WHERE order_id = 1564244;

This query finds which product was added to the basket first in this suspicious order.

Job information details

The Fraud Team noticed that the cost of this query, measured in bytes billed, was 497 MB. They are unhappy with this query and have asked you to review its performance and optimize it.

Execution details

  1. In the query results pane, examine the details in the execution details tab. execution details query results

Optimize this query using the techniques that you have learned.

To test your skills! Don't peek at the hints until you've had try!

From the job information page, ypu see that the bytes billed is 497 MB.

Examine the execution details tab to see how this could be reduced:

  • Stage "S00: Input" is simply loading all 49K rows from <project_id>.cymbal_bq_opt_1.products.
  • Stage "S02: Output" is doing two things: scanning all 32 M rows of <project_id>.cymbal_bq_opt_1.
  • order_products to compare each row's order id with input 1564244 Joining rows from #1 with project_id.
  • cymbal_bq_opt_1.products on product_id.

Clustering of column order_id can help to optimize Stage "S02: Output" as it'll sort the rows in the table and make it easier to look up the rows with order_id = 1564244.

Clustering of column product_id can help optimize this stage as well by improving join performance.

Optimize the query using clustering of column

  1. In the BigQuery editor, run the following query to cluster by order_id.
CREATE TABLE cymbal_bq_opt_2.order_products CLUSTER BY order_id AS SELECT * FROM `cymbal_bq_opt_1.order_products`;

Check cluster by order_id creation

  1. Run the following query to cluster by product_id.
CREATE TABLE cymbal_bq_opt_2.products CLUSTER BY product_id AS SELECT * FROM `cymbal_bq_opt_1.products`;

Check cluster by product_id creation

  1. Run the query on these clustered tables to get the desired query result.
SELECT o.product_id, p.product_name FROM `cymbal_bq_opt_2.order_products` o JOIN `cymbal_bq_opt_2.products` p on o.product_id = p.product_id WHERE order_id = 1564244;

Comparison

As shown in the table below, using clustering greatly reduces bytes billed.

Challenge C (Before) Challenge 2C (After) Improvement
Bytes billed 497 MB 20 MB 95% reduction

Task 6: Improve dashboard loading speed and cost by using materialized view

In this task you will create materialized view. In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only delta changes from the base tables to compute up-to-date results. Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.

Challenge D : It's the end of the month and your boss wants a chart showing order volume by date and hour for August. Write a SQL query to calculate the total order volume by date and hour. On which date and hour was the most number of orders in August (e.g., 2022-08-11 13H)?

Original query

SELECT TIMESTAMP_TRUNC(order_ts, HOUR) AS order_hour, COUNT(*) AS volume_of_orders FROM `cymbal_bq_opt_1.orders_with_timestamps` GROUP BY order_hour ORDER BY order_hour;

Your boss was really happy with your hourly order volume table. She discovered, using her amazing BigQuery skills, that this was the query you wrote to create this table.

She was so happy that she added your query to her dashboard. She made a time series chart of hourly sales in August, powered by your query and shared the dashboard with the CEO!

Excited by the email from your boss, the CEO tries to open the dashboard. He is furious; it is very slow to load. The CEO vents his anger by replying to your boss' email. Unfortunately, your boss has just gone on vacation. Her out-of-office auto reply mentions you as point-of-contact for the dashboard. To make matters worse, you do not have edit access to the dashboard, so you cannot edit the query!

Make the query run faster for the CEO and improve dashboard loading speed and cost.

The original loading speed, measured in query Elapsed Time, was around 1,000 ms. What is your new and improved speed?

The original cost, measured in bytes billed, was around 27MB. What is your new and reduced cost?

The answer here is to use a materialized view.

In particular you need to use smart-tuning to make the hidden read-only query in the dashboard use of your new materialized view.

Execution details

  1. In the Query results pane, examine the details in the execution details tab. execution details query results

As per stage "S00: Input" you're doing a full table scan of cymbal_bq_opt_1.orders_with_timestamps which is likely the reason for the slow performance and high cost.

You need to create a materialized view with the same logic in the same dataset cymbal_bq_opt_1 .

Optimize the query using materialized view and smart tuning

  1. In the BigQuery editor, run the following query to create a materialized view:
CREATE MATERIALIZED VIEW `cymbal_bq_opt_1.orders_by_hour` AS ( SELECT TIMESTAMP_TRUNC(order_ts, HOUR) AS order_hour, COUNT(*) AS volume_of_orders FROM `cymbal_bq_opt_1.orders_with_timestamps` GROUP BY order_hour );

Check materialized view creation

  1. Run the original query again, now smart-rewritten by materialized view, to get the desired query result:
SELECT TIMESTAMP_TRUNC(order_ts, HOUR) AS order_hour, COUNT(*) AS volume_of_orders FROM `cymbal_bq_opt_1.orders_with_timestamps` GROUP BY order_hour ORDER BY order_hour;

Execution details after using materialized views

  1. In the Query results pane, examine the details in the execution details tab. execution details query results

As per stage "S00: Input", you are no longer reading from the base table, but are now reading from the materialized view (via the "smart tuning" feature).

Comparison

The overall impact of using a materialized view is that costs are reduced and query speed improves.

Challenge D (Before) Challenge D (After) Improvement
Bytes billed 27 MB 10 MB 62% reduction
Elapsed Time 1,000 ms 389 ms 61% reduction

The CEO is impressed and your boss gives you a raise!

Please note that the elapsed time shown above may not exactly match what you will see. Small variations are expected owing to the way that BigQuery has been set up in this lab.

Task 7: Reduce the elapsed time and cost by pruning partition with a filter

Challenge E : Your boss has asked for a report for the order IDs on the day with the most customers. Your colleague has written the following SQL to pull the report.

Original query

  1. In the BigQuery editor, copy and paste the following query and click Run.
-- orders on day with most unique customers SELECT user_id, order_id, DATE(order_ts) FROM `cymbal_bq_opt_1.orders_with_timestamps` WHERE DATE(order_ts) = ( SELECT target_date FROM ( --find the date with the most unique customers SELECT DATE(order_ts) AS target_date, COUNT(user_id) AS customers FROM `cymbal_bq_opt_1.orders_with_timestamps` GROUP BY order_ts ORDER BY customers DESC LIMIT 1 ) ) ;

Your boss has told your colleague that the query is too expensive, with bytes billed = 79 MB.

As the resident BigQuery expert, your colleague comes to you for help and advice to reduce this cost.

The original cost, measured in bytes billed, was around 79MB. What is your new and reduced cost?

Execution details

  1. In the Query results pane, examine the execution details of the original query. execution details query results

The Execution Details page shows that you are doing a full table scan. Stage "S00: Input" is reading all 3.4 million rows of the table. To help avoid this, partition the table.

Optimize the query using partitioning

  1. Run the below query in the BigQuery editor to partition the table by date(order_ts):
CREATE TABLE cymbal_bq_opt_2.orders_with_timestamps_partition PARTITION BY DATE(order_ts) AS SELECT * FROM `cymbal_bq_opt_1.orders_with_timestamps`;

Check table partition by order_ts creation

  1. Run the query against against the newly partition table:
-- query 2 -- orders on day with most unique customers SELECT user_id, order_id, DATE(order_ts) FROM `cymbal_bq_opt_2.orders_with_timestamps_partition` WHERE DATE(order_ts) = ( SELECT target_date FROM ( --find the date with the most unique customers SELECT DATE(order_ts) AS target_date, COUNT(user_id) AS customers FROM `cymbal_bq_opt_2.orders_with_timestamps_partition` GROUP BY order_ts ORDER BY customers DESC LIMIT 1 ) ) ;

Execution details after partitioning

  1. In the query results pane, revisit the execution details page now. execution details query results

The Execution Details page shows that you are still doing a full table scan. The subquery is not pruning partitions, but the query is executing faster than before (from 2 sec to 1 sec). The number of bytes billed is not reduced from the original query (seen in the Job Information tab).

Optimize by pruning partition with a filter

By inspecting the results table of query 2, the target date was found to be "2022-08-17". What happens if you do not write a subquery and instead just provide this date directly as per the query below:

-- query 3 -- orders on day with most unique customers SELECT user_id, order_id, DATE(order_ts) FROM `cymbal_bq_opt_2.orders_with_timestamps` WHERE DATE(order_ts) = "2022-08-17";

Execution details after pruning partition with a filter

  1. In the Query results pane, revisit the execution details page now. execution details query results

Now you are no longer doing a full table scan. You are making use of the partitioning and pruning the partition with the filter on the provided date string. This is seen in stage "S00: Input" where you are no longer reading the full 3.2M rows of the table. You have again reduced the elapsed time (2 sec reduced to 1 sec). As per the job information page, the bytes billed have been reduced from 79MB to 10MB.

query bytes billed bytes saved elapsed time time reduction
original query 79 MB - 2 seconds -
query 2 79 MB 0% 1 second 50%
query 3 10 MB 87% 1 second 50%

As per the table above, adding partitioning alone reduces the elapsed time by 50%, but does not reduce the bytes billed. This is because of the way BigQuery handles pruning of partitions through subqueries. This also occurs when pruning partitions through joins, but they were not explored in this challenge.

For customers using on-demand pricing, this is important to remember. Adding partitioning alone does not reduce the financial costs if a query is using subqueries or joins to prune partitions.

Congratulations

By working through several scenarios you've learned how to reduce the bytes billed by through partitioning and pruning the partiion with filters; optimize a query using denormalization; and improved row filtering and join performance with clustering. You have improved dashboard loading speed and cost by using the materialized view.

Google Cloud Training & Certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated November 01, 2023

Lab Last Tested November 01, 2023

Copyright 2023 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

Diese Inhalte sind derzeit nicht verfügbar

Bei Verfügbarkeit des Labs benachrichtigen wir Sie per E-Mail

Sehr gut!

Bei Verfügbarkeit kontaktieren wir Sie per E-Mail