
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Minimize I/O
/ 10
Cache results of previous queries
/ 10
Denormalization
/ 5
Joins
/ 5
Avoid overwhelming a worker
/ 10
Approximate aggregation functions
/ 10
Performance tuning of BigQuery is usually carried out because we wish to reduce query execution times or cost. In this lab, we will look at a number of performance optimizations that might work for your use case. Performance tuning should be carried out only at the end of the development stage, and only if it is observed that typical queries take too long.
It is far better to have flexible table schemas and elegant, readable, and maintainable queries than to obfuscate table layouts and queries in search of a tiny bit of performance. However, there will be instances where you do need to improve the performance of your queries, perhaps because they are carried out so often that small improvements are meaningful. Another aspect is that knowledge of performance tradeoffs can help you in deciding between alternative designs.
In this lab, you learn about the following techniques for reducing BigQuery execution times and costs:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
A query that computes the sum of three columns will be slower than a query that computes the sum of two columns, but most of the performance difference will be due to reading more data, not the extra addition. Therefore, a query that computes the mean of a column will be nearly as fast as a query whose aggregation method is to compute the variance of the data (even though computing variance requires BigQuery to keep track of both the sum and the sum of the squares) because most of the overhead of simple queries is caused by I/O, not by computation.
Because BigQuery uses columnar file formats, the fewer the columns that are read in a SELECT, the less the amount of data that needs to be read. In particular, doing a SELECT * reads every column of every row in the table, making it quite slow and expensive.
The exception is when you use a SELECT * in a subquery, then only reference a few fields in an outer query; the BigQuery optimizer will be smart enough to only read the columns that are absolutely required.
In the Query results window notice that the query completed in ~1.2s and processed ~372MB of data.
In the Query results window notice that this query completed in ~4.5s and consumed ~2.6GB of data. Much longer!
If you require nearly all the columns in a table, consider using SELECT * EXCEPT
so as to not read the ones you don’t require.
When tuning a query, it is important to start with the data that is being read and consider whether it is possible to reduce this. Suppose we wish to find the typical duration of the most common one-way rentals.
The details of the query indicate that the sorting (for the approximate quantiles for every station pair) required a repartition of the outputs of the input stage but most of the time is spent during computation.
The above query avoids the need to read the two id columns and finishes in around 10.8 seconds. This speedup is caused by the downstream effects of reading less data.
The query result remains the same since there is a 1:1 relationship between the station name and the station id.
Suppose we wish to find the total distance traveled by each bicycle in our dataset.
The above query takes around 9.8 seconds (approximately 55 seconds of slot time) and shuffles around 1.22 MB. The result is that some bicycles have been ridden nearly 6000 kilometers.
cycle_stations
table against the cycle_hire table
if we precompute the distances between all pairs of stations:This query only makes 600k geo-distance calculations vs. 24M previously. Now it takes about 31.5 seconds of slot time (a 30% speedup), despite shuffling approximately 33.05MB of data.
Click Check my progress to verify the objective.
The BigQuery service automatically caches query results in a temporary table. If the identical query is submitted within approximately 24 hours, the results are served from this temporary table without any recomputation. Cached results are extremely fast and do not incur charges.
There are, however, a few caveats to be aware of. Query caching is based on exact string comparison. So even whitespaces can cause a cache miss. Queries are never cached if they exhibit non-deterministic behavior (for example, they use CURRENT_TIMESTAMP or RAND), if the table or view being queried has changed (even if the columns/rows of interest to the query are unchanged), if the table is associated with a streaming buffer (even if there are no new rows), if the query uses DML statements, or queries external data sources.
It is possible to improve overall performance at the expense of increased I/O by taking advantage of temporary tables and materialized views.
For example, suppose you have a number of queries that start out by finding the typical duration of trips between a pair of stations. The WITH clause (also called a Common Table Expression) improves readability but does not improve query speed or cost since results are not cached. The same holds for views and subqueries as well. If you find yourself using a WITH clause, view, or a subquery often, one way to potentially improve performance is to store the result into a table (or materialized view).
mydataset
in the eu (multiple regions in European Union)
region (where the bicycle data resides) under your project in BigQuery.qwiklabs-gcp-xxxx
) and select Create dataset.In the Create dataset dialog:
Set the Dataset ID to mydataset
.
Set the Location type to eu (multiple regions in European Union)
.
Leave all other options at their default values.
To finish, click the blue Create Dataset button.
Now you may execute the following query:
Notice the ~50% speedup since the average trip duration computation is avoided. Both queries return the same result, that trips on Christmas take longer than usual. Note, the table mydataset.typical_trip
is not refreshed when new data is added to the cycle_hire
table.
One way to solve this problem of stale data is to use a materialized view or to schedule queries to update the table periodically. You should measure the cost of such updates to see whether the improvement in query performance makes up for the extra cost of maintaining the table or materialized view up-to-date.
If there are tables that you access frequently in Business Intelligence (BI) settings such as dashboards with aggregations and filters, one way to speed up your queries is to employ BI Engine. It will automatically store relevant pieces of data in memory (either actual columns from the table or derived results), and will use a specialized query processor tuned for working with mostly in-memory data. You can reserve the amount of memory (up to a current maximum of 10 GB) that BigQuery should use for its cache from the BigQuery Admin Console, under BI Engine.
Make sure to reserve this memory in the same region as the dataset you are querying. Then, BigQuery will start to cache tables, parts of tables, and aggregations in memory and serve results faster.
A primary use case for BI Engine is for tables that are accessed from dashboard tools such as Google Data Studio. By providing memory allocation for a BI Engine reservation, we can make dashboards that rely on a BigQuery backend much more responsive.
Click Check my progress to verify the objective.
Joining two tables requires data coordination and is subject to limitations imposed by the communication bandwidth between slots. If it is possible to avoid a join, or reduce the amount of data being joined, do so.
One way to improve the read performance and avoid joins is to give up on storing data efficiently, and instead add redundant copies of data. This is called denormalization.
Thus, instead of storing the bicycle station latitudes and longitudes separately from the cycle hire information, we could create a denormalized table:
Then, all subsequent queries will not need to carry out the join because the table will contain the necessary location information for all trips.
In this case, you are trading off storage and reading more data against the computational expense of a join. It is quite possible that the cost of reading more data from disk will outweigh the cost of the join -- you should measure whether denormalization brings performance benefits.
Click Check my progress to verify the objective.
Self-joins happen when a table is joined with itself. While BigQuery supports self-joins, they can lead to performance degradation if the table being joined with itself is very large. In many cases, you can avoid the self-join by taking advantage of SQL features such as aggregation and window functions.
Let’s look at an example. One of the BigQuery public datasets is the dataset of baby names published by the US Social Security Administration.
This took around 2.4 seconds, an improvement of approximately 30 times.
It is possible to carry out the query above with an efficient join as long as we reduce the amount of data being joined by grouping the data by name and gender early on:
Try the following query:
The early grouping served to trim the data early in the query, before the query performs a JOIN. That way, shuffling and other complex operations are only executed on the much smaller data and remain quite efficient. The query above finished in about 2 seconds and returned the correct result.
Suppose you wish to find the duration between a bike being dropped off and it being rented again, i.e., the duration that a bicycle stays at the station. This is an example of a dependent relationship between rows. It might appear that the only way to solve this is to join the table with itself, matching the end_date
of one trip against the start_date
of the next. (Make sure your query is running in the eu (multiple regions in European Union)
region. Click +Create SQL Query and then select More > Query settings > Additional Options and verify Automatic location selection is checked.
Sometimes, it can be helpful to precompute functions on smaller tables, and then join with the precomputed values rather than repeat an expensive calculation each time.
For example, suppose we wish to find the pair of stations between which our customers ride bicycles at the fastest pace. To compute the pace (minutes per kilometer) at which they ride, we need to divide the duration of the ride by the distance between stations.
The above query invokes the geospatial function ST_DISTANCE
once for each row in the cycle_hire
table (24 million times), takes about 14.7 seconds and processes about 1.9 GB.
cycle_stations
table to precompute the distance between every pair of stations (this is a self-join) and then join it with the reduced-size table of average duration between stations:The recast query with the more efficient joins takes only about 8.2 seconds, an approximate 1.8x speedup and processes about 554 MB, a nearly 4x reduction in cost.
Click Check my progress to verify the objective.
Some operations (e.g. ordering) have to be carried out on a single worker. Having to sort too much data can overwhelm a worker’s memory and result in a “resources exceeded” error. Avoid overwhelming the worker with too much data. As the hardware in Google data centers is upgraded, what “too much” means in this context expands over time. Currently, this is on the order of one GB.
It takes 34.5 seconds to process just 372 MB because it needs to sort the entirety of the London bicycles dataset on a single worker. Had we processed a larger dataset, it would have overwhelmed that worker.
This takes about 15.1 seconds (a 2x speedup) because the sorting can be done on just a single day of data at a time.
Click Check my progress to verify the objective.
The same problem of overwhelming a worker (in this case, overwhelm the memory of the worker) can happen during an ARRAY_AGG with GROUP BY if one of the keys is much more common than the others.
us (multiple regions in United States)
processing center):Note, while this query will succeed, it can take upwards of 30 minutes to do so. If you understand the query, move on in the lab.
timezone
and repo_name
and then aggregate across repos to get the actual answer for each timezone:Note, while this query will succeed, it can take more than 15 minutes to do so. If you understand the query, move on in the lab.
BigQuery provides fast, low-memory approximations of aggregate functions. Instead of using COUNT(DISTINCT …), we can use APPROX_COUNT_DISTINCT on large data streams when a small statistical uncertainty in the result is tolerable.
The above query takes 8.3 seconds to compute the correct result of 3347770.
The above query takes about 3.9 seconds (a 2x speedup) and returns an approximate result of 3399473, which overestimates the correct answer by about 1.5%.
The approximate algorithm is much more efficient than the exact algorithm only on large datasets and is recommended in use-cases where errors of approximately 1% are tolerable. Before using the approximate function, measure your use case!
Other available approximate functions include APPROX_QUANTILES to compute percentiles, APPROX_TOP_COUNT to find the top elements and APPROX_TOP_SUM to compute top elements based on the sum of an element.
Click Check my progress to verify the objective.
You've learned about a number of techniques to potentially improve your query performance. While considering some of these techniques, remember the legendary computer scientist Donald Knuth's quote, "Premature optimization is the root of all evil."
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one