
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Explore data in a table using data limiting
/ 10
Identify duplicate records using COUNT(DISTINCT)
/ 10
Use GROUP BY to identify number of items in category
/ 20
Filter the data using GROUP BY and HAVING
/ 20
Sample a BigQuery table using TABLESAMPLE
/ 20
Explore the order_items table
/ 20
As a cloud data analyst, you'll use data transformations to change the format, structure, or content of data to prepare for storage and analysis.
In general, data transformation techniques help data professionals better understand data’s distribution, main characteristics, and overall quality; for this reason, data transformation techniques are often the first step that analysts take during exploratory data analysis.
Some common transformation techniques used for exploratory data analysis are limiting, sampling, and aggregation.
Data limiting is a technique that restricts the number of rows returned in a query. This is useful when you want to limit the amount of data that is displayed; and, in some cases, data limiting can improve the query’s speed and performance.
Data sampling is a technique of selecting a segment of a dataset that is representative of the entire dataset in order to better understand its characteristics.
Data aggregation is a technique used to summarize the data in a more manageable format.
In this lab activity, you’ll explore different ways that you can use these techniques with SQL in BigQuery to explore the data and identify potential data quality issues.
As a cloud data analyst for TheLook eCommerce, you’ve been asked to collaborate with a cross-functional team that includes merchandising, logistics, and marketing experts. This team is tasked with finding ways to improve delivery times and increase customer satisfaction across TheLook eCommerce’s entire product line.
You have prepared a report analyzing the number of returns for the team. But Meredith, the head merchandiser, has raised a concern that the number of products returned may not be correct.
To figure out the problem, you have been asked to explore the thelook_ecommerce dataset, which contains various tables related to product information, orders, and order items. Your job is to identify potential issues, such as duplicate data, that may be impacting the results Meredith has noted.To do this you'll use SQL to limit, sample, and aggregate the data.
Here’s how you'll do this task: First, you'll explore the products table. Next, you'll retrieve the total number of rows, and the number of products with distinct names. Third, you’ll determine the number of items per category. Then, you’ll filter the data to remove the categories with a small number of items. Fifth, you’ll sample the products table. Finally, you’ll explore the data contained in the order_items table.
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.
This practical lab lets you do the activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
Click the Start Lab button. On the left is the Lab Details panel with the following:
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window) if you are running the Chrome browser. The Sign in page opens in a new browser tab.
Tip: You can arrange the tabs in separate, side-by-side windows to easily switch between them.
If necessary, copy the Google Cloud username below and paste it into the Sign in dialog. Click Next.
You can also find the Google Cloud username in the Lab Details panel.
You can also find the Google Cloud password in the Lab Details panel.
After a few moments, the Console opens in this tab.
In this task, you’ll explore data in a table using data limiting.
The Navigation menu, Explorer pane, and Query Editor display.
Now, explore the data by executing a query that returns ten rows from the products table. This can assist you to gain some insight into the contents of the table.
This query limits the results to the first 10 rows of the products table in the thelook_ecommerce dataset.
Click Check my progress to verify that you have completed this task correctly.
In this task, you’ll determine the total number of rows, and the number of products with distinct names, that are contained in the products table.
This query returns the total number of entries in the products table, and the number of unique product names.
In BigQuery, the COUNT(DISTINCT name)
SQL aggregation function is used to calculate the number of unique values in the name column of a dataset. It returns the count of distinct (unique) values present in that column.
This can help you to identify any duplicate products present in the table. Identifying and fixing duplicate data is an important step in data analysis because duplicate data can lead to skewed results and errors during analysis.
Click Check my progress to verify that you have completed this task correctly.
In this task, you’ll determine the number of items per category in the products table, by aggregating the number of products by category using the GROUP BY
SQL keyword.
This query groups the products in the products table by category and counts the number of products in each category.
This query again to identify which segment each item has been assigned. Note that category
has been replaced with segment
in both the SELECT
and GROUP BY
clauses.
Click Check my progress to verify that you have completed this task correctly.
In this task, you’ll filter data to remove the categories with a small number of items before sampling the dataset using the GROUP BY
and HAVING
SQL keywords.
This query returns the categories with a large number of items. Currently, the threshold is set at 1000
, so only categories with more than 1000 items will be returned. You can adjust this threshold to as large or small of a number as you want to get the results you need.
Click Check my progress to verify that you have completed this task correctly.
In this task, you'll use data sampling to retrieve a random subset of rows from the products table.
Table sampling is useful for a variety of purposes, such as exploring data, testing queries, or getting a quick overview of a large dataset.
Sampling returns a random selection of rows while avoiding the costs associated with scanning and processing an entire table. This is because sampling only reads a subset of the data, which can significantly reduce the amount of time and resources required to run the query.
Unlike the LIMIT
clause, which you used in a previous task, TABLESAMPLE
returns a random subset of data from a table. This means that the results of a TABLESAMPLE
query may vary each time it is run.
BigQuery does not cache the results of queries that include a TABLESAMPLE
clause. This is because the results of a sampling query are always random, so caching them would not be helpful.
Click Check my progress to verify that you have completed this task correctly.
In this task, you'll explore the data contained in the order_items table.
This query returns the first 10 rows of the order_items
table.
Click Run.
Create a new query; copy the following query into the Query Editor:
This query returns the aggregate count of orders across various statuses.
Click Run.
Create a new query; copy the following query into the Query Editor:
This query returns the user ID associated with the greatest total order value.
Click Check my progress to verify that you have completed this task correctly.
Great work!
You’ve successfully explored the data and identified data quality issues in the sales data. This is a great first step in making sure that the sales data used for decision-making is optimized.
First, you explored the data using limiting to return a limited number of results.
Second, you identified duplicate rows using the COUNT(DISTINCT name)
aggregation technique.
Third, you identified the number of items per category in the products table using GROUP BY.
Fourth, you filtered the data using GROUP BY
and HAVING
.
Fifth, you sampled a table to return a random subset of the data.
Finally, you explored the data contained in the order_items table.
You’re well on your way to understanding how to use data limiting, sampling, and aggregation to better understand your data and your transformation needs.
Before you end the lab, make sure you’re satisfied that you’ve completed all the tasks. When you're ready, click End Lab and then click Submit.
Ending the lab will remove your access to the lab environment, and you won’t be able to access the work you've completed in it again.
Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one