
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Run Cloud Dataprep jobs to BigQuery
/ 100
This lab was developed with our partner, Alteryx. Your personal information may be shared with Alteryx, the lab sponsor, if you have opted-in to receive product updates, announcements, and offers in your Account Profile.
Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis. In this lab, you explore the Dataprep user interface (UI) to build a data transformation pipeline that outputs results into BigQuery.
The dataset for this lab is an ecommerce dataset that has millions of Google Analytics session records for the Google Merchandise Store loaded into BigQuery. In the lab, you explore the available fields and rows and prepare the data for analysis.
In this lab, you learn how to perform these tasks:
It is recommended that you take the Working with Cloud Dataprep on Google Cloud lab before attempting this lab.
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources are made available to you.
This hands-on lab lets you do the lab activities in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
Click the Start Lab button. If you need to pay for the lab, a dialog opens for you to select your payment method. On the left is the Lab Details pane with the following:
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).
The lab spins up resources, and then opens another tab that shows the Sign in page.
Tip: Arrange the tabs in separate windows, side-by-side.
If necessary, copy the Username below and paste it into the Sign in dialog.
You can also find the Username in the Lab Details pane.
Click Next.
Copy the Password below and paste it into the Welcome dialog.
You can also find the Password in the Lab Details pane.
Click Next.
Click through the subsequent pages:
After a few moments, the Google Cloud console opens in this tab.
You should see a message saying the service identity was created.
In the Cloud console, go to the Navigation menu, click View All Products and under Analytics select Dataprep.
To get into Dataprep, check that you agree to Google Dataprep Terms of Service, and then click Accept.
Click the checkbox and then click Agree and Continue when prompted to share account information with Alteryx.
Click Allow to give Alteryx access to your project.
Select your Qwiklabs credentials to sign in and click Allow.
Check the box and click Accept to agree to Alteryx Terms of Service.
If prompted to use the default location for the storage bucket, click Continue.
Although this lab is largely focused on Cloud Dataprep, you need BigQuery as an endpoint for dataset ingestion to the pipeline and as a destination for the output when the pipeline is completed.
In the Cloud Console, select Navigation menu > BigQuery.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
Click Done.
In the Explorer pane, select your project name:
ecommerce
.Click CREATE DATASET. You will now see your dataset under your project in the left pane.
Copy and paste the following SQL query into the Query Editor:
Click RUN. This query copies over a subset of the public raw ecommerce dataset (one day's worth of session data, or about 56 thousand records) into a new table named all_sessions_raw_dataprep
, which has been added to your ecommerce dataset for you to explore and clean in Cloud Dataprep.
Confirm that the new table exists in your ecommerce
dataset:
In this task, you will connect Cloud Dataprep to your BigQuery data source. On the Cloud Dataprep page:
Click Create a new flow in the right corner.
Rename the Untitled Flow and specify these details:
Ecommerce Analytics Pipeline
Revenue reporting table
Click Ok.
If prompted with a What's a flow?
popup, select Don't show me any helpers.
Click the Add Icon in the Dataset box.
In the Add Dataset to Flow dialog box, select Import Datasets.
In the left pane, click BigQuery.
When your ecommerce dataset is loaded, click on it.
Click on the Create dataset icon (+ sign) on the left of the all_sessions_raw_dataprep
table.
Click Import & Add to Flow in the bottom right corner.
The data source automatically updates. You are ready to go to the next task.
In this task, you will load and explore a sample of the dataset within Cloud Dataprep.
Cloud Dataprep loads a sample of your dataset into the Transformer view. This process might take a few seconds. You are now ready to start exploring the data!
Answer the following questions:
Answer: 32 columns.
Answer: About 12 thousand rows.
channelGrouping
column?channelGrouping
column title.
Answer: Referral. A referring site is typically any other website that has a link to your content. An example here is a different website reviewed a product on our ecommerce website and linked to it. This is considered a different acquisition channel than if the visitor came from a search engine.
Answer: United States, India, United Kingdom
Answer: Missing values for the totalTransactionRevenue
field. This means that a lot of sessions in this sample did not generate revenue. Later, we will filter out these values so our final table only has customer transactions and associated revenue.
timeOnSite
in seconds, maximum pageviews
, and maximum sessionQualityDim
for the data sample? (Hint: Open the menu to the right of the timeOnSite
column by clicking To close the details window, click the Close Column Details (X) button in the top right corner. Then repeat the process to view details for the pageviews
and sessionQualityDim
columns.
Answers:
timeOnSite
are only counted once per session. We'll explore the uniqueness of visitor and session data in a later lab.
sessionQualityDim
, are the data values evenly distributed?Answer: No, they are skewed to lower values (low quality sessions), which is expected.
Answer: 8/1/2017 (one day of data)
productSKU
column. If so, what might that mean?Answer: A red bar indicates mismatched values. While sampling data, Cloud Dataprep attempts to automatically identify the type of each column. If you do not see a red bar for the productSKU
column, then this means that Cloud Dataprep correctly identified the type for the column (i.e. the String type). If you do see a red bar, then this means that Cloud Dataprep found enough number values in its sampling to determine (incorrectly) that the type should be Integer. Cloud Dataprep also detected some non-integer values and therefore flagged those values as mismatched. In fact, the productSKU
is not always an integer (for example, a correct value might be "GGOEGOCD078399"). So in this case, Cloud Dataprep incorrectly identified the column type: it should be a string, not an integer. You will fix that later in this lab.
v2ProductName
column, what are the most popular products?Answer: Nest products
v2ProductCategory
column, what are some of the most popular product categories?Answers:
The most popular product categories are:
Nest
Bags
(not set) (which means that some sessions are not associated with a category)
True or False? The most common productVariant
is COLOR
.
Answer: False. It's (not set) because most products do not have variants (80%+)
Answer: PAGE
and EVENT
A user can have many different interaction types when browsing your website. Types include recording session data when viewing a PAGE or a special EVENT (like "clicking on a product") and other types. Multiple hit types can be triggered at the exact same time so you will often filter on type to avoid double counting. We'll explore this more in a later analytics lab.
productQuantity
?Answer: 100 (your answer may vary)
productQuantity
indicates how many units of that product were added to cart. 100 means 100 units of a single product was added.
currencyCode
for transactions?Answer: USD (United States Dollar)
itemQuantity
or itemRevenue
?Answer: No, they are all NULL
(or missing) values.
transactionId
values are valid? What does this represent for our ecommerce
dataset?eCommerceAction_type
values are there, and what is the most common value?Answers: There are seven values found in our sample. The most common value is zero 0
which indicates that the type is unknown. This makes sense as the majority of the web sessions on our website will not perform any ecommerce actions as they are just browsing.
eCommerceAction_type = 6
represent?eCommerceAction
type and read the description for the mapping
Answer: 6 maps to "Completed purchase". Later in this lab we will ingest this mapping as part of our data pipeline.
In this task, you will clean the data by deleting unused columns, eliminating duplicates, creating calculated fields, and filtering out unwanted rows.
As we mentioned earlier, we will be deleting the itemQuantity and itemRevenue columns as they only contain NULL values and are not useful for the purpose of this lab.
Your team has informed you there may be duplicate session values included in the source dataset. Let's remove these with a new deduplicate step.
Click Add in the right-hand panel.
Review the recipe that you created so far, it should resemble the following:
Your team has asked you to create a table of all user sessions that bought at least one item from the website. Filter out user sessions with NULL revenue.
This step filters your dataset to only include transactions with revenue (where totalTransactionRevenue is not NULL).
The dataset contains sessions of different types, for example PAGE (for page views) or EVENT (for triggered events like "viewed product categories" or "added to cart"). To avoid double counting session pageviews, add a filter to only include page view related hits.
In the histogram below the type column, click the bar for PAGE. All rows with the type PAGE are now highlighted in green.
In the Suggestions panel, in Keep rows, and click Add.
Search your schema documentation for visitId and read the description to determine if it is unique across all user sessions or just the user.
visitId
: an identifier for this session. This is part of the value usually stored as the utmb
cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.As we see, visitId
is not unique across all users. We will need to create a unique identifier.
As you discovered, the dataset has no single column for a unique visitor session. Create a unique ID for each session by concatenating the fullVisitorId and visitId fields.
For Columns, select fullVisitorId
and visitId
.
For Separator type a single hyphen character: -
.
For the New column name, type unique_session_id
.
The unique_session_id
is now a combination of the fullVisitorId
and visitId
. We will explore in a later lab whether each row in this dataset is at the unique session level (one row per user session) or something even more granular.
As you saw earlier, values in the eCommerceAction_type
column are integers that map to actual ecommerce actions performed in that session. For example, 3 = "Add to Cart" or 5 = "Check out". This mapping will not be immediately apparent to our end users so let's create a calculated field that brings in the value name.
For Column to evaluate, specify eCommerceAction_type
.
Next to Cases (1), click Add 8 times for a total of 9 cases.
Comparison |
New value |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For New column name, type eCommerceAction_label
. Leave the other fields at their default values.
Click Add.
As mentioned in the schema, the totalTransactionRevenue column contains values passed to Analytics multiplied by 10^6 (e.g., 2.40 would be given as 2400000). You now divide the contents of that column by 10^6 to get the original values.
DIVIDE(totalTransactionRevenue,1000000)
and for New column name, type: totalTransactionRevenue1
. Notice the preview for the transformation:Click Add.
Click the icon to the right of the totalTransactionRevenue1 column, then select Change type and choose Decimal.
Review the full list of steps in your recipe:
In the Run Job page, select Dataflow + Bigquery for your Running Environment.
Under Publishing Actions, click on Edit on the right of Create-CSV.
In the following page, select BigQuery from the left hand menu.
Select your ecommerce dataset.
Click Create a New Table from the panel on the right.
Name your table revenue_reporting.
Select Drop the Table every run.
Click on Update.
Click RUN.
Once your Cloud Dataprep job is completed, refresh your BigQuery page and confirm that the output table revenue_reporting exists.
Click Check my progress to verify the objective.
You've successfully explored your ecommerce dataset and created a data transformation pipeline with Cloud Dataprep.
...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.
Manual Last Updated December 09, 2024
Lab Last Tested December 09, 2024
Copyright 2025 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
Ce contenu n'est pas disponible pour le moment
Nous vous préviendrons par e-mail lorsqu'il sera disponible
Parfait !
Nous vous contacterons par e-mail s'il devient disponible
One lab at a time
Confirm to end all existing labs and start this one