Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you restart it, you'll have to start from the beginning.
- Click Start lab to begin
Retrieve dataset files
/ 50
Publish the results to BigQuery
/ 50
This lab was developed with our partner, Alteryx. Your personal information may be shared with Alteryx, the lab sponsor, if you have opted-in to receive product updates, announcements, and offers in your Account Profile.
Cloud Dataprep is Google's self-service data preparation tool built in collaboration with Alteryx. In this lab you will learn how to clean and enrich multiple datasets using Cloud Dataprep. The lab exercises are based on a mock use case scenario.
You work for a technical services company that sells three monthly subscription products:
The company occasionally offers promotional discounts, so some product prices may be slightly lower than those listed above. Your overall goal is to provide an analysis of sales activity by zip code over the course of three years.
To do this you'll need to join your customer contact datasource (where the zip code information resides) with sales data from your purchases datasource. Once you've joined the data, you'll aggregate the results.
In this lab, you will learn how to perform the following tasks:
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources are made available to you.
This hands-on lab lets you do the lab activities in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
Click the Start Lab button. If you need to pay for the lab, a dialog opens for you to select your payment method. On the left is the Lab Details pane with the following:
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).
The lab spins up resources, and then opens another tab that shows the Sign in page.
Tip: Arrange the tabs in separate windows, side-by-side.
If necessary, copy the Username below and paste it into the Sign in dialog.
You can also find the Username in the Lab Details pane.
Click Next.
Copy the Password below and paste it into the Welcome dialog.
You can also find the Password in the Lab Details pane.
Click Next.
Click through the subsequent pages:
After a few moments, the Google Cloud console opens in this tab.
Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.
Click Activate Cloud Shell at the top of the Google Cloud console.
Click through the following windows:
When you are connected, you are already authenticated, and the project is set to your Project_ID,
gcloud
is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
Output:
Output:
gcloud
, in Google Cloud, refer to the gcloud CLI overview guide.
You should see a message saying the service identity was created.
In the Cloud Console go to the Navigation menu, and under ANALYTICS select Dataprep.
Check that you agree to Google Dataprep Terms of Service, and then click Accept.
Click the checkbox and then click Agree and Continue when prompted to share account information with Alteryx.
Click Allow to give Alteryx access to your project.
Select your lab credentials to sign in and click Allow.
Check the box and click Accept to agree to Alteryx Terms of Service.
If prompted to use the default location for the storage bucket, click Continue.
For new users, a tutorial will launch, asking you to select datasets. Quit out of this screen by clicking Cancel or exiting out.
Click on the Dataprep icon on the top left corner to go to the home screen.
In this section you will add the sales activity files to a storage bucket that Dataprep created for you.
Get your bucket name. From the Navigation menu, select Cloud Storage > Bucket.
Note the Dataprep bucket name to use in the next step.
In the Cloud Shell command line, execute the following command, substituting [YOUR-BUCKET-NAME]
with the Dataprep bucket name:
You should receive a similar output:
Click Check my progress to verify the objective.
Go back to the Cloud Dataprep tab. To wrangle your data, you need to create a Flow. A Flow is a set of related datasets and the connections between them.
Qwiklab1
, leave the flow description blank, then click Ok.At this point your Flow is created. To guide you, Dataprep creates some placeholders in the flow to help you get started. The first step is to import and add data to Dataprep and the Flow.
Click on the + under Dataset to add a new data source and click the Import Datasets link.
In the left navigation menu, click Cloud Storage
> dataprep-staging-xxx...
> gsp050
to access the sample data you stored in the previous section.
Click the + next to each file listed. When you click on a file, it will move to the right side of the screen. Click Import & Add to Flow to add the datasets to your Flow:
Cloud Dataprep brings you back to the Flow View page, which now contains the datasets you added. Additional placeholders for the recipe and output are created.
Now that you have the data, the next step is to design a data preparation recipe to clean the customers dataset. From the template, there already is a recipe added using the lab_2013_transactions.csv
data. Let’s skip this recipe for now and create our own.
lab_customers.csv
.lab_customers
and click OK.A new Recipe node is created and a panel will open on the right side, displaying information about the recipe, including the data and any existing transformation steps.
Hit the blue Edit Recipe button. (Alternatively, you can double click the recipe node itself.)
Cloud Dataprep opens the "Transformer Grid". This is a worksheet-like interface where you can design the steps in your data preparation recipe. The Transformer page is where you build your transformation recipe and see the results applied to the sample. When you are satisfied with what you see, execute the job against your dataset.
Each of the columns have a Name and an icon that specify the inferred data type. Possible data types are shown when you click the icon to the left of the column name:
When you click on a column option, a Details panel opens on the right.
This Details panel is dynamic and contains information about whatever you have selected, including column information or suggested transformations. Click X in the top right of the Details panel to close the Details panel for now.
In the following steps, you explore data in the grid view and apply transformation steps to your recipe.
When you open the Transformer Grid, Cloud Dataprep automatically profiles the contents of your dataset and generates column-level histograms and data quality indicators. This profile information can be used to guide your data preparation process.
This is the data quality bar. The green part represents valid values, the gray represents missing or null values. A red bar indicates data that does not match the data type. Clicking on the sections of the data quality bar will generate suggestions that contain data quality conditionals. These conditionals test whether each record is valid, empty, or invalid, depending on the section of the bar that you clicked.
Using start_date
and end_date
as a filter, you will add a transformation to remove contacts where the start_date column is empty.
start_date
column.Cloud Dataprep generates a list of suggested transformations on the right, based on your selection. You can hover your mouse over any of the suggestion cards and Dataprep will show you a preview of how your data will look if the suggestion is applied. If you select a card, Cloud Dataprep updates the grid to show you a preview of this transformation.
start_date
" suggestion card on the right.The rows that were highlighted in red have been removed from your dataset.
Look at the end_date column. Based on the data quality bar, there are a large number of rows with missing values. To easily work with this column, you'll insert an empty value–January 01, 2050–in those empty rows.
end_date
column.This will generate another set of suggested transformations. There should be a suggestion to Set missing values to NULL(). In this case, Dataprep does not know what exact value you may want to fill with, so it creates a template for you to modify.
This opens the Add Step builder. Cloud Dataprep's suggested transformation has already been populated, but you can make adjustments to the code.
NULL()
with '2050/1/1'
(with the quotes). The full formula should look like:Now the data quality problems in the lab_customers
dataset have been addressed and the gray part of the data quality bar is gone.
Now switch gears and work on the transactions datasets.
QWIKLAB1
flow name at the top of the screen:This brings you back to the Flow view.
Create a single dataset that unions the transactions datasets from 2013, 2014, and 2015.
Click on the lab_2013_transactions
dataset.
Click the plus sign (+), and then click Add new Recipe.
Cloud Dataprep creates a new recipe and wrangled dataset named Untitled recipe
.
Right-click on this new wrangled dataset. From the drop-down menu, choose Edit name and description.
Change the name to Combined Transactions
and click OK.
lab_2013_transactions.csv
dataset.The metadata shows a quick summary of what data is loaded in the grid. Remember that the data loaded is a sample of the entire dataset, up to 10MB.
OR
The sample shown is the initial data from the source. For small datasets (under 10MB), Dataprep loads the entire dataset into the initial data sample.
Click Add New Step.
Type in "Union" in the search field then click on the result to get to the Union tool.
The Union Output field displays the output schema for your dataset. Each box represents a column. Cloud Dataprep bases the output schema on the schema of the dataset from which you initiated the union transform. In this case, the columns in the "Combined Transactions" dataset determine the columns that will appear in the combined output.
Click Add Data.
Check the lab_2014_transactions
. In the drop down at the bottom left, select Align By Name
and click Apply.
Click Add to Recipe to combine the datasets. After adding the union to your script, look at the transaction_date
column.
This dataset now includes records from January 2013 through December 2014.
Over the course of working with data, you may often find yourself tweaking or removing certain transformations. Dataprep makes it very easy to edit your work. In this case, you have unioned 2 of the 3 datasets, but instead of adding a brand new step to union the remaining dataset, you can simply edit what you've already done.
Notice that your recipe is now empty. The grid and metadata are also updated to reflect the original state.
Click the Redo icon to bring back the Union step.
In the Recipe panel, right click the Union step and choose Edit.
This will open up the Union tool again. This time, click Add data again and check the lab_2015_transactions
. In the drop down at the bottom left, select Align By Name
and click Apply.
Examine the column-to-column mappings. Click Add to Recipe to combine all three datasets.
After adding the union to your script, look at the transaction_date
column.
This dataset now includes records from January 2013 through December 2015. How many rows now show in the metadata?
QWIKLAB1
flow name to return to the Flow View.The flow visualization is updated to show how the three transactions datasets combine to form the Combined Transactions dataset:
Now that the datasets are combined, you will enrich the transactions data with information about where each purchase was made. To do this, join the customer data to the transactions data. When performing a join, treat the larger dataset as the master dataset, or the "left side" of the join. The smaller dataset should be the detail dataset, or the "right side" of the join. In Cloud Dataprep, the dataset from which you initiate a join automatically becomes the master dataset.
Double-click Combined Transactions to edit the recipe again.
Click on the Join icon in the Transformer toolbar to open the Join tool.
Click on the lab_customers
dataset to bring in the other dataset, then click Accept.
On the next screen, edit the Join keys and conditions. On the left is a preview of the join key matches. On the right are options to edit the join type, join keys and the preview statistics of the join. Dataprep will try to automatically infer the correct join keys based on common values between the datasets.
Optionally, you can edit the join keys. Hover in the Join keys section, then click on the pencil (Edit icon) to modify the join key or Add to add additional join keys.
For these datasets, Cloud Dataprep chose an inside join on column customer_id
. This means the output dataset will be those records that have the same customer_id.
Click Next.
On the next screen, you can choose which columns to keep or drop after the join. In the Output Columns panel, put a check next to the following fields to add those columns to the Join:
customer_id (current)
transaction_date
ticket_price
product
address_state
address_zip
region
start_date
end_date
All unchecked columns will be dropped. Your results will look like this:
As a final step, you will want to do some additional cleanup of the data for your report. You will need to create some columns with the necessary values that you want to visualize by.
Let’s see another way of authoring Transformations in Dataprep - from the column menus.
A new formula builder opens, pre-populated with the selected actions. A preview is also generated in the grid.
Notice that a new column is created, called year_transaction_date
. In the previous step, you could have named the new column while editing the transformation. If you leave out the name, Dataprep will generate a new column based on the transformation step you took, or as column# if no source columns were selected.
You will manually rename this column. Click the drop-down arrow next to year_transaction_date > Rename.
Enter activity_year
in the field. Click Add to accept the changes.
Note that you can rename multiple columns with this transformation by hitting Add to add more mappings.
You just finished preparing your data and you're ready to produce a results file in Cloud Storage. Cloud Dataprep executes your data transformation recipe to produce your output file using the BigQuery engine.
Click Run in the top right of the Transformer Grid.
In the Run Job dialog, you can configure your job execution settings and output destination. By default, Cloud Dataprep will create a CSV file on Cloud Storage.
Hover your mouse over the existing Publishing Action and hit Edit on the right.
Click on the BigQuery tab on the left.
Select the Dataprep
database, and click the Create a new table button on the right.
Enter transactions_by_customer
as the new table name and select Append to this table every run as the write option.
Click Update on the bottom to update the output settings.
Now click Run to kick off your BigQuery job. This will take a few minutes. You can see the job processing on the Dataprep "Jobs" page. When it completes, you'll see a success message that resembles the following, and your data will be loaded in the new BigQuery table.
Your results are visible by querying BigQuery directly. From the Google Cloud Console, navigate to ANALYTICS > BigQuery. Click on the Dataprep
dataset.
Enter select * from Dataprep.transactions_by_customer;
into the Query Editor. Click Run to see the data that was published.
Cloud Dataprep is that simple! It's easy to cleanse and enrich multiple data sources using an intuitive, visual interface.
Click Check my progress to verify the objective.
This concludes the Working With Google Cloud Dataprep lab. In this lab, you started off by creating new flows and transforming data. You then learned how to use the Dataprep UI to filter messy data, union and join multiple files, and create and rename columns. Lastly, you produced a results file and exported it to Google Cloud Storage.
You can continue exploring Dataprep Professional Edition with a free 30-day trial available here. Please make sure to sign out from your temporary lab account and re-sign with your Google Cloud valid email. Advanced features, such as additional connectivity, pipeline orchestration, and adaptive data quality are also available in the Premium edition that you can explore in the Google Cloud Marketplace.
Read the how-to guides to learn how to discover, cleanse, and enhance data with Google Dataprep.
...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.
Manual Last Updated December 27, 2023
Lab Last Tested September 27, 2023
Copyright 2025 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.