
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Create Vertex AI Platform Notebooks instance and clone course repo
/ 10
Setup the data environment
/ 15
Create a custom Dataflow Flex Template container image
/ 15
Create and stage the flex template
/ 10
Execute the template from the UI and using gcloud commands
/ 20
In this lab, you:
Prerequisites:
A pipeline that accepts command-line parameters is vastly more useful than one with those parameters hard-coded. However, running it requires creating a development environment. An even better option for pipelines that are expected to be rerun by a variety of different users or in a variety of different contexts would be to use a Dataflow template.
There are many Dataflow templates that have already been created as part of Google Cloud Platform, to learn more, explore the Get started with Google-provided templates guide. But none of them perform the same function as the pipeline in this lab. Instead, in this part of the lab, you convert the pipeline into a newer custom Dataflow Flex Template (as opposed to a custom traditional template).
Converting a pipeline into a custom Dataflow Flex Template requires the use of a Docker container to package up your code and the dependencies, a Dockerfile to describe what code to build, Cloud Build to build the underlying container that will be executed at runtime to create the actual job, and a metadata file to describe the job parameters.
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).
In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.
Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com
is present and has the editor
role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.
editor
role, follow the steps below to assign the required role.729328892908
).{project-number}
with your project number.For this lab, you will be running all commands in a terminal from your notebook.
In the Google Cloud Console, on the Navigation Menu, click Vertex AI > Workbench.
Click Enable Notebooks API.
On the Workbench page, select USER-MANAGED NOTEBOOKS and click CREATE NEW.
In the New instance dialog box that appears, set the region to
For Environment, select Apache Beam.
Click CREATE at the bottom of the dialog vox.
Next you will download a code repository for use in this lab.
On the left panel of your notebook environment, in the file browser, you will notice the training-data-analyst repo added.
Navigate into the cloned repo /training-data-analyst/quests/dataflow_python/
. You will see a folder for each lab, which is further divided into a lab
sub-folder with code to be completed by you, and a solution
sub-folder with a fully workable example to reference if you get stuck.
Click Check my progress to verify the objective.
For this lab, we will leverage the existing pipeline code from the Branching Pipelines lab (solutions folder).
Before you can begin editing the actual pipeline code, you need to ensure that you have installed the necessary dependencies.
my_pipeline.py
file in your IDE by using the solution file, which can be found in training-data-analyst/quests/dataflow_python/2_Branching_Pipelines/solution/
:Click Check my progress to verify the objective.
pip3 freeze
to record the packages and their versions being used in our environment.Next, we will create our Dockerfile. This will specify the code and the dependencies we need to use.
a. To complete this task, create a New File in the dataflow_python/2_Branching_Pipelines/lab
folder in the file explorer of your IDE.
b. To create New File, click on File >> New >> Text File.
c. Rename the file name as Dockerfile
, to rename the file name right click on it.
d. Open Dockerfile
file in the editor panel, click on the file to open it.
e. Copy the below code to the Dockerfile
file and save it:
This will take a few minutes to build and push the container.
Click Check my progress to verify the objective.
To run a template, you need to create a template spec file in a Cloud Storage containing all of the necessary information to run the job, such as the SDK information and metadata.
a. Create a New File in the dataflow_python/2_Branching_Pipelines/lab
folder in the file explorer of your IDE.
b. To create New File, click on File >> New >> Text File.
c. Rename the file name as metadata.json
, to rename the file name right click on it.
d. Open metadata.json
file in the editor panel. To open the file right click on the metadata.json
file then select open With >> Editor.
e. To complete this task, we need to create a metadata.json
file in the following format that accounts for all of the input parameters your pipeline expects. Refer to the solution if you need. This does require you to write your own parameter regex checking. While not best practice, ".*"
will match on any input.
Click Check my progress to verify the objective.
To complete this task, follow the instructions below:
Go to the Dataflow page in the Google Cloud console.
Click CREATE JOB FROM TEMPLATE.
Enter a valid job name in the Job name field.
Set the Regional endpoint to
Select Custom template from the Dataflow template drop-down menu.
Enter the Cloud Storage path to your template file in the template Cloud Storage path field.
Input the appropriate items under Required parameter
a. For Input file path, enter
b. For Output file location, enter
c. For BigQuery output table, enter
Click RUN JOB.
One of the benefits of using Dataflow templates is the ability to execute them from a wider variety of contexts, other than a development environment. To demonstrate this, use gcloud to execute a Dataflow template from the command line.
Click Check my progress to verify the objective.
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one