
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Download a code repository
/ 5
Create a BigQuery Dataset
/ 5
Simulate traffic sensor data into Pub/Sub
/ 5
Launch Dataflow Pipeline
/ 5
Create an alert
/ 5
In this lab, you will use Dataflow to collect traffic events from simulated traffic sensor data made available through Google Cloud PubSub, process them into an actionable average, and store the raw data in BigQuery for later analysis. You will learn how to start a Dataflow pipeline, monitor it, and, lastly, optimize it.
In this lab, you will perform the following tasks:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).
In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.
Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com
is present and has the editor
role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.
editor
role, follow the steps below to assign the required role.729328892908
).{project-number}
with your project number.You will be running a sensor simulator from the training VM. In Lab 1 you manually setup the Pub/Sub components. In this lab several of those processes are automated.
In the Console, on the Navigation menu (), click Compute engine > VM instances.
Locate the line with the instance called training-vm.
On the far right, under Connect, click on SSH to open a terminal window.
In this lab, you will enter CLI commands on the training-vm.
The training-vm is installing some software in the background.
The setup is complete when the result of your list (ls) command output appears as in the image below. If the full listing does not appear, wait a few minutes and try again.
This script sets the DEVSHELL_PROJECT_ID
and BUCKET
environment variables.
Click Check my progress to verify the objective.
The Dataflow pipeline will be created later and written into a table in the BigQuery dataset.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
To create a dataset, click on the View actions icon next to your project ID and select Create dataset.
Next, name your Dataset ID demos
and leave all other options at their default values, and then click Create dataset.
A bucket should already exist that has the same name as the Project ID.
In the Console, on the Navigation menu (), click Cloud Storage > Buckets.
Observe the following values:
Property |
Value (type value or select option as specified) |
Name |
|
Default storage class |
Regional |
Location |
|
Click Check my progress to verify the objective.
This command will send 1 hour of data in 1 minute. Let the script continue to run in the current terminal.
In the upper right corner of the training-vm SSH terminal, click on the gear-shaped button () and select New Connection to training-vm from the drop-down menu. A new terminal window will open.
The new terminal session will not have the required environment variables. Run the following command to set them.
In the new training-vm SSH terminal enter the following:
Click Check my progress to verify the objective.
A 4th optional argument is options. The options argument discussed later in this lab.
project id |
|
bucket name |
|
classname |
|
options |
|
There are 4 java files that you can choose from for classname. Each reads the traffic data from Pub/Sub and runs different aggregations/computations.
What does the script do?
Close the file to continue. You will want to refer to this source code while running the application. So for easy access you will open a new browser tab and view the file AverageSpeeds.java on Github.
Leave this browser tab open. You will be referring back to the source code in a later step in this lab.
This script uses maven to build a Dataflow streaming pipeline in Java.
Example successful completion:
This Dataflow pipeline reads messages from a Pub/Sub topic, parses the JSON of the input message, produces one main output and writes to BigQuery.
Example:
./run_oncloud.sh $DEVSHELL_PROJECT_ID $BUCKET AverageSpeeds
again.After the pipeline is running, click on the Navigation menu (), click Pub/Sub > Topics.
Examine the line for Topic name for the topic sandiego.
Return to the Navigation menu (), click Dataflow and click on your job.
Compare the code in the Github browser tab, AverageSpeeds.java and the pipeline graph on the page for your Dataflow job.
Find the GetMessages pipeline step in the graph, and then find the corresponding code in the AverageSpeeds.java file. This is the pipeline step that reads from the Pub/Sub topic. It creates a collection of Strings - which corresponds to Pub/Sub messages that have been read.
Find the BySensor and AvgBySensor pipeline steps in the graph, and then find the corresponding code snippet in the AverageSpeeds.java file. This BySensor does a grouping of all events in the window by sensor id, while AvgBySensor will then compute the mean speed for each grouping.
Find the ToBQRow pipeline step in the graph and in code. This step simply creates a "row" with the average computed from the previous step together with the lane information.
Find the BigQueryIO.Write in both the pipeline graph and in the source code. This step writes the row out of the pipeline into a BigQuery table. Because we chose the WriteDisposition.WRITE_APPEND write disposition, new records will be appended to the table.
Return to the BigQuery web UI tab. Refresh your browser.
Find your project name and the demos dataset you created. The small arrow to the left of the dataset name demos should now be active and clicking on it will reveal the average_speeds table.
It will take several minutes before the average_speeds table appears in BigQuery.
Example:
Click Check my progress to verify the objective.
One common activity when monitoring and improving Dataflow pipelines is figuring out how many elements the pipeline processes per second, what the system lag is, and how many data elements have been processed so far. In this activity you will learn where in the Cloud Console one can find information about processed elements and time.
Return to the browser tab for Console. On the Navigation menu (), click Dataflow and click on your job to monitor progress (it will have your username in the pipeline name).
Select the GetMessages pipeline node in the graph and look at the step metrics on the right.
The query below will return a subset of rows from the average_speeds table that existed at 10 minutes ago.
If your query requests rows but the table did not exist at the reference point in time, you will receive the following error message:
Invalid snapshot time 1633691170651 for Table PROJECT:DATASET.TABLE__
If you encounter this error please reduce the scope of your time travel by lowering the minute value:
Observe how Dataflow scales the number of workers to process the backlog of incoming Pub/Sub messages.
Return to the browser tab for Console. On the Navigation menu (), click Dataflow and click on your pipeline job.
Examine the Job metrics panel on the right, and review the Autoscaling section. How many workers are currently being used to process messages in the Pub/Sub topic?
Click on More history and review how many workers were used at different points in time during pipeline execution.
The data from a traffic sensor simulator started at the beginning of the lab creates hundreds of messages per second in the Pub/Sub topic. This will cause Dataflow to increase the number of workers to keep the system lag of the pipeline at optimal levels.
Click on More history. In the Worker pool, you can see how Dataflow changed the number of workers. Notice the Status column that explains the reason for the change.
Return to the training-vm SSH terminal where the sensor data script is running.
If you see messages that say INFO: Publishing then the script is still running. Press CRTL+C to stop it. Then issue the command to start the script again:
Open a new SSH terminal. The new session will have a fresh quota.
In the Console, on the Navigation menu (), click Compute Engine > VM instances.
Locate the line with the instance called training-vm.
On the far right, under Connect, click on SSH to open a new terminal window.
In the training-vm SSH terminal, enter the following to create environment variables:
Cloud Monitoring integration with Dataflow allows users to access Dataflow job metrics such as System Lag (for streaming jobs), Job Status (Failed, Successful), Element Counts, and User Counters from within Cloud Monitoring.
Some common Dataflow metrics.
Metrics | Features |
---|---|
Job status | Job status (Failed, Successful), reported as an enum every 30 secs and on update. |
Elapsed time | Job elapsed time (measured in seconds), reported every 30 secs. |
System lag | Max lag across the entire pipeline, reported in seconds. |
Current vCPU count | Current # of virtual CPUs used by the job and updated on value change. |
Estimated byte count | Number of bytes processed per PCollection. |
Cloud monitoring is a separate service in Google Cloud. So you will need to go through some setup steps to initialize the service for your lab account.
You will now setup a Monitoring workspace that's tied to your Google Cloud Project. The following steps create a new account that has a free trial of Monitoring.
In the Cloud Console, click on Navigation menu > Monitoring.
Wait for your workspace to be provisioned.
When the Monitoring dashboard opens, your workspace is ready.
In the panel to the left click on Metrics explorer.
In the Metrics Explorer, click on Select a metric.
Select Dataflow Job > Job
You should see a list of available Dataflow-related metrics. Select Data watermark lag and click Apply.
Cloud monitoring will draw a graph on the right side of the page.
Under metric, click on the Reset to remove the Data watermark lag metric. Select a new dataflow metric System lag.
Note: The metrics that Dataflow provides to Monitoring are listed in the Google Cloud metrics documentation. You can search on the page for Dataflow. The metrics you have viewed are useful indicators of pipeline performance.
Data watermark lag: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline. System lag: The current maximum duration that an item of data has been awaiting processing, in seconds.If you want to be notified when a certain metric crosses a specified threshold (for example, when System Lag of our lab streaming pipeline increases above a predefined value), you could use the Alerting mechanisms of Monitoring to accomplish that.
On the Cloud Monitoring, click Alerting.
Click + Create policy.
Click on Select a metric dropdown. Uncheck the Active.
Type Dataflow Job in filter by resource and metric name and click on Dataflow Job > Job. Select System Lag
and click Apply.
Click Configure Trigger.
Set the Threshold position to Above threshold
, Threshold value to 5
and Advanced options > Retest window to 1 min
. Click Next.
A Notification channels page will open in a new tab.
Scroll down the page and click on Add new for Email.
In the Create email channel dialog box, enter the lab username as the Email address field and a Display name.
Click Save.
Go back to the previous Create alerting policy tab.
Click on Notification channels again, then click on the Refresh icon to get the display name you mentioned in the previous step.
Now, select your Display name and click OK.
Set Alert policy name as MyAlertPolicy
.
Click Next.
Review the alert and click Create policy.
On the Cloud Monitoring tab, click on Alerting > Policies.
Every time an alert is triggered by a Metric Threshold condition, an Incident and a corresponding Event are created in Monitoring. If you specified a notification mechanism in the alert (email, SMS, pager, etc), you will also receive a notification.
Click Check my progress to verify the objective.
You can easily build dashboards with the most relevant Dataflow-related charts with Cloud Monitoring Dashboards.
In the left pane, click Dashboards.
Click +Create dashboard.
For New dashboard name, type My Dashboard
.
Click Add Widget, then Line chart.
Click on the dropdown
box under Select a metric.
Select Dataflow Job > Job > System Lag
and click Apply.
In the Filters panel, click + Add filter.
Select project_id in the Label
field, then select or type your Value
field.
Click Apply.
Example:
You can add more charts to the dashboard, if you would like, for example, Pub/Sub publish rates on the topic, or subscription backlog (which is a signal to the Dataflow auto-scaler).
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one