arrow_back

Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow

Sign in Join
Quick tip: Review the prerequisites before you run the lab
Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the student account, which may cause extra charges incurred to your personal account.
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow

Lab 1 hour universal_currency_alt 5 Credits show_chart Introductory
info This lab may incorporate AI tools to support your learning.
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Overview

In this lab, you own a fleet of New York City taxi cabs and are looking to monitor how well your business is doing in real-time. You build a streaming data pipeline to capture taxi revenue, passenger count, ride status, and much more, and then visualize the results in a management dashboard.

Objectives

In this lab you learn how to:

  • Create a Dataflow job from a template
  • Stream a Dataflow pipeline into BigQuery
  • Monitor a Dataflow pipeline in BigQuery
  • Analyze results with SQL
  • Visualize key metrics in Looker Studio

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:

    • The Open Google Cloud console button
    • Time remaining
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
  2. Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).

    The lab spins up resources, and then opens another tab that shows the Sign in page.

    Tip: Arrange the tabs in separate windows, side-by-side.

    Note: If you see the Choose an account dialog, click Use Another Account.
  3. If necessary, copy the Username below and paste it into the Sign in dialog.

    {{{user_0.username | "Username"}}}

    You can also find the Username in the Lab Details panel.

  4. Click Next.

  5. Copy the Password below and paste it into the Welcome dialog.

    {{{user_0.password | "Password"}}}

    You can also find the Password in the Lab Details panel.

  6. Click Next.

    Important: You must use the credentials the lab provides you. Do not use your Google Cloud account credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  7. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Google Cloud console opens in this tab.

Note: To view a menu with a list of Google Cloud products and services, click the Navigation menu at the top-left, or type the service or product name in the Search field. Navigation menu icon

Activate Google Cloud Shell

Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.

Google Cloud Shell provides command-line access to your Google Cloud resources.

  1. In Cloud console, on the top right toolbar, click the Open Cloud Shell button.

    Highlighted Cloud Shell icon

  2. Click Continue.

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

Project ID highlighted in the Cloud Shell Terminal

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  • You can list the active account name with this command:
gcloud auth list

Output:

Credentialed accounts: - @.com (active)

Example output:

Credentialed accounts: - google1623327_student@qwiklabs.net
  • You can list the project ID with this command:
gcloud config list project

Output:

[core] project =

Example output:

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide .

Task 1. Create a BigQuery dataset

In this task, you create the taxirides dataset. You have two different options which you can use to create this, using the Google Cloud Shell or the Google Cloud Console.

In this lab you will be using an extract of the NYC Taxi & Limousine Commission’s open dataset. A small, comma-separated, datafile will be used to simulate periodic updates of taxi data.

BigQuery is a serverless data warehouse. Tables in BigQuery are organized into datasets. In this lab, taxi data will flow from the standalone file via Dataflow to be stored in BigQuery. With this configuration, any new datafile deposited into the source Cloud Storage bucket would automatically be processed for loading.

Use one of the following options to create a new BigQuery dataset:

Option 1: The command-line tool

  1. In Cloud Shell (Cloud Shell icon), run the following command to create the taxirides dataset.
bq --location={{{project_0.default_region|Region}}} mk taxirides
  1. Run this command to create the taxirides.realtime table (empty schema that you will stream into later).
bq --location={{{project_0.default_region|Region}}} mk \ --time_partitioning_field timestamp \ --schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\ timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:string,\ passenger_count:integer -t taxirides.realtime

Option 2: The BigQuery Console UI

Note: Skip these steps if you created the tables using the command line.
  1. In the Google Cloud console, in the Navigation menu(Navigation Menu), click BigQuery.

  2. If you see the Welcome dialog, click Done.

  3. Click on View actions (View Actions) next to your Project ID, and then click Create dataset.

  4. In Dataset ID, type taxirides.

  5. In Data location, select:

{{{project_0.default_region|Region}}}

then click Create Dataset.

  1. In the Explorer pane, click expand node (Expander) to reveal the new taxirides dataset.

  2. Click on View actions (View Actions) next to the taxirides dataset, and then click Open.

  3. Click Create Table.

  4. In Table, type realtime

  5. For the schema, click Edit as text and paste in the following:

ride_id:string, point_idx:integer, latitude:float, longitude:float, timestamp:timestamp, meter_reading:float, meter_increment:float, ride_status:string, passenger_count:integer
  1. In Partition and cluster settings, select timestamp.

  2. Click Create Table.

Task 2. Copy required lab artifacts

In this task, you move the required files to your Project.

Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

A Cloud Storage bucket was created for you during lab start up.

  1. In Cloud Shell (Cloud Shell icon), run the following commands to move files needed for the Dataflow job.
gcloud storage cp gs://cloud-training/bdml/taxisrcdata/schema.json gs://{{{project_0.project_id|Project_ID}}}-bucket/tmp/schema.json gcloud storage cp gs://cloud-training/bdml/taxisrcdata/transform.js gs://{{{project_0.project_id|Project_ID}}}-bucket/tmp/transform.js gcloud storage cp gs://cloud-training/bdml/taxisrcdata/rt_taxidata.csv gs://{{{project_0.project_id|Project_ID}}}-bucket/tmp/rt_taxidata.csv

Task 3. Set up a Dataflow Pipeline

In this task, you set up a streaming data pipeline to read files from your Cloud Storage bucket and write data to BigQuery.

Dataflow is a serverless way to carry out data analysis.

Restart the connection to the Dataflow API.

  1. In the Cloud Shell, run the following commands to ensure that the Dataflow API is enabled cleanly in your project.
gcloud services disable dataflow.googleapis.com gcloud services enable dataflow.googleapis.com

Create a new streaming pipeline:

  1. In the Cloud console, in the Navigation menu (Navigation Menu), click View all Products > Analytics > Dataflow.

  2. In the top menu bar, click Create Job From Template.

  3. Type streaming-taxi-pipeline as the Job name for your Dataflow job.

  4. In Regional endpoint, select

{{{project_0.default_region|Region}}}
  1. In Dataflow template, select the Cloud Storage Text to Bigquery (Stream) template under Process Data Continuously (stream).
Note: Make sure to select the template option which matches with the parameters listed below.
  1. In Cloud Storage Input File(s), paste or type:
{{{project_0.project_id|Project_ID}}}-bucket/tmp/rt_taxidata.csv
  1. In Cloud Storage location of your BigQuery schema file, described as a JSON, paste or type:
{{{project_0.project_id|Project_ID}}}-bucket/tmp/schema.json
  1. In BigQuery Output table, paste or type:
{{{project_0.project_id|Project_ID}}}:taxirides.realtime
  1. In Temporary directory for BigQuery loading process, paste or type:
{{{project_0.project_id|Project_ID}}}-bucket/tmp
  1. Click Required Parameters.

  2. In Temporary location, used for writing temporary files, paste or type:

{{{project_0.project_id|Project_ID}}}-bucket/tmp
  1. In JavaScript UDF path in Cloud Storage, paste or type:
{{{project_0.project_id|Project_ID}}}-bucket/tmp/transform.js
  1. In JavaScript UDF name, paste or type:
transform
  1. In Max workers, type 2

  2. In Number of workers, type 1

  3. Uncheck Use default machine type.

  4. Under General purpose, choose the following:

Series: E2
Machine type: e2-medium (2 vCPU, 4 GB memory)

  1. Click Run Job.

Dataflow Template

A new streaming job has started! You can now see a visual representation of the data pipeline. It will take 3 to 5 minutes for data to begin moving into BigQuery.

Note: If the dataflow job fails for the first time then re-create a new job template with new job name and run the job.

Task 4. Analyze the taxi data using BigQuery

In this task, you analyze the data as it is streaming.

  1. In the Cloud console, in the Navigation menu (Navigation Menu), click BigQuery.

  2. If the Welcome dialog appears, click Done.

  3. In the Query Editor, type the following, and then click Run:

SELECT * FROM taxirides.realtime LIMIT 10 Note: If no records are returned, wait another minute and re-run the above query (Dataflow takes 3-5 minutes to setup the stream).

Your output will look similar to the following: Sample BigQuery output from query

Task 5. Perform aggregations on the stream for reporting

In this task, you calculate aggregations on the stream for reporting.

  1. In the Query Editor, clear the current query.

  2. Copy and paste the following query, and then click Run.

WITH streaming_data AS ( SELECT timestamp, TIMESTAMP_TRUNC(timestamp, HOUR, 'UTC') AS hour, TIMESTAMP_TRUNC(timestamp, MINUTE, 'UTC') AS minute, TIMESTAMP_TRUNC(timestamp, SECOND, 'UTC') AS second, ride_id, latitude, longitude, meter_reading, ride_status, passenger_count FROM taxirides.realtime ORDER BY timestamp DESC LIMIT 1000 ) # calculate aggregations on stream for reporting: SELECT ROW_NUMBER() OVER() AS dashboard_sort, minute, COUNT(DISTINCT ride_id) AS total_rides, SUM(meter_reading) AS total_revenue, SUM(passenger_count) AS total_passengers FROM streaming_data GROUP BY minute, timestamp Note: Ensure Dataflow is registering data in BigQuery before proceeding to the next task.

The result shows key metrics by the minute for every taxi drop-off.

  1. Click Save > Save query.

  2. In the Save query dialog, in the Name field, type My Saved Query.

  3. In Region, ensure that the region matches the Qwiklabs Lab Region.

  4. Click Save.

Task 6. Stop the Dataflow Job

In this task, you stop the Dataflow job to free up resources for your project.

  1. In the Cloud console, in the Navigation menu (Navigation Menu), click View all Products > Analytics > Dataflow.

  2. Click the streaming-taxi-pipeline, or the new job name.

  3. Click Stop, and then select Cancel > Stop Job.

Task 7. Create a real-time dashboard

In this task, you create a real-time dashboard to visualize the data.

  1. In the Cloud console, in the Navigation menu (Navigation Menu), click BigQuery.

  2. In the Explorer Pane, expand your Project ID.

  3. Expand Queries, and then click My Saved Query.

Your query is loaded in to the query editor.

  1. Click Run.

  2. In the Query results section, click Open in > Looker Studio.

    Looker Studio Opens. Click Get started.

  3. In the Looker Studio window, click your bar chart.

(Bar Chart

The Chart pane appears.

  1. Click Add a chart, and then select Combo chart.

    Combo chart

  2. In the Setup pane, in Data Range Dimension, hover over minute (Date) and click X to remove it.

  3. In the Data pane, click dashboard_sort and drag it to Setup > Data Range Dimension > Add dimension.

  4. In Setup > Dimension, click minute, and then select dashboard_sort.

  5. In Setup > Metric, click dashboard_sort, and then select total_rides.

  6. In Setup > Metric, click Record Count, and then select total_passengers.

  7. In Setup > Metric, click Add metric, and then select total_revenue.

  8. In Setup > Sort, click total_rides, and then select dashboard_sort.

  9. In Setup > Sort, click Ascending.

Your chart should look similar to this:

Sample chart

Note: Visualizing data at a minute-level granularity is currently not supported in Looker Studio as a timestamp. This is why we created our own dashboard_sort dimension.
  1. When you're happy with your dashboard, click Save and share to save this data source.

  2. If prompted to complete your account setup, type your country and company details, agree to the terms and conditions, and then click Continue.

  3. If prompted which updates you want to receive, answer no to all, then click Continue.

  4. If prompted with the Review data access before saving window, click Acknowledge and save.

  5. If prompted to choose an account select your Student Account.

  6. Whenever anyone visits your dashboard, it will be up-to-date with the latest transactions. You can try it yourself by clicking More options (More Options), and then Refresh data.

Task 8. Create a time series dashboard

In this task, you create a time series chart.

  1. Click this Looker Studio link to open Looker Studio in a new browser tab.

  2. On the Reports page, in the Start with a Template section, click the [+] Blank Report template.

  3. A new, empty report opens with the Add data to report window.

  4. From the list of Google Connectors, select the BigQuery tile.

  5. Click Custom Query, and then select your ProjectID. This should appear in the following format, qwiklabs-gcp-xxxxxxx.

  6. In Enter Custom Query, paste the following query:

SELECT * FROM taxirides.realtime WHERE ride_status='enroute'
  1. Click Add > Add To Report.

    A new untitled report appears. It may take up to a minute for the screen to finish refreshing.

Create a time series chart

  1. In the Data pane, click Add a Field > Add calculated field.

  2. Click All Fields on the left corner.

  3. Change the timestamp field type to Date & Time > Date Hour Minute (YYYYMMDDhhmm).

  4. In the change timestamp dialog, click Continue, and then click Done.

  5. In the top menu, click Add a chart.

  6. Choose Time series chart.

    Time Series

  7. Position the chart in the bottom left corner - in the blank space.

  8. In Setup > Dimension, click timestamp (Date), and then select timestamp.

  9. In Setup > Dimension, click timestamp, and then select calendar. Calendar

  10. In Data Type, select Date & Time > Date Hour Minute.

  11. Click outside the dialog to close it. You do not need to add a name.

  12. In Setup > Metric, click Record Count, and then select meter reading.

Congratulations!

In this lab, you used Dataflow to stream data through a pipeline into BigQuery.

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.