arrow_back

Serverless Data Processing with Dataflow - Custom Dataflow Flex Templates (Java)

Sign in Join
Get access to 700+ labs and courses

Serverless Data Processing with Dataflow - Custom Dataflow Flex Templates (Java)

Lab 2 hours universal_currency_alt 5 Credits show_chart Advanced
info This lab may incorporate AI tools to support your learning.
Get access to 700+ labs and courses

Overview

A pipeline that accepts command-line parameters is vastly more useful than one with those parameters hard-coded. However, running it requires creating a development environment. An even better option for pipelines that are expected to be rerun by a variety of different users or in a variety of different contexts would be to use a Dataflow template.

There are many Dataflow templates that have already been created as part of Google Cloud Platform, which you can explore in the Get started with Google documentation. But none of them perform the same function as the pipeline in this lab. Instead, in this part of the lab, you convert the pipeline into a newer custom Dataflow Flex Template (as opposed to a custom traditional template).

Converting a pipeline into a custom Dataflow Flex Template requires the use of an Uber JAR to package up your code and the dependencies, a Dockerfile to describe what code to build, Cloud Build to build the underlying container that will be executed at runtime to create the actual job, and a metadata file to describe the job parameters.

Prerequisites

Basic familiarity with Java.

What you learn

In this lab, you:

  • Convert a custom pipeline into a custom Dataflow Flex Template.
  • Run a Dataflow Flex Template.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

  5. Click Open Google Console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Activate Google Cloud Shell

Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.

Google Cloud Shell provides command-line access to your Google Cloud resources.

  1. In Cloud console, on the top right toolbar, click the Open Cloud Shell button.

  2. Click Continue.

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  • You can list the active account name with this command:
gcloud auth list

Output:

Credentialed accounts: - @.com (active)

Example output:

Credentialed accounts: - google1623327_student@qwiklabs.net
  • You can list the project ID with this command:
gcloud config list project

Output:

[core] project =

Example output:

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide .

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.

Note: If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.
  1. In the Google Cloud console, on the Navigation menu, click Cloud Overview > Dashboard.
  2. Copy the project number (e.g. 729328892908).
  3. On the Navigation menu, select IAM & Admin > IAM.
  4. At the top of the roles table, below View by Principals, click Grant Access.
  5. For New principals, type:
{project-number}-compute@developer.gserviceaccount.com
  1. Replace {project-number} with your project number.
  2. For Role, select Project (or Basic) > Editor.
  3. Click Save.

Setting up your IDE

For the purposes of this lab, you will mainly be using a Theia Web IDE hosted on Google Compute Engine. It has the lab repo pre-cloned. There is java langauge server support, as well as a terminal for programmatic access to Google Cloud APIs via the gcloud command line tool, similar to Cloud Shell.

  1. To access your Theia IDE, copy and paste the link shown in Google Cloud Skills Boost to a new tab.
Note: You may need to wait 3-5 minutes for the environment to be fully provisioned, even after the url appears. Until then you will see an error in the browser.

The lab repo has been cloned to your environment. Each lab is divided into a labs folder with code to be completed by you, and a solution folder with a fully workable example to reference if you get stuck.

  1. Click on the File Explorer button to look:

You can also create multiple terminals in this environment, just as you would with cloud shell:

You can see with by running gcloud auth list on the terminal that you're logged in as a provided service account, which has the exact same permissions are your lab user account:

If at any point your environment stops working, you can try resetting the VM hosting your IDE from the GCE console like this:

Task 1. Set up your pipeline

For this lab, we will leverage the existing pipeline code from the Branching Pipelines lab (solutions folder).

Open the appropriate lab

  1. Create a new terminal in your IDE environment, if you haven't already, and copy and paste the following command:
# Change directory into the lab cd 2_Branching_Pipelines/labs # Download dependencies mvn clean dependency:resolve export BASE_DIR=$(pwd)
  1. Set up the data environment:
# Create GCS buckets and BQ dataset cd $BASE_DIR/../.. source create_batch_sinks.sh # Generate event dataflow source generate_batch_events.sh # Change to the directory containing the practice version of the code cd $BASE_DIR

Click Check my progress to verify the objective. Set up the data environment

Update your pipeline code

  • Update the MyPipeline.java in your IDE by using the solution file, which can be found in 2_Branching_Pipelines/solution/src/main/java/com/mypackage/pipeline:
cp /home/project/training-data-analyst/quests/dataflow/2_Branching_Pipelines/solution/src/main/java/com/mypackage/pipeline/MyPipeline.java $BASE_DIR/src/main/java/com/mypackage/pipeline/

Task 2. Create a custom Dataflow Flex Template container image

  1. To complete this task, first add the following plugin in your pom.xml file to enable building an Uber JAR. First add this in the properties tag:
<maven-shade-plugin.version>3.2.3</maven-shade-plugin.version>
  1. Then add this in the build plugins tag:
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>${maven-shade-plugin.version}</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> </transformers> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> </execution> </executions> </plugin>
  1. Now you can build an Uber JAR file using this command:
cd $BASE_DIR mvn clean package

Note the size. This Uber JAR file has all the dependencies embedded in it.

  1. You can run this file as a standalone application with no external dependencies on other libraries:
ls -lh target/*.jar
  1. In the same directory as your pom.xml file, create a file named Dockerfile with the following text. Be sure to set FLEX_TEMPLATE_JAVA_MAIN_CLASS to your full class name and YOUR_JAR_HERE to the Uber JAR that you've created.
FROM gcr.io/dataflow-templates-base/java11-template-launcher-base:latest # Define the Java command options required by Dataflow Flex Templates. ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="YOUR-CLASS-HERE" ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/template/pipeline.jar" # Make sure to package as an uber-jar including all dependencies. COPY target/YOUR-JAR-HERE.jar ${FLEX_TEMPLATE_JAVA_CLASSPATH}
  1. You will then use Cloud Build to offload the building of this container for you, rather than building it locally. First, turn the caching on to speed up future builds:
gcloud config set builds/use_kaniko True
  1. Then execute the actual build. This will tar up the entire directory including the Dockerfile with instructions on what to actually build, upload it to the service, build a container, and push that container to Artifact Registry in your project for future use.
export TEMPLATE_IMAGE="gcr.io/$PROJECT_ID/my-pipeline:latest" gcloud builds submit --tag $TEMPLATE_IMAGE .

You can also monitor the build status from the Cloud Build UI. You can also see that the resulting container has been uploaded to Artifact Registry.

Click Check my progress to verify the objective. Create a custom Dataflow Flex Template container image

Task 3. Create and stage the Flex Template

To run a template, you need to create a template spec file in a Cloud Storage containing all of the necessary information to run the job, such as the SDK information and metadata.

  1. To complete this task, create a metadata.json file in the following format that accounts for all of the input parameters your pipeline expects.

Refer to the solution if you need. This does require you to write your own parameter regex checking. While not best practice, ".*" will match on any input.

{ "name": "Your pipeline name", "description": "Your pipeline description", "parameters": [ { "name": "inputSubscription", "label": "Pub/Sub input subscription.", "helpText": "Pub/Sub subscription to read from.", "regexes": [ "[-_.a-zA-Z0-9]+" ] }, { "name": "outputTable", "label": "BigQuery output table", "helpText": "BigQuery table spec to write to, in the form 'project:dataset.table'.", "is_optional": true, "regexes": [ "[^:]+:[^.]+[.].+" ] } ] }
  1. Then build and stage the actual template:
export TEMPLATE_PATH="gs://${PROJECT_ID}/templates/mytemplate.json" # Will build and upload the template to GCS gcloud dataflow flex-template build $TEMPLATE_PATH \ --image "$TEMPLATE_IMAGE" \ --sdk-language "JAVA" \ --metadata-file "metadata.json"
  1. Verify that the file has been uploaded to the template location in Cloud Storage.

Click Check my progress to verify the objective. Create and stage the Flex Template

Task 4. Execute the template from the UI

To complete this task, follow the instructions below:

  1. Go to the Dataflow page in the Google Cloud console.
  2. Click Create job from template.
  3. Enter a valid job name in the Job name field.
  4. Select Custom template from the Dataflow template drop-down menu.
  5. Enter the Cloud Storage path to your template file in the template Cloud Storage path field.
  6. Input the appropriate items under Required parameters.
  7. Click Run job.
Note: You don't need to specify a staging bucket; DataFlow will create a private one in your project using your project number, similar to gs://dataflow-staging--/staging.
  1. Examine the Compute Engine console and you will see a temporary launcher VM that is created to execute your container and initiate your pipeline with the provided parameters.

Task 5. Execute the template using gcloud

One of the benefits of using Dataflow templates is the ability to execute them from a wider variety of contexts, other than a development environment. To demonstrate this, use gcloud to execute a Dataflow template from the command line.

  1. To complete this task, execute the following command in your terminal, modifying the parameters as appropriate:
export PROJECT_ID=$(gcloud config get-value project) export REGION={{{project_0.default_region | Region}}} export JOB_NAME=mytemplate-$(date +%Y%m%H%M$S) export TEMPLATE_LOC=gs://${PROJECT_ID}/templates/mytemplate.json export INPUT_PATH=gs://${PROJECT_ID}/events.json export OUTPUT_PATH=gs://${PROJECT_ID}-coldline/ export BQ_TABLE=${PROJECT_ID}:logs.logs_filtered gcloud dataflow flex-template run ${JOB_NAME} \ --region=$REGION \ --template-file-gcs-location ${TEMPLATE_LOC} \ --parameters "inputPath=${INPUT_PATH},outputPath=${OUTPUT_PATH},tableName=${BQ_TABLE}"
  1. Ensure that your pipeline completes successfully.

Click Check my progress to verify the objective. Execute the template from the UI and using gcloud

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

Before you begin

  1. Labs create a Google Cloud project and resources for a fixed time
  2. Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
  3. On the top left of your screen, click Start lab to begin

This content is not currently available

We will notify you via email when it becomes available

Great!

We will contact you via email if it becomes available

One lab at a time

Confirm to end all existing labs and start this one

Use private browsing to run the lab

Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.