Punkty kontrolne
Generate synthetic data
/ 15
Run your pipeline
/ 15
Creating a JSON schema file
/ 10
Write a JavaScript User-Defined Function in Javascript file
/ 15
Running a Dataflow Template
/ 10
Serverless Data Processing with Dataflow - Writing an ETL pipeline using Apache Beam and Dataflow (Java)
- Overview
- Setup and requirements
- Apache Beam and Dataflow
- Lab part 1. Writing an ETL pipeline from scratch
- Task 1. Generate synthetic data
- Task 2. Read data from your source
- Task 3. Run your pipeline to verify that it works
- Task 4. Adding in a transformation
- Task 5. Writing to a sink
- Task 6. Run your pipeline
- Lab part 2. Parameterizing basic ETL
- Task 1. Creating a JSON schema file
- Task 2. Writing a JavaScript user-defined function
- Task 3. Running a Dataflow Template
- Task 4. Inspect the Dataflow Template code
- End your lab
Overview
In this lab, you:
- Build a batch Extract-Transform-Load pipeline in Apache Beam, which takes raw data from Google Cloud Storage and writes it to BigQuery.
- Run the Apache Beam pipeline on Dataflow.
- Parameterize the execution of the pipeline.
Prerequisites:
- Basic familiarity with Java.
Setup and requirements
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
-
Sign in to Qwiklabs using an incognito window.
-
Note the lab's access time (for example,
1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning. -
When ready, click Start lab.
-
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
-
Click Open Google Console.
-
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges. -
Accept the terms and skip the recovery resource page.
Activate Google Cloud Shell
Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.
Google Cloud Shell provides command-line access to your Google Cloud resources.
-
In Cloud console, on the top right toolbar, click the Open Cloud Shell button.
-
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
- You can list the active account name with this command:
Output:
Example output:
- You can list the project ID with this command:
Output:
Example output:
Check project permissions
Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).
-
In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.
-
Confirm that the default compute Service Account
{project-number}-compute@developer.gserviceaccount.com
is present and has theeditor
role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.
editor
role, follow the steps below to assign the required role.- In the Google Cloud console, on the Navigation menu, click Cloud Overview > Dashboard.
- Copy the project number (e.g.
729328892908
). - On the Navigation menu, select IAM & Admin > IAM.
- At the top of the roles table, below View by Principals, click Grant Access.
- For New principals, type:
- Replace
{project-number}
with your project number. - For Role, select Project (or Basic) > Editor.
- Click Save.
Setting up your IDE
For the purposes of this lab, you will mainly be using a Theia Web IDE hosted on Google Compute Engine. It has the lab repo pre-cloned. There is java langauge server support, as well as a terminal for programmatic access to Google Cloud APIs via the gcloud
command line tool, similar to Cloud Shell.
- To access your Theia IDE, copy and paste the link shown in Google Cloud Skills Boost to a new tab.
The lab repo has been cloned to your environment. Each lab is divided into a labs
folder with code to be completed by you, and a solution
folder with a fully workable example to reference if you get stuck.
- Click on the
File Explorer
button to look:
You can also create multiple terminals in this environment, just as you would with cloud shell:
You can see with by running gcloud auth list
on the terminal that you're logged in as a provided service account, which has the exact same permissions are your lab user account:
If at any point your environment stops working, you can try resetting the VM hosting your IDE from the GCE console like this:
Apache Beam and Dataflow
About 5 minutes
Dataflow is a fully-managed Google Cloud service for running batch and streaming Apache Beam data processing pipelines.
Apache Beam is an open source, advanced, unified and portable data processing programming model that allows end users to define both batch and streaming data-parallel processing pipelines using Java, Python, or Go. Apache Beam pipelines can be executed on your local development machine on small datasets, and at scale on Dataflow. However, because Apache Beam is open source, other runners exist — you can run Beam pipelines on Apache Flink and Apache Spark, among others.
Lab part 1. Writing an ETL pipeline from scratch
Introduction
In this section, you write an Apache Beam Extract-Transform-Load (ETL) pipeline from scratch.
Dataset and use case review
For each lab in this quest, the input data is intended to resemble web server logs in Common Log format along with other data that a web server might contain. For this first lab, the data is treated as a batch source; in later labs, the data will be treated as a streaming source. Your task is to read the data, parse it, and then write it to BigQuery, a serverless data warehouse, for later data analysis..
Open the appropriate lab
- Create a new terminal in your IDE environment, if you haven't already, and copy and paste the following command:
Modifying the pom.xml file
Before you can begin editing the actual pipeline code, you need to add in necessary dependencies.
- Add the following dependencies to your
pom.xml
file, which is located in1_Basic_ETL/labs
, inside the dependencies tag:
-
A
<beam.version>
tag has already been added in thepom.xml
to indicate which version of Beam to install. -
Save the file.
-
Lastly, download these dependencies for use in your pipeline:
Write your first pipeline
1 hour
Task 1. Generate synthetic data
- Run the following command in the shell to clone a repository containing scripts for generating synthetic web server logs:
The script creates a file called events.json
containing lines resembling the following:
It then automatically copies this file to your Google Cloud Storage bucket at gs://<YOUR-PROJECT-ID>/events.json
.
- Navigate to Google Cloud Storage and confirm that your storage bucket contains a file called
events.json
.
Click Check my progress to verify the objective.
Task 2. Read data from your source
If you get stuck in this or later sections, refer to the solution.
- Open up
MyPipeline.java
in your IDE, which can be found in1_Basic_ETL/labs/src/main/java/com/mypackage/pipeline
. Make sure the following packages are imported:
- Scroll down to the
run()
method. This method currently contains a pipeline that doesn’t do anything; note how a Pipeline object is created using a PipelineOptions object and the final line of the method runs the pipeline.
All data in Apache Beam pipelines reside in PCollections. To create your pipeline’s initial PCollection
, apply a root transform to your pipeline object. A root transform creates a PCollection
from either an external data source or some local data you specify.
There are two kinds of root transforms in the Beam SDKs: Read and Create. Read transforms read data from an external source, such as a text file or a database table. Create transforms create a PCollection
from an in-memory java.util.Collection
and are especially useful for testing.
The following example code shows how to apply a TextIO.Read
root transform to read data from a text file. The transform is applied to a Pipeline
object p, and returns a pipeline data set in the form of a PCollection<String>
. "ReadLines" is your name for the transform, which will be helpful later when working with larger pipelines:
-
Inside the
run()
method, create a string constant called “input” and set its value togs://<YOUR-PROJECT-ID>/events.json
. In a future lab, you will use command-line parameters to pass this information. -
Create a
PCollection
of strings of all the events inevents.json
by calling the TextIO.read() transform. -
Add any appropriate import statements to the top of
MyPipeline.java
, in this caseimport org.apache.beam.sdk.values.PCollection;
Task 3. Run your pipeline to verify that it works
- Return to the terminal and return to
$BASE_DIR
folder and execute themvn compile exec:java
command:
At the moment, your pipeline doesn’t actually do anything; it simply reads in data. However, running it demonstrates a useful workflow, in which you verify the pipeline locally and cheaply using DirectRunner running on your local machine before doing more expensive computations. To run the pipeline using Dataflow, you may change runner
to DataflowRunner.
Task 4. Adding in a transformation
If you get stuck, refer to the solution.
Transforms are what change your data. In Apache Beam, transforms are done by the PTransform class. At runtime, these operations will be performed on a number of independent workers. The input and output to every PTransform
is a PCollection
. In fact, though you may not have realized it, you have already used a PTransform
when you read in data from Google Cloud Storage. Whether or not you assigned it to a variable, this created a PCollection
of strings.
Because Beam uses a generic apply method for PCollection
s, you can chain transforms sequentially. For example, you can chain transforms to create a sequential pipeline, like this one:
For this task, use a new sort of transform: a ParDo. ParDo
is a Beam transform for generic parallel processing. The ParDo
processing paradigm is similar to the “Map” phase of a Map/Shuffle/Reduce-style algorithm: a ParDo
transform considers each element in the input PCollection
, performs some processing function (your user code) on that element, and emits zero, one, or multiple elements to an output PCollection
.
ParDo
is useful for a variety of common data processing operations, including:
- Filtering a data set. You can use
ParDo
to consider each element in aPCollection
and either output that element to a new collection, or discard it. - Formatting or type-converting each element in a data set. If your input
PCollection
contains elements that are of a different type or format than you want, you can useParDo
to perform a conversion on each element and output the result to a newPCollection
. - Extracting parts of each element in a data set. If you have a
PCollection
of records with multiple fields, for example, you can use aParDo
to parse out just the fields you want to consider into a newPCollection
. - Performing computations on each element in a data set. You can use
ParDo
to perform simple or complex computations on every element, or certain elements, of aPCollection
and output the results as a newPCollection
.
- To complete this task, write a
ParDo
transform that reads in a JSON string representing a single event, parses it using Gson, and outputs the custom object returned by Gson.
ParDo
functions can be implemented in two ways, either inline or as a static class. Write inline ParDo
functions like this:
Alternatively, they can be implemented as static classes that extend DoFn. This allows them to be more easily integrated with testing frameworks:
And then, within the pipeline itself:
- In order to use
Gson
, you will need to create an inner class inside ofMyPipeline
. To take advantage of Beam Schemas, add the @DefaultSchema annotation. More on that later. Here’s an example of how to useGson
:
-
Name your inner class
CommmonLog
. To construct this inner class with the right state variables, refer to the sample JSON above: the class should have one state variable for every key in the incoming JSON, and this variable should agree in type and name with the value and key. -
Use
String
for the "timestamp" for now,Long
for "INTEGER" (BigQuery Integer is INT64),Double
for "FLOAT" (BigQuery FLOAT is FLOAT64), and match the following BigQuery schema:
Remember, if you get stuck, refer to the solution.
Task 5. Writing to a sink
At this point, your pipeline reads a file from Google Cloud Storage, parses each line, and emits a CommonLog
for each element. The next step is to write these CommonLog
objects into a BigQuery table.
While you can instruct your pipeline to create a BigQuery table if needed, you will need to create the dataset ahead of time. This has already been done by the generate_batch_events.sh
script.
You can examine the dataset:
To output your pipeline’s final PCollection
s, you apply a Write transform to that PCollection
. Write transforms can output the elements of a PCollection
to an external data sink, such as a database table. You can use Write to output a PCollection
at any time in your pipeline, although you’ll typically write out data at the end of your pipeline.
The following example code shows how to apply a TextIO.Write
transform to write a PCollection
of String to a text file:
- In this case, instead of using
TextIO.write()
, use BigQueryIO.write().
This function requires a number of things to be specified, including the specific table to write to, the schema of this table. You can optionally specify whether to append to an existing table, recreate existing tables (helpful in early pipeline iteration), or create the table if it doesn't exist. By default, this transform will create tables that don't exist and won't write to a non-empty table.
Since the addition of Beam Schemas to the SDK, you can instruct the transform to infer the table schema from the object passed to it using .useBeamSchema()
and by marking the input type. Alternatively, you can explicitly provide the schema with .withSchema()
but would need to build a BigQuery TableSchema object to pass. Because you annotated the CommonLog
class with @DefaultSchema(JavaFieldSchema.class)
, each transform is aware of the names and types of the fields in the object, including BigQueryIO.write()
.
- Examine the various alternatives in the 'Writing' section of BigQueryIO. In this case, since you annotated your
CommonLog
object, utilize.useBeamSchema()
, and target<YOUR-PROJECT-ID>:logs.logs
table like this:
The set of all available types in Beam Schemas can be found in the Schema. FieldType documentation. All the possible BigQuery datatypes in Standard SQL that can be used can be found in the setType documentation and if you're curious, inspect the Beam Schema to BigQuery conversion.
Task 6. Run your pipeline
Return to the terminal, change the value of the RUNNER
environment variable to DataflowRunner
and run your pipeline using the same command as earlier. Once it has started, navigate to the Dataflow product page and note the arrangement of your pipeline. If you gave your transforms names, they will be displayed. Clicking on each one will reveal in real time the number of elements being processed each second.
The overall shape should be a single path, starting with the Read transform and ending with the Write transform. As your pipeline runs, workers will be added automatically, as the service determines the needs of your pipeline, and then disappear when they are no longer needed. You can observe this by navigating to Compute Engine, where you should see virtual machines created by the Dataflow service.
- Once your pipeline has finished, return to the BigQuery browser window and query your table.
If your code isn’t performing as expected and you don’t know what to do, check out the solution.
Click Check my progress to verify the objective.
Lab part 2. Parameterizing basic ETL
Approximately 20 minutes
Much of the work of data engineers is either predictable, like recurring jobs, or it’s similar to other work. However, the process for running pipelines requires engineering expertise. Think back to the steps that you just completed:
- You created a development environment and developed a pipeline. The environment included the Apache Beam SDK and other dependencies.
- You executed the pipeline from the development environment. The Apache Beam SDK staged files in Cloud Storage, created a job request file, and submitted the file to the Dataflow service.
It would be much better if were a way to initiate a job through an API call or without having to set up a development environment (which non-technical users would be unable to do). This would allow you also to pipelines to run.
Dataflow Templates seek to solve this problem by changing the representation that is created when a pipeline is compiled so that it is parameterizable. Unfortunately, it is not as simple as exposing command-line parameters, although that is something you do in a later lab. With Dataflow Templates, the workflow above becomes:
- Developers create a development environment and develop their pipeline. The environment includes the Apache Beam SDK and other dependencies.
- Developers execute the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.
- Non-developer users or other workflow tools like Airflow can easily execute jobs with the Google Cloud console, gcloud command-line tool, or the REST API to submit template file execution requests to the Dataflow service.
In this lab, you will practice using one of the many Google-created Dataflow Templates to accomplish the same task as the pipeline that you built in Part 1.
Task 1. Creating a JSON schema file
While you didn't have to pass a TableSchema
object to the BigQueryIO.writeTableRows()
transform since you utilized .usedBeamSchema()
, you must pass the Dataflow Template a JSON file representing the schema in this example.
- Open up the terminal and navigate back to the main directory. Run the following command to grab the schema from your existing
logs.logs
table:
- Now, capture this output in a file and upload to GCS. The extra sed commands are to build a full JSON object that DataFlow will expect.
Click Check my progress to verify the objective.
Task 2. Writing a JavaScript user-defined function
The Cloud Storage to BigQuery Dataflow Template requires a JavaScript function to convert the raw text into valid JSON. In this case, each line of text is valid JSON, so the function is somewhat trivial.
To complete this task, use the IDE to create a .js
file with the contents below and then copy it to Google Cloud Storage.
- Copy the function below to its own
transform.js
file in the 1_Basic_ETL/ folder:
- Then run the following to copy the file to Google Cloud Storage:
Click Check my progress to verify the objective.
Task 3. Running a Dataflow Template
-
Go to the Dataflow Web UI.
-
Click CREATE JOB FROM TEMPLATE.
-
Enter a job name for your Dataflow job.
-
For Regional endpoint select
region. -
Under Dataflow template, select the Text Files on Cloud Storage to BigQuery template under the Process Data in Bulk (batch) section - NOT the Streaming section.
-
For Cloud Storage input path, enter the path to
events.json
in the formgs://<YOUR-PROJECT-ID>/events.json
-
For Cloud Storage location of your BigQuery schema file, write the path to your schema.json file, in the form
gs://<YOUR-PROJECT-ID>/schema.json
-
For BigQuery output table, enter
<myprojectid>:logs.logs
-
For Temporary BigQuery directory, enter a new folder within this same bucket. The job will create it for you.
-
For Temporary location, enter a second new folder within this same bucket.
-
Leave Encryption at Google-managed key.
-
Click to open Optional Prameters.
-
For JavaScript UDF path in Cloud Storage, enter in the path to your
.js
, in the formgs://<YOUR-PROJECT-ID>/transform.js
-
For Javascript UDF name, enter
transform
-
Click the Run job button.
While your job is running, you may inspect it from within the Dataflow Web UI.
Click Check my progress to verify the objective.
Task 4. Inspect the Dataflow Template code
-
Recall the code for the Dataflow Template you just used.
-
Scroll down to the main method. The code should look familiar to the pipeline you authored!
- It begins with a
Pipeline
object, created using aPipelineOptions
object. - It consists of a chain of
PTransform
s, beginning with a TextIO.read() transform. - The PTransform after the read transform is a bit different; it allows one to use Javascript to transform the input strings if, for example, the source format doesn’t align well with the BigQuery table format; for documentation on how to use this feature, see this page.
- The PTransform after the Javascript UDF uses a library function to convert the JSON into a tablerow; you can inspect that code here.
- The write PTransform looks a bit different because instead of making use of a schema that is known at graph compile-time, the code is intended to accept parameters that will only be known at run-time. The NestedValueProvider class is what makes this possible. It also is using an explicitally set schema, vs one inferred from Beam schema using
.useBeamSchema()
as you did.
- Make sure to check out the next lab, which will cover making pipelines that are not simply chains of
PTransform
s, and how you can adapt a pipeline you’ve built to be a custom Dataflow Template.
End your lab
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
- 1 star = Very dissatisfied
- 2 stars = Dissatisfied
- 3 stars = Neutral
- 4 stars = Satisfied
- 5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.