arrow_back

Use Dataproc Serverless for Spark to Load BigQuery

Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Use Dataproc Serverless for Spark to Load BigQuery

Lab 45 годин universal_currency_alt 1 кредит show_chart Початковий
info This lab may incorporate AI tools to support your learning.
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Overview

Dataproc Serverless is a fully-managed service that makes it easier to run open source data processing and analytics workloads without the need to manage infrastructure or manually tune workloads.

Dataproc Serverless for Spark provides an optimized environment designed to easily move existing Spark workloads to Google Cloud.

In this lab you will run a Batch workload on Dataproc Serverless environment. The workload will use a Spark template to process an Avro file to create and load a BigQuery table.

What you'll do

  • Configure the environment
  • Download lab assets
  • Configure and execute the Spark code
  • View data in BigQuery

Setup

Before you click the Start Lab button

Note: Read these instructions.

Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This Qwiklabs hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

What you need

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
  • Time to complete the lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab. Note: If you are using a Pixelbook, open an Incognito window to run this lab.

How to start your lab and sign in to the Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is a panel populated with the temporary credentials that you must use for this lab.

    Credentials panel

  2. Copy the username, and then click Open Google Console. The lab spins up resources, and then opens another tab that shows the Choose an account page.

    Note: Open the tabs in separate windows, side-by-side.
  3. On the Choose an account page, click Use Another Account. The Sign in page opens.

    Choose an account dialog box with Use Another Account option highlighted

  4. Paste the username that you copied from the Connection Details panel. Then copy and paste the password.

Note: You must use the credentials from the Connection Details panel. Do not use your Google Cloud Skills Boost credentials. If you have your own Google Cloud account, do not use it for this lab (avoids incurring charges).
  1. Click through the subsequent pages:
  • Accept the terms and conditions.
  • Do not add recovery options or two-factor authentication (because this is a temporary account).
  • Do not sign up for free trials.

After a few moments, the Cloud console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Cloud Console Menu

Activate Google Cloud Shell

Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.

Google Cloud Shell provides command-line access to your Google Cloud resources.

  1. In Cloud console, on the top right toolbar, click the Open Cloud Shell button.

    Highlighted Cloud Shell icon

  2. Click Continue.

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

Project ID highlighted in the Cloud Shell Terminal

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  • You can list the active account name with this command:
gcloud auth list

Output:

Credentialed accounts: - @.com (active)

Example output:

Credentialed accounts: - google1623327_student@qwiklabs.net
  • You can list the project ID with this command:
gcloud config list project

Output:

[core] project =

Example output:

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide .

Task 1. Complete environment configuration tasks

First, you're going to perform a few enivronment configuration tasks to support the execution of a Dataproc Serverless workload.

  1. In the Cloud Shell, run the following command to enable Private IP Access:
gcloud compute networks subnets update default --region={{{project_0.default_region |REGION}}} --enable-private-ip-google-access
  1. Use the following command to create a new Cloud Storage bucket as a staging location:
gsutil mb -p {{{project_0.project_id |PROJECT_ID}}} gs://{{{project_0.project_id |PROJECT_ID}}}
  1. Use the following command to create a new Cloud Storage bucket as temporary location for BigQuery while it creates and loads a table:
gsutil mb -p {{{project_0.project_id |PROJECT_ID}}} gs://{{{project_0.project_id |PROJECT_ID}}}-bqtemp
  1. Create a BQ dataset to store the data.
bq mk -d loadavro Complete environment configuration tasks

Task 2. Download lab assets

Next, you're going to download a few assets necessary to complete the lab into lab provided Compute Engine VM. You will perform the rest of the steps in the lab inside the Compute Engine VM.

  1. From the Navigation menu click on Compute Engine. Here you'll see a linux VM provisioned for you. Click the SSH button next to the lab-vm instance.

GCE VM Instance Page

  1. At the VM terminal prompt, download the Avro file that will be processed for storage in BigQuery.
wget https://storage.googleapis.com/cloud-training/dataengineering/lab_assets/idegc/campaigns.avro
  1. Next, move the Avro file to the staging Cloud Storage bucket you created earlier.
gcloud storage cp campaigns.avro gs://{{{project_0.project_id |PROJECT_ID}}}
  1. Download an archive containing the Spark code to be executed against the Serverless environment.
wget https://storage.googleapis.com/cloud-training/dataengineering/lab_assets/idegc/dataproc-templates.zip
  1. Extract the archive.
unzip dataproc-templates.zip
  1. Change to the Python directory.
cd dataproc-templates/python Download lab assets

Task 3. Configure and execute the Spark code

Next, you're going to set a few environment variables into VM instance terminal and execute a Spark template to load data into BigQuery.

  1. Set the following environment variables for the Dataproc Serverless environment.
export GCP_PROJECT={{{project_0.project_id |PROJECT_ID}}} export REGION={{{project_0.default_region |REGION}}} export GCS_STAGING_LOCATION=gs://{{{project_0.project_id |PROJECT_ID}}} export JARS=gs://cloud-training/dataengineering/lab_assets/idegc/spark-bigquery_2.12-20221021-2134.jar
  1. Run the following code to execute the Spark Cloud Storage to BigQuery template to load the Avro file in to BigQuery.
./bin/start.sh \ -- --template=GCSTOBIGQUERY \ --gcs.bigquery.input.format="avro" \ --gcs.bigquery.input.location="gs://{{{project_0.project_id |PROJECT_ID}}}" \ --gcs.bigquery.input.inferschema="true" \ --gcs.bigquery.output.dataset="loadavro" \ --gcs.bigquery.output.table="campaigns" \ --gcs.bigquery.output.mode=overwrite\ --gcs.bigquery.temp.bucket.name="{{{project_0.project_id |PROJECT_ID}}}-bqtemp" Note: You may safely ignore any warning stating: WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path... Because this is a small test, a metadata directory is not required. Configure and execute the Spark code

Task 4. Confirm that the data was loaded into BigQuery

Now that you have successfully executed the Spark template, it is time to examine the results in BigQuery.

  1. View the data in the new table in BigQuery.
bq query \ --use_legacy_sql=false \ 'SELECT * FROM `loadavro.campaigns`;'
  1. The query should return results similiar to the following:

Example output:

+------------+--------+---------------------+--------+---------------------+----------+-----+ | created_at | period | campaign_name | amount | advertising_channel | bid_type | id | +------------+--------+---------------------+--------+---------------------+----------+-----+ | 2020-09-17 | 90 | NA - Video - Other | 41 | Video | CPC | 81 | | 2021-01-19 | 30 | NA - Video - Promo | 325 | Video | CPC | 137 | | 2021-06-28 | 30 | NA - Video - Promo | 78 | Video | CPC | 214 | | 2021-03-15 | 30 | EU - Search - Brand | 465 | Search | CPC | 170 | | 2022-01-01 | 30 | EU - Search - Brand | 83 | Search | CPC | 276 | | 2020-02-18 | 30 | EU - Search - Brand | 30 | Search | CPC | 25 | | 2021-06-08 | 30 | EU - Search - Brand | 172 | Search | CPC | 201 | | 2020-11-29 | 60 | EU - Search - Other | 83 | Search | CPC | 115 | | 2021-09-11 | 30 | EU - Search - Other | 86 | Search | CPC | 237 | | 2022-02-17 | 30 | EU - Search - Other | 64 | Search | CPC | 296 | +------------+--------+---------------------+--------+---------------------+----------+-----+ Confirm that the data is loaded into BigQuery

Congratulations!

You successfully executed a Batch workload using Dataproc Serverless for Spark to load an Avro file into a BigQuery table.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

This content is not currently available

We will notify you via email when it becomes available

Great!

We will contact you via email if it becomes available