Prüfpunkte
Create a Cloud SQL instance
/ 5
Whitelist the Cloud Shell instance to access your SQL instance
/ 5
Create a bts database and flights table using the create_table.sql file
/ 5
Loading Taxi Data into Google Cloud SQL 2.5
Overview
In this lab, you will learn how to import data from CSV text files into Cloud SQL and then carry out some basic data analysis using simple queries.
The dataset used in this lab is collected by the NYC Taxi and Limousine Commission and includes trip records from all trips completed in Yellow and Green taxis in NYC from 2009 to present, and all trips in for-hire vehicles (FHV) from 2015 to present. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
This dataset can be used to demonstrate a wide range of data science concepts and techniques and will be used in several of the labs in the Data Engineering curriculum.
Objectives
- Create Cloud SQL instance
- Create a Cloud SQL database
- Import text data into Cloud SQL
- Check the data for integrity
Setup and requirements
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
-
Sign in to Qwiklabs using an incognito window.
-
Note the lab's access time (for example,
1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning. -
When ready, click Start lab.
-
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
-
Click Open Google Console.
-
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges. -
Accept the terms and skip the recovery resource page.
Activate Google Cloud Shell
Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.
Google Cloud Shell provides command-line access to your Google Cloud resources.
-
In Cloud console, on the top right toolbar, click the Open Cloud Shell button.
-
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
- You can list the active account name with this command:
Output:
Example output:
- You can list the project ID with this command:
Output:
Example output:
Task 1. Preparing your environment
- Create environment variables that will be used later in the lab for your project ID and the storage bucket that will contain your data:
Task 2. Create a Cloud SQL instance
- Enter the following commands to create a Cloud SQL instance:
This will take a few minutes to complete.
Test completed task
Click Check my progress to verify your performed task. If you have completed the task successfully you will be granted an assessment score.
- Set a root password for the Cloud SQL instance:
- When prompted for the password type
Passw0rd
and press enter this will update root password.
Passw0rd
will be masked and not visible in the cloud terminal.
- Now create an environment variable with the IP address of the Cloud Shell:
- Whitelist the Cloud Shell instance for management access to your SQL instance:
- When prompted press Y to accept the change.
Test completed task
Click Check my progress to verify your performed task. If you have completed the task successfully you will be granted an assessment score.
- Get the IP address of your Cloud SQL instance by running:
- Check the variable MYSQLIP:
You should get an IP address as an output.
- Create the taxi trips table by logging into the
mysql
command line interface:
-
When prompted for a password enter
Passw0rd
. -
Paste the following content into the command line to create the schema for the
trips
table:
Test completed task
Click Check my progress to verify your performed task. If you have completed the task successfully you will be granted an assessment score.
- In the
mysql
command line interface check the import by entering the following commands:
- Query the
trips
table:
This will return an empty set as there is no data in the database yet.
- Exit the
mysql
interactive console:
Task 3. Add data to Cloud SQL instance
Now you'll copy the New York City taxi trips CSV files stored on Cloud Storage locally. To keep resource usage low, you'll only be working with a subset of the data (~20,000 rows).
- Run the following in the command line:
- Connect to the
mysql
interactive console to load local infile data:
-
When prompted for a password enter
Passw0rd
. -
In the
mysql
interactive console select the database:
- Load the local CSV file data using
local-infile
:
Task 4. Checking for data integrity
Whenever data is imported from a source it's always important to check for data integrity. Roughly, this means making sure the data meets your expectations.
- Query the
trips
table for unique pickup location regions:
This should return 159 unique ids.
- Let's start by digging into the
trip_distance
column. Enter the following query into the console:
One would expect the trip distance to be greater than 0 and less than, say 1000 miles. The maximum trip distance returned of 85 miles seems reasonable but the minimum trip distance of 0 seems buggy.
- How many trips in the dataset have a trip distance of 0?
There are 155 such trips in the database. These trips warrant further exploration. You'll find that these trips have non-zero payment amounts associated with them. Perhaps these are fraudulent transactions?
- Let's see if we can find more data that doesn't meet our expectations. We expect the
fare_amount
column to be positive. Enter the following query to see if this is true in the database:
There should be 14 such trips returned. Again, these trips warrant further exploration. There may be a reasonable explanation for why the fares take on negative numbers. However, it's up to the data engineer to ensure there are no bugs in the data pipeline that would cause such a result.
- Finally, let's investigate the
payment_type
column.
The results of the query indicate that there are four different payment types, with:
- Payment type = 1 has 13863 rows
- Payment type = 2 has 6016 rows
- Payment type = 3 has 113 rows
- Payment type = 4 has 32 rows
Digging into the documentation, a payment type of 1 refers to credit card use, payment type of 2 is cash, and a payment type of 4 refers to a dispute. The figures make sense.
- Exit the 'mysql' interactive console:
End your lab
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
- 1 star = Very dissatisfied
- 2 stars = Dissatisfied
- 3 stars = Neutral
- 4 stars = Satisfied
- 5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.