Gemini for Data Scientists and Analysts

Course · 2 hours

< 1%

complete

Get access to 700+ labs and courses

Gemini for Data Scientists

Lab 1 hour 30 minutes universal_currency_alt 5 Credits show_chart Intermediate

info This lab may incorporate AI tools to support your learning.

Overview
Objectives
Setup and requirements
Task 1. Create a BigQuery dataset for your project
Task 2. Create a new Python notebook
Task 3. Connect to the Colab Enterprise runtime in BigQuery
Task 4. Build the Python Notebook
Task 5. Generate insights from the results of the model
Task 6. Clean up project resources (Optional)
Congratulations!
Optional reading
End your lab

Get access to 700+ labs and courses

Overview

In this lab, you are working as a data scientist for Cymbal superstore and you have been asked to help the marketing team identify, categorize, and develop new customers. The leadership team has asked you to split customers into 5 different groups based on their order behaviors and generate descriptive statistics about each group. However, you also want to give them some actionable next steps to take with each of these groups.

To identify these new customers, you will use Gemini, Vertex AI, and BigQuery to create, visualize, and summarize a K-means clustering model with ecommerce data to generate useful next steps for a marketing campaign. This lab is intended for data scientists of any experience level.

Configuration of the environment has already been completed for you. This includes enabling the Cloud AI Companion for Gemini and granting IAM the necessary roles to use Gemini. For more information, see Gemini for Google Cloud overview.

Note: Duet AI was renamed to Gemini, our next-generation model. This lab has been updated to reflect this change. Any references to Duet AI in the user interface or documentation should be treated as equivalent to Gemini while following the lab instructions.

Note: As an early-stage technology, Gemini can generate output that seems plausible but is factually incorrect. We recommend that you validate all output from Gemini before you use it. For more information, see Gemini for Google Cloud and responsible AI.

Objectives

In this lab, you learn how to perform the following tasks:

Use Colab Enterprise Python notebooks inside BigQuery Studio.
Use BigQuery DataFrames inside of BigQuery Studio.
Use Gemini to generate code from natural language prompts.
Build a K-means clustering model.
Generate a visualization of the clusters.
Use the Gemini Pro model to develop the next steps for a marketing campaign.
Clean up project resources.

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

Access to a standard internet browser (Chrome browser recommended).

Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.

Time to complete the lab---remember, once you start, you cannot pause a lab.

Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud console

Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:
- The Open Google Cloud console button
- Time remaining
- The temporary credentials that you must use for this lab
- Other information, if needed, to step through this lab
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).

The lab spins up resources, and then opens another tab that shows the Sign in page.

Tip: Arrange the tabs in separate windows, side-by-side.
Note: If you see the Choose an account dialog, click Use Another Account.
If necessary, copy the Username below and paste it into the Sign in dialog.
{{{user_0.username | "Username"}}}
You can also find the Username in the Lab Details panel.
Click Next.
Copy the Password below and paste it into the Welcome dialog.
{{{user_0.password | "Password"}}}
You can also find the Password in the Lab Details panel.
Click Next.
Important: You must use the credentials the lab provides you. Do not use your Google Cloud account credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
Click through the subsequent pages:
- Accept the terms and conditions.
- Do not add recovery options or two-factor authentication (because this is a temporary account).
- Do not sign up for free trials.

After a few moments, the Google Cloud console opens in this tab.

Note: To view a menu with a list of Google Cloud products and services, click the Navigation menu at the top-left.

Task 1. Create a BigQuery dataset for your project

In this task, you will create the ecommerce dataset in BigQuery. The dataset will be used to store the ecommerce data you will categorize in this lab.

In the Google Cloud console, select the Navigation menu (), and then select BigQuery.

The Welcome to BigQuery in the Cloud Console pop-up is displayed.
Click Done.
In the Explorer panel, for , select View actions (), and then select Create dataset.

You create a dataset to store database objects, including tables and models.
In the Create dataset pane, enter the following information:

Field Value

Dataset ID ecommerce

Location type select Multi-Region

Multi-region select US (multiple regions in United States)

Leave the other fields at their defaults.
Click Create Dataset.

Field	Value
Dataset ID	ecommerce
Location type	select Multi-Region
Multi-region	select US (multiple regions in United States)

To verify the objective, click Check my progress. Create a BigQuery dataset for your project.

Task 2. Create a new Python notebook

In this task, you create a new Python notebook in BigQuery, so that you may use Gemini in BigQuery. The Python notebook is needed in BigQuery so that you may use Python machine learning libraries to identify customers and categorize them into groups, based upon shopping data in the ecommerce dataset.

In the Google Cloud console, select the Navigation menu (), and then select BigQuery.
At the top of the page, click the down arrow next to the plus sign.
Select Python Notebook.
Select the region to store your code assets from the dropdown and click Select.
In the Start with a template pane, click Close.

Task 3. Connect to the Colab Enterprise runtime in BigQuery

The next step is to connect to the Colab Enterprise runtime in BigQuery. Think of this runtime as a managed environment in BigQuery that allows you to access machine learning libraries that can help you identify your customers and categorize them into groups.

While remaining in the BigQuery Studio console, in the upper right corner of the notebook click on the down arrow next to Connect.
In the drop down select Connect to a runtime.
Select Create new runtime.
Select Create default runtime.
Click on the Qwiklabs student ID.

Note: Wait a few minutes while the runtime is allocated. You will then see the connection status update to Connected at the bottom of your browser window. You will also notice that the Python notebook is added in the notebooks section of the explorer under your project.

To verify the objective, click Check my progress. Connect to the Colab Enterprise runtime in BigQuery.

Task 4. Build the Python Notebook

In this task, you begin to build the Python notebook, by performing the following steps:

Import Python libraries
Define variables
Create and import a base table as a BigQuery DataFrame from a public dataset
Generate the K-means clustering model and visualization

Import Python libraries and define variables

The first step to build the Python notebook is to import python libraries and define variables.

To import the libraries into your notebook, use the steps below:

Add a code cell into the notebook, click the +Code button at the top of the notebook window.
Paste the following code snippet into the cell:
from google.cloud import bigquery from google.cloud import aiplatform import bigframes.pandas as bpd import pandas as pd from vertexai.language_models._language_models import TextGenerationModel from vertexai.generative_models import GenerativeModel from bigframes.ml.cluster import KMeans from bigframes.ml.model_selection import train_test_split
Run the cell.

The runtime will load the Python libraries, which will take approximately 1 minute. You can follow the progress by checking the status of the runtime at the bottom of the browser window.

When complete you will see a green checkmark next to the Run button on the cell.

The table below provides more information about the Python libraries that you just imported to your notebook, including a brief description about each one.

Library	Description
BigQuery	Python Client for Google BigQuery
AI Platform	Vertex AI SDK for Python
bigframes.pandas	BigQuery DataFrames
pandas	Open source data analysis and manipulation tool, built on top of the Python programming language.
TextGenerationModel	Creates a LanguageModel within VertexAI.
Kmeans	Used to create K-means clustering models within BigQuery DataFrames
train_test_split	Used to split source dataset into test and training subsets, and used with model tuning within BigQuery DataFrames.

Note: If you want to learn more about each library, use the link provided.

Define variables and initiate the BigQuery and Vertex AI connection

Next you will define variables and initiate the BigQuery and Vertex AI connection.

Add another code cell at the end of the notebook.
Paste the following code snippet into the cell.
project_id = '<project_id>' dataset_name = "ecommerce" model_name = "customer_segmentation_model" table_name = "customer_stats" location = "<location>" client = bigquery.Client(project=project_id) aiplatform.init(project=project_id, location=location)
Replace <project_id> with .
Replace <location> with .
Run the cell.

Create and import the ecommerce.customer_stats table

Next you will store data from thelook_ecommerce BigQuery public dataset into a new table entitled customer_status in your ecommerce dataset.

Add another code cell at the end of the notebook.
Paste the following code snippet into the cell.
%%bigquery CREATE OR REPLACE TABLE ecommerce.customer_stats AS SELECT user_id, DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) AS DATE), day) AS days_since_last_order, ---RECENCY COUNT(order_id) AS count_orders, --FREQUENCY AVG(sale_price) AS average_spend --MONETARY FROM ( SELECT user_id, order_id, sale_price, created_at AS order_created_date FROM `bigquery-public-data.thelook_ecommerce.order_items` WHERE created_at BETWEEN '2022-01-01' AND '2023-01-01' ) GROUP BY user_id;
Run the cell.

Create a BigQuery DataFrame and load the data using a Gemini prompt

In this step, you create a BigQuery DataFrame using a Gemini prompt and load the customer stats data into it, so that you may process the data later with the K-means clustering model.

Note: As mentioned at the start of the lab, you need to validate all output from Gemini before use. Use the code examples provided to assist you, however, do not copy and paste the code as is, as this may not work in some cases. You may also want to regenerate the code from Gemini for a more favorable output.

Add another code cell at the end of the notebook.
Within the cell, click generate. This allows you to generate code with Gemini, and you will see a prompt where you can add text.
Paste the following text into the prompt.
Convert the table ecommerce.customer_stats to a bigframes dataframe and show the top 10 records
Click generate. Gemini will generate the code below.
bqdf = client.read_gbq(f"{project_id}.{dataset_name}.{table_name}") df.head(10) Note: Remember that in a previous step, you added code to cell number 2 in the notebook that saved the project ID, the dataset name, and the table name as variables. If you completed this step, when you run the cell in the next step you will not have any issues and the DataFrame will be created with the first 10 rows displayed.
Regenerate the code so that the output looks similar to the code shown here:
bqdf = bpd.read_gbq(f"{project_id}.{dataset_name}.{table_name}") bqdf.head(10)
Run the cell.

You will see the BigQuery DataFrame output with the first 10 rows of the dataset displayed.

Generate the K-means clustering model

Now that you have your customer data in a BigQuery DataFrame, you will create a K-means clustering model to split the customer data into clusters based on fields like order recency, order count, and spend, and you will then visualize these as groups within a chart directly within the notebook.

Add another code cell at the end of the notebook.
Click generate in the cell. This will allow you to generate code with Gemini using a prompt.
Add the following prompt to the cell:
1. Split df (using random state and test size 0.2) into test and training data for a K-means clustering algorithm store these as df_test and df_train. 2. Create a K-means cluster model using bigframes.ml.cluster KMeans with 5 clusters. 3. Save the model using the to_gbq method where the model name is project_id.dataset_name.model_name.
Click generate. You will see output similar to the following:
#prompt: 1. Split df (using random state and test size 0.2) into test and training data for a K-means clustering algorithm store these as df_test and df_train. 2. Create a K-means cluster model using bigframes.ml.cluster KMeans with 5 clusters. 3. Save the model using the to_gbq method where the model name is project_id.dataset_name.model_name. df_train, df_test = train_test_split(bq_df, test_size=0.2, random_state = 42) kmeans = KMeans(n_clusters=5) kmeans.fit(df_train) kmeans.to_gbq(f"{project_id}.{dataset_name}.{model_name}")
Run the cell.
Note: This step will take approximately 2 minutes to complete.
Your model is now created!
Refresh the contents of your Explorer panel by clicking the three dots next to your project name and selecting Refresh contents. It should pop up under your ecommerce dataset.

Next, you define a new BigQuery DataFrame that joins the segment/cluster produced by the K-means model back to the original data.
Add another code cell at the end of the notebook.
Click generate in the cell.

This will allow you to generate code with Gemini using a prompt.
Add the following prompt to the cell:
1. Call the K-means prediction model on the df dataframe, and store the results as predictions_df and show the first 10 records.
Click generate. You will see output similar to the following:
# prompt: 1. Call the K-means prediction model on the df dataframe, and store the results as predictions_df and show the first 10 records. predictions_df = kmeans.predict(df_test) predictions_df.head(10)
Run the cell.

You see that the first 10 records are shown with the CENTROID_ID. CENTROID_ID is the cluster that the customer will be categorized as later in the lab. You also see the user_id, days_since_last_order, count_orders and average_spend fields.

To verify the objective, click Check my progress. Generate the K-means clustering model.

Create a visualization of the K-means clustering model results

In this next step, you will create a visualization of the K-means clustering model results. Specifically, you will generate a scatterplot using predictions_df to look at the relationship between the days since last order by average spend, colored by segment_id (which was generated using our K-means model!)

Add another code cell at the end of the notebook.
Click generate in the cell. This will allow you to generate code with Gemini using a prompt.
Add the following prompt to the cell:
1. Using predictions_df, and matplotlib, generate a scatterplot. 2. On the x-axis of the scatterplot, display days_since_last_order and on the y-axis, display average_spend from predictions_df. 3. Color by cluster. 4. The chart should be titled "Attribute grouped by K-means cluster."
Click generate. You will see output similar to the following:
#prompt: 1. Using predictions_df, and matplotlib, generate a scatterplot. 2. On the x-axis of the scatterplot, display days_since_last_order and on the y-axis, display average_spend from predictions_df. 3. Color by cluster. 4. The chart should be titled "Attribute grouped by K-means cluster." import matplotlib.pyplot as plt # Create the scatter plot plt.figure(figsize=(10, 6)) # Adjust figure size as needed plt.scatter(predictions_df['days_since_last_order'], predictions_df['average_spend'], c=predictions_df['cluster'], cmap='viridis') # Customize the plot plt.title('Attribute grouped by K-means cluster') plt.xlabel('Days Since Last Order') plt.ylabel('Average Spend') plt.colorbar(label='Cluster ID') # Display the plot plt.show()
Replace 'cluster' or 'cluster_id' with 'CENTROID_ID' in the c=predictions_df field only.
Run the cell.

You will see the visualization displayed.

Note: If you get a TypeError replace the code with the example output and then run the cell.

To verify the objective, click Check my progress. Generate the visualization from the K-means clustering model results.

Task 5. Generate insights from the results of the model

In this task, you generate insights from the results of the model by performing the following steps:

Summarize each cluster generated from the K-means model
Define a prompt for the marketing campaign
Generate the marketing campaign using the text-bison model

Summarize each cluster generated from the K-means model

In this step, you will summarize each cluster generated from the K-means model.

Add another code cell at the end of the notebook.
Paste the following code snippet into the cell:
query = """ SELECT CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid, average_spend, count_orders, days_since_last_order FROM ( SELECT centroid_id, feature, ROUND(numerical_value, 2) as value FROM ML.CENTROIDS(MODEL `{0}.{1}`) ) PIVOT ( SUM(value) FOR feature IN ('average_spend', 'count_orders', 'days_since_last_order') ) ORDER BY centroid_id """.format(dataset_name, model_name) df_centroid = client.query(query).to_dataframe() df_centroid.head()
Run the cell.

You should see the clusters summarized in a table. Some insights you can get from this table are that some clusters have a higher average spend, and others have a higher count of orders.

Next, you will convert the data frame into a string, so you can pass it to your large language model call.
Add another code cell at the end of the notebook.
Paste the following code snippet into the cell:
df_query = client.query(query).to_dataframe() df_query.to_string(header=False, index=False) cluster_info = [] for i, row in df_query.iterrows(): cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}" .format(row["centroid"], row["count_orders"], row["average_spend"], row["days_since_last_order"]) ) cluster_info = (str.join("\n", cluster_info)) print(cluster_info)
Run the cell.

The output should be similar to:
cluster 1, average spend $48.32, count of orders per person 1.36, days since last order 384.37 cluster 2, average spend $202.34, count of orders per person 1.3, days since last order 482.62 cluster 3, average spend $45.68, count of orders per person 1.36, days since last order 585.4 cluster 4, average spend $44.71, count of orders per person 1.36, days since last order 466.26 cluster 5, average spend $58.08, count of orders per person 3.92, days since last order 427.36

Generating the marketing campaign using the Gemini model

You have created a K-means model, assigned each customer to a cluster from the model, generated summary statistics for each cluster. In this step you will use Gemini to generate code from a prompt to create a marketing campaign, with customer insights and next steps for our marketing team.

For each cluster/segment defined by the K-means model, we'll generate three items that can be used by our marketing team:

Title
Persona
Next marketing step

Add another code cell at the end of the notebook.
Paste the following code snippet into the cell:
model = GenerativeModel("gemini-1.0-pro") prompt = f""" You're a creative brand strategist, given the following clusters, come up with \ creative brand persona, a catchy title, and next marketing action, \ explained step by step. Identify the cluster number, the title of the person, a persona for them and the next marketing step. Clusters: {cluster_info} For each Cluster: * Title: * Persona: * Next marketing step: """ responses = model.generate_content( prompt, generation_config={ "temperature": 0.1, "max_output_tokens": 800, "top_p": 1.0, "top_k": 40, } ) print(responses.text)
Run the cell.

You should see each cluster with the title, persona and next steps.
**Cluster 1:** * **Title:** The Lapsed Loyalists * **Persona:** These customers have made a purchase in the past but haven't returned for an extended period. They likely had a positive experience but haven't been engaged recently. * **Next Marketing Step:** 1. **Re-engagement campaign:** Send personalized emails or targeted ads reminding them of their previous purchase and highlighting new products or promotions that might interest them. 2. **Offer exclusive discounts or incentives:** Motivate them to return with special offers or loyalty rewards. 3. **Personalized product recommendations:** Leverage purchase history and browsing behavior to suggest relevant products they might be interested in. **Cluster 2:** * **Title:** The Occasional Treaters * **Persona:** These customers make infrequent purchases but spend more when they do. They likely view the brand as a premium option for special occasions. * **Next Marketing Step:** 1. **Highlight exclusivity and premium value:** Emphasize the unique features and benefits of your products to justify the higher price point. 2. **Offer limited-time promotions or bundles:** Encourage larger purchases with special deals on high-value items or curated product sets. 3. **Create a sense of urgency and scarcity:** Promote limited-edition products or flash sales to encourage immediate action. **Cluster 3:** * **Title:** The One-and-Done Buyers * **Persona:** These customers have only made a single purchase and haven't returned. They might have had a neutral experience or haven't found a reason to come back. * **Next Marketing Step:** 1. **Gather feedback:** Send post-purchase surveys to understand their experience and identify areas for improvement. 2. **Offer personalized recommendations:** Based on their initial purchase, suggest complementary products or accessories to encourage further engagement. 3. **Showcase customer testimonials and social proof:** Highlight positive reviews and user-generated content to build trust and encourage repeat purchases. **Cluster 4:** * **Title:** The Big Spenders * **Persona:** These customers spend significantly more than others and are likely your most loyal and valuable segment. They appreciate high-quality products and personalized experiences. * **Next Marketing Step:** 1. **Develop a VIP program:** Offer exclusive benefits, early access to new products, and personalized customer service to show appreciation and encourage continued loyalty. 2. **Personalized communication and offers:** Tailor your marketing messages and promotions to their specific interests and purchase history. 3. **Host exclusive events or experiences:** Create opportunities for them to connect with the brand and other high-value customers. **Cluster 5:** * **Title:** The Price-Conscious Shoppers * **Persona:** These customers are primarily driven by price and make infrequent, low-value purchases. They are likely to compare prices and seek the best deals. * **Next Marketing Step:** 1. **Promote competitive pricing and value-driven offers:** Highlight your competitive prices and bundle deals to attract price-sensitive customers. 2. **Offer free shipping or other incentives:** Reduce purchase barriers by offering free shipping or other attractive incentives. 3. **Focus on product benefits and value for money:** Emphasize the quality and functionality of your products to justify the price point.

To verify the objective, click Check my progress. Generate the marketing campaign using the text-bison model.

Task 6. Clean up project resources (Optional)

In this lab, you created resources within the Google Cloud console. In a production environment you will need to remove these resources from your account, as they would be no longer needed once the insights from the model are collected. To remove the resources from your account and avoid further charges for their use, you have two options:

Remove the project (see cautions below)
Remove the individual resources

Clean up resources by removing the project

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the Google Cloud project that you created for this tutorial.

Caution: Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.
Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.

In the Google Cloud console, go to the IAM & Admin > Manage resources page.
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut Down Anyway to delete the project.

Clean up resources by deleting individual resources

To avoid incurring charges, you can delete the table and model used in this lab by running the following code in a new code cell within the notebook:

# Delete customer_stats table client.delete_table(f"{project_id}.{dataset_name}.{table_name}", not_found_ok=True) print(f"Deleted table: {project_id}.{dataset_name}.{table_name}") # Delete K-means model client.delete_model(f"{project_id}.{dataset_name}.{model_name}", not_found_ok=True) print(f"Deleted model: {project_id}.{dataset_name}.{model_name}")

After you run the cell, you can refresh the contents of your project in BigQuery studio to observe the deletion of the table and the model.

Congratulations!

In this lab, you learned how to:

Use Colab Enterprise Python notebooks inside BigQuery Studio.
Use BigQuery DataFrames inside of BigQuery Studio.
Use Gemini to generate code from natural language prompts.
Build a K-means clustering model.
Generate a visualizations of the clusters.
Use the text-bison model to develop the next steps for a marketing campaign.

Optional reading

Now that you have identified, categorized, and helped develop customers for Cymbal superstore using Gemini, VertexAI, and BigQuery, if you want to learn more about Gemini, refer to the links below:

End your lab

When you have completed your lab, click End Lab. Qwiklabs removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

1 star = Very dissatisfied
2 stars = Dissatisfied
3 stars = Neutral
4 stars = Satisfied
5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Manual Last Updated August 7, 2024

Lab Last Tested August 7, 2024

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.