arrow_back

Build an End-to-End Data Capture Pipeline using Document AI

Teste e compartilhe seu conhecimento com nossa comunidade.
done
Tenha acesso a mais de 700 laboratórios, selos de habilidade e cursos

Build an End-to-End Data Capture Pipeline using Document AI

Laboratório 1 hora universal_currency_alt 1 crédito show_chart Introdutório
info Este laboratório pode incorporar ferramentas de IA para ajudar no seu aprendizado.
Teste e compartilhe seu conhecimento com nossa comunidade.
done
Tenha acesso a mais de 700 laboratórios, selos de habilidade e cursos

GSP927

Google Cloud self-paced labs logo

Overview

The Document AI API is a document understanding solution that takes unstructured data, such as documents and emails, and makes the data easier to understand, analyze, and consume.

In this lab, you'll build a document processing pipeline to automatically analyze documents uploaded to Cloud Storage. The pipeline uses a Cloud Run function with a Document AI form processor to extract data and store it in BigQuery. If the form includes address fields, the address data is sent to a Pub/Sub topic. This triggers a second Cloud Run function, which uses the Geocoding API to add coordinates and writes the results to BigQuery.

This simple pipeline uses a general form processor to detect basic form data, like labeled address fields. For more complex documents, Document AI offers specialized parsers (beyond the scope of this lab) that extract detailed information even without explicit labels. For instance, the Invoice parser can identify address and supplier details from an unlabeled invoice by understanding common invoice layouts.

The overall architecture that you will create looks like the following:

The Document AI Asynchronous Solution Architecture

  1. Upload forms with address data to Cloud Storage.
  2. The upload triggers a Cloud Run function call to process the forms.
  3. Document AI called from Cloud Run function.
  4. Document AI JSON data saved back to Cloud Storage.
  5. Form Data written to BigQuery by Cloud Run function.
  6. Cloud Run function sends addresses to a Pub/Sub topic.
  7. Pub/Sub message triggers Cloud Run function for GeoCode processing.
  8. Geocoding API called from Cloud Run function.
  9. Geocoding data written to BigQuery by Cloud Run function.

This example architecture uses Cloud Run functions to implement a simple pipeline, but Cloud Run functions are not recommended for production environments as the Document AI API calls can exceed the timeouts supported by Cloud Run functions. Cloud Tasks are recommended for a more robust serverless solution.

Objectives

In this lab, you learn how to:

  • Enable the Document AI API.
  • Deploy Cloud Run functions that use the Document AI, BigQuery, Cloud Storage, and Pub/Sub APIs.

You'll configure a Cloud Run function to:

  • Trigger when documents are uploaded to Cloud Storage.
  • Use the Document AI client library for Python.
  • Trigger when a Pub/Sub message is created.

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

  1. Click Activate Cloud Shell Activate Cloud Shell icon at the top of the Google Cloud console.

When you are connected, you are already authenticated, and the project is set to your Project_ID, . The output contains a line that declares the Project_ID for this session:

Your Cloud Platform project in this session is set to {{{project_0.project_id | "PROJECT_ID"}}}

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  1. (Optional) You can list the active account name with this command:
gcloud auth list
  1. Click Authorize.

Output:

ACTIVE: * ACCOUNT: {{{user_0.username | "ACCOUNT"}}} To set the active account, run: $ gcloud config set account `ACCOUNT`
  1. (Optional) You can list the project ID with this command:
gcloud config list project

Output:

[core] project = {{{project_0.project_id | "PROJECT_ID"}}} Note: For full documentation of gcloud, in Google Cloud, refer to the gcloud CLI overview guide.

Task 1. Enable APIs and create an API key

You must enable the APIs for Document AI, Cloud Run functions, Cloud Build, and Geocoding for this lab, then create the API key that is required by the Geocoding Cloud Run function.

  1. Click Activate Cloud Shell Activate Cloud Shell icon at the top of the Google Cloud console.

  2. In Cloud Shell, enter the following commands to enable the APIs required by the lab:

gcloud services enable documentai.googleapis.com gcloud services enable cloudfunctions.googleapis.com gcloud services enable cloudbuild.googleapis.com gcloud services enable geocoding-backend.googleapis.com
  1. In the console, in the Navigation menu (Navigation menu icon), click APIs & services > Credentials.

  2. Select Create credentials, then select API key from the dropdown menu.

The API key created dialog box displays your newly created key. An API key is a long string containing upper and lower case letters, numbers, and dashes. For example, a4db08b757294ea94c08f2df493465a1.

  1. Click Edit API key in the dialog box.

  2. Select Restrict key in the API restrictions section to add API restrictions for your new API key.

  3. Click in the filter box and type Geocoding API.

  4. Select Geocoding API and click OK.

  5. Click the Save button.

Note: If you cannot find the Geocoding API in the Restrict key dropdown list, refresh the page to refresh the list of available APIs. Check that all the required APIs have been enabled.

Task 2. Download the lab source code

In this task, you copy the source files into your Cloud Shell. These files include the source code for the Cloud Run functions and the schemas for the BigQuery tables that you will create in the lab.

  1. In Cloud Shell, enter the following command to download the source code for this lab:
mkdir ./documentai-pipeline-demo gcloud storage cp -r \ gs://spls/gsp927/documentai-pipeline-demo/* \ ~/documentai-pipeline-demo/

Task 3. Create a form processor

Create an instance of the generic form processor to use in the Document AI Platform using the Document AI Form Parser specialized parser. The generic form processor will process any type of document and extract all the text content it can identify in the document. It is not limited to printed text, it can handle handwritten text and text in any orientation, supports a number of languages, and understands how form data elements are related to each other so that you can extract key:value pairs for form fields that have text labels.

  1. In the Google Cloud Console, in the search bar, type Document AI and click the product page result.

  2. Click Explore Processors and click Create Processor for Form Parser.

  3. Specify the processor name as form-processor and select the region US (United States) from the list.

  4. Click Create to create your processor.

You will configure a Cloud Run function later in this lab with the processor ID and location of this processor so that the Cloud Run function will use this specific processor to process sample invoices.

Task 4. Create Cloud Storage buckets and a BigQuery dataset

In this section, you will prepare your environment by creating the Google Cloud resources that are required for your document processing pipeline.

Create input, output, and archive Cloud Storage buckets

Create input, output, and archive Cloud Storage buckets for your document processing pipeline.

  1. In Cloud Shell, enter the following command to create the Cloud Storage buckets for the lab:
export PROJECT_ID=$(gcloud config get-value core/project) export BUCKET_LOCATION="{{{my_primary_project.default_region|REGION}}}" gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \ gs://${PROJECT_ID}-input-invoices gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \ gs://${PROJECT_ID}-output-invoices gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \ gs://${PROJECT_ID}-archived-invoices

Create a BigQuery dataset and tables

Create a BigQuery dataset and the three output tables required for your data processing pipeline.

  1. In Cloud Shell, enter the following command to create the BigQuery tables for the lab:
bq --location="US" mk -d \ --description "Form Parser Results" \ ${PROJECT_ID}:invoice_parser_results cd ~/documentai-pipeline-demo/scripts/table-schema/ bq mk --table \ invoice_parser_results.doc_ai_extracted_entities \ doc_ai_extracted_entities.json bq mk --table \ invoice_parser_results.geocode_details \ geocode_details.json

You can navigate to BigQuery in the Cloud Console and inspect the schemas for the tables in the invoice_parser_results dataset using the BigQuery SQL workspace.

Create a Pub/Sub topic

Initialize the Pub/Sub topic used to trigger the Geocoding API data enrichment operations in the processing pipeline.

  1. In Cloud Shell, enter the following command to create the Pub/Sub topics for the lab:
export GEO_CODE_REQUEST_PUBSUB_TOPIC=geocode_request gcloud pubsub topics \ create ${GEO_CODE_REQUEST_PUBSUB_TOPIC} Check that the BigQuery Dataset, Cloud Storage buckets, and Pub/Sub topic have been created.

Task 5. Create Cloud Run functions

Create the two Cloud Run functions that your data processing pipeline uses to process invoices uploaded to Cloud Storage. These functions use the Document AI API to extract form data from the raw documents, then use the GeoCode API to retrieve geolocation data about the address information extracted from the documents.

You can examine the source code for the two Cloud Run functions using the Code Editor or any other editor of your choice. The Cloud Run functions are stored in the following folders in Cloud Shell:

  • Process Invoices - scripts/cloud-functions/process-invoices
  • Geocode Addresses - scripts/cloud-functions/geocode-addresses

The main Cloud Run function, process-invoices, is triggered when files are uploaded to the input files storage bucket you created earlier.

The function folder scripts/cloud-functions/process-invoices contains the two files that are used to create the process-invoices Cloud Run function.

The requirements.txt file specifies the Python libraries required by the function. This includes the Document AI client library as well as the other Google Cloud libraries required by the Python code to read the files from Cloud Storage, save data to BigQuery, and write messages to Pub/Sub that will trigger the remaining functions in the solution pipeline.

The main.py Python file contains the the Cloud Run function code that creates the Document-AI, BigQuery, and Pub/Sub API clients and the following internal functions to process the documents:

  • write_to_bq - Writes dictionary object to the BigQuery table. Note you must ensure the schema is valid before calling this function.
  • get_text - Maps form name and value text anchors to the scanned text in the document. This allows the function to identify specific forms elements, such as the Supplier name and Address, and extract the relevant value. A specialized Document AI processor provides that contextual information directly in the entities property.
  • process_invoice - Uses the asynchronous Document-AI client API to read and process files from Cloud Storage as follows:
    • Creates an asynchronous request to process the file(s) that triggered the Cloud Run function call.
    • Processes form data to extract invoice fields, storing only specific fields in a dictionary that are part of the predefined schema.
    • Publishes Pub/Sub messages to trigger the Geocoding Cloud Run function using address form data extracted from the document.
    • Writes form data to a BigQuery table.
    • Deletes intermediate (output) files asynchronous Document AI API call.
    • Copies input files to the archive bucket.
    • Deletes processed input files.

The process_invoices Cloud Run function only processes form data that has been detected with the following form field names:

  • input_file_name
  • address
  • supplier
  • invoice_number
  • purchase_order
  • date
  • due_date
  • subtotal
  • tax
  • total

The other Cloud Run function, geocode-addresses, is triggered when a new message arrives on a Pub/Sub topic and it extracts its parameter data from the Pub/Sub message.

Create the Cloud Run function to process documents uploaded to Cloud Storage

Create a Cloud Run function that uses a Document AI form processor to parse form documents that have been uploaded to a Cloud Storage bucket.

  1. Run the command to get the email address of the project's Cloud Storage service agent:
gcloud storage service-agent --project=$PROJECT_ID
  1. Run the below command to allow the required permissions to the Cloud Storage service account:
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") gcloud iam service-accounts create "service-$PROJECT_NUMBER" \ --display-name "Cloud Storage Service Account" || true gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:service-$PROJECT_NUMBER@gs-project-accounts.iam.gserviceaccount.com" \ --role="roles/pubsub.publisher" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:service-$PROJECT_NUMBER@gs-project-accounts.iam.gserviceaccount.com" \ --role="roles/iam.serviceAccountTokenCreator" Note: If the Cloud Storage service account exists, you can ignore the error.
  1. Create the Invoice Processor Cloud Run function:
cd ~/documentai-pipeline-demo/scripts export CLOUD_FUNCTION_LOCATION="{{{my_primary_project.default_region|REGION}}}" gcloud functions deploy process-invoices \ --gen2 \ --region=${CLOUD_FUNCTION_LOCATION} \ --entry-point=process_invoice \ --runtime=python39 \ --source=cloud-functions/process-invoices \ --timeout=400 \ --env-vars-file=cloud-functions/process-invoices/.env.yaml \ --trigger-resource=gs://${PROJECT_ID}-input-invoices \ --trigger-event=google.storage.object.finalize Note: If the command fails with a permission error, wait a minute and try it again.

Create the Cloud Run function to lookup geocode data from an address

Create the Cloud Run function that accepts address data from a Pub/Sub message and uses the Geocoding API to precisely locate the address.

  1. Create the Geocoding Cloud Run function:
cd ~/documentai-pipeline-demo/scripts gcloud functions deploy geocode-addresses \ --gen2 \ --region=${CLOUD_FUNCTION_LOCATION} \ --entry-point=process_address \ --runtime=python39 \ --source=cloud-functions/geocode-addresses \ --timeout=60 \ --env-vars-file=cloud-functions/geocode-addresses/.env.yaml \ --trigger-topic=${GEO_CODE_REQUEST_PUBSUB_TOPIC}

Task 6. Edit environment variables for Cloud Run functions

In this task, you finalize the configuration of the Cloud Run functions by editing the environment variables for each function to reflect your lab specific parameters via the Cloud Console.

Edit environment variables for the process-invoices Cloud Run function

Set the Cloud Run function environment variables for the process-invoices function.

  1. In the Cloud Console, in the search bar, type Cloud Run functions and click the product page result.
  2. Click the Cloud Run function process-invoices to open its management page.
  3. Click Edit.
  4. Click Runtime, build, connections and security settings to expand that section.
  5. Under Runtime environment variables, add the GCP_PROJECT variable and the value to match your Project ID.
  6. Under Runtime environment variables, update the PROCESSOR_ID value to match the Invoice processor ID you created earlier.
  7. Under Runtime environment variables, update the PARSER_LOCATION value to match the region of the Invoice processor you created earlier. This will be us or eu. This parameter must be lowercase.
  8. Click Next and select .env.yaml and then update the PROCESSOR_ID, PARSER_LOCATION, and GCP_PROJECT values again for your invoice processor.

Environment variables for the process-invoices Cloud Run function

  1. Click Deploy.
Deploy the Process Invoices Cloud Run function

Edit environment variables for the geocode-addresses Cloud Run function

Set the Cloud Run function environment variables for the GeoCode data enrichment function.

  1. Click the Cloud Run function geocode-addresses to open its management page.
  2. Click Edit.
  3. Click Runtime, build, connections and security settings to expand that section.
  4. Under Runtime environment variables, update the API_key value to match to the API Key value created in Task 1.
  5. Click Next and select .env.yaml and then update the API_key value to match the API Key value you set in the previous step.
  6. Click Deploy.
Deploy the Geocode Addresses Cloud Run function

Task 7. Test and validate the end-to-end solution

Upload test data to Cloud Storage and monitor the progress of the pipeline as the documents are processed and the extracted data is enhanced.

  1. In Cloud Shell, enter the following command to upload sample forms to the Cloud Storage bucket that will trigger the process-invoices Cloud Run function:
export PROJECT_ID=$(gcloud config get-value core/project) gsutil cp gs://spls/gsp927/documentai-pipeline-demo/sample-files/* gs://${PROJECT_ID}-input-invoices/
  1. In the Cloud Console, in the search bar, type Cloud Run functions and click the product page result.
  2. Click the Cloud Run function process-invoices to open its management page.
  3. Click Logs.

You will see events related to the creation of the function and the updates made to configure the environment variables followed by events showing details about the file being processed, and the data detected by Document AI.

Document AI Cloud Run function Events in the Logs section of the Function details page

Watch the events until you see a final event indicating that the function execution finished with a LoadJob. If errors are reported double check that the parameters set in the .env.yaml file in the previous section are correct. In particular make sure the Processor ID, location, and Project ID are valid. The event list does not automatically refresh.

At the end of the processing, your BigQuery tables will be populated with the Document AI extracted entities as well as enriched data provided by the Geocoding API if the Document AI Processor has detected address data in the uploaded document.

  1. In the Cloud Console, on the Navigation menu (Navigation menu icon), click BigQuery.

  2. Expand your Project ID in the Explorer.

  3. Expand invoice_parser_results.

  4. Select doc_ai_extracted_entities and click Preview. You will see the form information extracted from the invoices by the invoice processor. You can see that address information and the supplier name has been detected.

  5. Select geocode_details and click Preview. You will see the formatted address, latitude, and longitude for each invoice that has been processed that contained address data that Document AI was able to extract.

Check that the end-to-end pipeline has processed form and address data.

Congratulations!

You've successfully used the Document AI API and other Google Cloud services to build an end-to-end invoice processing pipeline. In this lab, you enabled the Document AI API, deployed Cloud Run functions that use the Document AI, BigQuery, Cloud Storage, and Pub/Sub APIs, and configured a Cloud Run function to trigger when documents are uploaded to Cloud Storage. You also configured a Cloud Run function to use the Document AI client library for Python and to trigger when a Pub/Sub message was created.

Next steps / Learn more

  • To read more about this form authentication, see the guide.

Google Cloud training and certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated October 18, 2024

Lab Last Tested October 18, 2024

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

Este conteúdo não está disponível no momento

Você vai receber uma notificação por e-mail quando ele estiver disponível

Ótimo!

Vamos entrar em contato por e-mail se ele ficar disponível