
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Create the AI Notebook instance
/ 100
Automated Speech Recognition (ASR), also known as machine transcription or Speech-to-Text, uses machine learning to turn audio containing speech into text. ASR has many applications from subtitling, to virtual assistants, to IVRs, to dictation, and more. However, machine learning systems are rarely 100% accurate and ASR is no exception. If you plan to rely on ASR for critical systems it's very important to measure its accuracy or overall quality to understand how it will perform in your broader system.
Once you measure your accuracy, it’s possible to tune the systems to provide even greater accuracy for your specific situation. In Google’s Cloud Speech-to-Text API, accuracy tuning is done using our Speech Adaptation API.
With this lab and Google’s tools you can learn how to get the best possible quality from Speech-to-Text systems on your specific data.
simple_wer
tools to measure quality on your own audioSpeech accuracy can be measured in a variety of ways. It may be useful for you to use multiple metrics, depending on your needs. However, the industry standard method for comparison is word error rate, often abbreviated as WER. Word error rate measures the percentage of incorrect transcriptions in the entire set. This means that a lower WER means the system is more accurate.
You may also see the term, ground truth, used in the context of ASR accuracy. Ground truth means the 100% accurate (typically human) transcription that you will compare against to measure the accuracy.
Word error rate is the combination of three types of transcription errors which can occur:
You combine the literal number of each one of these errors and device by the total number of words (N) in the ground truth transcript in order to find the word error rate (WER). This means that the WER can be greater than 100% in situations with very low accuracy.
You may also see the use of other metrics at times, and we will demonstrate some later in the lab. These metrics can be useful for tracking things like readability or measuring how many of your most important terms were transcribed correctly. Here are a few you might encounter.
Now that you know how to talk about accuracy and what to measure. These are the few simple steps you need to follow to get started on determining accuracy on your own audio.
You should gather a representative sample of the audio files for which you wish to measure quality. This sample should be random and should be as close to the target environment as possible. For example, if you want to transcribe conversations from a call center to aid in quality assurance, you should randomly select a few actual calls recorded on the same equipment that your production audio will come through, not recorded on your cell phone or computer microphone.
You will need at least 30min of audio to get a statistically significant accuracy metric. We recommend using between 30min and 3 hours of audio. In this lab the audio is provided for you.
Next you need to get accurate transcriptions of the audio. This usually involves a single or double pass human transcription of the target audio. Your goal is to have a 100% accurate transcription to measure the automated results against.
It’s important when doing this to match the transcription conventions of your target ASR system as closely as possible. For example, ensure that punctuation, numbers, and capitalization are consistent. In this lab the ground truth is provided for you.
Send the audio to Google Speech-to-Text API and get your hypothesis transcription. You can do this using one of Google Cloud’s many libraries or command line tools. In this lab the code to do this is provided for you.
Now you must count the insertions, substitutions, deletions, and total words. You can do this using the ground truth from step 2 and machine transcription from step 3.
Google has created and opened source the code used to normalize output and calculate the WER for you.
In this lab code to do this is provided for you.
Now you will try this out for real. This lab has curated and created a focused dataset based on public domain books and audio from the Librispeech project. All the code you need to measure the accuracy of Google Cloud Speech-to-Text API’s accuracy on this dataset is provided.
In the following Notebook you will learn how to set up and use this code. Once you have launched the notebook, follow the instructions inside to compute the WER on the provided dataset.
In the Google Cloud console, from the Navigation menu (), select Vertex AI.
Click Enable All Recommended APIs.
On the left-hand side, click Workbench.
At the top of the Workbench page, ensure you are in the Instances view.
Click Create New.
Configure the Instance:
Your terminal window will open in a new tab. You can now run commands in the terminal to interact with your Workbench instance.
Perform the following tasks to Play Audio Files in an Incognito Window:
It should be something like:
You can now continue to the notebook.
Click Check my progress to verify the objective.
Now that you have measured the accuracy of Google Cloud Speech-to-Text on your provided dataset, it’s time to start thinking about how you can improve on the results you already have.
There are many ways to think about how you can give the ASR system more signal to improve the accuracy and lower the WER. The following three are some things to consider as you get started.
When evaluating quality, look at where the system makes errors. You should think about if any of the above three types of context could help give the system more signal and improve accuracy.
If you think you can provide this type of context and get an improvement, you can do it with the Speech Adaptation API available in the Cloud Speech-to-Text API.
Google Cloud Speech-to-Text has tools for providing contextual information that can help users increase accuracy on their data. The Speech Adaptation API allows users to pass phrases and associated weights directly to the speech API.
These phrases can be changed with every request and allow for quick iteration as well as on the fly adaptation. All you do is include the terms in the request itself as part of the recognition config:
This type of biasing is advantageous over methods such as custom language models or complex grammars. It is easier to setup, doesn’t require any special training or deployment, and is included for free in your usage of the Cloud Speech-to-Text API.
As you learned above, Cloud Speech-to-Text makes it very easy to bias the system. However, you still have to figure out the right terms to send to the API. Thinking back to the previous considerations about types of quality improvements, you can also consider the following when deciding what terms to include with biasing.
Now that you have learned some about how to approach biasing, it’s time to put it into practice on the dataset from before. However, the performance on our old dataset was already pretty good. To make the problem a little harder for the ASR system, noise has been added to the audios that were provided in the previous notebook.
You have now successfully measured and improved the accuracy of Google Cloud Speech-to-Text API on a real dataset. You have learned how to talk about and compare accuracy metrics and how to approach measuring accuracy. You have successfully set up Python tools for performing automated speech recognition and measuring accuracy.
You are now ready to try these tools on your own data and put what you have learned into practice.
Manual Last Updated January 27, 2025
Manual Last Tested January 23, 2025
Copyright 2025 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
Questi contenuti non sono al momento disponibili
Ti invieremo una notifica via email quando sarà disponibile
Bene.
Ti contatteremo via email non appena sarà disponibile
One lab at a time
Confirm to end all existing labs and start this one