![]() |
VOOZH | about |
We use speeches to express ourselves. But sometimes it is crucial to store our speech in text format. One such technology is Automatic Speech Recognition which converts spoken language into written text. In this article, we will implement Automatic Speech Recognition using Connectionist Temporal Classification (CTC).
In various real-time applications like voice-activated virtual assistants and transcription services to dictation systems and language translation, ASR systems are everywhere being used to understand and transcribe spoken words accurately, making spoken information accessible in textual form. ASR involves a series of complex tasks, including feature extraction from audio signals, acoustic modelling to capture the characteristics of different sounds, language modelling to understand the context of words, and finally decoding to generate the final text output. The development of ASR systems has advanced significantly with the integration of deep learning techniques, leading to improved accuracy and performance.
The Connectionist Temporal Classification model is a neural network-based sequence-to-sequence model designed for tasks where the alignment between input and output sequences is not known a priori. In the context of ASR, CTC addresses the challenge of aligning variable-length audio sequences with their corresponding variable-length text transcriptions. The key idea behind CTC is to allow the model to learn the alignment during training without the need for explicit alignment information. CTC introduces a blank symbol and allows repetitions of both regular symbols and the blank symbol in the output sequence. This flexibility enables the model to handle sequences of different lengths and align them appropriately.
CTC model relies on some of the key-concepts which are listed below:
Before starting implementation, we need to install some required Python modules to our runtime.
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install jiwer
!pip install accelerateNow we will import all required Python modules like NumPy, transformers, Evaluate and Torch etc.
Now we will load minds dataset which is specially used to speech recognition model training. We will load a smaller subset of data as per our system capabilities. After that we will split it into training and testing sets in 70:30 ratio. Also, we will remove all useless columns like "english_transcription", "intent_class", "lang_id" and only keep the audio and transcription columns with a fixed sampling rate of 16000 (general sampling rate of model input) for all audio files.
For Automatic Speech Recognition model finetuning we need to define a processor from pre-trained models
We need to prepare our dataset in certain format which can be compatible with model's input format. We can't pass raw dataset as that will not in compatible format. We need to make all the letters in upper case of transcription so we will define a small mapping function (uppercase) to to this. After that, we will define another small function (prepare_dataset) which will prepare the dataset with fixed input lengths and sampling rate. You can lower the mapping rate (num_proc) for more accurate mapping but it will take more time to run.
In this article, we will employ CTC model for ASR but to do this we need to collect data as input feature and feature labelling in batches. Here we will utilize our pre-defined processor with padding techniques for both input and labels.
In Automatic Speech Recognition, the most common evaluation metrics is Word Error Rate or WER. This is basically a calculation of errors in words during generation of transcription. Less WER means better transcription. To calculate WER during our model training we will define a small function (compute_metrics) for it.
Now we will load a publicly available pre-trained CTC model for further fine-tuning. We will set the CTC loss reduction policy as mean for better loss ignorance.
Now we will train and fine-tune the pre-trained model with various hyper-parameters. The values for hyper-parameters are set for standard to low machine resources. If you have better machine resources, try to increase the values for better fine-tuning and less WER.
Output:
Step Training Loss Validation Loss Wer
1000 3.292600 3.293691 1.000000After finish running you can expand the model directory and copy the path of the model check points. This will be used in pipeline to use our finetuned model. See the below referral image for better understanding of which path needs to be used.
In our case, we will copy the path of 'checkpoint-1000'. Similarly, you need to use your own model's check point to test the model.
Output:
'I would like to set up a joint account with my partner'CTC model is widely used in various real-time applications which are listed below:
We can conclude that, finetuning an Automatic Speech Recognition model is a complex and time-consuming task but using CTC model we can perform this task comparatively faster.
Notebook: click here.