![]() |
VOOZH | about |
This article provides a comprehensive guide to implementing Speech Emotion Recognition (SER) using Transfer Learning, leveraging tools like Librosa for audio feature extraction and VGG16 for robust classification.
Prerequisites: VGG-16
Speech emotion recognition (SER) focuses on analyzing the pitch, tone, loudness, and frequency of sound to identify emotions in speech. This technique plays a crucial role in industries like entertainment, customer service, robotics, and security by providing insights into customer sentiment and human interactions.
Transfer Learning is a powerful technique where a pre-trained model is fine-tuned and reused for new datasets. It eliminates the need to train a model from scratch, significantly reducing training time and improving efficiency.
In this project, we use Python due to its robust library ecosystem. Speech data contains features such as pitch, loudness, and frequency that need to be accurately captured for analysis.
For this task, we will utilize the Toronto Emotional Speech Set (TESS), which includes 2,800 samples of seven emotions recorded by a 64-year-old woman and a young woman in her 20s.
The emotions are:
You can download the dataset from here.
Import the necessary libraries for data preprocessing, model creation, and training. Key libraries include:
The EmotionDataset class loads audio files, preprocesses them into Mel-Spectrograms, and prepares data for model training.
Use a pre-trained VGG16 model for transfer learning. Freeze the existing layers and replace the final layer with a custom classification layer for emotion recognition.
Output:
Epoch [1/10], Training Loss: 3.5698, Training Accuracy: 0.3829
Epoch [1/10], Validation Loss: 0.6287, Validation Accuracy: 0.7867
Epoch [2/10], Training Loss: 1.6390, Training Accuracy: 0.4850
Epoch [2/10], Validation Loss: 0.2506, Validation Accuracy: 0.8433
.
.
.
Epoch [10/10], Training Loss: 0.3281, Training Accuracy: 0.7450
Epoch [10/10], Validation Loss: 0.0285, Validation Accuracy: 0.9493
Final Training Accuracy: 0.7450
Final Validation Accuracy: 0.9493
Use the trained model to predict the emotion of a new audio file.
Output:
Predicted Emotion: fearSpeech Emotion Analysis is a useful technique as it helps to analyze the emotions of a person via speech. Combining the extraction power of Librosa and VGG 16 will be definitely useful in many industries as it will leverage the sentiment analysis.