VOOZH about

URL: https://www.geeksforgeeks.org/nlp/speech-emotion-recognition-using-transfer-learning/

⇱ Speech emotion Recognition using Transfer Learning - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Speech emotion Recognition using Transfer Learning

Last Updated : 23 Jul, 2025

This article provides a comprehensive guide to implementing Speech Emotion Recognition (SER) using Transfer Learning, leveraging tools like Librosa for audio feature extraction and VGG16 for robust classification.

Prerequisites: VGG-16

Need for Speech Emotion Recognition

Speech emotion recognition (SER) focuses on analyzing the pitch, tone, loudness, and frequency of sound to identify emotions in speech. This technique plays a crucial role in industries like entertainment, customer service, robotics, and security by providing insights into customer sentiment and human interactions.

Transfer Learning is a powerful technique where a pre-trained model is fine-tuned and reused for new datasets. It eliminates the need to train a model from scratch, significantly reducing training time and improving efficiency.

Why Use CNN Based Model for Speech Emotion Recognition?

  • Mel-Spectrograms as Images: Speech features are converted into visual representations, making CNNs ideal for processing.
  • Feature Extraction: CNNs capture global and local characteristics effectively.
  • Transfer Learning: Pre-trained models like VGG16 reduce training time and improve accuracy by leveraging existing knowledge.

Techniques and Tools

In this project, we use Python due to its robust library ecosystem. Speech data contains features such as pitch, loudness, and frequency that need to be accurately captured for analysis.

  • Librosa: A popular library for audio analysis. Its Mel-Frequency Cepstral Coefficients (MFCC) method extracts key audio features by converting the audio into small parts, applying filters, and analyzing the frequencies.
  • NumPy: Used to store feature values in arrays.
  • PyTorch: Chosen for implementing transfer learning due to its ease of debugging and flexibility.
  • VGG16: A pre-trained Convolutional Neural Network (CNN) model is fine-tuned for emotion classification.

For this task, we will utilize the Toronto Emotional Speech Set (TESS), which includes 2,800 samples of seven emotions recorded by a 64-year-old woman and a young woman in her 20s.

The emotions are:

  • Anger
  • Disgust
  • Fear
  • Happiness
  • Pleasant Surprise
  • Sadness
  • Neutral

You can download the dataset from here.

Step 1: Import Required Libraries

Import the necessary libraries for data preprocessing, model creation, and training. Key libraries include:

  • librosa: For audio feature extraction.
  • torch and torchvision: For building and training the neural network.
  • numpy: For handling numerical data.
  • os: For file path manipulations.

Step 2: Define the Custom Dataset Class

The EmotionDataset class loads audio files, preprocesses them into Mel-Spectrograms, and prepares data for model training.

Step 3: Define the Emotion Recognition Model

Use a pre-trained VGG16 model for transfer learning. Freeze the existing layers and replace the final layer with a custom classification layer for emotion recognition.

Step 4: Initialize Dataset and DataLoader

  • Initialize the dataset with the path and emotion categories.
  • Split the dataset into training, validation, and test sets.
  • Create DataLoaders for batch processing.

Step 5: Training the Model

  • Define the loss function (CrossEntropyLoss) and optimizer (Adam).
  • Train the model for 10 epochs and calculate training and validation accuracy.

Output:

Epoch [1/10], Training Loss: 3.5698, Training Accuracy: 0.3829
Epoch [1/10], Validation Loss: 0.6287, Validation Accuracy: 0.7867
Epoch [2/10], Training Loss: 1.6390, Training Accuracy: 0.4850
Epoch [2/10], Validation Loss: 0.2506, Validation Accuracy: 0.8433
.
.
.
Epoch [10/10], Training Loss: 0.3281, Training Accuracy: 0.7450
Epoch [10/10], Validation Loss: 0.0285, Validation Accuracy: 0.9493
Final Training Accuracy: 0.7450
Final Validation Accuracy: 0.9493

Step 7: Predict an Emotion

Use the trained model to predict the emotion of a new audio file.

Output:

Predicted Emotion: fear

Complete Code


Speech Emotion Analysis is a useful technique as it helps to analyze the emotions of a person via speech. Combining the extraction power of Librosa and VGG 16 will be definitely useful in many industries as it will leverage the sentiment analysis.

Comment
Article Tags:

Explore