Audio classification using spectrograms

Last Updated : 23 Jul, 2025

Our everyday lives are full of various types of audio signals. Our brains are capable of distinguishing different audio signals from each other by default. But machines don't have this capability. To learn audio classification, different approaches can be used. One of them is classification using spectrograms. Audio classification is an important task that is required for various applications like speech recognition, music genre classification, environmental sound analysis, forensic departments, and many more. In this article, we will explore the implementation guide for classifying audio signals using Spectrogram.

What is a spectrogram?

A spectrogram is a visual 2D representation of audio signals in the frequency domain that displays how the frequencies within a sound evolve over time by breaking down an audio signal into small segments and computing the intensity of different frequency components within each segment. The spectrogram, or time-frequency representation of an audio signal, helps us to understand valuable insights about the audio content, like distinguishing between various sounds, patterns, or characteristics. The efficient creation of spectrograms is a key step in audio classification using spectrograms. This spectrogram creation process involves various steps, which are discussed below.

Segmentation: At first, the raw audio signals are divided into short, overlapping time segments, or frames.
Frequency Analysis: segment, For each time segment, the Fourier transform is applied to obtain a frequency domain representation of that segment, which reveals the frequency components present in that short duration.
Repeat for Each Segment: This process is repeated for each time segment to create a series of individual frequency domain representations.
Mel spectrogram generation: In this article, we have used Mel spectrograms which is a representation of an audio signal that is closer to how humans perceive sound. This process starts with Fourier transformation and then a series of additional transformations are applied which models the nonlinear human auditory system's response to different frequencies. It utilizes mel-scale which is a perceptual scale that emphasizes lower frequencies and de-emphasizes higher frequencies by mimicking how the human ear perceives sound. This is greatly useful for audio classification using Spectrograms.
Visualization: These frequency domain representations are then stacked horizontally which forms the spectrogram. Brightness or color intensity is used to represent the amplitude or energy of each frequency component in each frame.

The fourth step is an extra step which is only performed for audio classification. Please find the 'Data pre-processing' sub-section.

About the dataset

You can download the Barbie Vs Puppy dataset from here.

Step-by-step implementation

Importing required libraries

We will import all necessary Python libraries like NumPy, Sckit Learn, Matplotlib, Librosa etc.

Un-zipping the dataset

Our dataset is a zip file which contains audio files(.wav) in two respective folders. So, our first task is to extract its contains to out runtime.

Data pre-processing

It is the most important step when we are attempting to perform audio classification using spectrograms. We will load each of the audio files till 3s for spectrogram generation as per machine capabilities. You can extent it if required. In our present dataset most of the audio files are within a range of 3s. Here we will generate mel-Spectrograms for better classification.

Encoding targets and data-splitting

In this step, we will use Label Encoder to encode the target labels and then we will split the dataset into training and testing(80:20). After that we will scale all to spectrograms to a certain length to ensure all the spectrograms have same length. Otherwise, we can not be able to classify them.

Exploratory data analysis

Now we will perform EDA to gain knowledge about dataset.

Target class distribution: The distribution of the classes(here barbie and puppy) of target variable helps us to gain a deep knowledge and for assessing class balance and potential data biases.

Output:

👁 Class Distributions-Geeksforgeeks

Distribution of classes

Class-wise Spectrogram comparison: As we are performing audio classification using spectrogram so it is mandatory to visualize pattern of audio waveform and spectrograms for each class. Now both the target classes contains multiple number of audio files and we can visualize all of them if it is required. In this article, we will visualize only one spectrogram and waveform from each class.

Output:

👁 Spectrogram Comparison for Barbie-Geeksforgeeks

Waveform and spectrogram comparison for 'barbie' class

👁 Spectrogram Comparison For puppy-Geeksforgeeks

Waveform and spectrogram comparison for 'puppy' class

Model fitting and evaluation

After EDA, we can say that we are going to perform Binary classification of audio as there are only two classes(barbie and puppy) present as target. So, we can choose a wide range of classification models for this task. Here, we are going to implement Gradient Boosting classifier of ensemble learning technique. We will pass all parameters of its to there default values. Only one parameter called 'random_state' will be specified to handle the randomness during model training and to ensure that the model will produce same result for each execution. Finally, we will evaluate this model's performance in the terms of accuracy and F1-score.

Output:

Accuracy: 0.7500
F1 score: 0.8000

Note: By using same data-preprocessing code you can implement different classifier models as per your choice. Only for example Gradient Boosting classifier is implemented. All other model implementation will be same as it is.

Conclusion

We can conclude that, Audio classification using spectrogram is a long and calculative technique. However, it can effectively useful for audio classification. Our model performed moderately well with a accuracy of 65% and achived a decent F1-score of approximately 70%. These results show that audio classification using spectrogram may be a lengthy process but by using correct model and hyperparameter-tuning, we can achieve outstanding results for classification of audio.

Comment

Article Tags:

Explore

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Courses

URL: https://www.geeksforgeeks.org/nlp/audio-classification-using-spectrograms/