Video Classification with a 3D Convolutional Neural Network

Last Updated : 23 Jul, 2025

Video classification is a highly important technology for analyzing digital content. It finds applications in various industries, such as security surveillance, media entertainment, and automotive safety. Central to this technology are 3D Convolutional Neural Networks (CNNs), which significantly enhance the accuracy and efficiency of video classification models.

Unlike traditional methods that treat videos as separate frames, 3D CNNs consider the entire temporal dimension, leading to a better understanding of the visual content. This results in faster and more reliable identification and categorization of video content.

This guide provides a comprehensive step-by-step approach to performing video classification with 3D CNNs. From setting up your environment to evaluating your model, you will learn the basics of 3D CNN technology, data preparation, model building, training, and performance evaluation. Each section covers both theoretical concepts and practical applications, making this guide an essential resource for anyone interested in leveraging 3D CNNs for advanced video analysis.

Understanding 3D Convolutional Neural Networks

A 3D Convolutional Neural Network (3D CNN) is a type of neural network architecture designed to learn hierarchical data representations. Equipped with multiple layers, it progressively learns more complex spatial features for tasks such as classification, regression, or generation. Unlike traditional 2D Convolutional Neural Networks, which handle two-dimensional data, 3D CNNs can process three-dimensional data, capturing both spatial and temporal dependencies. This capability makes them ideal for working with volumetric data like medical images (CT scans or MRI scans) or video sequences, where understanding the spatial context and temporal progression is crucial.

3D CNNs utilize 3D convolutional layers with a three-dimensional kernel that slides over the input volume to detect patterns across three dimensions. This approach is more effective in capturing the complexities of spatial patterns compared to 2D CNNs. The inclusion of the temporal dimension allows 3D CNNs to analyze frame-to-frame relationships in video data, enhancing the model's understanding of dynamic scenes.

Key Differences between 2D and 3D CNNs

When analyzing video data, 3D CNNs adopt a different approach compared to 2D CNNs. Instead of considering only height and width, 3D CNNs also account for depth, using a three-dimensional kernel that moves across all three dimensions. This enables them to capture both spatial and temporal features in video sequences.

2D CNNs treat video frames as separate images, ignoring connections between them, while 3D CNNs understand the patterns that develop over time. They achieve this by having convolution and pooling layers that work in three dimensions, processing sequences of frames and preserving valuable temporal information. This additional complexity in computations is justified by the more detailed and meaningful representation of the data.

Implementing Video Classification with 3D CNNs

Step 1: Import Required Libraries

First, we need to import the necessary libraries for video processing, data manipulation, model creation, and evaluation.

Step 2: Load and Visualize Data

You download the dataset from here.

Load the paths to the directories containing the training and testing videos, along with the labels associated with each video. Visualize the distribution of classes in the training labels to gain insights into the dataset's balance and determine the number of unique classes.

Output:

👁 Screenshot-2024-06-11-090609

Distribution of training labels

Number of classes: 5

Step 3: Data Pre-Processing

Next, we define a function extract_frames that takes a path to a video file and extracts a specified number of frames from the video, resizing them to 112x112 pixels. The function returns an array of the extracted frames, ensuring that the number of frames extracted matches the specified num_frames parameter.

Step 4: Load Data and Create Numpy Arrays

Next, we define a function load_data that takes a DataFrame containing video labels, a directory containing video files, the number of classes in the dataset, and the number of frames to extract per video. The function preprocesses the video data by extracting frames and converting labels to one-hot encoded format, returning arrays of video frames and labels suitable for training a 3D CNN model.

Output:

100%|██████████| 594/594 [01:09<00:00, 8.52it/s]
100%|██████████| 224/224 [00:25<00:00, 8.89it/s]

Step 5: Train-Test Split

X_train: Training set features (video frames).
X_val: Validation set features (video frames).
y_train: Training set labels (one-hot encoded).
y_val: Validation set labels (one-hot encoded).

Step 6: Create the 3D CNN Model

Next, we define a function create_advanced_3dcnn_model that constructs an advanced 3D CNN model for video classification. The function takes the shape of input frames (input_shape) and the number of classes (num_classes) as input and returns a compiled 3D CNN model.

The create_advanced_3dcnn_model function builds a deeper 3D CNN model compared to the previous model. It consists of multiple 3D convolutional layers followed by max pooling and batch normalization layers. The model also includes a dropout layer for regularization and ends with a dense layer for classification. The model is compiled with the Adam optimizer, categorical crossentropy loss, and accuracy metric. This model is suitable for more complex video classification tasks that may require capturing more intricate patterns in the data.

Step 7: Train the Model

Next, we train the 3D CNN model (model) using the training data (X_train and y_train) and validate it on the validation data (X_val and y_val) for 10 epochs with a batch size of 8. The training process involves updating the model's weights based on the gradient of the loss function computed on the training data, with the goal of minimizing the loss and improving the model's accuracy.

X_train and y_train: Training set features (video frames) and labels (one-hot encoded).
X_val and y_val: Validation set features (video frames) and labels (one-hot encoded).
epochs=10: Number of epochs (iterations over the entire dataset) for training.
batch_size=8: Number of samples per gradient update..

Output:

Epoch 1/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 150s 2s/step - accuracy: 0.4733 - loss: 41.4847 - val_accuracy: 0.5042 - val_loss: 314.1945
Epoch 2/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 164s 3s/step - accuracy: 0.8078 - loss: 23.4136 - val_accuracy: 0.5462 - val_loss: 249.8013
Epoch 3/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 225s 4s/step - accuracy: 0.8392 - loss: 13.7996 - val_accuracy: 0.3950 - val_loss: 465.2792
Epoch 4/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 179s 3s/step - accuracy: 0.9461 - loss: 4.9286 - val_accuracy: 0.8235 - val_loss: 28.2462
Epoch 5/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 165s 3s/step - accuracy: 0.9474 - loss: 4.8669 - val_accuracy: 0.9244 - val_loss: 5.1237
Epoch 6/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 170s 3s/step - accuracy: 0.9614 - loss: 3.1715 - val_accuracy: 0.8403 - val_loss: 39.0854
Epoch 7/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 168s 3s/step - accuracy: 0.9532 - loss: 3.4268 - val_accuracy: 0.8487 - val_loss: 27.8798
Epoch 8/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 163s 3s/step - accuracy: 0.9805 - loss: 1.1084 - val_accuracy: 0.9244 - val_loss: 6.3794
Epoch 9/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 163s 3s/step - accuracy: 0.9694 - loss: 3.3069 - val_accuracy: 0.8067 - val_loss: 51.3638
Epoch 10/10
60/60 ━━━━━━━━━━━━━━━━━━━━ 164s 3s/step - accuracy: 0.9597 - loss: 3.3076 - val_accuracy: 0.8824 - val_loss: 10.6910

Step 8: Evaluate the Model

In this step, we evaluate the trained 3D CNN model (model) on the test data (X_test and y_test) to measure its performance on unseen data. The model's performance is evaluated by calculating the loss and accuracy on the test set. The test accuracy gives an indication of how well the model generalizes to new, unseen data. This evaluation step is crucial for assessing the model's effectiveness in real-world scenarios.

Output:

7/7 ━━━━━━━━━━━━━━━━━━━━ 30s 3s/step - accuracy: 0.7098 - loss: 95.7806
Test Accuracy: 0.63

Step 9: Get Predictions

In this step, we use the trained 3D CNN model (model) to predict the class labels for the test data (X_test) and compare them with the true labels (y_test) to evaluate the model's performance.

The predict method is used to obtain predicted probabilities for each class for each sample in the test set.
The predicted probabilities are then converted to class labels using np.argmax along the axis representing the classes, resulting in y_pred_classes.
Similarly, the true labels (y_test) are converted to class labels, resulting in y_true_classes.
These class labels can be used to calculate metrics such as precision, recall, and the confusion matrix for further evaluation of the model.

Output:

7/7 ━━━━━━━━━━━━━━━━━━━━ 14s 2s/step

Step 10: Plot Confusion Matrix

In this step, we import the necessary functions from scikit-learn (classification_report and confusion_matrix) to evaluate the performance of our 3D CNN model.

Next , we use matplotlib and seaborn libraries to plot a heatmap of the confusion matrix, which provides a visual representation of the model's performance.

Output:

👁 Screenshot-2024-06-11-091102

Confusion Matrix

Step 11:Classification Report

In this step, we use the classification_report function from scikit-learn to generate a comprehensive report on the classification performance of our 3D CNN model.

Output:

 precision recall f1-score support

 CricketShot 0.60 1.00 0.75 49
PlayingCello 0.37 0.25 0.30 44
 Punch 0.73 0.85 0.79 39
ShavingBeard 0.41 0.16 0.23 43
 TennisSwing 0.80 0.84 0.82 49

 accuracy 0.63 224
 macro avg 0.58 0.62 0.58 224
weighted avg 0.59 0.63 0.58 224

Step 12: Plot training history

In this step, we use matplotlib to plot the training and validation accuracy of the 3D CNN model over epochs, which helps visualize how the model's accuracy improves during training.

Output:

👁 Screenshot-2024-06-11-091207

Model Accuracy

Next, we continue to visualize the training history of the 3D CNN model by plotting the training and validation loss over epochs, providing insights into the model's convergence and performance.

Output:

👁 Screenshot-2024-06-11-091321

Model Loss

Applications of 3D CNNs in Video Classification

3D CNNs excel in video classification by capturing both the spatial and temporal aspects of footage. In datasets like UCF101, which focus on action recognition, 3D CNNs achieve higher accuracy by leveraging spatio-temporal features. These features are essential for understanding complex activities and making accurate predictions.

Beyond action recognition, 3D CNNs are utilized in fields such as medical imaging and surveillance, where identifying patterns or anomalies over time is crucial for diagnosis and monitoring. Sometimes, 3D CNNs are combined with Recurrent Neural Networks (RNNs) to create hybrid models capable of handling the most complex video analysis tasks. These models integrate spatial and temporal cues, providing a comprehensive understanding of the data.

Conclusion

In this guide, we've delved into the ins and outs of using 3D Convolutional Neural Networks (3D CNNs) for video classification. We covered everything from setting up the environment and preprocessing videos to building, training, and evaluating the model. By following this step-by-step approach, you'll see just how powerful 3D CNNs are in capturing spatio-temporal features, which are crucial for accurate video analysis.

Our exploration highlights the potential of 3D CNNs to revolutionize video analytics. These advanced models can interpret both the spatial and temporal dimensions of a video, giving us a deeper understanding of its content. Of course, using these models comes with its challenges, as we need to balance complexity and computational demands. But the possibilities they offer for advancements in automated surveillance and digital media are enormous. This guide serves as a solid foundation for mastering 3D CNNs in sophisticated video classification tasks.

Comment

Article Tags:

Blogathon

Deep Learning

AI-ML-DS

Computer Vision Projects

Explore

Basics

Neural Networks

Deep Learning Models

Model Evaluation

Deep Learning Frameworks

Projects

Courses

URL: https://www.geeksforgeeks.org/deep-learning/video-classification-with-a-3d-convolutional-neural-network/