![]() |
VOOZH | about |
Spam messages are unsolicited or unwanted emails/messages sent in bulk to users. Detecting spam emails automatically helps prevent unnecessary clutter in users' inboxes.
In this article, we will build a spam email detection model that classifies emails as Spam or Ham (Not Spam) using TensorFlow, one of the most popular deep learning libraries.
Before we begin letโs import the necessary libraries: pandas, numpy, tensorflow, matplotlib, wordcloud, nltk for data processing, model building, and visualization.
Weโll use a dataset containing labeled emails (Spam or Ham). Letโs load the dataset and inspect its structure. You can download the dataset from here:
Output:
๐ Screenshot-2025-03-20-163128This will give us a glimpse into the first few rows of the dataset. You can also check the shape of the dataset:
Output:
(5171,4)
The data contains 5171 rows and four columns.
Now, let's visualize the label distribution to get understanding of the class distribution:
Output:
๐ ImageWe can clearly see that number of samples of Ham is much more than that of Spam which implies that the dataset we are using is imbalanced. To address the imbalance weโll downsample the majority class (Ham) to match the minority class (Spam).
Output:
๐ ImageTextual data often requires preprocessing before feeding it into a machine learning model. Common steps include removing stopwords, punctuations, and performing stemming/lemmatization.
Weโll perform the following steps:
Although removing data means loss of information we need to do this to make the data perfect to feed into a machine learning model.
Output:
Output:
๐ Screenshot-2025-03-20-171132The below function is a helper function that will help us to remove the stop words.
Output:
๐ Screenshot-2025-03-20-171132A word cloud is a text visualization tool that help's us to get insights into the most frequent words present in the corpus of the data.
Output:
๐ ImageMachine learning models work with numbers, so we need to convert the text data into numerical vectors using Tokenization and Padding.
We will build a deep learning model using a Sequential architecture. This model will include:
Output:
Model: "sequential"
__________________________________________________
Layer (type) Output Shape Param #
========================================================
embedding (Embedding) (None, 100, 32) 1274912
lstm (LSTM) (None, 16) 3136
dense (Dense) (None, 32) 544
dense_1 (Dense) (None, 1) 33
========================================================
Total params: 1,278,625
Trainable params: 1,278,625
Non-trainable params: 0
__________________________________________________
We train the model using EarlyStopping and ReduceLROnPlateau callbacks. These callbacks help stop the training early if the modelโs performance doesnโt improve and reduce the learning rate to fine-tune the model.
Output:
After training, we evaluate the model on the test data to measure its performance.
Output:
Test Loss: 0.1202
Test Accuracy: 0.9700
Thus, the training accuracy turns out to be 97% which is quite satisfactory.
Having trained our model, we can plot a graph depicting the variance of training and validation accuracies with the no. of epochs.
Output:
๐ Image
By following these steps, we have successfully built a machine learning model that can classify emails as spam or ham. With further optimization, this model can be fine-tuned to improve its performance even more.