Text Cleaning and Hyperparameters Optimization on a IMDB movie review dataset with a SVM model in…
The article aims at exploring a Support Vector Machine model to perform sentiment analysis on IMDB movie reviews along with text…
Sentiment analysis on movie reviews with text processing techniques and hyperparameters optimization.
Introduction
This article aims at deploying a machine learning model called "Support Vector Machines", with a particular focus on the text cleaning portion and hyperparameters optimization, these two techniques will most likely increase the model accuracy.
In a previous article, we saw how to perform sentiment analysis on an IMDB movie review data set by using two feature extractors methods such as Bag-of-words and Tf-idf with a Naïve Bayes classifier. Although results were promising, there is always room for improvement, which is the ultimate goal of the current project.
Before starting though, I owe you some basic definitions useful to understand the topic. First, it is fundamental to understand what is text cleaning, otherwise defined as text processing or text manipulation:
Sentences are usually presented as text (string of characters) and documents can be described as a collection of sentences. In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text¹.
The goal of text processing is to eliminate all the superfluous information text contains before implementing a machine learning model, this helps to reduce the "noise" allowing the algorithm to identify patterns more easily and increase its generalization. As a consequence, properly applied text manipulation allows the practitioner to increase computing efficiency. This step will eventually be helpful for the deployment of the Support Vector Machines model and its hyperparameters optimization.
The data
The data we are going to use is 50,000 rows of movie reviews you can find at this link. The goal would be to produce a high-performing sentiment analyzer by training it on the available rows. If you want to review what sentiment analysis is, I can suggest a quick read to this article, which covers all the basics. The structure of the csv file is quite simple, it has two columns, one contains the reviews and the other one the sentiment. Once the model will re-classify part of the reviews on the testing portion, we’ll be able to calculate how many were correctly classified which indicates the overall accuracy of the SVM model. The following shows what a sample review looks like:
I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.
As of now, it is very complex for an algorithm to find a common path among all reviews. This is due to three main factors: the number of characters, the word range, and special characters. The first two are intuitive. Regarding the third one, if you look carefully, you can spot HTML line breaks "
". These, along with parenthesis, question marks, commas, and periods need to be eliminated because they do not add any interesting information about sentiment.
Text Processing
Tokenization
Tokenization breaks down a full document or a sentence into a string of characters to ensure a more effective manipulation. For example, an algorithm has no interest in blank spaces, black lines, or line breaks. The result from the tokenization process only has words and punctuation.
Lemmatization and Stopwords
Lemmatization is about reducing words to their canonical and basic form. For example, the verbs writing, writes, wrote, and written can be represented by the word write, the associated lemma. Lemmatization allows a simplification of the process, the algorithm can refer to a single word instead of all its forms. Stopwords, on the other hand, refer to the elimination of some words that add little to no value in the sentiment computation. The articles "a" or "an" do not indicate any positive or negative sentiment.
Support Vector Machines
The Support Vector Machine algorithm is a supervised learning model with combined algorithms that analyze data for classification and regression analysis. It was developed by Vladimir Vapnik with the original idea going back to 1979, right before the end of a relatively short AI winter lasting since 1973. The SVM classifier is a non-probabilistic classifier based on statistical learning. As with every supervised learning model, SVM needs labeled data to identify relevant patterns, only then it can be deployed on new records. The concept at the basis of SVM is the linearly separable binary classification². It means that data between class 1 and class 2 can be divided by a straight line, called a hyperplane. On top of that, the peculiarity about support vector machines is the maximization of space between class 1 and 2 examples, which is identified by two more lines defined as vectors.
The potential applications of SVMs range from handwriting recognition, intrusion detection, face detection, email classification, gene classification, and web pages. SVMs can handle classification and regression problems on both linear and non-linear data, which makes a versatile ml algorithm. SVMs perform better compared to other algorithms when dealing with small datasets that have a large number of features.
Hyperparameters
Before defining a hyperparameter is important to define a "standard" parameter. When a model converges, we can say it found the best combination of parameters to describe the general behavior of the data it was trained on. In the case of the SVM model, we can classify a record as a positive if the following formula is true:
The opposite happens in the case of a negative record:
The parameters w, and b determine whether the new record X is going to be classified as either positive or negative. X is also known as input vector, w as a vector of weights, and b is the bias. On the other hand, hyperparameters are an external configuration the practitioner gives the model and its value cannot be estimated from training. Hyperparameters can therefore be modified with the goal of discovering which combinations result in higher performance.
The SVM model represents no exception to the rule and has two major hyperparameters we are going to use to optimize the prediction process:
- C: Inverse regularization strength. The strength of the regularization is inversely proportional to C, it applies a penalty as the parameter increases. Its effect is to reduce overfitting.
- Kernel: Specifies the kernel type to be used in the algorithm. There are five possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’ and ‘precomputed’. Each presents peculiar characteristics.
Code Deployment
Text cleaning and pre-processing
The first portion of code deployment will focus on a superficial use of the text cleaning process, in fact, there are lots of different techniques that can be applied. For this article we will point at reaching a broader understanding of the most effective ones such as lemmatization and stopwords elimination:
- Import libraries such as pandas, re, nltk, and bs4. Remember to make sure all these packages are installed before importing them.
- Stopwords need to be downloaded before nltk.corpus can use them. So make sure you run the commands "nltk.download(‘stopwords’)" and "nltk.download(‘wordnet’)" before running the whole script.
- the csv file can be imported and assigned to the df variable thanks to pandas. The encoding must be "Latin-1" to avoid it throwing an error.
- At this point, we can map the two values contained in the sentiment column to just ones and zeros instead of positive and negative
- Consequently, we can set the stopwords and assign the WordNetLemmatizer() object to the lemmatizer variable.
- the clean_text function at this stage can be defined. The library re provides regular expression matching operations. To summarize, the code eliminates HTML line breaks and every special character. Text is then transformed into lowercase letters only. The lemmatization process is applied to each token. As a final step, stopwords are eliminated.
-
The clean_text function is ultimately applied on each row under the "review" column as a new column called "Processed Reviews" is created.
Each review is finally ready to be processed by the SVM algorithm. The result of this first major section is the following:
think wa wonderful way spend time hot summer weekend sit air condition theater watch light hearted comedy plot simplistic dialogue witty character likable even well bread suspect serial killer may disappoint realize match point 2 risk addiction think wa proof woody allen still fully control style many u grow love wa laugh one woody comedy year dare say decade never impress scarlet johanson manage tone sexy image jump right average spirit young woman may crown jewel career wa wittier devil wear prada interest superman great comedy go see friend
As you might notice, every period, comma, question mark or parenthesis have been filtered out. There are no pronouns and line breaks are not there anymore. By completing the first major step of text processing we have reduced the complexity of the data. The model has now a lot fewer symbols and different words to process, which is likely to increase its generalization capabilities while maintaining high accuracy.
Model Deployment
The following code shows a simple model deployment of the SVM model to calculate the accuracy of the model without hyperparameters optimization.
- After importing the sklearn library and its sub-functions, we can define the input and the target variables.
- As in every ml process, the train and test split method follows the first step, in this case, the test size is 20% of records in the data set, the training is the remaining 80%.
- For the project, I decided to use a BoW feature extractor with its base parameters. The count vectorizer, therefore, creates a dictionary and transforms both the X_train and X_test data subsets according to the dictionary guidelines.
- The SVC() object is assigned to the SVM variable to instantiate the model, which is then fit on the records provided by the data set for the training portion of the script.
-
Finally, we can run the command SVM.predict(), which deploys the model on new records for the testing portion. The classification report gives the user information on the accuracy by making a comparison between the actual sentiment and the predicted one.
The initial set of results is quite comforting, the overall accuracy stands at 87% by testing the algorithm on 5035 negative reviews and 4965 positive ones. Compared to the Naive Bayes approach we took in a previous article, the SVM algorithm performed much better.
precision recall f1-score support
0 0.89 0.86 0.87 5035
1 0.86 0.89 0.87 4965
accuracy 0.87 10000
macro avg 0.87 0.87 0.87 10000
weighted avg 0.87 0.87 0.87 10000
Hyperparameters optimization
We now know the overall accuracy with the default settings both for the vectorizer and the SVM model. The objective of the following code is to calculate the best combination of hyperparameters to increase the model performance:
- We can start by importing the different functions of the sklearn library such as RepeatedStratifiedKfold, GridSearchCV, SVC, Pipeline, and CountVectorizer.
- Then we can create a pipeline. The concept of pipeline in computing most of the times refers to a data pipeline, it is a group of data processing elements where the output of an element is the input of the next one. The first element of the pipeline is CountVectorizer(), which we renamed "vect" while the second element is SVC(). Informally speaking, we need a pipeline to allow the cross-validation process.
- The parameters’ list is built so that each name of the pipeline is associated with a name of a parameter and its values. For example, the CountVectorizer function includes the parameters max_df and ngram_range. The name vect__max_df tells us that the parameter max_df is associated with the "vect" previously defined in the pipeline section.
- The grid search combines information found in the pipeline and the parameters grid to calculate the optimal combination of hyperparameters that maximize the SVM performance. Of course, the concept is much more complicated than that and will be covered in future articles. As of now, we are interested in the mere mechanics of the code. The grid search calculates every single hyperparameter combination on our data. Of course, 50,000 would be a lot of computations to perform each time, this is why I reduced the number of rows to only 5000 for this particular scenario.
-
The last section organizes and summarizes all the results found by reporting the average accuracy and the hyperparameters it’s been achieved with.
The code runs for more than an hour and the results are the following:
Best: 0.845663 using {'SVM__C': 50, 'SVM__kernel': 'rbf', 'vect__max_df': 0.5, 'vect__ngram_range': (1, 2)}
0.597360 (0.017354) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.1, 'vect__ngram_range': (1, 1)}
0.509796 (0.001329) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.1, 'vect__ngram_range': (1, 2)}
0.506197 (0.001325) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.1, 'vect__ngram_range': (1, 3)}
0.614556 (0.013541) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.2, 'vect__ngram_range': (1, 1)}
It seems that the best hyperparameters are:
- C=50
- the kernel is the rbf type
- max_df = 0.5
- the ngram range is (1,2)
The accuracy score calculated only on 5000 records is 84.5%. Of course, it is lower compared to the one we obtained at first. This time though, the model has been trained on far fewer records compared to the 40,000 we used previously. The difference in performance is therefore explained. It is now time to implement the model on the full data set with the hyperparameters adjustments and re-check the final score. In order to run the model with the new hyperparameters it is necessary to make code implementations:
- "countvect = CountVectorizer()" will become "_countvect = CountVectorizer(ngram_range=(1,2), maxdf=0.5)" as we are telling the tool to consider groups of two words at a time (ngram) and to ignore terms that appear in more than 50% of the documents (max_df)
- "SVM = SVC()" will become "SVM = SVC(C = 50, kernel = ‘rbf’)" as we are telling the SVM to use 50 as a C paramater and rbf as the kernel.
After the code ran for about 1 hour and 30 minutes, the results are the following:
precision recall f1-score support
0 0.90 0.87 0.88 5035
1 0.87 0.90 0.89 4965
accuracy 0.88 10000
macro avg 0.88 0.88 0.88 10000
weighted avg 0.88 0.88 0.88 10000
As you can see, the precision increased by only 0.01 going from 0.87 to 0.88. Is it worth the computational effort? The answer is: "it depends". In this case, the model is trying to predict the sentiment of movie reviews, an 88% accuracy and an 87% one are extremely close. If we were predicting the probability of a patient having diabetes based on medical conditions then a 88% precision would not be good enough. In a previous article, we reached a 65% accuracy with the same IMDB data but a different algorithm. We have witnessed a significant increase just by changing the classifier, performing text cleaning, and optimizing the hyperparameters.
One last test shows us that a potential review saying "The movie was really good, I could have not imagined a better ending" is correctly classified by the algorithm as a [1] (a positive review), whereas a negative one saying "The movie was generally bad, the plot was boring and the characters badly interpreted" scores a [0] (a negative review). In general, it is good practice to deploy ml models on data produced directly by the practitioner, the check will immediately expose any problems.
##Testing Algorithm on single sentences
#Defining test sentences
test = ['The movie was really good, I could have not imagined a better ending']
test_1 = ['The movie was generally bad, the plot was boring and the characters badly interpreted']
test = count_vect.transform(test).toarray()
test_1 = count_vect.transform(test_1).toarray()
#Printing prediction
print(SVM.predict(test))
print(SVM.predict(test_1))
output:
[1]
[0]
Just a quick reminder before we conclude. If you have read this far and you feel you understood the content, try to take this as a signal for you to start developing your own project. Kaggle is an astonishing platform full of great data sets waiting to be explored and deployed. I can’t recommend it enough.
Conclusion
Even though the accuracy score of the base feature extractor and base model configurations was already high, the hyperparameters optimization process has been able to increase it further. As always, machine learning amazes me in the way it can "learn" and classify records more rapidly and efficiently than humans. It is truly amazing that a "simple" statistical algorithm can potentially replace humans in terms of speed and accuracy. Of course, there are limitations but technology is finally available for virtually everyone to develop their own models and study this extraordinary field.
As a final note, if you liked the content please consider dropping a follow to be notified when new articles are published. If you have any considerations to make about the article, write them in the comments! I’d love to read them 🙂 Thank you for reading!
PS: If you like my writing, it would mean the world to me if you could subscribe to a medium membership through this link. It’s an indirect way of supporting me and you get the amazing value that medium articles provide!
References
[1] Wikipedia Contributors. (2022, March 9). Text processing. Retrieved March 14, 2022, from Wikipedia website: https://en.wikipedia.org/wiki/Text_processing
[2] Wikipedia Contributors. (2022, March 12). Support-vector machine. Retrieved March 20, 2022, from Wikipedia website: https://en.wikipedia.org/wiki/Support-vector_machine
[3] Appel, Orestes, et al. "A hybrid approach to the sentiment analysis problem at the sentence level." Knowledge-Based Systems 108 (2016): 110–124.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS