VOOZH about

URL: https://www.analyticsvidhya.com/blog/2015/10/6-practices-enhance-performance-text-classification-model/

โ‡ฑ 6 Practices to enhance the performance of a Text Classification Model


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

6 Practices to enhance the performance of a Text Classification Model

Shivam Bansal Last Updated : 29 Oct, 2024
6 min read

Introduction

A few months back, I was working on creating a sentiment classifier for Twitter data. After trying the common approaches, I was still struggling to get good accuracy on the results.

Text classification problems and algorithms have been around for a while now. They are widely used for Email Spam Filtering by the likes of Google and Yahoo, for conducting sentiment analysis of twitter data and automatic news categorization in google alerts.

However, while dealing with enormous amount of text data, modelโ€™s performance and accuracy becomes a challenge. The performance of a text classification model is heavily dependent upon the type of words used in the corpus and type of features created for classification. I used several practices to improve the results of my model.

In this article, Iโ€™ve illustrated the six best practices to enhance the performance and accuracy of a text classification model which I had used:

1. Domain Specific Features in the Corpus

For a classification problem, it is important to choose the test and training corpus very carefully. For a variety of features to act in the classification algorithm, domain knowledge plays an integral part.

๐Ÿ‘ a1
๐Ÿ‘ a2

For example, if the problem is โ€œSentiment Classification on social media dataโ€, the training corpus should consist of the data from social sources like twitter and facebook.

On the other hand if the problem is โ€œSentiment Classification for news dataโ€, the corpus should consist of data from news sources. This is because the vocabulary of a corpus varies with domains. Social Media contains a lot of slangs and improper keywords like โ€œawsum, lol, goooodโ€ etc which are absent in any of the formal corpus such as news, blogs etc.

Letโ€™s take an example of a naive bayes classification problem where the task is to classify the statements into two Classes: โ€œClass A and Class Bโ€. Weโ€™ve training data corpus and a test data corpus.  As mentioned here, the training corpus should contain the features from relevant corpus. Therefore, training corpus should consist of data points such as:

training_data = [
 ('The apple iphone was gooooooood, ##$! http://www.apple.com', 'Class A'),
 ('I do not enjoy my job, holy shiiit', 'Class B'),
 ("I ain't feeling lol today crappp.", 'Class B'),
 ("I feel amazing!", 'Class A')
 ]

test_data = [
 ('I luv this phones.', 'Class A'),
 ('This is an amaaaazingg company!', 'Class A'),
 ('I am feeling very gooood about these features lol.', 'Class A'),
 ('This is my bestest phones.', 'Class A'),
 ("What an awesomee player", 'Class A'),
 ('I do not like this phone #apple ', 'Class B'),
 ('I am tired of this stuff.', 'Class B'),
 ('They are my worst fears! . check them out here: http://goo.gl/qdjk3rf ', 'Class B'),
 ('My boss lives in India, He is horrible.', 'Class B')
 ]

print(training_data)

2. Use An Exhaustive Stopword List

Stopwords are defined as the most commonly used words in a corpus. Most commonly used stopwords are โ€œa, the, of, on, โ€ฆ etcโ€. These words are used to define the structure of a sentence. But, are of no use in defining the context. Treating these type of words as feature words would result in poor performance in text classification. These words can be directly ignored from the corpus in order to obtain a better performance. Apart from language stopwords, There are some other supporting words as well which are of lesser importance than any other terms. These includes:

Language Stopwords โ€“ a, of, on, the โ€ฆ etc

Location Stopwords โ€“ Country names, Cities names etc

Time Stopwords โ€“ Name of the months and days (january, february, monday, tuesday, today, tomorrow โ€ฆ) etc

Numerals Stopwords โ€“ Words describing numerical terms ( hundred, thousand, โ€ฆ etc)

After removal of these entities from the test data, test data would be reformed to following:

test_data = [
    (' luv  phones.', 'Class A'),
    ('   amaaaazingg company!', 'Class A'),
    ('  feeling very gooood about  features lol.', 'Class A'),
    ('   bestest phones.', 'Class A'),
    ("  awesomee player", 'Class A'),
    ('  not like  phone #apple  ', 'Class B'),
    ('  tired   stuff.', 'Class B'),
    ('   worst fears! . check  out here: http://goo.gl/qdjk3rf ', 'Class B'),
    (' boss lives,   horrible.', 'Class B')
x]

3. Noise Free Corpus

In most of the data science problems, it is recommended to undertake a classification algorithm on a cleaned corpus rather than a noisy corpus. Noisy corpus refers to unimportant entities of the text such as punctuations marks, numerical values, links and urls etc. Removal of these entities from the text would increase the accuracy, because size of sample space of possible features set decreases.

Yet, it becomes essential to note that, these entities should only be removed if the classification problem does not use these entities. For example โ€“ emoticons such as  :), :P, ๐Ÿ™  are important for sentiment classification but may not be important for other text classifications.

After eliminating the noisy entities like hashtags, url text, the training corpus would look like:

test_data = [
    ('luv phones', 'Class A'),
    ('amaaaazingg company', 'Class A'),
    ('feeling very gooood about features lol', 'Class A'),
    ('bestest phones', 'Class A'),
    ("awesomee player", 'Class A'),
    ('not like phone apple  ', 'Class B'),
    ('tired stuff', 'Class B'),
    ('worst fears check out here', 'Class B'),
    ('boss lives horrible', 'Class B')
]


4. Eliminating features with extremely low frequency

Keywords which occur in lesser frequency in the corpus usually does not play a role in text classification. One can get rid of these low occurring features, resulting in better performance of the model. For example โ€“ if the frequency counts of words of a corpus looks like the following:It is clear that terms โ€œfigโ€ and โ€œdaleโ€ have occurred in low frequency as compared to other terms. Hence, if we choose a threshold of 10,  all keywords with less frequency can be ignored, resulting in good accuracy.


5. Normalized Corpus

Words are the integral part of any classification technique. However, these words are often used with different variations in the text depending on their grammar (verb, adjective, noun, etc.). It is always a good practice to normalize the terms to their root forms. This technique is known as Lemmatization. For example, the words:

  • Playing
  • Player
  • Plays
  • Play
  • Players
  • Played

All can be normalized down to the word โ€œPlayโ€ as far as the classifier is concerned. Performance will be good when there is single feature for ten variations rather than ten features for each variations.

test_data = [
    ('luv phone', 'Class A'),
    ('amazing company', 'Class A'),
    ('feeling very good about feature lol', 'Class A'),
    ('best phone', 'Class A'),
    ("awesome player", 'Class A'),
    ('not like phone apple', 'Class B'),
    ('tired stuff', 'Class B'),
    ('worst fear check out here', 'Class B'),
    ('boss live horrible', 'Class B')
]

6. Use Complex Features: n-grams and part of speech tags

In some cases, features as the combination of words provides better significance rather than considering single words as features. Combination of N words together are called N-grams. It is known that Bigrams are the most informative N-Gram combinations. Adding bigrams to feature set will improve the accuracy of text classification model.

Similarly considering Part of Speech tags combined with with words/n-grams will give an extra set of feature space. also increase the classifications. For example โ€“

itโ€™s better to train the model, such that word โ€œbookโ€ when used as NOUN means โ€œbook of pagesโ€, when used as VERB means to โ€œbook a ticket or something elseโ€.

 ๐Ÿ‘ 12

End Notes

In this article, we discussed few practices to improve the accuracy of a text classifier model. These gave me an improvement of ~10% โ€“ 20% in accuracy depending on the use case.

This is obviously not a complete list, but it provides a nice introduction for optimization of text classification algorithms. If you feel there are any other techniques which I have missed, feel free to share in comments section below.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

Shivam Bansal is a data scientist with exhaustive experience in Natural Language Processing and Machine Learning in several domains. He is passionate about learning and always looks forward to solving challenging analytical problems.

Login to continue reading and enjoy expert-curated content.

Free Courses

Building a Deep Research AI Agent

Build a Research & Report Agent with LangGraph & OpenAI for under $1!

Introduction to Transformers and Attention Mechanisms

Learn attention mechanisms, RNNs, Seq2Seq, BERT & NLP applications.

Getting Started with Large Language Models

Embark on an LLM journey: Master NLP and model training

Nano Course: Building Large Language Models for Code

Train Code LLMs from scratch: curate data, evaluate & build Starcoder (15B)

DeepSeek from Scratch; Architectural Components

DeepSeek from Scratch: Learn input, self-attention, RoPE & more.

Responses From Readers

Karthikeyan

Hi Shivam, Good article. I have two questions: 1. Is Lemmatization and stemming of words are same? 2. I would like to know your experience with stemming of the words. Does that really improves the performance? I was trying to use stemdocument function in R. For me having the function didnt reduce the accuracy of the model, however if you see the words like organization has been stemmed to organ. Which is wrong in my case. Hence i am forced to remove the stemming from my work. 3. How point number 4 discussed above( removing the less frequency words) and TF-IDF are related / unrelated? In my understanding IDF reduces the high frequency words. If this understanding correct, is the point no 4 and IDF are opposite in meaning? Kindly clarify. Thanks in advance.

Hi Karthikeyan 1. Stemming and Lemmatization are two different processes.: Stemming is much simpler process in which, suffixes are removed from a word. It involves various if - else rules and conditions, to convert the word to a root form. Lemmatization on the other hand, is more advanced approach, which converts the word into its lemma. It takes care of grammar, vocabulary and dictionary importance of a word while the conversion. for example: in Stemming: "driving" will be reduced to "driv" but in lemmatization it will be reduced to "drive" 2. As per my experience with text mining, stemming is not a good tool for enhancing the performance. As stemming results in loss of some information. 3. Point 4 and TF-IDF are somewhat similar in practical sense. IDF also does the same thing by giving less priority to low frequency words.

123 1
Karthikeyan

Thanks for the clarifications. Well explained !!!

123 456
Hemant Chaudhary

Hi Shivam The article is really good you put all the models in one place providing a really good insight of classification. I did the news classification but for that i have to perform a lot of research and their are little number of articles which provide such a good and point-to-point insight on this topic. Appreciate your work and time to write this article. Keep it up mate.

123 1

Hi Hemant, Thanks ! I will keep posting the useful articles.

123 456

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner