VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/04/a-guide-to-feature-engineering-in-nlp/

⇱ Feature Engineering in NLP | How to do Feature Engineering in NLP


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

A Guide to Feature Engineering in NLP

Mohammad Last Updated : 16 Oct, 2024
7 min read
This article was published as a part of the Data Science Blogathon.

Overview

  • Feature engineering in NLP is understanding the context of the text.
  • In this blog, we will look at some of the common feature engineering in NLP.
  • We will compare the results of a classification task with and without doing feature engineering

Table of Content

  1. Introduction
  2. NLP task overview
  3. List of features with code
  4. Implementation
  5. Results comparison with and without doing feature engineering
  6. Conclusion

Introduction

“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.” – Andrew Ng

Feature engineering is one of the most important steps in machine learning. It is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Think machine learning algorithm as a learning child the more accurate information you provide the more they will be able to interpret the information well. Focusing first on our data will give us better results than focusing only on models. Feature engineering helps us to create better data which helps the model understand it well and provide reasonable results.

NLP is a subfield of artificial intelligence where we understand human interaction with machines using natural languages. To understand a natural language, you need to understand how we write a sentence, how we express our thoughts using different words, signs, special characters, etc basically we should understand the context of the sentence to interpret its meaning.

If we can use these contexts as features and feed them to our model then the model will be able to understand the sentence better. Some of the common features that we can extract from a sentence are the number of words, number of capital words, number of punctuation, number of unique words, number of stopwords, average sentence length, etc. We can define these features based on our data set we are using. In this blog, we will use a Twitter data set so we can add some others features like the number of hashtags, number of mentions, etc. We will discuss them in detail in the coming sections.

NLP task overview

To understand the feature engineering task in NLP, we will be implementing it on a Twitter dataset. We will be using COVID-19 Fake News Dataset. The task is to classify the tweet as Fake or Real. The dataset is divided into train, validation, and test set.  Below is the distribution,

Split Real Fake Total
Train 3360 3060 6420
Validation 1120 1020 2140
Test 1120 1020 2140

List of features

I will be listing out a total of 15 features that we can use for the above dataset, number of features totally depends upon the type of dataset you are using.

1. Number of Characters

Count the number of characters present in a tweet.

def count_chars(text):
 return len(text)

2. Number of words

Count the number of words present in a tweet.

def count_words(text):
 return len(text.split())

3. Number of capital characters

Count the number of capital characters present in a tweet.
Python Code:

def count_capital_chars(text):
count=0
for i in text:
if i.isupper():
count+=1
return count

4. Number of capital words

Count the number of capital words present in a tweet.

def count_capital_words(text):
 return sum(map(str.isupper,text.split()))

5. Count the number of punctuations

In this function, we return a dictionary of 32 punctuation with the counts, which can be used as separate features, which I will discuss in the next section.

def count_punctuations(text):
 punctuations='!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'
 d=dict()
 for i in punctuations:
 d[str(i)+' count']=text.count(i)
 return d

6. Number of words in quotes

The number of words in the single quotation and double quotation.

def count_words_in_quotes(text):
 x = re.findall("'.'|"."", text)
 count=0
 if x is None:
 return 0
 else:
 for i in x:
 t=i[1:-1]
 count+=count_words(t)
 return count

7. Number of sentences

Count the number of sentences in a tweet.

def count_sent(text):
 return len(nltk.sent_tokenize(text))

8. Count the number of unique words

Count the number of unique words in a tweet.

def count_unique_words(text):
 return len(set(text.split()))

9. Count of hashtags

Since we are using the Twitter dataset we can count the number of times users used the hashtag.

def count_htags(text):
 x = re.findall(r'(#w[A-Za-z0-9]*)', text)
    return len(x)

10. Count of mentions

On Twitter, most of the time people reply or mention someone in their tweet, counting the number of mentions can also be treated as a feature.

def count_mentions(text):
 x = re.findall(r'(@w[A-Za-z0-9]*)', text)
 return len(x)

11. Count of stopwords

Here we will count the number of stopwords used in a tweet.

def count_stopwords(text):
 stop_words = set(stopwords.words('english')) 
 word_tokens = word_tokenize(text)
 stopwords_x = [w for w in word_tokens if w in stop_words]
 return len(stopwords_x)

12. Calculating average word length

This can be calculated by dividing the counts of characters by counts of words.

df['avg_wordlength'] = df['char_count']/df['word_count']

13. Calculating average sentence length

This can be calculated by dividing the counts of words by the counts of sentences.

df['avg_sentlength'] = df['word_count']/df['sent_count']

14. unique words vs word count feature

This feature is basically the ratio of unique words to a total number of words.

df['unique_vs_words'] = df['unique_word_count']/df['word_count']

15. Stopwords count vs words counts feature

This feature is also the ratio of counts of stopwords to the total number of words.

df['stopwords_vs_words'] = df['stopword_count']/df['word_count']

Implementation

You can download the dataset from here. After downloading we can start implementing all features we defined above. We will focus more on feature engineering, for this we will keep the approach simple, by using TF-IDF and simple pre-processing. All the code will be available on my GitHub repository https://github.com/ahmadkhan242/Feature-Engineering-in-NLP.

  • Reading train, validation, and test set using pandas.

    train = pd.read_csv("train.csv")
    val = pd.read_csv("validation.csv")
    test = pd.read_csv(testWithLabel.csv")
    
    # For this task we will combine the train and validation dataset and then use
    # simple train test split from sklern.
    df = pd.concat([train, val])
    df.head()
  • Applying the above-defined feature extraction on train and test set.

    df['char_count'] = df["tweet"].apply(lambda x:count_chars(x))
    df['word_count'] = df["tweet"].apply(lambda x:count_words(x))
    df['sent_count'] = df["tweet"].apply(lambda x:count_sent(x))
    df['capital_char_count'] = df["tweet"].apply(lambda x:count_capital_chars(x))
    df['capital_word_count'] = df["tweet"].apply(lambda x:count_capital_words(x))
    df['quoted_word_count'] = df["tweet"].apply(lambda x:count_words_in_quotes(x))
    df['stopword_count'] = df["tweet"].apply(lambda x:count_stopwords(x))
    df['unique_word_count'] = df["tweet"].apply(lambda x:count_unique_words(x))
    df['htag_count'] = df["tweet"].apply(lambda x:count_htags(x))
    df['mention_count'] = df["tweet"].apply(lambda x:count_mentions(x))
    df['punct_count'] = df["tweet"].apply(lambda x:count_punctuations(x))
    df['avg_wordlength'] = df['char_count']/df['word_count']
    df['avg_sentlength'] = df['word_count']/df['sent_count']
    df['unique_vs_words'] = df['unique_word_count']/df['word_count']
    df['stopwords_vs_words'] = df['stopword_count']/df['word_count']
    # SIMILARLY YOU CAN APPLY THEM ON TEST SET
  • dding some extra features using punctuation count

    We will create a DataFrame from the dictionary returned by the “punct_count” function and then merge it with the main dataset.

    df_punct = pd.DataFrame(list(df.punct_count))
    test_punct = pd.DataFrame(list(test.punct_count))
    
    # Merging pnctuation DataFrame with main DataFrame
    df = pd.merge(df, df_punct, left_index=True, right_index=True)
    test = pd.merge(test, test_punct,left_index=True, right_index=True)
    # We can drop "punct_count" column from both df and test DataFrame
    df.drop(columns=['punct_count'],inplace=True)
    test.drop(columns=['punct_count'],inplace=True)
    df.columns

Final columns list

  • re-processing

    We performed a simple pre-processing step, like removing links, removing user name, numbers, double space, punctuation, lower casing, etc.

    def remove_links(tweet):
     '''Takes a string and removes web links from it'''
     tweet = re.sub(r'httpS+', '', tweet) # remove http links
     tweet = re.sub(r'bit.ly/S+', '', tweet) # rempve bitly links
     tweet = tweet.strip('[link]') # remove [links]
     return tweet
    def remove_users(tweet):
     '''Takes a string and removes retweet and @user information'''
     tweet = re.sub('(RTs@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove retweet
     tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove tweeted at
     return tweet
    my_punctuation = '!"$%&'()*+,-./:;<=>?[\]^_`{|}~•@'
    def preprocess(sent):
     sent = remove_users(sent)
     sent = remove_links(sent)
     sent = sent.lower() # lower case
     sent = re.sub('['+my_punctuation + ']+', ' ', sent) # strip punctuation
     sent = re.sub('s+', ' ', sent) #remove double spacing
     sent = re.sub('([0-9]+)', '', sent) # remove numbers
     sent_token_list = [word for word in sent.split(' ')]
     sent = ' '.join(sent_token_list)
     return sent
    df['tweet'] = df['tweet'].apply(lambda x: preprocess(x))
    test['tweet'] = test['tweet'].apply(lambda x: preprocess(x))
  • Encoding text

    We will encode our text data using TF-IDF. We first fit transform on our train and test set’s tweet column and then merge it with all features columns.

    vectorizer = TfidfVectorizer()
    train_tf_idf_features = vectorizer.fit_transform(df['tweet']).toarray()
    test_tf_idf_features = vectorizer.transform(test['tweet']).toarray()
    
    # Converting above list to DataFrame
    train_tf_idf = pd.DataFrame(train_tf_idf_features)
    test_tf_idf = pd.DataFrame(test_tf_idf_features)
    
    # Saparating train and test labels from all features
    train_Y = df['label']
    test_Y = test['label']
    
    #Listing all features
    features = ['char_count', 'word_count', 'sent_count',
     'capital_char_count', 'capital_word_count', 'quoted_word_count',
     'stopword_count', 'unique_word_count', 'htag_count', 'mention_count',
     'avg_wordlength', 'avg_sentlength', 'unique_vs_words',
     'stopwords_vs_words', '! count', '" count', '# count', '$ count',
     '% count', '& count', '' count', '( count', ') count', '* count',
     '+ count', ', count', '- count', '. count', '/ count', ': count',
     '; count', '< count', '= count', '> count', '? count', '@ count',
     '[ count', ' count', '] count', '^ count', '_ count', '` count',
     '{ count', '| count', '} count', '~ count']
    
    # Finally merging all features with above TF-IDF. 
    train = pd.merge(train_tf_idf,df[features],left_index=True, right_index=True)
    test = pd.merge(test_tf_idf,test[features],left_index=True, right_index=True)
  • Training

    For training, we will be using the Random forest algorithm from the sci-kit learn library.

    X_train, X_test, y_train, y_test = train_test_split(train, train_Y, test_size=0.2, random_state = 42)
    # Random Forest Classifier
    clf_model = RandomForestClassifier(n_estimators = 1000, min_samples_split = 15, random_state = 42)
    clf_model.fit(X_train, y_train)
    _RandomForestClassifier_prediction = clf_model.predict(X_test)
    val_RandomForestClassifier_prediction = clf_model.predict(test)

Result comparison

For comparison, we first trained our model on the above dataset by using features engineering techniques and then without using feature engineering techniques. In both approaches, we pre-processed the dataset using the same method as described above and TF-IDF was used in both approaches for encoding the text data. You can use whatever encoding techniques you want to use like word2vec, glove, etc.

1. Without using Feature Engineering techniques

👁 Without using Feature Engineering techniques
Here validation accuracy is test accuracy.

2. Using Feature Engineering techniques

👁 Using Feature Engineering techniques
Here validation accuracy is test accuracy.

From the above results, we can see that feature engineering techniques helped us to increase our f1 from 0.90 to 0.92 in the train set and from 0.90 to 0.94 in the test set.

Conclusion

The above results show that if we do feature engineering, we can achieve greater accuracy using classical Machine learning algorithms. Using a transformer-based model is a time-consuming and resource-expensive algorithms. If we do feature engineering in the right way that is after analyzing our dataset we can get comparable results.

We can also do some other feature engineering like, counting the number of emojis used, type of emojis used, what frequencies of unique words, etc. We can define our features by analyzing the dataset. I hope you have learned something from this blog, do share it with others. Check out my personal Machine learning blog(https://code-ml.com/) for new and exciting content on different domains of ML and AI.

About the Author

Mohammad Ahmad (B.Tech)
LinkedIn - https://www.linkedin.com/in/mohammad-ahmad-ai/
Personal Blog - https://code-ml.com/
GitHub - https://github.com/ahmadkhan242
Twitter - https://twitter.com/ahmadkhan_242

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 

Login to continue reading and enjoy expert-curated content.

Free Courses

Build a Document Retriever Search Engine with LangChain

​Learn to create a document retrieval search engine using LangChain. ​

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Build a ChatGPT-style language model using PyTorch.

Naive Bayes from Scratch

Master Naïve Bayes for ML: Build classifiers, analyze data, and apply Bayes.

Responses From Readers

Madhav

Can't find the dataset on the given website. Please help with some link.

Marius Estrada

feature engineering is really important especially on creating ml on unseen dataset

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner