![]() |
VOOZH | about |
Natural Language Processing is referred to as NLP. It is a subset of artificial intelligence that enables machines to comprehend and analyze human languages. Text or audio can be used to represent human languages.
The natural language processing (NLP) pipeline refers to the sequence of processes involved in analyzing and understanding human language. The following is a typical NLP pipeline:
The basic processes for all the above tasks are the same. Here we have discussed some of the most common approaches which are used during the processing of text data.
In comparison to general machine learning pipelines, In NLP we need to perform some extra processing steps. The region is very simple that machines don't understand the text. Here our biggest problem is How to make the text understandable for machines. Some of the most common problems we face while performing NLP tasks are mentioned below.
As we know, For building the machine learning model we need data related to our problem statements, Sometimes we have our data and Sometimes we have to find it. Text data is available on websites, in emails, in social media, in form of pdf, and many more. But the challenge is. Is it in a machine-readable format? if in the machine-readable format then will it be relevant to our problem? So, First thing we need to understand our problem or task then we should search for data. Here we will see some of the ways of collecting data if it is not available in our local machine or database.
Sometimes our acquired data is not very clean. it may contain HTML tags, spelling mistakes, or special characters. So, let's see some techniques to clean our text data.
Output :
b'GeeksForGeeks \xf0\x9f\x98\x80' b'\xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 \xe0\xa4\xab\xe0\xa5\x89\xe0\xa4\xb0 \xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 ????'
Output:
gfg GFG Geeks Learning together url email
NLP software mainly works at the sentence level and it also expects words to be separated at the minimum level.
Our cleaned text data may contain a group of sentences. and each sentence is a group of words. So, first, we need to Tokenize our text data.
Output:
Original text: GeeksforGeeks is a very famous edutech company in the IT industry.
Preprocessed tokens: ['geeksforgeeks', 'famous', 'edutech', 'company', 'industry']
POS tags: [('geeksforgeeks', 'NNS'), ('famous', 'JJ'), ('edutech', 'JJ'),
('company', 'NN'), ('industry', 'NN')]
Named entities: (S geeksforgeeks/NNS famous/JJ edutech/JJ company/NN industry/NN)
Here, Stop word removal, Stemming and lemmatization, Removing digit/punctuation, and lowercasing are the most common steps used in most of the pipelines.
In Feature Engineering, our main agenda is to represent the text in the numeric vector in such a way that the ML algorithm can understand the text attribute. In NLP this process of feature engineering is known as Text Representation or Text Vectorization.
There are two most common approaches for Text Representation.
In the traditional approach, we create a vocabulary of unique words assign a unique id (integer value) for each word. and then replace each word of a sentence with its unique id. Here each word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size will become very large. So, this makes it tough for the ML model.
One Hot Encoding represents each token as a binary vector. First mapped each token to integer values. and then each integer value is represented as a binary vector where all values are 0 except the index of the integer. index of the integer is marked by 1.
Output:
Tokenized Sentences : ['geeks for geeks', 'geeks learning together',
'geeks for geeks is famous for dsa', 'learning dsa']
vocabulary : {'geeks': 1, 'for': 2, 'learning': 3, 'together': 4, 'is': 5, 'famous': 6, 'dsa': 7}
OneHotEncoded vector for sentence : " geeks for geeks "is
[[1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]A bag of words only describes the occurrence of words within a document or not. It just keeps track of word counts and ignores the grammatical details and the word order.
Code block
Output:
Our Corpus: ['geeksforgeeks', 'geeks learning together',
'geeksforgeeks is famous for dsa', 'learning dsa']
Our vocabulary: {'geeksforgeeks': 4, 'geeks': 3, 'learning': 6,
'together': 7, 'is': 5, 'famous': 1, 'for': 2, 'dsa': 0}
BoW representation for geeksforgeeks [[0 0 0 0 1 0 0 0]]
BoW representation for geeks learning together [[0 0 0 1 0 0 1 1]]
BoW representation for geeksforgeeks is famous for dsa [[1 1 1 0 1 1 0 0]]
Bow representation for 'learning dsa from geeksforgeeks': [[1 0 0 0 1 0 1 0]]In Bag of Words, there is no consideration of the phrases or word order. Bag of n-gram tries to solve this problem by breaking text into chunks of n continuous words.
Output:
Our Corpus: ['geeksforgeeks', 'geeks learning together',
'geeksforgeeks is famous for dsa', 'learning dsa']
Our vocabulary:
{'geeksforgeeks': 9, 'geeks': 6, 'learning': 15, 'together': 18, 'geeks learning': 7,
'learning together': 17, 'geeks learning together': 8, 'is': 12, 'famous': 1, 'for': 4,
'dsa': 0, 'geeksforgeeks is': 10, 'is famous': 13, 'famous for': 2, 'for dsa': 5,
'geeksforgeeks is famous': 11, 'is famous for': 14, 'famous for dsa': 3, 'learning dsa': 16}
Ngram representation for "geeksforgeeks" is [[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]]
Ngram representation for "geeks learning together" is [[0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1]]
Ngram representation for "geeksforgeeks is famous for dsa" is
[[1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0]]
Ngram representation for 'learning dsa from geeksforgeeks together' is
[[1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1]]
The output shows that the input text has been tokenized into sentences and processed to remove any periods and convert to lowercase. The vectorizer then computes the Bag of n-grams representation of each sentence, and the vocabulary used by the vectorizer is printed. Finally, the n-gram representation of a new text is computed and printed. The n-gram representations are in the form of a sparse matrix, where each row represents a sentence and each column represents an n-gram in the vocabulary. The values in the matrix indicate the frequency of the corresponding n-gram in the sentence.
In all the above techniques, Each word is treated equally. TF-IDF tries to quantify the importance of a given word relative to the other word in the corpus. it is mainly used in Information retrieval.
Output:
Our Corpus: ['geeksforgeeks', 'geeks learning together', 'geeksforgeeks is famous for dsa', 'learning dsa'] vocabulary ['dsa', 'famous', 'for', 'geeks', 'geeksforgeeks', 'is', 'learning', 'together'] IDF for all words in the vocabulary : [1.51082562 1.91629073 1.91629073 1.91629073 1.51082562 1.91629073 1.51082562 1.91629073] TFIDF representation for "geeksforgeeks" is [[0. 0. 0. 0. 1. 0. 0. 0.]] TFIDF representation for "geeks learning together" is [[0. 0. 0. 0.61761437 0. 0. 0.48693426 0.61761437]] TFIDF representation for "geeksforgeeks is famous for dsa" is [[0.38274272 0.48546061 0.48546061 0. 0.38274272 0.48546061 0. 0. ]] TFIDF representation for 'learning dsa from geeksforgeeks' is [[0.57735027 0. 0. 0. 0.57735027 0. 0.57735027 0. ]]
The above technique is not very good for complex tasks like Text Generation, Text summarization, etc. and they can't understand the contextual meaning of words. But in the neural approach or word embedding, we try to incorporate the contextual meaning of the words. Here each word is represented by real values as the vector of fixed dimensions.
For example :
airplane =[0.7, 0.9, 0.9, 0.01, 0.35] kite =[0.7, 0.9, 0.2, 0.01, 0.2]
Here each value in the vector represents the measurements of some features or quality of the word which is decided by the model after training on text data. This is not interpretable for humans but Just for representation purposes. We can understand this with the help of the below table.
airplane | kite | |
|---|---|---|
Sky | 0.7 | 0.7 |
Fly | 0.9 | 0.9 |
Transport | 0.9 | 0.2 |
Animal | 0.01 | 0.01 |
Eat | 0.35 | 0.2 |
Now, The problem is how can we get these word embedding vectors.
There are following ways to deal with this.
There are two ways to train our own word embedding vector :
For example :
I am learning Natural Language Processing from GFG.
I am learning Natural _____?_____ Processing from GFG.
For example :
I am learning Natural Language Processing from GFG.
I am __?___ _____?_____ Language ___?___ ____?____ GFG.
These models are trained on a very large corpus. We import from Gensim or Hugging Face and used it according to our purposes.
Some of the most popular pre-trained embeddings are as follows :
Output:
Similarity between 'learn' and 'learning' using Word2Vec: 0.637 Similarity between 'india' and 'indian' using Word2Vec: 0.697 Similarity between 'fame' and 'famous' using Word2Vec: 0.326
Output:
Similarity between 'learn' and 'learning' using GloVe: 0.768 Similarity between 'india' and 'indian' using GloVe: 0.764 Similarity between 'fame' and 'famous' using GloVe: 0.507
Output:
Similarity between 'learn' and 'learning' using Word2Vec: 0.642 Similarity between 'india' and 'indian' using Word2Vec: 0.708 Similarity between 'fame' and 'famous' using Word2Vec: 0.519
At the start of any project. When we have no very fewer data, then we can use a heuristic approach. The heuristic-based approach is also used for the data-gathering tasks for ML/DL model. Regular expressions are largely used in this type of model.
Recurrent neural networks are a particular class of artificial neural networks that are created with the goal of processing sequential or time series data. It is primarily used for natural language processing activities including language translation, speech recognition, sentiment analysis, natural language production, summary writing, etc. Unlike feedforward neural networks, RNNs include a loop or cycle built into their architecture that acts as a "memory" to hold onto information over time. This distinguishes them from feedforward neural networks. This enables the RNN to process data from sources like natural languages, where context is crucial.
The basic concept of RNNs is that they analyze input sequences one element at a time while maintaining track in a hidden state that contains a summary of the sequenceβs previous elements. The hidden state is updated at each time step based on the current input and the previous hidden state. This allows RNNs to capture the temporal dependencies between elements of the sequence and use that information to make predictions.
Working: The fundamental component of an RNN is the recurrent neuron, which receives as inputs the current input vector and the previous hidden state and generates a new hidden state as output. And this output hidden state is then used as the input for the next recurrent neuron in the sequence. An RNN can be expressed mathematically as a sequence of equations that update the hidden state at each time step:
St= f(USt-1+Wxt+b)
Where,
And the output of the RNN at each time step will be:
yt = g(VSt+c)
Where,
Here, W, U, V, b, and c are the learnable parameters and it is optimized during the backpropagation.
Models have to process a large number of tokens. When it is processing a distant token from the first token, The significance of the first token starts decreasing, So, it fails to relate with starting token to the distant token. This can be avoided with explicit state management by using gates.
There are two architectures that try to solve this problem.
Long Short-Term Memory Networks are an advanced form of RNN model, and it handles the vanishing gradient problem of RNN. It only remembers the part of the context which has a meaningful role in predicting the output value. LSTMs function by selectively passing or retaining information from one-time step to the next using the combination of memory cells and gating mechanisms.
The LSTM cell is made up of a number of parts, such as:
Gated Recurrent Unit (GRU) is also the advanced form of RNN. which solves the vanishing gradient problem. Like LSTMs, GRUs also have gating mechanisms that allow them to selectively update or forget information from the previous time steps. However, GRUs have fewer parameters than LSTMs, which makes them faster to train and less prone to overfitting. The two gates in GRUs are the reset gate and the update gate, which control the flow of information in the network.
Evaluation matric depends on the type of NLP task or problem. Here I am listing some of the popular methods for evaluation according to the NLP tasks.
Making a trained NLP model usable in a production setting is known as deployment. The precise deployment process can vary based on the platform and use case, however, the following are some typical processes that may be involved: