![]() |
VOOZH | about |
Word embedding is an important part of the NLP process. It is responsible to capture the semantic meaning of words, reduce dimensionality, add contextual information, and promote efficient learning by transferring linguistic knowledge via pre-trained embeddings. As a result, we get enhanced performance with limited task-specific data. In this article, we are going to understand BERT and how it's going to generate embeddings.
Word embedding is an unsupervised method required for various Natural Language Processing (NLP) tasks like text classification, sentiment analysis, etc. Generating word embeddings from Bidirectional Encoder Representations from Transformers (BERT) is an efficient technique. BERT can be commonly referred to as a pre-trained language model, which can also be used for NLP tasks by fine-tuning.
There are some well-known word-embedding techniques are discussed below:
In this article, we will generate word embeddings using the BERT model.
BERT is a commonly used state-of-the-art deep learning model for various NLP tasks. We will explore its architecture below:
There are several reasons which made BERT a common choice for NLP tasks. The reasons are discussed below:
BERT and Word2vec both are famous for generating word-embeddings for different NLP tasks. But somehow BERT outperforms over Word2vec. The reasons are discussed below:
There is a high probability that the most important transformers module will not be pre-installed in your Python environment. To install it write the following line of code only.
!pip install transformers
Now, we will import all necessary Python libraries like PyTorch etc.
We will set random seed for PyTorch to get high reproducibility. It is good practice to use random seeding when we use any kind of model loading in our code. It also handles the randomness of GPU(if any).
Now we will load our BERT model along with tokenizer. Here we have used 'bert-base-uncased' which is the most commonly used of several NLP tasks. This will convert all upper case character present in input text to lower case. For all general NLP tasks like text classification, Sentiment analysis and Named entity recognition, this variant is used. However for advanced usage you can use 'bert-base-cased' which will keep the case sensitivity of the characters.
Now we will consider any input text and tokenize it using BERT tokenizer (batch_encode_plus). This is basically Word Piece tokenization which split each word of sentence into sub-word tokens. Then we will encode these tokens into IDs. We will also set the add_special_tokens parameter 'True' to add special token like SEP and CLS in tokenized text. SEP special token is used for separating different segments or sentences within a single input sequence. It is also should be inserted between two sentences or segments. And the CLS special token is the first token in every input sequence. It is used to represent the entire input sequence in a single vector and called as the "CLS" embedding. Adding special tokens is a good practice when working with word Embeddings.
Output:
Input ID: tensor([[ 101, 29294, 22747, 21759, 4402, 5705, 2003, 1037, 3274, 2671,
9445, 102]])
Attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
So, for each token attention is 1 so, overall attention score is 1 which denotes that out BERT model is considering the entire input sequence to generate the embeddings for each token. Getting attention score 1 is very good as it effectively capturing all context. However, if you use very long input sequence the attention score may drop. This score denotes that how much attention should be given to that token for generating word-embeddings. Here, all tokens are fully attended.
Now we will pass the tokens and encoded input through BERT model. The model will generate embeddings for each tokens.
Output:
Shape of Word Embeddings: torch.Size([1, 12, 768])
The shape of Word embeddings is [1, 12, 768] where 768 is the dimensionality/hidden size of the word embeddings generated by our BERT model (bert-base-uncased variant mode). Each token is represented by a 768-dimensional vector and 12 is the number of tokens in our input text after tokenization. Here 1 is nothing but batch dimension i.e. the total number of sentences(input sequences) we have passed(here only one sentence).
Here we will decode the token IDs back to text using a function (tokenizer.decode) then tokenize it (tokenizer.tokenize) and finally encode it (tokenizer.encode).
Output:
Decoded Text: geeksforgeeks is a computer science portal
tokenized Text: ['geek', '##sf', '##org', '##ee', '##ks', 'is', 'a', 'computer', 'science', 'portal']
Encoded Text: tensor([[ 101, 29294, 22747, 21759, 4402, 5705, 2003, 1037, 3274, 2671,
9445, 102]])
If you look into the decoded text which is same with input text but only changed with all lower case as we used bert-base-uncased variant mode. And Encoded text and Input IDs are same as tokenizer.encode and tokenizer.batch_encode_plus both variables produces same sequence of token IDs for a particular input text. As discussed previously BERT can handle out-of-vocabulary(new word to its pre-trained corpus) words which is here 'GeeksforGeeks'. So, it is broken down into sub-word tokens.
Finally, we will extract the generated word embeddings and print them. Word embeddings are contextual and can capture the meaning of each word present in the sentence. We also print the shape of embedding. We will not print tokens here as it is not needed. If you wish you can also print them by uncommenting the token printing line present in for loop.
Output:
Embedding: tensor([-2.4299e-01, -2.2849e-01, 5.8441e-02, 5.7861e-03, -4.3398e-01,
-3.4387e-01, 9.6974e-02, 3.6446e-01, -6.3829e-02, -2.3413e-01,
-3.2477e-01, -4.9730e-01, -3.0048e-01, 3.5098e-01, -4.8904e-01,
-1.2836e-01, -5.5042e-01, 4.0802e-02, -3.2041e-01, -1.6057e-01,
................................................
......
......
......
......
Embedding: tensor([-5.9422e-01, 3.0865e-01, -3.5836e-01, -1.6872e-02, 2.9080e-01,
-5.5942e-01, -2.2233e-01, 7.7186e-01, -8.0256e-01, 2.2205e-01,
-6.1288e-01, -6.0329e-01, -8.2418e-02, 2.8664e-01, -1.1168e+00,
1.1978e+00, 6.1283e-02, -3.9820e-01, 1.1269e-01, -7.9150e-01,
...................................................
It will generate a very large output. A little portion of embedding is provided for understanding purpose. So, the output we have shown the some portions of embeddings of the fast and last token only.
We will also generate sentence embedding by computing average of word embeddings using average pooling.
Output:
Sentence Embedding:
tensor([[-1.2731e-01, 2.3766e-01, 1.6280e-01, 1.7505e-01, 2.1393e-01,
-7.2085e-01, -1.1638e-01, 5.5303e-01, -2.4897e-01, -3.5929e-02,
-9.9867e-02, -5.9745e-01, -1.2873e-02, 4.0385e-01, -4.7625e-01,
9.3286e-02, -3.1485e-01, 1.4257e-02, -3.1248e-01, -1.5662e-01,
-1.8107e-01, -2.4591e-01, -9.8347e-02, 5.4759e-01, 1.2483e-01,
.......................................
.......................................
-1.1171e-01, 2.2538e-01, 5.8986e-02]])
Shape of Sentence Embedding: torch.Size([1, 768])
It will also generate a large output along with shape of sentence embedding which is [number of sentences, hidden size].
Now we will compute similarity metrics of a example sentence(GeeksforGeeks is a technology website) with our original sentence(GeeksforGeeks is a computer science portal) which is used till now. Checking similarity metrics or index is the most used task of word embeddings.
Output:
Cosine Similarity Score: 0.9561723
So, more than 95% similarity is encountered.
So we can conclude that, generating Word embeddings is very much necessary for various NLP task. The size of output of the word embedding may be huge but they are can capture the meaning of each token present in the sentence which can play an essential role in sentiment analysis and text classification.