Large Language Models: BERT – Bidirectional Encoder Representations from Transformer

Understand how BERT constructs state-of-the-art embeddings

Aug 30, 2023

13 min read

Introduction

2017 was a historical year in machine learning when the Transformer model made its first appearance on the scene. It has been performing amazingly on many benchmarks and has become suitable for lots of problems in Data Science. Thanks to its efficient architecture, many other Transformer-based models have been developed later which specialise more on particular tasks.

One of such models is BERT. It is primarily known for being able to construct embeddings which can very accurately represent text information and store semantic meanings of long text sequences. As a result, BERT embeddings became widely used in machine learning. Understanding how BERT builds text representations is crucial because it opens the door for tackling a large range of tasks in NLP.

In this article, we will refer to the original BERT paper and have a look at BERT architecture and understand the core mechanisms behind it. In the first sections, we will give a high-level overview of BERT. After that, we will gradually dive into its internal workflow and how information is passed throughout the model. Finally, we will learn how BERT can be fine-tuned for solving particular problems in NLP.

High level overview

Transformer‘s architecture consists of two primary parts: encoders and decoders. The goal of stacked encoders is to construct a meaningful embedding for an input which would preserve its main context. The output of the last encoder is passed to inputs of all decoders trying to generate new information.

BERT is a Transformer successor which inherits its stacked bidirectional encoders. Most of the architectural principles in BERT are the same as in the original Transformer.

👁 Transformer architecture

Transformer architecture

BERT versions

There exist two main versions of BERT: base and large. Their architecture is absolutely identical except for the fact that they use different numbers of parameters. Overall, BERT large has 3.09 times more parameters to tune, compared to BERT base.

👁 Comparison of BERT base and BERT large

Comparison of BERT base and BERT large

Bidirectional representations

From the letter "B" in the BERT’s name, it is important to remember that BERT is a bidirectional model meaning that it can better capture word connections due to the fact that the information is passed in both directions (left-to-right and right-to-left). Obviously, this results in more training resources, compared to unidirectional models, but at the same time leads to a better prediction accuracy.

For a better understanding, we can visualise BERT architecture in comparison with other popular NLP models.

👁 Comparison of BERT, OpenAI GPT and ElMo architectures from the ogirinal paper. Adopted by the author.

Comparison of BERT, OpenAI GPT and ElMo architectures from the ogirinal paper. Adopted by the author.

Input tokenisation

Note. In the official paper, authors use the term "sentence" to indicate text that is passed to the input. To designate the same term, throughout this article series we will be using the term "sequence". It is done to avoid confusion as "sentence" usually means a single phrase separated by a point and due to the fact that in many other NLP research papers the term "sequence" is utilised in similar circumstances.

Before diving into how BERT is trained, it is necessary to understand in what format it accepts data. For the input, BERT takes a single sequence or a pair of sequences. Each sequence is split into tokens. Additionally, two special tokens are passed to the input:

Note. The official paper uses the term "sentence" which designates an input sequence passed to BERT which can actually consist of several sentences. For simplicity, we are going to follow the notation and use the same term throughout this article.

[CLS] – passed before the first sequence indicating its beginning. At the same time, [CLS] is also used for a classification objective during training (discussed in the sections below).
[SEP] – passed between sequences to indicate the end of the first sequence and the beginning of the second.

Passing two sequences makes it possible for BERT to handle a large variety of tasks where an input contains a pair of sequences (e.g. question and answer, hypothesis and premise, etc.).

Input embedding

After tokenisation, an embedding is built for each token. To make input embeddings more representative, BERT constructs three types of embeddings for each token:

Token embeddings capture the semantic meaning of tokens.
Segment embeddings have one of two possible values and indicate to which sequence a token belongs.
Position embeddings contain information about a relative position of a token in a sequence.

👁 Input processing

Input processing

These embeddings are summed up and the result is passed to the first encoder of the BERT model.

Output

Each encoder takes n embeddings as input and then outputs the same number of processed embeddings of the same dimensionality. Ultimately, the whole BERT output also contains n embeddings each of which corresponds to its initial token.

👁 Image

Training

BERT training consists of two stages:

Pre-training. BERT is trained on unlabeled pair of sequences over two prediction tasks: masked language modeling (MLM) and natural language inference (NLI). For each pair of sequences, the model makes predictions for these two tasks and based on the loss values, it performs backpropagation to update weights.
Fine-tuning. BERT is initialised with pre-trained weights which are then optimised for a particular problem on labeled data.

Pre-training

Compared to fine-tuning, pre-training usually takes a significant proportion of time because the model is trained on a large corpus of data. That is why there exist a lot of online repositories of pre-trained models which can be then fine-tined relatively fast to solve a particular task.

We are going to look in detail at both problems solved by BERT during pre-training.

Masked Language Modeling

Authors propose training BERT by masking a certain amount of tokens in the initial text and predicting them. This gives BERT the ability to construct resilient embeddings that can use the surrounding context to guess a certain word which also leads to building an appropriate embedding for the missed word as well. This process works in the following way:

After tokenization, 15% of tokens are randomly chosen to be masked. The chosen tokens will be then predicted at the end of the iteration.
The chosen tokens are replaced in one of three ways: – 80% of the tokens are replaced by the [MASK] token. Example: I bought a book → I bought a [MASK]
- 10% of the tokens are replaced by a random token. Example: He is eating a fruit → He is drawing a fruit
- 10% of the tokens remain unchanged. Example: A house is near me → A house is near me
All tokens are passed to the BERT model which outputs an embedding for each token it received as input.
Output embeddings corresponding to the tokens processed at step 2 are independently used to predict the masked tokens. The result of each prediction is a probability distribution across all the tokens in the vocabulary.
The cross-entropy loss is calculated by comparing probability distributions with the true masked tokens.
The model weights are updated by using backpropagation.

Natural Language Inference

For this classification task, BERT tries to predict whether the second sequence follows the first. The whole prediction is made by using only the embedding from the final hidden state of the [CLS] token which is supposed to contain aggregated information from both sequences.

Similarly to MLM, a constructed probability distribution (binary in this case) is used to calculate the model’s loss and update the weights of the model through backpropagation.

For NLI, authors recommend choosing 50% of pairs of sequences which follow each other in the corpus (positive pairs) and 50% of pairs where sequences are taken randomly from the corpus (negative pairs).

👁 BERT pre-training

BERT pre-training

Training details

According to the paper, BERT is pre-trained on BooksCorpus (800M words) and English Wikipedia (2,500M words). For extracting longer continuous texts, authors took from Wikipedia only reading passages ignoring tables, headers and lists.

BERT is trained on a million batches of size equal to 256 sequences which is equivalent to 40 epochs on 3.3 billion words. Each sequence contains up to 128 (90% of the time) or 512 (10% of the time) tokens.

According to the original paper, the training parameters are the following:

Optimisator: Adam (learning rate l = 1e-4, weight decay L₂ = 0.01, β₁ = 0.9, β₂ = 0.999, ε = 1e-6).
Learning rate warmup is performed over the first 10 000 steps and then reduced linearly.
Dropout (α = 0.1) layer is used on all layers.
Activation function: GELU.
Training loss is the sum of mean MLM and mean next sentence prediction likelihoods.

Fine-tuning

Once pre-training is completed, BERT can literally understand the semantic meanings of words and construct embeddings which can almost fully represent their meanings. The goal of fine-tuning is to gradually modify BERT weights for solving a particular downstream task.

Data format

Thanks to the robustness of the self-attention mechanism, BERT can be easily fine-tuned for a particular downstream task. Another advantage of BERT is the ability to build bidirectional text representations. This gives a higher chance of discovering correct relations between two sequences when working with pairs. Previous approaches consisted of independently encoding both sequences and then applying bidirectional cross-attention to them. BERT unifies these two stages.

Depending on a certain problem, BERT accepts several input formats. The framework for solving all downstream tasks with BERT is the same: by taking as an input a sequence of text, BERT outputs a set of token embeddings which are then fed to the model. Most of the time, not all of the output embeddings are used.

Let us have a look at common problems and the ways they are solved by fine-tuning BERT.

Sentence pair classification

The goal of sentence pair classification is to understand the relationship between a given pair of sequences. Most of common types of tasks are:

Natural language inference: determining whether the second sequence follows the first.
Similarity analysis: finding a degree of similarity between sequences.

👁 Sentence pair classification

Sentence pair classification

For fine-tuning, both sequences are passed to BERT. As a rule of thumb, the output embedding of the [CLS] token is then used for the classification task. According to the researchers, the [CLS] token is supposed to contain the main information about sentence relationships.

Of course, other output embeddings can also be used but they are usually omitted in practice.

Question answering task

The objective of question answering is to find an answer in a text paragraph corresponding to a particular question. Most of the time, the answer is given in the form of two numbers: the start and end token positions of the passage.

👁 Question answering task

Question answering task

For the input, BERT takes the question and the paragraph and outputs a set of embeddings for them. Since the answer is contained within the paragraph, we are only interested in output embeddings corresponding to paragraph tokens.

For finding a position of the start answer token in the paragraph, the scalar product between every output embedding and a special trainable vector Tₛₜₐᵣₜ is calculated. For most cases when the model and the vector Tₛₜₐᵣₜ are trained accordingly, the scalar product should be proportional to the likelihood that a corresponding token is in reality the start answer token. To normalise scalar products, they are then passed to the softmax function and can be thought as probabilities. The token embedding corresponding to the highest probability is predicted as the start answer token. Based on the true probability distribution, the loss value is calculated and the backpropagation is performed. The analogous process is performed with the vector Tₑₙ𝒹 for predicting the end token.

Single sentence classification

The difference, compared to previous downstream tasks, is that here only a single sentence is passed BERT. Typical problems solved by this configuration are the following:

Sentiment analysis: understanding whether a sentence has a positive or negative attitude.
Topic classification: classifying a sentence into one of several categories based on its contents.

👁 Single sentence classification

Single sentence classification

The prediction workflow is the same as for sentence pair classification: the output embedding for the [CLS] token is used as the input for the classification model.

Single sentence tagging

Named entity recognition (NER) is a machine learning problem which aims to map every token of a sequence to one of respective entities.

👁 Single sentence tagging

Single sentence tagging

For this objective, embeddings are computed for tokens of an input sentence, as usual. Then every embedding (except for [CLS] and [SEP]) is passed independently to a model which maps each of them to a given NER class (or not, if it cannot).

Feature extraction

Taking the last BERT layer and using it as embeddings is not the only way to extract features from the input text. In fact, the researchers completed several experiments of aggregating embeddings in different manners for solving a NER task on the CoNLL-2003 dataset. To conduct the experiment, they used the extracted embeddings as input for a randomly initialized two-layer 768-dimensional BiLSTM before applying the classification layer.

The ways the embeddings were extracted (from the BERT base) are demonstrated in the figure below. As shown, the most performant way was to concatenate the four last BERT hidden layers.

Based on the conducted experiments, it is important to keep in mind that aggregation of hidden layers is a potential way of improving embeddings’ representation for achieving better results on a variety of NLP tasks.

👁 The diagram on the left shows the expanded BERT structure with hidden layers. The table on the right illustrates the ways the embeddings were constructed and the corresponding scores that were achieved by applying respective strategies.

The diagram on the left shows the expanded BERT structure with hidden layers. The table on the right illustrates the ways the embeddings were constructed and the corresponding scores that were achieved by applying respective strategies.

Combining BERT with other features

Sometimes we deal not only with text but with numerical features, for example, as well. It is naturally desirable to build embeddings that can incorporate information from both text and other non-text features. Here are the recommended strategies to apply:

Concatenation of text with non-text features. For instance, if we work with profile descriptions about people in the form of text and there are other separate features like their name or age, then a new text description can be obtained in the form: "My name is . . I am years old". Finally, such a text description can be fed into the BERT model.
Concatenation of embeddings with features. It is possible to build BERT embeddings, as discussed above, and then concatenate them with other features. The only thing that changes in the configuration is the fact a classification model for a downstream task has to accept now input vectors of higher dimensionality.

Conclusion

In this article, we have dived into the processes of BERT training and fine-tuning. As a matter of fact, this knowledge is enough to solve the majority of tasks in NLP thankfully to the fact that BERT allows to almost fully incorporate text data into embeddings.

In recent times, other BERT-like models have appeared (SBERT, RoBERTa, etc.). There even exists a special sphere of study called "BERTology" which analyses BERT capabilities in depth for deriving new high-performant models. These facts reinforce the fact that BERT designated a revolution in machine learning and made it possible to significantly advance in NLP.

Resources

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

All images unless otherwise noted are by the author

Written By

Vyacheslav Efimov

See all from Vyacheslav Efimov

Bert, Getting Started, Llm, Machine Learning, NLP

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/bert-3d1bf880386a/