BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing model developed by Google that understands the context of words in a sentence by analyzing text in both directions. It is widely used to improve language understanding tasks with high accuracy.
Uses a transformer-based encoder architecture
Processes text bidirectionally (left and right context)
Captures contextual relationships between words
Designed for language understanding tasks like classification, question answering, and Named Entity Recognition (NER)
BERT is trained on large amounts of unlabeled text to learn contextual representations of words based on their surrounding context.
Learns embeddings that capture meaning using both left and right context
Trained using unsupervised learning on large text datasets
Uses tasks like predicting masked words (MLM)
Learns relationships between sentences using Next Sentence Prediction (NSP)
Workflow of BERT
BERT uses a transformer-based encoder to process input text and generate contextualized representations for each token. Instead of predicting text sequentially like traditional models, it focuses on understanding context using its training strategies.
In BERTβs pre-training, some words in the input sequence are masked, and the model learns to predict these missing words using the surrounding context.
A classification layer is added on top of the encoder outputs to predict masked words
Output vectors are projected to the vocabulary space using the embedding matrix
Softmax is applied to generate probability distribution over all possible words
Loss is calculated only for masked positions, comparing predicted and actual words
Focus on masked tokens may slow convergence compared to directional models
However, it enables deeper contextual understanding by using both left and right context
2. Next Sentence Prediction (NSP)
Next Sentence Prediction trains BERT to understand the relationship between two sentences by predicting whether one sentence follows another.
Uses the [CLS] token representation, passed through a classification layer
Outputs probabilities (via Softmax) to determine if the second sentence is related
During training, 50% of sentence pairs are actual consecutive sentences, while 50% are randomly paired
Helps the model distinguish between logically connected and unrelated sentences
Improves performance in tasks requiring sentence-level understanding like question answering
Combined Training of MLM and NSP
BERT is trained using both Masked Language Model (MLM) and Next Sentence Prediction (NSP) simultaneously. The model minimizes a combined loss function from both tasks, enabling it to learn deeper language understanding.
MLM helps the model understand context within a sentence by predicting masked words
NSP helps capture relationships between pairs of sentences
Training both together improves understanding at both word-level and sentence-level
Results in a more comprehensive and context-aware language model
BERT Fine-Tuning
After pre-training, BERT is fine-tuned on labeled data to adapt it for specific NLP tasks. This step customizes the modelβs general language understanding for particular applications.
Requires minimal architectural changes due to its flexible design
Enhances performance by aligning the model with task specific requirements
BERT Architecture
BERT uses a multilayer bidirectional Transformer encoder to understand text by capturing context from both directions. Unlike the original Transformer, which has both encoder and decoder, BERT uses only the encoder for language understanding tasks.
BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack.
BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the Transformer architecture suggested in the original paper. It contains 512 hidden units and 8 attention heads.
BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.