Overview of Word Embedding using Embeddings from Language Models (ELMo)

Last Updated : 21 Jul, 2025

Word embeddings enable models to interpret text by converting words into numerical vectors. Traditional methods like Word2Vec and GloVe generate fixed embeddings, assigning the same vector to a word regardless of its context.

ELMo (Embeddings from Language Models) addresses this limitation by producing contextualized embeddings that vary based on surrounding words. This approach allows models to better capture the meaning of words in different contexts, improving performance in tasks like sentiment analysis, named entity recognition and question answering.

ELMo

ELMo (Embeddings from Language Models) generates word vectors by considering the entire sentence. Unlike traditional methods, ELMo derives word meanings from the internal states of a deep bi-directional LSTM network trained as a language model. Its Key characteristics are:

Context-aware: Word meaning changes with context.
Deep Representations: Uses multiple layers from the language model.
Pre-trained + Task-specific: ELMo embeddings are integrated into downstream models and fine-tuned accordingly.

Working of ELMo

1. Pre-training Phase

A bidirectional language model (biLM) is trained on a large text corpus. The model uses two separate LSTMs:

The forward LSTM reads the sentence from left to right and predicts the next word.
The backward LSTM reads from right to left and predicts the previous word.

👁 Bidirectional-Recurrent-Neural-Network-2

biLM architecture

For each word, the model captures contextual information from both directions. The hidden states from the forward and backward LSTMs are summed up to form a contextualized embedding. These embeddings vary depending on the word’s role in the sentence. ELMo also combines outputs from multiple LSTM layers, capturing both syntactic and semantic patterns.

2. Task-specific Integration

Once trained, the biLM is used to generate embeddings for specific NLP tasks.

ELMo embeddings are added to the input of a downstream model, such as a classifier.
biLM can be either frozen to preserve general knowledge or fine-tuned on the specific task to improve performance.
The downstream model learns to use these embeddings for improved predictions.

This phase allows ELMo to be applied to tasks like named entity recognition, sentiment analysis and text classification where understanding context is crucial.

Real-World Examples

Consider the word "bank" in two different contexts:

"She deposited money in the bank." financial institution
"He sat by the bank of the river." river edge

Static embeddings would assign the same vector to both, failing to capture the difference. ELMo generates context-dependent vectors which correctly differentiates between these meanings. It adapts based on sentence-level context, providing more accurate representations.

Implementation of ELMo Embeddings

We can implement ELMo embeddings using TensorFlow and TensorFlow Hub. Here is a step-by-step guide with explanations at each stage.

Step 1: Install Required Libraries

Tensorflowis used for building and running deep learning models and tensorflow_hub allows us to load pretrained models such as ELMo. You can install it using:

pip install tensorflow tensorflow_hub

Step 2: Import Libraries and Load ELMo

We import TensorFlow and TensorFlow Hub to access the model.
We then load the ELMo model from TensorFlow Hub using its URL.
The model outputs 1024-dimensional embeddings for each token.

Step 3: Define an Embedding Function

We define a function that:

Takes a list of input sentences.
Passes them to the ELMo model.
Returns a tensor of contextualized word embeddings.

Step 4: Generate Embeddings from Sample Sentences

We create a list of sample sentences that include ambiguous words like "bank".
We call the get_elmo_embedding() function to generate the embeddings.
The result is a 3D tensor with shape (batch_size, max_seq_length, 1024).

Output:

👁 ELMo-O1

Embeddings matrix

We can see that our model is working fine.

Limitations

Ambiguous Contexts: Some sentences may not provide enough information for accurate disambiguation. For example, "The bank was full of fish." could still confuse the model.
Computational Overhead: ELMo requires more memory and processing due to biLSTM layers, which can be a constraint in real-time applications.
Pretraining Dependency: Performance heavily depends on the quality and size of the pretraining corpus.

Applications of ELMo Embeddings

ELMo significantly improves performance across a variety of NLP tasks:

Sentiment Analysis: Detects emotions in text with context-aware understanding.
Named Entity Recognition (NER): Identifies names of people, places and organizations more accurately.
Question Answering: Helps locate contextually relevant answers in large documents.
Text Classification: Enhances accuracy in spam detection, topic classification and intent analysis.
Semantic Similarity: Measures context-specific similarity between phrases or documents.

Comparison with Other Models

Feature	Word2Vec / GloVe	ELMo	BERT / RoBERTa
Contextual	Static word representation	Contextualized based on sentence context	Contextualized based on full input
Architecture	Shallow neural networks	Bidirectional LSTM language model	Transformer-based
Training Objective	Word co-occurrence prediction	Forward and backward language modeling	Masked language modeling
Model Complexity	Low	Moderate	High
Fine-tuning	Not designed for fine-tuning	Supports task-specific fine-tuning	Designed for fine-tuning

ELMo introduced the idea of context in word meanings and still influences modern NLP although it has been surpassed by transformer-based models like BERT and RoBERTa in recent years.

Comment

Article Tags:

Python

Natural-language-processing

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/overview-of-word-embedding-using-embeddings-from-language-models-elmo/