In machine learning, embeddings are a way of representing data as numerical vectors in a continuous space. They capture the meaning or relationship between data points, so that similar items are placed closer together while dissimilar ones are farther apart. This makes it easier for algorithms to work with complex data such as words, images or audio.
They convert categorical or high-dimensional data into dense vectors.
They help machine learning models work with different types of data.
These vectors help show what the objects mean and how they relate to each other.
They are widely used in natural language processing, recommender systems and computer vision.
In the above graph, we observe distinct clusters of related words.
For instance "computer", "software" and "machine" are clustered together, indicating their semantic similarity.
Similarly "lion", "cow" ,"cat" and "dog" form another cluster, representing their shared attributes.
There exists a significant gap between these clusters highlighting their dissimilarity in meaning or context.
Important terms used for Embedding
These terms help understand how embeddings represent and organize data in machine learning.
1. Vector
A vector is a list of numbers representing features or characteristics of data, often showing magnitude and direction.
Example: In 2D, the vector points 3 steps along the x-axis and 4 steps along the y-axis. Its total length (magnitude) is 5.
2. Dense Vector
A vector in which most values are non-zero. In machine learning, it is commonly used to represent rich information such as words, images, or data points.
Example: [2000, 3, 5, 9.8] could describe a house, showing size, number of bedrooms, bathrooms and age.
3. Vector space
A mathematical structure where vectors can be added and scaled, forming the basis for representing data.
Example: The set of all 3D vectors with real number coordinates forms a vector space like the vectors [1, 0, 0], [0, 1, 0] and [0, 0, 1] constitute a basis for the 3D vector space.
4. Continuous Vector space
A vector space where values can take any real number, allowing smooth and precise representations.
Example: The color [0.9, 0.3, 0.1] in RGB shows a shade of red, where each number can be any value between 0 and 1.
Working
Embeddings convert data into numerical vectors that capture meaning and relationships, allowing models to compare and process different types of data effectively.
1. Define similarity signal
First, decide what we want the model to treat as “similar”.
Text: Words or sentences that appear in similar contexts.
Images: Pictures of the same object or scene.
Graphs: Nodes that are connected or related.
2. Choose dimensionality
Select how many numbers (dimensions) will describe each item, it could be 64, 384, 768 or more.
More dimensions: more detail but slower and uses more memory.
Fewer dimensions: faster but may lose detail.
3. Build the encoder
This is the model that turns our data into a list of numbers (vector):
Recommendations: suggesting similar products, content or users.
Monitoring: spotting unusual changes or patterns over time.
Importance
Embeddings are widely used because they represent data in a meaningful and efficient way, helping models understand relationships and perform better across tasks.
Capture semantic relationships by placing similar items closer in vector space.
Reduce dimensionality while preserving important patterns and features.
Support transfer learning by reusing embeddings across different tasks.
Provide interpretable representations through distances and directions between vectors.
Types of Data Represented with Embeddings
Embeddings can represent different types of data by converting them into dense vectors, making it easier for models to understand patterns, relationships and meaning.
1. Words
Word embeddings are numeric vectors which represent individual words as vectors where similar words are placed closer together, helping in tasks like sentiment analysis and translation.
Convert sound signals into vectors capturing acoustic features, enabling tasks like speech recognition and emotion detection. Some of the popular Audio embedding techniques may include Wav2Vec
4. Image Data
Represent images as vectors using CNN-based models, capturing visual features for tasks like classification and object detection.
Structured data such as feature vectors and tables can be embedded to help machine learning models capture underlying patterns. Common techniques include Autoencoders
Visualization using t-SNE
t-SNE is used to visualize high dimensional word embeddings by reducing them to 2D space, helping us understand how similar words are positioned relative to each other.
Step 1: Import Libraries
NumPy: Handles numerical data and array manipulation.
Here we can see snake, cow, birds, etc are grouped together nearby showing similarity (all animals) whereas computer and machines are far away from animal cluster showing dissimilarity.