UMAP Dimensionality Reduction โ An Incredibly Robust Machine Learning Algorithm
How does Uniform Manifold Approximation and Projection (UMAP) work, and how to use it in Python
Hands-on Tutorials, Machine Learning
Intro
Dimensionality reduction is not just for data visualization. It can also help you overcome the "curse of dimensionality" by identifying critical structures in the high-dimensional space and preserving them in the lower-dimensional embedding.
This article will take you through the inner workings of an increasingly popular dimensionality reduction technique called Uniform Manifold Approximation and Projection (UMAP) and provide you with a Python example that can be used as a guide when working on your Data Science projects.
Contents
- UMAPโs place in the universe of Machine Learning algorithms
- An intuitive explanation of how UMAP works
- An example of using UMAP in Python
Uniform Manifold Approximation and Projection (UMAP) in the universe of Machine Learning algorithms
The below sunburst chart is my attempt to categorize the most commonly used Machine Learning algorithms. I have created it to satisfy the need of Data Scientists like you and me to have a more structured and visual way of identifying how various algorithms fit together.
This particular view of the Machine Learning universe enables us to see similarities and differences between algorithms, serving as a quick guide when looking for a suitable solution for a specific project.
The graph is interactive so make sure to click๐ on different categories to enlarge and reveal more.
If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.
While UMAP is most commonly used for unsupervised learning, it can also perform supervised dimensionality reduction. You will find an example of that in the Python section at the end of this article.
How does Uniform Manifold Approximation and Projection (UMAP) work?
Analyzing the UMAP name
Letโs start by dissecting the UMAP name, which will give us a broad idea of what the algorithm is supposed to do.
Note, the below statements are not official definitions but rather a set of descriptions that will help us understand key ideas behind UMAP.
- Projection โ the process or technique of reproducing a spatial object upon a plane, a curved surface, or a line by projecting its points. You can also think of it as a mapping of an object from high-dimensional to low-dimensional space.
- Approximation โ the algorithm assumes that we only have a finite set of data samples (points), not the entire set that makes up the manifold. Hence, we need to approximate the manifold based on the data available.
- Manifold โ a manifold is a topological space that locally resembles Euclidean space near each point. One-dimensional manifolds include lines and circles, but not figure eights. Two-dimensional manifolds (a.k.a. surfaces) include planes, spheres, torus, and more.
- Uniform โ the uniformity assumption tells us that our data samples are uniformly (evenly) distributed across the manifold. In the real world, however, this is rarely the case. Hence, this assumption leads to the notion that the distance varies across the manifold. i.e., the space itself is warping: stretching or shrinking according to where the data appear sparser or denser.
Putting the above statements together, we can describe UMAP as:
A dimensionality reduction technique that assumes the available data samples are evenly (uniformly) distributed across a topological space (manifold), which can be approximated from these finite data samples and mapped (projected) to a lower-dimensional space.
The above description of the algorithm may help a little, but it is still vague on how UMAP does its magic. So, to answer the "how" question, letโs analyze the individual steps that UMAP performs.
High-level steps performed by UMAP
We can split UMAP into two major steps:
- learning the manifold structure in the high-dimensional space;
- finding a low-dimensional representation of said manifold.
We will, however, break this down into even smaller components to make our understanding of the algorithm deeper. The below map shows the order that we will go through in analyzing each piece.
Step 1 โ Learning the manifold structure
It will come as no surprise, but before we can map our data to lower dimensions, we first need to figure out what it looks like in the higher-dimensional space.
1.1. Finding nearest neighbors UMAP starts by finding the nearest neighbors using the Nearest-Neighbor-Descent algorithm of Dong et al. You will see in the Python section later on that we can specify how many nearest neighbors we want to use by adjusting UMAPโs _nneighbors hyperparameter.
It is important to experiment with the number of _nneighbors because it controls how UMAP balances local versus global structure in the data. It does it by constraining the size of the local neighborhood when attempting to learn the manifold structure.
Essentially, a small value for _n_neighbors means that we want a very local interpretation that accurately captures the fine detail of the structure. In contrast, a large n_neighbors_ value means that our estimates will be based on larger regions, thus more broadly accurate across the manifold as a whole.
1.2. Constructing a graph Next, UMAP needs to construct a graph by connecting the previously identified nearest neighbors. To understand this process, we need to look at a few sub-components that explain how the neighborhood graph comes to be.
1.2.1. Varying distance As outlined in the analysis of the UMAPโs name, we assume a uniform distribution of points across the manifold, suggesting that space between them is stretching or shrinking according to where the data appears to be sparser or denser.
It essentially means that the distance metric is not universal across the whole space, and instead, it varies between different regions. We can visualize it by drawing circles/spheres around each data point, which appear to be different in size because of the varying distance metric (see illustration below).
1.2.2. Local connectivity Next, we want to ensure that the manifold structure we are trying to learn does not result in many unconnected points. Luckily, we can use another hyperparameter called _local_connectivity_ (default value = 1) to solve this potential problem.
When we set _local_connectivity=1,_ we tell the algorithm that every point in the higher-dimensional space is connected to at least one other point. You can see in the above illustration how each solid circle touches at least one data point.
1.2.3. Fuzzy area You must have noticed that the illustration above also contains fuzzy circles extending beyond the closest neighbor. This tells us that the certainty of connection with other points decreases as we get farther away from the point of interest.
The easiest way to think about it is by viewing the two hyperparameters (_local_connectivity and n_neighbors_) as lower and upper bounds:
- local_connectivity (default=1) โ there is 100% certainty that each point is connected to at least one other point (lower limit for a number of connections).
- n_neighbors (default=15) โ there is a 0% chance that a point is directly connected to a 16th+ neighbor since it falls outside the local area used by UMAP when constructing a graph.
- neighbors 2 to 15 โ there is some level of certainty (>0% but <100%) that a point is connected to its 2nd to 15th neighbor.
1.2.4. Merging of edges Finally, we need to understand that the connection certainty discussed above is expressed through edge weights (w).
Since we have employed a varying distance approach, we will unavoidably have cases where edge weights do not align when viewed from the perspective of each point. E.g., the edge weight for points Aโ B will be different from the edge weight of Bโ A.
UMAP overcomes the problem of disagreeing edge weights we just described by taking a union of the two edges. Here is how UMAP documentation explains it:
If we want to merge together two disagreeing edges with weight a and b then we should have a single edge with combined weight ๐+๐โ๐โ ๐. The way to think of this is that the weights are effectively the probabilities that an edge (1-simplex) exists. The combined weight is then the probability that at least one of the edges exists.
In the end, we get a connected neighborhood graph that looks like this:
Step 2 โ Finding a low-dimensional representation
After learning the approximate manifold from the higher-dimensional space, the next step for UMAP is to project it (map it) to a lower-dimensional space.
2.1. Minimum distanceUnlike the first step, we do not want varying distances in the lower-dimensional space representation. Instead, we want the distance on the manifold to be standard Euclidean distance with respect to the global coordinate system.
The switch from varying to standard distances also impacts the proximity to nearest neighbors. Hence, we must pass another hyperparameter called _min_dist (default=0.1)_ to define the minimum distance between embedded points.
Essentially, we can control the minimum spread of points, avoiding scenarios with many points sitting on top of each other in the lower-dimensional embedding.
2.2. Minimizing the cost function (Cross-Entropy)With the minimum distance specified, the algorithm can start looking for a good low-dimensional manifold representation. UMAP does it by minimizing the following cost function, also known as Cross-Entropy (CE):
As you can see, the ultimate goal is to find the optimal weights of edges in the low-dimensional representation. These optimal weights emerge as the above Cross-Entropy cost function is minimized following an iterative stochastic gradient descent process.
And that is it! The UMAPโs job is now complete, and we are given an array containing the coordinates of each data point in a specified lower-dimensional space.
Using UMAP in Python
Finally, we can use our newly acquired knowledge of UMAP to perform dimensionality reduction in Python.
We will apply UMAP on the MNIST dataset (a collection of handwritten digits) to illustrate how we can successfully separate digits and display them in the low-dimensional space.
Setup
We will use the following data and libraries:
-
Scikit-learn library for 1) MNIST digit data (load_digits); 2) splitting data into train and test samples (train_test_split);
- UMAP library for performing dimensionality reduction;
- Plotly and Matplotlib for data visualizations;
- Pandas and Numpy for data manipulation.
The first step is to import the libraries that we have listed above.
# Data manipulation
import pandas as pd # for data manipulation
import numpy as np # for data manipulation
# Visualization
import plotly.express as px # for data visualization
import matplotlib.pyplot as plt # for showing handwritten digits
# Skleran
from sklearn.datasets import load_digits # for MNIST data
from sklearn.model_selection import train_test_split # for splitting data into train and test samples
# UMAP dimensionality reduction
from umap import UMAP
Next, we load the MNIST data and display the images of the first ten handwritten digits.
# Load digits data
digits = load_digits()
# Load arrays containing digit data (64 pixels per image) and their true labels
X, y = load_digits(return_X_y=True)
# Some stats
print('Shape of digit images: ', digits.images.shape)
print('Shape of X (main data): ', X.shape)
print('Shape of y (true labels): ', y.shape)
# Display images of the first 10 digits
fig, axs = plt.subplots(2, 5, sharey=False, tight_layout=True, figsize=(12,6), facecolor='white')
n=0
plt.gray()
for i in range(0,2):
for j in range(0,5):
axs[i,j].matshow(digits.images[n])
axs[i,j].set(title=y[n])
n=n+1
plt.show()
Next, we will create a function for drawing 3D scatter plots that we can reuse multiple times to display results from UMAP dimensionality reduction.
def chart(X, y):
#--------------------------------------------------------------------------#
# This section is not mandatory as its purpose is to sort the data by label
# so, we can maintain consistent colors for digits across multiple graphs
# Concatenate X and y arrays
arr_concat=np.concatenate((X, y.reshape(y.shape[0],1)), axis=1)
# Create a Pandas dataframe using the above array
df=pd.DataFrame(arr_concat, columns=['x', 'y', 'z', 'label'])
# Convert label data type from float to integer
df['label'] = df['label'].astype(int)
# Finally, sort the dataframe by label
df.sort_values(by='label', axis=0, ascending=True, inplace=True)
#--------------------------------------------------------------------------#
# Create a 3D graph
fig = px.scatter_3d(df, x='x', y='y', z='z', color=df['label'].astype(str), height=900, width=950)
# Update chart looks
fig.update_layout(title_text='UMAP',
showlegend=True,
legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),
scene_camera=dict(up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.1),
eye=dict(x=1.5, y=-1.4, z=0.5)),
margin=dict(l=0, r=0, b=0, t=0),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
zaxis=dict(backgroundcolor='lightgrey',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
)))
# Update marker size
fig.update_traces(marker=dict(size=3, line=dict(color='black', width=0.1)))
fig.show()
Applying UMAP to our data
Now, we take the MNIST digit data that we have previously loaded into an array X. The shape of X (1797,64) tells us that we have 1,797 digits, each made up of 64 dimensions.
We will use UMAP to reduce the dimensionality from 64 down to 3. Please note that I have listed every hyperparameter available in UMAP with a short explanation of what they do.
While, in this example, I am leaving most of the hyperparameters set to their default values, I encourage you to experiment with them to see how they affect the results.
# Configure UMAP hyperparameters
reducer = UMAP(n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.
n_components=3, # default 2, The dimension of the space to embed into.
metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.
n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings.
learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.
init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.
min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.
spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.
low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.
set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.
repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.
negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.
a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.
b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.
random_state=42, # default: None, If int, random_state is the seed used by the random number generator;
metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.
angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.
target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.
#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different.
#target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.
#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.
transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.
verbose=False, # default False, Controls verbosity of logging.
unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded.
)
# Fit and transform the data
X_trans = reducer.fit_transform(X)
# Check the shape of the new data
print('Shape of X_trans: ', X_trans.shape)
The above code applies UMAP to our MNIST data and prints the shape of the transformed array to confirm that we have successfully reduced dimensions from 64 to 3.
We can now use the chart plotting function we created earlier to visualize our 3-dimensional digits data. We call the function with a simple line of code passing the arrays we want to visualize.
chart(X_trans, y)
The results look pretty great, with a clear separation between the digit clusters. Interestingly digit 1 formed three distinct clusters, which can be explained by different variations of how people write digit 1:
Note how writing 1 with a base at the bottom makes it looks similar to digit 2. We can find these cases in a small red cluster of 1โs located very close to the green cluster of 2โs.
Supervised UMAP
As mentioned at the beginning of the article, we can also use UMAP in a supervised manner to help reduce the dimensions of our data.
When performing supervised dimensionality reduction, in addition to image data (X_train array), we also need to pass labels data (ytrain array) into a fit_transform_ method (see code below).
Also, I want to bring your attention to the fact that I have made a couple of other minor changes to hyperparameters, setting _min_dist=0.5 and local_connectivity=2_ for nicer visualization and better results on a test sample.
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)
# Configure UMAP hyperparameters
reducer2 = UMAP(n_neighbors=100, n_components=3, n_epochs=1000,
min_dist=0.5, local_connectivity=2, random_state=42,
)
# Training on MNIST digits data - this time we also pass the true labels to a fit_transform method
X_train_res = reducer2.fit_transform(X_train, y_train)
# Apply on a test set
X_test_res = reducer2.transform(X_test)
# Print the shape of new arrays
print('Shape of X_train_res: ', X_train_res.shape)
print('Shape of X_test_res: ', X_test_res.shape)
Now that we have successfully reduced dimensions using the supervised UMAP approach, we can plot 3D scatterplots to show the results.
chart(X_train_res, y_train)
We can see that UMAP has formed very tight clusters of each digit separated by a considerable distance.
We now create the same 3D graph for the test data to see if UMAP could successfully place new data points into these clusters.
chart(X_test_res, y_test)
As you can see, the results are pretty good, with only a few digits placed in the wrong clusters. In particular, it looks like the algorithm struggled with digit 3, with a few examples located next to 7โs, 8โs, and 5โs.
Conclusion
I appreciate you reading this long article, and I hope that every part of it has given you more insight into how this great algorithm operates.
In general, UMAP has a solid mathematical foundation, and it can often do a better job than similar dimensionality reduction algorithms such as t-SNE.
The secret lies in UMAPโs ability to infer both local and global structures while preserving relative global distances in the lower-dimensional space. These abilities enable us to unlock specific insights, such as finding similarities between the handwritten forms of digits 1 and 2.
Please feel free to share your thoughts and feedback, which will help me write better articles in the future.
Cheers ๐ Saul Dobilas
Other articles you may find interesting:
t-SNE Machine Learning Algorithm โ A Great Tool for Dimensionality Reduction in Python
LLE: Locally Linear Embedding โ A Nifty Way to Reduce Dimensionality in Python
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS