Data Science

t-SNE Machine Learning Algorithm – A Great Tool for Dimensionality Reduction in Python

How to use t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize high-dimensionality data?

Saul Dobilas

Sep 26, 2021

11 min read

Machine Learning

👁 t-SNE visualization with different perplexities. Gif image by author.

t-SNE visualization with different perplexities. Gif image by author.

Intro

A successful data scientist understands a wide range of Machine Learning algorithms and can explain the results to stakeholders. But, unfortunately, not every stakeholder has a sufficient amount of training to grasp the complexities of ML.

Luckily, we can aid our explanations by using dimensionality reduction techniques to create visual representations of high dimensional data. This article will take you through one such technique called t-Distributed Stochastic Neighbor Embedding (t-SNE).

t-SNE’s place in the universe of Machine Learning algorithms
An intuitive explanation of how t-SNE works
An example of using t-SNE in Python

t-Distributed Stochastic Neighbor Embedding (t-SNE) in the universe of Machine Learning algorithms

Perfect categorization of Machine Learning techniques is not always possible due to the flexibility demonstrated by specific algorithms, making them useful when solving different problems (e.g., one can use k-NN for regression and classification).

Nevertheless, it is still beneficial to bring some structure to the ML universe, and the below graph is my attempt to do just that. Make sure to explore this interactive graph by clicking👇 on various categories to enlarge and reveal more.

If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.

As you can see, t-SNE is a dimensionality reduction technique that belongs to the Unsupervised branch of Machine Learning algorithms. It means that we do not need to have labeled data to use it. That is in contrast to supervised techniques such as Linear Discriminant Analysis (LDA).

While not illustrated above, it is worth knowing that dimensionality reduction can be further split into two types:

Parametric – creates an explicit mapping function that we can use on test data, i.e., use it to get the location of new points within the lower-dimensional embedding (e.g., PCA);
Non-parametric – does not create an explicit mapping function; hence it cannot easily map new points to the lower-dimensional embedding. Although, in practice, it is possible to construct some workarounds to enable such mapping of new points.

t-SNE belongs to a non-parametric group of techniques. Hence, the primary use of it is typically for visualization purposes.

How does t-Distributed Stochastic Neighbor Embedding (t-SNE) work?

Analyzing the t-SNE name

Let’s start by dissecting the t-SNE name, which will give us a broad idea of what the algorithm is supposed to do. We will do this in reverse (i.e., from right to left).

Note, the below statements are not official definitions but rather a set of descriptions that help us understand key ideas behind t-SNE.

Embedding – typically high-dimensional data represented in a lower-dimensional space;
Neighbor – a data point that resides close to the data point of interest;
Stochastic -the use of randomness in the iterative process when searching for a representative embedding;
t-Distributed – probability distribution used by the algorithm to calculate similarity scores in the lower-dimensional embedding.

Putting the above statements together, we can describe t-SNE as a technique that utilizes a gradual iterative approach to find a lower-dimensional representation of the original data while preserving information about local neighborhoods.

Note that you do not need to pay much attention to the probability distribution element for the general understanding of this algorithm. t-distribution is used in steps 2 and 3 described below and is chosen instead of Normal distribution because it has a "flatter" shape with "higher" tails. Such shape helps to spread out the points in the lower-dimensional space. Otherwise, you may end up having a bunch of points stacked on top of each other.

The steps performed by t-SNE

Step 1t-SNE starts by determining the "similarity" of points based on distances between them. Nearby points are considered "similar," while distant ones are considered "dissimilar."

It achieves this by measuring distances between the point of interest and other points and then placing them on a Normal curve. It does this for every point, applying some scaling to account for variations in the density of different regions.

For example, the below illustration has a higher density in the region occupied by blue points and a lower density in the region occupied by yellow points.

👁 Similarity scores. Image by author.

Similarity scores. Image by author.

The result of these calculations is a matrix that contains similarity scores between each pair of points from the original multidimensional space.

👁 Similarity matrix. Image by author.

Similarity matrix. Image by author.

Step 2 Next, t-SNE randomly maps all the points onto a lower-dimensional space and calculates "similarities" between points as described in the process above. One difference, though, this time, the algorithm uses t-distribution instead of Normal distribution.

Unsurprisingly, though, the new "similarity" matrix differs significantly from the original one because of the random mapping. Here is an example of what it might look like.

👁 An example of what a new "similarity" matrix might look like. Image by author.

An example of what a new "similarity" matrix might look like. Image by author.

Step 3Now the goal of an algorithm is to make the new "similarity" matrix look like the original one by using an iterative approach. With each iteration, points move towards their "closest neighbors" from the original higher-dimensional space and away from the distant ones.

The new "similarity" matrix gradually begins to look more like the original one. The process continues until the maximum number of iterations is reached or no further improvement can be made.

In more scientific terms, the above explanation describes the process of an algorithm trying to minimize the Kullback–Leibler divergence (KL divergence) through gradient descent.

Perplexity

One important aspect that I haven’t mentioned yet is a hyper-parameter known as perplexity. It describes the expected density around each point or, in other words, relates to the target number of nearest neighbors from the point of interest.

Perplexity parameter plays a vital role in determining the final result of your embedding. Generally, you may want to choose perplexity to be somewhere between 5 and 50, but you should definitely experiment with different values.

At the start of this article, I have included a gif image to demonstrate how the final visualization changes when using different values for perplexity. A low value makes the algorithm "focus" on fewer neighbors, which results in many small groups. By contrast, high perplexity value "expands the neighborhood horizon," resulting in fewer, more tightly packed groups.

👁 Image

How to use t-SNE in Python?

Finally, we understand the algorithm well, so it is time to use it in Python.

We will apply t-SNE on the MNIST dataset (a collection of handwritten digits) to illustrate how we can successfully visualize high-dimensionality data.

Setup

We will use the following data and libraries:

Scikit-learn library for 1) MNIST digit data from sklearn’s datasets (load_digits); 2) performing dimensionality reduction(t-SNE);
Plotly and Matplotlib for data visualizations;
Pandas for data manipulation.

The first step is to import the libraries that we have listed above.

# Data manipulation
import pandas as pd # for data manipulation

# Visualization
import plotly.express as px # for data visualization
import matplotlib.pyplot as plt # for showing handwritten digits

# Skleran
from sklearn.datasets import load_digits # for MNIST data
from sklearn.manifold import TSNE # for t-SNE dimensionality reduction

Next, we load the MNIST data.

# Load digits data 
digits = load_digits()

# Load arrays containing digit data (64 pixels per image) and their true labels
X, y = load_digits(return_X_y=True)

# Some stats
print('Shape of digit images: ', digits.images.shape)
print('Shape of X (training data): ', X.shape)
print('Shape of y (true labels): ', y.shape)

👁 Shapes of the imported data. Image by author.

Shapes of the imported data. Image by author.

Now, let’s display the first ten handwritten digits to understand better what we are working with.

# Display images of the first 10 digits
fig, axs = plt.subplots(2, 5, sharey=False, tight_layout=True, figsize=(12,6), facecolor='white')
n=0
plt.gray() 
for i in range(0,2):
 for j in range(0,5):
 axs[i,j].matshow(digits.images[n])
 axs[i,j].set(title=y[n])
 n=n+1
plt.show()

👁 Images of the first ten handwritten digits (8x8=64 pixels, i.e., 64 dimensions). Image by author.

Images of the first ten handwritten digits (8×8=64 pixels, i.e., 64 dimensions). Image by author.

Applying t-SNE to our data

In the previous step, we loaded data into an array X of the shape (1797,64), which means that we have 1,797 digits, each made up of 64 dimensions.

We will now use t-SNE to reduce the dimensionality from 64 down to 2. Note, I am using default values for most hyper-parameters. However, I have also included short explanations of each parameter so that you can experiment by trying out different settings.

# Configure t-SNE function. 
embed = TSNE(
 n_components=2, # default=2, Dimension of the embedded space.
 perplexity=10, # default=30.0, The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms.
 early_exaggeration=12, # default=12.0, Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. 
 learning_rate=200, # default=200.0, The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a 'ball' with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers.
 n_iter=5000, # default=1000, Maximum number of iterations for the optimization. Should be at least 250.
 n_iter_without_progress=300, # default=300, Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration. 
 min_grad_norm=0.0000001, # default=1e-7, If the gradient norm is below this threshold, the optimization will be stopped.
 metric='euclidean', # default='euclidean', The metric to use when calculating distance between instances in a feature array.
 init='random', # {'random', 'pca'} or ndarray of shape (n_samples, n_components), default='random'. Initialization of embedding
 verbose=0, # default=0, Verbosity level.
 random_state=42, # RandomState instance or None, default=None. Determines the random number generator. Pass an int for reproducible results across multiple function calls.
 method='barnes_hut', # default='barnes_hut'. By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method='exact' will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. 
 angle=0.5, # default=0.5, Only used if method='barnes_hut' This is the trade-off between speed and accuracy for Barnes-Hut T-SNE.
 n_jobs=-1, # default=None, The number of parallel jobs to run for neighbors search. -1 means using all processors. 
)

# Transform X
X_embedded = embed.fit_transform(X)

# Print results
print('New Shape of X: ', X_embedded.shape)
print('Kullback-Leibler divergence after optimization: ', embed.kl_divergence_)
print('No. of iterations: ', embed.n_iter_)
#print('Embedding vectors: ', embed.embedding_)

The above code prints some basic stats:

👁 Basic stats from our t-SNE embedding. Image by author.

Basic stats from our t-SNE embedding. Image by author.

Finally, let’s plot the new X array in a 2-dimensional scatter plot. Note that we also use color to represent the actual labels of digits. This helps us to see how similar digits cluster together.

# Create a scatter plot
fig = px.scatter(None, x=X_embedded[:,0], y=X_embedded[:,1], 
 labels={
 "x": "Dimension 1",
 "y": "Dimension 2",
 },
 opacity=1, color=y.astype(str))

# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey', 
 zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey', 
 showline=True, linewidth=1, linecolor='black')

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey', 
 zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey', 
 showline=True, linewidth=1, linecolor='black')

# Set figure title
fig.update_layout(title_text="t-SNE")

# Update marker size
fig.update_traces(marker=dict(size=3))

fig.show()

👁 2D t-SNE embedding of 64-dimensional MNIST digit data. Image by author.

2D t-SNE embedding of 64-dimensional MNIST digit data. Image by author.

We can see that t-SNE has done a pretty good job identifying similar digits, with most of them forming tight groups. However, there are a few small clusters, namely 1 (red), 3 (purple), and 9 (yellow), that seem to be quite distant from their primary groups. Such results may indicate the existence of different ways of writing the same digit.

Conclusions

t-SNE is a great tool to visualize the similarities between different data points, which can aid your analysis in various ways.

E.g., it may help you spot different ways of writing the same digit or enable you to find word synonyms/phrases with similar meaning while performing NLP analysis. At the same time, you can use it as a visual aid when explaining your analysis to stakeholders.

However, as with every Machine Learning approach, t-SNE has some limitations that you need to keep in mind when using it:

It is a non-parametric approach to dimensionality reduction, meaning that it does not create an explicit mapping function for you to use on new data points;
You may need to experiment with different perplexity values to find the one that suits your data best;
You may get different results after each run due to its random initialization and stochastic nature. If you want reproducible results, make sure to specify a seed (random_state).

I sincerely hope that this article has given you some new insights into t-SNE and helped you expand your ML arsenal. If you have any questions or suggestions, I would be happy to hear from you.

Cheers 👏 Saul Dobilas

Other articles you may find interesting:

MDS: Multidimensional Scaling – Smart Way to Reduce Dimensionality in Python

Isomap Embedding – An Awesome Approach to Non-linear Dimensionality Reduction

Written By

Saul Dobilas

See all from Saul Dobilas

Data Science, Dimensionality Reduction, Editor’s Picks, Machine Learning, Programming

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/t-sne-machine-learning-algorithm-a-great-tool-for-dimensionality-reduction-in-python-ec01552f1a1e/