Semi-Supervised Learning – How to Assign Labels with Label Propagation Algorithm
How does Semi-Supervised Machine Learning work, and how to use it in Python?
Hands-on Tutorials, Machine Learning
Intro
Despite the abundance of data all around us, the vast majority of it is unstructured and unlabeled. At the same time, many of our Machine Learning applications, such as classification models, require us to have target labels.
Unfortunately, we may not always have the resources to go through tens of thousands of observations and assign labels manually. But what if we did not have to do that? What if we could automatically label large amounts of data using just a tiny fraction of examples?
Let me introduce you to Semi-Supervised Learning!
Contents
- What is Semi-Supervised Learning?
- Where does the Label Propagation algorithm sit within the Machine Learning universe?
- An intuitive explanation of how Label Propagation works
- An example of using Label Propagation in Python
What is Semi-Supervised Learning?
Typically, we would use data with a specific target variable (labeled data) to build supervised models (e.g., classification, regression). Alternatively, we would build unsupervised models (e.g., clustering, dimensionality reduction) when we do not have labeled data.
However, sometimes we may find ourselves in situations with a small amount of labeled data and a significant amount of unlabeled data. That’s where Semi-Supervised Learning can help since it incorporates elements from both supervised and unsupervised techniques.
Example
Let’s consider an example. Assume you have 10,000 sentences with user comments, and you want to classify them into positive and negative. Unfortunately, you only have 50 sentences to which you have previously manually assigned a label (positive, negative).
Unless you want to spend many more hours labeling the rest of the data, your options are:
- Build a supervised model using 50 labeled examples – this may result in a poor-performing model due to the small number of samples available.
- Build an unsupervised model with unlabeled data to group examples into two clusters. However, the data may naturally want to form multiple smaller clusters instead, and forcing them into just two groups may not necessarily split them amongst the intended target (positive/negative).
- Build a semi-supervised model using all labeled and unlabeled data – this will use 50 examples to label the rest of the data and give you a much larger dataset to work with when building a supervised sentiment prediction model.
I’m sure you are now curious to find out how this works. So, let’s take a closer look at one specific algorithm called Label Propagation.
Label Propagation in the universe of Machine Learning algorithms
You already know that we will be diving deeper into the Label Propagation algorithm under the Semi-Supervised branch of Machine Learning. However, it is always beneficial to take a step back for a minute and visualize the large universe of ML models before getting immersed into one corner of it.
Below is my attempt at categorizing some of the most popular Machine Learning algorithms. The sunburst chart is interactive, so make sure to click👇 on different categories to enlarge and reveal more.
If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.
An intuitive explanation of how Label Propagation works
Label Propagation is a relatively simple algorithm based on the assumption that closer data points have similar class labels. As a result, we can propagate these class labels through dense unlabeled data regions.
The algorithm follows an iterative approach, which we can describe as a collection of the following steps:
- Create a connected graph by drawing edges (links) between different nodes (data points). Note that creating a **** fully connected graph on a large dataset may demand a high amount of resources from your machine. Hence, it is often beneficial to limit the number of neighbors that you want to join together _(see nneighbors in the Python example section).
- Determine the weights for each edge, where edges for closer data points have larger weights (stronger connection), and edges for faraway points have smaller weights (weaker connection). Larger edge weights allow labels to travel through easier, increasing the probability of propagating the particular label.
- Perform a random walk from each unlabeled point to find a probability distribution of reaching a labeled one. This random walk consists of many iterations and continues until convergence is reached, i.e., all paths have been explored, and probabilities no longer change.
Unlabeled points get their new labels assigned based on the probabilities found by the process above. Note that original labeled points never change since their labels are clamped (fixed).
Here is a gif image to give you an intuitive view of how labels get propagated through the network.
I have designed the above example to show a scenario where a Semi-Supervised approach would have an advantage over using either a Supervised or Unsupervised one.
Note that we have three labeled samples available at the start (see below image). Based on this information, we can infer that red labels are likely to be centered in the middle, with blue ones around the outside (although it would always be beneficial to have more known labels to ensure our inference is correct).
A supervised model would struggle to draw a decision boundary in this scenario as it would not have the unlabeled data for context. Meanwhile, an unsupervised model would not do a great job either since there are no two clearly defined clusters that would separate red and blue points.
As seen in the gif image, the Label Propagation algorithm is able to propagate labels outwards through the network, making the best use of the entire data (labeled and unlabeled).
Another critical aspect of the Label Propagation algorithm is that we can view the corresponding probabilities in addition to hard labels after the algorithm has finished running. Hence, we could manually adjust the threshold and re-label some points if we were not happy with the boundary determined by the algorithm.
See the below interactive 3D graph, which shows a probability of belonging to the red label (label 1) alongside the two dimensions (Dim 1, Dim 2) that we had in the picture and gif image above.
As you can see, hard labels are assigned based on the probability of belonging to a particular class, with 0.5 being the threshold. However, the model is less sure about the points located closer to the boundary. Hence, if we wished to do so, we could move the threshold up or down and reclassify marginal cases.
An example of using Label Propagation in Python
Let’s now leave theory behind and use real data with the Label Propagation algorithm.
The data I’ve chosen has labels available for all observations. Therefore, we will mask many of those labels before sending the data through the Label Propagation algorithm and then use actual labels to evaluate how well the model has performed.
Setup
We will use the following data and libraries:
- Marketing campaign data from Kaggle
-
Scikit-learn library for 1) feature scaling (MinMaxScaler); 2) performing Label Propagation (LabelPropagation); 3) model evaluation (classification_report, confusion_matrix, ConfusionMatrixDisplay)
- Plotly and Matplotlib for data visualizations
- Pandas and NumPy for data manipulation
The first step is to import the libraries that we have listed above.
# Data manipulation
import pandas as pd # for data manipulation
import numpy as np # for data manipulation
# Visualization
import plotly.express as px # for data visualization
import plotly.graph_objects as go # for data visualization
import matplotlib.pyplot as plt # for displaying confusion matrix
# Skleran
from sklearn.metrics import classification_report # for model evaluation metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay # for showing confusion matrix
from sklearn.preprocessing import MinMaxScaler # for feature scaling
from sklearn.semi_supervised import LabelPropagation # for assigning labels to unlabeled data
Next, we download and ingest marketing campaign data (source: Kaggle). This time we will make use of just two features to build a connected graph and propagate labels. Hence, I have limited the ingestion to a few key columns instead of reading in the entire table.
Also, you will see that we have derived a few additional fields required for creating a target variable with masked labels.
# Read in data
df = pd.read_csv('marketing_campaign.csv',
encoding='utf-8', delimiter=';',
usecols=['ID', 'Year_Birth', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'MntWines', 'MntMeatProducts']
)
# Create a flag to denote whether the person has any dependants at home (either kids or teens)
df['Dependents_Flag']=df.apply(lambda x: 1 if x['Kidhome']+x['Teenhome']>0 else 0, axis=1)
# Randomly select 15% of observations to keep the label for. The rest of obs turned to be unlabeled
df['Rand_Selection'] = False
df.loc[df.sample(frac=0.15, random_state=42).index, 'Rand_Selection'] = True
# Create a new target colum with labels. The 1's and 0's are original labels and -1 represents unlabeled data
df['Dependents_Target']=df.apply(lambda x: x['Dependents_Flag'] if x['Rand_Selection']==True else -1, axis=1)
# Show target value distribution
print('Target Value Distribution:')
print(df['Dependents_Target'].value_counts())
# Print dataframe
df
The below snippet shows the data and the distribution of a target variable.
Note, we have kept 15% of the actual labels (1’s and 0’s) and masked the remainder 85% (-1’s). Hence, our target contains information on whether the shopper has any dependents (1), does not have any dependents (0), or this information is masked (-1).
We will now attempt to assign a label to those 85% masked observations.
Performing Label Propagation
The next piece of code consists of a few steps that help us prepare the data, fit the model and print the results.
We will use the amount of money shoppers spend annually on wine and meat products as the two dimensions to create a connected graph and infer whether they have any dependents at home.
### Step 1 - Select data
X=df[['MntMeatProducts', 'MntWines']]
y=df['Dependents_Target'].values
### Step 2 - Perform Min-Max scaling
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X)
### Step 3 - Configure model parameters
model_LP_knn = LabelPropagation(kernel='knn', # {'knn', 'rbf'} default='rbf'
#gamma=70, # default=20, Parameter for rbf kernel.
n_neighbors=20, # default=7, Parameter for knn kernel which is a strictly positive integer.
max_iter=1000, # default=30, Maximum number of iterations allowed.
tol=0.001, # default=1e-3, Convergence tolerance: threshold to consider the system at steady state.
n_jobs=-1, # default=None, The number of parallel jobs to run. -1 means using all processors.
)
### Step 4 - Fit the model
LP_knn=model_LP_knn.fit(X_scaled, y)
### Step 5 - exclude observations with known records before evcaluating model performance
df_eval=df[['Dependents_Flag', 'Dependents_Target']].copy() # Copy dataframe with dependents info
df_eval['Predicted_label']=LP_knn.transduction_ # Attach model predictions
df_eval=df_eval[df_eval['Dependents_Target']==-1] # Keep only records containing masked labels
### Step 6 - Print the summary of model results
print("Model Name: ", str(LP_knn))
print("Classes: ", LP_knn.classes_)
print("Label Distributions: ", LP_knn.label_distributions_)
print("Transduction Label: ", LP_knn.transduction_)
print("No. of features: ", LP_knn.n_features_in_)
print("No. of iterations: ", LP_knn.n_iter_)
print('')
print('*************** Evaluation of LP knn model ***************')
print(classification_report(df_eval['Dependents_Flag'], df_eval['Predicted_label']))
print('')
print('******************** Confusion Matrix ********************')
cm= confusion_matrix(df_eval['Dependents_Flag'], df_eval['Predicted_label'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=LP_knn.classes_)
disp.plot()
plt.show()
And here are the results:
As you can see, we have been relatively successful at inferring labels with a model accuracy of 83% (to make the assessment fair, we only used records with masked labels for model performance evaluation). It is a pretty good result, given that we only used labels from 15% of observations. However, we could improve it further by either increasing the number of known labels or exploring additional dimensions for a connected graph.
Let’s plot the results on a 3D graph for better visualization.
# Specify a size of the mesh to be used
mesh_size=10
margin=0
# Create a mesh grid for displaying a threshold plane
x_min, x_max = df['MntMeatProducts'].min() - margin, df['MntMeatProducts'].max() + margin
y_min, y_max = df['MntWines'].min() - margin, df['MntWines'].max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
# Set Z values to 0.5 for the threshold plane
Z = np.ones_like(xx)*0.5
# Create a 3D scatter plot
fig = px.scatter_3d(df, x=df['MntMeatProducts'], y=df['MntWines'], z=LP_knn.label_distributions_[:,1],
color=df['Dependents_Flag'].astype('str'),
color_discrete_sequence=['black', 'red'],
opacity=0.8,
hover_data=['Marital_Status',
'MntWines', 'Dependents_Target',
'Dependents_Flag',
],
height=900, width=900
)
# Update chart looks
fig.update_layout(#title_text="Scatter 3D Plot",
showlegend=False,
scene_camera=dict(up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.2),
eye=dict(x=-1.5, y=1.5, z=0.5)),
margin=dict(l=0, r=0, b=0, t=0),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
zaxis=dict(backgroundcolor='lightgrey',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
title_text='Probability',
tickfont=dict(size=10),
dtick=0.1,
)))
# Update marker size
fig.update_traces(marker=dict(size=3))
# Add prediction plane
fig.add_traces(go.Surface(x=xrange, y=yrange, z=Z, name='Separator',
colorscale='Gray', opacity=0.5, showscale=False))
fig.show()
In this graph, color represents the true label:
- red=1, i.e., have dependents
- black=0, i.e., do not have dependents.
Hence, red points in the upper half of the graph and black points in the lower half represent correctly identified labels. Meanwhile, black points in the upper half and red ones in the lower half represent incorrectly identified labels. Note, data points with probability=0 or probability=1 are the ones with known labels used in model training.
Conclusions
Semi-Supervised Learning and Label Propagation can be a massive help in situations where the availability of labeled data is scarce. However, it should be used with caution, and you should first test it with known labels to ensure that the approach suits your data.
It will generally work well where the data tends to form clusters, provided those clusters have consistent labels within them.
I hope you enjoyed reading this article and learned some new practical knowledge to help you with your Data Science journey! Please do not hesitate to reach out if you have any questions or suggestions.
Cheers 👏 Saul Dobilas
Other articles you may find interesting:
UMAP Dimensionality Reduction – An Incredibly Robust Machine Learning Algorithm
LDA: Linear Discriminant Analysis – How to Improve Your Models with Supervised Dimensionality…
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS