VOOZH about

URL: https://towardsdatascience.com/how-to-benefit-from-the-semi-supervised-learning-with-label-spreading-algorithm-2f373ae5de96/

⇱ How to Benefit from the Semi-Supervised Learning with Label Spreading Algorithm | Towards Data Science


How to Benefit from the Semi-Supervised Learning with Label Spreading Algorithm

A detailed explanation of how Label Spreading algorithm works with Python example

10 min read

Machine Learning

👁 Label Spreading. Image by author.
Label Spreading. Image by author.

Intro

This is a second article covering Semi-Supervised Learning, where I explore ways of using labeled and unlabeled data together to build better models.

This time I focus on the Label Spreading algorithm, which attempts to construct a smooth classifying function based on the intrinsic structure revealed by known labeled and unlabeled points.

While similar to Label Propagation, Label Spreading does a few things differently, which will be explored later in the article.

Contents

  • The place of Label Spreading within the universe of Machine Learning algorithms
  • Main differences between Label Spreading and Label Propagation
  • A brief explanation of how Label Spreading works
  • How to use Label Spreading in Python?

Label Spreading within the universe of Machine Learning algorithms

Sometimes we find ourselves having a mix of labeled data (perfect for supervised learning like classification or regression) and unlabeled data (perfect for unsupervised learning like clustering or dimensionality reduction).

However, to get the best results, it is often beneficial to combine these two sets of data. Such a situation is an excellent example of where we would want to use a Semi-Supervised Learning approach, with the Label Spreading algorithm being one of our options.

The below interactive sunburst chart shows the categorization of different ML algorithms. Make sure to click👇 on various categories to enlarge and reveal more.

If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.

Differences between Label Spreading and Label Propagation

If you are already familiar with the Label Propagation algorithm, you may want to know about the two ways that Label Spreading differs from it. If you are not familiar with Label Propagation, then feel free to skip to the next section.

Symmetric normalized Laplacian vs. random walk normalized Laplacian

The Label Spreading algorithm uses symmetric normalized graph Laplacian matrix in its calculations, while Label Propagation employs a random walk normalized Laplacian.

However, note that the two matrices are similar and that one can be derived from the other. Hence, from the perspective of this article, it is not crucial for us to understand the nuances of these two matrices.

Soft clamping vs. Hard clamping

Label Propagation uses hard clamping, which means that the labels of the originally labeled points never change.

Meanwhile, Label Spreading adopts soft clamping controlled through a hyperparameter α (alpha), which specifies the relative amount of information the point obtains from its neighbors vs. its initial label information.

A brief explanation of how Label Spreading works

Four steps describe how the Label Spreading algorithm operates.

1. Define a pairwise relationship between points, called affinity matrix W. The matrix is created with the help of a Radial Basis Function kernel (a.k.a. RBF kernel), which is used to determine edge weights. Note that matrix W contains 0’s in the diagonal since no edge connects a point to itself.

👁 Weights calculation for edges that connect each pair of points. Image by author.
Weights calculation for edges that connect each pair of points. Image by author.
Note, sklrean's implementation of RBF kernel looks slightly different as it replaces 1/2sigma^2 with a hyperparameter gamma. The effect is the same as it allows you to control the smoothness of the function. 
High gamma extends the influence of each individual point wide, hence creating a smooth transition in label probabilities. Meanwhile, low gamma leads to only the closest neighbors having influence over the label probabilities. 
👁 Sklearn's implementation of the RBF kernel. Image by author.
Sklearn’s implementation of the RBF kernel. Image by author.

And here is what the affinity matrix looks like:

👁 Affinity matrix. Image by author.
Affinity matrix. Image by author.

2. Create a symmetric normalized graph Laplacian matrix. This step takes affinity matrix W and normalizes it symmetrically, which helps with the convergence in step 3.

👁 Symmetric normalized graph Laplacian matrix S. Image by author.
Symmetric normalized graph Laplacian matrix S. Image by author.

3. The third step is iterative, which uses matrix multiplication to spread information from labeled points to unlabeled points.

👁 An iterative process to find the labels. Image by author.
An iterative process to find the labels. Image by author.

Each point receives the information from its neighbors (first term) and also retains its initial information (second term). The parameter α (alpha) enables soft clamping by controlling the proportion of information received from neighbors vs. the initial label. Alpha close to 0 keeps all the initial label information (equivalent to hard clamping), while alpha close to 1 allows most of the initial label information to be replaced.

Note that F(0)=Y, so the iterative process starts with the initial label information.

4. After the process in step 3 converges or reaches the specified maximum number of iterations, we arrive at the final step of assigning the labels.

Matrix F contains label vectors, representing the probabilities of each point belonging to a specific class (i.e., having a particular label). The final label is then chosen using argmax operation, meaning that the algorithm assigns the label with the highest probability.

👁 Image

How to use Label Spreading in Python?

It is finally time to use Label Spreading on real data.

Note, for this example, we have chosen marketing campaign data that have labels available, which will help us to evaluate the performance of our semi-supervised model.

Of course, before we fit the model, we will mask most of the labels to simulate a scenario of mainly having unlabeled data.

Setup

We will use the following data and libraries:

The first step is to import the libraries that we have listed above.

# Data manipulation
import pandas as pd # for data manipulation

# Visualization
import plotly.express as px # for data visualization
import plotly.graph_objects as go # for data visualization
import matplotlib.pyplot as plt # for showing confusion matrix

# Skleran
from sklearn.metrics import classification_report # for model evaluation metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay # for showing confusion matrix
from sklearn.preprocessing import MinMaxScaler # for feature scaling
from sklearn.semi_supervised import LabelSpreading # for assigning labels to unlabeled data

Next, we download and ingest marketing campaign data (source: Kaggle). This time we will use only two features to spread the labels. Hence, I have limited the ingestion to a few key columns instead of reading in the entire table.

Also, you will see that we have derived a few additional fields required for creating a target variable with masked labels.

# Read in data
df = pd.read_csv('marketing_campaign.csv', 
 encoding='utf-8', delimiter=';',
 usecols=['ID', 'Year_Birth', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'MntWines', 'MntMeatProducts'] 
 )

# Create a flag to denote whether the person has any dependants at home (either kids or teens)
df['Dependents_Flag']=df.apply(lambda x: 1 if x['Kidhome']+x['Teenhome']>0 else 0, axis=1)

# Randomly select 2% of observations to keep the label for. The rest of obs will have their labels masked
df['Rand_Selection'] = False
df.loc[df.sample(frac=0.02, random_state=42).index, 'Rand_Selection'] = True

# Create a new target colum with labels. The 1's and 0's are original labels and -1 represents unlabeled (masked) data
df['Dependents_Target']=df.apply(lambda x: x['Dependents_Flag'] if x['Rand_Selection']==True else -1, axis=1)

# Show target value distribution
print('Target Value Distribution:')
print(df['Dependents_Target'].value_counts())

# Print dataframe
df

The below snippet shows the data and the distribution of a target variable.

👁 Marketing campaign data from Kaggle. Image by author.
Marketing campaign data from Kaggle. Image by author.

Note, we have kept 2% of the actual labels (1’s and 0’s) and masked the remainder 98% (-1’s). Hence, our target contains information on whether the shopper has any dependents (1), does not have any dependents (0), or this information is masked (-1).

We are being ambitious here as we aim to assign labels to 2,195 data points by using only 45 known labels.

The features that we use are MntMeatProducts (shopper’s annual spend on meat products) MntWines (shopper’s annual spend on wine). Now, let’s see what the data looks like when we plot it on a graph.

# Create a scatter plot
fig = px.scatter(df, x='MntMeatProducts', y='MntWines', opacity=1, color=df['Dependents_Target'].astype(str),
 color_discrete_sequence=['lightgrey', 'red', 'blue'],
 )

# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='white', 
 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
 showline=True, linewidth=1, linecolor='white')

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='white', 
 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
 showline=True, linewidth=1, linecolor='white')

# Set figure title
fig.update_layout(title_text="Marketing Campaign Data - Labeled vs. Unlabeled")

# Update marker size
fig.update_traces(marker=dict(size=5))

fig.show()
👁 Combination of labeled and unlabeled data for Semi-Supervised Learning. Image by author.
Combination of labeled and unlabeled data for Semi-Supervised Learning. Image by author.

Applying Label Spreading algorithm

The next piece of code consists of a few steps that help us prepare the data, fit the model and print the results.

### Step 1 - Select data
X=df[['MntMeatProducts', 'MntWines']]
y=df['Dependents_Target'].values

### Step 2 - Perform Min-Max scaling
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X)

### Step 3 - Configure model parameters 
model_LS = LabelSpreading(kernel='rbf', # {'knn', 'rbf'} default='rbf'
 gamma=70, # default=20, Parameter for rbf kernel.
 #n_neighbors=7, # default=7, Parameter for knn kernel which is a strictly positive integer.
 alpha=0.5, # Clamping factor. A value in (0, 1) that specifies the relative amount that an instance should adopt the information from its neighbors as opposed to its initial label. alpha=0 means keeping the initial label information; alpha=1 means replacing all initial information.
 max_iter=100, # default=30, Maximum number of iterations allowed.
 tol=0.001, # default=1e-3, Convergence tolerance: threshold to consider the system at steady state.
 n_jobs=-1, # default=None, The number of parallel jobs to run. -1 means using all processors.
 )

### Step 4 - Fit the model
LS=model_LS.fit(X_scaled, y)

### Step 5 - exclude observations with known records before evcaluating model performance
df_eval=df[['Dependents_Flag', 'Dependents_Target']].copy() # Copy dataframe with dependents info
df_eval['Predicted_label']=LS.transduction_ # Attach model predictions
df_eval=df_eval[df_eval['Dependents_Target']==-1] # Keep only records containing masked labels

### Step 6 - Print the summary of model results
print("Model Name: ", str(LS))
print("Classes: ", LS.classes_)
print("Label Distributions: ", LS.label_distributions_)
print("Transduction Label: ", LS.transduction_)
print("No. of features: ", LS.n_features_in_)
print("No. of iterations: ", LS.n_iter_)
print('')
print('*************** Evaluation of LS model ***************')
print(classification_report(df_eval['Dependents_Flag'], df_eval['Predicted_label']))
print('')
print('******************** Confusion Matrix ********************')
cm= confusion_matrix(df_eval['Dependents_Flag'], df_eval['Predicted_label'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=LS.classes_)
disp.plot()
plt.show()

And here are the results:

👁 Label Spreading results. Image by author.
Label Spreading results. Image by author.

As you can see, despite being very ambitious, we have achieved a pretty good result with a model accuracy of 82% (to make the assessment fair, we only used records with masked labels for model performance evaluation).

Let’s plot a 2D graph again to see how the newly assigned labels are distributed.

# Create a scatter plot
fig = px.scatter(df, x='MntMeatProducts', y='MntWines', opacity=1, color=LS.transduction_.astype(str),
 color_discrete_sequence=['blue','red'],
 )

# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='white', 
 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
 showline=True, linewidth=1, linecolor='white')

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='white', 
 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
 showline=True, linewidth=1, linecolor='white')

# Set figure title
fig.update_layout(title_text="Label Spreading model results")

# Update marker size
fig.update_traces(marker=dict(size=5))

fig.show()
👁 A 2D plot of the Label Spreading model results. Image by author.
A 2D plot of the Label Spreading model results. Image by author.

We can see a clear separation of blue points (no dependents) and red points (with dependents) with a decision boundary located at around 400 spent on meat and ~1,000 spent on wine. So, according to this data, it looks like people without kids tend to eat more meat and drink more wine.

Conclusions

Label Spreading is an excellent algorithm when you have only a small number of labeled examples and want to apply auto-labeling on a large amount of unlabeled data.

However, as with all Semi-Supervised Learning techniques, you need to approach it with caution. It is always worth evaluating the model by creating a test sample with known labels or manually checking a sub-sample of Label Spreading results.

I hope you enjoyed reading this article, and I encourage you to try out Semi-Supervised Learning in your next Data Science project! Please do not hesitate to reach out if you have any questions or suggestions.

Cheers 👏 Saul Dobilas


Semi-Supervised Learning – How to Assign Labels with Label Propagation Algorithm

t-SNE Machine Learning Algorithm – A Great Tool for Dimensionality Reduction in Python


Written By

Saul Dobilas

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles