VOOZH about

URL: https://towardsdatascience.com/self-training-classifier-how-to-make-any-algorithm-behave-like-a-semi-supervised-one-2958e7b54ab7/

⇱ Self-Training Classifier: How to Make Any Algorithm Behave Like a Semi-Supervised One | Towards Data Science


Self-Training Classifier: How to Make Any Algorithm Behave Like a Semi-Supervised One

An easy Python implementation of Self-Training using standard classification algorithms from the Sklearn library

9 min read

Machine Learning

👁 Self Training Classifier: adding pseudo-labels with each iteration. Image by author.
Self Training Classifier: adding pseudo-labels with each iteration. Image by author.

Intro

Semi-Supervised Learning combines labeled and unlabeled examples to expand the available data pool for model training. As a result, we can improve model performance and save a lot of time and money by not having to label thousands of examples manually.

If you have your favorite Supervised Machine Learning algorithm, you will be happy to hear that you can quickly adapt it to use a Semi-Supervised approach through a technique called Self-Training.

Contents

  • Where does Self-Training sit within the universe of Machine Learning algorithms?
  • How does Self-Training work?
  • How to use Self-Training to build models in Python?

Self-Training within the universe of Machine Learning algorithms

There are more Machine Learning algorithms available than any of us can use throughout our Data Science careers. Nevertheless, it is beneficial to understand the most commonly used ones, which I have categorized below. The sunburst chart is interactive, so make sure to click👇 on different categories to enlarge and reveal more.

As explained in the intro section, Self-Training belongs to the Semi-Supervised branch of Machine Learning algorithms since it uses a combination of labeled and unlabeled data to train models.

If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.

How does Self-Training work?

You may think that Self-Training involves some magic or uses a highly complex approach. In reality, though, the idea behind Self-Training is very straightforward and can be explained by the following steps:

  1. First, we gather all labeled and unlabeled data, but we only use labeled observations to train our first supervised model.
  2. Then we use this model to predict the class of unlabeled data.
  3. In the third step, we select observations that satisfy our predefined criteria (e.g., prediction probability is >90% or belongs to the top 10 of observations with the highest prediction probabilities) and combine these pseudo-labels with labeled data.
  4. We repeat the process by training a new supervised model using observations with labels and pseudo-labels. Then we make predictions again and add newly selected observations into the pseudo-labeled pool.
  5. We iterate through these steps until we finish labeling all the data, no additional unlabeled observations satisfy our pseudo-labeling criteria, or we reach the specified max number of iterations.

Here is an illustration that summarizes all of the steps I have just described:

👁 The iterative process of Self-Training. Image by author.
The iterative process of Self-Training. Image by author.
👁 Image

How to use Self-Training in Python?

Let’s now work through a Python example using Self-Training Classifier on real-life data.

Setup

We will use the following data and libraries:

First, let’s import the libraries that we have listed above.

# Data manipulation
import pandas as pd

# Visualization
import plotly.express as px

# Sklearn
from sklearn.model_selection import train_test_split # for splitting data into train and test samples
from sklearn.svm import SVC # for Support Vector Classification baseline model
from sklearn.semi_supervised import SelfTrainingClassifier # for Semi-Supervised learning
from sklearn.metrics import classification_report # for model evaluation metrics

Next, we download and ingest marketing campaign data (source: Kaggle). We limit file ingestion to a few critical columns since we will only use two features to train our example model.

# Read in data
df = pd.read_csv('marketing_campaign.csv', 
 encoding='utf-8', delimiter=';',
 usecols=['ID', 'Year_Birth', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'MntWines', 'MntMeatProducts']
 )

# Create a flag to denote whether the person has any dependants at home (either kids or teens)
df['Dependents_Flag']=df.apply(lambda x: 1 if x['Kidhome']+x['Teenhome']>0 else 0, axis=1)

# Print dataframe
df

As you can see, we have also derived a ‘Dependents_Flag,’ which we will use as a prediction target. In other words, we will aim to predict whether our supermarket shopper has any dependents (kids/teens) at home or not.

And this is what the data looks like:

👁 A snippet of marketing campaign data from Kaggle. Image by author.
A snippet of marketing campaign data from Kaggle. Image by author.

We need to do a couple more things before we start training models. Since our goal is to evaluate the performance of a Self-Training Classifier, a Semi-Supervised technique, we will split the data as per the setup below.

👁 Preparing data for Semi-Supervised Learning. Image by author.
Preparing data for Semi-Supervised Learning. Image by author.

Test data will be used to evaluate model performance, while labeled and unlabeled data will be used to train our models.

So, let’s split data into train and test samples and print shapes to check the size is correct:

df_train, df_test = train_test_split(df, test_size=0.25, random_state=0)
print('Size of train dataframe: ', df_train.shape[0])
print('Size of test dataframe: ', df_test.shape[0])
👁 Train-test data size. Image by author.
Train-test data size. Image by author.

Now let’s mask 95% of labels within the training data and create a target variable that uses ‘-1’ to denote unlabeled (masked) data:

# Create a flag for label masking
df_train['Random_Mask'] = True
df_train.loc[df_train.sample(frac=0.05, random_state=0).index, 'Random_Mask'] = False

# Create a new target colum with labels. The 1's and 0's are original labels and -1 represents unlabeled (masked) data
df_train['Dependents_Target']=df_train.apply(lambda x: x['Dependents_Flag'] if x['Random_Mask']==False else -1, axis=1)

# Show target value distribution
print('Target Value Distribution:')
print(df_train['Dependents_Target'].value_counts())
👁 Target value distribution. Image by author.
Target value distribution. Image by author.

Finally, let’s plot training data on a 2D scatterplot to see how the observations are distributed.

# Create a scatter plot
fig = px.scatter(df_train, x='MntMeatProducts', y='MntWines', opacity=1, color=df_train['Dependents_Target'].astype(str),
 color_discrete_sequence=['lightgrey', 'red', 'blue'],
 )

# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='white', 
 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
 showline=True, linewidth=1, linecolor='white')

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='white', 
 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
 showline=True, linewidth=1, linecolor='white')

# Set figure title
fig.update_layout(title_text="Marketing Campaign Training Data - Labeled vs. Unlabeled")

# Update marker size
fig.update_traces(marker=dict(size=5))

fig.show()
👁 Combination of labeled and unlabeled data for Semi-Supervised Learning. Image by author.
Combination of labeled and unlabeled data for Semi-Supervised Learning. Image by author.

As you can see, we will use ‘MntMeatProducts’ (shopper’s annual spend on meat products) and ‘MntWines’ (shopper’s annual spend on wine) as two features to predict whether the shopper has any dependents at home.

Model training

Now that the data is ready, we will train a supervised Support Vector Machine classification model (SVC) on labeled data to establish a model performance benchmark. It will enable us to judge whether a Semi-Supervised approach from a later step is better or worse than a standard Supervised model.

########## Step 1 - Data Prep ########## 
# Select only records with known labels
df_train_labeled=df_train[df_train['Dependents_Target']!=-1]

# Select data for modeling 
X_baseline=df_train_labeled[['MntMeatProducts', 'MntWines']]
y_baseline=df_train_labeled['Dependents_Target'].values

# Put test data into an array
X_test=df_test[['MntMeatProducts', 'MntWines']]
y_test=df_test['Dependents_Flag'].values

########## Step 2 - Model Fitting ########## 
# Specify SVC model parameters
model = SVC(kernel='rbf', 
 probability=True, 
 C=1.0, # default = 1.0
 gamma='scale', # default = 'scale'
 random_state=0
 )

# Fit the model
clf = model.fit(X_baseline, y_baseline)

########## Step 3 - Model Evaluation ########## 
# Use score method to get accuracy of the model
print('---------- SVC Baseline Model - Evaluation on Test Data ----------')
accuracy_score_B = model.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_B)
# Look at classification report to evaluate the model
print(classification_report(y_test, model.predict(X_test)))
👁 Support Vector Machines classification model performance. Image by author.
Support Vector Machines classification model performance. Image by author.

The results from a supervised SVC model are already pretty good, with an accuracy of 82.85%. Note that the f1 score is higher for label=1 (shopper with dependents) due to class imbalance.

Now let’s follow a Semi-Supervised approach with Sklearn’s Self-Training Classifier while using the same SVC model as a base estimator. Note, you can choose pretty much any supervised classification algorithm to use inside Self-Training Classifier.

########## Step 1 - Data Prep ########## 
# Select data for modeling - we are including masked (-1) labels this time
X_train=df_train[['MntMeatProducts', 'MntWines']]
y_train=df_train['Dependents_Target'].values

########## Step 2 - Model Fitting ########## 
# Specify SVC model parameters
model_svc = SVC(kernel='rbf', 
 probability=True, # Need to enable to be able to use predict_proba
 C=1.0, # default = 1.0
 gamma='scale', # default = 'scale',
 random_state=0
 )

# Specify Self-Training model parameters
self_training_model = SelfTrainingClassifier(base_estimator=model_svc, # An estimator object implementing fit and predict_proba.
 threshold=0.7, # default=0.75, The decision threshold for use with criterion='threshold'. Should be in [0, 1).
 criterion='threshold', # {'threshold', 'k_best'}, default='threshold', The selection criterion used to select which labels to add to the training set. If 'threshold', pseudo-labels with prediction probabilities above threshold are added to the dataset. If 'k_best', the k_best pseudo-labels with highest prediction probabilities are added to the dataset.
 #k_best=50, # default=10, The amount of samples to add in each iteration. Only used when criterion='k_best'.
 max_iter=100, # default=10, Maximum number of iterations allowed. Should be greater than or equal to 0. If it is None, the classifier will continue to predict labels until no new pseudo-labels are added, or all unlabeled samples have been labeled.
 verbose=True # default=False, Verbosity prints some information after each iteration
 )

# Fit the model
clf_ST = self_training_model.fit(X_train, y_train)

########## Step 3 - Model Evaluation ########## 
print('')
print('---------- Self Training Model - Summary ----------')
print('Base Estimator: ', clf_ST.base_estimator_)
print('Classes: ', clf_ST.classes_)
print('Transduction Labels: ', clf_ST.transduction_)
#print('Iteration When Sample Was Labeled: ', clf_ST.labeled_iter_)
print('Number of Features: ', clf_ST.n_features_in_)
print('Feature Names: ', clf_ST.feature_names_in_)
print('Number of Iterations: ', clf_ST.n_iter_)
print('Termination Condition: ', clf_ST.termination_condition_)
print('')

print('---------- Self Training Model - Evaluation on Test Data ----------')
accuracy_score_ST = clf_ST.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_ST)
# Look at classification report to evaluate the model
print(classification_report(y_test, clf_ST.predict(X_test)))
👁 Self-Training classification model results. Image by author.
Self-Training classification model results. Image by author.

And the results are in! We have improved model performance, although only slightly with an accuracy of 83.57%. F1 score is also marginally better for label=0, driven by improved precision.

As mentioned earlier in the article, we can choose how to select pseudo-labels for training. We can base it on the top k_best predictions or specify a specific probability threshold.

This time, we have used a probability threshold of 0.7. It means that any observation with a class probability of 0.7 or higher will be added to the pool of pseudo-labeled data and used to train the model in the next iteration.

Remember, it is always worth exploring both approaches (threshold and k_best) with different hyperparameters to see which one yields the best results (which I haven’t done in this example).

Conclusions

Now you know how to use any supervised classification algorithm in a Semi-Supervised manner. If you have lots of unlabeled data, I would recommend exploring the benefits of Semi-Supervised Learning before engaging in a costly data labeling exercise.

I sincerely hope you enjoyed reading this article. However, as I try to make my articles more useful for my readers, I would appreciate it if you could let me know what has driven you to read this piece and whether it has given you the answers you were looking for. If not, what was missing?

Cheers! 👏 Saul Dobilas


Semi-Supervised Learning – How to Assign Labels with Label Propagation Algorithm

How to Benefit from the Semi-Supervised Learning with Label Spreading Algorithm


Written By

Saul Dobilas

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles