Self-Training Classifier: How to Make Any Algorithm Behave Like a Semi-Supervised One
An easy Python implementation of Self-Training using standard classification algorithms from the Sklearn library
Machine Learning
Intro
Semi-Supervised Learning combines labeled and unlabeled examples to expand the available data pool for model training. As a result, we can improve model performance and save a lot of time and money by not having to label thousands of examples manually.
If you have your favorite Supervised Machine Learning algorithm, you will be happy to hear that you can quickly adapt it to use a Semi-Supervised approach through a technique called Self-Training.
Contents
- Where does Self-Training sit within the universe of Machine Learning algorithms?
- How does Self-Training work?
- How to use Self-Training to build models in Python?
Self-Training within the universe of Machine Learning algorithms
There are more Machine Learning algorithms available than any of us can use throughout our Data Science careers. Nevertheless, it is beneficial to understand the most commonly used ones, which I have categorized below. The sunburst chart is interactive, so make sure to click👇 on different categories to enlarge and reveal more.
As explained in the intro section, Self-Training belongs to the Semi-Supervised branch of Machine Learning algorithms since it uses a combination of labeled and unlabeled data to train models.
If you enjoy Data Science and Machine Learning, please subscribe to get an email whenever I publish a new story.
How does Self-Training work?
You may think that Self-Training involves some magic or uses a highly complex approach. In reality, though, the idea behind Self-Training is very straightforward and can be explained by the following steps:
- First, we gather all labeled and unlabeled data, but we only use labeled observations to train our first supervised model.
- Then we use this model to predict the class of unlabeled data.
- In the third step, we select observations that satisfy our predefined criteria (e.g., prediction probability is >90% or belongs to the top 10 of observations with the highest prediction probabilities) and combine these pseudo-labels with labeled data.
- We repeat the process by training a new supervised model using observations with labels and pseudo-labels. Then we make predictions again and add newly selected observations into the pseudo-labeled pool.
- We iterate through these steps until we finish labeling all the data, no additional unlabeled observations satisfy our pseudo-labeling criteria, or we reach the specified max number of iterations.
Here is an illustration that summarizes all of the steps I have just described:
How to use Self-Training in Python?
Let’s now work through a Python example using Self-Training Classifier on real-life data.
Setup
We will use the following data and libraries:
- Marketing campaign data from Kaggle
-
Scikit-learn library for 1) splitting data into train and test samples (train_test_split) 2) performing Semi-Supervised Learning (SelfTrainingClassifier); 3) model evaluation (classification_report)
- Plotly for data visualizations
- Pandas for data manipulation
First, let’s import the libraries that we have listed above.
# Data manipulation
import pandas as pd
# Visualization
import plotly.express as px
# Sklearn
from sklearn.model_selection import train_test_split # for splitting data into train and test samples
from sklearn.svm import SVC # for Support Vector Classification baseline model
from sklearn.semi_supervised import SelfTrainingClassifier # for Semi-Supervised learning
from sklearn.metrics import classification_report # for model evaluation metrics
Next, we download and ingest marketing campaign data (source: Kaggle). We limit file ingestion to a few critical columns since we will only use two features to train our example model.
# Read in data
df = pd.read_csv('marketing_campaign.csv',
encoding='utf-8', delimiter=';',
usecols=['ID', 'Year_Birth', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'MntWines', 'MntMeatProducts']
)
# Create a flag to denote whether the person has any dependants at home (either kids or teens)
df['Dependents_Flag']=df.apply(lambda x: 1 if x['Kidhome']+x['Teenhome']>0 else 0, axis=1)
# Print dataframe
df
As you can see, we have also derived a ‘Dependents_Flag,’ which we will use as a prediction target. In other words, we will aim to predict whether our supermarket shopper has any dependents (kids/teens) at home or not.
And this is what the data looks like:
We need to do a couple more things before we start training models. Since our goal is to evaluate the performance of a Self-Training Classifier, a Semi-Supervised technique, we will split the data as per the setup below.
Test data will be used to evaluate model performance, while labeled and unlabeled data will be used to train our models.
So, let’s split data into train and test samples and print shapes to check the size is correct:
df_train, df_test = train_test_split(df, test_size=0.25, random_state=0)
print('Size of train dataframe: ', df_train.shape[0])
print('Size of test dataframe: ', df_test.shape[0])
Now let’s mask 95% of labels within the training data and create a target variable that uses ‘-1’ to denote unlabeled (masked) data:
# Create a flag for label masking
df_train['Random_Mask'] = True
df_train.loc[df_train.sample(frac=0.05, random_state=0).index, 'Random_Mask'] = False
# Create a new target colum with labels. The 1's and 0's are original labels and -1 represents unlabeled (masked) data
df_train['Dependents_Target']=df_train.apply(lambda x: x['Dependents_Flag'] if x['Random_Mask']==False else -1, axis=1)
# Show target value distribution
print('Target Value Distribution:')
print(df_train['Dependents_Target'].value_counts())
Finally, let’s plot training data on a 2D scatterplot to see how the observations are distributed.
# Create a scatter plot
fig = px.scatter(df_train, x='MntMeatProducts', y='MntWines', opacity=1, color=df_train['Dependents_Target'].astype(str),
color_discrete_sequence=['lightgrey', 'red', 'blue'],
)
# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))
# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='white',
zeroline=True, zerolinewidth=1, zerolinecolor='white',
showline=True, linewidth=1, linecolor='white')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='white',
zeroline=True, zerolinewidth=1, zerolinecolor='white',
showline=True, linewidth=1, linecolor='white')
# Set figure title
fig.update_layout(title_text="Marketing Campaign Training Data - Labeled vs. Unlabeled")
# Update marker size
fig.update_traces(marker=dict(size=5))
fig.show()
As you can see, we will use ‘MntMeatProducts’ (shopper’s annual spend on meat products) and ‘MntWines’ (shopper’s annual spend on wine) as two features to predict whether the shopper has any dependents at home.
Model training
Now that the data is ready, we will train a supervised Support Vector Machine classification model (SVC) on labeled data to establish a model performance benchmark. It will enable us to judge whether a Semi-Supervised approach from a later step is better or worse than a standard Supervised model.
########## Step 1 - Data Prep ##########
# Select only records with known labels
df_train_labeled=df_train[df_train['Dependents_Target']!=-1]
# Select data for modeling
X_baseline=df_train_labeled[['MntMeatProducts', 'MntWines']]
y_baseline=df_train_labeled['Dependents_Target'].values
# Put test data into an array
X_test=df_test[['MntMeatProducts', 'MntWines']]
y_test=df_test['Dependents_Flag'].values
########## Step 2 - Model Fitting ##########
# Specify SVC model parameters
model = SVC(kernel='rbf',
probability=True,
C=1.0, # default = 1.0
gamma='scale', # default = 'scale'
random_state=0
)
# Fit the model
clf = model.fit(X_baseline, y_baseline)
########## Step 3 - Model Evaluation ##########
# Use score method to get accuracy of the model
print('---------- SVC Baseline Model - Evaluation on Test Data ----------')
accuracy_score_B = model.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_B)
# Look at classification report to evaluate the model
print(classification_report(y_test, model.predict(X_test)))
The results from a supervised SVC model are already pretty good, with an accuracy of 82.85%. Note that the f1 score is higher for label=1 (shopper with dependents) due to class imbalance.
Now let’s follow a Semi-Supervised approach with Sklearn’s Self-Training Classifier while using the same SVC model as a base estimator. Note, you can choose pretty much any supervised classification algorithm to use inside Self-Training Classifier.
########## Step 1 - Data Prep ##########
# Select data for modeling - we are including masked (-1) labels this time
X_train=df_train[['MntMeatProducts', 'MntWines']]
y_train=df_train['Dependents_Target'].values
########## Step 2 - Model Fitting ##########
# Specify SVC model parameters
model_svc = SVC(kernel='rbf',
probability=True, # Need to enable to be able to use predict_proba
C=1.0, # default = 1.0
gamma='scale', # default = 'scale',
random_state=0
)
# Specify Self-Training model parameters
self_training_model = SelfTrainingClassifier(base_estimator=model_svc, # An estimator object implementing fit and predict_proba.
threshold=0.7, # default=0.75, The decision threshold for use with criterion='threshold'. Should be in [0, 1).
criterion='threshold', # {'threshold', 'k_best'}, default='threshold', The selection criterion used to select which labels to add to the training set. If 'threshold', pseudo-labels with prediction probabilities above threshold are added to the dataset. If 'k_best', the k_best pseudo-labels with highest prediction probabilities are added to the dataset.
#k_best=50, # default=10, The amount of samples to add in each iteration. Only used when criterion='k_best'.
max_iter=100, # default=10, Maximum number of iterations allowed. Should be greater than or equal to 0. If it is None, the classifier will continue to predict labels until no new pseudo-labels are added, or all unlabeled samples have been labeled.
verbose=True # default=False, Verbosity prints some information after each iteration
)
# Fit the model
clf_ST = self_training_model.fit(X_train, y_train)
########## Step 3 - Model Evaluation ##########
print('')
print('---------- Self Training Model - Summary ----------')
print('Base Estimator: ', clf_ST.base_estimator_)
print('Classes: ', clf_ST.classes_)
print('Transduction Labels: ', clf_ST.transduction_)
#print('Iteration When Sample Was Labeled: ', clf_ST.labeled_iter_)
print('Number of Features: ', clf_ST.n_features_in_)
print('Feature Names: ', clf_ST.feature_names_in_)
print('Number of Iterations: ', clf_ST.n_iter_)
print('Termination Condition: ', clf_ST.termination_condition_)
print('')
print('---------- Self Training Model - Evaluation on Test Data ----------')
accuracy_score_ST = clf_ST.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_ST)
# Look at classification report to evaluate the model
print(classification_report(y_test, clf_ST.predict(X_test)))
And the results are in! We have improved model performance, although only slightly with an accuracy of 83.57%. F1 score is also marginally better for label=0, driven by improved precision.
As mentioned earlier in the article, we can choose how to select pseudo-labels for training. We can base it on the top k_best predictions or specify a specific probability threshold.
This time, we have used a probability threshold of 0.7. It means that any observation with a class probability of 0.7 or higher will be added to the pool of pseudo-labeled data and used to train the model in the next iteration.
Remember, it is always worth exploring both approaches (threshold and k_best) with different hyperparameters to see which one yields the best results (which I haven’t done in this example).
Conclusions
Now you know how to use any supervised classification algorithm in a Semi-Supervised manner. If you have lots of unlabeled data, I would recommend exploring the benefits of Semi-Supervised Learning before engaging in a costly data labeling exercise.
I sincerely hope you enjoyed reading this article. However, as I try to make my articles more useful for my readers, I would appreciate it if you could let me know what has driven you to read this piece and whether it has given you the answers you were looking for. If not, what was missing?
Cheers! 👏 Saul Dobilas
Semi-Supervised Learning – How to Assign Labels with Label Propagation Algorithm
How to Benefit from the Semi-Supervised Learning with Label Spreading Algorithm
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS