![]() |
VOOZH | about |
Clustering is a technique in Machine Learning that is used to group similar data points. While the algorithm performs its job, helping uncover the patterns and structures in the data, it is important to judge how well it functions. Several metrics have been designed to evaluate the performance of these clustering algorithms.
In this article, we will explore these metrics and see the mathematical concepts that lie behind them. After that, we will demonstrate their practical implementation using scikit-learn.
Clustering is an unsupervised machine-learning approach that is used to group comparable data points based on specific traits or attributes. Clustering algorithms do not require labelled data, which makes them ideal for finding patterns in large datasets. It is a widely used technique in applications like customer segmentation, image recognition, anomaly detection, etc.
There are multiple clustering algorithms, and each has its way of grouping data points. Clustering metrics are used to evaluate all these algorithms. Let us take a look at some of the most commonly used clustering metrics:
The Silhouette Score is a way to measure how good the clusters are in a dataset. It helps us understand how well the data points have been grouped. The score ranges from -1 to 1.
Silhouette Score (S) for a data point i is calculated as:
where,
The Davies-Bouldin Index (DBI) helps us measure how good the clustering is in a dataset. It looks at how tight each cluster is (compactness), and how far apart the clusters are (separation).
A lower score is better, because it means:
Davies-Bouldin Index (DB) is calculated as:
where,
The Calinski-Harabasz Index measures how good the clusters are in a dataset.
It looks at:
A higher score is better, as it means the clusters are tight and well-separated. It helps determine the ideal number of clusters.
Calinski-Harabasz Index (CH) is calculated as:
where,
Calculating between group sum of squares (B)
where,
Calculating within the group sum of squares (W)
where,
The Adjusted Rand Index (ARI) helps us measure how accurate a clustering result is by comparing it to the true labels (ground truth).
It checks how well the pairs of points are grouped:
The score ranges from -1 to 1:
Adjusted Rand Index (ARI) is calculated as:
where,
Mutual Information measures how much two variables are related or connected. In clustering, it compares how much the true cluster labels match with the predicted labels. It shows how much knowing about one variable helps us predict the other. The more agreement there is, the higher the score.
MI between true labels Y and predicted labels Z is calculated as:
where,
These clustering metrics help in evaluating the quality and performance of clustering algorithms, allowing for informed decisions when selecting the most suitable clustering solution for a given dataset.
Let's consider an example using the Iris dataset and the K-Means clustering algorithm. We will calculate the Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and Adjusted Rand Index to evaluate the clustering.
Import the necessary libraries, including scikit-learn (sklearn).
Load or generate your dataset for clustering. Iris dataset consists of 150 samples of iris flowers. There are three species of iris flower: setosa, versicolor, and virginica with four features: sepal length, sepal width, petal length, and petal width.
Choose a clustering algorithm, such as K-Means, and fit it to your data.
K means is an unsupervised technique used for creating cluster based on similarity. It iteratively assigns data points to the nearest cluster center and updates the centroids until convergence.
Use the appropriate clustering metrics to evaluate the clustering results.
Output:
Silhouette Score: 0.55
Davies-Bouldin Index: 0.66
Calinski-Harabasz Index: 561.63
Adjusted Rand Index: 0.73
Mutual Information (MI): 0.83
Here's an interpretation of the metric scores obtained:
This score reveals how similar data points are inside their clusters when compared to data points from other clusters. A result of 0.55 indicates that there is some separation between the clusters, but there is still space for improvement. Closer to 1 values suggest better-defined clusters.
This index calculates the average similarity between each cluster and its closest neighbors. A lower score is preferable, and 0.66 suggests a pretty strong separation across clusters.
Calculates the ratio of between-cluster variation to within-cluster variance. Higher values suggest more distinct groups. Your clusters are distinct and independent with a score of 561.63.
Compares the resemblance of genuine class labels to predicted cluster labels. A rating of 0.73 shows that the clustering findings and the actual class labels correspond rather well.
This metric measures the agreement between the true class labels and the predicted cluster labels. A score of 0.75 indicates a substantial amount of shared information between the true labels and the clusters assigned by the algorithm.
Read More: