VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/hierarchical-clustering-with-scikit-learn/

⇱ Hierarchical Clustering with Scikit-Learn - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Hierarchical Clustering with Scikit-Learn

Last Updated : 13 Feb, 2026

Hierarchical Clustering is an unsupervised learning technique that groups data into a hierarchy of clusters based on similarity. It builds a tree-like structure called a dendrogram, which helps visualise relationships and decide the optimal number of clusters.

  • Does not require pre-selecting the number of clusters
  • Uses agglomerative (bottom up) or divisive (top down) approaches
  • Commonly applied in data exploration and pattern discovery
  • Widely used in pattern recognition, customer segmentation and image grouping
👁 Types-of-Hierarchical-Clustering
Hierarchical Clustering

Implementing Agglomerative Hierarchical Clustering

Scikit Learn provides a straightforward implementation of Agglomerative hierarchical clustering through the Agglomerative Clustering class.

Step 1: Import Required Libraries

Here we will import numpy, pandas, matplotlib and scikit learn for its implementation.

Step 2: Load the Dataset

  • Each row represents an image flattened into numerical features.
  • No labels are used during clustering this is purely unsupervised.

Step 3: Feature Scaling

Feature scaling matters in Hierarchical Clustering because the algorithm relies on distance calculations and is highly sensitive to feature magnitudes.

Step 4: Visualizing the Dendrogram

Before selecting the number of clusters, we visualize the hierarchy using a dendrogram.

Output:

👁 dendogram
Dendogram

Step 5: Building the Hierarchical Clustering Model

Based on dendrogram inspection, we choose a reasonable number of clusters.

Step 6: Fit the Model and Assign Clusters

This step fits the hierarchical clustering model to the scaled data and assigns a cluster label to each data point. Each label represents the cluster formed from the hierarchical structure defined by the dendrogram.

Step 7: Cluster Distribution Analysis

This step shows how data points are distributed across clusters. It provides quick insight into cluster balance, helps detect over fragmentation and highlights dominant groupings, making it an important checkpoint before using the clusters downstream.

Output:

👁 Heirarchial
Points assigned in each cluster

Step 8: Evaluating Clustering Quality

  • The Silhouette Score evaluates how well clusters are formed by comparing cohesion within clusters to separation between clusters.
  • Scores closer to +1 indicate well defined clusters, while values near 0 or negative suggest overlap or poor grouping.

Output:

Score : 0.13

Implementing Divisive Hierarchical Clustering

Scikit Learn does not provide a dedicated library or built in API for divisive clustering. Instead, the approach is implemented manually by applying a top down recursive splitting strategy, most commonly using K Means clustering from scikit learn to divide clusters step by step.

Step 1: Import Required Libraries

Here we will import numpy and scikit learn library.

Step 2: Load the Dataset

  • Each row represents an image flattened into numerical features.
  • No labels are used during clustering, this is purely unsupervised.

Step 3: Feature Scaling

Feature scaling ensures stable and meaningful partitioning.

Step 4: Defining Divisive Clustering Function

This function applies a top down divisive clustering strategy, where the dataset is repeatedly split into smaller clusters to reveal finer patterns.

  • Starts by treating the entire dataset as one cluster
  • Splits the cluster into two using K Means
  • Recursively repeats the same process on each sub cluster
  • Stops when the maximum depth or minimum cluster size is reached

Step 5: Execute Divisive Clustering

This step applies divisive clustering to the scaled data using the defined depth and minimum size.

Step 6: Analyze Cluster Sizes

This step calculates the number of data points in each final cluster to understand how the data has been split.

Output:

[39, 568, 643, 369, 178]

Step 7: Assign Flat Cluster Labels

This step transforms the hierarchical, tree based clustering output into flat cluster labels, which are required by most machine learning pipelines to enable evaluation, visualization and deployment.

Step 8: Evaluate Cluster Quality

  • The Silhouette Score evaluates how well clusters are formed by comparing cohesion within clusters to separation between clusters.
  • Scores closer to +1 indicate well defined clusters, while values near 0 or negative suggest overlap or poor grouping.

Output:

Score : -0.04

Step 9: Visualizing Divisive Hierarchical Clustering Tree

Now we will visualize the divisive clustering process as a tree.

  • Begins with the entire dataset as a single cluster (Root).
  • Divides it into two clusters using KMeans.
  • Recursively splits each resulting cluster into smaller groups.
  • Stops when the maximum depth is reached or the cluster size becomes too small.

Now, Assigns X and Y positions to each node, where depth controls the vertical placement and child nodes are distributed horizontally to maintain proper spacing.

Now we extract the parent child relationships by recursively traversing the tree. For each node, we store its connection to its children so that these relationships can later be drawn as lines in the visualization. This step builds the structural backbone required to clearly represent the hierarchical clustering tree.

Now we draw the tree by connecting each parent node to its children and displaying the cluster size inside every node. Colours are used to make the structure easier to read. This creates a clear top down view of how the data was split during divisive clustering.

Output:

👁 Divisive-Heirarchical-clsutering
Divisive Clustering

You can download the code from here

Comment