![]() |
VOOZH | about |
In this article, we will discuss Hierarchical Data and Dendrogram and Visualizing Hierarchical Data with Dendrograms in R Programming Language.
Hierarchical data refers to data that is organized in a hierarchical or tree-like structure, where each data point or record has a defined relationship with one or more other data points, forming a parent-child relationship.
Dendrograms are a popular way to visualize hierarchical data, particularly in fields like biology, linguistics, and computer science. They represent relationships between data points in a hierarchical manner, typically in a tree-like structure.
Output:
First generate some random data using rnorm function to create a matrix with 100 random values organized into 5 columns.
Hierarchical clustering is a popular technique in data analysis and machine learning used to group similar objects into clusters. It's called "hierarchical" because it creates a hierarchy of clusters. At its core, hierarchical clustering starts by considering each data point as its own cluster and then iteratively merges the closest clusters together until all points belong to just one cluster, forming a tree-like structure known as a dendrogram.
There are two main types of hierarchical clustering
This approach starts with each data point as its own cluster and then merges the closest pairs of clusters until only one cluster remains. The process involves calculating the distances between clusters and determining which ones to merge based on a chosen linkage clustering, such as "complete-linkage," "single-linkage," or "average-linkage."
In contrast to agglomerative clustering, divisive hierarchical clustering begins with all data points in one cluster and then recursively splits the cluster into smaller clusters until each data point is in its own cluster. While conceptually straightforward, divisive clustering can be computationally intensive and less commonly used in practice compared to agglomerative clustering.
Single linkage clustering, also known as minimum linkage clustering, is a method of hierarchical clustering where the distance between two clusters is defined as the shortest distance between any two points in the two clusters. In other words, the distance between two clusters is determined by the closest pair of points from each cluster.
Suppose we have the following set of points in a two-dimensional space
A(1, 1), B(2, 2), C(2, 4), D(3, 3), E(6, 5), F(7, 6)
Step 1: Calculate distances between points using Euclidean distance
- Distance(A, B) = √((2-1)² + (2-1)²) = √2
- Distance(A, C) = √((2-1)² + (4-1)²) = √5
- Distance(A, D) = √((3-1)² + (3-1)²) = √8
- Distance(A, E) = √((6-1)² + (5-1)²) = √41
- Distance(A, F) = √((7-1)² + (6-1)²) = √61
- Distance(B, C) = √((2-2)² + (4-2)²) = 2
- Distance(B, D) = √((3-2)² + (3-2)²) = √2
- Distance(B, E) = √((6-2)² + (5-2)²) = √29
- Distance(B, F) = √((7-2)² + (6-2)²) = √34
- Distance(C, D) = √((3-2)² + (3-4)²) = √2
- Distance(C, E) = √((6-2)² + (5-4)²) = √25
- Distance(C, F) = √((7-2)² + (6-4)²) = √29
- Distance(D, E) = √((6-3)² + (5-3)²) = √13
- Distance(D, F) = √((7-3)² + (6-3)²) = √18
- Distance(E, F) = √((7-6)² + (6-5)²) = √2
Step 2: Merge closest points
Start by merging the closest pair of points into clusters:
- Merge A and B: Distance(A, B) = √2
- Merge C and D: Distance(C, D) = √2
- Merge E and F: Distance(E, F) = √2
Step 3: Repeat until all points are in one cluster
- Merge A-B cluster with C-D cluster: Distance(A-B, C-D) = min(√2, √2) = √2
- Merge A-B-C-D cluster with E-F cluster: Distance(A-B-C-D, E-F) = min(√2, √2) = √2
Now, all points are in one cluster.
The resulting single linkage clustering can be visualized as a dendrogram in R
Output:
Complete linkage clustering, also known as farthest neighbor clustering, is a method of hierarchical clustering where the distance between two clusters is defined as the maximum distance between any two points in the two clusters. In other words, the distance between two clusters is determined by the farthest pair of points from each cluster.
Suppose we have the following set of points in a two-dimensional space
A(1, 1), B(2, 2), C(4, 4), D(5, 5), E(7, 7), F(8, 8)
Step 1: Calculate distances between points using Euclidean distance
- Distance(A, B) = √((2-1)² + (2-1)²) = √2
- Distance(A, C) = √((4-1)² + (4-1)²) = √18
- Distance(A, D) = √((5-1)² + (5-1)²) = √32
- Distance(A, E) = √((7-1)² + (7-1)²) = √72
- Distance(A, F) = √((8-1)² + (8-1)²) = √98
- Distance(B, C) = √((4-2)² + (4-2)²) = √8
- Distance(B, D) = √((5-2)² + (5-2)²) = √13
- Distance(B, E) = √((7-2)² + (7-2)²) = √32
- Distance(B, F) = √((8-2)² + (8-2)²) = √50
- Distance(C, D) = √((5-4)² + (5-4)²) = √2
- Distance(C, E) = √((7-4)² + (7-4)²) = √18
- Distance(C, F) = √((8-4)² + (8-4)²) = √32
- Distance(D, E) = √((7-5)² + (7-5)²) = √8
- Distance(D, F) = √((8-5)² + (8-5)²) = √18
- Distance(E, F) = √((8-7)² + (8-7)²) = √2
Step 2: Merge closest points
Start by merging the closest pair of points into clusters:
- Merge A and B: Distance(A, B) = √2
- Merge C and D: Distance(C, D) = √2
- Merge E and F: Distance(E, F) = √2
Step 3: Repeat until all points are in one cluster
- Merge A-B cluster with C-D cluster: Distance(A-B, C-D) = max(√18, √2) = √18
- Merge A-B-C-D cluster with E-F cluster: Distance(A-B-C-D, E-F) = max(√32, √2) = √32
Now, all points are in one cluster.
This example shows how complete linkage clustering merges clusters based on the maximum distance between any two points in the clusters. The result can be visualized as a dendrogram in R
Output:
Here, we use a real dataset "mtcars" which already available in R. Which contains fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.
Output:
We load the "mtcars" dataset using the data() function. This dataset contains information on various attributes of different car models.
We can Customize the dendrograms using "ggraph", it helps to adjust various aspects such as node size, edge width, node labels, and colors.
Output:
We generate sample gene expression data and perform hierarchical clustering as before.
Now we take another example and create more attractive Dendrograms
Output:
Create a hierarchical structure of "Parent", "Manager", and "Supervisor" levels with corresponding employees and generate an edge list.
Output:
👁 RGui-(64-bit)-21-03-2024-10_46_52
Load the necessary libraries like ggraph, igraph,tidyverse, and viridis. These libraries provide functions and tools for data manipulation, plotting, and color generation.
In summary, dendrograms offer a clear and intuitive way to visualize hierarchical data, revealing how items or groups are related. Whether it's understanding evolutionary relationships in biology or segmenting customers in marketing, dendrograms provide valuable insights. They help us see patterns and connections that might not be obvious otherwise. So, whether we exploring data for research or making business decisions, dendrograms are a useful tool for uncovering hierarchical relationships in a straightforward and visual manner.