VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/11/visualize-data-using-sankey-diagram/

⇱ Visualization with Sankey Diagram - Analytics Vidhya


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Here’s How to use Sankey Diagrams for Data Visualization

Sreedevi Last Updated : 16 Oct, 2024
8 min read

This article was published as a part of the Data Science Blogathon.

Introduction to Sankey Diagram for Data Visualization

Very often, we are in a situation where we would have to visualize how data flows between entities. For example, let’s take the case of how residents have migrated from one country to another within the UK. Here, it would be an interesting analysis to see how many residents have migrated from England to say Northern Ireland, Scotland, and Wales.

Image source

From this Sankey diagram visualization, it is apparent that more residents have migrated from England to Wales than to Scotland or to Northern Ireland.

What is a Sankey diagram?

Sankey diagrams typically depict the flow of data from one entity (or node) to another.

The entity from/to where data flows is referred to as a node – the node where the flow originates is the source node (e.g. England on the left-hand side) and where the flow ends is the target node (e.g. Wales on the right-hand side). The source and target nodes are often represented as rectangles with a label.

The flow itself is represented by a straight or a curved path is called the link. The width of the flow/link is proportional to the amount/quantity of flow. In the above example, the flow (i.e. migration of residents) from England to Wales is wider (more) than that from England to Scotland or Northern Ireland indicating more number of residents migrating to Wales than to the other countries.

The Sankey diagrams can be used to represent the flow of energy, money, costs, anything that has a notion of flow.

Minard’s classic diagram of Napoleon’s invasion of Russia is perhaps the most famous example of the Sankey diagram. This visualization using the Sankey diagram displays very effectively how the French army progressed (or dwindled?) on its way to Russia and back.

Image source

Now, let’s see how we can use python’s plotly to plot a Sankey diagram.

How to plot a Sankey diagram?

For plotting a Sankey diagram, let’s use the Olympics 2021 dataset. This dataset has details about the medals tally – country, total medals, and the split across the gold, silver, and bronze medals. Let’s plot a Sankey diagram to understand how many of the medals a country won are Gold, Silver, and Bronze.

import pandas as pd
df_medals = pd.read_excel("Medals.xlsx")
print(df_medals.info())
df_medals.rename(columns={'Team/NOC':'Country', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals'}, inplace=True)
print(df_medals)

A basic plot

We will use the plotly’s go interface Sankey that takes 2 parameters – nodes and links.

Note that all the nodes – source and target should have unique identifiers.

In this case,

  • the Source would be the country. Let’s consider the top 3 countries (which are the USA, China and Japan) as the source nodes. Let’s mark these source nodes with the following (unique) identifiers, labels and colours
    • 0: United States of America: green
    • 1: People’s Republic of China: blue
    • 2: Japan: orange
  • the Target would be the Gold, Silver and Bronze medals. Let’s mark these target nodes with the following (unique) identifiers, labels and colours
    • 3: Gold: gold
    • 4: Silver: silver
    • 5: Bronze: brown
  • the Link (between the source and target nodes) would be the number of medals of each kind (Gold, Silver, Bronze). From each source, we will have 3 links originating and each one ending in the target – Gold, Silver and Bronze. So we will have a total of 9 links. The width of each of the links should be the number of Gold, Silver and Bronze medals. Let’s mark these links with the following source to target, values and colours
    • 0 (USA) to 3,4,5 : 39, 41, 33
    • 1 (China) to 3, 4, 5 : 38, 32, 18
    • 2 (Japan) to 3,4,5 : 27, 14, 17

We will need to instantiate 2 python dict objects to represent the

  • nodes (both source and target): with labels & colors as individual lists and
  • links: source node, target node, value (width), and the color of the links as individual lists

and pass this to the plotly‘s go interface Sankey.

Each index of the lists – label, source, target, value, and color – corresponds to one node or link respectively.

NODES = dict( #    0                 1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = ["seagreen",                 "dodgerblue",                  "orange", "gold", "silver", "brown" ],)
LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],)        # Source Node: 2 - Japan
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.show()

Sankey diagram – a basic plot

Here we have a very basic plot. But do you notice how the diagram is too wide and Silver appears before the Gold? Let’s adjust the position of the nodes and the width.

Adjust the position of nodes and width of the diagram

Let’s add the x and y positions for the nodes to explicitly specify the positions of the nodes. The values should be between 0 and 1.

NODES = dict( #           0                               1                          2        3       4           5
            label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
            color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
            x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
            y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],)
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.show()

With this, we get a compact diagram:

Sankey diagram – node position adjusted

See below how the various parameters passed in the code map to the nodes and links in the diagram

Sankey diagram – how code maps to diagram

Add meaningful hover labels

The plot is interactive. You could hover on the nodes and the links for more information.

Sankey diagram – with default hover labels

Currently, the information displayed in the hover labels is the default text. When you hover on the

  • nodes, the node name, the number of incoming flows, the number of outgoing flows and the total value is displayed. For instance, 
    • node United States of America has a total of 11 medals (=39 Gold + 41 Silver + 33 Bronze)
  • node Gold has  a total of 104 medals (= 39 from the  USA, 38 from China, 27 from Japan)
  • links, the source node name and target node name and the value of the link is displayed. For instance, the link from the source node USA to the target node Silver has 39 medals. 

Don’t you think the labels are too verbose? All these can be improved.

Let’s improve the format of the hover labels using the hovertemplate parameter

  • For the nodes, since the hoverlabels are not giving any new information than what is already present, let’s take the hoverlabel off by passing an empty hovertemplate = ” “
  • For the links, we can make the label concise in the format <country>-<medal type>
  • For both the nodes and links, let’s have the values displayed with the suffix “Medals”. e.g 113 Medals instead of just 113. This can be achieved by using the update_traces function with appropriate valueformat and valuesuffix.
NODES = dict( #           0                               1                          2        3       4           5
label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],
hovertemplate=" ",)
LINK_LABELS = []
for country in ["USA","China","Japan"]:
for medal in ["Gold","Silver","Bronze"]:
LINK_LABELS.append(f"{country}-{medal}")
LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links
# Color of the links
# Target Node:    3-Gold          4 -Silver        5-Bronze
color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
"lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
"bisque",       "bisque",       "bisque"],         # Source Node: 2 - Japan
label = LINK_LABELS, hovertemplate="%{label}",)
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.update_traces( valueformat='3d', valuesuffix=' Medals', selector=dict(type='sankey'))
fig.update_layout(hoverlabel=dict(bgcolor="lightgray",font_size=16,font_family="Rockwell"))
fig.show()

Sankey diagram – with improved hover labels

Generalize for multiple nodes and levels

Nodes are referred to as source and target with respect to a link. A node that is a target for one link can be a source for another.

  • The code can be generalized to handle all the countries in the dataset.
  • We can also extend the diagram to another level to visualize the total number of medals across the countries.

End Notes

We saw how Sankey diagrams can be used to represent flows effectively and how plotly python library can be to generate Sankey diagrams for a sample dataset.

About the author

Sreedevi Gattu

A technical architect who also loves to break complex concepts into easily digestible capsules! Currently, finding my way around the fascinating world of data visualizations and data storytelling!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner