VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/04/exploratory-data-analysis-eda-in-python/

⇱ Exploratory Data Analysis (EDA) in Python - Analytics Vidhya


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Exploratory Data Analysis (EDA) in Python

Subhi Last Updated : 06 Apr, 2022
7 min read

Introduction

Exploratory Data Analysis is a method of evaluating or comprehending data in order to derive insights or key characteristics. EDA can be divided into two categories: graphical analysis and non-graphical analysis.

EDA is a critical component of any data science or machine learning process. You must explore the data, understand the relationships between variables, and the underlying structure of the data in order to build a reliable and valuable output based on it.

The EDA stages will be carried out in this tutorial using the Python programming language.

The Dataset

For this article, we will be doing Customer Churn Prediction. When clients stop doing business with a company, this is known as customer churn or customer attrition.

Because the cost of getting a new customer is usually higher than keeping an existing one, understanding customer churn is critical to a company’s success. As a result, churn analysis is the first step in gaining a better understanding of your clients.

To gain a deeper grasp of our data, we will go deep into Exploratory Data Analysis (EDA). The dataset is available here.

Importing the Python Libraries

First of all, we need to import all the libraries that are required for the analysis, namely Pandas for handling data, Numpy for numerical calculations, Matplotlib and Seaborn for visualization.

.πŸ‘ EDA in python

Loading the Dataset in Python

Now, load the dataset into the pandas dataframe.

πŸ‘ Dataset | EDA in python

Structured Based Data Exploration

This is the first part of EDA where the data frame is evaluated for structure, columns and data types. The goal of this step is to get a general understanding of the dataset.

Display the first 5 Observations

πŸ‘ Structured based Data Exploration

We get the output as:

πŸ‘ EDA in python

Display the Last 5 Observations

πŸ‘ EDA in python

The output is:

πŸ‘ EDA in python

Display the Number of Variables and Observations

This can be done with df.shape which gives the output as a tuple having 2 values. The first value counts the number of data points and the second value represents the number of features in the dataset.

πŸ‘ EDA in python

πŸ‘ Image

In this dataframe, there are 7043 rows and 21 columns.

Display the Variable Names and their Data Types

πŸ‘ EDA in python

πŸ‘ Variable Names and Data Types

Count the number of Non-Missing Values for each variable

df.count() counts the number of non-empty values. It gives the idea of missing values in our dataset.

πŸ‘ EDA in python

πŸ‘ EDA in python

Descriptive Statistics

Now to know more about the characteristics of the dataset we will use the df.describe() which by default gives statistical information of all numerical features in our data frame.

πŸ‘ EDA in python

πŸ‘ Descriptive Statistics

df.describe() gives some basic statistical details like count, percentile, mean, standard deviation, and the 5 point summary which includes minimum, first quartile, second quartile, third quartile and maximum of numerical features.

What about the categorical features?

By providing an include argument and assigning it the value β€˜all’, we can get the summary of all the categorical features too.

πŸ‘ EDA in python

πŸ‘ Categorical Features | EDA in python

Display the Complete Summary of the Dataset

df.info() gives the summary of the dataframe including data types, shape and memory storage.

πŸ‘ Image

πŸ‘ Summary of Dataset

Handling Missing Values

Missing values are the unknown values in the dataset. The concept of missing values is important to understand in order to successfully manage data. The first step is to detect the missing value in the dataset and then treat them using the appropriate method.

Detecting the Missing Values

πŸ‘ Image

πŸ‘ Handling Missing Values

  • Using error = β€˜coerce’ will replace all non-numeric values with NaN.

  • isnull().sum() returns the number of missing values in the dataset.

We have 11 missing values in the β€˜Total Charges’ column. Now, we will see different methods to deal with them.

Missing Value Treatment

To treat missing values we can use the following ways:

  • Drop the variable

  • Drop the observation(s)

  • Mean imputation or median imputation or mode imputation

For variable β€˜Total Charges’ only 11 values are missing. Since these data records are comparatively very low as compared to the total data set, we can drop them.

πŸ‘ Image

Done. Let’s check!

πŸ‘ Image

πŸ‘ Missing value treatment

Analysis using Charts

Data Visualizations

Now, it’s time to visualize the data. We can see how the data appears and what sort of relation the properties of data hold with the help of data visualization. It’s the quickest approach to check if the features reflect the output.

Let’s visualize the target variable i.e. Churn. It has two categories- Yes or No.

πŸ‘ Image

Display a frequency distribution of churn

πŸ‘ Data Visualisation | EDA in python

The plot shows a class imbalance of the data between churners and non-churners. To address this, resampling would be a suitable approach.

There are 17 Categorical features in the dataset. Let’s see their churning rate with respect to the target variable.

Note: I have only shown 5 graphs here which are more important according to me.

πŸ‘ EDA in python

πŸ‘ Categorical variable
πŸ‘ EDA in python

πŸ‘ EDA in python

πŸ‘ EDA in python

πŸ‘ Image

 

Total charges are the sum total of monthly charges. So, let’s visualize their relationship.

πŸ‘ EDA in python
πŸ‘ Relationship between Monthly charges and Total Charges

  • Here we can see that Total Charges and monthly charges are highly correlated.

Here we are trying to visualize the churning rate with respect to Contract.

πŸ‘ EDA in python

πŸ‘ Customer Contract Distribution

  • About 75% of customers with Month-to-Month Contracts opted to move out as compared to 13% of customers with one-year contract and 3% with two-year contracts.

This is the visualization of the payment method. It has four categories.

πŸ‘ EDA in python

πŸ‘ Payment Method Distribution | EDA in python

  •  The electronic check has the highest users.

This graph shows the churning rate with respect to Dependents.

πŸ‘ EDA in python

πŸ‘ Dependents Distribution | EDA in python

  • Customers without dependents are more likely to churn

Churn distribution w.r.t Partners

This graph shows the churning rate with respect to Partners.

πŸ‘ EDA in python

πŸ‘ Churn distribution

  • Customers that do not have partners are likely to churn more.

Conclusion

In this article, we tried to analyze customer behaviour. First, we explored the dataset on a basic level. We looked for missing values and treated them by dropping those values. Then we used the Pandas DataFrame to do Exploratory Data Analysis on sample data by plotting different graphs like Count plot, Pie Chart, Line Plot and Histplot. From this, we got some useful insights like: β€œCustomers with month-to-month contracts churn the most”, β€œTotal charges and monthly charges were highly correlated”, etc. This way, we perform EDA on the datasets to explore the data and extract all possible insights from it, which can help in model building and also better decision making.

However, this was only a basic overview of how EDA works; you can go deeper into it and attempt the stages on larger datasets.

You can reach out to me on LinkedIn. 

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

Analyzing Data with Power BI

Turn raw data into insights with Power BI - dashboards, reports & more!

Responses From Readers

this one is perfect: https://towardsdatascience.com/exploring-your-data-with-just-1-line-of-python-4b35ce21a82d

Thanks for your article. It's very nice. A couple of years back I found this one, which is great too: This one is perfect: https://towardsdatascience.com/exploring-your-data-with-just-1-line-of-python-4b35ce21a82d

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
πŸ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
πŸ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

πŸ‘ Popup Banner
πŸ‘ AI Popup Banner