![]() |
VOOZH | about |
Data exploration is a critical initial step in the data analysis process, where analysts examine large datasets to uncover patterns, outliers, and relationships before formal modeling and analysis occur. This stage, called exploratory data analysis (EDA), utilizes various statistical techniques and powerful data visualization tools to understand the data’s key characteristics, quality, and structure. Popular open-source tools like Python and R and software like Tableau enable robust data visualization during exploration of data through methods like histograms, scatter plots, box plots, and more.
Effective data exploration allows early detection of data quality issues, identifies variables and relationships of interest, and guides the direction of subsequent predictive modeling and machine learning workflows. Data Analysts can make data-driven decisions by fully understanding the raw data, optimizing their analysis approach, and extracting maximum insight from the available information. Careful exploratory analysis is, therefore, a crucial foundational step for any successful data science or analytics project. This guide will explore the key stages, statistical methods, and data exploration techniques skilled data analysts and scientists use.
Overview:
Data exploration is a critical step in data analysis, where data scientists and analysts examine large datasets to understand their main characteristics before further analysis. This stage, often called Exploratory Data Analysis (EDA), involves using various statistical techniques and data visualization tools to uncover patterns, relationships, and outliers within the data. Tools like Python, R, and Tableau are commonly used for this purpose, enabling data visualization through graphs, histograms, scatter plots, and box plots.
Remember, the quality of your inputs decides the quality of your output. So, once your business hypothesis is ready, spending a lot of time and effort here makes sense. With my estimate, data exploration, cleaning, and preparation can take up to 70% of your total project time.
Below are the steps data analysis professionals typically follow to understand, clean, and prepare data for building predictive models:
Finally, we will need to iterate over steps 4 – 7 multiple times before we develop our refined model.
Let’s now study each stage of data exploration in detail.
First, identify the Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.
Let’s understand this step in data exploration more clearly by taking an example.
For Example, suppose we want to predict whether the students will play cricket (refer to the data set below). Here, you need to identify predictor variables, target variables, data type of variables, and category of variables. Below, the variables have been defined in different categories:
At this stage, we explore variables one by one. The method to perform univariate analysis will depend on whether the variable type is categorical or continuous. Let’s look at these methods and statistical measures for categorical and continuous variables individually:
Continuous Variables: In the case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various statistical metrics visualization methods in data exploration as shown below:
Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will examine methods for handling missing and outlier values.
Categorical Variables: For categorical variables, we’ll use a frequency table to understand the distribution of each category. We can also read the percentage of values under each category. It can be measured against each category using two metrics: Count and Count%. A bar chart can be used as a visualization.
Bivariate analysis in data exploration refers to finding the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. We can perform a bivariate analysis for any combination of categorical and continuous variables, such as Categorical and categorical, Categorical and continuous, and Continuous and continuous. Different methods are used to tackle these combinations during the analysis process.
Let’s understand the possible combinations in detail:
In a bivariate analysis of two continuous variables, we should look at a scatter plot. It is a nifty way to determine the relationship between two variables. The pattern of the scatter plot indicates the relationship between variables, which can be linear or non-linear.
A scatter plot shows the relationship between two variables but does not indicate its strength. To find the strength of the relationship, we use Correlation, which varies between -1 and +1.
This correlation can be derived using the following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have functions or functionality to identify correlations between variables in data exploration. In Excel, function CORREL() returns the correlation between two variables, and SAS uses procedure PROC CORR to identify the correlation. This function returns the Pearson Correlation value to identify the relationship between two variables:
In above example, we have good positive relationship(0.65) between two variables X and Y.
To find the relationship between two categorical variables, we can use the following methods:
The chi-square test statistic for a test of independence of two categorical variables is found by:
Where O represents the observed frequency. E is the expected frequency under the null hypothesis.
From the previous two-way table, the expected count for product category 1 to be of small size is 0.22. It is derived by taking the row total for Size (9) times the column total for Product category (2) and then dividing by the sample size (81). This procedure is conducted for each cell. Statistical Measures used to analyze the power of relationship are:
To explore the relation between categorical and continuous variables, we can draw box plots for each level of categorical variables. The plots will not show statistical significance if the levels are small in number. We can perform a Z-test, T-test, or ANOVA to examine the statistical significance.
Example: Suppose we want to test the effect of five different exercises. For this, we recruited 20 men and assigned one type of exercise to 4 men (5 groups). Their weights are recorded after a few weeks. We need to determine whether these exercises’ effect on them is significantly different. This can be done by comparing the weights of the 5 groups of 4 men each.
Also Read: Difference between Z-Test and T-Test
Now, we will examine the methods for treating Missing values. More importantly, we will also examine why missing values occur in our data and why treating them is necessary.
Missing data in the training data set can reduce the power/fit of a model or lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. Missing Value Treatment can also lead to wrong predictions or classifications in data exploration.
Notice the missing values in the image above: In the left scenario, we have not treated missing values. The inference from this data set is that males’ chances of playing cricket are higher than females’. On the other hand, if you look at the second table, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket than males.
We looked at the importance of treating missing values in a dataset. Now, let’s explain the reasons for these missing values. They may occur in two stages:
It is of two types: List Wise Deletion and Pair Wise Deletion.
Imputation is a method of filling in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in evaluating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute with the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:-
In this case, we calculate the mean or median for all non-missing values of that variable and then replace the missing value with the mean or median. In the above table, the variable “Manpower” is missing, so we take the average of all non-missing values of “Manpower” (28.33) and then replace the missing value with it.
In this case, we calculate the average of non-missing values for gender “Male” (29.75) and “Female” (25) individually and then replace the missing value based on gender. For “Male, ” we will replace the missing values of manpower with 29.75 and for “Female,” with 25.
The prediction model is a sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another with missing values. The first data set becomes the training data set of the model. In contrast, the second data set with missing values is the test data set, and the variable with missing values is treated as the target variable. Next, we create a model to predict the target variable based on other attributes of the training data set and populate missing values of the test data set. We can use regression, ANOVA, Logistic regression, and various modeling techniques to perform this. There are two drawbacks to this approach:
In this imputation method, the missing values of an attribute are imputed using the given number of attributes most similar to the attribute whose values are missing. The similarity of the two characteristics is determined using a distance function. It is also known to have certain advantages & disadvantages.
After dealing with missing values, the next task is dealing with outliers. We often neglect outliers while building models, which is discouraging. Outliers tend to make data skewed and reduce accents. Let’s learn more about outlier treatment.
Let us now look at techniques of outlier detection and treatment for data exploration.
Data analysts and data scientists commonly use outliers. They need close attention, or else they can result in wildly wrong estimations. Simply speaking, an Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
For example, we do customer profiling and find out that the average annual income of customers is $0.8 million. However, two customers have yearly incomes of $4 and $4.2 million. These two customers’ annual incomes are much higher than the rest of the population. These two observations will be seen as Outliers.
Also Read: Detecting and Treating Outliers | Treating the odd one out!
Outliers can be of two types: Univariate and Multivariate. Above, we have discussed the example of a univariate outlier. Outlier outliers can be found when we look at the distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. To find them, you have to look at distributions in multi-dimensions.
Let us understand this with an example. Let us say we know the relationship between weight and weight. Below, we have univariate and bivariate distributions of weight and weight. Take a look at the box plot. We do not have any outliers (above and below 1.5*IQR, the most common method). Now, look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight an eighth.
Whenever we come across outliers, the ideal way to tackle them is to find out the reason for having these outliers. The method to deal with them would then depend on the reason for their occurrence. Causes of outliers can be classified into two broad categories:
Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavorable impacts of outliers in the data set:
To understand the impact deeply, let an example check what happens to a data set with and without outliers in the data set.
Example:
As you can see, a data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that the average is 5.45. But with the outlier, the average soars to 30, which would completely change the estimate.
The most commonly used method to detect outliers in data exploration is visualization. We use various visualization methods, like Box-plot, Histogram, and Scatter Plot (above, we have used box and scatter plots for visualization). Some analysts also use various thumb rules to detect outliers. Some of them are:
Most ways to deal with outliers in data exploration is similar to methods of missing values, like deleting observations, transforming them, binning them, treating them as a separate group, imputing values, and other statistical methods. Here, we will discuss the standard techniques used to deal with outliers:
We have learned about the steps of data exploration, missing value treatment, and outlier detection and treatment techniques. These three stages will improve your raw data regarding information availability and let’s. Let’s proceed to the final stage of data exploration: Feature Engineering.
Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any new data here, but you are making the data you already have more helpful.
For example, you are trying to predict footfall in a shopping mall based on dates. If you try to use the dates directly, you may be unable to extract meaningful insights from the data. This is because footfall is less affected by the day of the month than by the day of the week. This information about the day of the week is implicit in your data. You need to bring it out to improve your model.
This exercise of bringing out information from data is known as feature engineering.
You perform feature engineering once you have completed the first 5 steps in data exploration – Variable Identification, Univariate, Bivariate Analysis, Missing Values Imputation, and Outliers Treatment. Feature engineering itself can be divided into 2 steps:
These two techniques are vital in data exploration and remarkably impact prediction. Let’s plot each step in these steps.
In data modeling, transformation refers to replacing a variable with a function. For instance, replacing a variable x by the square/cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.
Let’s look at the situations when variable transformation is useful.
Below are the situations where variable transformation is a requisite:
Various methods are used to transform variables. As discussed, some include square root, cube root, logarithmic, binning, reciprocal, and many others. Examine these methods in detail and highlight their pros and cons.
Feature / Variable creation generates new variables/features based on an existing variable(s). For example, a date(dd-mm-yy) is an input variable in a data set. We can generate new variables like day, month, year, week, and weekday that may have a better relationship with the target variable. This step is used to highlight the hidden relationship in a variable:
There are various techniques to create new features. Let’s look at some of the commonly used methods:
Comprehensive data exploration is a critical initial step for any data science, machine learning, or analytics project involving large datasets. Data analysts and scientists deeply understand the raw data through exploratory data analysis (EDA) techniques like univariate analysis, bivariate analysis, data visualization with graphs and plots, and outlier detection. Popular open-source tools like Python and commercial options like Tableau enable robust EDA through histograms, scatter plots, box plots, and other visualizations.
Effective data exploration allows early identification of data quality issues like missing values and outliers, guides future analysis like regression modeling and predictive modeling, and facilitates data-driven decision-making for business intelligence. The data exploration phase lays the groundwork for accurate insights, optimal data mining, and reliable statistical analysis outputs by transforming variables, creating new features, and preparing high-quality datasets. Leveraging best practices in EDA is essential for data scientists to unlock maximum value from their data assets across formats and domains.
A. Data analysis interprets data to conclude, often using statistical methods and algorithms. Data exploration is the preliminary phase of examining data to understand its structure, identify patterns, and spot anomalies through visualizations and summary statistics.
A. Data exploration tools are software or platforms that assist in exploring and analyzing data. These tools enable users to interact with and visualize data, identify patterns, and discover insights. Some popular data exploration tools include Tableau, Power BI, QlikView, and Google Analytics.
A. During data exploration, visualize data, check for missing values, assess data distributions, and identify correlations and patterns to understand the dataset’s characteristics and prepare for detailed analysis.
Sunil Ray is Chief Content Officer at Analytics Vidhya, India's largest Analytics community. I am deeply passionate about understanding and explaining concepts from first principles. In my current role, I am responsible for creating top notch content for Analytics Vidhya including its courses, conferences, blogs and Competitions.
I thrive in fast paced environment and love building and scaling products which unleash huge value for customers using data and technology. Over the last 6 years, I have built the content team and created multiple data products at Analytics Vidhya.
Prior to Analytics Vidhya, I have 7+ years of experience working with several insurance companies like Max Life, Max Bupa, Birla Sun Life & Aviva Life Insurance in different data roles.
Industry exposure: Insurance, and EdTech
Major capabilities: Content Development, Product Management, Analytics, Growth Strategy.
GPT-4 vs. Llama 3.1 – Which Model is Better?
Llama-3.1-Storm-8B: The 8B LLM Powerhouse Surpa...
A Comprehensive Guide to Building Agentic RAG S...
Top 10 Machine Learning Algorithms in 2026
45 Questions to Test a Data Scientist on Basics...
90+ Python Interview Questions and Answers (202...
8 Easy Ways to Access ChatGPT for Free
Prompt Engineering: Definition, Examples, Tips ...
What is LangChain?
What is Retrieval-Augmented Generation (RAG)?
Hi Ray, I would like to thank you very much for this useful post I took more than 30 statistical courses but your post has summarized them for me Now all things are clear about EDA I'm member of the John Hopkins University Data Scientists (Coursera) Group Best,
Excellent series of blog posts. Thanks and keep up the good work!
Edit
Resend OTP
Resend OTP in 45s