Data Analysis with Python

Last Updated : 18 Apr, 2026

Data Analysis involves collecting, transforming and organizing data to generate insights, support decision making and solve business problems.

Helps in making informed, data driven decisions
Identifies patterns and trends for better predictions
Supports solving real world business problems
Converts raw data into meaningful insights

Analyzing Numerical Data with NumPy

NumPy is a Python library used for fast and efficient numerical computations. It provides multidimensional arrays and built in functions that simplify data analysis, mathematical operations and large scale data processing.

Arrays in NumPy

NumPy arrays store elements of the same data type and support multiple dimensions. The number of dimensions is called rank and the size of each dimension is called shape.

Output:

👁 output1

Output

Creating NumPy Arrays

Arrays can be created using lists, tuples or built in functions like zeros, ones, arange and empty.

Output:

👁 output2

Output

Operations on Numpy Arrays

NumPy allows efficient element wise operations on arrays, making numerical computations faster and more optimized compared to traditional Python methods.

Addition: Adds corresponding elements of two arrays
Subtraction: Subtracts elements of one array from another
Multiplication: Performs element wise multiplication
Division: Divides elements of one array by another

Output:

👁 output3

Arithmetic Operations on Arrays

NumPy Array Indexing

Indexing is used to access individual elements in an array using their position. It works similarly to Python lists but is more useful for multi dimensional data.

Output:

👁 output4

Array Indexing

NumPy Array Slicing

Slicing allows accessing a range of elements from an array. It is useful for working with subsets of data.

Output:

👁 output5

Output

NumPy Array Broadcasting

Broadcasting allows operations between arrays of different shapes without explicitly resizing them, improving efficiency and reducing code complexity.

Output:

👁 output6

Broadcasting

Analyzing Data Using Pandas

Pandas is a Python library used for handling structured (relational or labeled) data. Built on top of NumPy, it provides flexible data structures and tools for data manipulation, analysis and time series operations.

Used for working with structured and tabular data
Built on top of NumPy for high performance
Supports data cleaning, transformation and analysis

Series in Pandas

A Series is a one dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). Each element has an associated index.

Represents a single column of data
Supports indexing and labeling
Can store different data types

Output:

👁 output7

Series

DataFrame in Pandas

A DataFrame is a two dimensional labeled data structure with rows and columns, similar to a table or spreadsheet.

Represents tabular data (rows and columns)
Each column can have a different data type
Most commonly used Pandas structure

Output:

👁 output8

DataFrame

Pandas CRUD Operations

Pandas allows easy Create, Read, Update and Delete operations on data stored in CSV files, making it practical for real-world datasets. It is known as CRUD Oprations.

Create: Create and save a DataFrame as a CSV file
Read: Load data from a CSV file
Update: Modify values or add new columns
Delete: Remove rows or columns

Output:

👁 output9

CRUD Operations on a Dataset

Exploratory Data Analysis (EDA)

1. Data Inspection

Pandas provides quick methods to understand the structure, summary and content of a dataset. These functions help in exploring data before analysis.

info(): Displays dataset structure, column names, data types and non null values
describe(): Shows statistical summary like mean, min, max and standard deviation
value_counts(): Counts frequency of unique values in a column
head(): Displays first few rows of the dataset
tail():Displays last few rows of the dataset

Output:

👁 output10

Output

2. Data Manipulation in Pandas

Pandas provides multiple operations to efficiently select, organize and transform data for analysis.

Indexing and Selection

Indexing and Selection are used to access specific rows, columns or subsets of data.

Output:

👁 output11

Output

Grouping and Aggregation

Grouping and Aggregation Groups data based on a column and applies aggregate functions like mean, sum, etc.

Output:

👁 output11

Output

Merging and Joining

Merging and Joining combines multiple DataFrames based on common columns.

Output:

👁 output12

Output

Sort

Sorts data based on column values.

Output:

👁 output13

Output

Filter

Filter selects data based on conditions.

Output:

👁 output14

Output

set_index

Sets a column as the index of the DataFrame.

Output:

👁 output15

Output

reset_index

Resets the index back to default numeric indexing.

Output:

👁 output16

Output

3. Working With Missing Data

Working with missing data is a key step in EDA to ensure data quality and accurate analysis. It involves identifying missing values and applying appropriate techniques to handle them without affecting results.

Checking Missing Data

Used to detect null values present in the dataset.

Output:

👁 output1

Output

Dropping Missing Values

There are different methods to handle missing data based on requirements, here we just drop the missing values.

Output:

👁 output2

Output

4. Checking and Handling Duplicate Values

Duplicate values can lead to incorrect analysis and biased results. Identifying and removing duplicates is an important step in data cleaning during EDA.

Checking Duplicate Values

Used to detect duplicate rows in the dataset.

Output:

👁 output6

Output

Handling Duplicate Values

Remove duplicate rows to clean the dataset.

Output:

👁 output7

Output

5. Outlier Detection and Handling

Outliers are extreme values that differ significantly from other data points. Detecting and handling them is important to improve data quality and model performance during EDA.

IQR (Interquartile Range) Method

Outliers are values below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.

Output:

👁 output3

Output

Z-Score Method

Outliers are values with Z-score greater than 3 or less than -3.

Output:

👁 output4

No outliers in the dataset

Handling Outliers

Outliers can be handled by removing or capping depending on the use case.

Output:

👁 output5

Output

6. Data Visualization Using Matplotlib

Matplotlib is a widely used Python library for creating visualizations and graphs. It helps in understanding patterns, trends, and relationships in data through visual representation during EDA.

Used to create plots like line charts, bar graphs, histograms and scatter plots
Helps in identifying trends, distributions and outliers
Works well with NumPy and Pandas data

Pyplot

Pyplot is a Matplotlib module that provides a simple interface to create and customize plots. It helps in generating figures, adding labels, and displaying visualizations.

Output:

👁 output8

Output

Bar chart

A bar chart is used to compare values across different categories using rectangular bars. The height or length of each bar represents the value of that category.

Used for comparing discrete categories
Can be plotted vertically or horizontally
Created using bar() method

Output:

👁 output9

Bar chart

Histograms

A histogram is used to show the distribution of data by grouping values into bins (ranges). The X-axis represents the bins, and the Y-axis shows the frequency of values in each bin.

Used to understand data distribution
Groups data into non-overlapping intervals (bins)
Created using hist() method

Output:

👁 Image

Histplot using matplotlib library

Scatter Plot

Scatter plots are used to observe relationship between variables and uses dots to represent the relationship between them. The scatter() method in the matplotlib library is used to draw a scatter plot.

Output:

👁 output10

Scatter plot using matplotlib library

Box Plot

A boxplot (box-and-whisker plot) is used to visualize data distribution and identify outliers using quartiles.The minimum is shown at the far left of the chart, at the end of the left ‘whisker’

First quartile, Q1, is the far left of the box (left whisker)
The median is shown as a line in the center of the box
Third quartile, Q3, shown at the far right of the box (right whisker)
The maximum is at the far right of the box

Output:

👁 Boxplot using matplotlib library

Boxplot using matplotlib library

Correlation Heatmaps

A correlation heatmap is a visual tool that shows relationships between variables using colors. It is based on a correlation matrix, where each cell represents how strongly two variables are related.