VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/07/pandas-functions-you-should-know-for-data-analysis/

โ‡ฑ Pandas Functions you Should Know for Dataย Analysis -


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Pandas Functions you Should Know for Data Analysis

Suraj Last Updated : 24 Apr, 2023
9 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Any data science task starts with exploratory data analysis to learn more about the data, what is in the data and what is not. Having knowledge of different pandas functions certainly helps to complete the analysis in time.

Therefore, I have listed the top 21 pandas functions covering 80% of your data exploration tasks. This includes pandas functions and methods to โ€“

  1. Get an overview of the dataset
  2. Get a statistical summary of the dataset
  3. Get a subset of the entire dataset
  4. Identify and deal with missing values
  5. Modify the dataset

Note: Iโ€™m using a self created Dummy_Sales_Data which you can get on my Github repo for Free & use it for your other practice projects!

Source: www.freepik.com

Pandas in Python is a widely used library for data manipulation and analysis.

Letโ€™s first import the data into pandas DataFrame df

import pandas as pd
df = pd.read_csv("Dummy_Sales_Data_v1.csv")
It is a simple 9999 x 12 Dataset, which I created as a part of my article Faker: An Amazing and Insanely Useful Python Library. ๐Ÿ“š

You can download a complete notebook with all these examples at the end of this read!

Now, letโ€™s get started!

Pandas Functions to get an Overview of the Dataset

df.head()

This function helps you to get the first few rows of the dataset. By default, it returns the first 5 rows. However, you can change this number by simply mentioning the desired number of rows in df.head()

For example, suppose you want to retrieve the first 7 rows of the dataset. You can get it below.

๐Ÿ‘ dataset overview |pandas
df.head(7)

Display first 7 rows | Image by Author

df.tail()

This function helps you to get the last few rows of the dataset. By default, it returns the last 5 rows, and similar to .head(), you can simply mention the desired number of rows in df.tail()

For example, suppose you want to retrieve the last 3 rows of the dataset. You can get it below.

df.tail(3)
๐Ÿ‘ dataset rows | pandas

Image by Author

df.sample()

This function is used to get a randomly selected row, column, or both from a dataset. df.sample() takes 7 optional parameters, which means this function can be run without using any argument as below.

df.sample()
๐Ÿ‘ select a row randomly
Randomly select a row | Image by Author

As you can see, this returned only a single row output.

You can also specify the desired number of rows, and those many rows will be randomly selected.

df.info()

This function returns a quick summary of the DataFrame. This includes information about column names and their respective data types, missing values, and memory consumption by DataFrame, as shown below.

df.info()
๐Ÿ‘ pandas

Display information about DataFrame | Image by Author

So ideally, this function covers all the information which can be retrieved by individual methods such as df.dtypes, df.columns, df.shape.

Pandas Function to get the Statistical Summary of the Dataset

This function returns descriptive statistics about the data. This includes minimum, maximum, mean (central tendency), standard deviation (dispersion) of the values in numerical columns, and the count of all non-null values in the data, as shown below.

df.describe()
๐Ÿ‘ descriptive statistics | pandas
df.describe() | Image by Author

By default, descriptive statistics are generated for all the numerical columns.

Pandas Functions to Select a Subset of the Dataset

df.query()

This function is used to query the DataFrame based on an expression. An expression can be as simple as a single condition and as complex as a combination of multiple conditions.

For instance, suppose you want to retrieve all the rows  df where the Quantity is more than 95. You simply need to pass this condition as an expression, as shown below.

df.query("Quantity > 95")
๐Ÿ‘ dataframe |pandas

df.query() | Image by Author

It returns a large subset of df with 521 rows.

df.loc

This function is a property of DataFrame that returns the group of rows and columns identified by their labels or names.

So you can access a row and column combination as below,

df.loc[100,'Sales_Manager']
## Output
'John'

In this case, the row label and row number are the sameโ€Šโ€”โ€Š100

df.iloc

This function is again the DataFrame property which returns the same output as df.loc, but uses row and column numbers instead of their labels.

df.iloc[[100, 200],[6,3]]
๐Ÿ‘ Image

df.iloc | Image by AuthorThe 6 and 3 are the column numbers for columns Sales_Manager and Status, respectively which you can see in df.info() function.

To learn more about df.loc and df.iloc, I highly recommend reading Select Data in Python Pandas Easily with loc & iloc

Pandas Functions to get Deeper Insights into the Dataset

df.unique()

This function returns the list of unique values in a column or series. Instead of applying on complete DataFrame, it works only on the selected single column.

For example, you can see which different sales managers are present in the dataset using this function as below,

df["Sales_Manager"].unique()
๐Ÿ‘ pandas

df.unique() | Image by Author

df.nunique()

This method returns the number of unique records in each column. Similar to the previous function, df.nunique() can be used on single column as,

df["Sales_Manager"].nunique()
## Output
10

Additionally, it can take the entire DataFrame as input and return the number of unique elements in each column. And you literally do not need to pass any arguments to this function.

df.nunique()
๐Ÿ‘ dataframe

df.nunique() | Image by Author

df.isnull()

This function helps you to check if there in which row and which column your data has missing values.

From df.info() you already know which columns have missing values. df.isnull() returns output in Boolean formโ€Šโ€”โ€Šin terms of True and Falseโ€Šโ€”โ€Šfor all the rows in all columns.

So, when the value at a position is missing, df.isnull() returns True for that position and returns for all other positions.

df.isnull()
๐Ÿ‘ dataframe rows | pandas

df.isnull() | Image by Author

As you can see, it basically returned a table of True and False of the same size as df .

For example, suppose you want to select a subset of df, where the Product Category is missing. You can do it as below,

df2 = df[df["Product_Category"].isnull()]
df2
๐Ÿ‘ product category

 df.isnull() | Image by Author

In this way, you can understand if there are any trends associated with missing data.

df.fillna()

This function is used to replace missing values or NaN in the df with user-defined values. df.fillna() takes 1 required and 5 optional parameters.

The simplest useโ€Šโ€”โ€Šcase is filling a fixed value in the place of all missing values in a dataset. For example, suppose you want to replace all the NaN in the above DataFrame df2 with a string โ€˜MissingInfoโ€™ , which can be done as,

df2.fillna("MissingInfo")
๐Ÿ‘ missing info

df2.fillna() | Image by Author

However, to retain the filled values in the DataFrame, you need to assign True in parameter inplace in df.fillna()

df.sort_values()

This function helps to arrange the entire DataFrame in ascending or descending order based on a specified column. It takes exactly 1 required and 5 optional parameters.

For instance, suppose you want to sort a df in ascending order of Quantityโ€Šโ€”โ€Šwhich means records with the lowest quantity will appear on top. All you need to do is pass the column Quantity to this function as below.

df.sort_values("Quantity")
๐Ÿ‘ Ascending sorting

Ascending sorting | Image by Author

To sort the same DataFrame in descending order, you should use one of the optional parametersโ€Šโ€”โ€Šascendingโ€Šโ€”โ€Šby default, this parameter is set to True. You can arrange the DataFrame in descending order by making this parameter False.

df.value_counts()

This function returnsโ€Šโ€”โ€Šhow many times a value appeared in a column. So, you need to pass the specific column name to this function

For example, when you want to count how many orders every sales manager handled, all you need to do isโ€Šโ€”

df.value_counts("Sales_Manager")
๐Ÿ‘ sales manager

Image by Author

df.nlargest()

This function is useful in quickly getting several largest values from a specific column of the DataFrame and all the rows containing that.

for example, getting the top 10 largest Delivery times in the dataset is possible withโ€Šโ€”

df.nlargest(10, "Delivery_Time(Days)")
๐Ÿ‘ pandas | dataframe

Image by Author

df.nsmallest()

Similar to the previous function, df.nsmallest() helps you in getting several smallest values in the dataset.

For example, you can get the top 7 smallest shipping costs in the dataset usingโ€Šโ€”

df.nsmallest(7, "Shipping_Cost(USD)")
๐Ÿ‘ dataset

Image by Author

Both df.nlargest() and df.nsmallest() take exactly 2 required and 1 optional parameters. You can learn more about it here.

Pandas Functions to Modify the Dataset

df.copy()

This is useful in copying the entire DataFrame in one go. It contains only one optional parameter, which you probably never need to use.

For example, you can copy a subset of df into df1 as below,

df1 = df.iloc[0:10, :].copy()
df1
๐Ÿ‘ dataset

Image by Author

So, by default, this copy is a deep copy which means, after copying the original DataFrame into new one, whatever changes you make to any of these DataFrames will not be reflected in the other.

df.rename()

This is the simplest method to easily change the selected column name. all you need to do is pass a dictionary where the key is the old column name and the value is the new column name.

For instance, suppose you want to change the columns Shipping_Cost(USD) and Delivery_Time(Days) to Shipping_Cost and DeliveryTime_in_Days . This is quite simple using this function.

df1.rename(columns = {"Shipping_Cost(USD)": "Shipping_Cost",
 "Delivery_Time(Days)":"DeliveryTime_in_Days"},
 inplace=True)
df1
๐Ÿ‘ pandas

Image by Author

Donโ€™t forget the second parameter inplace to retain the changes you made in the column names.

df.where()

This function checks the DataFrame for a given condition and replaces values at all the locations with NaN where the condition is False.

Suppose you have conditions to checkโ€Šโ€”โ€ŠStatus = โ€˜Not Shippedโ€™ . You need to pass this condition in df.where() asโ€Šโ€”

condition = df1["Status"] == "Not Shipped"
df1.where(condition)
๐Ÿ‘ dataframe

Image by Author

Alternatively, you can also specify which value you would like to substitute when the condition is false, as below,

condition = df1["Status"] == "Not Shipped"
df1.where(condition, 'Condition_False')

df.drop()

This function is used to remove specified rows or columns from a DataFrame. The rows to be removed are identified by their labels or index, and columns are identified by their column names.

For example, you can remove a column OrderCode using the below code block.

df1.drop("OrderCode", axis=1)
๐Ÿ‘ order code

Image by Author

However, to remove it forever, you need to make inplace=True in df.drop()

Pandas Function to Understand the Relationship Between all Columns

df.corr()

This method is used to find out pairwise correlations between all the columns of the DataFrame. So, when you do not mention any specific column names, it returns Pearson correlation coefficients for all the column pairs in the datasets.

df1.corr()
๐Ÿ‘ Pearson correlation coefficients

Image by Author

However, as below, you can always apply this method on selected columns only.

df1[["Quantity","Shipping_Cost"]].corr()

Conclusion

I hope you find this article about Pandas useful. This covers most of the commonly used functions and methods used in Python pandas which you will use in your data analysis tasks.

Key takeaways from this read โ€“

  • Easily get the overview of the data using functions .head(), .tail() and .info()
  • Quickly get a statistical summary of the dataset using the function .describe()
  • A select subset of the data using pandas DataFrame properties such as .loc, .iloc
  • Get deeper insights into the data using functions .value_counts(), .query()

Feel free to download a complete notebook with all these examples from my Github

The media shown in this article is not owned by Analytics Vidhya and is used at the Authorโ€™s discretion.

Analytics Professional | I write about Data Science, Python and SQL

Follow me on Medium to get amazing tips on learning data science, changing job domains and winning job interviews. Join Medium Today with My referral link. https://medium.com/@17.rsuraj/membership

Also, Join my email list with 3K+ other readers https://medium.com/subscribe/@17.rsuraj

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner