VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/03/pandas-functions-for-data-analysis-and-manipulation/

โ‡ฑ Pandas Function for Data Manipulation and Analysis


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Pandas Functions for Data Analysis and Manipulation

Nikhil Last Updated : 17 Oct, 2024
8 min read
This article was published as a part of the Data Science Blogathon.

Introduction

Pandas is an open-source python library that is used for data manipulation and analysis. It provides many functions and methods to speed up the data analysis process. Pandas is built on top of the NumPy package, hence it takes a lot of basic inspiration from it. The two primary data structures are Series which is 1 dimensional and DataFrame which is 2 dimensional.

It is one of the most important and useful tools in the arsenal of a Data Scientist and a Data Analyst.

Install Pandas:

๐Ÿ‘ Install Pandas

First, letโ€™s import the Pandas module. We will make an alias of โ€œpandasโ€ as pd because it makes the code a little easy to read and it also avoids any namespace issue.

๐Ÿ‘ Pandas module

Next, we will import the os module which will help us to read the input file.

๐Ÿ‘ read the input

Then, we will create a function that will get the file name as input and load the particular file from the location. And, then we will call this function in the Pandas function which is โ€œread_csv()โ€œ, which will read the file from the provided location.

1. Read

import pandas as pd
import os

def getFilePath(filename):
 currentDir = os.getcwd()
 fullPath = os.path.join(currentDir, filename)
 return fullPath

df = pd.read_csv(getFilePath('police.csv'))
print(df.head())

Note โ€“ There are various other methods to read different types of files, such as read_json(), read_html(), read_excel(), etc which can be easily used as per the requirement.

Note โ€“ The dataset which is used is this.

2. head()

Then, we will use pandas โ€œhead()โ€ function to display the top 5 rows from our data set. Note โ€“ We can provide the no. of rows that we want to display by providing the count as a parameter to the โ€œhead()โ€ function, i.e. (df.head(10) โ€“ this will now display 10 rows from our dataset).

๐Ÿ‘ pandas head

Note โ€“ There is a โ€œtail()โ€ method as well which will show the last 5 rows from our data set.

3. shape

Now, if we want to see the dimension of our dataset, we can use โ€œshapeโ€, which will show the dimensions in (No. of rows, No. of columns) format.

๐Ÿ‘ pandas shape

4. info()

Now, if we want to know some more information about our dataset, we can use the โ€œinfo()โ€ function of pandas. It displays various information about our data such as the column names, no. of non-null values in each column(feature), type of each column, memory usage, etc.

๐Ÿ‘ pandas info

5. to_datetime()

So, while reading a CSV file, the DateTime objects in the file are read as string objects and therefore, itโ€™s a little difficult to perform DateTime operations like time difference on a string. So, this is where the pandas โ€œto_datetime()โ€ method comes into play. You can provide various formats as per your requirement.

๐Ÿ‘ pandas datetime

6. isnull()

Using โ€œisnull()โ€ and โ€œsum()โ€ functions, we can find that the no. of null values in a DataFrame for every feature.

๐Ÿ‘ pandas isnull

7. drop()

Now, as we can see that โ€œcounty_nameโ€ column is completely empty, so it will not provide any information which is beneficial to us. Hence, we will drop that particular column using the pandas โ€œdrop()โ€ function. Note โ€“ We provide โ€œinplace=Trueโ€™ to modify the current DataFrame.

๐Ÿ‘ pandas drop

8. describe()

Now, using the โ€œdescribe()โ€ function, we can get various information about the numerical columns in our DataFrame, such as total count of rows, mean, the minimum value, the maximum value, and the spread of the values in the particular feature.

๐Ÿ‘ describe()

9. value_counts()

โ€œvalue_counts()โ€ function is used to identify the different categories in a feature as well as the count of values per category.

๐Ÿ‘ pandas value counts

10. fillna()

Now, as we know that, by using the โ€œisnull()โ€ and โ€œsum()โ€ functions, we can check if our data has any missing values or not. So, now we can see that this โ€œdriver_genderโ€ feature has 5335 nan or missing values. So, we will fill up the missing values using the mode(a value that appears most frequently in a data set) of this particular feature using the โ€œfillna()โ€ function.

๐Ÿ‘ pandas fillna

Note โ€“ This way of filling the missing or nan values is not the most effective way and there are other much more efficient ways for imputation. There is a lot of thought which goes behind when we are working with missing values but, this is just for explaining the functionality of the fillna() function.

11. sample()

We can use the โ€œsample()โ€ function which allows us to choose random values from our data frame. We can pass it the no. of rows that we want to fetch as a parameter.

๐Ÿ‘ pandas sample

12. nunique()

We can use the โ€œnunique()โ€ function to find the no. of unique values in our series or data frame. Generally, it is used in the case of categorical features to identify the no. of categories in a particular feature.

๐Ÿ‘ pandas nunique

13. columns

As the name suggests, โ€œcolumnsโ€ get the name of all the features/columns in our data frame.

๐Ÿ‘ Image

14 . nsmallest() & nlargest()

So, as the name suggests, โ€œnsmallest() & nlargest()โ€ functions are used to obtain โ€œnโ€ no. of rows from our dataset which are lowest or highest respectively.

๐Ÿ‘ pandas nsmallest() & nlargest()

15. groupby()

The โ€œgroupby()โ€ function is very useful in data analysis as it allows us to unveil the underlying relationships among different variables. And then we can apply Aggregations as well on the groups with the โ€œagg()โ€ function and pass it with various aggregation operations such as mean, size, sum, std etc.

๐Ÿ‘ groupby()

16. get_group()

We can use the โ€œget_group()โ€ function to select a specific group.

๐Ÿ‘ get_group()

Note โ€“ We can combine various methods of pandas as per our requirement for a better understanding of data as shown below:

๐Ÿ‘ 1 get_group()

17. loc() and iloc()

loc() and iloc() methods are used in slicing data from the pandas DataFrame which helps in filtering the data according to some given condition.

loc โ€“ select by labels

iloc โ€“ select by positions

๐Ÿ‘ loc() and iloc()

iloc() slices the data frame in the specified rows and column range.

๐Ÿ‘ iloc()

18. Sorting

We can sort our DataFrame by index or values with Pandas โ€œsort_index()โ€ and โ€œsort_values()โ€ functions. Below is the implementation for sort by values:

๐Ÿ‘ Sorting

19. Query

We can use the Pandas query() function to filter our data frame as per our conditions or requirements as shown below:

๐Ÿ‘ Query

20. set_index()

So, we can use Pandas โ€œset_index()โ€ function to set any of your columns as the index.

๐Ÿ‘ set_index()

21. duplicated()

We can use the โ€œduplicated()โ€ function to find all the duplicate rows in our dataset. And, then we can remove duplicate values using the drop_duplicates() function, as having too many duplicate values will affect the accuracy of our model at the later stage.

๐Ÿ‘ duplicated()

22. get_dummies()

Pandas โ€œget_dummies()โ€ method is used to convert the categorical features of the data into dummy variables or indicator variables.

We usually do this conversion because some machine learning models donโ€™t work well with categorical values such as Random Forrest, but we shouldnโ€™t be using this if we have too many categories in our feature as it will create that many new features in our data frame, which will have an effect on the performance of our model.

๐Ÿ‘ get_dummies()

23. select_dtypes()

We can separate the numerical and categorical features from our data frame and create new ones by using the โ€œselect_dtypes()โ€ function and include โ€œnp.numberโ€ to select numerical columns whereas include โ€œobjectsโ€ for categorical columns.

๐Ÿ‘ select_dtypes()

24. concat()

We can perform concatenation of pandas object into a DataFrame output along a particular axis with optional set logic such as union and intersection using concat() method.

By default, axis=0, i.e. row-wise concatenation, so if we set axis=1, column-wise concatenation will be performed.

๐Ÿ‘ concat()

25. apply()

Suppose we create our own custom function and we want to use that function in our data frame. This is where the Pandas โ€œapply()โ€ function comes into play. It allows us to apply a custom function to every element of a particular Series.

So, here we have created our own custom function currentAge() which returns the current age of the person by subtracting their date of birth from the current year(2021). And, after that we can use this function inside the โ€œapply()โ€ function.

๐Ÿ‘ apply()

26. qcut() and cut()

So, when we have to deal with continuous numeric data, it is often helpful to bin them into multiple buckets and then carry on with the further analysis of the data. Pandas provide two methods which are qcut() and cut(), which helps us to convert continuous data to a set of discrete buckets.

qcut() method ensures a more even distribution of the values inside each bin, so we can say itโ€™s a better sampling. We just pass the no. of bins and then Pandas does the behind the scene job to decide how wide to make each bin.

cut() method is used to specifically define the bin edges and hence the distribution of values is not even across all the bins. There might be a situation when there is no item inside a particular bin, so we should be careful about that.

๐Ÿ‘ qcut() and cut()

27. to_csv()

Now, we can save our DataFrame in a CSV file using the pandas โ€œto_csv()โ€ function. As we donโ€™t want to store the preceding indices of each row, hence, we will set index=False.

๐Ÿ‘ to_csv()

So we have covered various functions of Pandas, which helps in data exploration and data manipulation which eventually speeds up the data analysis process and provides valuable insights.

Thanks for Reading, and Keep Learning.

And if you found this article helpful, then please follow me on LinkedIn.

THE END

The media shown in this article are not owned by Analytics Vidhya and is used at the Authorโ€™s discretion.

Data Scientist with 6 years of experience in analysing large datasets and delivering valuable insights via advanced data-driven methods. Proficient in Time Series Forecasting, Natural Language Processing and with a demonstrated history of working in the Telecom, Healthcare and Retail Supply Chain industries.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

For fillna() in paragraph 10, you referred to mode function, but you didn't actually invoke it. Correct line: df['driver_gender'].fillna(df['driver_gender'].mode(), inplace=True). Otherwise, you end up with 6 categories for gender feature instead of 5. Besides, nice and concise article.

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner