![]() |
VOOZH | about |
Pandas is a powerful and versatile library that allows you to work with data in Python. It offers a range of features and functions that make data analysis fast, easy, and efficient. Whether you are a data scientist, analyst, or engineer, Pandas can help you handle large datasets, perform complex operations, and visualize your results.
This Pandas Cheat Sheet is designed to help you master the basics of Pandas and boost your data skills. It covers the most common and useful commands and methods that you need to know when working with data in Python. You will learn how to create, manipulate, and explore data frames, how to apply various functions and calculations, how to deal with missing values and duplicates, how to merge and reshape data, and much more.
If you are new to Data Science using Python and Pandas, or if you want to refresh your memory, this cheat sheet is a handy reference that you can use anytime. It will save you time and effort by providing you with clear and concise examples of how to use Pandas effectively.
This Pandas Cheat Sheet will help you enhance your understanding of the Pandas library and gain proficiency in working with DataFrames, importing/exporting data, performing functions and operations, and utilizing visualization methods to explore DataFrame information effectively.
Python's Pandas open-source package is a tool for data analysis and management. It was developed by Wes McKinney and is used in various fields, including data science, finance, and social sciences. Pandas' key features encompass the use of DataFrame and Series objects, efficient indexing capabilities, data alignment, and swift handling of missing data.
If you have Python installed, you can use the following command to install Pandas:
pip install pandas
Once Pandas is installed, you can import it into your Python script or Jupyter Notebook using the following import statement:
import pandas as pd
Pandas provides two main data structures: Series and DataFrame.
Command | Execution |
|---|---|
Import pandas as pd | Load the Pandas library as custom defined name pd |
pd.__version__ | Check the Pandas version |
Command | Execution Tasks |
|---|---|
pd.read_csv('xyz.csv') | Read the .csv file |
df.to_csv('xyz.csv') | Save the Pandas data frame as "xyz.csv" form in the current folder |
pd.ExcelFile('xyz.xls' ) | Read the Sheet1 of the Excel file 'xyz.xls' |
df.to_excel('xyz.xlsx', sheet_name='Sheet1') | Save the dataset to xyz.xlsx as Sheet1 |
pd.read_json('xyz.json') | Read the xyz.json file |
pd.read_sql('xyz.sql') | Read the xyz.sql file |
pd.read_html('xyz.html') | Read the xyz.html file |
Command | Execution Tasks |
|---|---|
pd.Series(data=Data) | Create a Pandas Series with Data like {10: 'DSA', 20: 'ML', 30: 'DS'} |
pd.Series(data = ['Geeks','for','geeks'], | Create a Pandas Series and add custom defined index |
pd.DataFrame(data) | Create Pandas Data frame with Data like {'Fruits': ['Mango', 'Apple', 'Banana', 'Orange'], 'Quantity': [40, 20, 25, 10], 'Price': [80, 100, 50, 70] } |
df.dtypes | Give Data types |
df.shape | Give shape of the data |
df['Column_Name'].astype('int32') | Change the data type to integer 32 bit |
df['Column_Name'].astype('str') | Change the data type to string |
df['Column_Name'].astype('float') | Change the data type to float |
df.info() | Check the data information |
df.values | Give the data into the NumPy array |
Fruits | Quantity | Price | |
|---|---|---|---|
0 | Mango | 40 | 80 |
1 | Apple | 20 | 100 |
2 | Banana | 25 | 50 |
3 | Orange | 10 | 70 |
Sorting by values | |
df.sort_values('Price', ascending=True) | Sort the values of 'Price' of data frame df in Ascending order |
df.sort_values('Price', ascending=False) | Sort the values of 'Price' of data frame df in Descending order |
Sorting by Index | |
df.sort_index(ascending=False) | Sort the index of data frame df in Descending order |
Reindexing | |
df.reset_index(drop=True, inplace=True) | Reset the indexes to default
|
Renaming | |
df.rename(columns={'Fruits': 'FRUITS', | Rename the column name with its respective values: In the given code 'Fruits' will be replaced by 'FRUITS', 'Quantity' will be replaced 'QUANTITY' and 'Price' will be replaced by 'PRICE' |
Reshaping | |
pd.melt(df) | Gather columns into rows |
pivot = df.pivot(columns='FRUITS', | Create a Pivot Table |
Dropping | |
df1 = df.drop(columns=['QUANTITY'], axis=1) | Drop Column
|
df2 = df.drop([1, 3], axis=0) | Drop Rows
|
Observation | |
|---|---|
df.head() | Print the first 5 rows |
df.tail() | Print the last 5 rows |
df.sample(n) | Select randomly n rows from the data frame df and print it. |
df.nlargest(2, 'QUANTITY') | Select the largest top 2 rows of the numerical column name 'QUANTITY' by its values. |
df.nsmallest(2, 'QUANTITY') | Select the smallest 2 rows of the numerical column name 'QUANTITY' by its values. |
df[df.PRICE > 50] | Select the rows having 'PRICE' values > 50 |
Selection Column data | |
df['FRUITS'] | Select a single column value with the name of the column I.E 'FRUITS' |
df[['FRUITS', 'PRICE']] | Select more than one column with its name. |
df.filter(regex='F|Q') | Select the column whose names match the patterns of the respective regular expression I.E 'FRUITS' & 'QUANTITY' |
Getting Subsets of rows or columns | |
df.loc[:, 'FRUITS':'PRICE'] | Select all the columns between Fruits and Price |
df.loc[df['PRICE'] < 70, ['FRUITS', 'PRICE']] | Select FRUITS name having PRICE <70 |
df.iloc[2:5] | Select 2 to 5 rows |
df.iloc[:, [0, 2]] | Select the columns having 0th & 2nd positions |
df.at[1, 'PRICE'] | Select Single PRICE value at 2nd row of the 'PRICE' column |
df.iat[1, 2] | Select the single values by their position i.e at the 2nd row and 3rd column. |
Filter | |
df.filter(items=['FRUITS', 'PRICE']) | Filter by column name
|
df.filter(items=[3], axis=0) | Filter by row index
|
df['PRICE'].where(df['PRICE'] > 50) | Returns a new Series object with the same length as the original 'PRICE' column. But where() function will replace values where the condition is False with NaN (missing value) or another specified value. |
df.query('PRICE>70') | Filter a DataFrame based on a specified condition
|
Merge two data frame | |
pd.merge(df1, df2, how='left', on='Fruits') | Left Join
|
pd.merge(df1, df2, how='right', on='Fruits') | Right Join
|
pd.merge(df1, df2, how='inner', on='Fruits') | Inner Join
|
pd.merge(df1, df2, how='outer', on='Fruits') | Outer Join
|
Concatenation | |
concat_df = pd.concat([df, df1], axis=0, ignore_index=True) | Row-Wise Concatenation
|
concat_df = pd.concat([df, df2], axis=1) | Row-Wise Concatenation
|
Describe dataset | |
df.describe() | Descriptive statistics of a data frame
|
df.describe(include=['O']) | Descriptive statistics of Object data types of the data frame
|
df.FRUITS.unique() |
|
df.FRUITS.value_counts() | Frequency the unique values in 'FRUITS' column |
df['PRICE'].sum() | Return the sum of 'PRICE' |
df['PRICE'].cumsum() | Return the cumulative sum of 'PRICE' values |
df['PRICE'].min() | Return the minimum value of 'PRICE' column |
df['PRICE'].max() | Return the maximum value of 'PRICE' column |
df['PRICE'].mean() | Return the mean value of 'PRICE' column |
df['PRICE'].median() | Return the median value of 'PRICE' column |
df['PRICE'].var() | Return the variance value of 'PRICE' column |
df['PRICE'].std() | Return the standard deviation value of 'PRICE' column |
df['PRICE'].quantile([0.25, 0.75]) | Return the 25 and 75 percentile value of 'PRICE' column |
df.apply(summation) | Apply any custom function with pandas def summation(col): |
df.cov(numeric_only=True) | Compute the Covariance for numerical columns |
df.corr(numeric_only=True) | Compute the Correlation for numerical columns |
Missing Values | |
df.isnull() | Check for null values
|
df.isnull().sum() | Return the row-wise count of null values |
df['DISCOUNT'] = df['DISCOUNT'].fillna(value=VALUE) | Fill the null values with the specified values 'VALUE'. The value can be Mean, median, mode or any specified value. |
df1 = df.dropna() | Drop the null values |
Add a new column to the Data frame | |
df['COL_NAME'] = COL_DATA | Add a column to the Existing dataset Note: The length of COL_DATA should be equal to the number of rows of existing dataset |
df = df.assign(Paid_Price=lambda df: | Add a column using the existing columns values |
Group By | |
grouped = df.groupby(by='COL_NAME') | Group the dataframe w.r.t unique values of the specified column Name i.e 'COL_NAME' |
grouped.agg(['count','sum', 'mean']) | Return the count, sum and mean value as per grouped of column i.e 'COL_NAME' |
Graph with Pandas | |
grouped = df.groupby(['Origin']) | Pie Chart
|
df.plot.scatter(x='PRICE', y='DISCOUNT') | Scatter Plot
|
df.plot.bar(x='FRUITS', y=['QUANTITY', 'PRICE', 'DISCOUNT']) | Bar Chart
|
df['QUANTITY'].plot.hist(bins=3) | Histogram Plot
|
df.boxplot(column='PRICE', grid=False) | Box Plot
|
Output:
1.5.2
Creating Pandas Series.
Output:
A Geeks
B for
C geeks
dtype: object
Creating Pandas Dataframe.
Output:
Fruits Quantity Price
0 Mango 40 80
1 Apple 20 100
2 Banana 25 50
3 Orange 10 70
We will check data types with the help of dtypes() function.
Output:
Fruits object
Quantity int64
Price int64
dtype: object
We will check data types with the help of shape() function.
Output:
(4, 3)
df.info() methods return the all information of your dataset.
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Fruits 4 non-null object
1 Quantity 4 non-null int64
2 Price 4 non-null int64
dtypes: int64(2), object(1)
memory usage: 128.0+ bytes
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Fruits 4 non-null object
1 Quantity 4 non-null int32
2 Price 4 non-null float64
dtypes: float64(1), int32(1), object(1)
memory usage: 112.0+ bytes
Output:
array([['Mango', 40, 80],
['Apple', 20, 100],
['Banana', 25, 50],
['Orange', 10, 70]], dtype=object)
Output:
Fruits Quantity Price
c Banana 25 50
d Orange 10 70
a Mango 40 80
b Apple 20 100
Output:
Fruits Quantity Price
b Apple 20 100
a Mango 40 80
d Orange 10 70
c Banana 25 50
Output:
Fruits Quantity Price
d Orange 10 70
c Banana 25 50
b Apple 20 100
a Mango 40 80
Output:
Fruits Quantity Price
0 Mango 40 80
1 Apple 20 100
2 Banana 25 50
3 Orange 10 70
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
1 Apple 20 100
2 Banana 25 50
3 Orange 10 70
Output:
variable value
0 FRUITS Mango
1 FRUITS Apple
2 FRUITS Banana
3 FRUITS Orange
4 QUANTITY 40
5 QUANTITY 20
6 QUANTITY 25
7 QUANTITY 10
8 PRICE 80
9 PRICE 100
10 PRICE 50
11 PRICE 70
Output:
PRICE QUANTITY
FRUITS Apple Banana Mango Orange Apple Banana Mango Orange
0 NaN NaN 80.0 NaN NaN NaN 40.0 NaN
1 100.0 NaN NaN NaN 20.0 NaN NaN NaN
2 NaN 50.0 NaN NaN NaN 25.0 NaN NaN
3 NaN NaN NaN 70.0 NaN NaN NaN 10.0
Output:
FRUITS PRICE
0 Mango 80
1 Apple 100
2 Banana 50
3 Orange 70
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
2 Banana 25 50
We can view top 5 rows with head() methods
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
1 Apple 20 100
2 Banana 25 50
3 Orange 10 70
We can view the top last 5 rows with tail() methods.
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
1 Apple 20 100
2 Banana 25 50
3 Orange 10 70
sample() methods return the ith number of rows.
Output:
FRUITS QUANTITY PRICE
2 Banana 25 50
0 Mango 40 80
1 Apple 20 100
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
2 Banana 25 50
Output:
FRUITS QUANTITY PRICE
3 Orange 10 70
1 Apple 20 100
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
1 Apple 20 100
3 Orange 10 70
Output:
0 Mango
1 Apple
2 Banana
3 Orange
Name: FRUITS, dtype: object
Output:
FRUITS PRICE
0 Mango 80
1 Apple 100
2 Banana 50
3 Orange 70
Output:
FRUITS QUANTITY
0 Mango 40
1 Apple 20
2 Banana 25
3 Orange 10
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
1 Apple 20 100
2 Banana 25 50
3 Orange 10 70
Output:
FRUITS PRICE
2 Banana 50
Output:
FRUITS QUANTITY PRICE
2 Banana 25 50
3 Orange 10 70
Output:
FRUITS PRICE
0 Mango 80
1 Apple 100
2 Banana 50
3 Orange 70
For more please refer to this article Indexing and Selecting data
| FRUITS | QUANTITY | PRICE | |
|---|---|---|---|
| 0 | Mango | 40 | 80 |
| 1 | Apple | 20 | 100 |
| 2 | Banana | 25 | 50 |
| 3 | Orange | 10 | 70 |
Output:
100
Output:
100
Output:
FRUITS PRICE
0 Mango 80
1 Apple 100
2 Banana 50
3 Orange 70
Output:
FRUITS QUANTITY PRICE
3 Orange 10 70
Output:
0 80.0
1 100.0
2 NaN
3 70.0
4 60.0
5 NaN
Name: PRICE, dtype: float64
Pandas query() methods return the filtered data frame.
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
1 Apple 20 100
Output:
FRUITS QUANTITY PRICE
1 Apple 20 100
3 Orange 10 70
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
Output:
Fruits Price
0 Mango 60
1 Banana 40
2 Grapes 75
3 Apple 100
4 Orange 65
Output:
Fruits Price
0 Apple 120
1 Orange 60
2 Papaya 30
3 Pineapple 70
4 Mango 50
Output:
Fruits Price_x Price_y
0 Mango 60 50.0
1 Banana 40 NaN
2 Grapes 75 NaN
3 Apple 100 120.0
4 Orange 65 60.0
Output:
Fruits Price_x Price_y
0 Apple 100.0 120
1 Orange 65.0 60
2 Papaya NaN 30
3 Pineapple NaN 70
4 Mango 60.0 50
Output:
Fruits Price_x Price_y
0 Mango 60 50
1 Apple 100 120
2 Orange 65 60
Output:
Fruits Price_x Price_y
0 Mango 60.0 50.0
1 Banana 40.0 NaN
2 Grapes 75.0 NaN
3 Apple 100.0 120.0
4 Orange 65.0 60.0
5 Papaya NaN 30.0
6 Pineapple NaN 70.0
Output:
FRUITS QUANTITY PRICE
0 Mango 40 80
1 Apple 20 100
2 Banana 25 50
3 Orange 10 70
4 Grapes 23 60
5 Pineapple 17 30
Output:
FRUITS QUANTITY PRICE DISCOUNT
0 Mango 40 80 5.0
1 Apple 20 100 7.0
2 Banana 25 50 10.0
3 Orange 10 70 8.0
4 Grapes 23 60 6.0
5 Pineapple 17 30 NaN
A. For numerical datatype
Output:
QUANTITY PRICE DISCOUNT
count 6.00000 6.000000 5.000000
mean 22.50000 65.000000 7.200000
std 10.05485 24.289916 1.923538
min 10.00000 30.000000 5.000000
25% 17.75000 52.500000 6.000000
50% 21.50000 65.000000 7.000000
75% 24.50000 77.500000 8.000000
max 40.00000 100.000000 10.000000
B. For object datatype
Output:
FRUITS
count 6
unique 6
top Mango
freq 1
Output:
array(['Mango', 'Apple', 'Banana', 'Orange', 'Grapes', 'Pineapple'],
dtype=object)
Output:
Mango 1
Apple 1
Banana 1
Orange 1
Grapes 1
Pineapple 1
Name: FRUITS, dtype: int64
Output:
360
Output:
0 80
1 180
2 230
3 300
4 360
Name: PRICE, dtype: int64
Output:
30
Output:
100
Output:
65.0
Output:
65.0
Output:
590.0
Output:
24.289915602982237
Output:
0.00 30.0
0.25 52.5
0.75 77.5
1.00 100.0
Name: PRICE, dtype: float64
Output:
FRUITS 6
QUANTITY 135
PRICE 390
DISCOUNT 5
dtype: int64
Output:
QUANTITY PRICE DISCOUNT
QUANTITY 101.1 53.0 -10.4
PRICE 53.0 590.0 -18.0
DISCOUNT -10.4 -18.0 3.7
Output:
QUANTITY PRICE DISCOUNT
QUANTITY 1.000000 0.217007 -0.499210
PRICE 0.217007 1.000000 -0.486486
DISCOUNT -0.499210 -0.486486 1.000000
Check for null values using isnull() function.
Output:
FRUITS QUANTITY PRICE DISCOUNT
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False True
Column-wise null values count
Output:
FRUITS 0
QUANTITY 0
PRICE 0
DISCOUNT 1
dtype: int64
Fill the null values with mean()
Output:
FRUITS QUANTITY PRICE DISCOUNT
0 Mango 40 80 5.0
1 Apple 20 100 7.0
2 Banana 25 50 10.0
3 Orange 10 70 8.0
4 Grapes 23 60 6.0
5 Pineapple 17 30 7.2
Output:
FRUITS QUANTITY PRICE DISCOUNT Origin
0 Mango 40 80 5.0 BH
1 Apple 20 100 7.0 J&K
2 Banana 25 50 10.0 BH
3 Orange 10 70 8.0 MP
4 Grapes 23 60 6.0 WB
5 Pineapple 17 30 NaN WB
Output:
FRUITS QUANTITY PRICE DISCOUNT Origin Paid_Price
0 Mango 40 80 5.0 BH 3040.0
1 Apple 20 100 7.0 J&K 1860.0
2 Banana 25 50 10.0 BH 1125.0
3 Orange 10 70 8.0 MP 644.0
4 Grapes 23 60 6.0 WB 1297.2
5 Pineapple 17 30 NaN WB NaN
Group the DataFrame by the 'Origin' column using groupby() methods
Output:
QUANTITY PRICE DISCOUNT Paid_Price
sum mean sum mean sum mean sum mean
Origin
BH 65 32.5 130 65.0 15.0 7.5 4165.0 2082.5
J&K 20 20.0 100 100.0 7.0 7.0 1860.0 1860.0
MP 10 10.0 70 70.0 8.0 8.0 644.0 644.0
WB 40 20.0 90 45.0 6.0 6.0 1297.2 1297.2
we can use a boxplot for Detection of the outliers.
Output:
👁 Outlier Detection using Box plot
plot.bar() method is used to plot bar in pandas.
Output:
plot.hist() methods is used to create a histogram.
Output:
scatter() methods used to create a scatter plot in pandas.
Output:
plot.pie() methods used to create pie chart.
Output:
In conclusion, the Pandas Cheat Sheet serves as an invaluable resource for data scientists and Python users. Its concise format and practical examples provide quick access to essential Pandas functions and methods. By leveraging this pandas cheat sheet, users can streamline their data manipulation tasks, gain insights from complex datasets, and make informed decisions. Overall, the Pandas Cheat Sheet is a must-have tool for enhancing productivity and efficiency in data science projects.