![]() |
VOOZH | about |
Pandas is a popular open-source Python library used for data manipulation and analysis. It provides powerful tools for working with structured data and integrates seamlessly with libraries such as NumPy and Matplotlib. Pandas operates around two core data structures:
To install Pandas, run the following command in your terminal or command prompt:
pip install pandas
After installation, import Pandas into your Python script or notebook using the standard alias pd.
A DataFrame is a two-dimensional table with labeled rows and columns, similar to a spreadsheet or SQL table. Each column can store different data types, such as numbers, text, or dates. DataFrames can be created from lists, dictionaries, or other data sources.
Example: Creating a DataFrame using a Dictionary
In this example, a dictionary is converted into a DataFrame using the pd.DataFrame() function.
Output:
A Series is a one-dimensional labeled array that can store any data type, such as integers, strings, or floats. Each value is associated with an index, which can be default (0, 1, 2, ...) or custom labels. Series can be created from lists, NumPy arrays, dictionaries, or scalar values.
Example: Creating a Series using a List
Output:
CSV (Comma-Separated Values) files are commonly used to store tabular data. Pandas provides the read_csv() function to load a CSV file into a DataFrame. By default, Pandas displays only a preview of large datasets. To display all rows, you can use to_string() or adjust the pd.options.display.max_rows setting.
Output:
Pandas supports multiple file formats such as CSV and JSON (JavaScript Object Notation). You can load a JSON file into a DataFrame using the read_json() function.
After creating or loading a DataFrame, inspecting and summarizing the data is an important step in understanding dataset. Pandas provides various functions to help you view and analyze the data.
Let's see an example demonstrating the use of .head(), .tail(), .info() and .describe() methods:
Output:
Pandas provides versatile tools to inspect and understand data quickly. These functions help in summarizing datasets and exploring key attributes efficiently.
Indexing in Pandas refers to accessing and selecting data from a DataFrame or Series. There are multiple ways to do this, including selecting columns, slicing data, and filtering using conditions.
Example : Basic Indexing (Selecting a single column) with use of [ ] operator.
Output:
We can also select multiple columns by passing a list of column names, such as ['Name', 'Age'], which returns a new DataFrame. Pandas provides two main methods for selecting rows:
Output:
For more advanced selection techniques, you can explore topics like selecting multiple columns, slicing, and Boolean indexing.
Filtering data means selecting only those rows or columns that meet specific conditions instead of working with the entire dataset. This helps in extracting relevant data for analysis using logical conditions.
In the example below, we filter rows where the Age column is greater than 28, returning only matching records.
You can also refer to following resources for advanced techniques and more conditions in filtering and selection:
It involves various data manipulation techniques in Pandas, such as adding and deleting columns, truncating data, iterating over DataFrames and sorting data. For more detailed explanations of each concept and step, you can refer to Dealing with Rows and Columns in Pandas DataFrame.
1. Adding a New Column to DataFrame: We can easily add new columns by assigning values to them by direct assignment.
Output:
There are multiple methods, for that refer to: Adding new column to existing dataFrame in pandas. If you want to add columns from one dataFrame to another, refer to Adding Columns from Another DataFrame.
2. Renaming Columns in a DataFrame: We can use the rename() method for selective renaming of specific columns or directly modify the columns attribute when renaming all columns at once.
Output:
We can also Rename column by index in Pandas
3. Reindexing Data with Pandas:Reindexing in Pandas is used to change the row or column labels of a DataFrame or Series. It helps align data with new labels, handle missing values, and restructure datasets. The reindex() method updates indices or column names as needed.
Note: If the new index includes labels not present in the original DataFrame, the corresponding values will be set to NaN by default.
Output:
What if you want to reset the index? for that refer to: Convert Index to Column in Pandas Dataframe
Working with missing data is one of the most common tasks in data manipulation. Pandas provides several functions to identify, fill, and remove missing values efficiently. This helps ensure data quality before performing analysis.
1. Identifying Missing Data With Pandas using isnull() and notnull():
isnull():Returns True where values are missing (NaN) and False where values are present.notnull():Returns True where values are present and False where they are missing. Example:
Output:
When working with real world datasets, missing values are common. Broadly, we have two main ways to deal with missing data:
2. Filling Missing Data: Missing values can be replaced using the fillna() method.
For example: Replacing all missing values with 0. This method modifies the dataset directly when inplace=True is used.
Output:
We can also fill missing values with below techniques:
3. Dropping Missing Data With Pandas: We can remove missing values using the dropna() method. Depending on the requirement, Pandas allows different ways to drop data.
Common approaches include:
To explore more variations of dropping data, you can refer to topics like Dropping Rows from a Pandas DataFrame with Missing Values and Dropping One or Multiple Columns.
For example:
Output:
Aggregation and grouping in Pandas are used to analyze and summarize data. Grouping divides data into categories, while aggregation performs operations like sum, mean, or count to extract insights.
The groupby() function is commonly used for grouping data, followed by aggregation functions like sum(), mean(), size(), or custom operations.
Example:
Output:
- For more detailed information and practical examples, refer Grouping and Aggregating with Pandas
- For more advanced operations, you can use
.agg()to apply custom aggregation functions.- Grouping Rows in pandas
This section explains how to combine multiple DataFrames or Series into a single DataFrame. These operations are commonly used to integrate datasets and organize data for analysis.
1. Merging DataFrames: Combines data based on common column or index using functions like merge or join. There are mainly 4 types of joins:
Example:
Output:
2. Concatenating Data: Concatenation is used to stack DataFrames either vertically (row-wise) or horizontally (column-wise) using pd.concat(). Let's create two dataFrames and concatenate it with the original one:
Example:
Output:
- Learn Concatenate Two or More Pandas DataFrames with all operations and multiple examples. For more, Go Through below articles:
- Concatenate values in different columns to one column
- Merge two dataFrames on certain columns
- Merge multiple CSV Files into single dataframe
Reshaping data means changing the structure of rows and columns to better organize or analyze data. Common operations include pivoting, melting, stacking, and unstacking.
1. Pivot Tables in Pandas: Pivot tables reshape data based on column values and are useful for creating aggregated views using the pivot_table() method.
Output:
2. Melting : Multiple columns are combined into a single key value pair.
When you melt data:
Note: To melt data, you must specify columns that act as "identifiers" (id_vars) and others that need to be melted (value_vars).
Output:
3. Stacking and Unstacking With Pandas: Reshapes data by changing rows and columns, especially useful with MultiIndex DataFrames. These methods help reorganize the structure of data for better analysis.
Example: When columns are stacked into rows using stack(), the DataFrame structure changes. The data now has a MultiIndex in the rows:
Output:
Letβs compare and understand in depth with more examples the stack(), unstack(), and melt() methods. We can also use the transpose() method to swap rows and columns when needed.
When working with large datasets or heavy computations, performance optimization becomes important for faster processing and better memory usage. Pandas provides several techniques to improve efficiency.
int64) even when not needed. Using .astype() to convert columns to smaller or more appropriate types helps reduce memory usage, especially in large datasets.Related Articles: