![]() |
VOOZH | about |
Pandas is an open-source Python library used for data manipulation and analysis. In interviews, questions on Pandas are often asked to assess your ability to work with structured data effectively. Below are some of the most frequently asked interview questions and answers covering key Pandas topics.
Pandas are used for efficient data analysis. The key features of Pandas are as follows:
The two data structures that are supported by Pandas are Series and DataFrames.
A Series in Pandas is a one-dimensional labelled array. Its columns are like an Excel sheet that can hold any type of data like integer, string, Python objects, etc. Its axis labels are known as the index. Series contains homogeneous data and its values can be changed but the size of the series is immutable. A series can be created from a Python tuple, list and dictionary. The syntax for creating a series is as follows:
In Pandas, a series can be created in many ways. They are as follows:
1. Creating a Series from a List
We can create a series using a Python list and pass it to the Series() constructor.
Output:
0 g
1 e
2 e
3 k
4 s
dtype: object
2. Creating a Series from Dictionary
A Series can also be created from a Python dictionary. The keys of the dictionary as used to construct indexes of the series.
Output:
Geeks 10
for 20
geeks 30
dtype: int64
3. Creating a Series from Scalar Value
To create a series from a Scalar value, we must provide an index. The Series constructor will take two arguments, one will be the scalar value and the other will be a list of indexes. The value will repeat until all the index values are filled.
Output:
0 10
1 10
2 10
3 10
4 10
5 10
dtype: int64
4. Creating a Series using NumPy Functions
The Numpy module's functions, such as numpy.linspace() and numpy.random.randn() can also be used to create a Pandas series.
Output:
0 3.0
1 18.0
2 33.0
dtype: float64
0 -0.341027
1 -1.700664
2 0.364409
dtype: float64
5. Creating a Series using List Comprehension
Here, we will use the Python list comprehension technique to create a series in Pandas. We will use the range function to define the values and a for loop for indexes.
Output:
a 1
b 4
c 7
d 10
e 13
f 16
g 19
dtype: int64
A DataFrame in Panda is a data structure used to store the data in tabular form, that is in the form of rows and columns. It is two-dimensional, size-mutable and heterogeneous in nature. The main components of a dataframe are data, rows and columns. A dataframe can be created by loading the dataset from existing storage such as SL database, CSV file, Excel file, etc. The syntax for creating a dataframe is as follows:
In Pandas, a dataframe can be created in many ways. They are as follows:
1. Creating a DataFrame using a List
In order to create a DataFrame from a Python list, just pass the list to the DataFrame() constructor.
Output:
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks
2. Creating a DataFrame using a Dictionary
A DataFrame can be created from a Python dictionary and passed to the DataFrame() constructor. The Keys of the dictionary will be the column names and the values of the dictionary are the data of the DataFrame.
Output:
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
3. Creating a DataFrame using a List of Dictionaries
Another way to create a DataFrame is by using Python list of dictionaries. The list is passed to the DataFrame() constructor. The Keys of each dictionary element will be the column names.
Output:
1 2 3
0 Geeks For Geeks
1 Portal for Geeks
4. Creating a DataFrame from Pandas Series
A DataFrame in Pandas can also be created by using the Pandas series.
Output:
0
0 Geeks
1 For
2 Geeks
We can create a data frame from a CSV (Comma Separated Values) file. This can be done by using the read_csv() method which takes the csv file as the parameter.
Another way to do this is by using the read_table() method which takes the CSV file and a delimiter value as the parameter.
A Pandas dataframe can be converted to an Excel file by using the to_excel() function which takes the file name as the parameter. We can also specify the sheet name in this function.
Pandas Numpy is an inbuilt Python package that is used to perform large numerical computations. It is used for processing multidimensional array elements to perform complicated mathematical operations.
Pandas dataframe can be converted to a NumPy array by using the to_numpy() method. We can also provide the datatype as an optional argument.
We can also use .values to convert dataframe values to NumPy array
The first few records of a dataframe can be accessed by using the pandas head() method. It takes one optional argument n, which is the number of rows. By default, it returns the first 5 rows of the dataframe. The head() method has the following syntax:
Another way to do it is by using iloc() method. It is similar to the Python list-slicing technique. It has the following syntax:
There are many ways to Select a single column of a dataframe. They are as follows:
By using the Dot operator, we can access any column of a dataframe.
Another way to select a column is by using the square brackets [].
A column of the dataframe can be renamed by using the rename() function. We can rename a single as well as multiple columns at the same time using this method.
Another way is by using the set_axis() function which takes the new column name and axis to be replaced with the new name.
In case we want to add a prefix or suffix to the column names, we can use the add_prefix() or add_suffix() methods.
1. Adding Rows
The df.loc[] is used to access a group of rows or columns and can be used to add a row to a dataframe.
We can also add multiple rows in a dataframe by using pandas.concat() function which takes a list of dataframes to be added together.
2. Adding Columns
We can add a column to an existing dataframe by just declaring the column name and the list or dictionary of values.
Another way to add a column is by using df.insert() method which take a value where the column should be added, column name and the value of the column as parameters.
We can also add a column to a dataframe by using df.assign() function
We can delete a row or a column from a dataframe by using df.drop() method. and provide the row or column name as the parameter.
1. To delete a column
2. To delete a row
In pandas, we can combine two dataframes using the pandas.merge() method which takes 2 dataframes as the parameters.
Output:
A B C D
20 2 5 7 10
30 3 6 8 11
A dataframe in pandas can be sorted in ascending or descending order according to a particular column. We can do so by using the sort_values() method and providing the column name according to which we want to sort the dataframe. We can also sort it by multiple columns.
To sort it in descending order, we pass an additional parameter 'ascending' and set it to False.
The mean, median, mode, Variance, Standard Deviation and Quantile range can be computed using the following commands in Python.
In Pandas, there are two ways to create a copy of the Series. They are as follows:
1. Shallow Copy is a copy of the series object where the indices and the data of the original object are not copied. It only copies the references to the indices and data. This means any changes made to a series will be reflected in the other. A shallow copy of the series can be created by writing the following syntax:
2. Deep Copy is a copy of the series object where it has its own indices and data. This means changes made to a copy of the object will not be reflected to the original series object. A deep copy of the series can be created by writing the following syntax:
The default value of the deep parameter of the copy() function is set to True.
In pandas, duplicate values can be checked by using the duplicated() method.
To remove the duplicated values we can use the drop_duplicates() method.
Generally dataset has some missing values and it can happen for a variety of reasons such as data collection issues, data entry errors or data not being available for certain observations. This can cause a big problem. To handle these missing values Pandas provides various functions.
These functions are used for detecting, removing and replacing null values in Pandas DataFrame:
The interpolate() and fillna() methods in pandas are used to handle missing or NaN (Not a Number) values in a DataFrame or Series. The following table shows the difference between interpolate() and fillna():
| Feature | interpolate() | fillna() |
|---|---|---|
| Purpose | Estimates and fills missing values using interpolation techniques | Fills missing values using a specified constant or computed value |
| How it works | Calculates missing values based on existing surrounding data | Directly replaces NaN with given value(s) or strategies |
| Common methods supported | linear, polynomial, time, spline, etc. | 0, mean(), median(), mode(), forward-fill (ffill), back-fill (bfill) etc. |
| Data types supported | Mainly numeric and datetime data (where logical continuity exists) | Numeric, categorical and datetime data |
| Use case | Used when missing values depend on surrounding trends or time sequence | Used when missing values can be filled with a fixed or computed known value |
| Return type | Returns a DataFrame/Series with estimated values replacing NaN | Returns a DataFrame/Series with specified values replacing NaN |
| Example | df['col'].interpolate(method='linear') | df['col'].fillna(df['col'].mean()) |
The map(), applymap() and apply() methods are used in pandas for applying functions or transformations to elements in a DataFrame or Series. The following table shows the difference between map(), applymap() and apply():
| Feature | map() | applymap() | apply() |
|---|---|---|---|
| Defined on | Series only | DataFrame only | Both Series and DataFrame |
| Works on | Each element of a Series | Each element of a DataFrame | Entire row/column (or whole Series) |
| Axis support | No axis parameter | No axis parameter | Has axis parameter (axis=0 for columns, axis=1 for rows) |
| Function application level | Element-wise | Element-wise | Row-wise or Column-wise (can also be element-wise on Series) |
| Typical use | Apply a function/dict to each element of a Series | Apply a function to each element of a DataFrame | Apply a function across rows/columns or to a whole Series |
| Example use case | Convert each name in a Series to uppercase | Square each element in a numeric DataFrame | Calculate sum/mean of each row or column |
| Return type | Series | DataFrame | Series (if applied on DataFrame rows/columns) or scalar (if aggregated) |
1. Set Index: We can set the index to a Pandas dataframe by using the set_index() method, which is used to set a list, series or dataframe as the index of a dataframe.
2. Reset Index: The index of Pandas dataframes can be reset by using the reset_index() method. It can be used to simply reset the index to the default integer index beginning at 0.
Reindexing in Pandas as the name suggests means changing the index of the rows and columns of a dataframe. It can be done by using the Pandas reindex() method. In case of missing values or new values that are not present in the dataframe, the reindex() method assigns it as NaN.
Multi-indexing refers to selecting two or more rows or columns in the index. It is a multi-level or hierarchical object for pandas object and deals with data analysis and works with higher dimensional data. Multi-indexing in Pandas can be achieved by using a number of functions such as:
1. loc: It is label-based i.e you access rows and columns using their labels (row and column names).
2. iloc: It is integer-position based and here you access rows and columns using their numeric index positions (row and column numbers).
Pandas describe() is used to view some basic statistical details of a data frame or a series of numeric values. It can give a different output when it is applied to a series of strings. It can get details like percentile, mean, standard deviation, etc.
Pandas dataframe.corr() method is used to find the correlation of all the columns of a dataframe. It automatically ignores any missing or non-numerical values.
The groupby() function in Pandas is used to split the data into groups based on one or more columns, then apply an operation (like aggregation, transformation or filtering) on each group separately.
For example:
Output:
Dept
HR 50000.0
IT 55000.0
Name: Salary, dtype: float64
In Pandas, pivot_table() is used to summarize and reshape data into a tabular format. It allows you to aggregate values like sum, mean, count, etc by specifying which columns become rows (index), which become columns and which contain the values to aggregate.
We can pivot the dataframe in Pandas by using thepivot_table() method. To unpivot the dataframe to its original form we can melt the dataframe by using the melt() method.
Both pivot_table() and groupby() are useful methods in pandas used for aggregating and summarizing data. The following table shows the difference between pivot_table() and groupby():
| Feature | pivot_table() | groupby() |
|---|---|---|
| Purpose | Summarizes and aggregates data in a tabular (pivoted) format | Performs aggregation on grouped data of one or more columns |
| Reshaping | Used to reshape data based on column values | Used to group data based on categorical variables |
| Output structure | Returns a new reshaped DataFrame | Returns a GroupBy object which must be followed by aggregation functions |
| Multi-level grouping | Can handle multiple levels of grouping using index and columns parameters | Can handle multiple levels of grouping using multiple column names in groupby() |
| Comparison across dimensions | Used when we want to compare data across multiple dimensions | Used to summarize data within groups |
| Typical use case | Summarizing data with one axis as rows and another as columns | Grouping by one or more columns and then applying aggregation |
In Pandas, data aggregation refers to the act of summarizing or decreasing data in order to produce a statistical summary of one or more columns in a dataset. In order to calculate statistical measures like sum, mean, minimum, maximum, count, etc aggregation functions must be applied to groups or subsets of data.
The agg() function in Pandas is frequently used to aggregate data. Applying one or more aggregation functions to one or more columns in a DataFrame or Series is possible using this approach. Pandas' built-in functions or specially created user-defined functions can be used as aggregation functions.
The following table shows the difference between join(), merge() and concat():
| Feature | join() | merge() | concat() |
|---|---|---|---|
| Purpose | Combines two DataFrames on their index or on a key column. | Combines DataFrames using common columns or indices (like SQL joins). | Combines DataFrames along rows or columns. |
| Default on | Joins on index by default. | Joins on common columns by default. | Just stacks DataFrames without joining keys. |
| Join types | left, right, inner, outer | left, right, inner, outer | Not applicable (simply concatenates). |
| Axis support | Always horizontal (columns) | Always horizontal (columns) | Can be vertical (rows) or horizontal (columns) using axis. |
| Typical use | Combine DataFrames by their index labels. | Combine DataFrames based on matching column values. | Stack multiple DataFrames into one. |
Time series is a collection of data points with timestamps. It depicts the evolution of quantity over time. Pandas provide various functions to handle time series data efficiently. It is used to work with data timestamps, resampling time series for different time periods, working with missing data, slicing the data using timestamps, etc.
We have various time-series function in pandas like:
Pandas Built-in Function | Operation |
|---|---|
pandas.to_datetime(DataFrame['Date']) | Convert 'Date' column of DataFrame to datetime dtype |
DataFrame.set_index('Date', inplace=True) | Set 'Date' as the index |
DataFrame.resample('H').sum() | Resample time series to a different frequency (e.g., Hourly, daily, weekly, monthly etc) |
DataFrame.interpolate() | Fill missing values using linear interpolation |
DataFrame.loc[start_date:end_date] | Slice the data based on timestamps |
A Python string can be converted to a DateTime object by using:
1. Pandas.to_datetime()
Output:
2023-07-17 00:00:00
2. datetime.strptime
Output:
2023-07-17 00:00:00
The time delta is the difference in dates and time. It indicates the duration or difference in time. The time delta object can be created by using the timedelta() method and providing the number of weeks, days, seconds, milliseconds, etc as the parameter.
With the help of the Timedelta data type, you can easily perform arithmetic operations, comparisons and other time-related manipulations. In terms of different units, such as days, hours, minutes, seconds, milliseconds and microseconds.
Label encoding is used to convert categorical data into numerical data so that a machine-learning model can fit it. To apply label encoding using pandas we can use:
1. pandas.Categorical().codes: It only gives codes.
Output:
[2 0 1 0]
2. pandas.factorize(): It gives both codes and unique labels.
Output:
[0 1 2 1]
One-hot encoding is a technique for representing categorical data as numerical values in a machine-learning model. It works by creating a separate binary variable for each category in the data. The value of the binary variable is 1 if the observation belongs to that category and 0 otherwise. It can improve the performance of the model.
To apply one hot encoding, we greater a dummy column for our dataframe by using get_dummies() method.
For example:
Output:
Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0