Powerful One-liners in Pandas Every Data Scientist Should Know
Things you can do in one line using Pandas
Training data-driven machine learning models has never been as easy as today. For instance, assume you are training a vanilla neural network. Here, adjusting the architecture for the number of hidden layers and their dimension, tweaking the hyperparameters, or changing the loss function can all be done with a slight modification in the model definition or its optimizer.
While on one hand, this is advantageous as it reduces the heavy lifting of spending time designing architectures from scratch. However, this has often led machine learning practitioners/researchers to neglect the importance of data visualizations and analysis โ leading them to train deep models directly without establishing a clear understanding of their data.
Therefore, in this post, I would like to introduce you to a handful of essential and powerful one-liners specifically for tabular data using Pandas that will help you better understand your data and consequently (and hopefully) help you design and build better machine learning models.
Dataset
For this post, I will experiment with a dummy dataset of one thousand Employees which I created myself in Python. The image below gives an overview of the dataset we are experimenting with.
The code block below demonstrates my implementation:
One-liners in Pandas
Next, letโs discuss some popular functions available in Pandas to make a meaningful understanding of the available data.
#1 n-largest values in a series
Say we want to start off by finding the top-n paid roles in this dataset. You can do this using the [nlargest()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html) method in Pandas. This method returns the first n rows with the largest values in column(s), ordered in descending order.
Note that nlargest() returns the entire DataFrame, i.e., the function also returns the columns not specified for ordering. However, they are not used to order the DataFrame. The code snippet below depicts the use of nlargest() method on our DataFrame.
When duplicate values exist, we need to specify which particular row(s) we want in the final output. This is done using the keep argument that can take the following values:
keep = "first": prioritizes the first occurrence.keep = "last": prioritizes the last occurrence.keep = "all": Does not drop any duplicates, even if it means selecting more than n items (like in the image above).
It is often mistaken that the nlargest()is precisely equivalent to using the sort_values()method as follows:
However, the keepargument used in nlargest() makes all the difference. Considering the example above, nlargest() with keep="all"returns potential duplicates as well. This, on the other hand, can not be done in the case of sort_values() method.
2 n-smallest values in a series
Similar to the nlargest() method discussed above, you can find the rows corresponding to the lowest-n values using the nsmallest() method in Pandas. This method returns the first n rows with the smallest values in column(s), arranged in ascending order. The arguments passed here are the same as those specified in the nlargest() method. The code snippet below depicts the use of nsmallest() method on our DataFrame.
3 CrossTabs
Crosstab allows you to compute a cross-tabulation of two (or more) columns/series and returns a frequency of each combination by default. In other words, [crosstab()](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html) takes one column/list, displays its unique values as indexes, and then takes another column/list and displays its unique values as the column headers. The values in the individual cells are computed using an aggregation function. By default, they indicate the co-occurrence frequency.
Say, for instance, we wish to compute the number of employees working from each location within every company. This can be done as follows:
As it can be hard to interpret numerical values in a crosstab (and to make it more visually appealing), we can generate a heatmap from a crosstab shown below as follows:
If you wish to compute aggregation on some column other than the ones that make up the indexes and the column headers, you can do so by passing the aggregation column to values argument of crosstab()as shown below:
4 Pivot Table
Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data.
Although they both share numerous similarities and are conceptually the same in the context of Pandas, there are a few implementational differences that make them different (further reading here). The code snippet below demonstrates the use of [pivot_table()](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) method to find the frequency of co-occurrence between the "Company Name" and "Location":
Similar to what we did in Crosstab, we can create a heatmap to make it visually appealing as well as more interpretable. This can be done as shown in the code snippet to generate the following heatmap:
5 Handling Duplicated Data
In addition to the regular data analysis, appropriately handling duplicate values in your data also plays a vital role in building your data pipeline. One major caveat of having duplicates in your data is that they take up unnecessary storage space and slow down the computation by acquiring resources. Furthermore, duplicate data can skew analysis results, leading us to draw wrong insights. Therefore, removing or handling duplicates in your data is extremely important.
First, letโs look at how you can mark duplicate values in your DataFrame. For this, weโll use the [duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)method in Pandas. This returns a boolean Series that indicates duplicate rows. For demonstration purposes, Iโll only use a random sample of 10 rows of the original salary dataset, of which the last two rows have been intentionally duplicated. The sampled rows are shown in the image below.
- Mark duplicated rows
Pandas allows you to assign boolean labels to rows based on all columns (or a subset of columns) which are duplicates. This can be done using the [duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) method of Pandas as shown below:
When there are duplicate values, keep is used to indicate which specific duplicates to mark.
keep = "first": (Default) Marks all duplicates asTrueexcept for the first occurrence.keep = "last": Marks all duplicates asTrueexcept for the last occurrence.keep = False: Marks all duplicates asTrue.
You can filter all the rows which appear only once by passing the boolean series as flags for filtering a Pandas DataFrame as follows:
To check duplicates on a subset of columns, pass the list of columns as the subset argument of duplicated() method as shown below:
Filtering the DataFrame using the above boolean series as shown below outputs the DataFrame following the code:
- Remove duplicates
In addition to marking potential duplicates using boolean labels discussed above, one might also need to get rid of duplicates. To reiterate, the data I am referring to specifically for the "Handling Duplicated Data" section comprises just ten rows. This is shown below:
You can remove the duplicate rows either based on values in all columns or a subset of columns using the drop_duplicates() method as shown below:
Similar to duplicated(), the keep argument is used to indicate which specific duplicates you want to keep.
keep = "first": (Default) Drops all duplicates except for the first occurrence.keep = "last": Drops all duplicates except for the last occurrence.keep = False: Drops all duplicates.
To drop duplicates based on the values in a subset of columns, pass the list of columns as the subset argument to the drop_duplicates() method:
To conclude, in this post, I presented a few popular methods available in Pandas for effective data analysis in Tabular Data. Though this post will be helpful for you to make you comfortable with the syntax of these methods, I would highly recommend downloading a dataset on your own and experimenting with it in a jupyter notebook.
Further, there is no better place than referencing the official Pandas documentation available here to acquire fundamental and practical knowledge of various effective methods in Pandas. Pandasโ official documentation provides a detailed explanation of each of the arguments accepted by a function along with a practical example, which in my opinion, is an excellent way to acquire both beginner-level and advanced Pandas expertise.
P.S. I have been able to cover only five methods in the post. Iโll release the next set of Pandas methods for effective data analysis in another post soon :). Meanwhile, if you enjoyed reading this article, I am sure you would enjoy the following articles too:
20% of Pandas Functions that Data Scientists Use 80% of the Time
Thanks for reading.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS