Pandas can transform even the messiest data into pristine machine learning datasets. The process itself, though, can be quite messy.
Pandas code can be hard to read for a number of reasons. For one thing, there are many different ways of accomplishing the same basic tasks in Pandas. Subsetting data, adding new columns, dropping columns, removing null values and many other processes can be done in a number of different ways, which leads to inconsistent and messy code.
Managing the order of data cleaning steps can also be challenging in Pandas. Most of my data cleaning code earlier in my career looked like this:
# Import data
df_raw = pd.read_csv("path/to/data/file.csv")
# Clean data
df_raw["id"] = df_raw["id"].astype(str)
df_merged = df_raw.merge(df2, on="id")
df_final = df_merged.drop(columns=["col_5", "col_6", "col_7"])
# Investigate data
df_agg = df_final.groupby("id").size()
"Spaghetti code," like the example above, is difficult to interpret and debug. Furthermore, having so many data frames in the namespace uses up memory and can be especially problematic when working with large datasets.
Finally, Pandas code can get messy because many times it’s written in a hurry. Whether you’re itching to build your model and need to quickly clean your dataset beforehand or you have a fresh batch of output data that you want to analyze, as a data scientist Pandas is generally a means to an end.
So what’s the secret to writing clean Pandas code once and for all? Two words: method chaining.
In this article, I share with you a curated collection of clean Pandas methods that I use to preprocess, investigate, aggregate, and analyze Twitter data that I use, in a separate project, to train a Transformer model. Drawing on these examples will expand your understanding of method chaining and serve as a reference guide for you to write your own clean Pandas code.
The basics of clean Pandas
The Pandas library comes with loads of built-in methods. Recall that in Python, methods are functions that belong to an object of a specific class and are tacked onto the object itself, like df.to_csv(). Methods can also be chained, meaning that you can apply several methods to an object at one time.
new_df = ( # Wrap everything in ()'s
original_df # Name of data frame to modify
.query("text_length > 140") # Subset based on text length
.sort_values(by="text_length") # Sort entire df by text length
.reset_index() # Reset index of subsetted df
)
Forcing yourself to use Pandas methods instead of operators can be frustrating at first because, for the most part, you’re relearning things you already know. But here’s why you should stick with method chaining:
- It makes code much more readable.
- It eliminates the need for multiple intermediary data frames, which saves memory.
- It’s easier to debug. Simply comment out data frame manipulations line by line to see which method is giving you problems.
Data cleaning
I have a data frame of raw tweets made by US Senators that I fetched via the Twitter API v2 with elevated access credentials. Here’s a peek at the data:
Now let’s do some chaining to clean this up. In just this single call, we will select and drop columns, format the date column, clean the raw text of the tweets, count text length, merge two data frames, drop duplicate rows, rename columns, reorder column names, sort by date, and drop all rows where tweet length is zero.
You’re probably familiar with most of these methods. Perhaps the most important method here is .assign(), which allows you to create new columns or overwrite old ones. I primarily use the assign() method for two purposes.
- Changing the data type of an existing column:
.assign(column_name=original_df["column_name"].astype(str)
- Applying functions to entire columns:
.assign(new_column=original_df["column_name"].apply(function_name)
Note that you can also apply multiple functions to a column sequentially.
Data investigation
After implementing that monster chain to the raw Twitter data, we have a tidy, readable data frame that we’d like to inspect.
The simple method .info() gives you a remarkable amount of information about your data frame, including:
- Number of rows (and the index range)
- Number of columns
- Names of columns
- Data types of columns
- Number of non-null values per column
-
Memory usage
This handy one-liner to get the number of null values per column:
The
.describe()method gives you an overview of the actual values and distributions of the data in each of your columns. Apply.describe()to columns based on their data type for cleaner output, as shown below:The results from calling
.describe()ondtype="object"are not especially insightful, as the columnsid,username, andtextcontain string values rather than categorical data. However, the row values for thepartycolumn could show a potential pattern.
Data aggregation & analysis
Data aggregation on categorical variables is usually the first part of any analysis I perform for NLP projects. The most obvious variable to aggregate on in the tweet dataset is party.
Next, let’s move on to some more advanced aggregations. Chaining .groupby() and .agg() functions in order like this make it easier to understand the aggregation as a whole:
After applying the aggregations, the resulting index is hard to read. The .pipe() method is the clean Pandas way to apply a function to an entire data frame.
Conclusion
The key to writing clean Pandas code is to force yourself to use method chaining. Doing so ultimately makes your code more readable and interpretable, easier to debug, and even saves memory. As this article has demonstrated, you can use method chaining in every part of the data lifecycle, including cleaning, investigating, aggregating, and analyzing data. For more information on method chaining, check out the resources below.
If you’d like to stay up-to-date on the latest data science trends, technologies, and packages, consider becoming a Medium member. You’ll get unlimited access to articles and blogs like Towards Data Science and you’ll be supporting my writing. (I earn a small commission for each membership).
Want to connect?
- 📖 Follow me on Medium
- 💌 Subscribe to get an email whenever I publish
- 🖌 ️ Check out my generative AI blog
- 🔗 Take a look at my portfolio
- 👩 🏫 I’m also a data science coach!
Resources
- Supporting code for this article
- Effective Pandas by Matt Harrison
- Modern Pandas
- Pandas Documentation
References
(1) M. Newhauser, DistilBERT senator tweets (2022).
(2) T. Augspurger, Modern Pandas (Part 1) (2016).
(3) M. Harrison & T. Petrou, Pandas 1.x cookbook: Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python (Second edition) (2020).
(4) Python Software Foundation, 9. Classes (2022).
(5) Twitter, Twitter API Documentation (2022).
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS