Make working with large DataFrames easier, at least for your memory

Reduce the size of your DataFrames by up to 90%!

Aug 28, 2018

6 min read

Nowadays, in the big-data world, we are oftentimes working with large datasets. And while there are certain tools facilitating work on massive volumes of data (such as Spark), we still do a lot of work locally and, for example, would like to build a prototype model (an MVP, minimum viable product). So I think it might be worthwhile to spend a bit of time optimizing the way we handle the data – and by doing so – reduce its size by up to 90%. Yes, it can be that much…

What is happening under the hood?

There won’t be a lot of theory, but still, a few words are necessary. Under the hood, pandas stores DataFrame’s columns of the same variable type (such as integers, floats, objects) in blocks. In the case of blocks representing numeric values (integers and floats), those columns are combined and stored in the form of a NumPy’s ndarray, which makes accessing them very fast.

So let’s look at the data types available at pandas and what they actually represent. The following table presents a great overview of the available types and what they are used for.

👁 Source: http://pbpython.com/pandas_dtypes.html

Source: http://pbpython.com/pandas_dtypes.html

It is also worth mentioning what are the subtypes visible in the table. As an example, we have int8, int16, int32 and int64. The number indicates how many bits this subtype uses to represent a particular value and these use 1, 2, 4 and 8 bytes of memory respectively. The following table might be more informative with regards to what values we can actually store in these numeric data types.

👁 Source: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

Source: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

It might come in handy to know the difference between uint (unsigned integers) and int (signed integers). Both have equal capacity for storage, but unsigned integers only allow for storing positive values. So we can definitely leverage that in some cases!

Practical Example

For this article, I will generate a sample DataFrame, with a few different data types to illustrate how it is possible to optimize storage. Pandas usually does a good job when detecting the type of the data, however, there are times, for example, while using read_csv() or read_sql()the datatypes are not assigned in an optimal way. I create the DataFrame as follows:

I artificially set the type of variables such as user_id and order_id to string, because this is what often happens in practice. You can also encounter ids of the type id123456789 where removing the string part id does not make any difference but can lead to a significant reduction in required memory.

As the initial step, I examine the types of the variables stored in the sample DataFrame. Everything is exactly as expected.

>>> df.dtypes

user_id object
order_id object
order_date object
number_of_products int64
total_amount float64
day_of_week object
dtype: object

To see how much memory each of the columns take we can use the memory_usage as follows:

>>> df.memory_usage(deep=True) / 1024 ** 2

Index 0.000076
user_id 61.988831
order_id 61.035156
order_date 38.146973
number_of_products 7.629395
total_amount 7.629395
day_of_week 61.171283
dtype: float64

By setting deep=True we obtain the exact memory size (Mb), leaving the option to default False would present an approximation. It is clear at first sight that the majority of the memory is used for storing strings and that is also where we can gain most from optimization. For convenience, I define a helper function for evaluating the size (in Mb) of the entire DataFrame.

def memory_usage(df):
 return(round(df.memory_usage(deep=True).sum() / 1024 ** 2, 2))

And let’s see how much in total the DataFrame takes:

>>> print('Memory used:', memory_usage(df), 'Mb')

Memory used: 237.6 Mb

Optimizing numeric variables

At first, it is good to inspect the variables and find out what we are working with.

>>> df.describe()

👁 Summary statistics of numeric variables

Summary statistics of numeric variables

The results come as no surprise, as this is exactly how the variables were defined when initializing the random DataFrame. After inspecting the table with the subtypes, we can manually change the type of number_of_products to uint8 as this is the smallest data subtype that would work for this variable (only positive values with max < 20). I also downcast the total_amount as we deal with a float here. As the variable still has a high degree of precision, which in this case I do not need (rounded to 2 decimals before), it does not make sense to keep it this way, so I further change the type to float16.

>>> df.number_of_products = df.number_of_products.astype('uint8')
>>> df.total_amount = df.total_amount.apply(pd.to_numeric, downcast='float')
>>> print('Memory used:', memory_usage(df_copy), 'Mb')

Memory used: 230.93 Mb

>>> type(df.total_amount[0])

numpy.float64

>>> df.total_amount = df.total_amount.astype('float16')
>>> print('Memory used:', memory_usage(df), 'Mb')

Memory used: 225.21 Mb

With all these operations we managed to reduce the size of the DataFrame by a little over 10Mb, which is not exactly what I promised. But let’s continue!

Optimizing object variables

👁 Overview of 'object' type variables

Overview of ‘object’ type variables

By inspecting the variables with the type object we see a few things. First, the Ids are unique, as expected. Second, the dates take not so many distinct values. however, we want to keep them in datetime format as this facilitates easier EDA. So we will not transform them (which of course is possible). Finally, the day_of_week , which logically can take only a few distinct values. What we can do here is transform it into a different variable type – category. For anyone familiar with R, this will be the equivalent of afactor. The idea is pretty straightforward, the string variables are coded as integers and by using a special mapping dictionary can be decoded back into their original form. This is useful when we are dealing with a limited number of distinct string values (such as repeating days of the week, months, etc.). A useful rule of thumb is to consider using a categorical data type for variables with the ratio of unique values to the total number of observations lower than 50%. Otherwise, we might end up using more memory than in the first place.

>>> df.day_of_week = df.day_of_week.astype('category')
>>> df.user_id = df.user_id.astype('uint32')
>>> df.order_id = df.order_id.astype('uint32')

>>> print('Memory used:', memory_usage(df_copy), 'Mb')

Memory used: 49.59 Mb

>>> df_copy.memory_usage(deep=True) / 1024 ** 2

Index 0.000076
user_id 3.814697
order_id 3.814697
order_date 38.146973
number_of_products 0.953674
total_amount 1.907349
day_of_week 0.954408
dtype: float64

Well, we reduce the size of the DataFrame from 237.6 Mb to 49.59 Mb, which is approximately 80%. But when we only consider variables that were modified, then the result is actually 94%! I would say this is a job well done 🙂

One more tip would be to consider the types when loading data into Python via pandas.read_csv(). Supposing the sample DataFrame I created was in a CSV file, we could create a dictionary containing information about data types and pass it to the function.

column_dtypes = {'day_of_week': 'category',
 'user_id': 'uint32',
 'order_id': 'uint32',
 'total_amount': 'float16',
 'number_of_products': 'uint8'}

df = pd.read_csv('sample_dataframe.csv', dtype = column_dtypes)

As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments. I hope you enjoyed this article and will find these tricks handy in your projects!

Written By

Eryk Lewinson

See all from Eryk Lewinson

Data Science, Optimization, Pandas, Programming, Python

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/make-working-with-large-dataframes-easier-at-least-for-your-memory-6f52b5f4b5c4/