VOOZH about

URL: https://towardsdatascience.com/explaining-the-settingwithcopywarning-in-pandas-ebc19d799d25/

⇱ Explaining the SettingWithCopyWarning in pandas | Towards Data Science


Explaining the SettingWithCopyWarning in pandas

If you are wondering what causes the SettingwithCopyWarning, this is the place for you!

10 min read
👁 Photo by NeONBRAND on Unsplash
Photo by NeONBRAND on Unsplash

Regardless of how long you worked with pandas, be it a day or a year, sooner or later you are likely to run into the infamous SettingWithCopyWarning. In this article, I explain what causes the problem and how to properly address the issue.

Warning not an error

Before I dive into the technicalities, I want to highlight that SettingWithCopyWarning is – as the name suggests – a warning, not an error. So the code we are executing will most likely not break and produce the end result. However, the end result might not be the one we actually intended to obtain.

The reason why I wanted to highlight the distinction is that we might be tempted to ignore the warning when we see that the code actually succeeds in returning a result. And as a matter of fact, the result might be correct! The best practice is to be extra careful and actually understand the underlying principles. This way, we can often save a lot of time trying to identify an obscure bug, which we could have avoided in the first place.

Views versus copies

The key concepts that are connected to theSettingWithCopyWarning are views and copies. Some operations in pandas (and numpy as well) will return views of the original data, while other copies.

To put it very simply, a view is a subset of the original object (DataFrame or Series) linked to the original source, while a copy is an entirely new object. In general, copies are thrown away as soon as the operations we are doing on them are completed. A consequence of the distinction is that when we modify the view, we modify the original object as well. That does not happen with copies, as they are not connected to the original objects.

Having described the difference, SettingWithCopyWarning is actually letting us know that the code we have written might have done one thing, when in fact we wanted to do another. I will illustrate this with a real-life example. Imagine having a large DataFrame. For some analysis, you filter (slice) the DataFrame to only contain a subset of the full data, for example, users from a certain country. Then, you might want to modify some values in the extracted DataFrame, let’s say cap the maximum value of a feature at 100. This is the typical case in which you could encounter the infamous warning – you wanted to only modify the extracted frame, while you ended up modifying the source data. You can easily imagine that this is not something you wanted to do and can lead to potential problems later on.

Note: To understand whether a frame is a view of a copy, you can use the internal _is_view and _is_copy methods of a pandas.DataFrame. The first one returns a boolean, while the second either a weakref to the original DataFrame, or None.

Common occurrences of the warning

In this section, I go over the most common cases when the SettingWithCopyWarning occurs in practice. I will illustrate the cases using a small custom DataFrame, as it is more than enough to understand the logic.

To prepare the data I run the following code:

Running the code prints out the small DataFrame:

👁 Image

In order to clearly understand what is happening, for each of the cases below we will be starting with a clean slate – the result of running the get_data function.

For future reference, this article was written using pandas version 1.0.3.

1. Chained Assignment

To explain the concept of the chained assignment, we will sequentially go over the building blocks. The assignment operation (also knows as the set operation) simply sets the value of an object. We can illustrate this by creating a list:

x_list = [1, 2, 3, 4, 5]

Even though the first example is based on lists, the same principles apply to arrays, Series, and DataFrames (as we will see in a minute). The second type of operation is called the get operation and is used for accessing and returning the value of an object. Indexing is a type of a get operation and we can index the list by running x_list[0:3], which returns

[1, 2, 3]

The last building block is called chaining and essentially refers to chaining multiple indexing operations, such as x_list[0:3][1], which returns 2.

Having described all the individual pieces, by chained assignment we mean a combination of chaining and assignment. It is time to refer to our toyDataFrame. First, we slice the DataFrame to display observations with the value of the B feature higher than 12.

X = get_data()
X[X['B'] > 12]
👁 Image

There are only 2 rows fulfilling that criterion. Let’s replace the values of the C feature with 999.

X[X['B'] > 12]['C'] = 999
X[X['B'] > 12]['C']

Running the line above results in the infamous warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

In the generated output, we see that the values were not replaced!

👁 Image

We saw the warning because we chained two indexing operations. In the background, these chained operations are executed independently. The first part is the get operation, which returns a DataFrame containing all the rows in which the value of B is higher than 12. The second part is the set operation, and is carried out on the new DataFrame created by the get operation. So we are not modifying the original DataFrame!

This is quite obvious when we used two square brackets in a row, however, the same would happen by using loc, iloc or the dot method of accessing the column. For example, running X.loc[X['B'] > 12]['C'] = 999 would give the same incorrect result.

To properly replace the values within the DataFrame, we need to use loc in the following way:

X.loc[X['B'] > 12, 'C'] = 999
X[X['B'] > 12]['C']
👁 Image

We can see that the values were successfully replaced in the original DataFrame.

2. Hidden Chaining

Hidden chaining can be quite a tricky problem to debug, as it is often not that quite obvious where exactly the problem lies. We will go over an example. First, let’s load the data and using the knowledge from the previous case, create a DataFrame that is a subset of the original one. We filter out all the rows with the value of the C feature larger than 101.

👁 Image

It often happens that we then explore and further process the new DataFrame. Let’s imagine running a few lines of code to further inspect the temp object, such as the shape, describe, plot methods of a pandas.DataFrame.

We are not actually printing the output here, as this is not the important part. Now, after running a few more lines of code, let’s replace the value of the C feature in the first row of temp with 999:

temp.loc[2, 'C'] = 999

While doing so, we meet our old friend:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Let’s inspect the values of theC in both the original and extracted DataFrames:

print(f"New DataFrame: {temp.loc[2, 'C']}")
print(f"Original DataFrame: {X.loc[2, 'C']}")
# New DataFrame: 999
# Original DataFrame: 102

So we actually happened? The reason for the warning lies in the fact that chained indexing can occur in two lines, not only within one. When we created the new DataFrame, we used the output of the get operation. That might have been a copy of the original DataFrame or it might have not. There was no way to know until we checked. Quoting the pandas documentation on the chained indexing:

Outside of simple cases, it’s very hard to predict whether it [chained indexing] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees),…

So when we indexed temp to assign the new value, we actually used chained indexing. As a result, we might have also modified X, while modifying temp.

The tricky part is that in an actual codebase, the two lines of code that are responsible for the hidden chained assignment can be separated by dozens of lines, which makes identifying the potential problem quite difficult.

To solve that problem, we can directly instruct pandas to create a copy of the original DataFrame by using the copy method. Below, we use that approach to avoid the hidden chaining:

We see that running the code generates the correct output.

3. An example of a false negative

Lastly, we go over the case of a false negative (pandas not informing us about SettingWithCopyWarning, when it actually should have) mentioned in [3]. It happens in cases when we are using chained indexing while slicing multiple columns of the DataFrame. Let’s look at two simple cases.

X = get_data()
X.loc[X['A'] > 2, ['A', 'B']]['A'] = 999
X

and

X = get_data()
X[['A', 'B', 'C']]['A'] = 999
X

both produce the following outcome, without actually showing the SettingWithCopyWarning.

👁 Image

As we can see from the image above, the values were not modified as we wanted. Due to including multiple columns in the indexing operation, the warning was not displayed. We can easily verify that this does not happen in the case of a single column – running X[['C']]['C'] = 999 produces the warning and does not modify X.

Some deeper context on the origin of the warning

We can say that pandas inherited the concept of views and copies from numpy. Under the hood, pandas uses numpy for efficient data storage and manipulation. In numpy, the views and copies follow a certain set of rules and are returned in a predictable manner (see [5] for more information). So why that is not the case with pandas? The problem lies in the fact that numpy arrays are limited to a single data type. And as we know, that is not the case for pandas.

In practice, indexing (get operation) on a multi dtype DataFrame will always return a copy of the frame. The same operation on a single type frame will almost always return a view based on a single numpy array, which is the most efficient way of approaching the problem. However, as we already saw in the quote from the documentation, returning the view depends on the memory layout of the object and is unfortunately not guaranteed.

Summing up, pandas does its best to combine its versatile approach to indexing (thanks to which it is vastly popular and basically a prerequisite to doing data science in Python) and the efficiency of using the underlying numpy arrays. This results in some small nuisances, however, the trade-off is definitely worth it and the problems can be overcome with the proper understanding of how pandas works under the hood.

Conclusions

In this article, I explained the difference between copies and views in pandas and how they are related to the infamous SettingWithCopyWarning. The main idea narrows down to being aware of what chained indexing is and how to successfully avoid it. The general rules are:

  • if you want to change the original DataFrame, use the single assignment.
  • if you want to make a copy of the DataFrame, do so explicitly using the copy method.

Following these two rules can save you a lot of time on debugging some weird cases, especially in a lengthy codebase.

It is also worth mentioning that the SettingWithCopyWarning occurs only when we are using the set operations (assigning). However, it is best to avoid chained indexing for the get operations as well. That is because chained operations are generally slower and could cause issues in case you later decide to add the set operations to the code.

You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

References

[1] https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

[3] https://github.com/pandas-dev/pandas/issues/9767

[4] https://www.practicaldatascience.org/html/views_and_copies_in_pandas.html

[5] https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html


Written By

Eryk Lewinson

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles