在熊猫中,替代 = 真被认为是有害的,还是无害的?

这个问题以前也讨论过,但得到的答案相互矛盾:

我想知道的是:

  • 为什么 inplace = False是默认行为?
  • 什么时候改变它是好的? (好吧,我被允许改变它,所以我想这是有原因的)。
  • 这是一个安全问题吗? 也就是说,操作是否会因为 inplace = True而失败或失常?
  • 我可以事先知道,如果某一 inplace = True操作将“真的”进行到位?

我目前的看法是:

  • 许多熊猫操作都有一个 inplace参数,总是默认为 False,这意味着原始的 DataFrame 没有改变,操作返回一个新的 DF。
  • 当设置 inplace = True时,操作 也许吧在原始 DF 上工作,但它可能仍然在幕后的副本上工作,只是在完成后重新分配引用。

inplace = True的优点:

  • Can be both faster and less memory hogging (the first link shows reset_index() runs twice as fast and uses half the peak memory!).

inplace = False 的优点:

  • 支持链式/函数式语法: df.dropna().rename().sum()...非常好,并且提供了一个延迟计算或更有效的重新排序的机会(尽管我不认为熊猫正在这样做)。
  • 当对一个可能是底层 DF 的切片/视图的对象使用 inplace = True时,熊猫必须做 SettingWithCopy检查,这是昂贵的。inplace = False避免了这种情况。
  • 幕后行为的一致性和可预测性。

因此,抛开复制与视图的问题不谈,总是使用 inplace = True似乎性能更好,除非专门编写链式语句。但这不是熊猫的默认选择,那么我错过了什么呢?

28651 次浏览

If inplace was the default then the DataFrame would be mutated for all names that currently reference it.

A simple example, say I have a df:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

Now it's very important that DataFrame retains that row order - let's say it's from a data source where insertion order is key for instance.

However, I now need to do some operations which require a different sort order:

def f(frame):
df = frame.sort_values('a')
# if we did frame.sort_values('a', inplace=True) here without
# making it explicit - our caller is going to wonder what happened
# do something
return df

That's fine - my original df remains the same. However, if inplace=True were the default then my original df will now be sorted as a side-effect of f() in which I'd have to trust the caller to remember to not do something in place I'm not expecting instead of deliberately doing something in place... So it's better that anything that can mutate an object in place does so explicitly to at least make it more obvious what's happened and why.

Even with basic Python builtin mutables, you can observe this:

data = [3, 2, 1]


def f(lst):
lst.sort()
# I meant lst = sorted(lst)
for item in lst:
print(item)


f(data)


for item in data:
print(item)


# huh!? What happened to my data - why's it not 3, 2, 1?

In pandas, is inplace = True considered harmful, or not?

Yes, it is. Not just harmful. Quite harmful. This GitHub issue is proposing the inplace argument be deprecated api-wide sometime in the near future. In a nutshell, here's everything wrong with the inplace argument:

  • inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
  • inplace does not work with method chaining
  • inplace can lead to the dreaded SettingWithCopyWarning when called on a DataFrame column, and may sometimes fail to update the column in-place

The pain points above are all common pitfall for beginners, so removing this option will simplify the API greatly.


We take a look at the points above in more depth.

Performance
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In general, there are no performance benefits to using inplace=True (but there are rare exceptions which are mostly a result of implementation detail in the library and should not be used as a crutch to advocate for this argument's usage). Most in-place and out-of-place versions of a method create a copy of the data anyway, with the in-place version automatically assigning the copy back. The copy cannot be avoided.

Method Chaining
inplace=True also hinders method chaining. Contrast the working of

result = df.some_function1().reset_index().some_function2()

As opposed to

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

Unintended Pitfalls
One final caveat to keep in mind is that calling inplace=True can trigger the SettingWithCopyWarning:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})


df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame

Which can cause unexpected behavior.