Python: Pandas Series - Why use loc?

小开

最佳答案

Explicit is better than implicit.

df[boolean_mask]选择 boolean_mask为 True 的行，但有一种情况你可能不希望它这样做: 当 df有布尔值列标签:
```
In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]:
False  True
0      3      1
1      4      2
2      5      3
```
您可能希望使用 df[[True]]来选择 True列，而不是引发 ValueError:
```
In [230]: df[[True]]
ValueError: Item wrong length 1 instead of 3.
```
与使用 loc相比:
```
In [231]: df.loc[[True]]
Out[231]:
False  True
0      3      1
```
相比之下，尽管 df2的结构与上述 df1的结构几乎相同，但下列条件不会提高 ValueError:
```
In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
Out[258]:
A  B
0  1  3
1  2  4
2  3  5


In [259]: df2[['B']]
Out[259]:
B
0  3
1  4
2  5
```
因此，df[boolean_mask]的行为并不总是与 df.loc[boolean_mask]相同。尽管可以说这是一个不太可能的用例，但我还是建议始终使用 df.loc[boolean_mask]而不是 df[boolean_mask]，因为 df.loc语法的含义是明确的。使用 df.loc[indexer]，您可以自动知道 df.loc正在选择行。相反，在不知道 indexer和 df.loc[boolean_mask]0的详细信息的情况下，不清楚 df[indexer]是否会选择行或列(或提高 ValueError)。
df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).
```
In [237]: df2.loc[[True,False,True], 'B']
Out[237]:
0    3
2    5
Name: B, dtype: int64
```
When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:
```
In [239]: df2.loc[1:2]
Out[239]:
A  B
1  2  4
2  3  5


In [271]: df2[1:2]
Out[271]:
A  B
1  2  4
```

小开

除了已经说过的问题(使用 True，False 作为列名而不使用 loc 的问题，以及使用 loc 选择行和列的能力，以及为行和列选择进行切片的能力) ，另一个很大的区别是，您可以使用 loc 为特定的行和列分配值。如果您尝试使用布尔序列选择数据框架的一个子集，并尝试更改该子集选择的值，则可能会收到 SettingWithCopy 警告。

假设您试图更改所有薪资大于60000的行的“高层管理人员”列。

这个:

mask = df["salary"] > 60000
df[mask]["upper management"] = True

抛出警告: “一个值正试图在数据框的一个片的副本上设置”，但是不起作用，因为 df [掩码]创建了一个副本，并且试图更新该副本的“上层管理”对原始 df 没有影响。

But this succeeds:

mask = df["salary"] > 60000
df.loc[mask,"upper management"] = True

请注意，在这两种情况下，您都可以执行 df[df["salary"] > 60000]或 df.loc[df["salary"] > 60000]，但我认为首先在变量中存储布尔条件更为简洁。

小开

使用和不使用.loc 的多列“链式分配”的性能考虑

让我用考虑系统性能来补充已经很好的答案。

这个问题本身包括两段代码的系统性能(执行时间)的比较。洛克。引用的代码示例的执行时间大致相同。但是，对于其他一些代码示例，有 在使用和不使用.loc 的情况下，执行时间可能会有很大的差异: 例如，几倍的差异或更多！

熊猫数据框架操作的一个常见情况是，我们需要创建一个从现有列的值派生的新列。我们可以使用下面的代码来过滤条件(基于现有列) ，并为新列设置不同的值:

df[df['mark'] >= 50]['text_rating'] = 'Pass'

但是，这种“链式分配”不起作用，因为它可以创建一个“副本”而不是“视图”，并且基于这个“副本”分配给新列将不会更新原始数据框架。

现有两个备选方案:

1. 我们可以使用.loc 或
1. 不使用.loc 以另一种方式编码

第二种情况，例如:

df['text_rating'][df['mark'] >= 50] = 'Pass'

通过将过滤放在最后(在指定新的列名之后) ，赋值可以很好地与更新的原始数据框一起工作。

使用.loc 的解决方案如下:

df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'

Now, let's see their execution time:

不使用. loc :

%%timeit
df['text_rating'][df['mark'] >= 50] = 'Pass'


2.01 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用. loc :

%%timeit
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'


577 µs ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如我们所见，通过使用.loc，执行时间快了3倍以上！

有关「连锁作业」的详细解释，请参阅另一篇相关的文章如何处理大熊猫中的设置与复制警告？，特别是 CS95的答案。这篇文章很好地解释了使用的功能差异。洛克。我只是在这里补充系统性能(执行时间)的差异。