熊猫对决

我一直在探索如何优化我的代码,并运行了 pandas .at方法

基于标签的快速标量访问器

与 loc 类似,at 提供基于标签的标量查找。

所以我检测了一些样本:

设置

import pandas as pd
import numpy as np
from string import letters, lowercase, uppercase


lt = list(letters)
lc = list(lowercase)
uc = list(uppercase)


def gdf(rows, cols, seed=None):
"""rows and cols are what you'd pass
to pd.MultiIndex.from_product()"""
gmi = pd.MultiIndex.from_product
df = pd.DataFrame(index=gmi(rows), columns=gmi(cols))
np.random.seed(seed)
df.iloc[:, :] = np.random.rand(*df.shape)
return df


seed = [3, 1415]
df = gdf([lc, uc], [lc, uc], seed)


print df.head().T.head().T

df看起来像:

            a
A         B         C         D         E
a A  0.444939  0.407554  0.460148  0.465239  0.462691
B  0.032746  0.485650  0.503892  0.351520  0.061569
C  0.777350  0.047677  0.250667  0.602878  0.570528
D  0.927783  0.653868  0.381103  0.959544  0.033253
E  0.191985  0.304597  0.195106  0.370921  0.631576

让我们使用 .at.loc,并确保我得到相同的东西

print "using .loc", df.loc[('a', 'A'), ('c', 'C')]
print "using .at ", df.at[('a', 'A'), ('c', 'C')]


using .loc 0.37374090276
using .at  0.37374090276

使用 .loc测试速度

%%timeit
df.loc[('a', 'A'), ('c', 'C')]


10000 loops, best of 3: 180 µs per loop

使用 .at测试速度

%%timeit
df.at[('a', 'A'), ('c', 'C')]


The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop

这看起来是一个巨大的速度增加。即使在缓存阶段,6.11 * 8也比 180快得多

提问

.at的局限性是什么?我有动力使用它。文档说它类似于 .loc,但它的行为并不相似。例如:

# small df
sdf = gdf([lc[:2]], [uc[:2]], seed)


print sdf.loc[:, :]


A         B
a  0.444939  0.407554
b  0.460148  0.465239

print sdf.at[:, :]在哪里导致 TypeError: unhashable type

所以即使意图相似,显然也不一样。

也就是说,谁能提供关于 .at方法能做什么和不能做什么的指导?

71380 次浏览

Update: df.get_value is deprecated as of version 0.21.0. Using df.at or df.iat is the recommended method going forward.


df.at can only access a single value at a time.

df.loc can select multiple rows and/or columns.

Note that there is also df.get_value, which may be even quicker at accessing single values:

In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop


In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop


In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop

Under the hood, df.at[...] calls df.get_value, but it also does some type checking on the keys.

As you asked about the limitations of .at, here is one thing I recently ran into (using pandas 0.22). Let's use the example from the documentation:

df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df2 = df.copy()


A   B   C
4   0   2   3
5   0   4   1
6  10  20  30

If I now do

df.at[4, 'B'] = 100

the result looks as expected

    A    B   C
4   0  100   3
5   0    4   1
6  10   20  30

However, when I try to do

 df.at[4, 'C'] = 10.05

it seems that .at tries to conserve the datatype (here: int):

    A    B   C
4   0  100  10
5   0    4   1
6  10   20  30

That seems to be a difference to .loc:

df2.loc[4, 'C'] = 10.05

yields the desired

    A   B      C
4   0   2  10.05
5   0   4   1.00
6  10  20  30.00

The risky thing in the example above is that it happens silently (the conversion from float to int). When one tries the same with strings it will throw an error:

df.at[5, 'A'] = 'a_string'

ValueError: invalid literal for int() with base 10: 'a_string'

It will work, however, if one uses a string on which int() actually works as noted by @n1k31t4 in the comments, e.g.

df.at[5, 'A'] = '123'


A   B   C
4    0   2   3
5  123   4   1
6   10  20  30

.at is an optimized data access method compared to .loc .

.loc of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Instead, .at selects particular element of a data frame positioned at the given indexed_row and labeled_column.

Also, .at takes one row and one column as input argument, whereas .loc may take multiple rows and columns. Output using .at is a single element and using .loc maybe a Series or a DataFrame.

Adding to the above, Pandas documentation for the at function states:

Access a single value for a row/column label pair.

Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series.

For setting data loc and at are similar, for example:

df = pd.DataFrame({'A': [1,2,3], 'B': [11,22,33]}, index=[0,0,1])

Both loc and at will produce the same result

df.at[0, 'A'] = [101,102]
df.loc[0, 'A'] = [101,102]


A   B
0   101 11
0   102 22
1   3   33


df.at[0, 'A'] = 103
df.loc[0, 'A'] = 103


A   B
0   103 11
0   103 22
1   3   33

Also, for accessing a single value, both are the same

df.loc[1, 'A']   # returns a single value (<class 'numpy.int64'>)
df.at[1, 'A']    # returns a single value (<class 'numpy.int64'>)


3

However, when matching multiple values, loc will return a group of rows/cols from the DataFrame while at will return an array of values

df.loc[0, 'A']  # returns a Series (<class 'pandas.core.series.Series'>)


0    103
0    103
Name: A, dtype: int64


df.at[0, 'A']   # returns array of values (<class 'numpy.ndarray'>)


array([103, 103])

And more so, loc can be used to match a group of row/cols and can be given only an index, while at must receive the column

df.loc[0]  # returns a DataFrame view (<class 'pandas.core.frame.DataFrame'>)


A   B
0   103 11
0   103 22




# df.at[0]  # ERROR: must receive column