Groupby 的价值取决于数据框架熊猫

我有以下数据框:

df = pd.DataFrame([
(1, 1, 'term1'),
(1, 2, 'term2'),
(1, 1, 'term1'),
(1, 1, 'term2'),
(2, 2, 'term3'),
(2, 3, 'term1'),
(2, 2, 'term1')
], columns=['id', 'group', 'term'])

我想把它按 idgroup分组,然后计算这个 id 组对的每个项的数目。

所以最后我会得到这样的结果:

enter image description here

通过使用 df.iterrows()对所有行进行循环并创建一个新的数据框架,我能够实现我想要的结果,但这显然效率低下。(如果有帮助的话,我事先知道所有术语的列表,它们大约有10个)。

看起来我必须按照值进行分组,然后计算值,所以我用 df.groupby(['id', 'group']).value_counts()尝试了这个方法,但是没有用,因为 Value _ count在 groupby 序列上操作,而不是在数据帧上操作。

无论如何,我可以实现这个没有循环?

172304 次浏览

You can use crosstab:

print (pd.crosstab([df.id, df.group], df.term))
term      term1  term2  term3
id group
1  1          2      1      0
2          0      1      0
2  2          1      0      1
3          1      0      0

Another solution with groupby with aggregating size, reshaping by unstack:

df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)


term      term1  term2  term3
id group
1  1          2      1      0
2          0      1      0
2  2          1      0      1
3          1      0      0

Timings:

df = pd.concat([df]*10000).reset_index(drop=True)


In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0))
100 loops, best of 3: 12.4 ms per loop


In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0))
100 loops, best of 3: 12.2 ms per loop

using pivot_table() method:

In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
Out[22]:
term      term1  term2  term3
id group
1  1          2      1      0
2          0      1      0
2  2          1      0      1
3          1      0      0

Timing against 700K rows DF:

In [24]: df = pd.concat([df] * 10**5, ignore_index=True)


In [25]: df.shape
Out[25]: (700000, 3)


In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 226 ms per loop


In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 236 ms per loop


In [5]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 355 ms per loop


In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 232 ms per loop


In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 231 ms per loop

Timing against 7M rows DF:

In [9]: df = pd.concat([df] * 10, ignore_index=True)


In [10]: df.shape
Out[10]: (7000000, 3)


In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 2.27 s per loop


In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 2.3 s per loop


In [13]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 3.37 s per loop


In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 2.28 s per loop


In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 1.89 s per loop

I use groupby and size

df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

enter image description here


Timing

enter image description here

1,000,000 rows

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
group=np.random.choice(20, 1000000),
term=np.random.choice(10, 1000000)))

enter image description here

Instead of remembering lengthy solutions, how about the one that pandas has built in for you:

df.groupby(['id', 'group', 'term']).count()

If you want to use value_counts you can use it on a given series, and resort to the following:

df.groupby(["id", "group"])["term"].value_counts().unstack(fill_value=0)

or in an equivalent fashion, using the .agg method:

df.groupby(["id", "group"]).agg({"term": "value_counts"}).unstack(fill_value=0)

Another option is to directly use value_counts on the DataFrame itself without resorting to groupby:

df.value_counts().unstack(fill_value=0)