在两个熊猫数据框架中查找公共行(交集)

假设我有两个这种格式的数据帧(称为 df1df2) :

+------------------------+------------------------+--------+
|        user_id         |      business_id       | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA |      3 |
+------------------------+------------------------+--------+

我希望得到一个数据帧的所有行有一个共同的 df1df2user_id。(即。如果 user_id同时在 df1df2中,则在输出数据帧中包含这两行)

我可以想出很多方法来解决这个问题,但它们都让我觉得笨拙。例如,我们可以在每个数据帧中找到所有唯一的 user_id,创建每个数据帧的集合,找到它们的交集,用结果集过滤两个数据帧,并将两个过滤后的数据帧连接起来。

也许这是最好的办法,但我知道熊猫很聪明。还有更简单的方法吗?我看过 merge,但我不认为那是我需要的。

160823 次浏览

If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():

In [80]: df1
Out[80]:
rating  user_id
0       2  0x21abL
1       1  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
5       2  0x21abL
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
9       1  0x21abL


In [81]: df2
Out[81]:
rating      user_id
0       2      0x1d14L
1       1    0xdbdcad7
2       1      0x21abL
3       3      0x21abL
4       3      0x21abL
5       1  0x5734a81e2
6       2      0x1d14L
7       0       0xdafL
8       0      0x1d14L
9       4  0x5734a81e2


In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)


In [83]: ind
Out[83]:
0     True
1    False
2     True
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: user_id, dtype: bool


In [84]: df1[ind].append(df2[ind])
Out[84]:
rating  user_id
0       2  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
0       2  0x1d14L
2       1  0x21abL
3       3  0x21abL
4       3  0x21abL
6       2  0x1d14L
7       0   0xdafL
8       0  0x1d14L

This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if

In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')

In fact, it won't give the expected output if their row indices are not equal.

In SQL, this problem could be solved by several methods:

select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)

or join and then unpivot (possible in SQL server)

select
df1.user_id,
c.rating
from df1
inner join df2 on df2.user_i = df1.user_id
outer apply (
select df1.rating union all
select df2.rating
) as c

Second one could be written in pandas with something like:

>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
user_id  rating
0        3      20
0        3      30

My understanding is that this question is better answered over in this post.

But briefly, the answer to the OP with this method is simply:

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

You can do this for n DataFrames and k colums by using pd.Index.intersection:

import pandas as pd
from functools import reduce
from typing import Union


def dataframe_intersection(
dataframes: list[pd.DataFrame], by: Union[list, str]
) -> list[pd.DataFrame]:
set_index = [d.set_index(by) for d in dataframes]
index_intersection = reduce(pd.Index.intersection, [d.index for d in set_index])
intersected = [df.loc[index_intersection].reset_index() for df in set_index]


return intersected


df1 = pd.DataFrame({"user_id":[1,2,3], "business_id": ['a', 'b', 'c'], "rating":[10, 15, 20]})
df2 = pd.DataFrame({"user_id":[3,4,5], "business_id": ['c', 'd', 'e'], "rating":[30, 35, 40]})
df3 = pd.DataFrame({"user_id":[3,3,3], "business_id": ['f', 'c', 'f'], "rating":[50, 70, 80]})


df_list = [df1, df2, df3]

This gives

>>> pd.concat(dataframe_intersection(df_list, by='user_id'))
user_id business_id  rating
0        3           c      20
0        3           c      30
0        3           f      50
1        3           c      70
2        3           f      80


And

>>> pd.concat(dataframe_intersection(df_list, by=['user_id', 'business_id']))
user_id business_id  rating
0        3           c      20
0        3           c      30
0        3           c      70