熊猫左外连接导致表大于左表

根据我对左外部联接的理解,生成的表应该不会比左表拥有更多的行... ... 如果这样做是错误的,请告诉我... ..。

My left table is 192572 rows and 8 columns.

我的右表是42160行和5列。

我的左表有一个名为“ id”的字段,它与右表中名为“ key”的列相匹配。

Therefore I merge them as such:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key')

But then the combined shape is 236569.

What am I misunderstanding?

101429 次浏览

如果键匹配另一个 DataFrame 中的多个行,则可以预期这种情况会增加:

In [11]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])


In [12]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])


In [13]: df.merge(df2, how='left')  # merges on columns A
Out[13]:
A  B   C
0  1  3   5
1  1  3   6
2  2  4 NaN

为了避免 df2中的这种行为:

In [21]: df2.drop_duplicates(subset=['A'])  # you can use take_last=True
Out[21]:
A  C
0  1  5


In [22]: df.merge(df2.drop_duplicates(subset=['A']), how='left')
Out[22]:
A  B   C
0  1  3   5
1  2  4 NaN

There are also strategies you can use to avoid this behavior that don't involve losing the duplicated data if, for example, not all columns are duplicated. If you have

In [1]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])


In [2]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])

一种方法是求重复数的平均值(也可以求和等等)

In [3]: df3 = df2.groupby('A').mean().reset_index()


In [4]: df3
Out[4]:
C
A
1  5.5


In [5]: merged = pd.merge(df,df3,on=['A'], how='outer')


In [6]: merged
Out[204]:
A  B    C
0  1  3  5.5
1  2  4  NaN

或者,如果您有不能使用 pd.to _ numeric ()转换的非数值数据,或者如果您只是不想取平均值,您可以通过枚举重复数据来修改合并变量。然而,当两个数据集中都存在副本时(这会导致相同的问题行为,也是一个常见的问题) ,这种策略将会适用:

In [7]: df = pd.DataFrame([['a', 3], ['b', 4],['b',0]], columns=['A', 'B'])


In [8]: df2 = pd.DataFrame([['a', 3], ['b', 8],['b',5]], columns=['A', 'C'])


In [9]: df['count'] = df.groupby('A')['B'].cumcount()


In [10]: df['A'] = np.where(df['count']>0,df['A']+df['count'].astype(str),df['A'].astype(str))


In[11]: df
Out[11]:
A  B  count
0   a  3      0
1   b  4      0
2  b1  0      1

Do the same for df2, drop the count variables in df and df2 and merge on 'A':

In [16]: merged
Out[16]:
A  B  C
0   a  3  3
1   b  4  8
2  b1  0  5

A couple of notes. In this last case I use .cumcount() instead of .duplicated because it could be the case that you have more than one duplicate for a given observation. Also, I use .astype(str) to convert the count values to strings because I use the np.where() command, but using pd.concat() or something else might allow for different applications.

最后,如果只有一个数据集具有重复项,但您仍然希望保留它们,那么可以使用后一种策略的前半部分来区分结果合并中的重复项。

在给定的答案中有一个小小的补充,那就是有一个名为 valid的参数,如果在右表中有重复的 ID 匹配,它可以用来抛出一个错误:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key', validate = 'm:1')

使用 drop _ copy 在你的情况下将是:

merged = pd.merge(df,df3,on=['A'], how='outer').drop_duplicates()

There could be multiple entries with same key value(s). Make sure there is no duplicates with respect to key in right table.

# One workaround could be remove duplicates from right table w.r.t key.


combined = pd.merge(a.reset_index(),b.drop_duplicates(['key']),how='left',left_on='id',right_on='key')