求熊猫中两个级数的交集

我有两个系列 s1s2在熊猫,并希望计算的交集,即其中所有的数列的值是公共的。

如何使用 concat函数来完成这项工作?我一直试图计算出来,但一直无法(我不想计算交集的指数 s1s2,但在值)。

141885 次浏览

Place both series in Python's set container then use the set intersection method:

s1.intersection(s2)

and then transform back to list if needed.

Just noticed pandas in the tag. Can translate back to that:

pd.Series(list(set(s1).intersection(set(s2))))

From comments I have changed this to a more Pythonic expression, which is shorter and easier to read:

Series(list(set(s1) & set(s2)))

should do the trick, except if the index data is also important to you.

Have added the list(...) to translate the set before going to pd.Series as pandas does not accept a set as direct input for a Series.

If you are using Pandas, I assume you are also using NumPy. Numpy has a function intersect1d that will work with a Pandas series.

Example:

pd.Series(np.intersect1d(pd.Series([1,2,3,5,42]), pd.Series([4,5,6,20,42])))

will return a Series with the values 5 and 42.

Setup:

s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])

Timings:

%%timeit
pd.Series(list(set(s1).intersection(set(s2))))
10000 loops, best of 3: 57.7 µs per loop


%%timeit
pd.Series(np.intersect1d(s1,s2))
1000 loops, best of 3: 659 µs per loop


%%timeit
pd.Series(np.intersect1d(s1.values,s2.values))
10000 loops, best of 3: 64.7 µs per loop

So the numpy solution can be comparable to the set solution even for small series, if one uses the values explicitly.

Python

s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])


s1[s1.isin(s2)]

R

s1  <- c(4,5,6,20,42)
s2 <- c(1,2,3,5,42)


s1[s1 %in% s2]

Edit: Doesn't handle dupes.

pd.merge can be used:

pd.merge(series1, series2, how='inner').drop_duplicates()

Note that the result is a dataframe.

Here's another solution by checking both left and right inclusions

import pandas as pd


def intersect(left, right):
left, right = pd.Series(pd.unique(left)), pd.Series(pd.unique(right))
right = right.loc[right.isin(left)]
left  =  left.loc[left.isin(right)]
return pd.Series(pd.unique(left))


left = pd.Series([1,2,pd.NA, pd.NA, pd.NA], index=[*"abcde"], dtype="Int32")
right = pd.Series([pd.NA, pd.NA, 1, 3], index=[*"efgh"], dtype="Int32")
intersect(left, right)

This has 2 major advantages:

  • It works with pandas Int32 and other nullable data types. If your columns contain pd.NA then np.intersect1d throws an error!

  • It keeps the pandas dtype intact