如何计算大熊猫数据框中的重复行?

我试图在我的数据框中计算每种类型的行的重复数量。例如,假设我有一个熊猫的数据框如下:

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
'two': pd.Series([1., 2., 1])})

我得到一个像这样的 df:

    one two
0   1   1
1   1   2
2   1   1

我想第一步是找到所有不同的唯一行,我通过:

df.drop_duplicates()

这给了我下面的 df:

    one two
0   1   1
1   1   2

现在我想从上面的 df ([11]和[12])中获取每一行,并计算每一行在初始 df 中的次数。我的结果是这样的:

Row     Count
[1 1]     2
[1 2]     1

我应该怎样做最后一步?

编辑:

这里有一个更大的例子来说明这一点:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False])})

给了我:

    one three   two
0   True    True    True
1   True    False   False
2   True    False   False
3   False   False   True

我想要一个结果告诉我:

       Row           Count
[True True True]       1
[True False False]     2
[False False True]     1
234468 次浏览

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()


Out[28]:
one    three  two
False  False  True     1
True   False  False    2
True   True     1
dtype: int64
df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

Produces:

>>> df1
Count
1 1 1      2
3 2 3      1
1 2 2      1

If the index values must be a list, you could take the above code a step further with:

df1.index = df1.index.str.split()

Produces:

           Count
[1, 1, 1]      2
[3, 2, 3]      1
[1, 2, 2]      1
df.groupby(df.columns.tolist()).size().reset_index().\
rename(columns={0:'records'})


one  two  records
0    1    1        2
1    1    2        1

Specific to your question, as the others mentioned fast and easy way would be:

df.groupby(df.columns.tolist(),as_index=False).size()

If you like to count duplicates on particular column(s):

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

df.duplicated(subset='one', keep='first').sum()

where

subset : column label or sequence of labels(by default use all of the columns)

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:

# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})


# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]

I use:

used_features =[
"one",
"two",
"three"
]


df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.

ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.

import pandas as pd


# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
'test2':['bar',None,None,None,'bar'],
'test3':[None, 'foo','bar',None,None]})


# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
.groupby(df.columns.tolist()).apply(len)\
.rename('group_count')\
.reset_index()\
.replace('',np.nan)\
.sort_values(by = ['group_count'], ascending = False))
df
  test test2 test3  group_count
3  foo   bar   NaN            2
0  NaN   NaN   NaN            1
1  NaN   NaN   bar            1
2  bar   NaN   foo            1

To count rows in DataFrame you can use the method value_counts (Pandas 1.1.0):

df = pd.DataFrame({'A': [1, 1, 2, 2, 3], 'B': [10, 10, 20, 20, 30]})


df.value_counts().reset_index(name='counts').query('counts > 1')

Output:

   A   B  counts
0  1  10       2
1  2  20       2

If you find some counts missing or get error: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements, read here:

1. Count duplicate rows with NaN entries:

The accepted solution is great and believed to have been helpful to many members. In a recent task, I found it can be further fine-tuned to support complete counting of a dataframe with NaN entries. Pandas supports missing entries or null values as NaN values. Let's see what's the output for this use case when our dataframe contains NaN entries:

  Col1  Col2 Col3 Col4
0  ABC   123  XYZ  NaN       # group #1 of 3
1  ABC   123  XYZ  NaN       # group #1 of 3
2  ABC   678  PQR  def           # group #2 of 1
3  MNO   890  EFG  abc               # group #3 of 4
4  MNO   890  EFG  abc               # group #3 of 4
5  CDE   234  567  xyz                   # group #4 of 2
6  ABC   123  XYZ  NaN       # group #1 of 3
7  CDE   234  567  xyz                   # group #4 of 2
8  MNO   890  EFG  abc               # group #3 of 4
9  MNO   890  EFG  abc               # group #3 of 4

Applying the code:

df.groupby(df.columns.tolist(),as_index=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   678  PQR  def     1
1  CDE   234  567  xyz     2
2  MNO   890  EFG  abc     4

Oh, how come the count of Group #1 with 3 duplicate rows is missing?!

For some Pandas version, you may get an error instead: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements

Solution:

Use the parameter dropna= for the .groupby() function, as follows:

df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   123  XYZ  NaN     3          # <===  count of rows with `NaN`
1  ABC   678  PQR  def     1
2  CDE   234  567  xyz     2
3  MNO   890  EFG  abc     4

The count of duplicate rows with NaN can be successfully output with dropna=False. This parameter has been supported since Pandas version 1.1.0


2. Alternative Solution

Another way to count duplicate rows with NaN entries is as follows:

df.value_counts(dropna=False).reset_index(name='count')

gives:

  Col1  Col2 Col3 Col4  count
0  MNO   890  EFG  abc      4
1  ABC   123  XYZ  NaN      3
2  CDE   234  567  xyz      2
3  ABC   678  PQR  def      1

Here, we use the .value_counts() function with also the parameter dropna=False. However, this parameter has been supported only recently since Pandas version 1.3.0 If your version is older than this, you'll need to use the .groupby() solution if you want to get complete counts for rows with NaN entries.

You will see that the output is in different sequence than the previous result. The counts are sorted in descending order. If you want to get unsorted result, you can specify sort=False:

df.value_counts(dropna=False, sort=False).reset_index(name='count')

it gives the same result as the df.groupby(df.columns.tolist(), as_index=False, dropna=False).size() solution:

  Col1  Col2 Col3 Col4  count
0  ABC   123  XYZ  NaN      3
1  ABC   678  PQR  def      1
2  CDE   234  567  xyz      2
3  MNO   890  EFG  abc      4

Note that this .value_counts() solution supports dataframes both with and without NaN entries and can be used as a general solution.

In fact, in the underlying implementation codes .value_counts() calls GroupBy.size to get the counts: click the link to see the underlying codes: counts = self.groupby(subset, dropna=dropna).grouper.size()

Hence, for this use case, .value_counts() and the .groupby() solution in the accepted solution are actually doing the same thing. We should be able to use the .value_counts() function to get the desired counts of duplicate rows equally well.

Use of .value_counts() function to get counts of duplicate rows has the additional benefit that its syntax is simpler. You can simply use df.value_counts() or df.value_counts(dropna=False) depending on whether your dataframe contains NaN or not. Chain with .reset_index() if you want the result as a dataframe instead of a Series.

If you just need to find a count the for unique and duplicate rows (entire row duplicated) this could work:

df.duplicated().value_counts()

output: False 11398 True 154 dtype: int64