Anti-Join Pandas

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

I am familiar with R and this is the code I would use to do this in R.

library("dplyr")


## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")


## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)

How would I achieve this in python?

63040 次浏览

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()

TableA

enter image description here


TableB

enter image description here

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)


# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)

enter image description here

Method 2

rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)


pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

enter image description here

10,000 rows 5,000 overlap

loops are bad

enter image description here

You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.

Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:

import pandas as pd


TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])


removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same


for i in range(0, len(keyColumnA)):
firstValue = keyColumnA[i]
for j in range(0, len(keyColumnB)):
copycat = keyColumnB[j]
if firstValue == copycat:
removeTheseIndexes.append(j)


TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)

Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.

# Table A
0  1   2
0  2  3   4
1  5  6   7
2  8  9  10


# Table B
0   1  2
0  1   3  4
1  5   7  8
2  9  10  0


# Set 'Key' column = 1
# Run the script after the loop


# Table A
0   1   2
0  2   3   4
1  5   6   7
2  8   9  10
3  5   7   8
4  9  10   0


# Table B
0   1  2
1  5   7  8
2  9  10  0

I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

import pandas as pd
import numpy as np


TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()


print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')


TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')


print('TableB_only', TableB_only, sep='\n')


Table_concatenated = pd.concat((TableA, TableB_only), join='inner')


print('Table_concatenated', Table_concatenated, sep='\n')

Which prints this output:

TableA
Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
TableB
Key         A         B         C
0   a  0.754538  0.692902  0.537704
1   e  0.499092  0.864145  0.004559
2   c  0.082087  0.682573  0.421654
3   f  0.768914  0.281617  0.924693
TableB_only
Key  A_foo  B_foo  C_foo         A         B         C      _merge
4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only
5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_only
Table_concatenated
Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
4   e  0.499092  0.864145  0.004559
5   f  0.768914  0.281617  0.924693

Easiest answer imaginable:

tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")


answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]

Should be the fastest proposed as well.

Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()] to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()] you have a semi_join.

def anti_join(tableA,tableB,on):


#if joining on index, make it into a column
if tableB.index.name is not None:
dummy = tableB.reset_index()[on]
else:
dummy = tableB[on]


#create a dummy columns of 1s
if isinstance(dummy, pd.Series):
dummy = dummy.to_frame()


dummy.loc[:,'dummy_col'] = 1


#preserve the index of tableA if it has one
if tableA.index.name is not None:
idx_name = tableA.index.name
tableA = tableA.reset_index(drop = False)
else:
idx_name = None


#do a left-join
merged = tableA.merge(dummy,on=on,how='left')


#keep only the non-matches
output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]


#reset the index (if applicable)
if idx_name is not None:
output = output.set_index(idx_name)


return(output)

indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

  • left_only
  • right_only
  • both

Keep right_only and left_only. That is it.

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)


anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)




easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared's solution is faster.

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

So it's up for you to decide.

enter image description here

One liner

TableA.append(TableB.loc[~TableB.Key.isin(TableA.Key)], ignore_index=True)

%%timeit gives about the same timing as the accepted answer.