是否可以做模糊匹配与蟒熊合并?

我有两个数据框架,我想合并基于一个列。然而,由于不同的拼写,不同的空格数量,不存在/发音符号的存在,我希望能够合并,只要他们彼此相似。

任何相似性算法都可以(soundex,Levenshtein,afflib’s)。

假设一个数据框架包含以下数据:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])


number
one         1
two         2
three       3
four        4
five        5


df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])


letter
one        a
too        b
three      c
fours      d
five       e

Then I want to get the resulting DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e
104628 次浏览

Http://pandas.pydata.org/pandas-docs/dev/merging.html 没有一个钩子函数来动态执行这个操作。

我只需要执行一个单独的步骤,并在两个数据框中的一个中使用 dillib getclose _ match 来创建一个新列,并在模糊匹配列上执行 merge/join 操作

与@locojay 建议类似,你可以将 difflibget_close_matches应用于 df2的索引,然后应用 join:

In [23]: import difflib


In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>


In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])


In [26]: df2
Out[26]:
letter
one        a
two        b
three      c
four       d
five       e


In [31]: df1.join(df2)
Out[31]:
number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

.

如果这些是柱子,那么你可以用同样的方法来处理这个柱子,然后是 merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])


df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

提醒一下,这基本上是可行的,除非没有找到匹配项,或者两列中都有 NaNs。与直接应用 get_close_matches不同,我发现应用以下函数更容易。NaN 替代品的选择将在很大程度上取决于您的数据集。

def fuzzy_match(a, b):
left = '1' if pd.isnull(a) else a
right = b.fillna('2')
out = difflib.get_close_matches(left, right)
return out[0] if out else np.NaN

I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [科恩等人。], [温克勒].

这就是我对 水母套餐中的 Jaro-Winkler 的做法:

def get_closest_match(x, list_strings):


best_match = None
highest_jw = 0


for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)


if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string


return best_match


df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])


df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))


df1.join(df2)

产出:

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

我编写了一个 Python 包来解决这个问题:

pip install fuzzymatcher

You can find the repo here and docs 给你.

基本用法:

给定两个数据帧 df_leftdf_right,您想要模糊连接,您可以编写以下代码:

from fuzzymatcher import link_table, fuzzy_left_join


# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]


# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]


# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

或者,如果你只是想链接到最接近的匹配:

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

你可以使用 D6tjoin

import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']

索引号索引号右字母 0 one 1 one a 122 B 2333 C 344 D 455 E

它还有各种其他功能,例如:

  • check join quality, pre and post join
  • 自定义相似度函数,例如编辑距离与汉明距离
  • 指定最大距离
  • 多核计算机多核计算机

详情请参阅

使用 fuzzywuzzy

由于没有 fuzzywuzzy包的示例,这里有一个我编写的函数,它将根据您可以作为用户设置的阈值返回所有匹配:


示例数据框架

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})


# df1
Key
0       Apple
1      Banana
2      Orange
3  Strawberry


# df2
Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry

模糊匹配函数

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
    

m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
    

m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
    

return df_1

在数据框架上使用我们的函数: < em > # 1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process


fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)


Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

在数据框架上使用我们的函数: < em > # 2

df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})


fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)


Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM

安装:

皮普

pip install fuzzywuzzy

Anaconda

conda install -c conda-forge fuzzywuzzy

我使用了 Fuzzymatcher 软件包,这对我来说效果很好。

use the below command to install

pip install fuzzymatcher

Below is the sample Code (already submitted by RobinL above)

from fuzzymatcher import link_table, fuzzy_left_join


# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]


# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]


# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

你可能会得到错误

  1. 错误: 浮点除以零—— > 请参阅 链接 来解决这个问题
  2. OperationalError: 没有这样的模块: ft4—— > 下载 sqlite3.dll 并替换 Python 或蟒蛇中的 DLL 文件 DLL 文件夹。

优点:

  1. 工作速度更快。 在我的例子中,我将一个数据帧与3000行数据帧与另一个数据帧与170,000条记录进行了比较。这也使用了跨文本的 SQLite3搜索。所以比许多
  2. Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
  3. 给出同一记录中所有最接近的匹配的得分。你选择截止得分。

缺点:

  1. 原来的软件包安装是错误的
  2. 需要安装 C + + 和可视化工作室
  3. 对于64位水蟒/Python 不起作用

There is a package called fuzzy_pandas that can use levenshtein, jaro, metaphone and bilenco methods. With some great examples 给你

import pandas as pd
import fuzzy_pandas as fpd


df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})


results = fpd.fuzzy_merge(df1, df2,
left_on='Key',
right_on='Key',
method='levenshtein',
threshold=0.6)


results.head()


  Key    Key
0 Apple  Aple
1 Banana Bannanna
2 Orange Orag

一般方法: fuzzy_merge

对于一个更一般的场景,我们希望合并来自两个数据框架的列,其中包含略有不同的字符串,下面的函数使用 difflib.get_close_matchesmerge,以模仿熊猫的 merge的功能,但模糊匹配:

import difflib


def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)


def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None

Here are some use cases with two sample dataframes:

print(df1)


key   number
0    one       1
1    two       2
2  three       3
3   four       4
4   five       5


print(df2)


key_close  letter
0                    three      c
1                      one      a
2                      too      b
3                    fours      d
4  a very different string      e

通过上面的例子,我们可以得到:

fuzzy_merge(df1, df2, left_on='key', right_on='key_close')


key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d

我们可以做一个左连接:

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')


key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
4   five       5       NaN    NaN

对于右连接,我们将在左侧数据帧中的所有非匹配键放到 None:

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')


key  number                key_close letter
0    one     1.0                      one      a
1    two     2.0                      too      b
2  three     3.0                    three      c
3   four     4.0                    fours      d
4   None     NaN  a very different string      e

还要注意,如果在截止值内没有匹配的项目,则 difflib.get_close_matches 将返回一个空列表。在共享示例中,如果我们将 df2中的最后一个索引更改为:

print(df2)


letter
one                          a
too                          b
three                        c
fours                        d
a very different string      e

我们会得到一个 index out of range错误:

df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

IndexError: 列表索引超出范围

为了解决这个问题,以上函数 get_closest_match将返回最接近的匹配索引的列表返回的 difflib.get_close_matches 除非它实际上包含任何匹配。

对于更复杂的用例来匹配具有许多列的行,您可以使用 recordlinkage包。recordlinkage提供了所有的工具,以模糊匹配行之间的 pandas数据帧,这有助于减少重复您的数据合并时。我写了一篇关于 给你包的详细文章

if the join axis is numeric this could also be used to match indexes with a specified tolerance:

def fuzzy_left_join(df1, df2, tol=None):
index1 = df1.index.values
index2 = df2.index.values


diff = np.abs(index1.reshape((-1, 1)) - index2)
mask_j = np.argmin(diff, axis=1)  # min. of each column
mask_i = np.arange(mask_j.shape[0])


df1_ = df1.iloc[mask_i]
df2_ = df2.iloc[mask_j]


if tol is not None:
mask = np.abs(df2_.index.values - df1_.index.values) <= tol
df1_ = df1_.loc[mask]
df2_ = df2_.loc[mask]


df2_.index = df1_.index


out = pd.concat([df1_, df2_], axis=1)
return out

我已经使用 fuzzywuzz在一个非常小的方式,同时匹配现有的行为和关键字的 mergepandas

Just specify your accepted threshold for matching (between 0 and 100):

from fuzzywuzzy import process


def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
    

def fuzzy_apply(x, df, column, threshold=threshold):
if type(x)!=str:
return None
        

match, score, *_ = process.extract(x, df[column], limit=1)[0]
            

if score >= threshold:
return match


else:
return None
    

if on is not None:
left_on = on
right_on = on


# create temp column as the best fuzzy match (or None!)
df2['tmp'] = df2[right_on].apply(
fuzzy_apply,
df=df,
column=left_on,
threshold=threshold
)


merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
    

del merged_df['tmp']
    

return merged_df


使用示例数据进行尝试:

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})


df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})


fuzzy_merge(df, df2, on='Key', threshold=80)

Using thefuzz

使用 SeatGeek 的优秀软件包 thefuzz,这个软件包充分利用了莱文斯坦距离。这适用于列中保存的数据。它以行而不是列的形式添加匹配,以保持整洁的数据集,并允许将额外的列轻松地拉到输出数据框。


样本数据

df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})


col_a   col_b
0   one     1
1   two     2
2   three   3
3   four    4
4   five    5


df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})


col_a   col_b
0   one     a
1   too     b
2   three   c
3   fours   d
4   five    e

函数进行匹配

def fuzzy_match(
df_left, df_right, column_left, column_right, threshold=90, limit=1
):
# Create a series
series_matches = df_left[column_left].apply(
lambda x: process.extract(x, df_right[column_right], limit=limit)            # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
)


# Convert matches to a tidy dataframe
df_matches = series_matches.to_frame()
df_matches = df_matches.explode(column_left)     # Convert list of matches to rows
df_matches[
['match_string', 'match_score', 'df_right_id']
] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index)       # Convert match tuple to columns
df_matches.drop(column_left, axis=1, inplace=True)      # Drop column of match tuples


# Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
if df_matches.index.name:
index_name = df_matches.index.name     # Stash index name
else:
index_name = 'index'        # Default used by pandas
df_matches.reset_index(inplace=True)
df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True)       # The previous index has now become a column: rename for ease of reference


# Drop matches below threshold
df_matches.drop(
df_matches.loc[df_matches['match_score'] < threshold].index,
inplace=True
)


return df_matches

Use function and merge data

import pandas as pd
from thefuzz import process


df_matches = fuzzy_match(
df1,
df2,
'col_a',
'col_a',
threshold=60,
limit=1
)


df_output = df1.merge(
df_matches,
how='left',
left_index=True,
right_on='df_left_id'
).merge(
df2,
how='left',
left_on='df_right_id',
right_index=True,
suffixes=['_df1', '_df2']
)


df_output.set_index('df_left_id', inplace=True)       # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table


df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']]      # Drop columns used in the matching
df_output.index.name = 'id'


id  col_a_df1   col_b_df1   col_b_df2
0   one         1           a
1   two         2           b
2   three       3           c
3   four        4           d
4   five        5           e

提示 : 如果您也可以选择安装 python-Levenshtein包,那么使用 thefuzz进行模糊匹配会快得多。

TheFuzz is the new version of a fuzzywuzzy

为了在两个大表中模糊连接字符串元素,您可以这样做:

  1. 使用应用程序逐行前进
  2. 使用更快的并行,加速和可视化默认应用功能(与彩色进度条)
  3. 使用来自集合的 OrderedDect 消除 merge 输出中的重复项并保持初始顺序
  4. thefuzz.process.extract中增加 limit以查看更多合并选项(存储在相似度为% 的元组列表中)

“ *”您可以使用 thefuzz.process.extractOne而不是 thefuzz.process.extract仅返回一个最佳匹配项(不需要指定任何 极限)。但是,请注意,几个结果可能具有相同的相似度,您将只能得到其中的一个。

不知怎么的,快递员在开始真正的申请之前,要花上一两分钟。如果需要处理小型表,可以跳过这一步,而只使用 进度 _ 申请

from thefuzz import process
from collections import OrderedDict
import swifter




def match(x):
matches = process.extract(x, df1, limit=6)
matches = list(OrderedDict((x, True) for x in matches).keys())
print(f'{x:20} : {matches}')


return str(matches)




df1 = df['name'].values
df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))