Python: 查找与另一个字符串最接近的字符串(从列表中)

假设我有一个 string "Hello"和一个列表

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']

我怎样才能找到最接近 "Hello"n words并出现在 words列表中?

在这种情况下,我们会有 ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

因此,策略是从最接近的单词到最远的单词对列表单词进行排序。

我想过这样的事

word = 'Hello'
for i, item in enumerate(words):
if lower(item) > lower(word):
...

但是在大的列表中它是非常慢的。

更新 difflib可以工作,但是速度也很慢。(words list里面有630000多个单词(已排序,每行一个))。因此,每次搜索最接近的单词,检查列表需要5到7秒钟!

109495 次浏览

Use difflib.get_close_matches.

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'Hallo', 'hallo']

Please look at the documentation, because the function returns 3 or less closest matches by default.

Create a sorted list of your words and use the bisect module to identify the point in the sorted list where your word would fit according to the sorting order. Based on that position you can give the k nearest neighbours above and below to find the 2k closest words.

There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.

http://norvig.com/spell-correct.html

The idea is to build all possible edits of your word,

hello - helo   - deletes
hello - helol  - transpose
hello - hallo  - replaces
hello - heallo - inserts




def edits1(word):
splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes    = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts    = [a + c + b     for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)

Now, look up each of these edits in your list.

Peter's article is a great read and worth reading.

maybe heap can help you .

you have a heap named Heap that until it's size is less than n , you insert words into the Heap using function close [shows this string is closer than another string or not] .

this method can help you when n is small :)

Heap = []
for word in words:
if len(Heap)<n:
Heap.insert(word)
else
if close(word,Heap[0]): # it means Heap[0] is the nth farthest word until now
Heap.pop():
Heap.insert(word)

I was looking at this answer for getting a closest match from a list or possible alternatives of

difflib.get_close_matches

I found Amjith's answer based on Peter Norwig's post and thought it might be a good replacement. Before I realised it might not be quite much suited for my use case, I had made a class out of it. So this is a version of spell where you can use it for different set of words. Maybe best use can be for population name matching.

import re
from collections import Counter


def words(text): return re.findall(r'\w+', text.lower())


# WORDS = Counter(words(open('big.txt').read()))




class WordMatcher:


def __init__(self, big_text):
self.WORDS=Counter(words(big_text))
self.N = sum(self.WORDS.values())


def P(self, word):
"Probability of `word`."
return self.WORDS[word] / self.N


def correction(self, word):
"Most probable spelling correction for word."
return max(self.candidates(word), key=self.P)


def candidates(self, word):
"Generate possible spelling corrections for word."
return (self.known([word]) or self.known(self.edits1(word)) or self.known(self.edits2(word)) or [word])


def known(self, words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in self.WORDS)


def edits1(self, word):
"All edits that are one edit away from `word`."
letters    = 'abcdefghijklmnopqrstuvwxyz'
splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
deletes    = [L + R[1:]               for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
inserts    = [L + c + R               for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)


def edits2(self, word):
"All edits that are two edits away from `word`."
return (e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))