绞刑架难度等级的单词分类算法: “简单”、“中等”或“困难”

小开

你被否决是因为你要求我们为你建立一个非常复杂的算法。

Why don't you just create three arrays (easy,medium, and hard) and populate each with a hundred or so words? It would take about 20 minutes.

我保证你的孩子们在玩完几百场游戏之前就会对绞刑感到厌倦的

小开

First, of course, you'd generate a list of unique letters. Then sort by frequency (in English or whatever language -- 有这个的清单), with less frequent letters having a higher difficulty.

然后，您需要决定是通过加法、乘法还是使用其他方案来组合这些分数。

小开

一个非常简单的方法是根据单词中元音的缺失、独特字母的数量以及每个字母的共性来计算分数:

letters = 'etaoinshrdlcumwfgypbvkjxqz'
vowels = set('aeiou')


def difficulty(word):
unique = set(word)
positions = sum(letters.index(c) for c in word)


return len(word) * len(unique) * (7 - len(unique & vowels)) * positions


words = ['the', 'potato', 'school', 'egypt', 'floccinaucinihilipilification']


for word in words:
print difficulty(word), word

结果是:

432 the
3360 potato
7200 school
7800 egypt
194271 floccinaucinihilipilification

You could then score the words with:

        score < 2000   # Easy
2000 < score < 10000  # Medium
10000 < score          # Hard

小开

只要做到这一点! 发挥刽子手对单词。计算多少弃权(即错误的猜测)需要打败。

你需要一个策略。这是一个人类的策略。从字典中剔除所有到目前为止不符合要求的单词。猜猜剩下的单词中最常用的字母。

If your strategy is randomised, you can define your measure as the expected number of forfeits, and estimate that empirically.

另一个确定性策略，来自我几年前写的刽子手机器人。猜测字母，最大限度地减少剩余的单词数量的情况下，猜测是不正确的(即。优化最坏的情况)。今天我不喜欢这个策略，因为它太机械了，我更喜欢上面的那个。

小开

你可以使用蒙特卡罗方法来估计一个单词的难度:

Simulate a game by guessing a random letter each time, weighted by letter's frequency in your target language, and count how many guesses it took your randomized player to arrive at a solution. Note that since each guess eliminates a letter, this process is finite, and it returns a number from 1 to 26, inclusive.
重复此过程 2*N次，其中 N是单词中的 独一无二字母数,
Calculate the score by averaging the results of 2*N runs,
确定复杂程度: 分数小于10表示一个简单的单词，分数大于16表示一个难的单词; 其他的都是中等的。

小开

可能涉及到很多方面:

正如大家所说，每个字母的频率;
一个单词的长度当然应该计算，但不是线性的方式-一个长的单词可以使随机猜测命中的字母，而一个短的可能很难得到;
此外，词语本身也应该被考虑——“双方”可能是 SO 人群的词语，但也许不是非技术人群的词语。

事实上，你可以选择试图共同发展几种策略，其中一半是为了决定一个单词的价值，另一半是为了赢得比赛。后一组试图使得分数最大化，而前一组试图使得分数最小化。一段时间后，可能会有一个模式，然后决定一个词的价值的一半可能会给你一些基准。

小开

以前关于同一话题的类似讨论: 确定一个英语单词的难度

我喜欢链接末尾的答案 ^ 。对于一个儿童刽子手游戏，只是应用一种方法，就像拼字游戏。

给每个字母分配一个点值，然后把这些字母相加。

小开

最佳答案

1. 引言

这里有一个系统地处理这个问题的方法: 如果你有一个能很好地处理 hangman 的算法，那么你可以把每个单词的难度看作是你的程序在猜测这个单词时会做出的错误猜测的次数。

2. 除了刽子手战略

在其他一些回答和评论中隐含着这样一个观点，对于解题者来说，最佳的策略是根据英语中字母的频率，或者根据某些语料库中单词的频率来做出决定。这是一个诱人的想法，但它并不完全正确。如果是 准确地模拟出由塞特选择的单词的分布，那么解决问题的人就做得最好，而人类排序员很可能会根据词汇的稀少程度或者避免频繁使用的字母来选择单词。例如，虽然 E是英语中使用最频繁的字母，但如果塞特尔总是从单词 JUGFUL、 RHYTHM、 SYZYGY和 ZYTHUM中选择，那么一个完美的求解者就不会从猜 E开始！

建立二传手模型的最佳方法取决于上下文，但是我猜想某种贝叶斯归纳推理在解决者与同一个二传手或者与一组相似的二传手玩很多游戏的情况下会很有效。

3. 刽子手算法

在这里，我将概述一个相当不错的解决方案(但远非完美)。它模拟塞特人从一本固定的词典中统一选择单词。这是一个贪婪算法: 在每个阶段，它猜测的字母，最大限度地减少错过的数量，也就是说，不包含猜测的单词。例如，如果到目前为止还没有做出猜测，而可能的单词是 DEED、 DEAD和 DARE，那么:

如果你猜 D或 E，没有错过;
如果你猜 A，有一个错过(DEED) ;
如果你猜 R，有两个错过(DEED和 DEAD) ;
如果你猜错了任何一个字母，就有三个没猜中。

所以在这种情况下，无论是 D还是 E都是一个很好的猜测。

(感谢恐慌上校发表评论指出，正确的猜测在刽子手中是免费的ーー我在第一次尝试时完全忘记了这一点!)

4. 实施

下面是这个算法在 Python 中的一个实现:

from collections import defaultdict
from string import ascii_lowercase


def partition(guess, words):
"""Apply the single letter 'guess' to the sequence 'words' and return
a dictionary mapping the pattern of occurrences of 'guess' in a
word to the list of words with that pattern.


>>> words = 'deed even eyes mews peep star'.split()
>>> sorted(list(partition('e', words).items()))
[(0, ['star']), (2, ['mews']), (5, ['even', 'eyes']), (6, ['deed', 'peep'])]


"""
result = defaultdict(list)
for word in words:
key = sum(1 << i for i, letter in enumerate(word) if letter == guess)
result[key].append(word)
return result


def guess_cost(guess, words):
"""Return the cost of a guess, namely the number of words that don't
contain the guess.


>>> words = 'deed even eyes mews peep star'.split()
>>> guess_cost('e', words)
1
>>> guess_cost('s', words)
3


"""
return sum(guess not in word for word in words)


def word_guesses(words, wrong = 0, letters = ''):
"""Given the collection 'words' that match all letters guessed so far,
generate tuples (wrong, nguesses, word, guesses) where
'word' is the word that was guessed;
'guesses' is the sequence of letters guessed;
'wrong' is the number of these guesses that were wrong;
'nguesses' is len(guesses).


>>> words = 'deed even eyes heel mere peep star'.split()
>>> from pprint import pprint
>>> pprint(sorted(word_guesses(words)))
[(0, 1, 'mere', 'e'),
(0, 2, 'deed', 'ed'),
(0, 2, 'even', 'en'),
(1, 1, 'star', 'e'),
(1, 2, 'eyes', 'en'),
(1, 3, 'heel', 'edh'),
(2, 3, 'peep', 'edh')]


"""
if len(words) == 1:
yield wrong, len(letters), words[0], letters
return
best_guess = min((g for g in ascii_lowercase if g not in letters),
key = lambda g:guess_cost(g, words))
best_partition = partition(best_guess, words)
letters += best_guess
for pattern, words in best_partition.items():
for guess in word_guesses(words, wrong + (pattern == 0), letters):
yield guess

5. 结果实例

使用这种策略可以评估猜测集合中每个单词的难度。在这里，我考虑我的系统词典中的六个字母的单词:

>>> words = [w.strip() for w in open('/usr/share/dict/words') if w.lower() == w]
>>> six_letter_words = set(w for w in words if len(w) == 6)
>>> len(six_letter_words)
15066
>>> results = sorted(word_guesses(six_letter_words))

本词典中最容易猜出的单词(连同解题者猜出单词所需的猜测顺序)如下:

>>> from pprint import pprint
>>> pprint(results[:10])
[(0, 1, 'eelery', 'e'),
(0, 2, 'coneen', 'en'),
(0, 2, 'earlet', 'er'),
(0, 2, 'earner', 'er'),
(0, 2, 'edgrew', 'er'),
(0, 2, 'eerily', 'el'),
(0, 2, 'egence', 'eg'),
(0, 2, 'eleven', 'el'),
(0, 2, 'enaena', 'en'),
(0, 2, 'ennead', 'en')]

最难以启齿的话是:

>>> pprint(results[-10:])
[(12, 16, 'buzzer', 'eraoiutlnsmdbcfg'),
(12, 16, 'cuffer', 'eraoiutlnsmdbpgc'),
(12, 16, 'jugger', 'eraoiutlnsmdbpgh'),
(12, 16, 'pugger', 'eraoiutlnsmdbpcf'),
(12, 16, 'suddle', 'eaioulbrdcfghmnp'),
(12, 16, 'yucker', 'eraoiutlnsmdbpgc'),
(12, 16, 'zipper', 'eraoinltsdgcbpjk'),
(12, 17, 'tuzzle', 'eaioulbrdcgszmnpt'),
(13, 16, 'wuzzer', 'eraoiutlnsmdbpgc'),
(13, 17, 'wuzzle', 'eaioulbrdcgszmnpt')]

之所以这么难，是因为在你猜出 -UZZLE之后，你还有七种可能性:

>>> ' '.join(sorted(w for w in six_letter_words if w.endswith('uzzle')))
'buzzle guzzle muzzle nuzzle puzzle tuzzle wuzzle'

6. Choice of wordlist

当然，在为孩子准备词汇表时，你不会从计算机系统词典开始，而是从你认为他们可能知道的词汇表开始。例如，您可以查看各种英语语料库中的维基词典中最常用单词的列表。

例如，在截至2006年，古腾堡计划最常用词汇达10000个的1700个六个字母的单词中，最难的十个是:

[(6, 10, 'losing', 'eaoignvwch'),
(6, 10, 'monkey', 'erdstaoync'),
(6, 10, 'pulled', 'erdaioupfh'),
(6, 10, 'slaves', 'erdsacthkl'),
(6, 10, 'supper', 'eriaoubsfm'),
(6, 11, 'hunter', 'eriaoubshng'),
(6, 11, 'nought', 'eaoiustghbf'),
(6, 11, 'wounds', 'eaoiusdnhpr'),
(6, 11, 'wright', 'eaoithglrbf'),
(7, 10, 'soames', 'erdsacthkl')]

(Soames Forsyte 是福尔赛传奇作者: 约翰 · 高尔斯华绥中的一个字符; 单词表已经转换为小写，因此我无法快速删除正确的名称。)

小开

不久前，我使用一个显而易见的算法编写了一个刽子手求解器: 给定一个包含所有可能单词的初始字典，在每个回合中，我们选择出现在字典中剩余的大多数单词中的字母，然后从字典中删除不匹配的单词(取决于响应)。

这个算法并不像这个算法那么简单，因为在字典中，通常有几个字母出现在相同数量的单词中。在这种情况下，字母的选择会对一个单词需要猜测多少次产生重大影响。我们选择最大值，其中关于字母位置的结果信息(如果确实在单词中)给出关于系统的最大信息(具有最大熵的字母)。例如，如果剩下的两个可能的单词是“ encyclopedia”和“ encyclopedic”，那么字母“ c”出现的概率与 e、 n、 y、 l、 o、 p、 e、 d、 i (即它肯定出现在单词中)相同，但是我们应该首先询问“ c”，因为它有一个非零的熵。

源代码(C + + ，GPL)是给你

所有这些的结果是一个单词列表，每个单词需要猜测的次数是: 困难(630KB)。这个算法最难找到的单词是“ will”(猜错14次) ; i 和双 l 很快就能猜出来，但是之后的选项包括 bill，dill，fill，gill，hill，kill，mill，piel，rill，til，will，从那时起，唯一的选项就是依次猜出每个字母。有点违反直觉的是，较长的单词猜测起来要快得多(只是没有那么多单词可供选择)。

当然，在一个人类的刽子手游戏中，心理学(和词汇的广度)扮演着比这个算法更重要的角色。

小开

从单词列表开始，对每个单词进行谷歌搜索。让点击数作为这个术语难度的(粗略的)代理。

在一个改进的版本中，你可以根据同义词关系将单词分组，然后通过计算谷歌搜索的结果来确定一个类别中最难的单词。

Taking the Notion of n-Grams One step further, the difficulty of a Word could be rated by the frequency of its syllables in prose. Depends on the quality of the syllable statistics, of course. You'd probably have to Differentiate between Lexemes and Function words ( determiners, conjunctions etc. ) and Normalize by number of syllables in the Word (Feels like Overkill as i Write ...).

小开

我喜欢构建一个根据用户进行学习和更改的算法的想法。在开始的时候，你可以实现任何推荐的算法，然后随着玩游戏的人越来越多，你可以根据猜测的次数给每个单词赋予一个权重(这个权重也会不断被跟踪和计算)。这就避免了复杂但流行的词汇被给予困难的评级，但却为人们所熟知的问题。

小开

Compute the value of each letter of a word in Scrabble points: E=1, D=2, V=4, X=8 and so on. Add them up and divide by the number of letters to get an average letter value, and use that to score the word. Compute the average for each word in a large dictionary, and determine the break points between quartiles. Call words in the lowest quartile "easy", words in the two middle quartiles "medium", and words in the highest quartile "hard".