检测单词中的音节

我需要找到一种相当有效的方法来检测单词中的音节,

看不见-> in-vi-sib-le

有一些音节规则可以使用:

V 简历 越共 CVC 闭路电视 CCCV CVCC

V 是元音 C 是辅音 * 。 例如:

发音(5前名词词组; CV-CVC-CV-V-CVC)

我尝试过几种方法,其中包括使用正则表达式(只有当你想要计算音节时才有帮助)或硬编码规则定义(一种被证明效率非常低的蛮力方法) ,最后使用有限状态自动机(没有得到任何有用的结果)。

我的应用程序的目的是在给定的语言中创建一个包含所有音节的字典。这本字典稍后将用于拼写检查应用程序(使用贝叶斯分类器)和文本语音合成。

如果有人能给我除了我以前的方法之外的另一种解决这个问题的方法,我将不胜感激。

我在 Java 中工作,但是任何 C/C + + ,C # ,Python,Perl 的技巧对我来说都是可行的。

78101 次浏览

为了连字符的目的,阅读关于 TeX 解决这个问题的方法。特别是看到法兰克梁的 毕业论文 单词 Hy-phen-a-tion by Com-put-er。他的算法是非常准确的,然后包括一个小的例外字典的情况下,该算法不工作。

Perl 有 语言: : 音系: : 音节模块。你可以试试,或者查查它的算法。我在那里也看到了一些旧的模块。

我不明白为什么一个正则表达式只给你一个音节数。你应该能够得到音节本身使用捕捉括号。假设您可以构造一个可以工作的正则表达式。

我偶然发现这个页面也在寻找同样的东西,并且在这里找到了一些梁论文的实现: Https://github.com/mnater/hyphenator 或者继任者: https://github.com/mnater/hyphenopoly

除非您喜欢阅读60页的论文,而不是为非唯一性问题调整免费可用代码。:)

为什么要计算它呢? 每个在线词典都有这个信息。 < a href = “ http://dictionary.reference ence.com/浏览/看不见”rel = “ norefrer”> http://dictionary.reference.com/browse/invisible 在 · 维 · 布尔

下面是一个使用 NLTK的解决方案:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

这是一个特别困难的问题,而 LaTeX 连字符算法并没有完全解决这个问题。在 英语自动音节化算法的评价(Marchand,Adsett,and Damper 2007)这篇论文中可以找到一些可用的方法和所涉及的挑战。

我正在尝试解决这个问题,为一个程序,将计算一块文本的肉体 Kincaid 和肉体阅读分数。我的算法使用我在这个网站上发现的: http://www.howmanysyllables.com/howtocountsyllables.html,它得到了相当接近。它仍然在像隐形和连字符这样的复杂单词上存在问题,但是我发现它已经达到了我的目的。

它具有易于实现的优点。我发现“ es”可以是音节也可以不是。这是一场赌博,但我决定去掉算法中的 es。

private int CountSyllables(string word)
{
char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
string currentWord = word;
int numVowels = 0;
bool lastWasVowel = false;
foreach (char wc in currentWord)
{
bool foundVowel = false;
foreach (char v in vowels)
{
//don't count diphthongs
if (v == wc && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}


//if full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
//remove es, it's _usually? silent
if (currentWord.Length > 2 &&
currentWord.Substring(currentWord.Length - 2) == "es")
numVowels--;
// remove silent e
else if (currentWord.Length > 1 &&
currentWord.Substring(currentWord.Length - 1) == "e")
numVowels--;


return numVowels;
}

感谢 Joe Basirico,感谢您在 C # 中分享快速而肮脏的实现。我使用过大型库,它们可以工作,但是它们通常有点慢,对于快速项目,您的方法可以很好地工作。

下面是您的 Java 代码,以及测试用例:

public static int countSyllables(String word)
{
char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
char[] currentWord = word.toCharArray();
int numVowels = 0;
boolean lastWasVowel = false;
for (char wc : currentWord) {
boolean foundVowel = false;
for (char v : vowels)
{
//don't count diphthongs
if ((v == wc) && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}
// If full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
// Remove es, it's _usually? silent
if (word.length() > 2 &&
word.substring(word.length() - 2) == "es")
numVowels--;
// remove silent e
else if (word.length() > 1 &&
word.substring(word.length() - 1) == "e")
numVowels--;
return numVowels;
}


public static void main(String[] args) {
String txt = "what";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "super";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Maryland";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "American";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "disenfranchized";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Sophia";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

结果和预期的一样(对于 Flesch-Kincaid 来说,它已经足够好了) :

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

谢谢@joe-basirico 和@tihamer,我已经将@tihamer 的代码移植到 Lua 5.1、5.2和 luajit 2(很可能也会在其他版本的 lua 上运行) :

countsyllables.lua

function CountSyllables(word)
local vowels = { 'a','e','i','o','u','y' }
local numVowels = 0
local lastWasVowel = false


for i = 1, #word do
local wc = string.sub(word,i,i)
local foundVowel = false;
for _,v in pairs(vowels) do
if (v == string.lower(wc) and lastWasVowel) then
foundVowel = true
lastWasVowel = true
elseif (v == string.lower(wc) and not lastWasVowel) then
numVowels = numVowels + 1
foundVowel = true
lastWasVowel = true
end
end


if not foundVowel then
lastWasVowel = false
end
end


if string.len(word) > 2 and
string.sub(word,string.len(word) - 1) == "es" then
numVowels = numVowels - 1
elseif string.len(word) > 1 and
string.sub(word,string.len(word)) == "e" then
numVowels = numVowels - 1
end


return numVowels
end

以及一些有趣的测试来确认它的工作原理(尽可能的多) :

countsyllables.tests.lua

require "countsyllables"


tests = {
{ word = "what", syll = 1 },
{ word = "super", syll = 2 },
{ word = "Maryland", syll = 3},
{ word = "American", syll = 4},
{ word = "disenfranchized", syll = 5},
{ word = "Sophia", syll = 2},
{ word = "End", syll = 1},
{ word = "I", syll = 1},
{ word = "release", syll = 2},
{ word = "same", syll = 1},
}


for _,test in pairs(tests) do
local resultSyll = CountSyllables(test.word)
assert(resultSyll == test.syll,
"Word: "..test.word.."\n"..
"Expected: "..test.syll.."\n"..
"Result: "..resultSyll)
end


print("Tests passed.")

我找不到一个合适的方法来计算音节,所以我自己设计了一个方法。

你可以在这里查看我的方法: https://stackoverflow.com/a/32784041/2734752

我使用字典和算法相结合的方法来计算音节。

你可以在这里查看我的图书馆: https://github.com/troywatson/Lawrence-Style-Checker

我刚刚测试了我的算法,有99.4% 的命中率!

Lawrence lawrence = new Lawrence();


System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));

产出:

4
3

碰撞“ Tihamer 和”Joe-Basirico。非常有用的功能,不是 太好了,但对大多数中小型项目很好。Joe 我用 Python 重写了你代码的一个实现:

def countSyllables(word):
vowels = "aeiouy"
numVowels = 0
lastWasVowel = False
for wc in word:
foundVowel = False
for v in vowels:
if v == wc:
if not lastWasVowel: numVowels+=1   #don't count diphthongs
foundVowel = lastWasVowel = True
break
if not foundVowel:  #If full cycle and no vowel found, set lastWasVowel to false
lastWasVowel = False
if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
numVowels-=1
elif len(word) > 1 and word[-1:] == "e":    #remove silent e
numVowels-=1
return numVowels

希望有人觉得这有用!

今天我发现了 Frank Liang 的连字符算法的 这个 Java 实现,它的模式适用于英语或德语,工作得非常好,可以在 Maven Central 上使用。

洞穴: 删除 .tex模式文件的最后几行非常重要,否则这些文件无法在 Maven Central 上用当前版本加载。

要加载和使用 hyphenator,可以使用以下 Java 代码片段。texTable是包含所需模式的 .tex文件的名称。这些文件可以在项目 github 站点上获得。

 private Hyphenator createHyphenator(String texTable) {
Hyphenator hyphenator = new Hyphenator();
hyphenator.setErrorHandler(new ErrorHandler() {
public void debug(String guard, String s) {
logger.debug("{},{}", guard, s);
}


public void info(String s) {
logger.info(s);
}


public void warning(String s) {
logger.warn("WARNING: " + s);
}


public void error(String s) {
logger.error("ERROR: " + s);
}


public void exception(String s, Exception e) {
logger.error("EXCEPTION: " + s, e);
}


public boolean isDebugged(String guard) {
return false;
}
});


BufferedReader table = null;


try {
table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
.getResourceAsStream((texTable)), Charset.forName("UTF-8")));
hyphenator.loadTable(table);
} catch (Utf8TexParser.TexParserException e) {
logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
throw new RuntimeException("Failed to load hyphenation table", e);
} finally {
if (table != null) {
try {
table.close();
} catch (IOException e) {
logger.error("Closing hyphenation table failed", e);
}
}
}


return hyphenator;
}

之后,Hyphenator就可以使用了。为了检测音节,基本思想是在所提供的连字符处分割术语。

    String hyphenedTerm = hyphenator.hyphenate(term);


String hyphens[] = hyphenedTerm.split("\u00AD");


int syllables = hyphens.length;

您需要在 "\u00AD上进行分割”,因为 API 不返回正常的 "-"

这种方法的性能优于 JoeBasirico,因为它支持许多不同的语言,并且能够更准确地检测德语连字符。

我曾经使用 jsoup 来完成这个任务,下面是一个音节分析器示例:

public String[] syllables(String text){
String url = "https://www.merriam-webster.com/dictionary/" + text;
String relHref;
try{
Document doc = Jsoup.connect(url).get();
Element link = doc.getElementsByClass("word-syllables").first();
if(link == null){return new String[]{text};}
relHref = link.html();
}catch(IOException e){
relHref = text;
}
String[] syl = relHref.split("·");
return syl;
}

不久前我也遇到了同样的问题。

我最终使用 卡内基梅隆大学发音词典来快速准确地查找大多数单词。对于词典里没有的单词,我又回到了机器学习模型,这种模型在预测音节数方面有98% 的准确率。

我在这里用一个易于使用的 python 模块(https://github.com/repp/big-phoney)完成了整个工作

安装: pip install big-phoney

计数音节:

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

如果您不使用 Python,并且希望尝试基于 ML 模型的方法,我做了一个非常详细的 写出音节计数模型在 Kaggle 上是如何工作的

在进行了大量测试并尝试了断字包之后,我基于一些示例编写了自己的代码。我还尝试了 pyhyphenpyphen软件包,它们与连字符字典接口,但是在许多情况下,它们产生的音节数量是错误的。对于这个用例来说,nltk包实在是太慢了。

我在 Python 中的实现是我编写的类的一部分,音节计数例程粘贴在下面。它有点高估了音节的数量,因为我还没有找到一个好的方法来解释无声的单词结尾。

该函数返回每个单词的音节比例,因为它用于 Flesch-Kincaid 可读性评分。数字不一定要精确,只要接近估计就行了。

在我的第7代 i7 CPU 上,这个函数用了1.1 -1.2毫秒完成一个759字的示例文本。

def _countSyllablesEN(self, theText):


cleanText = ""
for ch in theText:
if ch in "abcdefghijklmnopqrstuvwxyz'’":
cleanText += ch
else:
cleanText += " "


asVow    = "aeiouy'’"
dExep    = ("ei","ie","ua","ia","eo")
theWords = cleanText.lower().split()
allSylls = 0
for inWord in theWords:
nChar  = len(inWord)
nSyll  = 0
wasVow = False
wasY   = False
if nChar == 0:
continue
if inWord[0] in asVow:
nSyll += 1
wasVow = True
wasY   = inWord[0] == "y"
for c in range(1,nChar):
isVow  = False
if inWord[c] in asVow:
nSyll += 1
isVow = True
if isVow and wasVow:
nSyll -= 1
if isVow and wasY:
nSyll -= 1
if inWord[c:c+2] in dExep:
nSyll += 1
wasVow = isVow
wasY   = inWord[c] == "y"
if inWord.endswith(("e")):
nSyll -= 1
if inWord.endswith(("le","ea","io")):
nSyll += 1
if nSyll < 1:
nSyll = 1
# print("%-15s: %d" % (inWord,nSyll))
allSylls += nSyll


return allSylls/len(theWords)

我正在包括一个解决方案,工程的“好”在 R 远非完美。

countSyllablesInWord = function(words)
{
#word = "super";
n.words = length(words);
result = list();
for(j in 1:n.words)
{
word = words[j];
vowels = c("a","e","i","o","u","y");
    

word.vec = strsplit(word,"")[[1]];
word.vec;
    

n.char = length(word.vec);
    

is.vowel = is.element(tolower(word.vec), vowels);
n.vowels = sum(is.vowel);
    

    

# nontrivial problem
if(n.vowels <= 1)
{
syllables = 1;
str = word;
} else {
# syllables = 0;
previous = "C";
# on average ?
str = "";
n.hyphen = 0;
        

for(i in 1:n.char)
{
my.char = word.vec[i];
my.vowel = is.vowel[i];
if(my.vowel)
{
if(previous == "C")
{
if(i == 1)
{
str = paste0(my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
if(i < n.char)
{
if(n.vowels > (n.hyphen + 1))
{
str = paste0(str, my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
str = paste0(str, my.char);
}
} else {
str = paste0(str, my.char);
}
}
# syllables = 1 + syllables;
previous = "V";
} else {  # "VV"
# assume what  ?  vowel team?
str = paste0(str, my.char);
}
            

} else {
str = paste0(str, my.char);
previous = "C";
}
#
}
        

syllables = 1 + n.hyphen;
}
  

result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
}
  

if(n.words == 1) { result[[1]]; } else { result; }
}

以下是一些结果:

my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));


my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);


my.count.df;


#    syllables vowels         word
# 1          4      4   A-me-ri-ca
# 2          4      5 be-auti-fu-l
# 3          3      4   spa-ci-ous
# 4          2      2       ski-es
# 5          2      2       a-mber
# 6          2      2       wa-ves
# 7          2      2       gra-in
# 8          2      2      pu-rple
# 9          3      4  mo-unta-ins
# 10         3      3    ma-je-sty

我没意识到这个“兔子洞”有多大,看起来这么容易。


################ hackathon #######




# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word






# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/




# https://enchantedlearning.com/consonantblends/index.shtml
# start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr",
#                   "fl", "fr", "gl", "gr", "pl", "pr",
#                   "sc", "sh", "sk", "sl", "sm", "sn",
#                   "sp", "st", "sw", "th", "tr", "tw",
#                   "wh", "wr");
# start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
#                     "spr", "squ", "str", "thr");
#
#
#
# end.digraphs = c("ch","sh","th","ng","dge","tch");
#
# ile
#
# farmer
# ar er
#
# vowel teams ... beaver1
#
#
# # "able"
# # http://www.abcfastphonics.com/letter-blends/blend-cial.html
# blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian",
#             "ck", "ct", "dge", "dis", "ed", "ex", "ful",
#             "gh", "ng", "ous", "kn", "ment", "mis", );
#
# glue = c("ld", "st", "nd", "ld", "ng", "nk",
#           "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch",
#           "nse", "nt", "ph", "psy", "pt", "re", )
#
#
# start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
#
# # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
# digraphs.start = c("ch","sh","th","wh","ph","qu");
# digraphs.end = c("ch","sh","th","ng","dge","tch");
# # https://www.education.com/worksheet/article/beginning-consonant-blends/
# blends.start = c("pl", "gr", "gl", "pr",
#
# blends.end = c("lk","nk","nt",
#
#
# # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
# # Monte     Mon-te
# # Sophia    So-phi-a
# # American  A-mer-i-can
#
# n.vowels = 0;
# for(i in 1:n.char)
#   {
#   my.char = word.vec[i];
#
#
#
#
#
# n.syll = 0;
# str = "";
#
# previous = "C"; # consonant vs "V" vowel
#
# for(i in 1:n.char)
#   {
#   my.char = word.vec[i];
#
#   my.vowel = is.element(tolower(my.char), vowels);
#   if(my.vowel)
#     {
#     n.vowels = 1 + n.vowels;
#     if(previous == "C")
#       {
#       if(i == 1)
#         {
#         str = paste0(my.char, "-");
#         } else {
#                 if(n.syll > 1)
#                   {
#                   str = paste0(str, "-", my.char);
#                   } else {
#                          str = paste0(str, my.char);
#                         }
#                 }
#        n.syll = 1 + n.syll;
#        previous = "V";
#       }
#
#   } else {
#               str = paste0(str, my.char);
#               previous = "C";
#               }
#   #
#   }
#
#
#
#
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE   1
# IDEA   3
# IDEAS  2
# IDEE   2
# IDE   1
# AIDA   2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE  1
# HALIDE  2
# TELEPHONE 3
# TELEPHONY 4
# DUE   1
# IDEAL  2
# DEE   1
# UREA  3
# VACUO  3
# SEANCE  1
# SAILED  1
# RIBBED  1
# MOPED  1
# BLESSED  1
# AGED  1
# TOTED  2
# WARRED  1
# UNDERFED 2
# JADED  2
# INBRED  2
# BRED  1
# RED   1
# STATES  1
# TASTES  1
# TESTES  1
# UTILIZES  4

还有一个简单的 Kincaid 可读性函数,音节是从第一个函数返回的计数列表。

由于我的功能偏向于更多的音节,这将给出一个膨胀的可读性分数... ... 现在是好的... ... 如果目标是使文本更可读,这不是最坏的事情。

computeReadability = function(n.sentences, n.words, syllables=NULL)
{
n = length(syllables);
n.syllables = 0;
for(i in 1:n)
{
my.syllable = syllables[[i]];
n.syllables = my.syllable$syllables + n.syllables;
}
# Flesch Reading Ease (FRE):
FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
# Flesh-Kincaid Grade Level (FKGL):
FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59;
# FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
# FKGL = -0.13948  * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
  

list("FRE" = FRE, "FKGL" = FKGL);
}

你可以试试 空间音节,这在 Python 3.9中是可行的:

设置:

pip install spacy
pip install spacy_syllables
python -m spacy download en_core_web_md

密码:

import spacy
from spacy_syllables import SpacySyllables
nlp = spacy.load('en_core_web_md')
syllables = SpacySyllables(nlp)
nlp.add_pipe('syllables', after='tagger')




def spacy_syllablize(word):
token = nlp(word)[0]
return token._.syllables




for test_word in ["trampoline", "margaret", "invisible", "thought", "Pronunciation", "couldn't"]:
print(f"{test_word} -> {spacy_syllablize(test_word)}")

产出:

trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']