Lucene 分析仪的比较

有人能解释一下 Lucene 中不同分析程序之间的区别吗?我得到了一个 maxClauseCount 异常,我知道我可以通过使用 KeywordAnalyzer 来避免这种情况,但是如果不了解分析器周围的问题,我不想从 StandardAnalyzer 进行更改。非常感谢。

44530 次浏览

In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.

Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.

Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.

KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.

Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].

And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.

And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.

In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.