自动补全算法?

我指的是当用户在 Google 中键入搜索词时,用于提供查询建议的算法。

我主要感兴趣的是: 最重要的结果(最有可能是查询,而不是任何匹配的结果) 2. 匹配子字符串 3. 模糊匹配

我知道你可以使用 Trie 或者广义的 try 来寻找匹配,但是它不能满足上述要求..。

类似的问题问早些时候 给你

80230 次浏览

有像 Soundex莱文斯坦距离这样的工具可以用来查找在一定范围内的模糊匹配。

Soundex 会找到发音相似的单词,而莱文斯坦距离则会找到与另一个单词编辑距离在一定范围内的单词。

我认为构建一个专门的试验可能比追求一个完全不同的数据结构更好。

我可以看到这种功能体现在一个试验中,其中每片叶子都有一个能够反映对应词搜索频率的字段。

搜索查询方法将显示具有最大值的子代叶节点,这些值是通过将到每个子代叶节点的距离乘以与每个子代叶节点相关的搜索频率计算得到的。

谷歌使用的数据结构(因此算法)可能要复杂得多,可能要考虑大量其他因素,比如你自己账户的搜索频率(一天中的时间... ... 天气... ... 季节... ... 月相... ... 和... ...)。 然而,我相信基本的 trie 数据结构可以通过在每个节点上包含额外的字段并在搜索查询方法中使用这些字段来扩展到任何一种专门的搜索偏好。

看看 Firefox 的 Awesome bar 算法

谷歌建议是有用的,因为它考虑到数百万流行的查询 + 你过去的相关查询。

尽管它没有一个好的完成算法/用户界面:

  1. 不做子串
  2. 看起来是个相对简单的词边界前缀算法。
    例如: 尝试 tomcat tut-> 正确地建议“ tomcat 教程”,现在尝试 tomcat rial-> 无建议)-:
  3. 不支持“你的意思是?”-如在谷歌搜索结果。

对于(呵)令人敬畏的模糊/部分字符串匹配算法,请查看该死的酷算法:

它们不会替换尝试,而是防止在尝试中进行强力查找——这仍然是一个巨大的胜利。接下来,您可能需要一种方法来约束尝试的大小:

  • 尝试使用全球范围内使用的最近/最多的 N 个单词;
  • 对于每个用户,为该用户保留最近的/最高的 N 个单词。

最后,您希望尽可能防止查找..。

  • 缓存查找结果: 如果用户点击任何搜索结果,您可以非常快速地提供这些结果,然后异步获取完整的部分/模糊查找。
  • 预计算查找结果: 如果用户输入了“ appl”,他们可能会继续输入“ apple”、“ application”。
  • 预取数据: 例如,一个 web 应用程序可以向浏览器发送一组较小的结果,这些结果小到足以使用 JS 进行强力搜索成为可能。

我只想说..。 解决这个问题的一个好办法就是不仅仅使用三元搜索树。 Ngram 和带状疱疹(短语)是必需的。还需要检测单词边界错误。“ hell o”应该是“ hello”... “ whitesocks”应该是“ white ocks”——这些是预处理步骤。如果不对数据进行适当的预处理,就不会得到有价值的搜索结果。 三元搜索树是一个非常有用的组件,它可以帮助我们判断什么是单词,也可以帮助我们在索引中输入的单词不是有效单词的情况下实现相关单词的猜测。

谷歌算法执行短语建议和更正。 谷歌算法也有一些上下文的概念... 如果你搜索的第一个单词是天气相关的,你把它们结合起来“ weatherforcst”vs“ monsoonfrcst”vs“ desk frcst”-我的猜测是幕后排名正在根据遇到的第一个单词改变建议-预测和天气是相关的单词,因此预测得到的是一个高排名在你的意思猜测。

单词片段(ngram)、短语词汇(shingles)、单词接近度(word 聚类索引)、三元搜索树(word lookup)。

谷歌的确切算法是未知的,但 据说的工作由用户输入统计分析。这种方法不适用于大多数情况。更常见的自动完成是使用下列方法之一实现的:

  • 树木 。通过索引可搜索的文本在一个树结构(前缀树,后缀树,党等。)可以以牺牲内存存储为代价执行非常快速的搜索。树遍历可以用于近似匹配。
  • 模式分区 。通过将文本划分为标记(ngram) ,可以使用一个简单的散列方案执行对模式出现的搜索。
  • 筛选 . 查找一组潜在的匹配项,然后应用顺序算法检查每个候选项。

看一下 彻底的,它是一个 Java 自动完成库,实现了后面的一些概念。

对于子串和模糊匹配,莱文斯坦距离算法对我来说相当有效。虽然我承认它看起来不像自动补全/建议的行业实现那么完美。Google 和微软的 Intellisense 都做得更好,我认为这是因为他们改进了这个基本算法来衡量匹配不同字符串所需的编辑操作。例如,转换两个字符可能只能算作1个操作,而不是2个(插入和删除)。

但即便如此,我发现这已经足够接近了,这是它在 C # 中的实现..。

// This is the traditional Levenshtein Distance algorithem, though I've tweaked it to make
// it more like Google's autocomplete/suggest.  It returns the number of operations
// (insert/delete/substitute) required to change one string into another, with the
// expectation that userTyped is only a partial version of fullEntry.
// Gives us a measurement of how similar the two strings are.
public static int EditDistance(string userTyped, string fullEntry)
{
if (userTyped.Length == 0) // all entries are assumed to be fully legit possibilities
return 0; // at this point, because the user hasn't typed anything.


var inx = fullEntry.IndexOf(userTyped[0]);
if (inx < 0) // If the 1st character doesn't exist anywhere in the entry, it's not
return Int32.MaxValue; // a possible match.


var lastInx = inx;
var lastMatchCount = 0;
TryAgain:
// Is there a better starting point?
var len = fullEntry.Length - inx;
var matchCount = 1;
var k = 1;
for (; k < len; k++)
{
if (k == userTyped.Length || userTyped[k] != fullEntry[k + inx])
{
if (matchCount > lastMatchCount)
{
lastMatchCount = matchCount;
lastInx = inx;
}
inx = fullEntry.IndexOf(userTyped[0], inx + 1);
matchCount = 0;
if (inx > 0)
goto TryAgain;
else
break;
}
else
matchCount++;
}
if (k == len && matchCount > lastMatchCount)
lastInx = inx;


if (lastInx > 0)
fullEntry = fullEntry.Substring(lastInx); // Jump to 1st character match, ignoring previous values


// The start of the Levenshtein Distance algorithem.
var m = userTyped.Length;
var n = Math.Min(m, fullEntry.Length);


int[,] d = new int[m + 1, n + 1]; // "distance" - meaning number of operations.


for (var i = 0; i <= m; i++)
d[i, 0] = i; // the distance of any first string to an empty second string
for (var j = 0; j <= n; j++)
d[0, j] = j; // the distance of any second string to an empty first string


for (var j = 1; j <= n; j++)
for (var i = 1; i <= m; i++)
if (userTyped[i - 1] == fullEntry[j - 1])
d[i, j] = d[i - 1, j - 1];       // no operation required
else
d[i, j] = Math.Min
(
d[i - 1, j] + 1,  // a deletion
Math.Min(
d[i, j - 1] + 1,  // an insertion
d[i - 1, j - 1] + 1 // a substitution
)
);


return d[m, n];
}

如果你正在寻找一个问题的整体设计,尝试阅读在 https://www.interviewbit.com/problems/search-typeahead/的内容。

他们首先通过使用一个 try 的简单方法构建自动完成,然后在此基础上进行构建。他们还解释了优化技术,如采样和离线更新,以迎合特定的用例。

为了保持解决方案的可伸缩性,您必须智能地分割 try 数据。

我不知道这是否会回答你的问题,但我做了一个非常简单的输入-自动完成代码使用 C 语言当时。我还没有实现机器学习和神经网络在这方面,所以它不会作出概率计算和诸如此类的东西。它所做的是使用子字符串检查算法检查与输入匹配的第一个索引。

您可以将匹配数据提供到“ dict.txt”文件中。

/* Auto-complete input function in c
@authors: James Vausch
@date: 2018-5-23


- This is a bona-fide self-created program which aims to
stimulate an input auto-suggest or auto-complete function
in C language. This is open source so you can use the code
freely. However if you will use this, just acknowledge the
creator as a sign of respect.
- I'd also like to acknowledge Code with C team whom where I
I got an answer how to have a colored output instead of
using system("color #"). Link down below


https://www.codewithc.com/change-text-color-in-codeblocks-console-window/




- THE GENERAL IDEA IS; WE READ A FILE WITH DICTIONARY WORDS
OR SHALL WE SAY ALL WORDS. WE RUN A WHILE LOOP THAT WILL
GET CHARACTER FROM THE USER USING "getch()" FUNCTION THEN
STORE IT IN A CHARACTER ARRAY THEN IS PASSED ON A FUNCTION
THAT CHECKS IF THE ANY DICTIONARY WORDS HAS A SUBSTRING
THAT IS THE USER INPUT. IF YES(0), THE FUNCTION WILL COPY
THE FOLLOWING STRING FROM THE DICTIONARY ARRAY THEN STORED
IN A TEMP CHAR ARRAY AND PROCESSED. THE PROCESSING SHOULD
BE SIMPLE. WE RUN A LOOP IN WHICH WILL CHECK THE AMOUNT OF
CHARACTERS IN THE MATCHED STRING, THEN WE'LL RUN A LOOP
THAT WILL SORT THE WORDS DECREMENTALLY BASED ON THE AMOUNT
OF CHARACTERS OF THE INPUT SUBSTRING. THEN PRINT THE
PROCESSED STRING ON THE FRONT OF THE INPUT STRING THEN RUN
A LOOP BASED ON THE AMOUNT OF CHARACTERS PRESENT OR STRING
LENGTH OF THE PROCESSED STRING PLUS 10 EXTRA CHARACTERS
ALONG WITH PRINTING SOME BACKWARD TRAVERSE CARET FUNCTION
TO MAKE THE CARET STAY WHERE IT SHOULD BE ALONG WITH
INPUTTING. SIMPLE.


- <EXAMPLE>
INPUT: COM
AFTER LOOP RUN: MATCHED WITH WORD "COMMAND"
AFTER LOOP RUN: INPUT HAS 3 CHARACTERS
LOOP SEQUENCE:
LOOP 0: OMMAND
LOOP 1: MMAND
LOOP 2: MAND
AFTER LOOP: MAND
PRINT: "MAND" AFTER INPUT BUT KEEP CARET ON THE INPUT "COM"


NOTE:
- You'll need the "dict.txt" file or you can create one and
put some stuff there
- Since C Programs run on.. say a terminal, I have not much of a way to
efficiently make a way to use arrow keys for the job.
- you should type your INPUT in LOWERCASE since pressing "Shift_Key + M"
is equivalent to pressing the VK_Right(right arrow key) as well as
the other arrow keys
- the right arrow key has an ascii equivalent of <-32><77>, 77 = M
- to complete the input, you'll need to press right arrow key
- the left arrow key has an ascii equivalent of <-32><75>, 75 = K
- to remove auto-complete suggestion, press left arrow key


TO ADD:
- UP arrow key and DOWN arrow key to cycle through suggestions
*/


//#include <headers.h> //My personal header file
#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <windows.h>


void main(){
SetColor(6);
start();
}


void start(){
int rep = 0;
char dictFile[] = "dict.txt";
loadDictionaryEntries(dictFile);


char inp[50];
printf("\nAuto Complete Program : C");
while(rep == 0){
printf("\nInput: ");
autoCompleteInput(inp);
if(strcasecmp(inp, "exit") == 0){
break;
}
printf("\nOutput: %s", inp);
}
printf("\n");
system("pause");
}


int dictEntryCount = 0;
struct allWords{
char entry[100];
}dictionary[60000];


//============================================================================//
void loadDictionaryEntries(char directory[]){
FILE *file;
int dex = 0;
char str[100];
if(file = fopen(directory, "r")){
printf("File accessed.\n");
while(!feof(file)){
fscanf(file, "%s", str);
//UN-COMMENT line 109 to check if the program is reading from file
//printf("Adding entry %d: \"%s\" to dictionary\n",dex + 1, str);
strcpy(dictionary[dex].entry, str);
dex++;
dictEntryCount++;
}
fclose(file);
printf("[ADDED %d WORDS TO DICTIONARY]\n", dictEntryCount);
}else{
printf(" File cannot be accessed.");
fclose(file);
}
}


void printArray(){
for(int i = 0; i < dictEntryCount; i++){
printf("Index %d: %s\n", i + 1, dictionary[i].entry);
}
}


//============================================================================//
void autoCompleteInput(char input[]){
char matchedWord[100]; //STORAGE FOR THE WORD THAT MATCHES INPUT
char ch; //STORAGE FOR EACH CHARACTER THAT THE USER INPUTS
int i = 0; //COUNTER
int words;
while(i != 200){ //LOOP TO GET EACH CHARACTER FROM KEYBOARD PRESS
SetColor(6);
ch = getch();
clsx(strlen(matchedWord));
if(ch == 13){ //CONDITION TO CHECK IF INPUT IS "ENTER" KEY
break; //BREAKS LOOP IF "ENTER IS PRESSED"
}else if(ch == 8){ //CONDITION TO CHECK IF INPUT IS "BACKSPACE"
if(i == 0){ //IF INPUT IS NULL, DO NOTHING, DONT ERASE ANYTHING
//DO NOTHING
}else{ //IF INPUT IS NOT NULL, ENABLE ERASING
clsx(strlen(matchedWord));
bksp();
i--;
input[i] = '\0';
if(i > 2){
if(matchToDictionary(input, matchedWord) == 0){
words = 0;
processMatchedWord(i, matchedWord);
SetColor(8);
printf("%s", matchedWord);
words = getArrSizeChar(matchedWord);
for(int x = 0; x < words; x++){
printf("\b");
}
}
}
}
}else if(ch == 77){ //CONDITION TO CHECK IF INPUT IS RIGHT ARROW KEY
printf("%s", matchedWord); //PRINT SUGESTED WORD WITH CARET AT FRONT
strcat(input, matchedWord); //CONCATENATE SUGGESTION TO INPUT
i = i + words - 1; //SETS INDEX AT THE END OF INPUT
words = 0; //
}else if(ch == 75){ //CONDITION TO CHECK IS INPUT IS LEFT ARROW KEY
clsx(strlen(matchedWord)); //ERASE SUGGESTION
i--; //DECREMENT INDEX
}else{ //IF CONDITIONS ABOVE ARE NOT MET, DO THIS
input[i] = ch; //INSERT CH AT THE INDEX OF INPUT
printf("%c", ch); //PRINT CHARACTER
input[i + 1] = '\0'; //SET END OF CURRENT INPUT TO NULL
i++;
if(i >= 2){
if(matchToDictionary(input, matchedWord) == 0){
words = 0;
processMatchedWord(i, matchedWord);
SetColor(8);
printf("%s", matchedWord);
words = getArrSizeChar(matchedWord);
for(int x = 0; x < words; x++){
printf("\b");
}
}else{
clsx(strlen(matchedWord));
}
}
}


}
input[i] = '\0'; //NULL ENDING VALUE TO PREVENT UNNECESSARY CHARACTERS
}


int getArrSizeChar(char array[]){
int size = 0;
while(array[size] != '\0'){size++;}
return size;
}


void clsx(int maxVal){
for(int i = 0; i < maxVal + 10; i++){
printf(" ");
}
for(int i = 0; i < maxVal + 10; i++){
printf("\b");
}
}


int matchToDictionary(char input[], char matchedWord[]){
int found = 0;
int dex = dictEntryCount; //LIMIT OF ARRAY / ARRAY BOUND/S
//while(dictionary[dex] != '\0'){ //LOOP TO DETERMINE ARRAY BOUND
//printf("%d", dex);
//dex++; //INCREMENT IF INDEX OF ARRAY IS NOT NULL
//}
//printf("%d", dex);
for(int i = 0; i < dex; i++){ //LOOP TROUGH ALL INDEXES OF DICTIONARY
//CHECKS IF THE INDEX OF DICTIONARY HAS A SUBSTRING INPUT
//printf(" Matching %s and %s\n", dictionary[i], input);
if(containsIgnoreCase(dictionary[i].entry, input) == 0){
//CHECKS IF THE INDEX OF DICTIONARY TOTALLY MATCHES THE INPUT
//IT IS TO PREVENT ERRORS IN AUTO-COMPLETING PROCESS
if(strcasecmp(dictionary[i].entry, input) == 1){
//IF NOT, STORE INDEX OF DICTIONARY TO MATCHED WORD
strcpy(matchedWord, dictionary[i].entry);
found++;
break; //BREAK LOOP
}
}
}


if(found == 1){
return 0;
}else{
return 1;
}
}


void processMatchedWord(int rep, char str[]){
int lim = 0;
int i;
char temp[50];
while(str[lim] != '\0'){
lim++;
}


while(rep != 0){
for(i = 0; i < lim; i++){
str[i] = str[i + 1];
}
rep--;
}
}


//===================================================================//
void bksp(){
printf("\b "); //first backsapce to print an emtpy character
printf("\b"); //second backspace to erase printed character
}


int containsIgnoreCase(char str1[], char str2[]){
char tmp1[100];
char tmp2[100];
toLowerCase(tmp1, str1);
toLowerCase(tmp2, str2);
int i, j = 0, k;


for(i = 0; tmp1[i]; i++){
if(tmp1[i] == tmp2[j]){
for(k = i, j = 0; tmp1[k] && tmp2[j]; j++, k++){
if(tmp1[k] != tmp2[j]){
break;
}
}
if(!tmp2[j]){
return 0;
}
}
}


return 1;
}


void toLowerCase(char destination[], char source[]){
int lim = 0;
int i;


while(source[lim] != '\0'){
lim++;
}


for(i = 0; i < lim; i++){
destination[i] = tolower(source[i]);
}
destination[i] = '\0';
}




/*Console Colors: Windows


0 = Black        8 = Gray
1 = Blue         9 = LBlue
2 = Green       10 = LGreen
3 = Aqua        11 = LAqua
4 = Red         12 = LRed
5 = Purple      13 = LPurple
6 = Yellow      14 = LYellow
7 = White       15 = Bright White
}*/


void SetColor(int ForgC){ //CODE SNIPPET FROM WWW.CODEWITHC.COM
WORD wColor;
//This handle is needed to get the current background attribute


HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
CONSOLE_SCREEN_BUFFER_INFO csbi;
//csbi is used for wAttributes word


if(GetConsoleScreenBufferInfo(hStdOut, &csbi)){
//To mask out all but the background attribute, and to add the color
wColor = (csbi.wAttributes & 0xF0) + (ForgC & 0x0F);
SetConsoleTextAttribute(hStdOut, wColor);
}
return;
}

如果你指的是一个初始化为空匹配的程序然后把用户的输入保存到数组或文件中然后当用户输入与之前输入匹配的同一个单词时,也许我可以解决这个问题。