谷歌怎么会这么快?

是什么技术和编程决策使得 Google 能够如此快速地提供查询服务?

每次我搜索一些东西(每天几次中的一次) ,它总是让我惊讶于他们如何在接近或少于1秒钟的时间内提供结果。他们有什么样的配置和算法可以实现这一点?

边注: 这是一种压倒性的想法,即使我把一个桌面应用程序,并使用它在我的机器上可能不会一半的速度,谷歌。继续学习,我说。


以下是一些很好的答案和建议:

79804 次浏览

Hardware.

Lots and lots of hardware. They use massive clusters of commodity PCs as their server farm.

It's a bit too much to put it in one answer. http://en.wikipedia.org/wiki/Google_platform

TraumaPony is right. Tons of servers and smart architecture for load balancing/caching and voila you can run query in under 1 second. There was a lot of articles on the net describing google services architecture. I'm sure you can find them via Google :)

And algorithms that can harness that hardware power. Like mapreduce for instance.

They have implemented good, distributed, algorithms running on a vast amount of hardware.

They pretty much have a local copy of the internet cached on thousands of PC's on custom filesystems.

If you are interested in more details about how the google cluster works, I'll suggest this open source implementation of their HDFS.

It's based on Mapreduce by google.

Google hires the best of the best. Some of the smartest people in IT work at google. They have virtually infinite money to throw at hardware and engineers.

They use highly optimized storage mechanisms for the tasks that they are performing.

They have geographically located server farms.

You can find in the google research homepage some pointers about the research papers written by some of the google guys. You should start with the explanatio of the google file system and the map/reduce algorithm to try and understand what's going on behind the google pages.

One of the most important delays is webservers is getting your query to the webserver, and the response back. THis latency is bound by the speed of light, which even Google has to obey. However, they have datacenters all over the world. As a result, the average distance to any one of them is lower. This keeps the latency down. Sure, the difference is measured in milliseconds, but it matters if the response has to arrive within 1000 milliseconds.

Latency is killed by disk accesses. Hence it's reasonable to believe that all data used to answer queries is kept in memory. This implies thousands of servers, each replicating one of many shards. Therefore the critical path for search is unlikely to hit any of their flagship distributed systems technologies GFS, MapReduce or BigTable. These will be used to process crawler results, crudely.

The handy thing about search is that there's no need to have either strongly consistent results or completely up-to-date data, so Google are not prevented from responding to a query because a more up-to-date search result has become available.

So a possible architecture is quite simple: front end servers process the query, normalising it (possibly by stripping out stop words etc.) then distributing it to whatever subset of replicas owns that part of the query space (an alternative architecture is to split the data up by web pages, so that one of every replica set needs to be contacted for every query). Many, many replicas are probably queried, and the quickest responses win. Each replica has an index mapping queries (or individual query terms) to documents which they can use to look up results in memory very quickly. If different results come back from different sources, the front-end server can rank them as it spits out the html.

Note that this is probably a long way different from what Google actually do - they will have engineered the life out of this system so there may be more caches in strange areas, weird indexes and some kind of funky load-balancing scheme amongst other possible differences.

One fact that I've aways found funny is that Google is in fact run by bioinformatics (’kay, I find that funny because I'm a bioinf…thingy). Let me explain.

Bioinformatics early on had the challenge to search small texts in gigantic strings very fast. For us, the “gigantic string” is of course DNA. Often not a single DNA but a database of several DNAs from different species/individuals. The small texts are proteins or their genetic counterpart, a gene. Most of the first work of computational biologists was restricted to find homologies between genes. This is done to establish the function of newly found genes by noting similarities to genes that are already known.

Now, these DNA strings get very big indeed and (lossy!) search has to be done extremely efficiently. Most of the modern theory of string lookup was thus developed in the context of computational biology.

However, quite some time ago, conventional text search was exhausted. A new approach was needed that allowed searching large strings in sublinear time, that is, without looking at each single character. It was discovered that this can be solved by pre-processing the large string and building a special index data structure over it. Many different such data structures have been proposed. Each have their strengths and weaknesses but there's one that is especially remarkable because it allows a lookup in constant time. Now, in the orders of magnitude in which Google operates this isn't strictly true anymore because load balancing across servers, preprocessing and some other sophisticated stuff has to be taken into account.

But in the essence, the so-called q-gram index allows a lookup in constant time. The only disadvantage: The data structure gets ridiculously big. Essentially, to allow for a lookup of strings with up to q characters (hence the name), it requires a table that has one field for each possible combination of q letters (that is, qS, where S is the size of the alphabet, say 36 (= 26 + 10)). Additionally, there has to be one field for each letter position in the string that was indexed (or in the case of google, for each web site).

To mitigate the sheer size, Google will probably use multiple indices (in fact, they do, to offer services like spelling correction). The topmost ones won't work on character level but on word level instead. This reduces q but it makes S infinitely bigger so they will have to use hashing and collision tables to cope with the infinite number of different words.

On the next level, these hashed words will point to other index data structures which, in turn, will hash characters pointing to websites.

Long story short, these q-gram index data structures are arguably the most central part of Google's search algorithm. Unfortunately, there are no good non-technical papers explaining how q-gram indices work. The only publication that I know that contains a description of how such an index works is … alas, my bachelor thesis.

  1. Multi staged data storage, processing and retrieval

  2. EFFICIENT Distribution (100's of 1000's of machines) of the above tasks

  3. Good framework to store the raw data and the processed results

  4. Good framework to retrieve the results

How exactly all these are done is summarized by all the links that you have in the question summary

HenryR is probably correct.

Map Reduce does not play a role for the search itself, but is only used for indexing. Check this video interview with the Map Reduce inventors.

An attempt at a generalized list (that does not depend on you having access to Google's internal tools):

  1. Parellelize requests (e.g. break up a single request in to smaller sets)
  2. Async (make as much asynchronious as possible, e.g. won't block the user's request)
  3. Memory/cache (Disk I/O is slow, keep as much as possible in memory)
  4. Pre-compute (Do as much work as possible before hand, don't wait for a user to ask for data/processing)
  5. Care about your front-end HTML (see Yslow and friends)

Everyone knows it's because they use pigeons, of course!

Oh yeah, that and Mapreduce.

This link it also very informative Behind the scenes of a google query

Here are some of the great answers and pointers provided:

An additional reason appears to be that they cheat on the TCP slow start algorithm.

http://blog.benstrong.com/2010/11/google-and-microsoft-cheat-on-slow.html