散列查找和二进制搜索哪个更快?

当给定一个静态对象集(静态的意思是一旦加载它很少,如果有变化) ,需要重复并发查找到最佳性能,这是更好的,一个 HashMap或二进制搜索使用一些自定义比较器的数组?

答案是对象函数还是结构类型函数?散列和/或等效函数性能?Hash 的独特性?列表大小?Hashset尺寸/套装尺寸? ?

我正在看的这个集合的大小可以从50万到10万不等——如果这个信息有用的话。

当我在寻找 C # 的答案时,我认为真正的数学答案不在语言中,所以我不包括这个标签。但是,如果有特定于 C # 的事情需要注意,那么就需要这些信息。

65879 次浏览

The only reasonable answer to this question is: It depends. It depends on the size of your data, the shape of your data, your hash implementation, your binary search implementation, and where your data lives (even though it's not mentioned in the question). A couple other answers say as much, so I could just delete this. However, it might be nice to share what I've learned from feedback to my original answer.

  1. I wrote, "Hash algorithms are O(1) while binary search is O(log n)." - As noted in the comments, Big O notation estimates complexity, not speed. This is absolutely true. It's worth noting that we usually use complexity to get a sense of an algorithm's time and space requirements. So, while it's foolish to assume complexity is strictly the same as speed, estimating complexity without time or space in the back of your mind is unusual. My recommendation: avoid Big O notation.
  2. I wrote, "So as n approaches infinity..." - This is about the dumbest thing I could have included in an answer. Infinity has nothing to do with your problem. You mention an upper bound of 10 million. Ignore infinity. As the commenters point out, very large numbers will create all sorts of problems with a hash. (Very large numbers don't make binary search a walk in the park either.) My recommendation: don't mention infinity unless you mean infinity.
  3. Also from the comments: beware default string hashes (Are you hashing strings? You don't mention.), database indexes are often b-trees (food for thought). My recommendation: consider all your options. Consider other data structures and approaches... like an old-fashioned trie (for storing and retrieving strings) or an R-tree (for spatial data) or a MA-FSA (Minimal Acyclic Finite State Automaton - small storage footprint).

Given the comments, you might assume that people who use hash tables are deranged. Are hash tables reckless and dangerous? Are these people insane?

Turns out they're not. Just as binary trees are good at certain things (in-order data traversal, storage efficiency), hash tables have their moment to shine as well. In particular, they can be very good at reducing the number of reads required to fetch your data. A hash algorithm can generate a location and jump straight to it in memory or on disk while binary search reads data during each comparison to decide what to read next. Each read has the potential for a cache miss which is an order of magnitude (or more) slower than a CPU instruction.

That's not to say hash tables are better than binary search. They're not. It's also not to suggest that all hash and binary search implementations are the same. They're not. If I have a point, it's this: both approaches exist for a reason. It's up to you to decide which is best for your needs.

Original answer:


Hash algorithms are O(1) while binary search is O(log n). So as n approaches infinity, hash performance improves relative to binary search. Your mileage will vary depending on n, your hash implementation, and your binary search implementation.

Interesting discussion on O(1). Paraphrased:

O(1) doesn't mean instantaneous. It means that the performance doesn't change as the size of n grows. You can design a hashing algorithm that's so slow no one would ever use it and it would still be O(1). I'm fairly sure .NET/C# doesn't suffer from cost-prohibitive hashing, however ;)

For very small collections the difference is going to be negligible. At the low end of your range (500k items) you will start to see a difference if you're doing lots of lookups. A binary search is going to be O(log n), whereas a hash lookup will be O(1), amortized. That's not the same as truly constant, but you would still have to have a pretty terrible hash function to get worse performance than a binary search.

(When I say "terrible hash", I mean something like:

hashCode()
{
return 0;
}

Yeah, it's blazing fast itself, but causes your hash map to become a linked list.)

ialiashkevich wrote some C# code using an array and a Dictionary to compare the two methods, but it used Long values for keys. I wanted to test something that would actually execute a hash function during the lookup, so I modified that code. I changed it to use String values, and I refactored the populate and lookup sections into their own methods so it's easier to see in a profiler. I also left in the code that used Long values, just as a point of comparison. Finally, I got rid of the custom binary search function and used the one in the Array class.

Here's that code:

class Program
{
private const long capacity = 10_000_000;


private static void Main(string[] args)
{
testLongValues();
Console.WriteLine();
testStringValues();


Console.ReadLine();
}


private static void testStringValues()
{
Dictionary<String, String> dict = new Dictionary<String, String>();
String[] arr = new String[capacity];
Stopwatch stopwatch = new Stopwatch();


Console.WriteLine("" + capacity + " String values...");


stopwatch.Start();


populateStringArray(arr);


stopwatch.Stop();
Console.WriteLine("Populate String Array:      " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();
stopwatch.Start();


populateStringDictionary(dict, arr);


stopwatch.Stop();
Console.WriteLine("Populate String Dictionary: " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();
stopwatch.Start();


Array.Sort(arr);


stopwatch.Stop();
Console.WriteLine("Sort String Array:          " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();
stopwatch.Start();


searchStringDictionary(dict, arr);


stopwatch.Stop();
Console.WriteLine("Search String Dictionary:   " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();
stopwatch.Start();


searchStringArray(arr);


stopwatch.Stop();
Console.WriteLine("Search String Array:        " + stopwatch.ElapsedMilliseconds);


}


/* Populate an array with random values. */
private static void populateStringArray(String[] arr)
{
for (long i = 0; i < capacity; i++)
{
arr[i] = generateRandomString(20) + i; // concatenate i to guarantee uniqueness
}
}


/* Populate a dictionary with values from an array. */
private static void populateStringDictionary(Dictionary<String, String> dict, String[] arr)
{
for (long i = 0; i < capacity; i++)
{
dict.Add(arr[i], arr[i]);
}
}


/* Search a Dictionary for each value in an array. */
private static void searchStringDictionary(Dictionary<String, String> dict, String[] arr)
{
for (long i = 0; i < capacity; i++)
{
String value = dict[arr[i]];
}
}


/* Do a binary search for each value in an array. */
private static void searchStringArray(String[] arr)
{
for (long i = 0; i < capacity; i++)
{
int index = Array.BinarySearch(arr, arr[i]);
}
}


private static void testLongValues()
{
Dictionary<long, long> dict = new Dictionary<long, long>(Int16.MaxValue);
long[] arr = new long[capacity];
Stopwatch stopwatch = new Stopwatch();


Console.WriteLine("" + capacity + " Long values...");


stopwatch.Start();


populateLongDictionary(dict);


stopwatch.Stop();
Console.WriteLine("Populate Long Dictionary: " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();
stopwatch.Start();


populateLongArray(arr);


stopwatch.Stop();
Console.WriteLine("Populate Long Array:      " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();
stopwatch.Start();


searchLongDictionary(dict);


stopwatch.Stop();
Console.WriteLine("Search Long Dictionary:   " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();
stopwatch.Start();


searchLongArray(arr);


stopwatch.Stop();
Console.WriteLine("Search Long Array:        " + stopwatch.ElapsedMilliseconds);
}


/* Populate an array with long values. */
private static void populateLongArray(long[] arr)
{
for (long i = 0; i < capacity; i++)
{
arr[i] = i;
}
}


/* Populate a dictionary with long key/value pairs. */
private static void populateLongDictionary(Dictionary<long, long> dict)
{
for (long i = 0; i < capacity; i++)
{
dict.Add(i, i);
}
}


/* Search a Dictionary for each value in a range. */
private static void searchLongDictionary(Dictionary<long, long> dict)
{
for (long i = 0; i < capacity; i++)
{
long value = dict[i];
}
}


/* Do a binary search for each value in an array. */
private static void searchLongArray(long[] arr)
{
for (long i = 0; i < capacity; i++)
{
int index = Array.BinarySearch(arr, arr[i]);
}
}


/**
* Generate a random string of a given length.
* Implementation from https://stackoverflow.com/a/1344258/1288
*/
private static String generateRandomString(int length)
{
var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
var stringChars = new char[length];
var random = new Random();


for (int i = 0; i < stringChars.Length; i++)
{
stringChars[i] = chars[random.Next(chars.Length)];
}


return new String(stringChars);
}
}

Here are the results with several different sizes of collections. (Times are in milliseconds.)

500000 Long values...
Populate Long Dictionary: 26
Populate Long Array: 2
Search Long Dictionary: 9
Search Long Array: 80

500000 String values...
Populate String Array: 1237
Populate String Dictionary: 46
Sort String Array: 1755
Search String Dictionary: 27
Search String Array: 1569

1000000 Long values...
Populate Long Dictionary: 58
Populate Long Array: 5
Search Long Dictionary: 23
Search Long Array: 136

1000000 String values...
Populate String Array: 2070
Populate String Dictionary: 121
Sort String Array: 3579
Search String Dictionary: 58
Search String Array: 3267

3000000 Long values...
Populate Long Dictionary: 207
Populate Long Array: 14
Search Long Dictionary: 75
Search Long Array: 435

3000000 String values...
Populate String Array: 5553
Populate String Dictionary: 449
Sort String Array: 11695
Search String Dictionary: 194
Search String Array: 10594

10000000 Long values...
Populate Long Dictionary: 521
Populate Long Array: 47
Search Long Dictionary: 202
Search Long Array: 1181

10000000 String values...
Populate String Array: 18119
Populate String Dictionary: 1088
Sort String Array: 28174
Search String Dictionary: 747
Search String Array: 26503

And for comparison, here's the profiler output for the last run of the program (10 million records and lookups). I highlighted the relevant functions. They pretty closely agree with the Stopwatch timing metrics above.

Profiler output for 10 million records and lookups

You can see that the Dictionary lookups are much faster than binary search, and (as expected) the difference is more pronounced the larger the collection. So, if you have a reasonable hashing function (fairly quick with few collisions), a hash lookup should beat binary search for collections in this range.

It depends on how you handle duplicates for hash tables (if at all). If you do want to allow hash key duplicates (no hash function is perfect), It remains O(1) for primary key lookup but search behind for the "right" value may be costly. Answer is then, theorically most of the time, hashes are faster. YMMV depending on which data you put there...

The answers by Bobby, Bill and Corbin are wrong. O(1) is not slower than O(log n) for a fixed/bounded n:

log(n) is constant, so it depends on the constant time.

And for a slow hash function, ever heard of md5?

The default string hashing algorithm probably touches all characters, and can be easily 100 times slower than the average compare for long string keys. Been there, done that.

You might be able to (partially) use a radix. If you can split up in 256 approximately same size blocks, you're looking at 2k to 40k binary search. That is likely to provide much better performance.

[Edit] Too many people voting down what they do not understand.

String compares for binary searching sorted sets have a very interesting property: they get slower the closer they get to the target. First they will break on the first character, in the end only on the last. Assuming a constant time for them is incorrect.

I'd say it depends mainly on the performance of the hash and compare methods. For example, when using string keys that are very long but random, a compare will always yield a very quick result, but a default hash function will process the entire string.

But in most cases the hash map should be faster.

Hashes are typically faster, although binary searches have better worst-case characteristics. A hash access is typically a calculation to get a hash value to determine which "bucket" a record will be in, and so the performance will generally depend on how evenly the records are distributed, and the method used to search the bucket. A bad hash function (leaving a few buckets with a whole lot of records) with a linear search through the buckets will result in a slow search. (On the third hand, if you're reading a disk rather than memory, the hash buckets are likely to be contiguous while the binary tree pretty much guarantees non-local access.)

If you want generally fast, use the hash. If you really want guaranteed bounded performance, you might go with the binary tree.

Of course, hash is fastest for such a big dataset.

One way to speed it up even more, since the data seldom changes, is to programmatically generate ad-hoc code to do the first layer of search as a giant switch statement (if your compiler can handle it), and then branch off to search the resulting bucket.

I strongly suspect that in a problem set of size ~1M, hashing would be faster.

Just for the numbers:

a binary search would require ~ 20 compares (2^20 == 1M)

a hash lookup would require 1 hash calculation on the search key, and possibly a handful of compares afterwards to resolve possible collisions

Edit: the numbers:

    for (int i = 0; i < 1000 * 1000; i++) {
c.GetHashCode();
}
for (int i = 0; i < 1000 * 1000; i++) {
for (int j = 0; j < 20; j++)
c.CompareTo(d);
}

times: c = "abcde", d = "rwerij" hashcode: 0.0012 seconds. Compare: 2.4 seconds.

disclaimer: Actually benchmarking a hash lookup versus a binary lookup might be better than this not-entirely-relevant test. I'm not even sure if GetHashCode gets memoized under-the-hood

Ok, I'll try to be short.

C# short answer:

Test the two different approaches.

.NET gives you the tools to change your approach with a line of code. Otherwise use System.Collections.Generic.Dictionary and be sure to initialize it with a large number as initial capacity or you'll pass the rest of your life inserting items due to the job GC has to do to collect old bucket arrays.

Longer answer:

An hashtable has ALMOST constant lookup times and getting to an item in an hash table in the real world does not just require to compute an hash.

To get to an item, your hashtable will do something like this:

  • Get the hash of the key
  • Get the bucket number for that hash (usually the map function looks like this bucket = hash % bucketsCount)
  • Traverse the items chain (basically it's a list of items that share the same bucket, most hashtables use this method of handling bucket/hash collisions) that starts at that bucket and compare each key with the one of the item you are trying to add/delete/update/check if contained.

Lookup times depend on how "good" (how sparse is the output) and fast is your hash function, the number of buckets you are using and how fast is the keys comparer, it's not always the best solution.

A better and deeper explanation: http://en.wikipedia.org/wiki/Hash_table

Here it's described how hashes are built and because the Universe of keys is reasonably big and hash functions are built to be "very injective" so that collisions rarely happen the access time for a hash table is not O(1) actually ... it's something based on some probabilities. But,it is reasonable to say that the access time of a hash is almost always less than the time O(log_2(n))

If your set of objects is truly static and unchanging, you can use a perfect hash to get O(1) performance guaranteed. I've seen gperf mentioned a few times, though I've never had occasion to use it myself.

I wonder why no one mentioned perfect hashing.

It's only relevant if your dataset is fixed for a long time, but what it does it analyze the data and construct a perfect hash function that ensures no collisions.

Pretty neat, if your data set is constant and the time to calculate the function is small compared to the application run time.

Surprised nobody mentioned Cuckoo hashing, which provides guaranteed O(1) and, unlike perfect hashing, is capable of using all of the memory it allocates, where as perfect hashing can end up with guaranteed O(1) but wasting the greater portion of its allocation. The caveat? Insertion time can be very slow, especially as the number of elements increases, since all of the optimization is performed during the insertion phase.

I believe some version of this is used in router hardware for ip lookups.

See link text

The answer depends. Lets think that the number of elements 'n' is very large. If you are good at writing a better hash function which lesser collisions, then hashing is the best. Note that The hash function is being executed only once at searching and it directs to the corresponding bucket. So it is not a big overhead if n is high.
Problem in Hashtable: But the problem in hash tables is if the hash function is not good (more collisions happens), then the searching isn't O(1). It tends to O(n) because searching in a bucket is a linear search. Can be worst than a binary tree. problem in binary tree: In binary tree, if the tree isn't balanced, it also tends to O(n). For example if you inserted 1,2,3,4,5 to a binary tree that would be more likely a list. So, If you can see a good hashing methodology, use a hashtable If not, you better using a binary tree.

Dictionary/Hashtable is using more memory and takes more time to populate comparing to array. But search is done faster by Dictionary rather than Binary Search within array.

Here are the numbers for 10 Million of Int64 items to search and populate. Plus a sample code you can run by yourself.

Dictionary Memory: 462,836

Array Memory: 88,376

Populate Dictionary: 402

Populate Array: 23

Search Dictionary: 176

Search Array: 680

using System;
using System.Collections.Generic;
using System.Diagnostics;


namespace BinaryVsDictionary
{
internal class Program
{
private const long Capacity = 10000000;


private static readonly Dictionary<long, long> Dict = new Dictionary<long, long>(Int16.MaxValue);
private static readonly long[] Arr = new long[Capacity];


private static void Main(string[] args)
{
Stopwatch stopwatch = new Stopwatch();


stopwatch.Start();


for (long i = 0; i < Capacity; i++)
{
Dict.Add(i, i);
}


stopwatch.Stop();


Console.WriteLine("Populate Dictionary: " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();


stopwatch.Start();


for (long i = 0; i < Capacity; i++)
{
Arr[i] = i;
}


stopwatch.Stop();


Console.WriteLine("Populate Array:      " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();


stopwatch.Start();


for (long i = 0; i < Capacity; i++)
{
long value = Dict[i];
//                Console.WriteLine(value + " : " + RandomNumbers[i]);
}


stopwatch.Stop();


Console.WriteLine("Search Dictionary:   " + stopwatch.ElapsedMilliseconds);


stopwatch.Reset();


stopwatch.Start();


for (long i = 0; i < Capacity; i++)
{
long value = BinarySearch(Arr, 0, Capacity, i);
//                Console.WriteLine(value + " : " + RandomNumbers[i]);
}


stopwatch.Stop();


Console.WriteLine("Search Array:        " + stopwatch.ElapsedMilliseconds);


Console.ReadLine();
}


private static long BinarySearch(long[] arr, long low, long hi, long value)
{
while (low <= hi)
{
long median = low + ((hi - low) >> 1);


if (arr[median] == value)
{
return median;
}


if (arr[median] < value)
{
low = median + 1;
}
else
{
hi = median - 1;
}
}


return ~low;
}
}
}

This is more a comment to Bill's answer because his answer have so many upvotes even though its wrong. So I had to post this.

I see lots of discussion about what is the worst case complexity of a lookup in hashtable, and what is considered amortized analysis / what is not. Please check the link below

Hash table runtime complexity (insert, search and delete)

worst case complexity is O(n) and not O(1) as opposed to what Bill says. And thus his O(1) complexity is not amortized since this analysis can only be used for worst cases (also his own wikipedia link says so)

https://en.wikipedia.org/wiki/Hash_table

https://en.wikipedia.org/wiki/Amortized_analysis

This question is more complicated than the scope of pure algorithm performance. If we remove the factors that binary search algorithm is more cache friendly, the hash lookup is faster in general sense. The best way to figured out is to build a program and disable the compiler optimization options, and we could find that the hash lookup is faster given its algorithm time efficiency is O(1) in general sense.

But when you enable the compiler optimization, and try the same test with smaller count of samples say less than 10,000, the binary search outperformed the hash lookup by taking advantages of its cache-friendly data structure.