列表 < T > 。包含()很慢吗?

有人能解释一下为什么泛型 List.Contains()函数这么慢吗?

我有一个大约有一百万个数字的 List<long>,代码不断地检查这些数字中是否有一个特定的数字。

我尝试使用 Dictionary<long, byte>Dictionary.ContainsKey()函数做同样的事情,这比使用 List 快10-20倍。

当然,我真的不想为此目的使用 Dictionary,因为它本来就不应该被这样使用。

所以,这里真正的问题是,有没有任何替代 List<T>.Contains(),但没有怪异的 Dictionary<K,V>.ContainsKey()

59003 次浏览

Dictionary isn't that bad, because the keys in a dictionary are designed to be found fast. To find a number in a list it needs to iterate through the whole list.

Of course the dictionary only works if your numbers are unique and not ordered.

I think there is also a HashSet<T> class in .NET 3.5, it also allows only unique elements.

List.Contains is a O(n) operation.

Dictionary.ContainsKey is a O(1) operation, since it uses the hashcode of the objects as a key, which gives you a quicker search ability.

I don't think that it 's a good idea to scan through a List which contains a million entries to find a few entries.

Isn't it possible to save those millon entities into a RDBMS for instance, and perform queries on that database ?

If it is not possible, then I would use a Dictionary anyway if you want to do key-lookups.

A SortedList will be faster to search (but slower to insert items)

Why is a dictionary inappropriate?

To see if a particular value is in the list you need to walk the entire list. With a dictionary (or other hash based container) it's much quicker to narrow down the number of objects you need to compare against. The key (in your case, the number) is hashed and that gives the dictionary the fractional subset of objects to compare against.

If you are just checking for existence, HashSet<T> in .NET 3.5 is your best option - dictionary-like performance, but no key/value pair - just the values:

    HashSet<int> data = new HashSet<int>();
for (int i = 0; i < 1000000; i++)
{
data.Add(rand.Next(50000000));
}
bool contains = data.Contains(1234567); // etc

I'm using this in the Compact Framework where there is no support for HashSet, I have opted for a Dictionary where both strings are the value I am looking for.

It means I get list<> functionality with dictionary performance. It's a bit hacky, but it works.

This is not exactly an answer to your question, but I have a class that increases the performance of Contains() on a collection. I subclassed a Queue and added a Dictionary that maps hashcodes to lists of objects. The Dictionary.Contains() function is O(1) whereas List.Contains(), Queue.Contains(), and Stack.Contains() are O(n).

The value-type of the dictionary is a queue holding objects with the same hashcode. The caller can supply a custom class object that implements IEqualityComparer. You could use this pattern for Stacks or Lists. The code would need just a few changes.

/// <summary>
/// This is a class that mimics a queue, except the Contains() operation is O(1) rather     than O(n) thanks to an internal dictionary.
/// The dictionary remembers the hashcodes of the items that have been enqueued and dequeued.
/// Hashcode collisions are stored in a queue to maintain FIFO order.
/// </summary>
/// <typeparam name="T"></typeparam>
private class HashQueue<T> : Queue<T>
{
private readonly IEqualityComparer<T> _comp;
public readonly Dictionary<int, Queue<T>> _hashes; //_hashes.Count doesn't always equal base.Count (due to collisions)


public HashQueue(IEqualityComparer<T> comp = null) : base()
{
this._comp = comp;
this._hashes = new Dictionary<int, Queue<T>>();
}


public HashQueue(int capacity, IEqualityComparer<T> comp = null) : base(capacity)
{
this._comp = comp;
this._hashes = new Dictionary<int, Queue<T>>(capacity);
}


public HashQueue(IEnumerable<T> collection, IEqualityComparer<T> comp = null) :     base(collection)
{
this._comp = comp;


this._hashes = new Dictionary<int, Queue<T>>(base.Count);
foreach (var item in collection)
{
this.EnqueueDictionary(item);
}
}


public new void Enqueue(T item)
{
base.Enqueue(item); //add to queue
this.EnqueueDictionary(item);
}


private void EnqueueDictionary(T item)
{
int hash = this._comp == null ? item.GetHashCode() :     this._comp.GetHashCode(item);
Queue<T> temp;
if (!this._hashes.TryGetValue(hash, out temp))
{
temp = new Queue<T>();
this._hashes.Add(hash, temp);
}
temp.Enqueue(item);
}


public new T Dequeue()
{
T result = base.Dequeue(); //remove from queue


int hash = this._comp == null ? result.GetHashCode() : this._comp.GetHashCode(result);
Queue<T> temp;
if (this._hashes.TryGetValue(hash, out temp))
{
temp.Dequeue();
if (temp.Count == 0)
this._hashes.Remove(hash);
}


return result;
}


public new bool Contains(T item)
{ //This is O(1), whereas Queue.Contains is (n)
int hash = this._comp == null ? item.GetHashCode() : this._comp.GetHashCode(item);
return this._hashes.ContainsKey(hash);
}


public new void Clear()
{
foreach (var item in this._hashes.Values)
item.Clear(); //clear collision lists


this._hashes.Clear(); //clear dictionary


base.Clear(); //clear queue
}
}

My simple testing shows that my HashQueue.Contains() runs much faster than Queue.Contains(). Running the test code with count set to 10,000 takes 0.00045 seconds for the HashQueue version and 0.37 seconds for the Queue version. With a count of 100,000, the HashQueue version takes 0.0031 seconds whereas the Queue takes 36.38 seconds!

Here's my testing code:

static void Main(string[] args)
{
int count = 10000;


{ //HashQueue
var q = new HashQueue<int>(count);


for (int i = 0; i < count; i++) //load queue (not timed)
q.Enqueue(i);


System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
bool contains = q.Contains(i);
}
sw.Stop();
Console.WriteLine(string.Format("HashQueue, {0}", sw.Elapsed));
}


{ //Queue
var q = new Queue<int>(count);


for (int i = 0; i < count; i++) //load queue (not timed)
q.Enqueue(i);


System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
bool contains = q.Contains(i);
}
sw.Stop();
Console.WriteLine(string.Format("Queue,     {0}", sw.Elapsed));
}


Console.ReadLine();
}

I think I have the answer! Yes, it's true that Contains() on a list (array) is O(n), but if the array is short and you are using value types, it still should be quite fast. But using the CLR Profiler [free download from Microsoft], I discovered that Contains() is boxing values in order to compare them, which requires heap allocation, which is VERY expensive (slow). [Note: This is .Net 2.0; other .Net versions not tested.]

Here's the full story and solution. We have an enumeration called "VI" and made a class called "ValueIdList" which is an abstract type for a list (array) of VI objects. The original implementation was in the ancient .Net 1.1 days, and it used an encapsulated ArrayList. We discovered recently in http://blogs.msdn.com/b/joshwil/archive/2004/04/13/112598.aspx that a generic list (List<VI>) performs much better than ArrayList on value types (like our enum VI) because the values don't have to be boxed. It's true and it worked... almost.

The CLR Profiler revealed a surprise. Here's a portion of the Allocation Graph:

  • ValueIdList::Contains bool(VI) 5.5MB (34.81%)
  • Generic.List::Contains bool(<UNKNOWN>) 5.5MB (34.81%)
  • Generic.ObjectEqualityComparer<T>::Equals bool (<UNKNOWN> <UNKNOWN>) 5.5MB (34.88%)
  • Values.VI 7.7MB (49.03%)

As you can see, Contains() surprisingly calls Generic.ObjectEqualityComparer.Equals(), which apparently requires the boxing of a VI value, which requires expensive heap allocation. It's odd that Microsoft would eliminate boxing on the list, only to require it again for a simple operation such as this.

Our solution was to re-write the Contains() implementation, which in our case was easy to do since we were already encapsulating the generic list object (_items). Here's the simple code:

public bool Contains(VI id)
{
return IndexOf(id) >= 0;
}


public int IndexOf(VI id)
{
int i, count;


count = _items.Count;
for (i = 0; i < count; i++)
if (_items[i] == id)
return i;
return -1;
}


public bool Remove(VI id)
{
int i;


i = IndexOf(id);
if (i < 0)
return false;
_items.RemoveAt(i);


return true;
}

The comparison of VI values is now being done in our own version of IndexOf() which requires no boxing, and it's very fast. Our particular program sped up by 20% after this simple re-write. O(n)... no problem! Just avoid the wasted memory usage!