对 numpy 数组散列最有效的属性

我需要能够存储一个 numpy array在一个 dict缓存的目的。散列速度很重要。

array表示索引,因此尽管对象的实际标识并不重要,但其值却很重要。可变性不是问题,因为我只对当前值感兴趣。

为了在 dict中存储它,我应该散列什么?

我目前的方法是使用 str(arr.data),在我的测试中它比 md5快。


为了了解相对时间,我引用了一些答案中的例子:

In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop


In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop


In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop


In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop


In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop

看起来,对于这个特定的用例(索引的小数组) ,arr.tostring提供了最好的性能。

虽然散列只读缓冲区本身就很快,但是设置可写标志的开销实际上使它变慢了。

80041 次浏览

What kind of data do you have?

  • array-size
  • do you have an index several times in the array

If your array only consists of permutation of indices you can use a base-convertion

(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)

and use '10' as hash_key via

import numpy as num


base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()


hashed_array = (base * array).sum()

Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

You can simply hash the underlying buffer, if you make it read-only:

>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop

For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.

>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'

Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:

def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)

I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.

You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).

Example IPython session:

>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop

And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.

Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].

[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.

If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:

grid  = np.array([[True, False, True],[False, False, True]])
hash  = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]