为什么一个 Python 命令可以有多个具有相同散列的键?

我试图理解底层的 Python hash函数。我创建了一个自定义类,其中所有实例返回相同的散列值。

class C:
def __hash__(self):
return 42

我只是假设上述类的一个实例在任何时候都可以位于 dict中,但实际上 dict可以有多个具有相同散列的元素。

c, d = C(), C()
x = {c: 'c', d: 'd'}
print(x)
# {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'}
# note that the dict has 2 elements

我进行了更多的实验,发现如果我重写 __eq__方法,使类的所有实例比较相等,那么 dict只允许一个实例。

class D:
def __hash__(self):
return 42
def __eq__(self, other):
return True


p, q = D(), D()
y = {p: 'p', q: 'q'}
print(y)
# {<__main__.D object at 0x7f0823a9af40>: 'q'}
# note that the dict only has 1 element

因此,我很好奇 dict如何能够有多个具有相同散列的元素。

41416 次浏览

编辑 : 下面的答案是处理 hash 冲突的一种可能的方法,但是它是如何使用 Python 的 没有。下面引用的 Python 的 wiki 也是不正确的。下面@Duncan 给出的最好的来源是实现本身: https://github.com/python/cpython/blob/master/Objects/dictobject.c我为混淆道歉。


It stores a list (or bucket) of elements at the hash then iterates through that list until it finds the actual key in that list. A picture says more than a thousand words:

Hash table

这里您可以看到 John SmithSandra Dee都散列到 152。Bucket 152包含了这两个。当查找 Sandra Dee时,它首先在 bucket 152中找到该列表,然后循环遍历该列表,直到找到 Sandra Dee并返回 521-6955

下面是错误的,它只是在这里的上下文: 巨蟒的 wiki上你可以找到(伪?)编写 Python 如何执行查找的代码。

这个问题实际上有几种可能的解决方案,请查看维基百科的文章以获得一个很好的概述: http://en.wikipedia.org/wiki/Hash_table#Collision_resolution

哈希表,一般都要允许哈希冲突!你会不走运,两件事情最终会散列到同一件事情上。在下面,有一组具有相同散列键的项列表中的对象。通常,这个列表里只有一样东西,但是在这种情况下,它会把它们堆叠到同一个列表里。它知道它们不同的唯一方法是通过等于运算符。

当这种情况发生时,您的性能将随着时间的推移而降低,这就是为什么您希望散列函数尽可能“随机”的原因。

有关 Python 散列工作原理的详细描述,请参阅我对 为什么早回来比别人慢?的回答

基本上,它使用散列在表中选择一个槽。如果槽中有一个值并且散列匹配,它将比较项目以查看它们是否相等。

如果散列匹配但项不相等,则尝试另一个槽。有一个公式可以选择它(我在引用的答案中描述过) ,它会逐渐引入散列值中未使用的部分; 但是一旦它用完了所有这些部分,它最终会在散列表中的所有槽中工作。这保证了我们最终要么找到匹配的项目,要么找到一个空槽。当搜索找到一个空槽,它插入值或放弃(取决于我们是否添加或得到一个值)。

需要注意的重要一点是没有列表或桶: 只有一个具有特定数量槽的哈希表,每个哈希用于生成一系列候选槽。

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to 邓肯 for pointing out that Python dicts use slots and leading me down this rabbit hole.

  • Python 字典实现为 哈希桌
  • 散列表必须支持 散列冲突,也就是说,即使两个键具有相同的散列值,表的实现也必须有一个明确地插入和检索键和值对的策略。
  • Python dict 使用 公开发言解决散列冲突(如下所述)(参见 Dictoject.c: 296-297)。
  • Python 散列表只是一个连续的内存块(类似于数组,因此可以通过索引进行 O(1)查找)。
  • 表中的每个槽可以存储一个且只能存储一个条目。 这很重要
  • 表中的每个 进入实际上是三个值 -的组合。这是作为 C 结构实现的(参见 字典对象 h: 51-56)
  • 下图是一个 python 散列表的逻辑表示形式。在下面的图中,左边的0,1,... ,i,... 是散列表中 老虎机的索引(它们只是用于说明目的,显然不与表一起存储!).

    # Logical model of Python Hash table
    -+-----------------+
    0| <hash|key|value>|
    -+-----------------+
    1|      ...        |
    -+-----------------+
    .|      ...        |
    -+-----------------+
    i|      ...        |
    -+-----------------+
    .|      ...        |
    -+-----------------+
    n|      ...        |
    -+-----------------+
    
  • When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)

  • When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask. Where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
  • If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
  • If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
  • Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
  • The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
  • BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (==) of the keys when inserting items. So in summary, if there are two keys, a and b and hash(a)==hash(b), but a!=b, then both can exist harmoniously in a Python dict. But if hash(a)==hash(b) and a==b, then they cannot both be in the same dict.

Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).

I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"

While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?

在线程中,当我们将 python 作为键放入 dictionary 中时,我没有看到 python 对用户定义类的实例具体做了什么。让我们阅读一些文档: 它声明只有散列对象可以用作键。哈希化是所有不可变的内置类和所有用户定义的类。

User-defined classes have __cmp__() and 默认情况下散列(hash)方法; 有了它们,所有对象 比较不平等(除了与自己)和 x.__hash__() returns a result derived from id(x).

因此,如果你有一个不断散列在您的类,但没有提供任何 cmp & # 95; 或 & # 95; & # 95; 等式 & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; & # 95; 那么你的所有。 另一方面,如果您提供任何 cmp 或 eq,但没有提供 hash,那么您的实例在字典方面仍然是不平等的。

class A(object):
def __hash__(self):
return 42




class B(object):
def __eq__(self, other):
return True




class C(A, B):
pass




dict_a = {A(): 1, A(): 2, A(): 3}
dict_b = {B(): 1, B(): 2, B(): 3}
dict_c = {C(): 1, C(): 2, C(): 3}


print(dict_a)
print(dict_b)
print(dict_c)

Output

{<__main__.A object at 0x7f9672f04850>: 1, <__main__.A object at 0x7f9672f04910>: 3, <__main__.A object at 0x7f9672f048d0>: 2}
{<__main__.B object at 0x7f9672f04990>: 2, <__main__.B object at 0x7f9672f04950>: 1, <__main__.B object at 0x7f9672f049d0>: 3}
{<__main__.C object at 0x7f9672f04a10>: 3}