Python 散列字典

作为一个练习,主要是为了我自己的娱乐,我正在实现一个回溯 Packrat 解析器。其灵感来自于我想更好地了解在类算法语言(与你通常能找到的语法自由的 lisp 方言相对应)中,生态宏是如何工作的。因此,不同的输入传递可能会看到不同的语法,所以缓存的解析结果是无效的,除非我还存储当前版本的语法和缓存的解析结果。(剪辑: 这种使用键值集合的结果是它们应该是不可变的,但我不打算公开接口以允许它们被更改,因此可变或不可变集合都可以)

The problem is that python dicts cannot appear as keys to other dicts. Even using a tuple (as I'd be doing anyways) doesn't help.

>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>

我想一直都是元组。现在 python 标准库提供了我所需要的大致内容,collections.namedtuple有一个非常不同的语法,但是 可以被用作一个键。继续上节课:

>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}

好吧。但是我必须为我想要使用的规则中的每个可能的键组合创建一个类,这也没有那么糟糕,因为每个解析规则都确切地知道它所使用的参数,这样就可以在定义解析规则的函数的同时定义类。

编辑: namedtuple的另一个问题是它们是严格位置的。看起来应该不同的两个元组实际上可以是相同的:

>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False

Tl‘ dr: 如何获得可以用作其他 dict的键的 dict

在对答案进行了一些修改之后,下面是我正在使用的更完整的解决方案。请注意,这会做一些额外的工作,使得出于实际目的,产生的字符模糊地不可变。当然,通过调用 dict.__setitem__(instance, key, value)来解决这个问题还是很容易的,但是我们都是成年人了。

class hashdict(dict):
"""
hashable dict implementation, suitable for use as a key into
other dicts.


>>> h1 = hashdict({"apples": 1, "bananas":2})
>>> h2 = hashdict({"bananas": 3, "mangoes": 5})
>>> h1+h2
hashdict(apples=1, bananas=3, mangoes=5)
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: hashdict(bananas=3, mangoes=5)


based on answers from
http://stackoverflow.com/questions/1151658/python-hashable-dicts


"""
def __key(self):
return tuple(sorted(self.items()))
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__,
", ".join("{0}={1}".format(
str(i[0]),repr(i[1])) for i in self.__key()))


def __hash__(self):
return hash(self.__key())
def __setitem__(self, key, value):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def __delitem__(self, key):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def clear(self):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def pop(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def popitem(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def setdefault(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def update(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
# update is not ok because it mutates the object
# __add__ is ok because it creates a new object
# while the new object is under construction, it's ok to mutate it
def __add__(self, right):
result = hashdict(self)
dict.update(result, right)
return result


if __name__ == "__main__":
import doctest
doctest.testmod()
85619 次浏览

Hashables should be immutable -- not enforcing this but TRUSTING you not to mutate a dict after its first use as a key, the following approach would work:

class hashabledict(dict):
def __key(self):
return tuple((k,self[k]) for k in sorted(self))
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
return self.__key() == other.__key()

如果你确实需要改变你的发音,并且仍然想用它们作为键,复杂性会爆炸成百倍——并不是说这不可能做到,但是我会等到一个非常具体的指示,然后再进入那个令人难以置信的沼泽!-)

Here is the easy way to make a hashable dictionary. Just remember not to mutate them after embedding in another dictionary for obvious reasons.

class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))

一个相当干净、简单的实现是

import collections


class FrozenDict(collections.Mapping):
"""Don't forget the docstrings!!"""


def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)


def __iter__(self):
return iter(self._d)


def __len__(self):
return len(self._d)


def __getitem__(self, key):
return self._d[key]


def __hash__(self):
return hash(tuple(sorted(self._d.iteritems())))

您可能还需要添加这两个方法,以使 v2 pickling 协议能够与 hashdict 实例一起工作。否则 cPickle 将尝试使用 hashdict。_ _ _ _ setitem _ _ _ _ 导致 TypeError。有趣的是,对于协议的其他两个版本,您的代码工作得很好。

def __setstate__(self, objstate):
for k,v in objstate.items():
dict.__setitem__(self,k,v)
def __reduce__(self):
return (hashdict, (), dict(self),)

给定的答案是可以的,但是可以通过使用 frozenset(...)而不是 tuple(sorted(...))来生成散列来改进:

>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023

The performance advantage depends on the content of the dictionary, but in most cases I've tested, hashing with frozenset is at least 2 times faster (mainly because it does not need to sort).

如果你不把数字放进字典,而且你永远不会丢失包含字典的变量,你可以这样做:

cache[id(rule)] = "whatever"

因为 id ()对于每个字典都是唯一的

编辑:

哦,对不起,那样的话,其他人说的会更好。我认为你也可以把你的字典串行化成字符串,比如

cache[ 'foo:bar' ] = 'baz'

但是,如果你需要从键盘中恢复你的字典,那么你就必须做一些更丑陋的事情,比如

cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )

我想这样做的好处是您不必编写那么多的代码。

我一直在回到这个话题... 这里有另一个变化。我对于将 dict子类化以添加 __hash__方法感到不安; 实际上,我们无法逃避这样一个问题,即 dict 是可变的,并且相信它们不会改变似乎是一个薄弱的想法。因此,我转而考虑基于一种本身不可变的内置类型构建映射。虽然 tuple是一个显而易见的选择,但是访问其中的值意味着排序和对分; 这不是问题,但是它似乎并没有利用构建它所基于的类型的多大功能。

如果你把键,值对塞进一个 frozenset里会怎么样? 这需要什么,它是如何工作的?

第1部分,您需要一种方法来编码“ item’s”,这种方法使得 Frozenset 主要通过键来对待它们; 我将为此创建一个小的子类。

import collections
class pair(collections.namedtuple('pair_base', 'key value')):
def __hash__(self):
return hash((self.key, None))
def __eq__(self, other):
if type(self) != type(other):
return NotImplemented
return self.key == other.key
def __repr__(self):
return repr((self.key, self.value))

仅此一点就让你与一个不可变的映射相去甚远:

>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])

D'oh! Unfortunately, when you use the set operators and the elements are equal but not the same object; which one ends up in the return value is 未定义, we'll have to go through some more gyrations.

>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'

然而,以这种方式查找值是很麻烦的,更糟糕的是,会创建大量的中间集; 这是不行的!我们将创建一个“假”键值对来绕过它:

class Thief(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(pair(self.key, None))
def __eq__(self, other):
self.value = other.value
return pair(self.key, None) == other

其结果是问题较少:

>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'

这就是所有的深奥的魔法; 其余的就是把它包装成一个具有 接口的东西,就像一个游戏词典一样。因为我们是从 frozenset子类化,它有一个非常不同的接口,有相当多的方法; 我们从 collections.Mapping得到一点帮助,但大部分工作是重写 frozenset方法的版本,像 dicts,而不是:

class FrozenDict(frozenset, collections.Mapping):
def __new__(cls, seq=()):
return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
def __getitem__(self, key):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
raise KeyError(key)
def __eq__(self, other):
if not isinstance(other, FrozenDict):
return dict(self.iteritems()) == other
if len(self) != len(other):
return False
for key, value in self.iteritems():
try:
if value != other[key]:
return False
except KeyError:
return False
return True
def __hash__(self):
return hash(frozenset(self.iteritems()))
def get(self, key, default=None):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
return default
def __iter__(self):
for item in frozenset.__iter__(self):
yield item.key
def iteritems(self):
for item in frozenset.__iter__(self):
yield (item.key, item.value)
def iterkeys(self):
for item in frozenset.__iter__(self):
yield item.key
def itervalues(self):
for item in frozenset.__iter__(self):
yield item.value
def __contains__(self, key):
return frozenset.__contains__(self, pair(key, None))
has_key = __contains__
def __repr__(self):
return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
@classmethod
def fromkeys(cls, keys, value=None):
return cls((key, value) for key in keys)

这最终回答了我自己的问题:

>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5

要使字典可用,只需添加一个 _ _ hash _ _ 方法:

class Hashabledict(dict):
def __hash__(self):
return hash(frozenset(self))

Note, the 冻结 conversion will work for all dictionaries (i.e. it doesn't require the keys to be sortable). Likewise, there is no restriction on the dictionary values.

If there are many dictionaries with identical keys but with distinct values, it is necessary to have the hash take the values into account. The fastest way to do that is:

class Hashabledict(dict):
def __hash__(self):
return hash((frozenset(self), frozenset(self.itervalues())))

这比 frozenset(self.iteritems())快有两个原因。首先,frozenset(self)步骤重用字典中存储的散列值,保存对 hash(key)的不必要调用。其次,使用 itervalues将直接访问这些值,并避免每次查找时 物品使用许多内存分配器调用在内存中形成新的许多键/值元组。

“未知”和“亚历克斯 · 马特利”给出的答案都是可接受的,但都受到以下限制:

  1. 字典的值必须是散列的。例如,hash(hashabledict({'a':[1,2]}))将引发 TypeError
  2. 键必须支持比较操作。例如,hash(hashabledict({'a':'a', 1:1}))将引发 TypeError
  3. 键上的比较运算符强制执行总排序。例如,如果字典中的两个键是 frozenset((1,2,3))frozenset((4,5,6)),它们在两个方向上比较不等。因此,使用这样的键对字典的项进行排序可能导致任意顺序,因此将违反相等对象必须具有相同散列值的规则。

@ ObenSonne 的快得多的答案解除了约束2和3,但仍然受约束1的约束(值必须是散列的)。

由@RaymondHettinger 提供的更快的答案解除了所有3个约束,因为它在散列计算中不包括 .values()。然而,只有在下列情况下,它的表现才算不错:

  1. 大多数(不等式)需要散列的字典都没有相同的 .keys()

如果不满足此条件,散列函数仍然有效,但可能会导致太多冲突。例如,在极端情况下,所有字典都是从网站模板(字段名称作为键,用户输入作为值)生成的,键总是相同的,散列函数将为所有输入返回相同的值。因此,依赖于这种散列函数的散列表在检索项时会变得和列表一样慢(O(N)而不是 O(1))。

我认为即使违反了上面列出的所有4个约束,下面的解决方案仍然可以很好地工作。它还有一个额外的优点,即它不仅可以散列字典,而且可以散列任何容器,即使它们具有嵌套的可变容器。

我非常感谢任何关于这个的反馈,因为到目前为止我只是简单地测试了一下。

# python 3.4
import collections
import operator
import sys
import itertools
import reprlib


# a wrapper to make an object hashable, while preserving equality
class AutoHash:
# for each known container type, we can optionally provide a tuple
# specifying: type, transform, aggregator
# even immutable types need to be included, since their items
# may make them unhashable


# transformation may be used to enforce the desired iteration
# the result of a transformation must be an iterable
# default: no change; for dictionaries, we use .items() to see values


# usually transformation choice only affects efficiency, not correctness


# aggregator is the function that combines all items into one object
# default: frozenset; for ordered containers, we can use tuple


# aggregator choice affects both efficiency and correctness
# e.g., using a tuple aggregator for a set is incorrect,
# since identical sets may end up with different hash values
# frozenset is safe since at worst it just causes more collisions
# unfortunately, no collections.ABC class is available that helps
# distinguish ordered from unordered containers
# so we need to just list them out manually as needed


type_info = collections.namedtuple(
'type_info',
'type transformation aggregator')


ident = lambda x: x
# order matters; first match is used to handle a datatype
known_types = (
# dict also handles defaultdict
type_info(dict, lambda d: d.items(), frozenset),
# no need to include set and frozenset, since they are fine with defaults
type_info(collections.OrderedDict, ident, tuple),
type_info(list, ident, tuple),
type_info(tuple, ident, tuple),
type_info(collections.deque, ident, tuple),
type_info(collections.Iterable, ident, frozenset) # other iterables
)


# hash_func can be set to replace the built-in hash function
# cache can be turned on; if it is, cycles will be detected,
# otherwise cycles in a data structure will cause failure
def __init__(self, data, hash_func=hash, cache=False, verbose=False):
self._data=data
self.hash_func=hash_func
self.verbose=verbose
self.cache=cache
# cache objects' hashes for performance and to deal with cycles
if self.cache:
self.seen={}


def hash_ex(self, o):
# note: isinstance(o, Hashable) won't check inner types
try:
if self.verbose:
print(type(o),
reprlib.repr(o),
self.hash_func(o),
file=sys.stderr)
return self.hash_func(o)
except TypeError:
pass


# we let built-in hash decide if the hash value is worth caching
# so we don't cache the built-in hash results
if self.cache and id(o) in self.seen:
return self.seen[id(o)][0] # found in cache


# check if o can be handled by decomposing it into components
for typ, transformation, aggregator in AutoHash.known_types:
if isinstance(o, typ):
# another option is:
# result = reduce(operator.xor, map(_hash_ex, handler(o)))
# but collisions are more likely with xor than with frozenset
# e.g. hash_ex([1,2,3,4])==0 with xor


try:
# try to frozenset the actual components, it's faster
h = self.hash_func(aggregator(transformation(o)))
except TypeError:
# components not hashable with built-in;
# apply our extended hash function to them
h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
if self.cache:
# storing the object too, otherwise memory location will be reused
self.seen[id(o)] = (h, o)
if self.verbose:
print(type(o), reprlib.repr(o), h, file=sys.stderr)
return h


raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))


def __hash__(self):
return self.hash_ex(self._data)


def __eq__(self, other):
# short circuit to save time
if self is other:
return True


# 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
# 2) any other situation => lhs.__eq__ will be called first


# case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
# => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
# case 2. neither side is a subclass of the other; self is lhs
# => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
# case 3. neither side is a subclass of the other; self is rhs
# => we can't compare to another type, and the other side already tried and failed;
# we should return False, but NotImplemented will have the same effect
# any other case: we won't reach the __eq__ code in this class, no need to worry about it


if isinstance(self, type(other)): # identifies case 1
return self._data == other._data
else: # identifies cases 2 and 3
return NotImplemented


d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))


d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))

使用 json 包将 dict 序列化为字符串:

d = {'a': 1, 'b': 2}
s = json.dumps(d)

当你需要的时候恢复判决:

d2 = json.loads(s)