两本字典相交

我正在做一个反向索引的搜索程序。索引本身是一个字典,其键为术语,其值本身是短文档的字典,ID 号为键,其文本内容为值。

为了对两个术语执行“ AND”搜索,我需要交叉它们的发布列表(字典)。在 Python 中实现这一点的清晰(不一定过于聪明)方法是什么?我开始尝试与 iter漫长的道路:

p1 = index[term1]
p2 = index[term2]
i1 = iter(p1)
i2 = iter(p2)
while ...  # not sure of the 'iter != end 'syntax in this case
...
126661 次浏览

In general, to construct the intersection of dictionaries in Python, you can first use the & operator to calculate the intersection of sets of the dictionary keys (dictionary keys are set-like objects in Python 3):

dict_a = {"a": 1, "b": 2}
dict_b = {"a": 2, "c": 3}


intersection = dict_a.keys() & dict_b.keys()  # {'a'}

On Python 2 you have to convert the dictionary keys to sets yourself:

keys_a = set(dict_a.keys())
keys_b = set(dict_b.keys())
intersection = keys_a & keys_b

Then given the intersection of the keys, you can then build the intersection of your values however is desired. You have to make a choice here, since the concept of set intersection doesn't tell you what to do if the associated values differ. (This is presumably why the & intersection operator is not defined directly for dictionaries in Python).

In this case it sounds like your values for the same key would be equal, so you can just choose the value from one of the dictionaries:

dict_of_dicts_a = {"a": {"x":1}, "b": {"y":3}}
dict_of_dicts_b = {"a": {"x":1}, "c": {"z":4}}


shared_keys = dict_of_dicts_a.keys() & dict_of_dicts_b.keys()


# values equal so choose values from a:
dict_intersection = {k: dict_of_dicts_a[k] for k in shared_keys }  # {"a":{"x":1}}

Other reasonable methods of combining values would depend on the types of the values in your dictionaries, and what they represent. For example you might also want the union of values for shared keys of dictionaries of dictionaries. Since the union of dictionaries doesn't depend on the values, it is well defined, and in python you can get it using the | operator:

# union of values for each key in the intersection:
dict_intersection_2 = { k: dict_of_dicts_a[k] | dict_of_dicts_b[k] for k in shared_keys }

Which in this case, with identical dictionary values for key "a" in both, would be the same result.

Just wrap the dictionary instances with a simple class that gets both of the values you want

class DictionaryIntersection(object):
def __init__(self,dictA,dictB):
self.dictA = dictA
self.dictB = dictB


def __getitem__(self,attr):
if attr not in self.dictA or attr not in self.dictB:
raise KeyError('Not in both dictionaries,key: %s' % attr)


return self.dictA[attr],self.dictB[attr]


x = {'foo' : 5, 'bar' :6}
y = {'bar' : 'meow' , 'qux' : 8}


z = DictionaryIntersection(x,y)


print z['bar']

A little known fact is that you don't need to construct sets to do this:

In Python 2:

In [78]: d1 = {'a': 1, 'b': 2}


In [79]: d2 = {'b': 2, 'c': 3}


In [80]: d1.viewkeys() & d2.viewkeys()
Out[80]: {'b'}

In Python 3 replace viewkeys with keys; the same applies to viewvalues and viewitems.

From the documentation of viewitems:

In [113]: d1.viewitems??
Type:       builtin_function_or_method
String Form:<built-in method viewitems of dict object at 0x64a61b0>
Docstring:  D.viewitems() -> a set-like object providing a view on D's items

For larger dicts this also slightly faster than constructing sets and then intersecting them:

In [122]: d1 = {i: rand() for i in range(10000)}


In [123]: d2 = {i: rand() for i in range(10000)}


In [124]: timeit d1.viewkeys() & d2.viewkeys()
1000 loops, best of 3: 714 µs per loop


In [125]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2


1000 loops, best of 3: 805 µs per loop


For smaller `dict`s `set` construction is faster:


In [126]: d1 = {'a': 1, 'b': 2}


In [127]: d2 = {'b': 2, 'c': 3}


In [128]: timeit d1.viewkeys() & d2.viewkeys()
1000000 loops, best of 3: 591 ns per loop


In [129]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2


1000000 loops, best of 3: 477 ns per loop

We're comparing nanoseconds here, which may or may not matter to you. In any case, you get back a set, so using viewkeys/keys eliminates a bit of clutter.

In [1]: d1 = {'a':1, 'b':4, 'f':3}


In [2]: d2 = {'a':1, 'b':4, 'd':2}


In [3]: d = {x:d1[x] for x in d1 if x in d2}


In [4]: d
Out[4]: {'a': 1, 'b': 4}

Okay, here is a generalized version of code above in Python3. It is optimized to use comprehensions and set-like dict views which are fast enough.

Function intersects arbitrary many dicts and returns a dict with common keys and a set of common values for each common key:

def dict_intersect(*dicts):
comm_keys = dicts[0].keys()
for d in dicts[1:]:
# intersect keys first
comm_keys &= d.keys()
# then build a result dict with nested comprehension
result = {key:{d[key] for d in dicts} for key in comm_keys}
return result

Usage example:

a = {1: 'ba', 2: 'boon', 3: 'spam', 4:'eggs'}
b = {1: 'ham', 2:'baboon', 3: 'sausages'}
c = {1: 'more eggs', 3: 'cabbage'}


res = dict_intersect(a, b, c)
# Here is res (the order of values may vary) :
# {1: {'ham', 'more eggs', 'ba'}, 3: {'spam', 'sausages', 'cabbage'}}

Here the dict values must be hashable, if they aren't you could simply change set parentheses { } to list [ ]:

result = {key:[d[key] for d in dicts] for key in comm_keys}

In Python 3, you can use

intersection = dict(dict1.items() & dict2.items())
union = dict(dict1.items() | dict2.items())
difference = dict(dict1.items() ^ dict2.items())

Your question isn't precise enough to give single answer.

1. Key Intersection

If you want to intersect IDs from posts (credits to James) do:

common_ids = p1.keys() & p2.keys()

However if you want to iterate documents you have to consider which post has a priority, I assume it's p1. To iterate documents for common_ids, collections.ChainMap will be most useful:

from collections import ChainMap
intersection = {id: document
for id, document in ChainMap(p1, p2)
if id in common_ids}
for id, document in intersection:
...

Or if you don't want to create separate intersection dictionary:

from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
document = posts[id]

2. Items Intersection

If you want to intersect items of both posts, which means to match IDs and documents, use code below (credits to DCPY). However this is only useful if you're looking for duplicates in terms.

duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
...

3. Iterate over p1 'AND' p2.

In case when by "'AND' search" and using iter you meant to search both posts then again collections.ChainMap is the best to iterate over (almost) all items in multiple posts:

from collections import ChainMap
for id, document in ChainMap(p1, p2):
...
def two_keys(term_a, term_b, index):
doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
doc_store = index[term_a] # index[term_b] would work also
return {doc_id: doc_store[doc_id] for doc_id in doc_ids}


def n_keys(terms, index):
doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
doc_store = index[term[0]]
return {doc_id: doc_store[doc_id] for doc_id in doc_ids}


In [0]: index = {'a': {1: 'a b'},
'b': {1: 'a b'}}


In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}


In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}

I would recommend changing your index from

index = {term: {doc_id: doc}}

to two indexes one for the terms and then a separate index to hold the values

term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}

that way you don't store multiple copies of the same data

To find full intersection by keys and values

d1 = {'a':1}
d2 = {'b':2, 'a':1}
{x:d1[x] for x in d1 if x in d2 and d1[x] == d2[x]}


>> {'a':1}

None of the solutions so far solve the general case of intersecting N dictionaries.

So, if you want to handle the intersection of N arbitrary dictionaries:

from functools import reduce


def dict_intersection(*dict_list):
return reduce(lambda a,b: dict(a.items() & b.items()), dict_list)


a = {k:k for k in range(0,5)} # {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}
b = {k:k for k in range(2,7)} # {2: 2, 3: 3, 4: 4, 5: 5, 6: 6}
c = {k:k for k in range(3,8)} # {3: 3, 4: 4, 5: 5, 6: 6, 7: 7}


dict_intersection(a,b,c)  # {3:3, 4:4}
# or if you have a list of dicts
dicts = [{k:k for k in range(0+n,5+n)} for n in (0,2,3)] # == [a,b,c]
dict_intersection(*dicts) # {3:3, 4:4}

Using functools.reduce allows for completing the operation within a single iteration over the list of dictionaries instead of the multiple loops in some solutions. It also doesn't perform any additional conditional statements.

Trade-offs

Changing dict_intersection_v1 to dict_intersection_v2 we can see it performs faster for a larger lists of dictionaries and/or dictionaries (designing a proper experiment to test out which is a larger factor is outside the scope of this solution). This performance gain is due to reducing the amount of dictionary instantiations.

def dict_intersection_v1(*dict_list):
return reduce(lambda a,b: dict(a.items() & b.items()),  dict_list)


def dict_intersection_v2(*dict_list):
return dict(reduce(lambda a,b: a & b, (d.items() for d in dict_list)))


dict_lst1 = [{k:k for k in range(0+n,5+n)} for n in (0,2,3)] # = [a,b,c]
dict_lst2 = [{k:k for k in range(0,50,n)} for n in range(1,5)]]
dict_lst3 = [{k:k for k in range(0,500,n)} for n in range(40)]
dict_lst4 = [{k:k for k in range(0+n,500+n)} for n in range(400)]
dict list kv pair count dict_intersection_v1 dict_intersection_v2 relative difference
1 15 808 ns ± 4.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 821 ns ± 0.785 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) + 1.6%
2 105 3.14 µs ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 2.38 µs ± 5.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) -24.2%
3 2155 36.9 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 25.1 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) -32.0%
4 200_000 9.08 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.88 ms ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) -46.3%

The regression for result dict_lst1 is mainly due to difference in overhead between creating a dictionary after every intersection and the overhead due to dict.items() calls within the generator (and python's general function call overhead).

NB: I did test using the pre-calculated list of dict.items() for the for a dictionary instead of v2's building the generator on the fly.

I tested both passing in the pre-calculated list outside of the timings and within the timings, and, while it's statistically significant, it's less than 30 μs and 10 μs respectively. If you are trying to get these gains looking at a different language or Cython might be better.