两本字典的合并和

我有一个字典下面,我想添加到另一个字典不一定不同的元素,并合并它的结果。是否有任何内置的功能,或将我需要自己的?

{
'6d6e7bf221ae24e07ab90bba4452267b05db7824cd3fd1ea94b2c9a8': 6,
'7c4a462a6ed4a3070b6d78d97c90ac230330603d24a58cafa79caf42': 7,
'9c37bdc9f4750dd7ee2b558d6c06400c921f4d74aabd02ed5b4ddb38': 9,
'd3abb28d5776aef6b728920b5d7ff86fa3a71521a06538d2ad59375a': 15,
'2ca9e1f9cbcd76a5ce1772f9b59995fd32cbcffa8a3b01b5c9c8afc2': 11
}

字典中的元素数也是未知的。

如果合并考虑两个相同的键,则应该对这些键的值进行求和,而不是重写。

109794 次浏览

You didn't say how exactly you want to merge, so take your pick:

x = {'both1': 1, 'both2': 2, 'only_x': 100}
y = {'both1': 10, 'both2': 20, 'only_y': 200}


print {k: x.get(k, 0) + y.get(k, 0) for k in set(x)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) & set(y)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y)}

Results:

{'both2': 22, 'only_x': 100, 'both1': 11}
{'both2': 22, 'both1': 11}
{'only_y': 200, 'both2': 22, 'both1': 11, 'only_x': 100}

You could use defaultdict for this:

from collections import defaultdict


def dsum(*dicts):
ret = defaultdict(int)
for d in dicts:
for k, v in d.items():
ret[k] += v
return dict(ret)


x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }


print(dsum(x, y))

This produces

{'both1': 11, 'both2': 22, 'only_x': 100, 'only_y': 200}
d1 = {'apples': 2, 'banana': 1}
d2 = {'apples': 3, 'banana': 2}
merged = reduce(
lambda d, i: (
d.update(((i[0], d.get(i[0], 0) + i[1]),)) or d
),
d2.iteritems(),
d1.copy(),
)

There is also pretty simple replacement of dict.update():

merged = dict(d1, **d2)

You can perform +, -, &, and | (intersection and union) with collections.Counter().

We can do the following (Note: only positive count values will remain in the dictionary):

from collections import Counter


x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }


z = dict(Counter(x)+Counter(y))


print(z)
[out]:
{'both2': 22, 'only_x': 100, 'both1': 11, 'only_y': 200}

To address adding values where the result may be zero or negative, use Counter.update() for addition, and Counter.subtract() for subtraction:

x = {'both1':0, 'both2':2, 'only_x': 100 }
y = {'both1':0, 'both2': -20, 'only_y':200 }
xx = Counter(x)
yy = Counter(y)
xx.update(yy)
dict(xx)
[out]:
{'both2': -18, 'only_x': 100, 'both1': 0, 'only_y': 200}

If you want to create a new dict as | use:

>>> dict({'a': 1,'c': 2}, **{'c': 1})
{'a': 1, 'c': 1}

Additional notes based on the answers of georg, NPE, Scott and Havok.

I was trying to perform this action on collections of 2 or more dictionaries and was interested in seeing the time it took for each. Because I wanted to do this on any number of dictionaries, I had to change some of the answers a bit. If anyone has better suggestions for them, feel free to edit.

Here's my test method. I've updated it recently to include tests with MUCH larger dictionaries, and again to include Havok's and Scott's newer methods:

Firstly I used the following data:

import random


x = {'xy1': 1, 'xy2': 2, 'xyz': 3, 'only_x': 100}
y = {'xy1': 10, 'xy2': 20, 'xyz': 30, 'only_y': 200}
z = {'xyz': 300, 'only_z': 300}


small_tests = [x, y, z]


# 200,000 random 8 letter keys
keys = [''.join(random.choice("abcdefghijklmnopqrstuvwxyz") for _ in range(8)) for _ in range(200000)]


a, b, c = {}, {}, {}


# 50/50 chance of a value being assigned to each dictionary, some keys will be missed but meh
for key in keys:
if random.getrandbits(1):
a[key] = random.randint(0, 1000)
if random.getrandbits(1):
b[key] = random.randint(0, 1000)
if random.getrandbits(1):
c[key] = random.randint(0, 1000)


large_tests = [a, b, c]


print("a:", len(a), "b:", len(b), "c:", len(c))
#: a: 100069 b: 100385 c: 99989

Now each of the methods:

from collections import defaultdict, Counter
from functools import reduce


def georg_method(tests):
return {k: sum(t.get(k, 0) for t in tests) for k in set.union(*[set(t) for t in tests])}


def georg_method_nosum(tests):
# If you know you will have exactly 3 dicts
return {k: tests[0].get(k, 0) + tests[1].get(k, 0) + tests[2].get(k, 0) for k in set.union(*[set(t) for t in tests])}


def npe_method(tests):
ret = defaultdict(int)
for d in tests:
for k, v in d.items():
ret[k] += v
return dict(ret)


# Note: There is a bug with scott's method. See below for details.
# Scott included a similar version using counters that is fixed
# See the scott_update_method below
def scott_method(tests):
return dict(sum((Counter(t) for t in tests), Counter()))


def scott_method_nosum(tests):
# If you know you will have exactly 3 dicts
return dict(Counter(tests[0]) + Counter(tests[1]) + Counter(tests[2]))


def scott_update_method(tests):
ret = Counter()
for test in tests:
ret.update(test)
return dict(ret)


def scott_update_method_static(tests):
# If you know you will have exactly 3 dicts
xx = Counter(tests[0])
yy = Counter(tests[1])
zz = Counter(tests[2])
xx.update(yy)
xx.update(zz)
return dict(xx)


def havok_method(tests):
def reducer(accumulator, element):
for key, value in element.items():
accumulator[key] = accumulator.get(key, 0) + value
return accumulator
return reduce(reducer, tests, {})


methods = {
"georg_method": georg_method, "georg_method_nosum": georg_method_nosum,
"npe_method": npe_method,
"scott_method": scott_method, "scott_method_nosum": scott_method_nosum,
"scott_update_method": scott_update_method, "scott_update_method_static": scott_update_method_static,
"havok_method": havok_method
}

I also wrote a quick function find whatever differences there were between the lists. Unfortunately, that's when I found the problem in Scott's method, namely, if you have dictionaries that total to 0, the dictionary won't be included at all because of how Counter() behaves when adding.

Test setup:

  • MacBook Pro (15-inch, Late 2016), 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3 RAM, running macOS Mojave Version 10.14.5
  • Python 3.6.5 via IPython 6.1.0

Finally, the results:

Results: Small Tests

for name, method in methods.items():
print("Method:", name)
%timeit -n10000 method(small_tests)
#: Method: georg_method
#: 7.81 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: georg_method_nosum
#: 4.6 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: npe_method
#: 3.2 µs ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method
#: 24.9 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method_nosum
#: 18.9 µs ± 64.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method
#: 9.1 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method_static
#: 14.4 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: havok_method
#: 3.09 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Results: Large Tests

Naturally, couldn't run anywhere near as many loops

for name, method in methods.items():
print("Method:", name)
%timeit -n10 method(large_tests)
#: Method: georg_method
#: 347 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: georg_method_nosum
#: 280 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: npe_method
#: 119 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method
#: 324 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method_nosum
#: 289 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method
#: 123 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method_static
#: 136 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: havok_method
#: 103 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion

╔═══════════════════════════╦═══════╦═════════════════════════════╗
║                           ║       ║    Best of Time Per Loop    ║
║         Algorithm         ║  By   ╠══════════════╦══════════════╣
║                           ║       ║  small_tests ║  large_tests ║
╠═══════════════════════════╬═══════╬══════════════╬══════════════╣
║ functools reduce          ║ Havok ║       3.1 µs ║   103,000 µs ║
║ defaultdict sum           ║ NPE   ║       3.2 µs ║   119,000 µs ║
║ Counter().update loop     ║ Scott ║       9.1 µs ║   123,000 µs ║
║ Counter().update static   ║ Scott ║      14.4 µs ║   136,000 µs ║
║ set unions without sum()  ║ georg ║       4.6 µs ║   280,000 µs ║
║ set unions with sum()     ║ georg ║       7.8 µs ║   347,000 µs ║
║ Counter() without sum()   ║ Scott ║      18.9 µs ║   289,000 µs ║
║ Counter() with sum()      ║ Scott ║      24.9 µs ║   324,000 µs ║
╚═══════════════════════════╩═══════╩══════════════╩══════════════╝

Important. YMMV.

class dict_merge(dict):
def __add__(self, other):
result = dict_merge({})
for key in self.keys():
if key in other.keys():
result[key] = self[key] + other[key]
else:
result[key] = self[key]
for key in other.keys():
if key in self.keys():
pass
else:
result[key] = other[key]
return result




a = dict_merge({"a":2, "b":3, "d":4})
b = dict_merge({"a":1, "b":2})
c = dict_merge({"a":5, "b":6, "c":5})
d = dict_merge({"a":8, "b":6, "e":5})


print((a + b + c +d))




>>> {'a': 16, 'b': 17, 'd': 4, 'c': 5, 'e': 5}

That is operator overloading. Using __add__, we have defined how to use the operator + for our dict_merge which inherits from the inbuilt python dict. You can go ahead and make it more flexible using a similar way to define other operators in this same class e.g. * with __mul__ for multiplying, or / with __div__ for dividing, or even % with __mod__ for modulo, and replacing the + in +1 with the corresponding operator, if you ever find yourself needing such merging. I have only tested this as it is without other operators but I don't foresee a problem with other operators. Just learn by trying.

Another options using a reduce function. This allows to sum-merge an arbitrary collection of dictionaries:

from functools import reduce


collection = [
{'a': 1, 'b': 1},
{'a': 2, 'b': 2},
{'a': 3, 'b': 3},
{'a': 4, 'b': 4, 'c': 1},
{'a': 5, 'b': 5, 'c': 1},
{'a': 6, 'b': 6, 'c': 1},
{'a': 7, 'b': 7},
{'a': 8, 'b': 8},
{'a': 9, 'b': 9},
]




def reducer(accumulator, element):
for key, value in element.items():
accumulator[key] = accumulator.get(key, 0) + value
return accumulator




total = reduce(reducer, collection, {})




assert total['a'] == sum(d.get('a', 0) for d in collection)
assert total['b'] == sum(d.get('b', 0) for d in collection)
assert total['c'] == sum(d.get('c', 0) for d in collection)


print(total)

Execution:

{'a': 45, 'b': 45, 'c': 3}

Advantages:

  • Simple, clear, Pythonic.
  • Schema-less, as long all keys are "sumable".
  • O(n) temporal complexity and O(1) memory complexity.

The approach of Scott using collections.Counter is nice, but it has the disadvantage of not being usable with sum; also the need of taking care of negative or zero values is a bit counterintuitive to me, when you simply want to add values component-wise.

So I think, it might be a good idea to write a custom class for this. This has also been the idea of John Mutuma. However, I want to add my solution:

I create a class which behaves very much like a dict, passing basically all member calls to the underlying _data in the getatrr method. The only two things which are different, is:

  1. it has a DEFAULT_VALUE (similar to collections.defaultdict) which is used as value for non-existing keys.
  2. it implements an __add__() method which (together the __radd__() method) takes care of adding the dicts component-wise.
from typing import Union, Any




class AddableDict:
DEFAULT_VALUE = 0


def __init__(self, data: dict) -> None:
self._data = data


def __getattr__(self, attr: str) -> Any:
return getattr(self._data, attr)


def __getitem__(self, item) -> Any:
try:
return self._data[item]
except KeyError:
return self.DEFAULT_VALUE


def __repr__(self):
return self._data.__repr__()


def __add__(self, other) -> "AddableDict":
return AddableDict({
key: self[key] + other[key]
for key in set(self.keys()) | set(other.keys())
})


def __radd__(
self, other: Union[int, "AddableDict"]
) -> "AddableDict":
if other == 0:
return self

That way we can add two those objects and sum iterables of those objects as well:

>>> alpha = AddableDict({"a": 1})
>>> beta = AddableDict({"a": 10, "b": 5})
>>> alpha + beta
{'a': 11, 'b': 5}


>>> sum([beta]*10)
{'a': 100, 'b': 50}

To my eye, this solution has the advantage of providing a simple and understandable interface for the developer to use. Of course, you can also inherit from dict instead of using composition.

A fairly simple approach:

from collections import Counter
from functools import reduce


data = [
{'x': 10, 'y': 1, 'z': 100},
{'x': 20, 'y': 2, 'z': 200},
{'a': 10, 'z': 300}
]


result = dict(reduce(lambda x, y: Counter(x) + Counter(y), data))

TL;DR;

This code works for both the list of dicts and pandas series (when dicts are row items). It is super fast.


@Havok method is far the best method according to my tests, since some other tests also confirm this, I am not going to put test results here, but instead, I share my code in addition to Havok's method. So, the following code works for a list of dictionaries and also for the pandas series where each row has a dictionary.

from functools import reduce
def reducer(accumulator, element):
"""Set unions two dictionary keys, and sums their values if keys are same,
see explanation here https://stackoverflow.com/a/46128481/2234161"""
for key, value in element.items():
if accumulator.get(key, 0)!=0 and not accumulator.get(key, 0):
print("why not", accumulator.get(key, 0))
elif not value:
print("why not value",value)
accumulator[key] = accumulator.get(key, 0) + value
return accumulator


def sum_dicts(dicts_collection, init_dict = None):
"""
For a given a collection of dictionaries, it sums values of the same keys
:param dicts_collection: [list of dictonaries, it can be a pandas series where each column has a dictionary]
:param init_dict: [if there is a initial dictionary where the dicts_collection will be added on], defaults to dict()
"""
res=None
if not init_dict:
init_dict = dict()
try:
res = reduce(reducer, dicts_collection, init_dict)
except Exception as ex:
print(f"Error while reducing dict: {dicts_collection}", ex)
raise ex
return res






result_dict = sum_dicts(list_of_dicts_or_pandasSeries)