Merge and sum of two dictionaries

Each Answer to this Q is separated by one/two green lines.

I have a dictionary below, and I want to add to another dictionary with not necessarily distinct elements and merge it’s results. Is there any built-in function for this, or will I need to make my own?

{
  '6d6e7bf221ae24e07ab90bba4452267b05db7824cd3fd1ea94b2c9a8': 6,
  '7c4a462a6ed4a3070b6d78d97c90ac230330603d24a58cafa79caf42': 7,
  '9c37bdc9f4750dd7ee2b558d6c06400c921f4d74aabd02ed5b4ddb38': 9,
  'd3abb28d5776aef6b728920b5d7ff86fa3a71521a06538d2ad59375a': 15,
  '2ca9e1f9cbcd76a5ce1772f9b59995fd32cbcffa8a3b01b5c9c8afc2': 11
}

The number of elements in the dictionary is also unknown.

Where the merge considers two identical keys, the values of these keys should be summed instead of overwritten.

You didn’t say how exactly you want to merge, so take your pick:

x = {'both1': 1, 'both2': 2, 'only_x': 100}
y = {'both1': 10, 'both2': 20, 'only_y': 200}

print {k: x.get(k, 0) + y.get(k, 0) for k in set(x)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) & set(y)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y)}

Results:

{'both2': 22, 'only_x': 100, 'both1': 11}
{'both2': 22, 'both1': 11}
{'only_y': 200, 'both2': 22, 'both1': 11, 'only_x': 100}

You can perform +, -, &, and | (intersection and union) with collections.Counter().

We can do the following (Note: only positive count values will remain in the dictionary):

from collections import Counter

x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }

z = dict(Counter(x)+Counter(y))

print(z)
[out]:
{'both2': 22, 'only_x': 100, 'both1': 11, 'only_y': 200}

To address adding values where the result may be zero or negative, use Counter.update() for addition, and Counter.subtract() for subtraction:

x = {'both1':0, 'both2':2, 'only_x': 100 }
y = {'both1':0, 'both2': -20, 'only_y':200 }
xx = Counter(x)
yy = Counter(y)
xx.update(yy)
dict(xx)
[out]:
{'both2': -18, 'only_x': 100, 'both1': 0, 'only_y': 200}

Additional notes based on the answers of georg, NPE, Scott and Havok.

I was trying to perform this action on collections of 2 or more dictionaries and was interested in seeing the time it took for each. Because I wanted to do this on any number of dictionaries, I had to change some of the answers a bit. If anyone has better suggestions for them, feel free to edit.

Here’s my test method. I’ve updated it recently to include tests with MUCH larger dictionaries, and again to include Havok’s and Scott’s newer methods:

Firstly I used the following data:

import random

x = {'xy1': 1, 'xy2': 2, 'xyz': 3, 'only_x': 100}
y = {'xy1': 10, 'xy2': 20, 'xyz': 30, 'only_y': 200}
z = {'xyz': 300, 'only_z': 300}

small_tests = [x, y, z]

# 200,000 random 8 letter keys
keys = [''.join(random.choice("abcdefghijklmnopqrstuvwxyz") for _ in range(8)) for _ in range(200000)]

a, b, c = {}, {}, {}

# 50/50 chance of a value being assigned to each dictionary, some keys will be missed but meh
for key in keys:
    if random.getrandbits(1):
        a[key] = random.randint(0, 1000)
    if random.getrandbits(1):
        b[key] = random.randint(0, 1000)
    if random.getrandbits(1):
        c[key] = random.randint(0, 1000)

large_tests = [a, b, c]

print("a:", len(a), "b:", len(b), "c:", len(c))
#: a: 100069 b: 100385 c: 99989

Now each of the methods:

from collections import defaultdict, Counter
from functools import reduce

def georg_method(tests):
    return {k: sum(t.get(k, 0) for t in tests) for k in set.union(*[set(t) for t in tests])}

def georg_method_nosum(tests):
    # If you know you will have exactly 3 dicts
    return {k: tests[0].get(k, 0) + tests[1].get(k, 0) + tests[2].get(k, 0) for k in set.union(*[set(t) for t in tests])}

def npe_method(tests):
    ret = defaultdict(int)
    for d in tests:
        for k, v in d.items():
            ret[k] += v
    return dict(ret)

# Note: There is a bug with scott's method. See below for details.
# Scott included a similar version using counters that is fixed
# See the scott_update_method below
def scott_method(tests):
    return dict(sum((Counter(t) for t in tests), Counter()))

def scott_method_nosum(tests):
    # If you know you will have exactly 3 dicts
    return dict(Counter(tests[0]) + Counter(tests[1]) + Counter(tests[2]))

def scott_update_method(tests):
    ret = Counter()
    for test in tests:
        ret.update(test)
    return dict(ret)

def scott_update_method_static(tests):
    # If you know you will have exactly 3 dicts
    xx = Counter(tests[0])
    yy = Counter(tests[1])
    zz = Counter(tests[2])
    xx.update(yy)
    xx.update(zz)
    return dict(xx)

def havok_method(tests):
    def reducer(accumulator, element):
        for key, value in element.items():
            accumulator[key] = accumulator.get(key, 0) + value
        return accumulator
    return reduce(reducer, tests, {})

methods = {
    "georg_method": georg_method, "georg_method_nosum": georg_method_nosum,
    "npe_method": npe_method,
    "scott_method": scott_method, "scott_method_nosum": scott_method_nosum,
    "scott_update_method": scott_update_method, "scott_update_method_static": scott_update_method_static,
    "havok_method": havok_method
}

I also wrote a quick function find whatever differences there were between the lists. Unfortunately, that’s when I found the problem in Scott’s method, namely, if you have dictionaries that total to 0, the dictionary won’t be included at all because of how Counter() behaves when adding.

Test setup:

  • MacBook Pro (15-inch, Late 2016), 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3 RAM, running macOS Mojave Version 10.14.5
  • Python 3.6.5 via IPython 6.1.0

Finally, the results:

Results: Small Tests

for name, method in methods.items():
    print("Method:", name)
    %timeit -n10000 method(small_tests)
#: Method: georg_method
#: 7.81 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: georg_method_nosum
#: 4.6 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: npe_method
#: 3.2 µs ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method
#: 24.9 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method_nosum
#: 18.9 µs ± 64.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method
#: 9.1 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method_static
#: 14.4 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: havok_method
#: 3.09 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Results: Large Tests

Naturally, couldn’t run anywhere near as many loops

for name, method in methods.items():
    print("Method:", name)
    %timeit -n10 method(large_tests)
#: Method: georg_method
#: 347 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: georg_method_nosum
#: 280 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: npe_method
#: 119 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method
#: 324 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method_nosum
#: 289 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method
#: 123 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method_static
#: 136 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: havok_method
#: 103 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion

???????????????????????????????????????????????????????????????????
?                           ?       ?    Best of Time Per Loop    ?
?         Algorithm         ?  By   ???????????????????????????????
?                           ?       ?  small_tests ?  large_tests ?
???????????????????????????????????????????????????????????????????
? functools reduce          ? Havok ?       3.1 µs ?   103,000 µs ?
? defaultdict sum           ? NPE   ?       3.2 µs ?   119,000 µs ?
? Counter().update loop     ? Scott ?       9.1 µs ?   123,000 µs ?
? Counter().update static   ? Scott ?      14.4 µs ?   136,000 µs ?
? set unions without sum()  ? georg ?       4.6 µs ?   280,000 µs ?
? set unions with sum()     ? georg ?       7.8 µs ?   347,000 µs ?
? Counter() without sum()   ? Scott ?      18.9 µs ?   289,000 µs ?
? Counter() with sum()      ? Scott ?      24.9 µs ?   324,000 µs ?
???????????????????????????????????????????????????????????????????

Important. YMMV.

You could use defaultdict for this:

from collections import defaultdict

def dsum(*dicts):
    ret = defaultdict(int)
    for d in dicts:
        for k, v in d.items():
            ret[k] += v
    return dict(ret)

x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }

print(dsum(x, y))

This produces

{'both1': 11, 'both2': 22, 'only_x': 100, 'only_y': 200}

Another options using a reduce function. This allows to sum-merge an arbitrary collection of dictionaries:

from functools import reduce

collection = [
    {'a': 1, 'b': 1},
    {'a': 2, 'b': 2},
    {'a': 3, 'b': 3},
    {'a': 4, 'b': 4, 'c': 1},
    {'a': 5, 'b': 5, 'c': 1},
    {'a': 6, 'b': 6, 'c': 1},
    {'a': 7, 'b': 7},
    {'a': 8, 'b': 8},
    {'a': 9, 'b': 9},
]


def reducer(accumulator, element):
    for key, value in element.items():
        accumulator[key] = accumulator.get(key, 0) + value
    return accumulator


total = reduce(reducer, collection, {})


assert total['a'] == sum(d.get('a', 0) for d in collection)
assert total['b'] == sum(d.get('b', 0) for d in collection)
assert total['c'] == sum(d.get('c', 0) for d in collection)

print(total)

Execution:

{'a': 45, 'b': 45, 'c': 3}

Advantages:

  • Simple, clear, Pythonic.
  • Schema-less, as long all keys are “sumable”.
  • O(n) temporal complexity and O(1) memory complexity.

d1 = {'apples': 2, 'banana': 1}
d2 = {'apples': 3, 'banana': 2}
merged = reduce(
    lambda d, i: (
        d.update(((i[0], d.get(i[0], 0) + i[1]),)) or d
    ),
    d2.iteritems(),
    d1.copy(),
)

There is also pretty simple replacement of dict.update():

merged = dict(d1, **d2)

class dict_merge(dict):
def __add__(self, other):
    result = dict_merge({})
    for key in self.keys():
        if key in other.keys():
            result[key] = self[key] + other[key]
        else:
            result[key] = self[key]
    for key in other.keys():
        if key in self.keys():
            pass
        else:
            result[key] = other[key]
    return result


a = dict_merge({"a":2, "b":3, "d":4})
b = dict_merge({"a":1, "b":2})
c = dict_merge({"a":5, "b":6, "c":5})
d = dict_merge({"a":8, "b":6, "e":5})

print((a + b + c +d))


>>> {'a': 16, 'b': 17, 'd': 4, 'c': 5, 'e': 5}

That is operator overloading. Using __add__, we have defined how to use the operator + for our dict_merge which inherits from the inbuilt python dict. You can go ahead and make it more flexible using a similar way to define other operators in this same class e.g. * with __mul__ for multiplying, or / with __div__ for dividing, or even % with __mod__ for modulo, and replacing the + in self[key] + other[key] with the corresponding operator, if you ever find yourself needing such merging.
I have only tested this as it is without other operators but I don’t foresee a problem with other operators. Just learn by trying.

A fairly simple approach:

from collections import Counter
from functools import reduce

data = [
  {'x': 10, 'y': 1, 'z': 100},
  {'x': 20, 'y': 2, 'z': 200},
  {'a': 10, 'z': 300}
]

result = dict(reduce(lambda x, y: Counter(x) + Counter(y), data))

TL;DR;

This code works for both the list of dicts and pandas series (when dicts are row items). It is super fast.


@Havok method is far the best method according to my tests, since some other tests also confirm this, I am not going to put test results here, but instead, I share my code in addition to Havok’s method.
So, the following code works for a list of dictionaries and also for the pandas series where each row has a dictionary.

from functools import reduce
def reducer(accumulator, element):
    """Set unions two dictionary keys, and sums their values if keys are same,
    see explanation here https://stackoverflow.com/a/46128481/2234161"""
    for key, value in element.items():
        if accumulator.get(key, 0)!=0 and not accumulator.get(key, 0):
            print("why not", accumulator.get(key, 0))
        elif not value:
            print("why not value",value)
        accumulator[key] = accumulator.get(key, 0) + value
    return accumulator

def sum_dicts(dicts_collection, init_dict = None):
    """
    For a given a collection of dictionaries, it sums values of the same keys
    :param dicts_collection: [list of dictonaries, it can be a pandas series where each column has a dictionary]
    :param init_dict: [if there is a initial dictionary where the dicts_collection will be added on], defaults to dict()
    """
    res=None
    if not init_dict:
        init_dict = dict()
    try:
        res = reduce(reducer, dicts_collection, init_dict)
    except Exception as ex:
        print(f"Error while reducing dict: {dicts_collection}", ex)
        raise ex
    return res



result_dict = sum_dicts(list_of_dicts_or_pandasSeries)

If you want to create a new dict as | use:

>>> dict({'a': 1,'c': 2}, **{'c': 1})
{'a': 1, 'c': 1}

The approach of Scott using collections.Counter is nice, but it has the disadvantage of not being usable with sum; also the need of taking care of negative or zero values is a bit counterintuitive to me, when you simply want to add values component-wise.

So I think, it might be a good idea to write a custom class for this. This has also been the idea of John Mutuma. However, I want to add my solution:

I create a class which behaves very much like a dict, passing basically all member calls to the underlying _data in the getatrr method. The only two things which are different, is:

  1. it has a DEFAULT_VALUE (similar to collections.defaultdict) which is used as value for non-existing keys.
  2. it implements an __add__() method which (together the __radd__() method) takes care of adding the dicts component-wise.
from typing import Union, Any


class AddableDict:
    DEFAULT_VALUE = 0

    def __init__(self, data: dict) -> None:
        self._data = data

    def __getattr__(self, attr: str) -> Any:
        return getattr(self._data, attr)

    def __getitem__(self, item) -> Any:
        try:
            return self._data[item]
        except KeyError:
            return self.DEFAULT_VALUE

    def __repr__(self):
        return self._data.__repr__()

    def __add__(self, other) -> "AddableDict":
        return AddableDict({
            key: self[key] + other[key]
            for key in set(self.keys()) | set(other.keys())
        })

    def __radd__(
        self, other: Union[int, "AddableDict"]
    ) -> "AddableDict":
        if other == 0:
            return self

That way we can add two those objects and sum iterables of those objects as well:

>>> alpha = AddableDict({"a": 1})
>>> beta = AddableDict({"a": 10, "b": 5})
>>> alpha + beta
{'a': 11, 'b': 5}

>>> sum([beta]*10)
{'a': 100, 'b': 50}

To my eye, this solution has the advantage of providing a simple and understandable interface for the developer to use. Of course, you can also inherit from dict instead of using composition.


The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .