Assuming I have a numpy array like:
[1,2,3,4,5,6]
and another array:
[0,0,1,2,2,1]
I want to sum the items in the first array by group (the second array) and obtain n-groups results in group number order (in this case the result would be [3, 9, 9]). How do I do this in numpy?

The numpy function `bincount` was made exactly for this purpose and I’m sure it will be much faster than the other methods for all sizes of inputs:

``````data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]

np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array
``````

The i-th element of the output is the sum of all the `data` elements corresponding to “id” `i`.

Hope that helps.

This is a vectorized method of doing this sum based on the implementation of numpy.unique. According to my timings it is up to 500 times faster than the loop method and up to 100 times faster than the histogram method.

``````def sum_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
values.cumsum(out=values)
index = np.ones(len(groups), 'bool')
index[:-1] = groups[1:] != groups[:-1]
values = values[index]
groups = groups[index]
values[1:] = values[1:] - values[:-1]
return values, groups
``````

There’s more than one way to do this, but here’s one way:

``````import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)
sums = []
for group in unique_groups:
sums.append(data[groups == group].sum())
``````

You can vectorize things so that there’s no for loop at all, but I’d recommend against it. It becomes unreadable, and will require a couple of 2D temporary arrays, which could require large amounts of memory if you have a lot of data.

Edit: Here’s one way you could entirely vectorize. Keep in mind that this may (and likely will) be slower than the version above. (And there may be a better way to vectorize this, but it’s late and I’m tired, so this is just the first thing to pop into my head…)

However, keep in mind that this is a bad example… You’re really better off (both in terms of speed and readability) with the loop above…

``````import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)

# Forgive the bad naming here...
# I can't think of more descriptive variable names at the moment...
x, y = np.meshgrid(groups, unique_groups)
data_stack = np.tile(data, (unique_groups.size, 1))

data_in_group = np.zeros_like(data_stack)
data_in_group[x==y] = data_stack[x==y]

sums = data_in_group.sum(axis=1)
``````

If the groups are indexed by consecutive integers, you can abuse the `numpy.histogram()` function to get the result:

``````data = numpy.arange(1, 7)
groups = numpy.array([0,0,1,2,2,1])
sums = numpy.histogram(groups,
bins=numpy.arange(groups.min(), groups.max()+2),
weights=data)
# array([3, 9, 9])
``````

This will avoid any Python loops.

I tried scripts from everyone and my considerations are:

User Comment
Joe Will only work if you have few groups.
kevpie Too slow because of loops (this is not pythonic way).
Bi_Rico and Sven Nice performance, but will only work for Int32 (if the sum goes over 2^32/2 it will fail
Alex Is the fastest one, the best solution for sum.

But if you want more flexibility and the possibility to group by other statistics use SciPy:

``````import numpy as np
from scipy import ndimage

data = np.arange(10000000)
unique_groups = np.arange(1000)
groups = unique_groups.repeat(10000)

ndimage.sum(data, groups, unique_groups)
``````

This is good because you have many statistics to group (sum, mean, variance, …).

You’re all wrong! The best way to do it is:

``````a = [1,2,3,4,5,6]
ix = [0,0,1,2,2,1]
accum = np.zeros(np.max(ix)+1)
print accum
> array([ 3.,  9.,  9.])
``````

I noticed the `numpy` tag but in case you don’t mind using `pandas`, this task becomes an one-liner:

``````import pandas as pd
import numpy as np

data = np.arange(1, 7)
groups = np.array([0, 0, 1, 2, 2, 1])

df = pd.DataFrame({'data': data, 'groups': groups})
``````

So `df` then looks like this:

``````   data  groups
0     1       0
1     2       0
2     3       1
3     4       2
4     5       2
5     6       1
``````

Now you can use the functions `groupby()` and `sum()`

``````print(df.groupby(['groups'], sort=False).sum())
``````

which gives you the desired output

``````        data
groups
0          3
1          9
2          9
``````

By default, the dataframe would be sorted, therefore I use the flag `sort=False` which might improve speed for huge dataframes.

I tried different methods to do this and I found that indeed using `np.bincount` is the fastest. See Alex’s answer

``````    import numpy as np
import random
import time

size = 10000
ngroups = 10

groups = np.random.randint(low=0,high=ngroups,size=size)
values = np.random.rand(size)

# Test 1
beg = time.time()
result = np.zeros(ngroups)
for i in range(size):
result[groups[i]] += values[i]
print('Test 1 took:',time.time()-beg)

# Test 2
beg = time.time()
result = np.zeros(ngroups)
for g,v in zip(groups,values):
result[g] += v
print('Test 2 took:',time.time()-beg)

# Test 3
beg = time.time()
result = np.zeros(ngroups)
for g in np.unique(groups):
wh = np.where(groups == g)
result[g] = np.sum(values[wh])
print('Test 3 took:',time.time()-beg)

# Test 4
beg = time.time()
result = np.zeros(ngroups)
for g in np.unique(groups):
wh = groups == g
result[g] = np.sum(values, where = wh)
print('Test 4 took:',time.time()-beg)

# Test 5
beg = time.time()
result = np.array([np.sum(values[np.where(groups == g)]) for g in np.unique(groups) ])
print('Test 5 took:',time.time()-beg)

# Test 6
beg = time.time()
result = np.array([np.sum(values, where = groups == g) for g in np.unique(groups) ])
print('Test 6 took:',time.time()-beg)

# Test 7
beg = time.time()
result = np.bincount(groups, weights = values)
print('Test 7 took:',time.time()-beg)
``````

Results:

``````    Test 1 took: 0.005615234375
Test 2 took: 0.004812002182006836
Test 3 took: 0.0006084442138671875
Test 4 took: 0.0005099773406982422
Test 5 took: 0.000499725341796875
Test 6 took: 0.0004980564117431641
Test 7 took: 1.9073486328125e-05
``````

A pure python implementation:

``````l = [1,2,3,4,5,6]
g = [0,0,1,2,2,1]

from itertools import izip
from operator import itemgetter
from collections import defaultdict

def group_sum(l, g):
groups = defaultdict(int)
for li, gi in izip(l, g):
groups[gi] += li
return map(itemgetter(1), sorted(groups.iteritems()))

print group_sum(l, g)

[3, 9, 9]
``````

Also, note for Alex’s answer:

``````data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]
np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array
``````

In case your indexes are not consecutive you might get stuck thinking why you keep getting a lot of zeros.

For instance:

``````data = [1,2,3,4,5,6]
ids  = [1,1,3,5,5,3]
np.bincount(ids, weights=data)
``````

will give you:

``````array([0, 3, 0, 9, 0, 9])
``````

which obviously means it builds all unique bins from 0 to max id in the list. And then return sums for each bin.

Here’s a method that works for summing objects of any dimension, grouped by values of any type (not only int):

``````grouping = np.array([1.1, 10, 1.1, 15])
to_sum = np.array([
[1, 0],
[0, 1],
[0.5, 0.3],
[2, 5],
])

groups, element_group_ixs = np.unique(grouping, return_inverse=True)
accum = np.zeros((groups.shape, *to_sum.shape[1:]))
``````groups = array([ 1.1, 10. , 15. ])