-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Enhancement: groupby function #7265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
agreed. have you found /developed an unofficial solution? |
It's worth taking a look numpy.add.reduceat which provides a very efficient loop for calculating sums, though you do need every group to be in order. |
I thought Pandas prided itself for speedy groupby, perhaps you are doing something wrong. @jreback Thoughts? |
i don't see any code so impossible to tell what the user is doing wrong |
@shoyer, thanks I will look into reduceat. python 2.7, 64 bit. I was working with 2GB of data (I have 16GB of RAM) and the groupby consumed all os resources. In other words, working memory shot up, all 12 cores were used. And runtime vs my benchmark lagged by a factor of 3. |
Pandas has developers who have more or less built a career on groupby; numpy doesn't. I think you'll have more luck working with them to fix your problem then with switching to some naive new implementation in numpy. :-) |
I recurringly come back to the idea that adding a I think it would be a neat feature, but not sure if it is worth the development and maintenance effort, given how several of the applications that I'm secretly hoping @shoyer has it figured out already by twisting a gufunc's arm in some incredibly smart way! ;-) |
Someday hopefully we'll switch those operations to make them Someday... |
s/discussion/special/ # thanks autocorrect |
Here's a pure NumPy version of It doesn't yet support arbitrary axes, but that would be very easy to add. As you can see from the benchmark, it's significantly slower than pandas (~5x) for summing 1e7 elements in 10 groups. OTOH, if the groups are already sorted, it's significantly faster. |
Ufunc.at also does something like it, but horribly slow until someone fixes that. reduceat unfortunatly has some quirks as well, your code likely needs a check for empty groups. |
You have to be very cognizant that a groupby is two operations. A group indexer calculation and then a reduction with those bins. Using @shoyer example Both ops
This is doing the factorizing & computing the group indexers
This is a tuple of the
So the actual bin reduction is oftentimes pretty cheap. Side note that the pandas impl is more like Giving a sorted indexer is just avoiding most of the actual work in doing a groupby. Pandas doesn't support passing this from a user (it could easily though), but generally you don't have this (in fact that is quite a lot of the user API, turning what you want to groupby into the actual indexer). |
@jreback you are absolutely right. In my pure numpy example, the main expense is the sort which serves an equivalent purpose to the pandas factorize. This is pretty similar to what we see for most set operations in numpy compared to pandas (which uses a hash table, of course). So this is not a terribly attractive option for 1d arrays. But for grouping higher dimensional arrays along an axis, the expense of the computation dominates and this could make sense. I might actually switch to this approach in xarray, which is currently uses very slow loops. It does make an additional copy of the array to sum with fancy indexing so it can assign the data to groups in order (which means extra memory consumption), but without benchmarking I'm not even sure if that's slower than the pandas approach of assigning in the order of the data. EDIT: I did some some benchmarking and it is indeed always slower to copy the full input array and then assign to groups in order, although the gap does start to get close (~50% slower) when there are a large number of distinct groups. So the bottom line is that the pandas approach looks strictly superior. |
in pandas have thought about lazily tracking whether a dimension is sorted (you could also track if it has nulls among other things); this would allow you to have some addl perf on some operations. might make sense in xarray. postgres does things like this. |
Hi, I started to make a sort of comparison with different types of grouping found in different places. So the solution was to not sort on matlab-time (even rounded to seconds), but it worked if I made a new decimal series to groupby on, then pandas did win by far!! (if I remember right I also tried to sort on a column with homemade integers representing the wanted grouping, and that one was slow as well) import numpy as np
from scipy.ndimage import labeled_comprehension
import pandas as pd
import timeit
import sys
steps = 1
try: steps = int(sys.argv[1])
except: pass
t = np.linspace(0, 10*steps, 51*steps)
a = np.random.random(t.size)
a = np.array(range(t.size)) * 1.0
print('size: %s'%t.size)
print('best out of 2 repeats with 3 runs')
def timer(gb, t, a):
tim = timeit.Timer('%s(t, a)'%gb, setup='from __main__ import %s, t, a'%gb)
return min(tim.repeat(2,3))
def groupby1_1d(t_, a_):
# scipy labeled_comprehension
tr = np.round(t_)
unir = np.unique(tr)
return unir, labeled_comprehension(a_, tr, unir, np.mean, np.float64, -1)
def groupby2_1d(t_, a_):
# alternative that splits the array, but uses lists etc
tr = np.round(t_)
tr = np.sort(tr)
unir = np.unique(tr) # new time signal..., can't escape doing it
idx = np.where(np.diff(tr) != 0)[0] + 1
an = map(np.mean, np.array_split(a_, idx))
return unir, np.array(tuple(an))
# # small alternative, but not faster....
# unir, idx = np.unique(tr, return_index=True)
# an = map(np.mean, np.array_split(a_, idx[1:]))
# return unir, np.array(tuple(an))
def groupby3_1d(t_, a_):
# alternative with pandas
tr = np.round(t_)
df = pd.DataFrame(dict(idx=tr, v=a_))
res = df.groupby('idx').mean()
return res.index.values, res['v'].values
def groupby4_1d(t_, a_):
# new home hacked version with loop over indexes
tr = np.round(t_)
tr = np.sort(tr)
unir, idx = np.unique(tr, return_index=True)
res = np.full_like(unir, np.nan)
for i in range(idx.size-1):
res[i] = np.mean( a_[idx[i]:idx[i+1]] )
res[-1] = np.mean(a_[idx[-1]:])
return unir, res
result = []
opt = []
opa = []; failed = False
variants = ('groupby1_1d', 'groupby2_1d', 'groupby3_1d', 'groupby4_1d')
# take results to compare validity
for gb in variants:
_t, _a = eval('%s(t, a)'%gb)
opt.append(_t)
opa.append(_a)
for i in range(len(opa)-1):
# print('compare output version, %s to %s, time & array'%(i+1, i+2))
# print(np.allclose(opt[i], opt[i+1]), '&',np.allclose(opa[i], opa[i+1]))
if not failed:
failed = not np.allclose(opt[i], opt[i+1]) or not np.allclose(opa[i], opa[i+1])
if not failed:
print('comparison of output gave %s'%(not(failed)))
for gb in variants:
print('timing %s'%gb)
result.append(timer(gb, t, a))
mi = min(result)
for gb, r in zip(variants, result):
print( '%s: %s sec, %0.f%%'%(gb, r, r/mi*100) ) Output (on my slow computer) |
I opened this issue before I had found the way forward with pandas, which seems to be sensitive for the actual data it groups on. However the great pydata-stack once again stepped up to the task and there was a quick way in the end, although it took me a while to figure it out. I will leave it up to @shoyer to close this issue (w/o action is fine by me). @jreback, to make pandas slow just do... def groupby3_1d(t_, a_):
# alternative with pandas
tr = np.round(t_)/24/60/60 + 736046
df = pd.DataFrame(dict(idx=tr, v=a_))
res = df.groupby('idx').mean()
return res.index.values, res['v'].values At least on my computer it's so slow that I don't even have the patience to wait for a time once the arrays are a bit bigger. |
@ahed87 looks like you are trying to do some sort of time grouping or resampling. I cannot reproduce any slowness. Maybe a complete w/o all of the boilerplate would help. |
@jreback The script below gives the following output on my computer. step=3100 import numpy as np
import pandas as pd
import time
steps = 3100
t = np.linspace(0, 10*steps, 51*steps)
a = np.random.random(t.size)
a = np.array(range(t.size)) * 1.0
ts = time.clock()
print(time.clock()-ts)
tr = np.round(t)
df = pd.DataFrame(dict(idx=tr, v=a))
res = df.groupby('idx').mean()
print(time.clock()-ts)
tr = np.round(t)/24/60/60 + 736046
df = pd.DataFrame(dict(idx=tr, v=a))
res = df.groupby('idx').mean()
print(time.clock()-ts) |
is this your beginning time index: ? |
does this represent what you are trying to do? (note the |
beginning of time index - yepp (although symbolic here) the example you gave is very close, only difference is the matlabtime, but I suppose that's an easy fix. |
|
hm, I have looked at resample earlier, but shied away partly because one have to have a pandas timeindex, which I so far have had bad experiences with getting a quick conversion to, and that I at least up until now was not fully aware of the ms option on the resampling. Give me a day or two to see if I can get it fully working where I earlier was not able to. |
:-) got caugth up in the heat of the moment. |
Because I stumbled over it while looking for an efficient numpyonic way to do groupby, I want to drop some findings about performance considerations. https://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/ |
For everyone needing fast grouped calculations, please have a look at numpy_groupies. I was annoyed by all the slow grouping solutions a while ago myself, and wrote fast implementations with weave and numba, to fix this issue. They perform about 30x faster then the ufuncs, and about 20x faster than pandas. Also, a couple of grouped operations can be done reasonably fast with pure numpy using bincount.
|
Important features:
Since def uniq_nosort(ar):
_, idx = np.unique(ar, return_index=True)
u = ar[np.sort(idx)]
return u
def groupby(x, groups, axis = 0):
for g in unique_nosort(groups):
yield x.compress((g == groups), axis = axis) |
And it would be great to have a version able to split multiple arrays with the same groups (not sure about the best way to pass the axis argument): def split_by(groups, *args):
uniq_groups = unique_nosort(groups)
for g in uniq_groups:
idx = (g == groups)
yield g, *(a.compress(idx, axis = 0) for a in args) |
I wonder if there is a way to achieve the above but return views into Also, I am not sure if |
Hi,
I think it would be great with some basic groupby functionality in numpy, on int's and float's.
Using it would be something like a one liner below (just thinking of how a call could look like, not considering if it would actually work with current input/output of for example the mean function).
result = np.groupby(x, y).mean() # x.shape = (n, ) y.shape = (n, m)
or maybe,
result = np.groupby(x, y, how='mean') # reasonable options would be mean, min, max, var or std
or even,
result = np.mean(np.groupby(x, y))
result would be result.shape = (np.unique(x), m)
Background:
I recently did some cleaning of measurements signals (~5e6 samples) - all of them floats (also the groupby series).
One of the things to do with the signals was to reduce the length a bit with for example averaging, however the signal was not evenly spaced and with larger gaps in time so a simple rolling average and dropping every second sample would not work, of course the signal had some nan's scattered around as well, which does not matter until one has to use nanmean instead of mean (maybe mean with a keword like bubblenan=True could be an idea for api-simplification instead of nanmean?).
I started out with pandas since it has some great functionality for grouping, however that turned out to be like hitting a brickwall in speed (don't really understand why - but after minutes of waiting it was quite obvious..., and yes I did try some different ways) - so in the end not usable for this case.
Next step was to turn to numpy, and after a bit of searching on the web I did not really find anything that I did not already know, python loops are slow, python has itertools etc, there was some examples of grouping with numpy but they seemed to be based on integers, and it was only grouping, not including aggregating with mean in most of the cases. Quite some time ago I did some stuff with ufuncs and found them pretty slow compared to vectorized operations (I probably did something wrong), so that I never tried for this case.
In the end I wrote a small piece of code (<20 lines) based on np.diff, np.where, np.mean and np.delete in a while statement, breaking out when all elements where unique. The whole aggregation was done only on consecutive pairs, which was ensured by an internal loop over the mask dropping the multiple duplicates, leaving only consecutive pairs.
The numpy version I did was doing it in around 5 seconds per series which was good enough for me. I do not know the actual speed of pandas since I never had patience to wait for it to finish.
Maybe if the thing is really implemented in numpy it would be less than half the time,
and would be a simple oneliner.
The text was updated successfully, but these errors were encountered: