Description
Hi,
I think it would be great with some basic groupby functionality in numpy, on int's and float's.
Using it would be something like a one liner below (just thinking of how a call could look like, not considering if it would actually work with current input/output of for example the mean function).
result = np.groupby(x, y).mean() # x.shape = (n, ) y.shape = (n, m)
or maybe,
result = np.groupby(x, y, how='mean') # reasonable options would be mean, min, max, var or std
or even,
result = np.mean(np.groupby(x, y))
result would be result.shape = (np.unique(x), m)
Background:
I recently did some cleaning of measurements signals (~5e6 samples) - all of them floats (also the groupby series).
One of the things to do with the signals was to reduce the length a bit with for example averaging, however the signal was not evenly spaced and with larger gaps in time so a simple rolling average and dropping every second sample would not work, of course the signal had some nan's scattered around as well, which does not matter until one has to use nanmean instead of mean (maybe mean with a keword like bubblenan=True could be an idea for api-simplification instead of nanmean?).
I started out with pandas since it has some great functionality for grouping, however that turned out to be like hitting a brickwall in speed (don't really understand why - but after minutes of waiting it was quite obvious..., and yes I did try some different ways) - so in the end not usable for this case.
Next step was to turn to numpy, and after a bit of searching on the web I did not really find anything that I did not already know, python loops are slow, python has itertools etc, there was some examples of grouping with numpy but they seemed to be based on integers, and it was only grouping, not including aggregating with mean in most of the cases. Quite some time ago I did some stuff with ufuncs and found them pretty slow compared to vectorized operations (I probably did something wrong), so that I never tried for this case.
In the end I wrote a small piece of code (<20 lines) based on np.diff, np.where, np.mean and np.delete in a while statement, breaking out when all elements where unique. The whole aggregation was done only on consecutive pairs, which was ensured by an internal loop over the mask dropping the multiple duplicates, leaving only consecutive pairs.
The numpy version I did was doing it in around 5 seconds per series which was good enough for me. I do not know the actual speed of pandas since I never had patience to wait for it to finish.
Maybe if the thing is really implemented in numpy it would be less than half the time,
and would be a simple oneliner.