Skip to content

Enhancement: groupby function #7265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ahed87 opened this issue Feb 16, 2016 · 31 comments
Open

Enhancement: groupby function #7265

ahed87 opened this issue Feb 16, 2016 · 31 comments

Comments

@ahed87
Copy link

ahed87 commented Feb 16, 2016

Hi,

I think it would be great with some basic groupby functionality in numpy, on int's and float's.
Using it would be something like a one liner below (just thinking of how a call could look like, not considering if it would actually work with current input/output of for example the mean function).

result = np.groupby(x, y).mean() # x.shape = (n, ) y.shape = (n, m)
or maybe,
result = np.groupby(x, y, how='mean') # reasonable options would be mean, min, max, var or std
or even,
result = np.mean(np.groupby(x, y))

result would be result.shape = (np.unique(x), m)

Background:
I recently did some cleaning of measurements signals (~5e6 samples) - all of them floats (also the groupby series).

One of the things to do with the signals was to reduce the length a bit with for example averaging, however the signal was not evenly spaced and with larger gaps in time so a simple rolling average and dropping every second sample would not work, of course the signal had some nan's scattered around as well, which does not matter until one has to use nanmean instead of mean (maybe mean with a keword like bubblenan=True could be an idea for api-simplification instead of nanmean?).

I started out with pandas since it has some great functionality for grouping, however that turned out to be like hitting a brickwall in speed (don't really understand why - but after minutes of waiting it was quite obvious..., and yes I did try some different ways) - so in the end not usable for this case.

Next step was to turn to numpy, and after a bit of searching on the web I did not really find anything that I did not already know, python loops are slow, python has itertools etc, there was some examples of grouping with numpy but they seemed to be based on integers, and it was only grouping, not including aggregating with mean in most of the cases. Quite some time ago I did some stuff with ufuncs and found them pretty slow compared to vectorized operations (I probably did something wrong), so that I never tried for this case.

In the end I wrote a small piece of code (<20 lines) based on np.diff, np.where, np.mean and np.delete in a while statement, breaking out when all elements where unique. The whole aggregation was done only on consecutive pairs, which was ensured by an internal loop over the mask dropping the multiple duplicates, leaving only consecutive pairs.

The numpy version I did was doing it in around 5 seconds per series which was good enough for me. I do not know the actual speed of pandas since I never had patience to wait for it to finish.

Maybe if the thing is really implemented in numpy it would be less than half the time,
and would be a simple oneliner.

@ghost
Copy link

ghost commented Mar 19, 2016

agreed. have you found /developed an unofficial solution?

@shoyer
Copy link
Member

shoyer commented Mar 19, 2016

It's worth taking a look numpy.add.reduceat which provides a very efficient loop for calculating sums, though you do need every group to be in order.

@charris
Copy link
Member

charris commented Mar 19, 2016

I thought Pandas prided itself for speedy groupby, perhaps you are doing something wrong. @jreback Thoughts?

@jreback
Copy link

jreback commented Mar 19, 2016

i don't see any code so impossible to tell what the user is doing wrong

@ghost
Copy link

ghost commented Mar 19, 2016

@shoyer, thanks I will look into reduceat.

python 2.7, 64 bit.
pandas 0.16.0
windows 7

I was working with 2GB of data (I have 16GB of RAM) and the groupby consumed all os resources. In other words, working memory shot up, all 12 cores were used. And runtime vs my benchmark lagged by a factor of 3.
Wish I had a python tool to more accurately detail/log the issue.
When my 20MB test dataset was used, everything was quick and worked beautifully. With the test set, I thought the groupby cython implementation was perfect. Not sure why scaling up caused a hiccup.
Now i've begun to look for alternative solutions.

@njsmith
Copy link
Member

njsmith commented Mar 19, 2016

Pandas has developers who have more or less built a career on groupby; numpy doesn't. I think you'll have more luck working with them to fix your problem then with switching to some naive new implementation in numpy. :-)

@jaimefrio
Copy link
Member

I recurringly come back to the idea that adding a .groupby method to ufuncs may be something worth trying. It would take an array of values and an array of groups, and do a reduction of the ufunc over the groups. Basically, np.add(values, groups) would be equivalent to np.bincount(groups, weights=values). Ideally it would also support an axis argument, something I tried to implement for bincount in #4330 with limited success.

I think it would be a neat feature, but not sure if it is worth the development and maintenance effort, given how several of the applications that groupby suggests (mean, variance, median) cannot be expressed directly as a reduction of a binary ufunc. If anyone can think of a way of fitting those in, then I would be much more enthusiastic about this.

I'm secretly hoping @shoyer has it figured out already by twisting a gufunc's arm in some incredibly smart way! ;-)

@njsmith
Copy link
Member

njsmith commented Mar 19, 2016

given how several of the applications that groupby suggests (mean, variance, median) cannot be expressed directly as a reduction of a binary ufunc.

Someday hopefully we'll switch those operations to make them (k)->() gufuncs, and also make reductions use the gufunc machinery as (k)->() gufuncs, and then .groupby would make sense as a special feature of (k)->() gufuncs (just like .reduce is a discussion feature of (),()->() gufuncs).

Someday...

@njsmith
Copy link
Member

njsmith commented Mar 19, 2016

s/discussion/special/ # thanks autocorrect

@shoyer
Copy link
Member

shoyer commented Mar 19, 2016

Here's a pure NumPy version of grouped_sum that uses sorting (adapted from np.unique) and np.add.reduceat:
https://gist.github.com/shoyer/f538ac78ae904c936844

It doesn't yet support arbitrary axes, but that would be very easy to add.

As you can see from the benchmark, it's significantly slower than pandas (~5x) for summing 1e7 elements in 10 groups. OTOH, if the groups are already sorted, it's significantly faster.

@seberg
Copy link
Member

seberg commented Mar 19, 2016

Ufunc.at also does something like it, but horribly slow until someone fixes that. reduceat unfortunatly has some quirks as well, your code likely needs a check for empty groups.

@ahaldane
Copy link
Member

Also note that groupby was recently discussed on the list in this thread. There's discussion there about how pandas or xarray may be better for this task.

I also pointed out there, that there is an old NEP for groupby that was never implemented, here, which you might want to take a look at.

@jreback
Copy link

jreback commented Mar 20, 2016

You have to be very cognizant that a groupby is two operations. A group indexer calculation and then a reduction with those bins.

Using @shoyer example

Both ops

In [6]: %timeit df.groupby('y').x.sum()
1 loop, best of 3: 150 ms per loop

This is doing the factorizing & computing the group indexers

In [7]: %timeit df.groupby('y').x.grouper.group_info
10 loops, best of 3: 116 ms per loop

This is a tuple of the (indexer, uniques, len(uniques))

In [11]: df.groupby('y').x.grouper.group_info
Out[11]: (array([8, 8, 6, ..., 3, 8, 3]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 10)

In [9]: df.groupby('y').x.grouper.group_info[0].shape
Out[9]: (10000000,)

So the actual bin reduction is oftentimes pretty cheap.

Side note that the pandas impl is more like nansum that sum, so the comparison is slightly different here.

Giving a sorted indexer is just avoiding most of the actual work in doing a groupby. Pandas doesn't support passing this from a user (it could easily though), but generally you don't have this (in fact that is quite a lot of the user API, turning what you want to groupby into the actual indexer).

@shoyer
Copy link
Member

shoyer commented Mar 20, 2016

@jreback you are absolutely right. In my pure numpy example, the main expense is the sort which serves an equivalent purpose to the pandas factorize. This is pretty similar to what we see for most set operations in numpy compared to pandas (which uses a hash table, of course).

So this is not a terribly attractive option for 1d arrays. But for grouping higher dimensional arrays along an axis, the expense of the computation dominates and this could make sense. I might actually switch to this approach in xarray, which is currently uses very slow loops.

It does make an additional copy of the array to sum with fancy indexing so it can assign the data to groups in order (which means extra memory consumption), but without benchmarking I'm not even sure if that's slower than the pandas approach of assigning in the order of the data.

EDIT: I did some some benchmarking and it is indeed always slower to copy the full input array and then assign to groups in order, although the gap does start to get close (~50% slower) when there are a large number of distinct groups. So the bottom line is that the pandas approach looks strictly superior.

@jreback
Copy link

jreback commented Mar 21, 2016

in pandas have thought about lazily tracking whether a dimension is sorted (you could also track if it has nulls among other things); this would allow you to have some addl perf on some operations. might make sense in xarray. postgres does things like this.

@ahed87
Copy link
Author

ahed87 commented Mar 23, 2016

Hi,
well I kind of found a solution, and to my surprise I found it in pandas.
I still don't understand what's going on but maybe the pandas people could figure it out, at least to me it seems like pandas somehow can revert to python looping, or at least get in a mode where it's appearing to have the speed of larger python loops.

I started to make a sort of comparison with different types of grouping found in different places.
Wrote a small script to do a comparison (script and output below...). I used a small bat file to repeat the runs.
Found that they were all about the same.
Once that was done I just threw in a pandas version to double check that the slow result was still standing.
To my surprise it wasn't, so then I went ahead to recheck if pandas was slow under the conditions I found it slow earlier, and it was.

So the solution was to not sort on matlab-time (even rounded to seconds), but it worked if I made a new decimal series to groupby on, then pandas did win by far!! (if I remember right I also tried to sort on a column with homemade integers representing the wanted grouping, and that one was slow as well)
This comparison is without nan's, it's a while ago so it's not completely fresh in memory, but I do remember that pandas got a bit slowed down by nan's, but I guess the others did as well.

import numpy as np
from scipy.ndimage import labeled_comprehension
import pandas as pd
import timeit
import sys

steps = 1

try: steps = int(sys.argv[1])
except: pass

t = np.linspace(0, 10*steps, 51*steps)
a = np.random.random(t.size)
a = np.array(range(t.size)) * 1.0
print('size: %s'%t.size)
print('best out of 2 repeats with 3 runs')

def timer(gb, t, a):
    tim = timeit.Timer('%s(t, a)'%gb, setup='from __main__ import %s, t, a'%gb)
    return min(tim.repeat(2,3))

def groupby1_1d(t_, a_):
    # scipy labeled_comprehension

    tr = np.round(t_)
    unir = np.unique(tr)

    return unir, labeled_comprehension(a_, tr, unir, np.mean, np.float64, -1)


def groupby2_1d(t_, a_):
    # alternative that splits the array, but uses lists etc

    tr = np.round(t_)
    tr = np.sort(tr)
    unir = np.unique(tr) # new time signal..., can't escape doing it

    idx = np.where(np.diff(tr) != 0)[0] + 1
    an = map(np.mean, np.array_split(a_, idx))
    return unir, np.array(tuple(an))

#    # small alternative, but not faster....
#    unir, idx = np.unique(tr, return_index=True)
#    an = map(np.mean, np.array_split(a_, idx[1:]))
#    return unir, np.array(tuple(an))


def groupby3_1d(t_, a_):
    # alternative with pandas
    tr = np.round(t_)
    df = pd.DataFrame(dict(idx=tr, v=a_))
    res = df.groupby('idx').mean()
    return  res.index.values, res['v'].values


def groupby4_1d(t_, a_):
    # new home hacked version with loop over indexes
    tr = np.round(t_)
    tr = np.sort(tr)
    unir, idx = np.unique(tr, return_index=True)
    res = np.full_like(unir, np.nan)
    for i in range(idx.size-1):
        res[i] = np.mean( a_[idx[i]:idx[i+1]] )
    res[-1] = np.mean(a_[idx[-1]:])
    return unir, res


result = []
opt = []
opa = []; failed = False
variants = ('groupby1_1d', 'groupby2_1d', 'groupby3_1d', 'groupby4_1d')

# take results to compare validity
for gb in variants:
    _t, _a = eval('%s(t, a)'%gb)
    opt.append(_t)
    opa.append(_a)
for i in range(len(opa)-1):
#    print('compare output version, %s to %s, time & array'%(i+1, i+2))
#    print(np.allclose(opt[i], opt[i+1]), '&',np.allclose(opa[i], opa[i+1]))
    if not failed:
        failed = not np.allclose(opt[i], opt[i+1]) or not np.allclose(opa[i], opa[i+1])

if not failed:
    print('comparison of output gave %s'%(not(failed)))

for gb in variants:
    print('timing %s'%gb)
    result.append(timer(gb, t, a))

mi = min(result)    
for gb, r in zip(variants, result):
    print( '%s: %s sec, %0.f%%'%(gb, r, r/mi*100) )

Output (on my slow computer)
starting comparison of groupby function, will take a while...
3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
np 1.10.4
pd 0.17.1
scipy 0.17.0
size: 51
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.002091338718389469 sec, 137%
groupby2_1d: 0.001801516207746828 sec, 118%
groupby3_1d: 0.014042285360782857 sec, 923%
groupby4_1d: 0.0015214399408249105 sec, 100%
size: 153
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.005020341436565547 sec, 132%
groupby2_1d: 0.003902601169857021 sec, 103%
groupby3_1d: 0.014501384736048638 sec, 381%
groupby4_1d: 0.0038056516928455936 sec, 100%
size: 255
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.009073239943800082 sec, 112%
groupby2_1d: 0.010583907720512611 sec, 131%
groupby3_1d: 0.020973916487002364 sec, 260%
groupby4_1d: 0.008072454601740262 sec, 100%
size: 357
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.011551863610039452 sec, 112%
groupby2_1d: 0.010279722324386793 sec, 100%
groupby3_1d: 0.01682406850275639 sec, 164%
groupby4_1d: 0.013710913074278172 sec, 133%
size: 459
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.01352573444358438 sec, 128%
groupby2_1d: 0.015626819405694575 sec, 147%
groupby3_1d: 0.01575352057406401 sec, 149%
groupby4_1d: 0.010598270605995774 sec, 100%
size: 663
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.01794180876946492 sec, 118%
groupby2_1d: 0.016135162959759815 sec, 106%
groupby3_1d: 0.015857651493817043 sec, 105%
groupby4_1d: 0.01516156450807886 sec, 100%
size: 867
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.026399496478271774 sec, 171%
groupby2_1d: 0.0212432205898119 sec, 137%
groupby3_1d: 0.01546574990420467 sec, 100%
groupby4_1d: 0.020869785567249333 sec, 135%
size: 1071
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.028026093259241363 sec, 169%
groupby2_1d: 0.0253310003903627 sec, 152%
groupby3_1d: 0.01662657882736275 sec, 100%
groupby4_1d: 0.024746225767119157 sec, 149%
size: 1581
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.04363957569984442 sec, 250%
groupby2_1d: 0.03628629129265454 sec, 208%
groupby3_1d: 0.017422693051287297 sec, 100%
groupby4_1d: 0.03587489721560075 sec, 206%
size: 15810
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 0.408309647396291 sec, 1071%
groupby2_1d: 0.3504379910632074 sec, 919%
groupby3_1d: 0.03812884431606767 sec, 100%
groupby4_1d: 0.35363783676478033 sec, 927%
size: 158100
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 4.259647114162882 sec, 1937%
groupby2_1d: 3.700931894787759 sec, 1683%
groupby3_1d: 0.21995989677188987 sec, 100%
groupby4_1d: 3.675658345939329 sec, 1671%
size: 1581000
best out of 2 repeats with 3 runs
comparison of output gave True
groupby1_1d: 42.987091869632195 sec, 1966%
groupby2_1d: 37.7301880665966 sec, 1725%
groupby3_1d: 2.187069401975606 sec, 100%
groupby4_1d: 37.265364055545376 sec, 1704%

@ahed87
Copy link
Author

ahed87 commented Mar 23, 2016

I opened this issue before I had found the way forward with pandas, which seems to be sensitive for the actual data it groups on.

However the great pydata-stack once again stepped up to the task and there was a quick way in the end, although it took me a while to figure it out.

I will leave it up to @shoyer to close this issue (w/o action is fine by me).

@jreback, to make pandas slow just do...

def groupby3_1d(t_, a_):
    # alternative with pandas
    tr = np.round(t_)/24/60/60 + 736046
    df = pd.DataFrame(dict(idx=tr, v=a_))
    res = df.groupby('idx').mean()
    return  res.index.values, res['v'].values

At least on my computer it's so slow that I don't even have the patience to wait for a time once the arrays are a bit bigger.

@jreback
Copy link

jreback commented Mar 23, 2016

@ahed87 looks like you are trying to do some sort of time grouping or resampling. I cannot reproduce any slowness. Maybe a complete w/o all of the boilerplate would help.

@ahed87
Copy link
Author

ahed87 commented Mar 23, 2016

@jreback
yepp, it's time grouping.

The script below gives the following output on my computer.
step=310
3.077758017431395e-06
0.01371756748369173
0.3080050947154382

step=3100
3.5907176870032946e-06
0.07356611101165393
48.66983453459938

import numpy as np
import pandas as pd
import time

steps = 3100

t = np.linspace(0, 10*steps, 51*steps)
a = np.random.random(t.size)
a = np.array(range(t.size)) * 1.0

ts = time.clock()
print(time.clock()-ts)

tr = np.round(t)
df = pd.DataFrame(dict(idx=tr, v=a))
res = df.groupby('idx').mean()

print(time.clock()-ts)

tr = np.round(t)/24/60/60 + 736046
df = pd.DataFrame(dict(idx=tr, v=a))
res = df.groupby('idx').mean()

print(time.clock()-ts)

@jreback
Copy link

jreback commented Mar 23, 2016

is this your beginning time index: ? 736046 is this supposed to be 1000*epoch seconds?

@jreback
Copy link

jreback commented Mar 23, 2016

In [61]: index = pd.to_datetime(736046*1000,unit='s') + pd.timedelta_range(0,periods=51*3100,freq='31000ms')

In [62]: df = DataFrame({'value' : np.random.randn(len(index))},index=index)

In [63]: df
Out[63]: 
                        value
1993-04-29 01:13:20 -0.421618
1993-04-29 01:13:51  0.211800
1993-04-29 01:14:22  0.891728
1993-04-29 01:14:53  0.499494
1993-04-29 01:15:24  0.010122
...                       ...
1993-06-24 18:35:45  1.373097
1993-06-24 18:36:16  0.707244
1993-06-24 18:36:47  1.813460
1993-06-24 18:37:18 -1.745865
1993-06-24 18:37:49  0.381538

[158100 rows x 1 columns]

In [64]: df.resample('D').mean()
Out[64]: 
               value
1993-04-29 -0.003777
1993-04-30 -0.030092
1993-05-01  0.007131
1993-05-02  0.002670
1993-05-03  0.004681
...              ...
1993-06-20 -0.020928
1993-06-21  0.015627
1993-06-22  0.038849
1993-06-23 -0.025305
1993-06-24 -0.003742

[57 rows x 1 columns]

In [65]: %timeit df.resample('D').mean()
100 loops, best of 3: 2.19 ms per loop

does this represent what you are trying to do? (note the .resample syntax changed a bit in 0.18.0)

@ahed87
Copy link
Author

ahed87 commented Mar 23, 2016

beginning of time index - yepp (although symbolic here)
1000*epoch - not really, it's supposed to be matlab time, hence the /24/60/60.
dt.datetime.toordinal(dt.datetime.today())
Out[6]: 736046

the example you gave is very close, only difference is the matlabtime, but I suppose that's an easy fix.

@jreback
Copy link

jreback commented Mar 23, 2016

In [72]: Timestamp.now().toordinal()
Out[72]: 736046

In [73]: Timestamp.now().normalize()
Out[73]: Timestamp('2016-03-23 00:00:00')

@ahed87
Copy link
Author

ahed87 commented Mar 23, 2016

hm, I have looked at resample earlier, but shied away partly because one have to have a pandas timeindex, which I so far have had bad experiences with getting a quick conversion to, and that I at least up until now was not fully aware of the ms option on the resampling.

Give me a day or two to see if I can get it fully working where I earlier was not able to.
Thanks for all the hints.

@shoyer
Copy link
Member

shoyer commented Mar 23, 2016

@ahed87 @jreback maybe time to move your pandas performance testing over the pandas issue tracker? :)

@ahed87
Copy link
Author

ahed87 commented Mar 23, 2016

:-) got caugth up in the heat of the moment.
I will open a new issue on pandas, with ref to this one and close both, once I know if I can get it working.

@srkunze
Copy link

srkunze commented Sep 13, 2018

Because I stumbled over it while looking for an efficient numpyonic way to do groupby, I want to drop some findings about performance considerations.

https://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/

@ml31415
Copy link

ml31415 commented Dec 22, 2019

For everyone needing fast grouped calculations, please have a look at numpy_groupies. I was annoyed by all the slow grouping solutions a while ago myself, and wrote fast implementations with weave and numba, to fix this issue. They perform about 30x faster then the ufuncs, and about 20x faster than pandas. Also, a couple of grouped operations can be done reasonably fast with pure numpy using bincount.

function         ufunc         numpy         numba         weave        pandas
-------------------------------------------------------------------------------
sum              35.165         1.894         1.271         1.506        19.282
prod             39.227        37.932         1.237         1.542        19.406
amin             39.705        38.720         1.251         1.478        19.039
amax             40.431        39.334         1.285         1.485        20.276
len              33.257         1.746         1.027         1.238        19.855
all              39.607         4.102         1.533         1.677        15.832
any              38.586         6.223         1.543         1.738        15.579
anynan           33.892         2.810         1.214         1.485       258.491
allnan           36.571         5.193         1.218         1.562       247.796
mean               ----         2.903         1.202         1.532        19.697
std                ----         4.975         1.291         1.663        72.955
var                ----         5.223         1.216         1.596        75.325
first              ----         2.637         1.253         1.380        19.257
last               ----         1.718         1.026         1.074        18.632
argmax             ----        39.302         1.127          ----       250.525
argmin             ----        53.162         1.570          ----       294.759
nansum             ----         8.096         2.452         1.954        13.041
nanprod            ----        40.267         2.608         2.177        25.360
nanmin             ----        45.185         2.883         2.217        24.716
nanmax             ----        43.925         2.960         2.353        24.048
nanlen             ----         7.488         2.621         1.857        22.759
nanall             ----         9.868         2.564         1.958        24.640
nanany             ----        11.569         2.472         2.075        25.832
nanmean            ----         8.177         2.456         2.086        22.481
nanvar             ----        10.499         2.609         2.384        93.734
nanstd             ----        10.944         2.599         2.331        92.303
nanfirst           ----         8.703         3.060         1.721        23.279
nanlast            ----         7.503         2.351         1.678        23.895
cumsum             ----       218.935         2.156          ----        13.827
cumprod            ----          ----         1.796          ----        13.430
cummax             ----          ----         1.781          ----        13.576
cummin             ----          ----         1.780          ----        13.712
arbitrary          ----       437.689       122.783          ----       199.312
sort               ----       435.738          ----          ----          ----
Linux(x86_64), Python 2.7.12, Numpy 1.16.5, Numba 0.46.0, Weave 0.17.0, Pandas 0.24.2

@MLopez-Ibanez
Copy link

MLopez-Ibanez commented Nov 10, 2021

Important features:

  • groups do not need to be consecutive (otherwise we could use split)
  • groups do not need to be numeric
  • it does not change the order between the groups
  • it does not change the order within each group

Since np.unique does not preserve order (bug #8621), we cannot use it directly. There must be a faster way than:

def uniq_nosort(ar):
    _, idx = np.unique(ar, return_index=True)
    u = ar[np.sort(idx)]
    return u

def groupby(x, groups, axis = 0):
    for g in unique_nosort(groups):
        yield x.compress((g == groups), axis = axis)

@MLopez-Ibanez
Copy link

MLopez-Ibanez commented Nov 10, 2021

And it would be great to have a version able to split multiple arrays with the same groups (not sure about the best way to pass the axis argument):

def split_by(groups, *args):
    uniq_groups = unique_nosort(groups)
    for g in uniq_groups:
        idx = (g == groups)
        yield g, *(a.compress(idx, axis = 0) for a in args)

@MLopez-Ibanez
Copy link

MLopez-Ibanez commented Oct 25, 2024

def uniq_nosort(ar):
    _, idx = np.unique(ar, return_index=True)
    u = ar[np.sort(idx)]
    return u

def groupby(x, groups, axis = 0):
    for g in uniq_nosort(groups):
        yield x.compress((g == groups), axis = axis)

I wonder if there is a way to achieve the above but return views into x instead of copies.

Also, I am not sure if x.compress() is the most efficient way to do this operation. There must be a way in a single pass to create a list of indexes for each unique value and then directly use those for selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests