BUG: Allow any integer type in bincount (fixes #823) #4366

jaimefrio · 2014-02-26T01:51:55Z

Gives array-likes that cannot be cast to np.intp a second chance, by
comparing its entries one by one to NPY_INTP_MAX and forcing the cast if
none exceeds it. Fixes #823.

Gives array-likes that cannot be cast to np.intp a second chance, by comparing its entries one by one to NPY_INTP_MAX and forcing the cast if none exceeds it. Fixes numpy#823.

jaimefrio · 2014-02-26T01:54:04Z

This is a simpler, slower alternative to the type specific functions I had implemented in #4330. Since solving this issue is orthogonal to whether an axis argument is ever added to bincount, I thought it made sense to implement it separately.

juliantaylor · 2014-02-26T19:14:18Z

wouldn't it be simpler to just try to cast the not-intp array to NPY_UINTP then get the maximum and if it fits cast back to intp? Probably also faster than the function pointer compare loop.

actually you could probably save the cast if the rest of the code works on npy_uintp as documented and you just create a npy_uintp view of the npy_intp array after the minimum check

jaimefrio · 2014-02-26T19:41:51Z

On a 32 bit system uintp is uint32, but you could also be receiving uint64 and int64, and neither can be safely cast to uint32.

A potentially faster alternative without multiple functions for different types is to try to create a NPY_INTP array, if it fails, create it with type set to NULL, and if the resulting type is an integer, force cast it to NPY_ULONGLONG, and check there for items exceeding NPY_INTP_MAX. The only problem then would be how to tell apart negative values and values that are too large, to give a meaningful error to the user.

juliantaylor · 2014-02-26T20:38:59Z

maybe its better to add type specific functions, even on 64 bit you might not want to cast your large int16 array to 64 bit integers

jaimefrio · 2014-02-26T22:14:42Z

I may have gotten a little carried away there, but type specific functions were frowned upon by @charris in #4330, which is what prompted this more flexible, less efficient approach here, as a test of a simplified approach for #4330.

I think he was concerned that the additional complexity did not justify making faster a function whose speed nobody is complaining about. Hence this approach that leaves the original path unchanged, and provides a slower alternative path for what used not to work.

I am happy to rewrite it either way, or in some middle ground approach, but need your guidance as to what approach best fits NumPy.

charris · 2014-02-26T22:36:21Z

I'd say just chunk the conversions.

charris · 2014-02-26T23:05:31Z

Chunking would probably be a good idea for any integer type bigger than uint16 anyway. Not sure of the best chunk size though, maybe 64K.

jaimefrio · 2014-02-27T03:28:08Z

By "chunking" you mean something similar to what NpyIter does when buffering? That works wonderfully well, but the overhead of setting up NpyIter kills performance for small arrays. Before I go and reinvent the wheel, is there any standard way of "chunking" that I am not aware of?

charris · 2014-02-27T04:55:02Z

I was thinking that you could not specify the npy_intp type to begin with, then convert the resulting array to npy_intp in chunks. But there is a problem with the minmax function as it only works with npy_intp. You could call the max/min methods of the array, however. That would probably be slower, especially for small arrays.

What you have now doesn't bother me that much, but as Julian says, it does waste space. But no more than it did.

seberg · 2014-02-27T11:46:36Z

If you find a cool solution for this, put it into the indexing machinery ;). It currently just casts to np.intp potentially creating integer wraparound problems, too... The nicest way would probably be to plug in some warning/error mechanism into the integer casting functions, so that we can do NPY_ATTEMPT_SAME_KIND casting in the NpyIter. And get an error (if not at the beginning, but at least some error at all, even if in indexing things could potentially be corrupted).

jaimefrio · 2014-02-27T18:15:05Z

That's a nasty shortcut you are taking there! ;-)

>>> a = np.array([np.iinfo(np.uintp).max])
>>> a
array([18446744073709551615], dtype=uint64)
>>> a[a]
array([18446744073709551615], dtype=uint64)
>>> a.astype(np.intp) # this is what's going on under the hood
array([-1], dtype=int64)

As you say, probably the proper solution would be to have a set of int-to-int conversion functions that checked for inboundness, and that had a return value that could be used to signal overflows in the conversion. I think I have followed the path of how casting works all the way to here, so duplicating some of those for ints. That's definitely more bite than I can swallow with my current knowledge of the code base, so I won't even dream of attempting it.

Quite frankly, if something so ubiquitous as indexing can live with this behavior, the hack in this PR is unwarranted: until an elegant solution is devised for that, perhaps the best solution for this particular bug (which is everywhere, see #4384) is to follow the leader and do as indexing does, i.e. something equivalent to NPY_SAME_KIND_CASTING?

On a 64 bit system, the worse that could happen is that you get an error for a negative number if the values are too large, and we can always change the error to "incorrect values in array" to not mislead the user:

>>> np.array([np.iinfo(np.intp).max+1])
array([9223372036854775808], dtype=uint64)
>>> np.array([np.iinfo(np.intp).max+1]).astype(np.intp)
array([-9223372036854775808], dtype=int64)

On 32 bit systems you can have messier bugs creeping in:

>>> np.array([np.iinfo(np.uintp).max+1])
array([4294967296], dtype=int64)
>>> np.array([np.iinfo(np.uintp).max+1]).astype(np.intp)
array([0])

Which is not good, you should have gotten a "Maximum allowed dimension exceeded", but it's wrong in a way consistent with indexing...

seberg · 2014-02-27T22:46:26Z

It is kinda impressive that nobody ever noticed and bothered to open an issue about it ;). But I didn't feel like specializing a load of inner loops for this, and the old code did a forced cast too...
The correct way to fix it is indeed to check out how the inner loops could warn/raise about integer wraparound problems better. Once we got that we can at least raise an error reasonably. Of course for indexing specifically specializing the inner loops would be plausible, but unless someone screams about np.int_ indexing being very common and slow on windows machines...

One advantage indexing has over bincount is that you can't even fathom the idea of indexing with such a large integer ;).

charris · 2014-02-27T22:56:36Z

True, true. My 64 bit machine only uses 48 bits for memory addressing. That's what, 32 terabytes or so? I've been robbed.

jaimefrio · 2014-02-27T23:59:03Z

Actually bincount or repeat (#4384) are not very different from indexing in that respect: such huge numbers, if properly handled, would lead to some form of "array too big" error. By forcing the cast, you only change where the error happens. The only situation where an overflow may wrap around to give incorrect results is in 32 bit systems, see the example above where 2**32 is magically converted to 0. So I will restate my proposal:

We (that would be me) make bincount (and repeat and any other function taking integers and trying to cast them to npy_intp) be broken in a way consistent with indexing. We close those issues, and open a new one about the indexing force cast issues, and also list there the functions that would need to be corrected if that ever gets figured out.

jaimefrio · 2014-02-28T00:14:56Z

np.tile seems to already be forcing the cast, so the error messages may need redoing:

>>> np.tile([1,2,3], np.array([2**63]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\WinPython-64bit-2.7.6.2\python-2.7.6.amd64\lib\site-packages\numpy\lib\shape_base.py", line 829, in tile
    c = c.reshape(-1, n).repeat(nrep, 0)
ValueError: negative dimensions are not allowed

matthew-brett · 2015-11-28T21:44:40Z

We just ran into this (kind of) problem : dipy/dipy#789 - not bincount, but repeat.

eric-wieser · 2017-04-10T00:22:00Z

numpy/lib/src/_compiled_base.c

+
+    min = (const void *)PyArray_DATA(arr_cmp2);
+    max = min + stride;
+    while (size--) {


Is it me, or is there no iteration happening here?

eric-wieser · 2017-04-10T00:22:30Z

numpy/lib/src/_compiled_base.c

+            return 0;
+        }
+        else if (cmp(a, max, arr) > 0) {
+            PyErr_SetString(PyExc_ValueError,


I think these errors would make more sense emitted by the caller

anirudh2290 · 2020-05-20T20:56:32Z

@jaimefrio this PR has been stale for a while, do you still want to continue working on this PR ?

charris · 2020-12-28T19:25:05Z

I'm going to close this, it is stalled and needs rebase. @jaimefrio Please make a new PR if you want to pursue this.

BUG: Allow any integer type in bincount (fixes numpy#823)

8ea2a76

Gives array-likes that cannot be cast to np.intp a second chance, by comparing its entries one by one to NPY_INTP_MAX and forcing the cast if none exceeds it. Fixes numpy#823.

rgommers added component: numpy.lib 03 - Maintenance labels Mar 8, 2015

jorisvandenbossche mentioned this pull request Sep 25, 2015

BUG: groupby().size gives casting error on 32 bit platform pandas-dev/pandas#11189

Closed

eric-wieser requested changes Apr 10, 2017

View reviewed changes

jvkersch mentioned this pull request Oct 31, 2017

MAINT: Upcast to int_ instead of int64 to avoid bincount issue. enthought/chaco#383

Merged

pauldmccarthy mentioned this pull request Apr 14, 2019

Tests for mesh with immutable vertex/face arrays mikedh/trimesh#373

Merged

anirudh2290 added the 61 - Stale label May 20, 2020

charris closed this Dec 28, 2020

Uh oh!

BUG: Allow any integer type in bincount (fixes #823) #4366

BUG: Allow any integer type in bincount (fixes #823) #4366

Uh oh!

Conversation

jaimefrio commented Feb 26, 2014

Uh oh!

jaimefrio commented Feb 26, 2014

Uh oh!

juliantaylor commented Feb 26, 2014

Uh oh!

jaimefrio commented Feb 26, 2014

Uh oh!

juliantaylor commented Feb 26, 2014

Uh oh!

jaimefrio commented Feb 26, 2014

Uh oh!

charris commented Feb 26, 2014

Uh oh!

charris commented Feb 26, 2014

Uh oh!

jaimefrio commented Feb 27, 2014

Uh oh!

charris commented Feb 27, 2014

Uh oh!

seberg commented Feb 27, 2014

Uh oh!

jaimefrio commented Feb 27, 2014

Uh oh!

seberg commented Feb 27, 2014

Uh oh!

charris commented Feb 27, 2014

Uh oh!

jaimefrio commented Feb 27, 2014

Uh oh!

jaimefrio commented Feb 28, 2014

Uh oh!

matthew-brett commented Nov 28, 2015

Uh oh!

eric-wieser Apr 10, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Apr 10, 2017

Choose a reason for hiding this comment

Uh oh!

anirudh2290 commented May 20, 2020

Uh oh!

charris commented Dec 28, 2020

Uh oh!

Uh oh!