Skip to content

BUG: ptp fails on bool #15057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bashtage opened this issue Dec 5, 2019 · 16 comments
Closed

BUG: ptp fails on bool #15057

bashtage opened this issue Dec 5, 2019 · 16 comments

Comments

@bashtage
Copy link
Contributor

bashtage commented Dec 5, 2019

Reproducing code example:

import numpy as np
rg = np.random.default_rng()
np.ptp(rg.integers(0,2,size=100).astype("bool"))

Error message:

TypeError: numpy boolean subtract, the - operator, is deprecated, use the bitwise_xor, the ^ operator, or the logical_xor function instead.

Numpy/Python version information:

import sys, numpy; print(numpy.__version__, sys.version)
1.17.4 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)]
bashtage added a commit to bashtage/statsmodels that referenced this issue Dec 5, 2019
Workaround for bool data when using ptp
Also silences pandas ptp warning in this location

xref numpy/numpy#15057
@eric-wieser
Copy link
Member

Would you expect the result to be a float or bool?

bashtage added a commit to bashtage/statsmodels that referenced this issue Dec 5, 2019
Workaround for bool data when using ptp
Also silences pandas ptp warning in this location

xref numpy/numpy#15057
@bashtage
Copy link
Contributor Author

bashtage commented Dec 5, 2019

I'm happy either way as long as I can test using ptp(x, axis=0) == 0. Probably a bool, and so True/False. I assume ptp(x.astype(np.float32), axis=0) is a float32.

@bashtage
Copy link
Contributor Author

bashtage commented Dec 5, 2019

Type seems to be preserved by ptp except where it spills over (int types).

@eric-wieser
Copy link
Member

I'd argue that the current behavior is correct - ptp is not defined over things on which subtraction isn't defined.

Having said that, the operation all_values_equal(x, axis=...), which you implement as ptp(x, axis=....) == 0 is well-defined on booleans, it's just your implementation that isn't.

I'd argue that ptp(x) == 0 is a bad way to implement this anyway - why not max(x) == min(x), which is fewer comparisons?

@WarrenWeckesser
Copy link
Member

FWIW: When I was testing various ways to check that an array is constant earlier this year, the fastest turned out to be (x == x[0]).all(). In my case, x was a 1-d aray and I knew the length was not 0. To check for constant values along the first dimension of an n-d array, you could use (x == x[0]).all(axis=0).

@eric-wieser
Copy link
Member

eric-wieser commented Dec 5, 2019

Perhaps it would be worth adding to numpy somewhere (or perhaps a C-optimized version of the following):

def all_same(arr, *, axis):
    axis = normalize_axis_tuple(axis, arr.ndim)
    sl = [slice(None)]*arr.ndim
    for ax in axis:
         sl[axis] = slice(None, 1)
    return (x == x[tuple(sl)]).all(axis=axis)

Note by using [:1] instead of [0] it works for size 0 arrays too

@bashtage
Copy link
Contributor Author

bashtage commented Dec 5, 2019

@eric-wieser The original justification for ptp was that it should be about 2x as fast as using max and min. I think the implementation in NumPy isn't and is actually doing two separate passes.

@WarrenWeckesser The potential issue there is that it requires creating an intermediate array that has the same dimension as x. If x is large, then this array will also be large. The ptp method uses a reduction first, and then does creates an intermediate with the same size as the reduction.

@eric-wieser
Copy link
Member

eric-wieser commented Dec 5, 2019

The original justification for ptp was that it should be about 2x as fast as using max and min. I think the implementation in NumPy isn't and is actually doing two separate passes.

ptp is implemented as max() - min() inside numpy

that it requires creating an intermediate array

That's a fair point - although at least the array is a boolean so typically 4-8x smaller than x

@bashtage
Copy link
Contributor Author

bashtage commented Dec 5, 2019

For me, the reduction dimension is negligible relative to the shape of the axis being reduced (1000-10000000x smaller), so it is pretty negligible.

A loopy implementation of all_same would avoid intermediates.

@bashtage
Copy link
Contributor Author

bashtage commented Dec 5, 2019

In terms of thinking about the choice here, it is hard to understand by why some ops, e.g., sum, happily cast bools to integers, while other, e.g., ptp, do not.

Put another way, I don't really understand how + is well defined by - is not

@bashtage
Copy link
Contributor Author

bashtage commented Dec 5, 2019

This seems to be a rabbit hole where it is hard to be fully consistent. For example

>>> x = np.bool_(True)
>>> x + x
True
>>> np.array([x, x]).sum()
2
>>> np.diff(np.array([x, x]))
array([False])

bashtage added a commit to bashtage/statsmodels that referenced this issue Dec 5, 2019
Workaround for bool data when using ptp
Also silences pandas ptp warning in this location

xref numpy/numpy#15057
@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Dec 5, 2019

I don't know the details, but there is a lot of relevant history here, including some churn back in version 1.13 when the deprecation of boolean subtraction that was added in 1.9 (I think) was converted to an error (e.g. #9251, #9255); see the links below. Digging into that history is probably necessary to explain how we ended with the current behavior.

Can someone familiar with the details and history comment here? Is there more to do, or is the current behavior the best compromise at the moment (and therefore this issue should be closed)?

Edit:

@bashtage
Copy link
Contributor Author

bashtage commented Dec 5, 2019

After this thread, I went with something like

def all_same(arr, *, axis):
    _max = arr.max(axis=axis)
    _min = arr.min(axis=axis)
    return _max == _min

which is what was originally suggested. Seems to work correctly for nan as well.

@seberg
Copy link
Member

seberg commented Dec 6, 2019

@bashtage nice. Sounds like we can close this then. I personally think it is OK if ptp does not like booleans much (although others mileage may vary). I will close this, if anyone disagrees just comment!

@seberg seberg closed this as completed Dec 6, 2019
@bashtage
Copy link
Contributor Author

It would be nice if there was some more consistency for booleans w.r.t when they are treated as 1 and normal int math appears or when adding is treated as logical_or.

@seberg
Copy link
Member

seberg commented Dec 10, 2019

Yes, I agree, but we are in a tricky position here. Python traditionally treats booleans exactly like 0 and 1 (they are just singletons of a subclass of int). We have been stricter, because it is even stranger in NumPy world (e.g. with limited dtype precision and what happens if you index with a boolean).

But we cannot decide that np.True_ + np.True_ stops working, it is too "normal" in the python world. The subtract is a case where we push back. Python is moving slowly in that direction, but until they change math as well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants