Skip to content

boolean vs. scalar comparisons very recently broke in master #7284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Feb 19, 2016 · 26 comments
Closed

boolean vs. scalar comparisons very recently broke in master #7284

jreback opened this issue Feb 19, 2016 · 26 comments

Comments

@jreback
Copy link

jreback commented Feb 19, 2016

between these 2 commits:

1.12.0.dev0+f4cc58c (good)

1.12.0.dev0+a2f5392 (bad)

pandas now started failing here:
https://travis-ci.org/pydata/pandas/jobs/110022540 (bad)

xref: pandas-dev/pandas#12390

@jreback
Copy link
Author

jreback commented Feb 19, 2016

@cpcloud would know more about what actually is tested here. But the point is that somehow np.int64 is now being passed rather than int. very odd.

@ahaldane
Copy link
Member

Looks like it is related to #7254 which was just merged. We tried changing it so np.random.randint now returns numpy scalars rather than python ints, when returning a single value.

Maybe we will have to revert that, but maybe we can investigate a little first..

@ahaldane
Copy link
Member

Thanks for the report also, it's nice to know pandas catches bugs so quickly!

@jreback
Copy link
Author

jreback commented Feb 19, 2016

hah we r using wheels from your master - so thank you!

@ahaldane
Copy link
Member

So pandas is literally expecting that type(randint(10)) == int in order to determine which case to test, which is exaclt what #7254 changed.

Even though it's a rare case we should still try to make randint 100% backward compatible. I think we can do so while keeping #7254's nice behavior that np.random.randint(True, dtype=bool) returns bool instead of an int.

@gfyoung, what do you think of this: We could make the default value of dtype keyword be 'int', and randint would return a python int when size=1 in that case. That way we are back compatible, but we can still get the behavior of #7254 if dtype is supplied, eg when writing np.random.randint(10, dtype='l').

@gfyoung
Copy link
Contributor

gfyoung commented Feb 19, 2016

This is indeed a pity, but backwards compatibility must be respected. Here is what I suggest:

I think it is still correct to return numpy integers to maintain consistency with the cases when shape specifies an array to be returned. However, in light of these failing tests, I propose that the dtype NOT be enforced for np.int32 and np.int64 since np.dtype(int) or np.dtype('l') will be casted to those values. Instead, I would propose putting a FutureWarning so that downstream libraries like pandas have time to adjust to this change.

@ahaldane
Copy link
Member

I'm not sure it's worth forcing people to adjust to a change, since there is so little benefit (python integers can represent any numpy integers).

If we make the signature be randint(low, high=None, size=None, dtype=int), and then at the end of randint replace

return randfunc(low, high - 1, size, self.state_address)

by

ret = randfunc(low, high - 1, size, self.state_address)
if dtype in [bool, int, long]:
    return dtype(ret)
return ret

then if the user requests a python type, they get a python type, and if they request a numpy type ,they get a numpy type, and we get to keep the old default. (remove long for python 3 though).

@gfyoung
Copy link
Contributor

gfyoung commented Feb 19, 2016

@ahaldane : Fair enough. np.long as a default would be a little more accurate because that will go to int in Python 3 and long in Python 2. I'll put up a PR for this shortly.

@jreback
Copy link
Author

jreback commented Feb 21, 2016

ok, this looks good in pandas master.

Have some new breakages though: pandas-dev/pandas#12406

@seberg
Copy link
Member

seberg commented Feb 21, 2016

@jreback it is due to gh-7215. Does this bother you/create incompatibilities with older pandas versions or is it just a note seen in tests? Numpy used to patch the error to be an index error in this case, but Nathaniel removed it to simplify code, it would be nice to not having to replace the error, but if it creates compat problems we have to think about it. Can't say I really expected problems there ;).

@jreback
Copy link
Author

jreback commented Feb 21, 2016

@seberg no the IndexError -> TypeError looks reasonable to me. We will just update the tests on that, no biggie. Its the second on where something is being returned as inf rather than NaN when doing a floordiv is more troubling. But only had a quick look.

@seberg
Copy link
Member

seberg commented Feb 21, 2016

Ah, ok, I did not realize it was two things when glancing over. Chuck is still busy with some floordiv stuff, he might have a quick idea what is going on.

@charris
Copy link
Member

charris commented Feb 21, 2016

@jreback Could you specify what the arguments were?

@jreback
Copy link
Author

jreback commented Feb 21, 2016

I'll have to see if I can give u a self contained example in a bit

@charris
Copy link
Member

charris commented Feb 21, 2016

@jreback I think it is 3. // 0.. Note two things,

  • in Python this raises a zero division error, so compatibility does not compel numpy behavior
  • // is different from the floor function, which I suspect is used in sparse.

Currently numpy computes the % // pair based on the result of fmod, which in this case is nan, and that is returned for both of the operators. However, in this case it would be possible to return the result of floor(a/b) for the integer part when b is zero if that would help backwards compatibility.

@charris
Copy link
Member

charris commented Feb 21, 2016

Another case that probably differs and is specified by Python is

In [2]: 1. % inf
Out[2]: 1.0

@njsmith
Copy link
Member

njsmith commented Feb 21, 2016

Wow, those are brain breaking examples. But I think 3.0 // 0.0 should be nan, on the principle that (1) it isn't well defined in its own right, (2) it can't be extended by continuity arguments, because lim 3.0 // x can diverge in different directions depending on the choice for the x sequence (in particular, positive versus negative, just like 3 / 0).

The 1.0 % inf == 1.0 sounds correct, through, because 1.0 % x == 1.0 for all x > 1.0, so the limit is well defined. But surely IEEE754 and/or Annex F have something more definitive to say about this, specifically in their comments on fmod?

@charris
Copy link
Member

charris commented Feb 21, 2016

@njsmith I think it probably is specified, as MSVC 2008 got it wrong and it is corrected in later versions.

@pv
Copy link
Member

pv commented Feb 21, 2016

Annex F:

fmod(±0, y) returns ±0 for y not zero.
fmod(x, y) returns a NaN and raises the ‘‘invalid’’ floating-point exception for x infinite or y zero.
fmod(x, ± ∞ ) returns x for x not infinite.

@charris
Copy link
Member

charris commented Feb 21, 2016

Which makes me think I should explicitly return nan rather than the result of fmod in the case of zero division so that things work properly on MSVC 2008 also.

@jreback
Copy link
Author

jreback commented Feb 21, 2016

So here's a simple repro.

In [1]: np.__version__
Out[1]: '1.12.0.dev0+7d4d26a'

In [3]: np.array([3])//np.array([0],dtype='float64')
Out[3]: array([ nan])

In [2]: np.float64(3)//np.float64(0)
Out[2]: nan

In [3]: np.__version__
Out[3]: '1.10.4'

In [4]: np.float64(3)//np.float64(0)
Out[4]: inf

In [7]: np.array([3])//np.array([0],dtype='float64')
Out[7]: array([ inf])

@jreback
Copy link
Author

jreback commented Feb 21, 2016

so my test fails because we are expecting the infs (and the result is now NaN in numpy >= 1.11)

This is 1.10.4

In [15]: b = array([ nan,  nan,  nan,   0.,   1.,   2.,   3.,   4.,   5.,   6.])

In [16]: a = array([[ nan,  nan,  nan,   0.,   1.,   2.,   3.,   4.,   5.,   6.],
       [  0.,   1.,   2.,  nan,  nan,  nan,   3.,   4.,   5.,   6.],
       [  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [  0.,   1.,   2.,   3.,   4.,   5.,  nan,  nan,  nan,  nan]])

In [17]: a//b
Out[17]: 
array([[ nan,  nan,  nan,  nan,   1.,   1.,   1.,   1.,   1.,   1.],
       [ nan,  nan,  nan,  nan,  nan,  nan,   1.,   1.,   1.,   1.],
       [ nan,  nan,  nan,  inf,   4.,   2.,   2.,   1.,   1.,   1.],
       [ nan,  nan,  nan,  inf,   4.,   2.,  nan,  nan,  nan,  nan]])

@charris
Copy link
Member

charris commented Feb 21, 2016

@jreback Yes, but what should we do about it. Nan is probably the most correct.

@jreback
Copy link
Author

jreback commented Feb 21, 2016

@charris yes agree NaN is the most logical here. ok, as long as you announce this as an bug fix I think you are ok. I will adjust my tests.

@charris
Copy link
Member

charris commented Feb 21, 2016

@jreback OK, I will make sure the release notes mention it.

@charris
Copy link
Member

charris commented Feb 21, 2016

Actually, MSVC 2008 gets the 1.0 % inf case wrong, not the zero division.

jaimefrio pushed a commit to jaimefrio/numpy that referenced this issue Mar 22, 2016
The 'pandas' library expects Python integers to be
returned, so this commit changes the API so that
the default is 'np.int' which converts to native
Python integer types when a singleton is being
generated with this function.

Closes numpygh-7284.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants