Skip to content

near-infinite loop on ppc64 in np.random.hypergeometric #14457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ahaldane opened this issue Sep 8, 2019 · 9 comments · Fixed by #14458
Closed

near-infinite loop on ppc64 in np.random.hypergeometric #14457

ahaldane opened this issue Sep 8, 2019 · 9 comments · Fixed by #14458
Labels

Comments

@ahaldane
Copy link
Member

ahaldane commented Sep 8, 2019

Running the following command on ppc64(be) hangs due to a near-infinite loop:

>>> np.random.hypergeometric([1,2,2], [2], [2])

Appears to be related to new code in #13761, perhaps @WarrenWeckesser has an idea. See in particular frame #2 below. It appears a value is negative but the code expects it not to be.

gdb stack trace after breaking during hang:

#0  __log (x=-25811282) at ../math/w_log.c:27
#1  0x0ea14754 in loggam (x=-160848451)
    at numpy/random/src/distributions/distributions.c:503
#2  0x0ea1232c in random_hypergeometric_hrua (bitgen_state=0xf76272c8, good=1, 
    bad=2, sample=268080756)
    at numpy/random/src/legacy/legacy-distributions.c:266
#3  0x0ea12798 in random_hypergeometric_original (bitgen_state=0xf76272c8, 
    good=1, bad=2, sample=268080756)
    at numpy/random/src/legacy/legacy-distributions.c:315
#4  0x0ea12844 in legacy_random_hypergeometric (bitgen_state=0xf76272c8, 
    good=1, bad=2, sample=8858015348)
    at numpy/random/src/legacy/legacy-distributions.c:334
#5  0x0e96db10 in __pyx_f_5numpy_6random_6common_discrete_broadcast_iii (
    __pyx_v_func=0xea127e8 <legacy_random_hypergeometric>, 
    __pyx_v_state=0xf76272c8, __pyx_v_size=0x104520e0 <_Py_NoneStruct>, 
    __pyx_v_lock=0xf7647140, __pyx_v_a_arr=0xf754c638, 
    __pyx_v_a_name=0xf7643fa0, 
    __pyx_v_a_constraint=__pyx_e_5numpy_6random_6common_CONS_NON_NEGATIVE, 
    __pyx_v_b_arr=0xf754c570, __pyx_v_b_name=0xf7643f80, 
    __pyx_v_b_constraint=__pyx_e_5numpy_6random_6common_CONS_NON_NEGATIVE, 
    __pyx_v_c_arr=0xf754c548, __pyx_v_c_name=0xf7643fe0, 
    __pyx_v_c_constraint=__pyx_e_5numpy_6random_6common_CONS_GTE_1)
    at numpy/random/common.c:13924
---Type <return> to continue, or q <return> to quit---frame 1
#6  0x0e9e7c5c in __pyx_pf_5numpy_6random_6mtrand_11RandomState_96hypergeometric (__pyx_v_self=0xf76272b8, __pyx_v_ngood=0xf7551058, __pyx_v_nbad=0xf754ffa8, 
    __pyx_v_nsample=0xf754fe68, __pyx_v_size=0x104520e0 <_Py_NoneStruct>)
    at numpy/random/mtrand.c:15834
#7  0x0e9e536c in __pyx_pw_5numpy_6random_6mtrand_11RandomState_97hypergeometric (__pyx_v_self=0xf76272b8, __pyx_args=0xf7535b48, __pyx_kwds=0x0)
    at numpy/random/mtrand.c:15312
#8  0x1006ca2c in PyCFunction_Call ()

A little bit of gdb exploration so far:

(gdb) frame 1
#1  0x0ea14754 in loggam (x=-160848451)
    at numpy/random/src/distributions/distributions.c:503
(gdb) p n
$2 = 160848458
(gdb) p x
$3 = -160848451
(gdb) frame 2
#2  0x0ea1232c in random_hypergeometric_hrua (bitgen_state=0xf76272c8, good=1, 
    bad=2, sample=268080756)
    at numpy/random/src/legacy/legacy-distributions.c:266
266	  d10 = (loggam(d9 + 1) + loggam(mingoodbad - d9 + 1) + loggam(m - d9 + 1) +
(gdb) l
261	  d5 = 1.0 - d4;
262	  d6 = m * d4 + 0.5;
263	  d7 = sqrt((double)(popsize - m) * sample * d4 * d5 / (popsize - 1) + 0.5);
264	  d8 = D1 * d7 + D2;
265	  d9 = (RAND_INT_TYPE)floor((double)(m + 1) * (mingoodbad + 1) / (popsize + 2));
266	  d10 = (loggam(d9 + 1) + loggam(mingoodbad - d9 + 1) + loggam(m - d9 + 1) +
267	         loggam(maxgoodbad - m + d9 + 1));
268	  d11 = MIN(MIN(m, mingoodbad) + 1.0, floor(d6 + 16 * d7));
269	  /* 16 for 16-decimal-digit precision in D1 and D2 */
270	
(gdb) p d9
$4 = -107232301

The failing line 266 in the last bit of code is from #13761, but I haven't read it well enough yet to know what it should do.

@charris charris added 00 - Bug 09 - Backport-Candidate PRs tagged should be backported labels Sep 8, 2019
@charris charris added this to the 1.17.3 release milestone Sep 8, 2019
@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Sep 9, 2019

I'll take a look, but I don't know if I'll be able to reproduce the problem. The call np.random.hypergeometric([1,2,2], [2], [2]) ultimately runs the legacy version, not the new version implemented in #13761. To see if the new code also causes a problem, try:

>>> rng = np.random.generator.Generator(np.random.PCG64(123))
>>> rng.hypergeometric([1, 2, 2], [2], [2])

The platform being ppc64(be) and those huge values of sample visible in the stack trace suggests there might be an endianess issue somewhere.

@ahaldane
Copy link
Member Author

ahaldane commented Sep 9, 2019

The new code runs no problem.

@ahaldane
Copy link
Member Author

ahaldane commented Sep 9, 2019

It does seem to be some kind of cython/multiiter endianness issue. Here's some simple cython code that seems to be failing (I don't write much cython so I haven't figured out the exact problem yet).

This is debug code I added in discrete_broadcast_iii in numpy/random/common.pyx:

    it = np.PyArray_MultiIterNew4(randoms, a_arr, b_arr, c_arr)
    with lock:
        for i in range(n):
            a_val = (<int64_t*>np.PyArray_MultiIter_DATA(it, 1))[0]
            b_val = (<int64_t*>np.PyArray_MultiIter_DATA(it, 2))[0]
            c_val = (<int64_t*>np.PyArray_MultiIter_DATA(it, 3))[0]
            print(c_val, c_arr)

The last line prints 8866477368 [2], so something seems to go wrong with PyArray_MultiIter_DATA here.

@ahaldane
Copy link
Member Author

ahaldane commented Sep 9, 2019

All right, I think I see the bug. The hypergeometric code assumes that npy_long is int64, but on ppc64(be) it is int32. I think that's because (confusingly) npy_long maps to python-int, which maps to c-int, which is int32 on ppc64, even though c-long is an int64. (edit: wrong, see comment below)

So the following code in hypergeometric is converting the inputs to int32, but later code assumes it is int64.

        # This cast to long is required to ensure that the values are inbounds
        ongood = <np.ndarray>np.PyArray_FROM_OTF(ngood, np.NPY_LONG, np.NPY_ALIGNED)
        onbad = <np.ndarray>np.PyArray_FROM_OTF(nbad, np.NPY_LONG, np.NPY_ALIGNED)
        onsample = <np.ndarray>np.PyArray_FROM_OTF(nsample, np.NPY_LONG, np.NPY_ALIGNED)

The fix might be as simple as replacing NPY_LONG by NPY_INT64, but the comment about being "inbounds" is throwing me off since I'm not sure what it means. Would that fix make sense @WarrenWeckesser ?

@charris
Copy link
Member

charris commented Sep 9, 2019

Hmm, npy_long is used all over the legacy random code. Is this perhaps a cython bug?

@ahaldane
Copy link
Member Author

ahaldane commented Sep 9, 2019

I think my description of the size of npy_long above wasn't quite right.

For reasons I don't really remember, my gcc on my ppc64 system is in 32-bit ELF mode for executables/libraries, which is why long is 4 bytes. I think if it was in 64-bit mode long should be 8 bytes. In any case we should support 32-bit executables on ppc64.

@ahaldane
Copy link
Member Author

ahaldane commented Sep 9, 2019

I don't think it's a cython bug since long is well documented to be 4 bytes on 32-bit systems.

@eric-wieser
Copy link
Member

npy_long maps to python-int, which maps to c-int, which is int32 on ppc64, even though c-long is an int64.

This sounds surprising to me. I was under the impression that python2's int mapped to C long, meaning that npy_long really is just long. C long is unfortunately unspecified as either int32 or int64, but I thought the former was only true in practice on windows.

@ahaldane
Copy link
Member Author

ahaldane commented Sep 9, 2019

C long is unfortunately unspecified as either int32 or int64, but I thought the former was only true in practice on windows.

I think it is also 4 bytes in 32-bit executables on linux 64-bit systems, and is probably controlled by compiler flags too.

I think the reason my ppc64(be) system is using 32-bit mode has to do with ubuntu support. I remember spending a while investigating ubuntu versions available on this system (a VM provided by OSUOSL, who did the ubuntu install for us), and ubuntu only provided 32-bit binaries for big-endian ppc when I asked for it in 2016.

I think ubuntu used to support "powerpc" (which was 32-bit big-endian) but dropped support 2 years ago. It currently supports ppc64el (little endian, 64-bit) used by new IBM computers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants