Add dtypes for random number generation #6869

gfyoung · 2015-12-20T13:58:53Z

Addresses request in #6790 to allow methods like rand and randint to generate numbers based on dtype due to memory considerations.

(Optional, But Recommended) Depends on #6824

If #6824 is merged in, the PR will be revised to make any newly created methods from that PR take in a dtype parameter.

gfyoung · 2015-12-20T21:32:56Z

Couple of questions:

I made changes to randint that allow me to generate random integers with specific dtypes, but my solution seems too good to be true, especially with that recasting of the PyArray_Data pointer to long long *. Sanity check?
While I can think of simple tests to see if the actual dtype argument is respected, I'm not sure how to test the request in the email attached in the issue, which was a giant array of uint8 numbers. Any ideas?

jaimefrio · 2015-12-20T22:27:07Z

It is too good to be true. If you call e.g. randint(256, size=(1000,), dtype=np.uint8), your array_data pointer will point at the beginning of 1000 bytes of allocated memory. But because the pointer is of type long long, which takes up 8 bytes, when you try to write to array_data[125], it will be to a memory address just past the allocated memory (125 * 8 = 1000), which means that you will end up writing 875 of the values outside the allocated memory: you should be getting a big segmentation fault.

gfyoung · 2015-12-21T00:09:24Z

I actually didn't get a segmentation fault with that exact method call, but I definitely broke something behind the scenes with my changes using a method call similar to the one you proposed. Back to the drawing board.

As an aside, I had a lot of issues using fused types because I'm trying to recast the return value (long long) from rk_interval as different data types depending on what is passed in for dtype. This context seems different from the general context that they are used in, at least from what I could see from documentation.

jaimefrio · 2015-12-21T05:08:25Z

I think the idea would be to make a function that did something like:

ctypedef fused integer_type:
    char
    short
    int
    long
    long long
    unsigned char
    unsigned short
    unsigned int
    unsigned long
    unsigned long long

cdef fill_randint_array(integer_type[:] array_data, npy_intp length, long long lo, long long diff):
    cdef npy_intp i;
        with self.lock, nogil:
        for i from 0 <= i < length:
            array_data[i] = lo + <long long>rk_interval(diff, self. internal_state)

You would then probably have to call it with something like:

fill_randint_arr(&array.reshape(-1)[0], ...)

to trigger the right function call. Has something akin to this failed for you?

gfyoung · 2015-12-21T08:58:47Z

I tried something akin to that, but the problem is that Cython considers the elements of array to be Python objects and not C objects. Thus, it cannot take the addresses of them unfortunately.

pv · 2015-12-21T09:10:10Z

The long -> long long change in the rng is probably wrong. It's also not required, as you can draw more bytes instead. You can take addresses of numpy arrays in Cython if they are declared appropriately.

gfyoung · 2015-12-21T09:14:09Z

I'm not sure I quite follow there on the long to long long change. Could you elaborate? I think I made that change earlier (see the dependent PR here) because numbers weren't quite being generated properly.

gfyoung · 2015-12-21T09:29:05Z

Also, I'm not sure I follow your comment about the proper declaration of array objects in Cython in order to get their addresses either now. cdef is used for initializing the array in randint. Doesn't that make array a C object?

pv · 2015-12-21T09:32:34Z

Yes; what I meant is that array.reshape(-1) mentioned above is not a C object as far as Cython knows, so you'd need to declare a new variable to hold it.

gfyoung · 2015-12-21T09:41:14Z

It just occurred to me, how can we guarantee though that reshape won't return a copy of array (I believe there is a possibility that the method returns only a copy)? If so, then the pointer passed in would be pointing to the wrong array, wouldn't it?

jaimefrio · 2015-12-21T14:58:28Z

You have just allocated the array, so it is contiguous, and the reshape will not trigger a copy. It's unnecessary, but if you want to be extra careful, rather than calling . reshape(), assign to the shape attribute, which will trigger an error if a copy is needed.

gfyoung · 2015-12-21T23:36:52Z

Here is what I currently have:

cdef ndarray array "arrayObject"
cdef ndarray flat_array "arrayObject"

...

array = <ndarray>np.empty(size, dtype)
length = PyArray_SIZE(array)

flat_array = <ndarray>array.reshape(-1)
randint_array(self.internal_state, &flat_array[0], length, lo, diff, self.lock)

Cython (FYI: verson 0.23.2) keeps complaining that it cannot take the address of flat_array because it is a Python variable. However, I used cdef to declare it. What is going on here?

gfyoung · 2015-12-22T03:17:39Z

Accidentally pushed my current and incomplete work to my fork. Luckily it is all in that new commit "FIXUP", so not all is ruined. However, I'm puzzled by the fact that AppVeyor said that my code passed, even though both builds failed to compile. Is there a bug somewhere in the AppVeyor script?

pv · 2015-12-22T09:28:42Z

appveyor failures are currently ignored, because there are other failing tests present in master.

gfyoung · 2015-12-23T03:03:51Z

In spite of the Appveyor issue, Travis is unhappy because Cython is unhappy with that cdef flat_array for some reason (see comment above). Any ideas about what to do?

jaimefrio · 2015-12-23T09:04:19Z

When things with Cython get to this point of figuring out these arcane details, I'm always strongly tempted to send it to hell and write things in C directly. That would be an entirely different kind of PITA, so lets give it one more chance...

You can also get Cython to efficiently index your array, by typing the contents of the ndarray with a special buffer syntax. This is tricky, because to properly do that, you need to know the C type, and if you already knew it, you could do away with the extra function call and do the casting directly. Here and here are a couple of SO questions dealing with similar issues.

I think, but you will need to figure out if it really works, that in your case the easiest would be to move the flat_array declaration into the randint_array function. Something like this:

cimport cython
@cython.boundscheck(False)
cdef object randint_array(rk_state *state, ndarray[npy_integer, ndim=1] flat_array,
                          long long lo, long long diff, object lock):
    cdef npy_intp i;
    cdef npy_intp length = flat_array.shape[0]
    with lock, nogil:
        for i from 0 <= i < length:
            flat_array[i] = lo + <long long>rk_interval(diff, state)

You would then simply call this function as:

return randint_array(self.internal_state, array.reshape(-1), length, lo, diff, self.lock)

Good luck!

gfyoung · 2015-12-23T12:56:24Z

I just tried that and no luck unfortunately. The problem is that the array object is ambiguous since it has no data type associated with it in its declaration ndarray array "arrayObject", Cython cannot compile the function call at the end of your comment.

I thought about what you said about writing it in C. I was wondering, how does numpy translate the dtype argument into C data types when you call np.array([1, 2, 3, 4, 5], dtype=np.uint8) for example? Surely it must not do this giant if-else if-else block, but I haven't been able to find the implementation.

jaimefrio · 2015-12-23T16:17:32Z

Does it also fail if array is not typed, i.e. is not explicitly defined as an ndarray? What we are dealing with here has to be syntax: I cannot believe there isn't a way to do what we want...

As for the C version of this, it uses a gigantic switch statement, not pretty either...

gfyoung · 2015-12-23T21:00:53Z

I think I tried something similar to that, and then Cython said it could not find a suitable method. I believe this is what you mean (see below), correct?

randint_array(rk_state *state, ndarray[npy_integer, ndim=1] flat_array, 
              long long lo, unsigned long long diff, object lock):
...

randint(self, lo, hi=None, dtype=None):

# Instead of cdef ndarray array, do this
array = np.empty(zeros, dtype)

gfyoung · 2015-12-24T03:58:48Z

@jaimefrio : I came up with an entirely different implementation for the array just now (see commit history). It seems to work (did some random randint calls with different types), and Travis is happy. How does it look? My concern right now is removing that nogil part (I can't add elements to the array with nogil), which could impact the behavior of randint under multi-thread conditions.

jaimefrio · 2015-12-24T10:07:15Z

You would need to test it, but I think that is going to be Python slow...

I remembered that there is an example in scipy.ndimage.label of the contortions you have to go through to get a C type specialized function based on the dtype of a numpy array. The following compiles, but I haven't tested it any further:

ctypedef fused npy_integers:
    char
    short
    int
    long
    long long
    unsigned char
    unsigned short
    unsigned int
    unsigned long
    unsigned long long

cdef fill_array(rk_state *internal_state, object lock, long lo,
                unsigned long diff,
                npy_integers[::1] flat_array,
                npy_intp length):
    cdef npy_intp i
    with lock, nogil:
        for i from 0 <= i < length:
            flat_array[i] = lo + <long>rk_interval(diff, internal_state)

def get_fill_function(ndarray[npy_integers] array):
    return <int>fill_array[npy_integers]

ctypedef void (*fill_array_t)(rk_state*, object, long, unsigned long,
                              void*, npy_intp)

And then inside the randint function you call this as:

cdef fill_array_t fill_function
...
array = <ndarray>np.empty(size, dtype)
fill_function = <fill_array_t><void*><int>get_fill_function(array.take([0]))
length = PyArray_SIZE(array)
fill_function(self.internal_state, self.lock, lo, diff,
              PyArray_DATA(array), length)
return array

I personally think this approach is ludicrous, especially those magic castings to get the function pointer, and would not like to see something like this creeping into numpy, but would like to hear from others.

I think the proper way of doing this would be to have a small C module, probably using NumPy's templating syntax (look for a *.c.src file in the numpy codebase for an example) defining all the specialized array filler functions, plus a getter function that dispatches on the dtype's num attribute.

gfyoung · 2015-12-24T20:02:18Z

Ah, yes, speed is also an issue here (the slow-down is noticeable already for a 4096 * 4096 array). It seems like "simple" solutions just won't cut it here, will they? 😄

I think both options are worth considering (especially if they give flexibility without compromising speed), though I am not entirely sure how to appropriately set up the *.c.src file and where to place. I can attempt the scipy implementation first as a separate commit and then do the same with the *.c.src implementation.

gfyoung · 2015-12-25T05:53:49Z

I gave your scipy implementation a shot, and it worked quite well (I had to make several small modifications, but that's it). There was no noticeable performance decrease like I noticed with my own implementation, which is a good sign.

I gave the other implementation some thought, and if I'm not mistaken from your description, it seems like we are disguising the giant if...else as a getter function that returns the appropriate randint_array (using nomenclature from my commit) function depending on the num attribute of dtype. As convoluted as the scipy implementation is, I think I would side with it. First, it is quite compact relatively speaking. Also, I don't see reason to dispute something that has worked for a close cousin! 😄

From the PR description, attaching a dtype to randint is only part of the change. Thus, I agree that we should wait to see if anyone has anything to add. Otherwise, I will go ahead and try to do the same thing for rand as well as tomaxint.

gfyoung · 2015-12-25T11:43:57Z

I believe I added it because my C compiler was raising a warning telling me to put it there to "speed things up".

jaimefrio · 2015-12-25T11:44:43Z

The stated reason to use Cython in a situation like this is that it improves maintainability, because many people are not comfortable with C, but don't mind Cython. I don't think that really applies here:

there is a confusing extra level of indirection, with the getter function that takes a typed array to select the function that gets called.
you end up having to deal with function pointers anyway: if you understand that, then you shouldn't have problems to make sense of C.
what kills me is that the function pointer has to be passed around as an integer value for things to work. It's probably safe, because you wisely use npy_intp. But having to cast a number to a void pointer to a function pointer just seems wrong to me.

If I did not make it clear enough, I don't like this approach. Not at all. ;-) But if others think that it is not that bad, I won't fight back too hard.

What do you think, @charris, @njsmith?

charris · 2015-12-25T20:02:12Z

I think this is the wrong place to start. Note that the Mersenne twister used by numpy produces 32bit random integers stored in long. Hence in randomkit.c there is an rk_long function to fill out the full long by concatenating two 32 random samples whenever long is larger that 32 bits. One consequence is that the random integer stream is not reproducible, it differs between architectures with 32 bit longs and 64 bit longs. I think a better place to start is having defined 32 bit and 64 bit random integer streams using different functions. The current randint can then be adjusted to use the appropriate one to preserve backward compatibility. We could then provide new functions for the 32 bit and 64 bit streams, we just need to decide on the names and whether to specify half open or closed intervals. Whether or not to have a specialized npy_intp type is an open question.

@rkern Thoughts?

charris · 2015-12-25T20:18:18Z

Small correction, the bit stream depends on the difference between lo and high. For differences <= 0xffffffff, the random streams are the same and greater differences are only supported on platforms with long > 32 bits. So there is no compatibility problem between platforms.

gfyoung · 2015-12-25T20:20:52Z

@charris : Indeed, see #6824

charris · 2015-12-25T20:31:30Z

@gfyoung I think adding a dtype may be more than needed. But in any case, I think you do want efficient integer streams from mtrand, perhaps in different ranges. Note that pulling off anything but longs from the state can lead to alignment problems, so everything needs to be constructed from longs. But these things should all be new, just leave the current functions as they are so there are no compatibility questions.

charris · 2015-12-25T20:33:00Z

Note, I think mtrand.c is the place to start.

gfyoung · 2015-12-25T20:35:44Z

@charris : Sorry, I really don't follow what you just said. Could you clarify? How does this relate to dtype and the issue I am trying to address here?

charris · 2015-12-25T20:40:08Z

I think you should forget about dtypes for a while and study the functions in mtrand.c

rkern · 2015-12-25T22:36:02Z

@charris mtrand.c is generated from the Cython mtrand.pyx. No one should look at that.

charris · 2015-12-25T23:11:54Z

@rkern, oops, meant randomkit.c.

gfyoung · 2015-12-25T23:43:03Z

@charris : Have you looked at #6824? randomkit.c is all I have been looking at there pretty much. Also, to restate, I'm still not seeing how this discussion fits within the discussion of this PR. Unless someone can explain, this discussion about bit streams seems to me to be beyond the scope of this PR, and would fit better in #6824.

rkern · 2015-12-26T00:05:28Z

You are trying to solve what is essentially the same missing feature with three separate PRs. @charris's comments are relevant to the underlying feature if not the one part you are implementing in this PR. In particular, he is suggesting that this particular prong of your attack on this problem is probably not going to be fruitful.

rkern · 2015-12-26T00:10:15Z

@charris Yes, I think adding rk_uint32 and rk_uint64 would be a good place to start.

gfyoung · 2015-12-26T03:12:32Z

Alright: I think there is a discrepancy between what I am understanding about random integer generation and what you guys are understanding about it. The way I see these PR's, I see them as three separate issues:

Overflow with 32-bit long
Generating integers of different integer types
Refactoring random_integers to accept a slightly broader range of arguments

Now, you guys are of the thought that there is some fundamental issue with random number generation that somehow all of my PR's are not addressing. Let me restate again: I don't really understand what you guys are talking about. Thus, I would appreciate (for the sake of my own understanding and for properly working on these PR's instead of barking up the wrong tree for another week) if someone could lay out what exactly is going on and address these questions

What exactly is the issue underlying random integer generation?
How are these seemingly disparate issues that I am addressing related to them, and why are each of these PR's are not addressing the core issue?

I would love to help improve this issue with random number generation but being told something doesn't work without much explanation makes it difficult. Thanks.

njsmith · 2015-12-26T05:29:53Z

I haven't had the bandwidth to follow this in detail, was just trying to respond quickly since I was pinged, but I think my comment above is in the correct thread, since it's suggesting one possible strategy for handling smaller-width dtypes in randint etc.

gfyoung · 2015-12-26T05:49:49Z

@njsmith: See my comment above. Also, judging from the numbers you suggested, isn't that suggestion tuned to the request in the email?

charris · 2015-12-26T16:29:05Z

@njsmith You need to fetch multiples of 32 bits in any case and discard the extra in order to keep the state of the rng sane. If you have internal buffers that are not cleared when the rng is reseeded the stream is not reproducible.

njsmith · 2015-12-26T18:57:08Z

I'm sorry, no idea what you folks are saying :-) My suggestion is that one crude but nonetheless possibly attractive strategy for implementing dtype= support is:

def randint(low, high=None, size=(), dtype=int):
    if dtype == np.dtype(int):
        return _randint_core(low, high, size)
    total = np.prod(size)
    out = np.empty(total, dtype=dtype)
    for i in range(total // BUFSIZE):
        out[i * BUFSIZE : (i+1) * BUFSIZE] = _randint_core(low, high, BUFSIZE)
    out[(i + 1) * BUFSIZE :] = _randint_core(low, high, total % BUFSIZE)
    return out.reshape(size)

It's not quite as efficient as all the fused-type style approaches, but with appropriate choice of BUFSIZE it might do pretty well, and is massively simpler. (It comes even closer to performance parity if we add support for _randint_core to write to an appropriately dtyped out= buffer, so as to allow the loop here to allocate a single bounce buffer and re-use it instead of allocating a new one on each call.)

charris · 2015-12-26T19:57:22Z

@njsmith Of course you can cast using a buffer, that is a detail. But from my point of view we need to be able to produce 64 bit random integers on platforms with 32 bit longs. Hence, first we need rk_interval32 and rk_interval64, which might as well be based on rk_uint32 and rk_uint64. That is simple enough and provides a basis for whatever we wants to do afterwards. We will probably want to produce byte types, etc, using 32 bit chunked conversions. I suggest separate functions for random_int8, random_int16, random_int32, and random_int64 as they should each be implemented somewhat differently and, due to backward compatibility concerns, we only get one chance. I chose those functions instead of the rand*** versions as it would be nice to get the full range (closed intervals). The unsigned versions can be gotten with a view and proper choice of bounds. Other versions, say the current long (int), can be gotten by assigning the appropriate specific version depending on platform.

I think the important thing at this point is to decide what we want and lay out the project.

njsmith · 2015-12-26T22:20:22Z

Oh sure, we need to sort out the basic int32/int64 generation routines too (I thought that was more #6824 territory but like I said I haven't followed this whole discussion closely). I was narrowly commenting on the discussion in this thread that was struggling to find a good strategy for handling narrower dtypes without adding a bunch of nasty templated code.

homu · 2015-12-28T22:32:32Z

☔ The latest upstream changes (presumably #6885) made this pull request unmergeable. Please resolve the merge conflicts.

gfyoung · 2015-12-28T22:41:55Z

@homu : I'm actually going to refrain from rebasing until we can figure out exactly what's going on with this random integer generation thing. It may require just restarting from scratch given the refactoring I did in #6885.

jaimefrio · 2015-12-29T00:00:53Z

@homu is a bot, but I'm sure he appreciates being replied to. ;-)

gfyoung · 2015-12-29T00:17:49Z

Whoops! Did not know that. 😄

Though the comment does seem relevant nonetheless given the discussion going on.

charris · 2016-01-01T19:30:08Z

@gfyoung Closing this in favor of #6910. Thanks for getting the ball rolling.

gfyoung changed the title ~~PR IN PROGRESS: AUTHOR NEEDS HELP THOUGH~~ Add dtypes for random number generation Dec 20, 2015

charris added 01 - Enhancement component: numpy.random labels Dec 21, 2015

rkern mentioned this pull request Dec 23, 2015

Allow for larger int64s in random.randint #6824

Closed

gfyoung mentioned this pull request Dec 28, 2015

ENH: Expand Argument Range of random_integers #6885

Merged

Add dtype for randint methods

ea69d13

gfyoung mentioned this pull request Dec 30, 2015

BUG: Broadcast Arguments in Random Integer Generation #6902

Closed

charris mentioned this pull request Dec 30, 2015

ENH: Add dtype argument to random.randint. #6910

Merged

charris closed this Jan 1, 2016

gfyoung deleted the rand_dtype branch January 1, 2016 21:29

gfyoung mentioned this pull request Jan 10, 2016

ENH: Allow Randint to Broadcast Arguments #6938

Closed

Uh oh!

Add dtypes for random number generation #6869

Add dtypes for random number generation #6869

Uh oh!

Conversation

gfyoung commented Dec 20, 2015

(Optional, But Recommended) Depends on #6824

Uh oh!

gfyoung commented Dec 20, 2015

Uh oh!

jaimefrio commented Dec 20, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

jaimefrio commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

pv commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

pv commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

jaimefrio commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 22, 2015

Uh oh!

pv commented Dec 22, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

jaimefrio commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

jaimefrio commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 24, 2015

Uh oh!

jaimefrio commented Dec 24, 2015

Uh oh!

gfyoung commented Dec 24, 2015

Uh oh!

gfyoung commented Dec 25, 2015

Uh oh!

gfyoung commented Dec 25, 2015

Uh oh!

jaimefrio commented Dec 25, 2015

Uh oh!

charris commented Dec 25, 2015

Uh oh!

charris commented Dec 25, 2015

Uh oh!

gfyoung commented Dec 25, 2015

Uh oh!

charris commented Dec 25, 2015

Uh oh!

charris commented Dec 25, 2015

Uh oh!

gfyoung commented Dec 25, 2015

Uh oh!

charris commented Dec 25, 2015

Uh oh!

rkern commented Dec 25, 2015

Uh oh!

charris commented Dec 25, 2015

Uh oh!

gfyoung commented Dec 25, 2015

Uh oh!

rkern commented Dec 26, 2015

Uh oh!