Skip to content

Add dtypes for random number generation #6869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Add dtypes for random number generation #6869

wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Contributor

@gfyoung gfyoung commented Dec 20, 2015

Addresses request in #6790 to allow methods like rand and randint to generate numbers based on dtype due to memory considerations.

(Optional, But Recommended) Depends on #6824

If #6824 is merged in, the PR will be revised to make any newly created methods from that PR take in a dtype parameter.

@gfyoung gfyoung changed the title PR IN PROGRESS: AUTHOR NEEDS HELP THOUGH Add dtypes for random number generation Dec 20, 2015
@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 20, 2015

Couple of questions:

  1. I made changes to randint that allow me to generate random integers with specific dtypes, but my solution seems too good to be true, especially with that recasting of the PyArray_Data pointer to long long *. Sanity check?

  2. While I can think of simple tests to see if the actual dtype argument is respected, I'm not sure how to test the request in the email attached in the issue, which was a giant array of uint8 numbers. Any ideas?

@jaimefrio
Copy link
Member

It is too good to be true. If you call e.g. randint(256, size=(1000,), dtype=np.uint8), your array_data pointer will point at the beginning of 1000 bytes of allocated memory. But because the pointer is of type long long, which takes up 8 bytes, when you try to write to array_data[125], it will be to a memory address just past the allocated memory (125 * 8 = 1000), which means that you will end up writing 875 of the values outside the allocated memory: you should be getting a big segmentation fault.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

I actually didn't get a segmentation fault with that exact method call, but I definitely broke something behind the scenes with my changes using a method call similar to the one you proposed. Back to the drawing board.

As an aside, I had a lot of issues using fused types because I'm trying to recast the return value (long long) from rk_interval as different data types depending on what is passed in for dtype. This context seems different from the general context that they are used in, at least from what I could see from documentation.

@jaimefrio
Copy link
Member

I think the idea would be to make a function that did something like:

ctypedef fused integer_type:
    char
    short
    int
    long
    long long
    unsigned char
    unsigned short
    unsigned int
    unsigned long
    unsigned long long

cdef fill_randint_array(integer_type[:] array_data, npy_intp length, long long lo, long long diff):
    cdef npy_intp i;
        with self.lock, nogil:
        for i from 0 <= i < length:
            array_data[i] = lo + <long long>rk_interval(diff, self. internal_state)

You would then probably have to call it with something like:

fill_randint_arr(&array.reshape(-1)[0], ...)

to trigger the right function call. Has something akin to this failed for you?

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

I tried something akin to that, but the problem is that Cython considers the elements of array to be Python objects and not C objects. Thus, it cannot take the addresses of them unfortunately.

@pv
Copy link
Member

pv commented Dec 21, 2015

The long -> long long change in the rng is probably wrong. It's also not required, as you can draw more bytes instead. You can take addresses of numpy arrays in Cython if they are declared appropriately.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

I'm not sure I quite follow there on the long to long long change. Could you elaborate? I think I made that change earlier (see the dependent PR here) because numbers weren't quite being generated properly.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

Also, I'm not sure I follow your comment about the proper declaration of array objects in Cython in order to get their addresses either now. cdef is used for initializing the array in randint. Doesn't that make array a C object?

@pv
Copy link
Member

pv commented Dec 21, 2015

Yes; what I meant is that array.reshape(-1) mentioned above is not a C object as far as Cython knows, so you'd need to declare a new variable to hold it.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

It just occurred to me, how can we guarantee though that reshape won't return a copy of array (I believe there is a possibility that the method returns only a copy)? If so, then the pointer passed in would be pointing to the wrong array, wouldn't it?

@jaimefrio
Copy link
Member

You have just allocated the array, so it is contiguous, and the reshape will not trigger a copy. It's unnecessary, but if you want to be extra careful, rather than calling . reshape(), assign to the shape attribute, which will trigger an error if a copy is needed.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

Here is what I currently have:

cdef ndarray array "arrayObject"
cdef ndarray flat_array "arrayObject"

...

array = <ndarray>np.empty(size, dtype)
length = PyArray_SIZE(array)

flat_array = <ndarray>array.reshape(-1)
randint_array(self.internal_state, &flat_array[0], length, lo, diff, self.lock)

Cython (FYI: verson 0.23.2) keeps complaining that it cannot take the address of flat_array because it is a Python variable. However, I used cdef to declare it. What is going on here?

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 22, 2015

Accidentally pushed my current and incomplete work to my fork. Luckily it is all in that new commit "FIXUP", so not all is ruined. However, I'm puzzled by the fact that AppVeyor said that my code passed, even though both builds failed to compile. Is there a bug somewhere in the AppVeyor script?

@pv
Copy link
Member

pv commented Dec 22, 2015

appveyor failures are currently ignored, because there are other failing tests present in master.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

In spite of the Appveyor issue, Travis is unhappy because Cython is unhappy with that cdef flat_array for some reason (see comment above). Any ideas about what to do?

@jaimefrio
Copy link
Member

When things with Cython get to this point of figuring out these arcane details, I'm always strongly tempted to send it to hell and write things in C directly. That would be an entirely different kind of PITA, so lets give it one more chance...

You can also get Cython to efficiently index your array, by typing the contents of the ndarray with a special buffer syntax. This is tricky, because to properly do that, you need to know the C type, and if you already knew it, you could do away with the extra function call and do the casting directly. Here and here are a couple of SO questions dealing with similar issues.

I think, but you will need to figure out if it really works, that in your case the easiest would be to move the flat_array declaration into the randint_array function. Something like this:

cimport cython
@cython.boundscheck(False)
cdef object randint_array(rk_state *state, ndarray[npy_integer, ndim=1] flat_array,
                          long long lo, long long diff, object lock):
    cdef npy_intp i;
    cdef npy_intp length = flat_array.shape[0]
    with lock, nogil:
        for i from 0 <= i < length:
            flat_array[i] = lo + <long long>rk_interval(diff, state)

You would then simply call this function as:

return randint_array(self.internal_state, array.reshape(-1), length, lo, diff, self.lock)

Good luck!

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

I just tried that and no luck unfortunately. The problem is that the array object is ambiguous since it has no data type associated with it in its declaration ndarray array "arrayObject", Cython cannot compile the function call at the end of your comment.

I thought about what you said about writing it in C. I was wondering, how does numpy translate the dtype argument into C data types when you call np.array([1, 2, 3, 4, 5], dtype=np.uint8) for example? Surely it must not do this giant if-else if-else block, but I haven't been able to find the implementation.

@jaimefrio
Copy link
Member

Does it also fail if array is not typed, i.e. is not explicitly defined as an ndarray? What we are dealing with here has to be syntax: I cannot believe there isn't a way to do what we want...

As for the C version of this, it uses a gigantic switch statement, not pretty either...

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

I think I tried something similar to that, and then Cython said it could not find a suitable method. I believe this is what you mean (see below), correct?

randint_array(rk_state *state, ndarray[npy_integer, ndim=1] flat_array, 
              long long lo, unsigned long long diff, object lock):
...

randint(self, lo, hi=None, dtype=None):

# Instead of cdef ndarray array, do this
array = np.empty(zeros, dtype)

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 24, 2015

@jaimefrio : I came up with an entirely different implementation for the array just now (see commit history). It seems to work (did some random randint calls with different types), and Travis is happy. How does it look? My concern right now is removing that nogil part (I can't add elements to the array with nogil), which could impact the behavior of randint under multi-thread conditions.

@jaimefrio
Copy link
Member

You would need to test it, but I think that is going to be Python slow...

I remembered that there is an example in scipy.ndimage.label of the contortions you have to go through to get a C type specialized function based on the dtype of a numpy array. The following compiles, but I haven't tested it any further:

ctypedef fused npy_integers:
    char
    short
    int
    long
    long long
    unsigned char
    unsigned short
    unsigned int
    unsigned long
    unsigned long long

cdef fill_array(rk_state *internal_state, object lock, long lo,
                unsigned long diff,
                npy_integers[::1] flat_array,
                npy_intp length):
    cdef npy_intp i
    with lock, nogil:
        for i from 0 <= i < length:
            flat_array[i] = lo + <long>rk_interval(diff, internal_state)

def get_fill_function(ndarray[npy_integers] array):
    return <int>fill_array[npy_integers]

ctypedef void (*fill_array_t)(rk_state*, object, long, unsigned long,
                              void*, npy_intp)

And then inside the randint function you call this as:

cdef fill_array_t fill_function
...
array = <ndarray>np.empty(size, dtype)
fill_function = <fill_array_t><void*><int>get_fill_function(array.take([0]))
length = PyArray_SIZE(array)
fill_function(self.internal_state, self.lock, lo, diff,
              PyArray_DATA(array), length)
return array

I personally think this approach is ludicrous, especially those magic castings to get the function pointer, and would not like to see something like this creeping into numpy, but would like to hear from others.

I think the proper way of doing this would be to have a small C module, probably using NumPy's templating syntax (look for a *.c.src file in the numpy codebase for an example) defining all the specialized array filler functions, plus a getter function that dispatches on the dtype's num attribute.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 24, 2015

Ah, yes, speed is also an issue here (the slow-down is noticeable already for a 4096 * 4096 array). It seems like "simple" solutions just won't cut it here, will they? 😄

I think both options are worth considering (especially if they give flexibility without compromising speed), though I am not entirely sure how to appropriately set up the *.c.src file and where to place. I can attempt the scipy implementation first as a separate commit and then do the same with the *.c.src implementation.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 25, 2015

I gave your scipy implementation a shot, and it worked quite well (I had to make several small modifications, but that's it). There was no noticeable performance decrease like I noticed with my own implementation, which is a good sign.

I gave the other implementation some thought, and if I'm not mistaken from your description, it seems like we are disguising the giant if...else as a getter function that returns the appropriate randint_array (using nomenclature from my commit) function depending on the num attribute of dtype. As convoluted as the scipy implementation is, I think I would side with it. First, it is quite compact relatively speaking. Also, I don't see reason to dispute something that has worked for a close cousin! 😄

From the PR description, attaching a dtype to randint is only part of the change. Thus, I agree that we should wait to see if anyone has anything to add. Otherwise, I will go ahead and try to do the same thing for rand as well as tomaxint.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 25, 2015

I believe I added it because my C compiler was raising a warning telling me to put it there to "speed things up".

@jaimefrio
Copy link
Member

The stated reason to use Cython in a situation like this is that it improves maintainability, because many people are not comfortable with C, but don't mind Cython. I don't think that really applies here:

  • there is a confusing extra level of indirection, with the getter function that takes a typed array to select the function that gets called.
  • you end up having to deal with function pointers anyway: if you understand that, then you shouldn't have problems to make sense of C.
  • what kills me is that the function pointer has to be passed around as an integer value for things to work. It's probably safe, because you wisely use npy_intp. But having to cast a number to a void pointer to a function pointer just seems wrong to me.

If I did not make it clear enough, I don't like this approach. Not at all. ;-) But if others think that it is not that bad, I won't fight back too hard.

What do you think, @charris, @njsmith?

@charris
Copy link
Member

charris commented Dec 25, 2015

I think this is the wrong place to start. Note that the Mersenne twister used by numpy produces 32bit random integers stored in long. Hence in randomkit.c there is an rk_long function to fill out the full long by concatenating two 32 random samples whenever long is larger that 32 bits. One consequence is that the random integer stream is not reproducible, it differs between architectures with 32 bit longs and 64 bit longs. I think a better place to start is having defined 32 bit and 64 bit random integer streams using different functions. The current randint can then be adjusted to use the appropriate one to preserve backward compatibility. We could then provide new functions for the 32 bit and 64 bit streams, we just need to decide on the names and whether to specify half open or closed intervals. Whether or not to have a specialized npy_intp type is an open question.

@rkern Thoughts?

@charris
Copy link
Member

charris commented Dec 25, 2015

Small correction, the bit stream depends on the difference between lo and high. For differences <= 0xffffffff, the random streams are the same and greater differences are only supported on platforms with long > 32 bits. So there is no compatibility problem between platforms.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 25, 2015

@charris : Indeed, see #6824

@charris
Copy link
Member

charris commented Dec 25, 2015

@gfyoung I think adding a dtype may be more than needed. But in any case, I think you do want efficient integer streams from mtrand, perhaps in different ranges. Note that pulling off anything but longs from the state can lead to alignment problems, so everything needs to be constructed from longs. But these things should all be new, just leave the current functions as they are so there are no compatibility questions.

@charris
Copy link
Member

charris commented Dec 25, 2015

Note, I think mtrand.c is the place to start.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 25, 2015

@charris : Sorry, I really don't follow what you just said. Could you clarify? How does this relate to dtype and the issue I am trying to address here?

@charris
Copy link
Member

charris commented Dec 25, 2015

I think you should forget about dtypes for a while and study the functions in mtrand.c

@rkern
Copy link
Member

rkern commented Dec 25, 2015

@charris mtrand.c is generated from the Cython mtrand.pyx. No one should look at that.

@charris
Copy link
Member

charris commented Dec 25, 2015

@rkern, oops, meant randomkit.c.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 25, 2015

@charris : Have you looked at #6824? randomkit.c is all I have been looking at there pretty much. Also, to restate, I'm still not seeing how this discussion fits within the discussion of this PR. Unless someone can explain, this discussion about bit streams seems to me to be beyond the scope of this PR, and would fit better in #6824.

@rkern
Copy link
Member

rkern commented Dec 26, 2015

You are trying to solve what is essentially the same missing feature with three separate PRs. @charris's comments are relevant to the underlying feature if not the one part you are implementing in this PR. In particular, he is suggesting that this particular prong of your attack on this problem is probably not going to be fruitful.

@rkern
Copy link
Member

rkern commented Dec 26, 2015

@charris Yes, I think adding rk_uint32 and rk_uint64 would be a good place to start.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 26, 2015

Alright: I think there is a discrepancy between what I am understanding about random integer generation and what you guys are understanding about it. The way I see these PR's, I see them as three separate issues:

  1. Overflow with 32-bit long
  2. Generating integers of different integer types
  3. Refactoring random_integers to accept a slightly broader range of arguments

Now, you guys are of the thought that there is some fundamental issue with random number generation that somehow all of my PR's are not addressing. Let me restate again: I don't really understand what you guys are talking about. Thus, I would appreciate (for the sake of my own understanding and for properly working on these PR's instead of barking up the wrong tree for another week) if someone could lay out what exactly is going on and address these questions

  1. What exactly is the issue underlying random integer generation?
  2. How are these seemingly disparate issues that I am addressing related to them, and why are each of these PR's are not addressing the core issue?

I would love to help improve this issue with random number generation but being told something doesn't work without much explanation makes it difficult. Thanks.

@njsmith
Copy link
Member

njsmith commented Dec 26, 2015

I haven't had the bandwidth to follow this in detail, was just trying to respond quickly since I was pinged, but I think my comment above is in the correct thread, since it's suggesting one possible strategy for handling smaller-width dtypes in randint etc.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 26, 2015

@njsmith: See my comment above. Also, judging from the numbers you suggested, isn't that suggestion tuned to the request in the email?

@charris
Copy link
Member

charris commented Dec 26, 2015

@njsmith You need to fetch multiples of 32 bits in any case and discard the extra in order to keep the state of the rng sane. If you have internal buffers that are not cleared when the rng is reseeded the stream is not reproducible.

@njsmith
Copy link
Member

njsmith commented Dec 26, 2015

I'm sorry, no idea what you folks are saying :-) My suggestion is that one crude but nonetheless possibly attractive strategy for implementing dtype= support is:

def randint(low, high=None, size=(), dtype=int):
    if dtype == np.dtype(int):
        return _randint_core(low, high, size)
    total = np.prod(size)
    out = np.empty(total, dtype=dtype)
    for i in range(total // BUFSIZE):
        out[i * BUFSIZE : (i+1) * BUFSIZE] = _randint_core(low, high, BUFSIZE)
    out[(i + 1) * BUFSIZE :] = _randint_core(low, high, total % BUFSIZE)
    return out.reshape(size)

It's not quite as efficient as all the fused-type style approaches, but with appropriate choice of BUFSIZE it might do pretty well, and is massively simpler. (It comes even closer to performance parity if we add support for _randint_core to write to an appropriately dtyped out= buffer, so as to allow the loop here to allocate a single bounce buffer and re-use it instead of allocating a new one on each call.)

@charris
Copy link
Member

charris commented Dec 26, 2015

@njsmith Of course you can cast using a buffer, that is a detail. But from my point of view we need to be able to produce 64 bit random integers on platforms with 32 bit longs. Hence, first we need rk_interval32 and rk_interval64, which might as well be based on rk_uint32 and rk_uint64. That is simple enough and provides a basis for whatever we wants to do afterwards. We will probably want to produce byte types, etc, using 32 bit chunked conversions. I suggest separate functions for random_int8, random_int16, random_int32, and random_int64 as they should each be implemented somewhat differently and, due to backward compatibility concerns, we only get one chance. I chose those functions instead of the rand*** versions as it would be nice to get the full range (closed intervals). The unsigned versions can be gotten with a view and proper choice of bounds. Other versions, say the current long (int), can be gotten by assigning the appropriate specific version depending on platform.

I think the important thing at this point is to decide what we want and lay out the project.

@njsmith
Copy link
Member

njsmith commented Dec 26, 2015

Oh sure, we need to sort out the basic int32/int64 generation routines too (I thought that was more #6824 territory but like I said I haven't followed this whole discussion closely). I was narrowly commenting on the discussion in this thread that was struggling to find a good strategy for handling narrower dtypes without adding a bunch of nasty templated code.

@homu
Copy link
Contributor

homu commented Dec 28, 2015

☔ The latest upstream changes (presumably #6885) made this pull request unmergeable. Please resolve the merge conflicts.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 28, 2015

@homu : I'm actually going to refrain from rebasing until we can figure out exactly what's going on with this random integer generation thing. It may require just restarting from scratch given the refactoring I did in #6885.

@jaimefrio
Copy link
Member

@homu is a bot, but I'm sure he appreciates being replied to. ;-)

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 29, 2015

Whoops! Did not know that. 😄

Though the comment does seem relevant nonetheless given the discussion going on.

@charris
Copy link
Member

charris commented Jan 1, 2016

@gfyoung Closing this in favor of #6910. Thanks for getting the ball rolling.

@charris charris closed this Jan 1, 2016
@gfyoung gfyoung deleted the rand_dtype branch January 1, 2016 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants