Skip to content

API: provide examples of use of numpy random API via user stories #14778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 8 tasks
mattip opened this issue Oct 25, 2019 · 33 comments
Open
3 of 8 tasks

API: provide examples of use of numpy random API via user stories #14778

mattip opened this issue Oct 25, 2019 · 33 comments

Comments

@mattip
Copy link
Member

mattip commented Oct 25, 2019

Gathered from gh-14604, gh-14517 and the discussions.

  • "To summarize I need to draw random ints of a given C type from continually changing ranges, either one-by-one or small batch-by-small batch."

  • Someone asked how to use random in a ufunc.

  • "An ideal API would allow projects like https://github.com/deepmind/torch-randomkit/tree/master/randomkit or numba to consume the code in NumPy without vendoring it."

  • "There are c++ applications which use boost::random, would be nice to be able to swap it for numpy.random."

  • "Using the existing distributions from Cython was a requested feature and an explicit goal, yes. There are users waiting for this." (mattip: but isn't this supported via Generator?)

  • "Numba would definitely appreciate C functions to access the random distribution implementations, and

  • have a side-project (numba-scipy) that is making the Cython wrapped functions in SciPy visible to Numba" (mattip: I think we do this after API: rearrange the cython files in numpy.random #14608 from _generator.so via these cdef extern declarations, but that requires the H file.

  • Someone who wants to write a new BitGenerator, i.e., the numpy/bitgenerator repo without needing to have the numpy code as a git submodule

Some of these are already be handled by gh-14604, but I am putting them here for completeness

Edit: turned into a checklist

@rkern
Copy link
Member

rkern commented Oct 25, 2019

"Using the existing distributions from Cython was a requested feature and an explicit goal, yes. There are users waiting for this." (mattip: but isn't this supported via Generator?)

No. In the context of the discussion this statement was drawn from, this was referring to the functions in distributions.c: efficient C-level functions for drawing numbers one-by-one without all of the overhead that the Generator methods have for cramming these numbers into numpy arrays. That's what people have been waiting for.

@mattip
Copy link
Member Author

mattip commented Nov 15, 2019

@rkern the way the code is currently structured in #14604 we do not ship distributions.c directly, rather we expose all the funcitons in _generator.pyx from the shared object _generator.*.so. If we wish to officially support them from C we should either add a _generator.pxd which will expose them via cimport _generator or ship a _generator.h header file and a link library for MSVC. I think a pxd file would be better, what do you think?

@bashtage
Copy link
Contributor

Coming from #14951, I think the C functions represent are a lot of the value add of a low-level interface. While some are pretty trival, many are not.

I don't think it adds much to maintenance since these functions are all used by Generator. IMO the simple thing to do is to always have the set of random_ functions exactly match the set required in Generator. Functions that are not in Generator should probably not be exposed, and if not used in Generator or RandomState, then they should be removed (don't think there are many if any like this).

Did you change the export of the distributions in #14604? I couldn't figure out where I might find them.

@mattip
Copy link
Member Author

mattip commented Nov 21, 2019

I removed unused functions and renamed some others more consistently (mainly gauss_zig -> normal) in #14604 which also

  • moved distributions.h to numpy/core/include/numpy/random/distributions.h
  • moved (only the needed) function declarations from distributions.pxd to mtrand.pyx and _generator.pyx
  • removed distributions.pxd

The functions are exported from the _generator c-extension shared object, so the issue is how best to access them from C/Cython/CFFI. xref #14954 (the first commit there shows there is a problem on windows)

@bashtage
Copy link
Contributor

The Windows .pyd files only export a single symbol, PyInit_generator. If this is fixable, then the numba examples will work well.

So that leaves Cython. I've always compiled the C source directly into Cython modules. Is there a reasonable alternative? Most examples that I found searching for Cython dll assume that you have the source of the DLL code, and don't actually rely on the DLL when the Cython extension module is built.

@mattip
Copy link
Member Author

mattip commented Nov 22, 2019

you have the source of the DLL code

The numpy code is only a click away. This has the added advantage of compiling the code directly, rather than calling into a DLL which not only adds runtime overhead but allows for compile-time optimizations.

@rkern
Copy link
Member

rkern commented Nov 22, 2019

We need to expose the functions in the DLL/SO so that numba can use them. They have made an unsupported hack into the mtrand DLL/SO, and we need to move them off onto something supported. Copying C source code from the numpy source dist doesn't help them.

@mattip
Copy link
Member Author

mattip commented Nov 22, 2019

The functions in distributions.h are now exposed from the _generator DLL/SO. Can you point to a numba issue or PR so we can help them work out how to do this moving forward?

@rkern
Copy link
Member

rkern commented Nov 22, 2019

I'm only reacting to @bashtage's comment: "The Windows .pyd files only export a single symbol, PyInit_generator. If this is fixable, then the numba examples will work well."

@mattip
Copy link
Member Author

mattip commented Nov 22, 2019

The export on windows was a bug that is fixed in gh-14954

@bashtage
Copy link
Contributor

bashtage commented Nov 22, 2019 via email

@rgommers
Copy link
Member

Since the last open PR was merged, this is probably the best issue for questions/comments. So here are some:

  • In new-basic-rngs there's some text to explain the bitgen struct. From that it's unclear what the difference is between next_uint64 and next_raw. It just refers to "raw value" but is that a different uint64 number than returned by next_uint64?
  • Naming in the now public header is still not optimal probably: s_binomial_t, some _zig names, some other cryptic abbreviations (_mvhg, _btpe)
  • The random C API docs still use private imports: cimport numpy.random._bit_generator and cimport numpy.random._generator
  • The See extending for examples link on that same C API page is broken
  • Does it make sense to expose duplicate functions with and without _standard? It only saves the user adding , 0.0, 1.0) to a function call (or outside of the function call) as far as I can tell. It's also unclear if the _standard functions are made available consistently (e.g. normal has with/without _standard, lognormal does not).
  • Formatting of functions in https://numpy.org/devdocs/reference/random/c-api.html is off, should look like https://numpy.org/devdocs/reference/c-api/array.html
  • The title is Cython API, still TBD if it should say C API I think?
  • Not quite sure about the "massaging" in https://numpy.org/devdocs/reference/random/extending.html#cffi. Should that be provided as functionality somewhere, or if not, should CFFI be dropped? It doesn't look very usable as is.
  • Also, more use of private _generator in that CFFI section.

@bashtage
Copy link
Contributor

Next raw is the next number using the natural number of bits the bg uses. It can be 32, 53, 63 or 64 in the range of generators I've considered. It is used to directly test the correctness against a reference implementation.

@bashtage
Copy link
Contributor

Most distributions don't have _standard versions. For example, there isn't a standard lognormal, or a standard chisquare, etc. Some of the standard functions are misnomer imo. For example, standard_t should really be just t, or perhaps students t. It isn't actually standard in any sense that is different from the definition of a t. Ditto for standard cauchy.

@bashtage
Copy link
Contributor

Cffi is pretty easy to use if you want a since function. The examples extending to numba are pretty simple. Mattis example is pretty complete which requires parsing the header file and expanding or replacing macros.

@rgommers
Copy link
Member

Most distributions don't have _standard versions

Yes, that's what I was commenting on. Whether something "is standard" is in the eye of the beholder I'd say. This is basically a default loc=0, scale=1 like in the scipy.stats distributions. It could be defined for all distributions here. It just looks odd to me now.

Some of the standard functions are misnomer imo. For example, standard_t should really be just t, or perhaps students t. It isn't actually standard in any sense that is different from the definition of a t. Ditto for standard cauchy.

Now would be the time to rename them. Once we ship 1.18.0 it becomes a lot harder.

@rgommers
Copy link
Member

Cffi is pretty easy to use if you want a since function. The examples extending to numba are pretty simple. Mattis example is pretty complete which requires parsing the header file and expanding or replacing macros.

I'm not sure I understand this comment. At the moment there's ~50 lines of boilerplate. Why couldn't we simply put that in a function so not every user has to copy-paste that every time she wants to use cffi?

@rkern
Copy link
Member

rkern commented Nov 29, 2019

Most of the standard_ functions are there because those special forms are used inside of other distributions where the efficiency does matter. The fact that they get used like is what picks them out as "standard". It's not arbitrary.

I think I named standard_t that because I thought t was too short, and since it didn't expose the mean or scale as parameters. Should have called it student_t, but we're way past that. That's the name at the Python level. 1.18.0 is irrelevant.

@rkern
Copy link
Member

rkern commented Nov 29, 2019

And I have no explanation for standard_cauchy except that I think, having done standard_t, I had mentally established a pattern with the other bell-shaped distributions that if they didn't expose the loc and scale that I'd call it standard_ to avoid misleading people.

@rgommers
Copy link
Member

Right indeed. For the names that match Python it is what it is, I missed that. 1.18.0 only matter for the new names of functions that don't have a Python equivalent.

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

The random C API docs still use private imports: cimport numpy.random._bit_generator and cimport numpy.random._generator
Also, more use of private _generator in that CFFI section.

For Cython, we could somehow extend the __init__.pxd so the pattern would be cimport numpy.random which would expose all the gory details. I tried to do this and it was messy, but maybe doable. However CFFI can only access the shared object/DLL, which is named _generator.*.so

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

The random C API docs still use private imports
The title is Cython API, still TBD if it should say C API I think

The top half of that document is Cython-specific, the bottom half builds on it to document the available C functions. Perhaps it could be divided into two pages

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

We should decide which of the points @rgommers raised are 1.18 blockers and break them out into separate issues to be resolved in the next couple of days.

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

Not quite sure about the "massaging" ... CFFI

This is standard stuff for CFFI. All the C wrapper libraries have a boilerplate problem: ctypes must define args and retype, Cython need pxd files, SWIG needs interface files, CFFI needs to pre-parse headers.

@bashtage
Copy link
Contributor

I think I named standard_t that because I thought t was too short, and since it didn't expose the mean or scale as parameters. Should have called it student_t, but we're way past that. That's the name at the Python level. 1.18.0 is irrelevant.

Should the C functions correct this? They could become random_student_t (or students_t) and random_cauchy.

@bashtage
Copy link
Contributor

I'm not sure I understand this comment. At the moment there's ~50 lines of boilerplate. Why couldn't we simply put that in a function so not every user has to copy-paste that every time she wants to use cffi?

Sorry, on the phone.

It should have said: CFFI is pretty easy if you only want to use one or a few functions.

See

https://github.com/numpy/numpy/blob/7548073e58e89b414ca999d666c19189291dc1e5/numpy/random/_examples/numba/extending_distributions.py

which is no longer in master (and isn't quite right, but easy to fix). There is no need for tons of boiler plate. TBH, I'm not sure why @mattip got rid of this simple example which is basic but follows the style pretty much every CFFI example on the internet.

Of course, NumPy could provide a components needed for CFFI interfacing, but this seems to me like committing support something that may not be worth the overhead since downstream might want to do things differently.

@bashtage
Copy link
Contributor

As for 1.18, I think only naming if essential, since changing it later is harder. I don't think it really is essential to rename some of the functions, but my vote would be

standard_t -> student_t
standard_cauchy -> cauchy
Remvoe all _zig since there are no other versions of these distributions

The one feature that I would really like, and I think is needed to complete the story, is to have something like npymath.lib for the C functions, perhaps npyrandom.lib. This would allow C/Cython to link to these functions without needing to get the C source form somewhere. For my, this would allow what will be a slimmed-down version of randomgen to become a dependent on the random code in NumPy instead of having an out-of-sync copy. This is a bigger ask though, and can definitely wait.

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

... numba/extending_distributions.py which is no longer in master

Must be a glitch in the way you are looking at the repo. The file is there, it is referenced in the docs but not tested since it is complicated to run.

CFFI is pretty easy if you only want to use one or a few functions

The example I provided shows how to use all the functions available in distributions.h, and is closer to how I have seen CFFI used in industry. I agree that many CFFI examples and tutorials "cheat" by redefining only one or two functions without parsing a header, but I wanted to provide a more complete and powerful example. It seems I missed the mark, the attention was drawn to the boilerplate and not the actual code. I will refactor the parsing into a separate file.

Remove all _zig since there are no other versions of these distributions

Then we would add _legacy to the "bare" versions needed for mtrand.pyx?

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

Then we would add _legacy to the "bare" versions needed for mtrand.pyx?

nvermind, that is wrong. Changing the names as suggested in gh-15007

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

The Generator.standard_exponential method has a zig keyword. This uses the random_standard_exponential_zig_fill and random_standard_exponential_fill functions from distributions.c. I will rename them to random_standard_exponential_fill and random_standard_exponential_inv_fill respectively, and drop random_standard_exponential_inv_fill from the documentation.

@mattip
Copy link
Member Author

mattip commented Nov 29, 2019

Done in #15007 (pending review and passing tests)

@jamesthomasgriffin
Copy link

The reference page: Cython API for random implies that it is possible to use a long list of functions "random_..." from Cython and points to Extending for examples of using these functions. However those examples only use the methods associated to bitgen_t.

I am unable to cimport any of those functions.

from numpy.random._generator cimport random_interval
^

test_cython.pyx:13:0: 'numpy\random_generator.pxd' not found

Perhaps "_generator.pxy" is required but is not present. I am able to generate random numbers from bitgen_t.next_double, following the Extending example above.

Would it be possible to clarify the documentation by including an example? Or have I misunderstood and those functions should be private, in which case could this be clarified?

Numpy/Python version information:

1.18.1 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)]

@mattip
Copy link
Member Author

mattip commented Jan 29, 2020

It seems the cython use case was indeed overlooked. There is an example using CFFI, but none for cython.

seberg pushed a commit that referenced this issue Mar 16, 2020
…API (gh-15463)

xref gh-14778

As pointed out in the comment by @jamesthomasgriffin, we did not include a pxd file to expose the distribution functions documented in the random c-api. This PR adds a c_distributions.pxd file that exposes them.

Squashed commits:

* BUG: add missing c_distributions.pxd to enable cython use of random C-API

* ENH, TST: add npyrandom library like npymath, test cython use of it

* BUG: actually prefix f-string with f

* MAINT: fixes from review, add _bit_generato_bit_generator.pxd

* STY: fixes from review

* BLD: don't use nprandom library for mtrand legacy build

* TST: WindowsPath cannot be used in subprocess's list2cmdline

* MAINT, API: move _bit_generator to bit_generator

* DOC: add release note about moving bit_generator

* DOC, MAINT: fixes from review

* MAINT: redo dtype determination from review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants