-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Cythonize _update_dict #11874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cythonize _update_dict #11874
Conversation
sys.stdout.flush() | ||
elif verbose: | ||
print("Adding new random atom") | ||
dictionary[:, k] = random_state.randn(n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're no longer doing normally distributed sampling, are you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. We need some sort of transform. Any that you would recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was looking at that previously. Seems they don't really expose that. Raised issue ( numpy/numpy#11764 ) on this point. Interestingly this has come up before. Guess that's how the uniform random distribution code came to be in cd_fast.pyx
. Perhaps we could vendor randomkit for now?
As a "simple" thought, we could use the inverse of the normal distribution's CDF to map uniformly distributed numbers to normally distributed ones. This would require some implementation of erfi
(or better yet ndtri
). There is a Cython interface to these in SciPy, but it requires 0.19.0 at least. Perhaps we could vendor the relevant code (from Netlib Cephes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have included an implementation of the Marsaglia polar method (based off the Box–Muller transform), which appears to be the same thing that NumPy's vendored randomkit is doing. Interestingly the implementation included here appears to be ~2x faster than NumPy's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speed improvements on numpy's random library aren't that interesting: their version compatibility assurances force them to keep sub-optimal versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you guys don't need this numerical binary compatibility with standard Numpy's random module, do you want to try our mkl_random
instead? It's available on all the channels and you can try: import mkl_random as rnd
to make it optionally drop-in replacement for numpy.random.
@oleksandr-pavlyk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a Cython or C/C++ header that could be used? The main goal in this PR is to move as much as possible of this function into non-GIL holding Cython code to maximize performance.
f0729a8
to
846e32d
Compare
@jakirkham how much speed up do you get here? Using the sparse coding is the bottleneck more than the dict update no? |
This is an issue that generally affects a lot of scientific python code, I'm not sure that rewriting everything in low level Cython is the right solution. The same point was for instance mentioned in #10744 for K-Means. Are you sure BLAS threading actually improves performance here? As you saw, loky (used in joblib) will have some mechanism to temporary disabling BLAS threading at runtime in the future (https://github.com/tomMoral/loky/pull/135). Otherwise, reusing the same thread pool including the BLAS one might be possible with with smp or tbb. If I understood @anton-malakhov's scipy talk correctly, both might only require to load the corresponding Python module at startup. I would be curious to know it it would improve performance in this case. |
Have answered comments inline below. Sorry for the length.
Will update once we have addressed Joel's point about the random distribution drawn from.
Not for our work load. They both took roughly half the time originally. Due to various PRs submitted here to remove unneeded checks and make other improvements, am happy to say it got to ~33% sparse coding and ~66% dictionary learning. So thanks everyone for helping with this work. A fair bit of sparse coding's performance is just some sanity checks and a copy at this point, the remainder of which are unfortunately difficult to remove. We have since made changes to our workload to handle some newer/different data. The time is now split ~10% sparse coding and ~90% dictionary learning. So any speed up we can get here helps. Below is a flame graph showing this. The line of code shown varies, but the percentage remains roughly the same. By comparison the aqua blue block on the top right is
That's fair. Though I'm not proposing we rewrite everything. Only the slow part. 🙂
A few weeks back I played with different configurations and determined the best performer was BLAS with parallel threads and sequential mode for Joblib (used for sparse coding). This was based on the runtime of batches, which was in 10s of minutes. So it was pretty easy to see what was helpful and what wasn't. Unfortunately I didn't keep good notes on this.
Yeah, that's a good idea in some cases. It would be good if a user could set a flag to disable it if needed.
Not sure how much it would matter actually (assuming it could be trivially configured). The update dictionary step is fundamentally sequential. So outside of BLAS there isn't much parallelism that we could exploit. Sparse coding could be parallelized (already has Joblib support), but that's not the slow step here and it seemed to hurt when I last tried. There are probably different algorithms that could parallelize the problem approached here. Suggestions on this front would certainly be welcome. |
5e0c6be
to
e2dd7b8
Compare
e70157a
to
d386011
Compare
d386011
to
b244442
Compare
Lower the `_update_dict` into Cython (as close to C as possible). As the big performance pieces are in the BLAS and not in this code, the main objective here is to cutdown the amount of time not spent in those BLAS functions as it can cause the BLAS to spin down its threadpool for instance only to spin it up fractions of seconds later, which is quite costly. This avoids that by calling directly into the BLAS C API with contiguous arrays back to back. Only little bits of Cython are used to string these together. The GIL is released for the bulk of the computation. No-GIL operations are employed where possible. This includes random number generation and any array operations. The GIL is only released once and reacquired once. Everything else that requires the GIL is moved outside that section and information needed on the GIL side is passed from the No-GIL side. This minimizes contention over the GIL with other processes. Threading is maximized by using BLAS operations that will make the most use of threads possible. Smaller BLAS operations are merged into larger ones where possible. Wherever possible `for`-loops are converted into some form of BLAS operation with the `for`-loop being pushed into BLAS. This includes even simple things like find the sum of the residuals squared or zeroing out arrays. Arrays are coerced into contiguous form either by verifying that the match the expected stride or copying them via NumPy if necessary. Striding is employed to ensure movement along arrays in BLAS picks the contiguous path as often as possible. This means arrays may be C or Fortran order depending on what will be most efficient for fast traversal and good use of the cache. For convenience fused types are employed liberally including in binding BLAS functions of different precision together. This makes it easy for the same code to leverage single or double precision as needed with minor specialization as required. Overall this makes for a nice blend of static typing that remains relatively flexible. Memoryviews are used heavily within functions and as part of the definitions of all internal functions. Construction of memoryviews or their slices is avoid as much as possible for performance reasons. Instead direct indexing to a specific value or address location is preferred wherever possible to maximize performance. Function inlining is used heavily. Many of the functions have but a couple lines or perhaps less. While convenient to have these as separate functions for readability/comprehension, there is no point in taking a hit on performance by having to perform a jump to get to these functions when they are so simple. To allow random numbers sampled from the normal distribution to be generated without the GIL quickly, the Marsaglia polar method (an optimization on the Box–Muller transform) is used. This is the same method that NumPy's vendored copy of RandomKit uses. The performance of our implementation ends up being notably faster than NumPy's. Though doesn't appear to be surprising, it is a nice performance bump to have.
To simplify the code a bit and cutdown on repeated pointer arithmetic, store frequently used addresses in memoryviews as pointers. This is helpful for working with BLAS functions, which expect pointers as input. Plus it ensures that we compute each address needed exactly once as opposed to one or more times per BLAS function the address is needed for. This also improves readability for developers as one can see that a particular atom or sparse code is being used without having to think about what the indexing means in each function. Also simplifies the C code generated by Cython as it handles the pointer arithmetic once and then simply uses the pointer in all other cases, which should make it easier for the compiler to do its work and make it easier for developers to reason about what is happening in the C code.
Borrow a trick from C++ to convert runtime checks into compile time checks. The result being that a check that would have been run in the tight loop of a function is made beforehand by dispatching a specialization of the function that applies the result of the check in its definition. Thus each check occurs only once at runtime before the loops occur allowing the loops to remain tight and focused on only the relevant steps. Programmatically the code reads similarly, which keeps it in nice form for developers to work with.
b244442
to
f663846
Compare
@jakirkham did you end up doing the benchmarks? |
Not yet. Have tried to update the PR, but the tests still are failing. Probably need some guidance on how to fix them. |
Going to go ahead and close. Think there have been other changes to these files in the interim. So this approach likely needs to be reworked. Currently probably can't pick this up myself, but others are welcome to borrow from this work if it is useful. |
Have noticed OpenBLAS's threadpool spin down and spin up in seconds or less when running
_update_dict
fromdict_learning
, which is mostly BLAS function calls. As creating and destorying threads is expensive, it's important to make sure that BLAS implementations (like OpenBLAS) maintain their threadpool between BLAS calls.Thus this rewrites
_update_dict
in straight Cython (getting as close to C as possible). This should help make sure the glue code stays out of the way while BLAS does the heavy lifting.