Cythonize _update_dict #11874

jakirkham · 2018-08-21T06:05:31Z

Have noticed OpenBLAS's threadpool spin down and spin up in seconds or less when running _update_dict from dict_learning, which is mostly BLAS function calls. As creating and destorying threads is expensive, it's important to make sure that BLAS implementations (like OpenBLAS) maintain their threadpool between BLAS calls.

Thus this rewrites _update_dict in straight Cython (getting as close to C as possible). This should help make sure the glue code stays out of the way while BLAS does the heavy lifting.

jnothman · 2018-08-21T06:43:36Z

sklearn/decomposition/dict_learning.py

-                sys.stdout.flush()
-            elif verbose:
-                print("Adding new random atom")
-            dictionary[:, k] = random_state.randn(n_features)


You're no longer doing normally distributed sampling, are you?

You're right. We need some sort of transform. Any that you would recommend?

https://github.com/numpy/numpy/blob/52c67eb9a213b9df87f98d6f1f70398e147de425/numpy/random/mtrand/randomkit.c#L600

Was looking at that previously. Seems they don't really expose that. Raised issue ( numpy/numpy#11764 ) on this point. Interestingly this has come up before. Guess that's how the uniform random distribution code came to be in cd_fast.pyx. Perhaps we could vendor randomkit for now?

As a "simple" thought, we could use the inverse of the normal distribution's CDF to map uniformly distributed numbers to normally distributed ones. This would require some implementation of erfi (or better yet ndtri). There is a Cython interface to these in SciPy, but it requires 0.19.0 at least. Perhaps we could vendor the relevant code (from Netlib Cephes)?

Have included an implementation of the Marsaglia polar method (based off the Box–Muller transform), which appears to be the same thing that NumPy's vendored randomkit is doing. Interestingly the implementation included here appears to be ~2x faster than NumPy's.

Speed improvements on numpy's random library aren't that interesting: their version compatibility assurances force them to keep sub-optimal versions

if you guys don't need this numerical binary compatibility with standard Numpy's random module, do you want to try our mkl_random instead? It's available on all the channels and you can try: import mkl_random as rnd to make it optionally drop-in replacement for numpy.random.
@oleksandr-pavlyk

Is there a Cython or C/C++ header that could be used? The main goal in this PR is to move as much as possible of this function into non-GIL holding Cython code to maximize performance.

agramfort · 2018-08-21T08:02:03Z

@jakirkham how much speed up do you get here? Using the sparse coding is the bottleneck more than the dict update no?

rth · 2018-08-21T09:17:35Z

As creating and destorying threads is expensive, it's important to make sure that BLAS implementations (like OpenBLAS) maintain their threadpool between BLAS calls.

This is an issue that generally affects a lot of scientific python code, I'm not sure that rewriting everything in low level Cython is the right solution. The same point was for instance mentioned in #10744 for K-Means.

Are you sure BLAS threading actually improves performance here? As you saw, loky (used in joblib) will have some mechanism to temporary disabling BLAS threading at runtime in the future (https://github.com/tomMoral/loky/pull/135). Otherwise, reusing the same thread pool including the BLAS one might be possible with with smp or tbb. If I understood @anton-malakhov's scipy talk correctly, both might only require to load the corresponding Python module at startup. I would be curious to know it it would improve performance in this case.

jakirkham · 2018-08-21T18:08:06Z

Have answered comments inline below. Sorry for the length.

how much speed up do you get here?

Will update once we have addressed Joel's point about the random distribution drawn from.

Using the sparse coding is the bottleneck more than the dict update no?

Not for our work load.

They both took roughly half the time originally. Due to various PRs submitted here to remove unneeded checks and make other improvements, am happy to say it got to ~33% sparse coding and ~66% dictionary learning. So thanks everyone for helping with this work. A fair bit of sparse coding's performance is just some sanity checks and a copy at this point, the remainder of which are unfortunately difficult to remove.

We have since made changes to our workload to handle some newer/different data. The time is now split ~10% sparse coding and ~90% dictionary learning. So any speed up we can get here helps. Below is a flame graph showing this. The line of code shown varies, but the percentage remains roughly the same. By comparison the aqua blue block on the top right is enet_path, which is responsible for coordinate descent in sparse coding and is taking ~2.5%.

This is an issue that generally affects a lot of scientific python code, I'm not sure that rewriting everything in low level Cython is the right solution.

That's fair. Though I'm not proposing we rewrite everything. Only the slow part. 🙂

Are you sure BLAS threading actually improves performance here?

A few weeks back I played with different configurations and determined the best performer was BLAS with parallel threads and sequential mode for Joblib (used for sparse coding). This was based on the runtime of batches, which was in 10s of minutes. So it was pretty easy to see what was helpful and what wasn't. Unfortunately I didn't keep good notes on this.

As you saw, loky (used in joblib) will have some mechanism to temporary disabling BLAS threading at runtime in the future.

Yeah, that's a good idea in some cases. It would be good if a user could set a flag to disable it if needed.

Otherwise, reusing the same thread pool including the BLAS one might be possible with with smp or tbb. ... I would be curious to know it it would improve performance in this case.

Not sure how much it would matter actually (assuming it could be trivially configured). The update dictionary step is fundamentally sequential. So outside of BLAS there isn't much parallelism that we could exploit. Sparse coding could be parallelized (already has Joblib support), but that's not the slow step here and it seemed to hurt when I last tried.

There are probably different algorithms that could parallelize the problem approached here. Suggestions on this front would certainly be welcome.

Lower the `_update_dict` into Cython (as close to C as possible). As the big performance pieces are in the BLAS and not in this code, the main objective here is to cutdown the amount of time not spent in those BLAS functions as it can cause the BLAS to spin down its threadpool for instance only to spin it up fractions of seconds later, which is quite costly. This avoids that by calling directly into the BLAS C API with contiguous arrays back to back. Only little bits of Cython are used to string these together. The GIL is released for the bulk of the computation. No-GIL operations are employed where possible. This includes random number generation and any array operations. The GIL is only released once and reacquired once. Everything else that requires the GIL is moved outside that section and information needed on the GIL side is passed from the No-GIL side. This minimizes contention over the GIL with other processes. Threading is maximized by using BLAS operations that will make the most use of threads possible. Smaller BLAS operations are merged into larger ones where possible. Wherever possible `for`-loops are converted into some form of BLAS operation with the `for`-loop being pushed into BLAS. This includes even simple things like find the sum of the residuals squared or zeroing out arrays. Arrays are coerced into contiguous form either by verifying that the match the expected stride or copying them via NumPy if necessary. Striding is employed to ensure movement along arrays in BLAS picks the contiguous path as often as possible. This means arrays may be C or Fortran order depending on what will be most efficient for fast traversal and good use of the cache. For convenience fused types are employed liberally including in binding BLAS functions of different precision together. This makes it easy for the same code to leverage single or double precision as needed with minor specialization as required. Overall this makes for a nice blend of static typing that remains relatively flexible. Memoryviews are used heavily within functions and as part of the definitions of all internal functions. Construction of memoryviews or their slices is avoid as much as possible for performance reasons. Instead direct indexing to a specific value or address location is preferred wherever possible to maximize performance. Function inlining is used heavily. Many of the functions have but a couple lines or perhaps less. While convenient to have these as separate functions for readability/comprehension, there is no point in taking a hit on performance by having to perform a jump to get to these functions when they are so simple. To allow random numbers sampled from the normal distribution to be generated without the GIL quickly, the Marsaglia polar method (an optimization on the Box–Muller transform) is used. This is the same method that NumPy's vendored copy of RandomKit uses. The performance of our implementation ends up being notably faster than NumPy's. Though doesn't appear to be surprising, it is a nice performance bump to have.

To simplify the code a bit and cutdown on repeated pointer arithmetic, store frequently used addresses in memoryviews as pointers. This is helpful for working with BLAS functions, which expect pointers as input. Plus it ensures that we compute each address needed exactly once as opposed to one or more times per BLAS function the address is needed for. This also improves readability for developers as one can see that a particular atom or sparse code is being used without having to think about what the indexing means in each function. Also simplifies the C code generated by Cython as it handles the pointer arithmetic once and then simply uses the pointer in all other cases, which should make it easier for the compiler to do its work and make it easier for developers to reason about what is happening in the C code.

Borrow a trick from C++ to convert runtime checks into compile time checks. The result being that a check that would have been run in the tight loop of a function is made beforehand by dispatching a specialization of the function that applies the result of the check in its definition. Thus each check occurs only once at runtime before the loops occur allowing the loops to remain tight and focused on only the relevant steps. Programmatically the code reads similarly, which keeps it in nice form for developers to work with.

agramfort · 2019-02-20T09:06:05Z

@jakirkham did you end up doing the benchmarks?

jakirkham · 2019-03-06T02:28:09Z

Not yet. Have tried to update the PR, but the tests still are failing. Probably need some guidance on how to fix them.

jakirkham · 2021-08-23T04:16:46Z

Going to go ahead and close. Think there have been other changes to these files in the interim. So this approach likely needs to be reworked. Currently probably can't pick this up myself, but others are welcome to borrow from this work if it is useful.

jnothman reviewed Aug 21, 2018

View reviewed changes

jakirkham force-pushed the cythonize_update_dict branch from f0729a8 to 846e32d Compare August 21, 2018 06:44

jakirkham force-pushed the cythonize_update_dict branch 25 times, most recently from 5e0c6be to e2dd7b8 Compare August 23, 2018 19:18

jakirkham force-pushed the cythonize_update_dict branch 11 times, most recently from e70157a to d386011 Compare August 27, 2018 16:44

jakirkham force-pushed the cythonize_update_dict branch from d386011 to b244442 Compare September 6, 2018 04:10

jakirkham added 3 commits September 6, 2018 00:11

jakirkham force-pushed the cythonize_update_dict branch from b244442 to f663846 Compare September 6, 2018 04:11

jakirkham added 2 commits March 4, 2019 11:34

Merge 'scikit-learn/master' into 'jakirkham/cythonize_update_dict'

2187d17

Update to use SciPy's Cython BLAS API

3c54ae4

amueller added Needs Benchmarks A tag for the issues and PRs which require some benchmarks Needs work Performance labels Aug 6, 2019

github-actions bot added the module:decomposition label Mar 2, 2020

jakirkham mentioned this pull request Jan 11, 2021

Dictionary learning is slower with n_jobs > 1 #4769

Open

Base automatically changed from master to main January 22, 2021 10:50

thomasjpfan added the cython label Apr 13, 2021

jakirkham closed this Aug 23, 2021

jakirkham deleted the cythonize_update_dict branch August 23, 2021 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cythonize _update_dict #11874

Cythonize _update_dict #11874

jakirkham commented Aug 21, 2018

jnothman Aug 21, 2018

jakirkham Aug 21, 2018

jnothman Aug 21, 2018

jakirkham Aug 21, 2018

jakirkham Aug 22, 2018

jnothman Aug 22, 2018

anton-malakhov Aug 22, 2018 •

edited

Loading

jakirkham Aug 24, 2018

agramfort commented Aug 21, 2018

rth commented Aug 21, 2018

jakirkham commented Aug 21, 2018

agramfort commented Feb 20, 2019

jakirkham commented Mar 6, 2019

jakirkham commented Aug 23, 2021

Cythonize _update_dict #11874

Cythonize _update_dict #11874

Conversation

jakirkham commented Aug 21, 2018

jnothman Aug 21, 2018

Choose a reason for hiding this comment

jakirkham Aug 21, 2018

Choose a reason for hiding this comment

jnothman Aug 21, 2018

Choose a reason for hiding this comment

jakirkham Aug 21, 2018

Choose a reason for hiding this comment

jakirkham Aug 22, 2018

Choose a reason for hiding this comment

jnothman Aug 22, 2018

Choose a reason for hiding this comment

anton-malakhov Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

jakirkham Aug 24, 2018

Choose a reason for hiding this comment

agramfort commented Aug 21, 2018

rth commented Aug 21, 2018

jakirkham commented Aug 21, 2018

agramfort commented Feb 20, 2019

jakirkham commented Mar 6, 2019

jakirkham commented Aug 23, 2021

anton-malakhov Aug 22, 2018 •

edited

Loading