Skip to content

Cythonize _update_dict #11874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

jakirkham
Copy link
Contributor

Have noticed OpenBLAS's threadpool spin down and spin up in seconds or less when running _update_dict from dict_learning, which is mostly BLAS function calls. As creating and destorying threads is expensive, it's important to make sure that BLAS implementations (like OpenBLAS) maintain their threadpool between BLAS calls.

Thus this rewrites _update_dict in straight Cython (getting as close to C as possible). This should help make sure the glue code stays out of the way while BLAS does the heavy lifting.

sys.stdout.flush()
elif verbose:
print("Adding new random atom")
dictionary[:, k] = random_state.randn(n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're no longer doing normally distributed sampling, are you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We need some sort of transform. Any that you would recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was looking at that previously. Seems they don't really expose that. Raised issue ( numpy/numpy#11764 ) on this point. Interestingly this has come up before. Guess that's how the uniform random distribution code came to be in cd_fast.pyx. Perhaps we could vendor randomkit for now?

As a "simple" thought, we could use the inverse of the normal distribution's CDF to map uniformly distributed numbers to normally distributed ones. This would require some implementation of erfi (or better yet ndtri). There is a Cython interface to these in SciPy, but it requires 0.19.0 at least. Perhaps we could vendor the relevant code (from Netlib Cephes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have included an implementation of the Marsaglia polar method (based off the Box–Muller transform), which appears to be the same thing that NumPy's vendored randomkit is doing. Interestingly the implementation included here appears to be ~2x faster than NumPy's.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speed improvements on numpy's random library aren't that interesting: their version compatibility assurances force them to keep sub-optimal versions

Copy link

@anton-malakhov anton-malakhov Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you guys don't need this numerical binary compatibility with standard Numpy's random module, do you want to try our mkl_random instead? It's available on all the channels and you can try: import mkl_random as rnd to make it optionally drop-in replacement for numpy.random.
@oleksandr-pavlyk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a Cython or C/C++ header that could be used? The main goal in this PR is to move as much as possible of this function into non-GIL holding Cython code to maximize performance.

@jakirkham jakirkham force-pushed the cythonize_update_dict branch from f0729a8 to 846e32d Compare August 21, 2018 06:44
@agramfort
Copy link
Member

@jakirkham how much speed up do you get here? Using the sparse coding is the bottleneck more than the dict update no?

@rth
Copy link
Member

rth commented Aug 21, 2018

As creating and destorying threads is expensive, it's important to make sure that BLAS implementations (like OpenBLAS) maintain their threadpool between BLAS calls.

This is an issue that generally affects a lot of scientific python code, I'm not sure that rewriting everything in low level Cython is the right solution. The same point was for instance mentioned in #10744 for K-Means.

Are you sure BLAS threading actually improves performance here? As you saw, loky (used in joblib) will have some mechanism to temporary disabling BLAS threading at runtime in the future (https://github.com/tomMoral/loky/pull/135). Otherwise, reusing the same thread pool including the BLAS one might be possible with with smp or tbb. If I understood @anton-malakhov's scipy talk correctly, both might only require to load the corresponding Python module at startup. I would be curious to know it it would improve performance in this case.

@jakirkham
Copy link
Contributor Author

Have answered comments inline below. Sorry for the length.

how much speed up do you get here?

Will update once we have addressed Joel's point about the random distribution drawn from.

Using the sparse coding is the bottleneck more than the dict update no?

Not for our work load.

They both took roughly half the time originally. Due to various PRs submitted here to remove unneeded checks and make other improvements, am happy to say it got to ~33% sparse coding and ~66% dictionary learning. So thanks everyone for helping with this work. A fair bit of sparse coding's performance is just some sanity checks and a copy at this point, the remainder of which are unfortunately difficult to remove.

We have since made changes to our workload to handle some newer/different data. The time is now split ~10% sparse coding and ~90% dictionary learning. So any speed up we can get here helps. Below is a flame graph showing this. The line of code shown varies, but the percentage remains roughly the same. By comparison the aqua blue block on the top right is enet_path, which is responsible for coordinate descent in sparse coding and is taking ~2.5%.

screen shot 2018-08-21 at 13 13 16

This is an issue that generally affects a lot of scientific python code, I'm not sure that rewriting everything in low level Cython is the right solution.

That's fair. Though I'm not proposing we rewrite everything. Only the slow part. 🙂

Are you sure BLAS threading actually improves performance here?

A few weeks back I played with different configurations and determined the best performer was BLAS with parallel threads and sequential mode for Joblib (used for sparse coding). This was based on the runtime of batches, which was in 10s of minutes. So it was pretty easy to see what was helpful and what wasn't. Unfortunately I didn't keep good notes on this.

As you saw, loky (used in joblib) will have some mechanism to temporary disabling BLAS threading at runtime in the future.

Yeah, that's a good idea in some cases. It would be good if a user could set a flag to disable it if needed.

Otherwise, reusing the same thread pool including the BLAS one might be possible with with smp or tbb. ... I would be curious to know it it would improve performance in this case.

Not sure how much it would matter actually (assuming it could be trivially configured). The update dictionary step is fundamentally sequential. So outside of BLAS there isn't much parallelism that we could exploit. Sparse coding could be parallelized (already has Joblib support), but that's not the slow step here and it seemed to hurt when I last tried.

There are probably different algorithms that could parallelize the problem approached here. Suggestions on this front would certainly be welcome.

@jakirkham jakirkham force-pushed the cythonize_update_dict branch 25 times, most recently from 5e0c6be to e2dd7b8 Compare August 23, 2018 19:18
@jakirkham jakirkham force-pushed the cythonize_update_dict branch 11 times, most recently from e70157a to d386011 Compare August 27, 2018 16:44
@jakirkham jakirkham force-pushed the cythonize_update_dict branch from d386011 to b244442 Compare September 6, 2018 04:10
Lower the `_update_dict` into Cython (as close to C as possible). As the
big performance pieces are in the BLAS and not in this code, the main
objective here is to cutdown the amount of time not spent in those BLAS
functions as it can cause the BLAS to spin down its threadpool for
instance only to spin it up fractions of seconds later, which is quite
costly. This avoids that by calling directly into the BLAS C API with
contiguous arrays back to back. Only little bits of Cython are used to
string these together.

The GIL is released for the bulk of the computation. No-GIL operations
are employed where possible. This includes random number generation and
any array operations. The GIL is only released once and reacquired once.
Everything else that requires the GIL is moved outside that section and
information needed on the GIL side is passed from the No-GIL side. This
minimizes contention over the GIL with other processes.

Threading is maximized by using BLAS operations that will make the most
use of threads possible. Smaller BLAS operations are merged into larger
ones where possible. Wherever possible `for`-loops are converted into
some form of BLAS operation with the `for`-loop being pushed into BLAS.
This includes even simple things like find the sum of the residuals
squared or zeroing out arrays.

Arrays are coerced into contiguous form either by verifying that the
match the expected stride or copying them via NumPy if necessary.
Striding is employed to ensure movement along arrays in BLAS picks the
contiguous path as often as possible. This means arrays may be C or
Fortran order depending on what will be most efficient for fast
traversal and good use of the cache.

For convenience fused types are employed liberally including in binding
BLAS functions of different precision together. This makes it easy for
the same code to leverage single or double precision as needed with
minor specialization as required. Overall this makes for a nice blend of
static typing that remains relatively flexible.

Memoryviews are used heavily within functions and as part of the
definitions of all internal functions. Construction of memoryviews or
their slices is avoid as much as possible for performance reasons.
Instead direct indexing to a specific value or address location is
preferred wherever possible to maximize performance.

Function inlining is used heavily. Many of the functions have but a
couple lines or perhaps less. While convenient to have these as separate
functions for readability/comprehension, there is no point in taking a
hit on performance by having to perform a jump to get to these functions
when they are so simple.

To allow random numbers sampled from the normal distribution to be
generated without the GIL quickly, the Marsaglia polar method (an
optimization on the Box–Muller transform) is used. This is the same
method that NumPy's vendored copy of RandomKit uses. The performance of
our implementation ends up being notably faster than NumPy's. Though
doesn't appear to be surprising, it is a nice performance bump to have.
To simplify the code a bit and cutdown on repeated pointer arithmetic,
store frequently used addresses in memoryviews as pointers. This is
helpful for working with BLAS functions, which expect pointers as input.
Plus it ensures that we compute each address needed exactly once as
opposed to one or more times per BLAS function the address is needed
for. This also improves readability for developers as one can see that a
particular atom or sparse code is being used without having to think
about what the indexing means in each function. Also simplifies the C
code generated by Cython as it handles the pointer arithmetic once and
then simply uses the pointer in all other cases, which should make it
easier for the compiler to do its work and make it easier for developers
to reason about what is happening in the C code.
Borrow a trick from C++ to convert runtime checks into compile time
checks. The result being that a check that would have been run in the
tight loop of a function is made beforehand by dispatching a
specialization of the function that applies the result of the check in
its definition. Thus each check occurs only once at runtime before the
loops occur allowing the loops to remain tight and focused on only the
relevant steps. Programmatically the code reads similarly, which keeps
it in nice form for developers to work with.
@jakirkham jakirkham force-pushed the cythonize_update_dict branch from b244442 to f663846 Compare September 6, 2018 04:11
@agramfort
Copy link
Member

@jakirkham did you end up doing the benchmarks?

@jakirkham
Copy link
Contributor Author

Not yet. Have tried to update the PR, but the tests still are failing. Probably need some guidance on how to fix them.

@amueller amueller added Needs Benchmarks A tag for the issues and PRs which require some benchmarks Needs work Performance labels Aug 6, 2019
Base automatically changed from master to main January 22, 2021 10:50
@jakirkham
Copy link
Contributor Author

Going to go ahead and close. Think there have been other changes to these files in the interim. So this approach likely needs to be reworked. Currently probably can't pick this up myself, but others are welcome to borrow from this work if it is useful.

@jakirkham jakirkham closed this Aug 23, 2021
@jakirkham jakirkham deleted the cythonize_update_dict branch August 23, 2021 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython module:decomposition Needs Benchmarks A tag for the issues and PRs which require some benchmarks Needs work Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants