Skip to content

[MRG] Use Scipy cython BLAS API instead of bundled CBLAS #12732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 1, 2019

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Dec 6, 2018

First step to tackle #11638

This PR add a new module in utils, _cython_blas, which contains helpers for the BLAS functions shipped with scipy in scipy.cython_blas.

  • The scipy functions expect fortran aligned arrays. These helpers allow to specify the memory layout as it is in CBLAS, and can be used as drop in replacement in sklearn.
  • BLAS functions are not fused (e.g. sgemm and dgemm for float and double matrix matrix multiplication). These helpers are fused, sgemv and dgemv are replaced by _xgemv, which avoids the redundant
if floating is float:
    dot = sdot
else:
    dot = ddot

I still have to fill the docstrings.

For now the module only contains helpers for the BLAS functions which were already used in sklearn (+gemm because I'm using it in another PR :) ). We can add other BLAS functions as we need them.

This module comes with a test suite, which tests the helpers for all type/layout/transpose configurations.

I also added an example of use in sklearn pairwise_fast with the asum function, see sklearn/metrics/pairwise_fast.pyx and sklearn/metrics/setup.py.

I don't think I'll replace all occurences of CBLAS use in sklearn in this PR. I think it would be easier to do it in separate PRs. Removing the bundled CBLAS will only be possible once all replacement are done.

Finally, this PR requires an upgrade of scipy minimal version > 0.16, which seems to be on it's way (#12184)

@jeremiedbb jeremiedbb force-pushed the scipy-blas branch 2 times, most recently from be33b3c to 486cf22 Compare December 6, 2018 16:29
@jeremiedbb
Copy link
Member Author

Note that all the _xxxx_memview functions in the module are just wrappers to be able to tests the module with pytest.

@jeremiedbb
Copy link
Member Author

CI is failing because of cython version :(
cython does not support the const keyword for memoryview before version 0.28. Would it be considered to upgrade cython requirement ?

@rth
Copy link
Member

rth commented Dec 6, 2018

cython does not support the const keyword for memoryview before version 0.28. Would it be considered to upgrade cython requirement ?

Yes, we did discuss this in #10624 and I think it should be possible. Are you sure this is not affected by the issues from #10624 (comment) though?

@jeremiedbb
Copy link
Member Author

Not I don't use const with fused typed memview. But after all it's not necessary, I found a workaround :)

@rth
Copy link
Member

rth commented Dec 6, 2018

Maybe it could be worth checking if some discussion about such wrappers did take place at scipy and if not open an issue about it? From a user perspective, we might indeed want 1 cython function for matrix-matrix multiplication with BLAS irrespective of the C/Fortran alignment or dtype, not 4...

@jeremiedbb
Copy link
Member Author

There is an old open issue in scipy scipy/scipy#4516 for that. It's been active again this summer, so there may be some progress in that way.

@jeremiedbb jeremiedbb force-pushed the scipy-blas branch 2 times, most recently from fa78d95 to 2aa4bcb Compare December 14, 2018 14:08
@jeremiedbb
Copy link
Member Author

jeremiedbb commented Dec 17, 2018

I had to increase rtol up to 1e-3 in tests with float32 to make CI green ! (MKL or OpenBLAS show similar behavior) whereas rtol = 1e-12 is enough for float64.

This is a really high rtol...

@jeremiedbb
Copy link
Member Author

So I tested this rtol thing with the bundled CBLAS and it turns out that the issue is already here.
gemv(A,x,y) and A.dot(x) + y are close only up to 1e-3 on float32 data.

However, to mitigate this, I used data such that the result of gemv (or others) was often close to zero, meaning that it was involving differences between close numbers, which can easily lead to numerical precision issues.

Using better conditioned data allows to retrieve a 1e-6 rtol which is fine for float32.

@jeremiedbb jeremiedbb changed the title [WIP] Use Scipy cython BLAS API instead of bundled CBLAS [MRG] Use Scipy cython BLAS API instead of bundled CBLAS Dec 20, 2018
@jeremiedbb
Copy link
Member Author

Since minimal versions requirements have been upgraded in #12746, it's now safe to use scipy cython blas (scipy >= 0.16).

Does that need a what's new entry ?

@ogrisel
Copy link
Member

ogrisel commented Dec 22, 2018

Does that need a what's new entry?

I think we should do what's new entries for the individual estimators when there implementation are updated as a consequence of this infrastructure change.

For this PR you can already add an entry for the pairwise_metrics sparse Manhattan distance computation.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @jeremiedbb.

return ddot(&n, x, &incx, y, &incy)


cpdef _dot_memview(floating[::1] x, floating[::1] y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using const memoryviews to allow read-only input arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fused typed const memoryviews does not work yet, see #10624
However, all the _xxx_memview functions are just python wrappers to be able to test the C functions with pytest. They are not meant to be used in the python code base (if we want to multiply matrices in python we just do numpy dot), we don't want to expose blas functions at the python level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course. But is there a reason we should not be using memview interfaces when that would simplify the call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be the small overhead of an additional function call (and I don't really know how memview behave versus pointers performance wise) but I agree it would simplify it.

Currently all blas functions are called within functions where we have access to the pointers, so I'm not sure it's worth making interfaces for what we don't need currently. Maybe we should reconsider doing it when the need comes ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any function call overhead. Certainly not python functions. There will be cost in accessing members of the memoryview struct, but minimal.

No hurry, you're right, but I still find it strange that we are passed order rather than determining it from the memoryview strides.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any function call overhead. Certainly not python functions. There will be cost in accessing members of the memoryview struct, but minimal.

No hurry, you're right, but I still find it strange that we are passed order rather than determining it from the memoryview strides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You convinced me :) I updated the functions to infer the memory layout.

dgemv(&ta_, &m, &n, &alpha, A, &lda, x, &incx, &beta, y, &incy)


cpdef _gemv_memview(BLAS_Order order, BLAS_Trans ta, floating alpha,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we determine order from A.strides[0] == A.itemsize rather than having it passed in?

@ogrisel ogrisel merged commit d0f63a7 into scikit-learn:master Feb 1, 2019
@ogrisel
Copy link
Member

ogrisel commented Feb 1, 2019

Merged! Thanks @jeremiedbb!

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
jackmitch pushed a commit to jackmitch/scikit-learn that referenced this pull request Jul 2, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
@jeremiedbb jeremiedbb deleted the scipy-blas branch July 20, 2020 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants