-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
lasso_cd with multiprocessing fails on large dataset #4772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
do you have the pb with n_jobs=1?
|
No, only with n_jobs > 1. A simple fix (though ugly), is to copy X in _sparse_encode (where X is a slice of the design matrix). You may see pull request #4773 |
Bug can be reproduced using this example : |
I trie to design a test that would currently fail : using synthetic data of the same size as data in previous example, I am not able to reproduce the error. |
This is a limitation of cython typed memory views that do not work with readonly numpy arrays such as I fixed a similar issue in pandas with a workaround here: pandas-dev/pandas#10070 |
should we get rid of typed memory views then and use raw pointers in cython?
|
We have to investigate which array is in read-only mode. From the traceback this is ambiguous. If this is
|
I had a quick chat IRL with @arthurmensch. He is going to do a benchmark and if successful will write a non regression test with non-writeable inputs + the fix itself in a PR. |
I adpated bench_lasso to benchmark cd_fast using double[:] and ndarray in cd_fast.pyx. Here are the results : it looks like we do not loose performance (surprisingly enough we even seem to gain for large number of samples). Source code (as to be run on both version of cd_fast.pyx) : |
Thanks @arthurmensch, this looks good. Please open a PR with the fix and non-regression test on non-writeable arrays. |
Did anyone try to submit a bugfix to cython btw? |
I am very surprised by the benchmark. Where it the time spend? |
Never mind, the typed memory views are unpacked anyhow. |
That would be nice but it seems quite complicated: https://mail.python.org/pipermail/cython-devel/2015-February/004316.html |
Closed by #4775. |
scikit-learn==0.19 works well |
@ShichengChen there is no need on comment on closed issues, find a tutorial about github if you are not sure how this works. |
From sklearn.decomposition.dict_learning, using sparse_encode triggers error when using a large input X with algorithm 'lasso_cd'. I was not able to reproduce this bug using simple examples, as it seems to come from a memory mapping error, owing to X being large.
Call line :
Output :
The text was updated successfully, but these errors were encountered: