-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Better support large or read-only datasets in decomposition.DictionaryLearning
#25172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Better support large or read-only datasets in decomposition.DictionaryLearning
#25172
Conversation
Thanks for the fix. We also need a non-regression test for this. I don't really understand why the error is triggered in the first place because the data loaded by keras in the reproducer is actually no backed by a readonly buffer according to Furthermore, I thought we had a common test that checked that we could fit on readonly-buffers using a numpy array generated via It would be great to really understand what's going on in the reproducer before going forward. |
Update: this common estimator check should have tested that all scikit-learn estimators can be fitted on readonly data buffers:
|
I have posted a snippet (without keras) to reproduce in #25165 (comment) I think the issue is not that the fitted array is a read-only memmap, but that joblib creates read-only memmaps and that call some cython code with these read-only memmaps, and that cython chokes on read-only arrays when you create memoryviews. Honestly I don't know how to better test this, you would need:
|
I do not like it, but we can monkeypatch As for refactoring code to using memoryviews, if our tests do not have enough coverage, then I think we should pause accepting this type of refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, moving forward, I think it is safer to keep the existing cnp.ndarray
for fused types until Cython 3.0 is out.
As for this PR, I think we still need to revert:
scikit-learn/sklearn/linear_model/_coordinate_descent.py
Lines 597 to 599 in 49ff2d9
X_data=ReadonlyArrayWrapper( | |
X.data | |
), # TODO: Remove after release of Cython 3 (#23147) |
This test is built based on Loïc's reproducer: scikit-learn#25165 (comment) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>
Reverting to use cnp.ndarray does not solve the problem. This is a tentative workaround using ReadonlyArrayWrapper. This solves the problem, but there is now a segmentation fault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is changing too much. I am okay with a reverting #23147 and adding a new non-regression test.
It looks like there's something weird at play between NumPy and memory view creation as using The
Is this a bug of Cython? See 6c30a48, out of this branch. |
Just in case it gets lost, I responded to the out of branch commit: 6c30a48#r94961757 |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@@ -273,16 +273,17 @@ def enet_coordinate_descent( | |||
return np.asarray(w), gap, tol, n_iter + 1 | |||
|
|||
|
|||
# TODO: use const fused typed memoryview where possible when Cython 0.29.33 is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this TODO take effect now since const fused typed memoryviews are available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but once #25342 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#25342 is now merged. Shall we try that now or is it safer not to experiment with const typed memory views in a bugfix (1.2.1) and instead do that only in main
(for 1.3.0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided that Cython has good coverage and test cases for const
-qualified memoryviews, I think it is safe and the best to use them to resolve such issues (as it is conceptually the best solution). Yet it do not think we need to use them for this fix with respect to the next bug fix release (I do not want to postpone this release).
Note that using such construct might not be sufficient to fix the problem tackled by this PR as cython implementations need writable buffers.
I will start investigating const
-qualified fused typed memoryviews tomorrow for this fix and for other long standing issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a big enough fix to add to the 1.2.1 changelog.
…e-nd_array-for-read-only-support
cd_fast
use cnp.ndarray
to support readonly buffers
I am not entirely satisfied with this fix: I think the root of the problem might better addressed closer to the various coordinate descent algorithms. Yet, I think performing this fix would better come after removing some complexity. Removing this complexity necessitates #25322 to be fixed. Edit: |
decomposition.DictionaryLearning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as it is. I think I am fine with experimenting with const typed memory views in a follow-up PR but keep this bugfix as it is for 1.2.1.
LGTM, merging this one! |
…ionaryLearning` (scikit-learn#25172) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…ionaryLearning` (scikit-learn#25172) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…ionaryLearning` (scikit-learn#25172) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…ionaryLearning` (#25172) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
Follow-up of #23147.
Fix tentative for #25165
What does this implement/fix? Explain your changes.
In some workflows using coordinate descent algorithms:
joblib
might memmap arrays making their buffer read-only.Yet the implementation of those algorithms need those buffers to be writable.
This introduces a small copy of the slices of the dataset to make them writable in
_sparse_encode
(thejoblib.Parallel
ed function used insklearn.decomposition.sparse_encode
) prior to the call to aLasso
instance relying on coordinate descent.Moreover,
cnp.ndarray
is temporarily used instead of memoryviews to allow for a larger support of the variety of NumPy arrays sinceconst
-qualified memoryviews aren't yet supported. See #25322 for more details in this regards.