Skip to content

FIX Better support large or read-only datasets in decomposition.DictionaryLearning #25172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Dec 12, 2022

Reference Issues/PRs

Follow-up of #23147.
Fix tentative for #25165

What does this implement/fix? Explain your changes.

In some workflows using coordinate descent algorithms:

  • users might provide NumPy arrays with read-only buffers
  • joblib might memmap arrays making their buffer read-only.

Yet the implementation of those algorithms need those buffers to be writable.

This introduces a small copy of the slices of the dataset to make them writable in _sparse_encode (the joblib.Paralleled function used in sklearn.decomposition.sparse_encode) prior to the call to a Lasso instance relying on coordinate descent.

Moreover, cnp.ndarray is temporarily used instead of memoryviews to allow for a larger support of the variety of NumPy arrays since const-qualified memoryviews aren't yet supported. See #25322 for more details in this regards.

@ogrisel
Copy link
Member

ogrisel commented Dec 12, 2022

Thanks for the fix. We also need a non-regression test for this. I don't really understand why the error is triggered in the first place because the data loaded by keras in the reproducer is actually no backed by a readonly buffer according to x_train.flags.

Furthermore, I thought we had a common test that checked that we could fit on readonly-buffers using a numpy array generated via mmap_mode="r".

It would be great to really understand what's going on in the reproducer before going forward.

@ogrisel
Copy link
Member

ogrisel commented Dec 12, 2022

Update: this common estimator check should have tested that all scikit-learn estimators can be fitted on readonly data buffers:

@lesteve
Copy link
Member

lesteve commented Dec 13, 2022

I have posted a snippet (without keras) to reproduce in #25165 (comment)

I think the issue is not that the fitted array is a read-only memmap, but that joblib creates read-only memmaps and that call some cython code with these read-only memmaps, and that cython chokes on read-only arrays when you create memoryviews.

Honestly I don't know how to better test this, you would need:

  • arrays big enough (>1MB) to trigger joblib read-only memmap creation
  • test all the combination of parameters that could end up calling some cython code which does not support read-only array. In this case there is no issue with the default fit_algorithm parameter value

@thomasjpfan
Copy link
Member

arrays big enough (>1MB) to trigger joblib read-only memmap creation

I do not like it, but we can monkeypatch Parallel and adjust max_nbytes for testing. In general, there has been issues such as #19608 where having a public API to adjust max_nbytes would be useful.

As for refactoring code to using memoryviews, if our tests do not have enough coverage, then I think we should pause accepting this type of refactoring.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, moving forward, I think it is safer to keep the existing cnp.ndarray for fused types until Cython 3.0 is out.

As for this PR, I think we still need to revert:

X_data=ReadonlyArrayWrapper(
X.data
), # TODO: Remove after release of Cython 3 (#23147)

@jjerphan jjerphan marked this pull request as ready for review December 15, 2022 13:46
@jjerphan jjerphan added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Dec 15, 2022
This test is built based on Loïc's reproducer:
scikit-learn#25165 (comment)

Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>
Reverting to use cnp.ndarray does not solve the problem.
This is a tentative workaround using ReadonlyArrayWrapper.

This solves the problem, but there is now a segmentation fault.
@jjerphan jjerphan removed Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Dec 19, 2022
@jjerphan jjerphan marked this pull request as draft December 19, 2022 14:22
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is changing too much. I am okay with a reverting #23147 and adding a new non-regression test.

@jjerphan
Copy link
Member Author

jjerphan commented Jan 3, 2023

It looks like there's something weird at play between NumPy and memory view creation as using cnp.ndarray does not resolve the problem for read-only arrays.

The ValueError comes from NumPy's array_getbuffer (the ValueError message is the concatenation of "buffer source" (see here) and "array is read-only" (see there)).

array_getbuffer is called (indirectly via Cython macro) via View.MemoryView.memoryview.__cinit__ I think when memoryviews are created.

Is this a bug of Cython?

See 6c30a48, out of this branch.

@thomasjpfan
Copy link
Member

Just in case it gets lost, I responded to the out of branch commit: 6c30a48#r94961757

@jjerphan jjerphan marked this pull request as ready for review January 5, 2023 18:19
@jjerphan jjerphan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jan 7, 2023
@jjerphan jjerphan added the Quick Review For PRs that are quick to review label Jan 10, 2023
@@ -273,16 +273,17 @@ def enet_coordinate_descent(
return np.asarray(w), gap, tol, n_iter + 1


# TODO: use const fused typed memoryview where possible when Cython 0.29.33 is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this TODO take effect now since const fused typed memoryviews are available?

Copy link
Member Author

@jjerphan jjerphan Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but once #25342 is merged.

Copy link
Member

@ogrisel ogrisel Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#25342 is now merged. Shall we try that now or is it safer not to experiment with const typed memory views in a bugfix (1.2.1) and instead do that only in main (for 1.3.0)?

Copy link
Member Author

@jjerphan jjerphan Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided that Cython has good coverage and test cases for const-qualified memoryviews, I think it is safe and the best to use them to resolve such issues (as it is conceptually the best solution). Yet it do not think we need to use them for this fix with respect to the next bug fix release (I do not want to postpone this release).

Note that using such construct might not be sufficient to fix the problem tackled by this PR as cython implementations need writable buffers.

I will start investigating const-qualified fused typed memoryviews tomorrow for this fix and for other long standing issues.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a big enough fix to add to the 1.2.1 changelog.

@jjerphan jjerphan changed the title FIX Make cd_fast use cnp.ndarray to support readonly buffers FIX Better support for read-only datasets in coordinate descent algorithms Jan 13, 2023
@jjerphan jjerphan removed the Quick Review For PRs that are quick to review label Jan 13, 2023
@jjerphan
Copy link
Member Author

jjerphan commented Jan 13, 2023

I am not entirely satisfied with this fix: I think the root of the problem might better addressed closer to the various coordinate descent algorithms. Yet, I think performing this fix would better come after removing some complexity. Removing this complexity necessitates #25322 to be fixed.

Edit: const-qualified fused typed memoryviews are necessary but aren't sufficient to solve this issue.

@jjerphan jjerphan changed the title FIX Better support for read-only datasets in coordinate descent algorithms FIX Better support large or read-only datasets in decomposition.DictionaryLearning Jan 13, 2023
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as it is. I think I am fine with experimenting with const typed memory views in a follow-up PR but keep this bugfix as it is for 1.2.1.

@lesteve
Copy link
Member

lesteve commented Jan 19, 2023

LGTM, merging this one!

@lesteve lesteve merged commit d431d7e into scikit-learn:main Jan 19, 2023
@jjerphan jjerphan deleted the fix/make-cd_fast-use-nd_array-for-read-only-support branch January 19, 2023 09:19
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…ionaryLearning` (scikit-learn#25172)

Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…ionaryLearning` (scikit-learn#25172)

Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023
…ionaryLearning` (scikit-learn#25172)

Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
adrinjalali pushed a commit that referenced this pull request Jan 24, 2023
…ionaryLearning` (#25172)

Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants