FIX Better support large or read-only datasets in `decomposition.DictionaryLearning` #25172

jjerphan · 2022-12-12T09:57:33Z

Reference Issues/PRs

Follow-up of #23147.
Fix tentative for #25165

What does this implement/fix? Explain your changes.

In some workflows using coordinate descent algorithms:

users might provide NumPy arrays with read-only buffers
joblib might memmap arrays making their buffer read-only.

Yet the implementation of those algorithms need those buffers to be writable.

This introduces a small copy of the slices of the dataset to make them writable in _sparse_encode (the joblib.Paralleled function used in sklearn.decomposition.sparse_encode) prior to the call to a Lasso instance relying on coordinate descent.

Moreover, cnp.ndarray is temporarily used instead of memoryviews to allow for a larger support of the variety of NumPy arrays since const-qualified memoryviews aren't yet supported. See #25322 for more details in this regards.

Follow-up of scikit-learn#23147.

ogrisel · 2022-12-12T15:03:41Z

Thanks for the fix. We also need a non-regression test for this. I don't really understand why the error is triggered in the first place because the data loaded by keras in the reproducer is actually no backed by a readonly buffer according to x_train.flags.

Furthermore, I thought we had a common test that checked that we could fit on readonly-buffers using a numpy array generated via mmap_mode="r".

It would be great to really understand what's going on in the reproducer before going forward.

ogrisel · 2022-12-12T15:09:22Z

Update: this common estimator check should have tested that all scikit-learn estimators can be fitted on readonly data buffers:

scikit-learn/sklearn/utils/estimator_checks.py

Line 103 in a576bcc

yield partial(check_estimators_fit_returns_self, readonly_memmap=True)

lesteve · 2022-12-13T13:57:31Z

I have posted a snippet (without keras) to reproduce in #25165 (comment)

I think the issue is not that the fitted array is a read-only memmap, but that joblib creates read-only memmaps and that call some cython code with these read-only memmaps, and that cython chokes on read-only arrays when you create memoryviews.

Honestly I don't know how to better test this, you would need:

arrays big enough (>1MB) to trigger joblib read-only memmap creation
test all the combination of parameters that could end up calling some cython code which does not support read-only array. In this case there is no issue with the default fit_algorithm parameter value

thomasjpfan · 2022-12-13T17:40:00Z

arrays big enough (>1MB) to trigger joblib read-only memmap creation

I do not like it, but we can monkeypatch Parallel and adjust max_nbytes for testing. In general, there has been issues such as #19608 where having a public API to adjust max_nbytes would be useful.

As for refactoring code to using memoryviews, if our tests do not have enough coverage, then I think we should pause accepting this type of refactoring.

thomasjpfan

Overall, moving forward, I think it is safer to keep the existing cnp.ndarray for fused types until Cython 3.0 is out.

As for this PR, I think we still need to revert:

scikit-learn/sklearn/linear_model/_coordinate_descent.py

Lines 597 to 599 in 49ff2d9

    
           X_data=ReadonlyArrayWrapper( 
        
               X.data 
        
           ),  # TODO: Remove after release of Cython 3 (#23147)

This test is built based on Loïc's reproducer: scikit-learn#25165 (comment) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>

Reverting to use cnp.ndarray does not solve the problem. This is a tentative workaround using ReadonlyArrayWrapper. This solves the problem, but there is now a segmentation fault.

sklearn/decomposition/tests/test_dict_learning.py

thomasjpfan

I think this PR is changing too much. I am okay with a reverting #23147 and adding a new non-regression test.

sklearn/linear_model/_cd_fast.pyx

jjerphan · 2023-01-03T09:06:09Z

It looks like there's something weird at play between NumPy and memory view creation as using cnp.ndarray does not resolve the problem for read-only arrays.

The ValueError comes from NumPy's array_getbuffer (the ValueError message is the concatenation of "buffer source" (see here) and "array is read-only" (see there)).

array_getbuffer is called (indirectly via Cython macro) via View.MemoryView.memoryview.__cinit__ I think when memoryviews are created.

Is this a bug of Cython?

See 6c30a48, out of this branch.

thomasjpfan · 2023-01-03T23:15:18Z

Just in case it gets lost, I responded to the out of branch commit: 6c30a48#r94961757

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…support

Vincent-Maladiere · 2023-01-10T16:48:02Z

sklearn/linear_model/_cd_fast.pyx

@@ -273,16 +273,17 @@ def enet_coordinate_descent(
    return np.asarray(w), gap, tol, n_iter + 1


+# TODO: use const fused typed memoryview where possible when Cython 0.29.33 is used.


Does this TODO take effect now since const fused typed memoryviews are available?

Not yet, but once #25342 is merged.

#25342 is now merged. Shall we try that now or is it safer not to experiment with const typed memory views in a bugfix (1.2.1) and instead do that only in main (for 1.3.0)?

Provided that Cython has good coverage and test cases for const-qualified memoryviews, I think it is safe and the best to use them to resolve such issues (as it is conceptually the best solution). Yet it do not think we need to use them for this fix with respect to the next bug fix release (I do not want to postpone this release).

Note that using such construct might not be sufficient to fix the problem tackled by this PR as cython implementations need writable buffers.

I will start investigating const-qualified fused typed memoryviews tomorrow for this fix and for other long standing issues.

thomasjpfan

I think this is a big enough fix to add to the 1.2.1 changelog.

…e-nd_array-for-read-only-support

jjerphan · 2023-01-13T09:56:50Z

I am not entirely satisfied with this fix: I think the root of the problem might better addressed closer to the various coordinate descent algorithms. Yet, I think performing this fix would better come after removing some complexity. Removing this complexity necessitates #25322 to be fixed.

Edit: const-qualified fused typed memoryviews are necessary but aren't sufficient to solve this issue.

…support

ogrisel

LGTM as it is. I think I am fine with experimenting with const typed memory views in a follow-up PR but keep this bugfix as it is for 1.2.1.

lesteve · 2023-01-19T09:18:02Z

LGTM, merging this one!

…ionaryLearning` (scikit-learn#25172) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…ionaryLearning` (#25172) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

FIX Make cd_fast use cnp.ndarray to support readonly buffers

96c6870

Follow-up of scikit-learn#23147.

github-actions bot added cython module:linear_model labels Dec 12, 2022

jjerphan mentioned this pull request Dec 12, 2022

ValueError: buffer source array is read-only In DictionaryLearning using coordinate decent, numworkers = 15 #25165

Closed

glemaitre added this to the 1.2.1 milestone Dec 12, 2022

thomasjpfan reviewed Dec 14, 2022

View reviewed changes

MAINT Remove ReadonlyArrayWrapper

73fa243

jjerphan marked this pull request as ready for review December 15, 2022 13:46

jjerphan added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Dec 15, 2022

TST Add non-regression test for scikit-learn#25165

8d1ce0e

This test is built based on Loïc's reproducer: scikit-learn#25165 (comment) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com>

jjerphan added the No Changelog Needed label Dec 16, 2022

DEBUG Use ReadonlyArrayWrapper to support const qualification

c223481

Reverting to use cnp.ndarray does not solve the problem. This is a tentative workaround using ReadonlyArrayWrapper. This solves the problem, but there is now a segmentation fault.

jjerphan removed Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Dec 19, 2022

jjerphan marked this pull request as draft December 19, 2022 14:22

thomasjpfan reviewed Dec 19, 2022

View reviewed changes

sklearn/decomposition/tests/test_dict_learning.py Outdated Show resolved Hide resolved

thomasjpfan reviewed Dec 19, 2022

View reviewed changes

sklearn/linear_model/_cd_fast.pyx Outdated Show resolved Hide resolved

jjerphan and others added 5 commits January 5, 2023 14:15

MAINT Simplify and make copy of chunks of w

c1a4d96

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

TST Make test_cd_work_on_joblib_memmapped_data faster to run

ad6058d

fixup! MAINT Simplify and make copy of chunks of w

160d6d5

[scipy-dev] Trigger CI

1200183

[scipy-dev] TST cnp.ndarray all the way!

5ad7565

jjerphan marked this pull request as ready for review January 5, 2023 18:19

jjerphan added 2 commits January 6, 2023 16:45

Merge branch 'main' into fix/make-cd_fast-use-nd_array-for-read-only-…

eee8314

…support

CI Trigger CI

bb4246f

jjerphan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jan 7, 2023

Merge branch 'main' into fix/make-cd_fast-use-nd_array-for-read-only-…

d77042d

…support

jjerphan added the Quick Review For PRs that are quick to review label Jan 10, 2023

Vincent-Maladiere reviewed Jan 10, 2023

View reviewed changes

thomasjpfan reviewed Jan 12, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix/make-cd_fast-us…

7687a18

…e-nd_array-for-read-only-support

jjerphan changed the title ~~FIX Make cd_fast use cnp.ndarray to support readonly buffers~~ FIX Better support for read-only datasets in coordinate descent algorithms Jan 13, 2023

jjerphan removed the Quick Review For PRs that are quick to review label Jan 13, 2023

jjerphan added 2 commits January 13, 2023 10:39

DOC Document workaround

4697e74

DOC Add a whats_new entry for 1.2.1

5595f7f

jjerphan changed the title ~~FIX Better support for read-only datasets in coordinate descent algorithms~~ FIX Better support large or read-only datasets in decomposition.DictionaryLearning Jan 13, 2023

Merge branch 'main' into fix/make-cd_fast-use-nd_array-for-read-only-…

120a489

…support

ogrisel mentioned this pull request Jan 18, 2023

RFC Consider making auto-memmaping a manual operation joblib/joblib#1376

Open

ogrisel approved these changes Jan 18, 2023

View reviewed changes

lesteve merged commit d431d7e into scikit-learn:main Jan 19, 2023

jjerphan deleted the fix/make-cd_fast-use-nd_array-for-read-only-support branch January 19, 2023 09:19

adrinjalali pushed a commit that referenced this pull request Jan 24, 2023

FIX Better support large or read-only datasets in `decomposition.Dict…

27736ac

…ionaryLearning` (#25172) Co-authored-by: Loïc Esteve <loic.esteve@ymail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX Better support large or read-only datasets in `decomposition.DictionaryLearning` #25172

FIX Better support large or read-only datasets in `decomposition.DictionaryLearning` #25172

jjerphan commented Dec 12, 2022 •

edited

Loading

ogrisel commented Dec 12, 2022

ogrisel commented Dec 12, 2022

lesteve commented Dec 13, 2022

thomasjpfan commented Dec 13, 2022

thomasjpfan left a comment •

edited

Loading

thomasjpfan left a comment

jjerphan commented Jan 3, 2023 •

edited

Loading

thomasjpfan commented Jan 3, 2023

Vincent-Maladiere Jan 10, 2023

jjerphan Jan 10, 2023 •

edited

Loading

ogrisel Jan 18, 2023 •

edited

Loading

jjerphan Jan 18, 2023 •

edited

Loading

thomasjpfan left a comment

jjerphan commented Jan 13, 2023 •

edited

Loading

ogrisel left a comment

lesteve commented Jan 19, 2023

	X_data=ReadonlyArrayWrapper(
	X.data
	), # TODO: Remove after release of Cython 3 (#23147)

		@@ -273,16 +273,17 @@ def enet_coordinate_descent(
		return np.asarray(w), gap, tol, n_iter + 1


		# TODO: use const fused typed memoryview where possible when Cython 0.29.33 is used.

FIX Better support large or read-only datasets in decomposition.DictionaryLearning #25172

FIX Better support large or read-only datasets in decomposition.DictionaryLearning #25172

Conversation

jjerphan commented Dec 12, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

ogrisel commented Dec 12, 2022

ogrisel commented Dec 12, 2022

lesteve commented Dec 13, 2022

thomasjpfan commented Dec 13, 2022

thomasjpfan left a comment • edited Loading

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

jjerphan commented Jan 3, 2023 • edited Loading

thomasjpfan commented Jan 3, 2023

Vincent-Maladiere Jan 10, 2023

Choose a reason for hiding this comment

jjerphan Jan 10, 2023 • edited Loading

Choose a reason for hiding this comment

ogrisel Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

jjerphan Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

jjerphan commented Jan 13, 2023 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

lesteve commented Jan 19, 2023

FIX Better support large or read-only datasets in `decomposition.DictionaryLearning` #25172

FIX Better support large or read-only datasets in `decomposition.DictionaryLearning` #25172

jjerphan commented Dec 12, 2022 •

edited

Loading

thomasjpfan left a comment •

edited

Loading

jjerphan commented Jan 3, 2023 •

edited

Loading

jjerphan Jan 10, 2023 •

edited

Loading

ogrisel Jan 18, 2023 •

edited

Loading

jjerphan Jan 18, 2023 •

edited

Loading

jjerphan commented Jan 13, 2023 •

edited

Loading