Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

arthurmensch · 2015-10-20T11:38:36Z

PR #4807 reveals a variety of estimators that fails on memory mapped input, once we allow check_array to process a memory map without copying its content in memory.

Estimators failing on read only memory maps :

the whole PLS family
Factor analysis
Incremental PCA
NuSVC

Transformers :

KernelCenterer
MaxAbsScaler
MinAbsScaler
RobustScaler
StandardScaler

Most of these should be easy to fix (e.g X -= X.mean should be replaced etc.)

ogrisel · 2015-10-20T11:51:45Z

For the people interested in this issue, please feel free to submit individual pull request per estimator to make the review process faster by focusing review discussions on specific estimators.

pletelli · 2015-10-20T11:54:24Z

I'll work on this ! May take some time to understand memorymaps & transformers better

sygi · 2015-10-20T12:42:25Z

As there's quite a few estimators/transformers to work on, I'll let myself to join. I'll start with FactorAnalysis, for instance ;)

kastnerkyle · 2015-10-20T15:10:17Z

Is PLS under a refactor right now? @arthurmensch

kastnerkyle · 2015-10-20T15:10:28Z

Sorry. Misclick.

kastnerkyle · 2015-10-20T15:12:18Z

Incremental PCA could fix this at the same time as #5173 . I see the issue but it probably means copying data per minibatch which may slow things down. And I am pretty sure @lesteve was using read only memmaps for his processing

SumedhArani · 2015-10-21T07:41:01Z

I'm new to scikit and would wish to contribute to this issue. I'll have to go through memory maps, estimators and transformers. I'll stick to NuSVC and will get to the end of the problem as soon as possible

ogrisel · 2015-10-21T12:45:56Z

FYI the list of estimators that are expected to do something inplace should be found via:

>>> import inspect
>>> from sklearn.utils.testing import all_estimators
>>> pprint([e for e in all_estimators()
...         if 'copy' in inspect.signature(e[1].__init__).parameters])
[('AffinityPropagation',
  <class 'sklearn.cluster.affinity_propagation_.AffinityPropagation'>),
 ('Binarizer', <class 'sklearn.preprocessing.data.Binarizer'>),
 ('Birch', <class 'sklearn.cluster.birch.Birch'>),
 ('CCA', <class 'sklearn.cross_decomposition.cca_.CCA'>),
 ('FactorAnalysis',
  <class 'sklearn.decomposition.factor_analysis.FactorAnalysis'>),
 ('Imputer', <class 'sklearn.preprocessing.imputation.Imputer'>),
 ('IncrementalPCA',
  <class 'sklearn.decomposition.incremental_pca.IncrementalPCA'>),
 ('MaxAbsScaler', <class 'sklearn.preprocessing.data.MaxAbsScaler'>),
 ('MinMaxScaler', <class 'sklearn.preprocessing.data.MinMaxScaler'>),
 ('Normalizer', <class 'sklearn.preprocessing.data.Normalizer'>),
 ('OrthogonalMatchingPursuitCV',
  <class 'sklearn.linear_model.omp.OrthogonalMatchingPursuitCV'>),
 ('PCA', <class 'sklearn.decomposition.pca.PCA'>),
 ('PLSCanonical', <class 'sklearn.cross_decomposition.pls_.PLSCanonical'>),
 ('PLSRegression', <class 'sklearn.cross_decomposition.pls_.PLSRegression'>),
 ('PLSSVD', <class 'sklearn.cross_decomposition.pls_.PLSSVD'>),
 ('RandomizedPCA', <class 'sklearn.decomposition.pca.RandomizedPCA'>),
 ('RobustScaler', <class 'sklearn.preprocessing.data.RobustScaler'>),
 ('StandardScaler', <class 'sklearn.preprocessing.data.StandardScaler'>)]

pletelli · 2015-10-21T12:53:13Z

TODOs are:

check that for all Estimators given a readonly array, fit doesn't raise (even if copy=False in the init)
-> if a copy is eventually needed and the user specified copy=False, do we raise a warning ?
remove explicit ValueError raise
check that there is never a check_array(copy=self.copy) if copy is not useful (which means if no inplace transformation is executed)
check (optionnally) that transform with readonly data and copy=False raise for Estimators that are doing inplace modifications -> we may not be able to check this ?

SumedhArani · 2015-11-27T01:52:35Z

I understood what needs to be done and was working on robust scaler, and when I tried to import it, it gave me a import error but the class robust scaler does exist in data.py in sklearn.preprocessing. I tried to even run the plot_robust_scaling.py given in the examples folder. Could you let me know what to do?
I'm a new comer to scikit!

vighneshbirodkar · 2015-12-09T17:51:53Z

Shouldn't SparseCoder also be in the list to be considered ? See #5956

arthurmensch · 2015-12-11T09:07:03Z

I am working on this again

amueller · 2016-07-27T21:44:57Z

Is anyone working on this? This seems pretty bad.

amueller · 2016-07-27T21:45:13Z

MiniBatchDictionaryLearning has the same issue.

amueller · 2016-07-28T16:47:54Z

We also need to test on sparse arrays, probably of multiple types (CSR, CSC should be enough?)

rousseau · 2016-07-29T08:43:40Z

I confirm that MiniBatchDictionaryLearning has the same issue.

Is the resolution of this bug can change the behavior/results of the algorithm?

amueller · 2016-07-29T18:29:22Z

Is the resolution of this bug can change the behavior/results of the algorithm?

Sorry I can't parse this sentence.

rousseau · 2016-08-01T08:14:10Z

Sorry if it was not clear.

What I meant is: Does a fix of this bug will change the results obtained from the algorithm?

amueller · 2016-10-13T16:43:23Z

@rousseau no.

jamestwebber · 2016-12-05T05:23:22Z

I just encountered this bug in sklearn.metrics.pairwise, so it's not restricted to estimators.

In my case I was trying to compute pairwise manhattan distances on a big sparse matrix. If I give it n_jobs>1 it crashes with ValueError: buffer source array is read-only message.

I did the hacky thing of changing the lines in manhattan_distance to say copy=True) and now it's chugging along, but that's not a general solution.

jnothman · 2016-12-05T08:31:29Z

Sounds like a @joblib issue...

…

On 5 December 2016 at 16:23, James Webber ***@***.***> wrote: I just encountered this bug in sklearn.metrics.pairwise <https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py>, so it's not restricted to estimators. In my case I was trying to compute pairwise manhattan distances on a big sparse matrix. If I give it n_jobs>1 it crashes with ValueError: buffer source array is read-only message. I did the hacky thing of changing the lines in manhattan_distance to say copy=True) and now it's chugging along, but that's not a general solution. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5481 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6x3UY80vwiJPKU5uqBVnC26o7H_Aks5rE5_LgaJpZM4GSE5T> .

lesteve · 2016-12-05T10:56:52Z

joblib is behind all this of course but AFAIR, the idea was to modify scikit-learn code to avoid doing inplace modification by default (copy=True by default and expose a copy parameter if not present already), see #6614 (comment).

@jamestwebber I created a separate issue #7981.

Morikko · 2017-03-12T19:47:59Z

I am interested to work on it. However I need some clarifications:
The goal is to:

Add a parameter copy if it wasn't already
Put True by default to the copy parameter if False before

A 3rd point was done in PR #5507:
3. Force to true (even if set to False) the copy parameter if the array is read-only
It was done by adding before each check_array call:

copy = not X.flags.writable or self.copy

Is it necessary ?

Finally, is this PR #4807 important before doing modifications ?

lesteve · 2017-03-13T10:28:42Z

I removed the easy tag. @Morikko my personal advice: maybe try to tackle simpler PRs first before coming back to this one.

ghost · 2017-08-22T04:06:35Z

ValueError: output array is read-only when using n_jobs > 1 with RandomizedLasso.

iarroyof · 2017-12-11T06:05:34Z

Hi.. the same problem with TrucatedSVD

jnothman · 2018-02-04T09:14:04Z

I think we need to call this one a blocker and make a point of fixing it... I'm not sure I'm as optimistic as @arthurmensch about how easy they are to fix... I think it would be good to see an example of one fixed, so that contributors can follow suit on others.

jnothman · 2018-02-04T09:14:43Z

I also think we should try to merge #4807 and make solving this a matter of removing estimators from a list of those failing...

ogrisel added Bug Easy Well-defined and straightforward way to resolve Need Contributor labels Oct 20, 2015

ogrisel changed the title ~~Estimators should not try to modify X and y inplace in order to handle memory maps~~ Estimators should not try to modify X and y inplace in order to handle readonly memory maps Oct 20, 2015

kastnerkyle closed this as completed Oct 20, 2015

kastnerkyle reopened this Oct 20, 2015

pletelli mentioned this issue Oct 21, 2015

[WIP] Fix read only mmap tests #5507

Closed

waterponey mentioned this issue Oct 21, 2015

[MRG+1] Read-only input data in common tests #4807

Closed

3 tasks

TomDLT removed the Need Contributor label Nov 4, 2015

lesteve mentioned this issue Dec 9, 2015

ValueError: assignment destination is read-only, when paralleling with n_jobs > 1 #5956

Closed

lesteve mentioned this issue Apr 12, 2016

ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614

Closed

amueller added this to the 0.18 milestone Jul 27, 2016

amueller modified the milestones: 0.18, 0.19 Sep 22, 2016

lesteve mentioned this issue Dec 5, 2016

ValueError buffer source array is read-only when computing manahattan_distances of a CSR matrix inside Parallel #7981

Closed

jnothman added the Need Contributor label Jan 10, 2017

amueller added the Sprint label Mar 3, 2017

lesteve removed the Easy Well-defined and straightforward way to resolve label Mar 13, 2017

jnothman modified the milestones: 0.20, 0.19 Jun 14, 2017

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

jnothman added the Blocker label Feb 4, 2018

lesteve mentioned this issue Feb 20, 2018

[MRG+1] Read-only memmap input data in common tests #10663

Merged

3 tasks

glemaitre closed this as completed in #10663 Apr 23, 2018

ulupo mentioned this issue Jul 18, 2020

Fix mmap settings used by joblib.Parallel giotto-ai/giotto-tda#428

Merged

9 tasks

glemaitre mentioned this issue Sep 12, 2024

TST add a few more tests to API checks #29832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

arthurmensch commented Oct 20, 2015

ogrisel commented Oct 20, 2015

pletelli commented Oct 20, 2015

sygi commented Oct 20, 2015

kastnerkyle commented Oct 20, 2015

kastnerkyle commented Oct 20, 2015

kastnerkyle commented Oct 20, 2015

SumedhArani commented Oct 21, 2015

ogrisel commented Oct 21, 2015

pletelli commented Oct 21, 2015

SumedhArani commented Nov 27, 2015

vighneshbirodkar commented Dec 9, 2015

arthurmensch commented Dec 11, 2015

amueller commented Jul 27, 2016

amueller commented Jul 27, 2016

amueller commented Jul 28, 2016

rousseau commented Jul 29, 2016

amueller commented Jul 29, 2016

rousseau commented Aug 1, 2016

amueller commented Oct 13, 2016

jamestwebber commented Dec 5, 2016

jnothman commented Dec 5, 2016 via email

lesteve commented Dec 5, 2016

Morikko commented Mar 12, 2017

lesteve commented Mar 13, 2017

ghost commented Aug 22, 2017

iarroyof commented Dec 11, 2017

jnothman commented Feb 4, 2018

jnothman commented Feb 4, 2018

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

Comments

arthurmensch commented Oct 20, 2015

ogrisel commented Oct 20, 2015

pletelli commented Oct 20, 2015

sygi commented Oct 20, 2015

kastnerkyle commented Oct 20, 2015

kastnerkyle commented Oct 20, 2015

kastnerkyle commented Oct 20, 2015

SumedhArani commented Oct 21, 2015

ogrisel commented Oct 21, 2015

pletelli commented Oct 21, 2015

SumedhArani commented Nov 27, 2015

vighneshbirodkar commented Dec 9, 2015

arthurmensch commented Dec 11, 2015

amueller commented Jul 27, 2016

amueller commented Jul 27, 2016

amueller commented Jul 28, 2016

rousseau commented Jul 29, 2016

amueller commented Jul 29, 2016

rousseau commented Aug 1, 2016

amueller commented Oct 13, 2016

jamestwebber commented Dec 5, 2016

jnothman commented Dec 5, 2016 via email

lesteve commented Dec 5, 2016

Morikko commented Mar 12, 2017

lesteve commented Mar 13, 2017

ghost commented Aug 22, 2017

iarroyof commented Dec 11, 2017

jnothman commented Feb 4, 2018

jnothman commented Feb 4, 2018