Skip to content

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arthurmensch opened this issue Oct 20, 2015 · 28 comments · Fixed by #10663

Comments

@arthurmensch
Copy link
Contributor

PR #4807 reveals a variety of estimators that fails on memory mapped input, once we allow check_array to process a memory map without copying its content in memory.

Estimators failing on read only memory maps :

  • the whole PLS family
  • Factor analysis
  • Incremental PCA
  • NuSVC

Transformers :

  • KernelCenterer
  • MaxAbsScaler
  • MinAbsScaler
  • RobustScaler
  • StandardScaler

Most of these should be easy to fix (e.g X -= X.mean should be replaced etc.)

@ogrisel ogrisel added Bug Easy Well-defined and straightforward way to resolve Need Contributor labels Oct 20, 2015
@ogrisel ogrisel changed the title Estimators should not try to modify X and y inplace in order to handle memory maps Estimators should not try to modify X and y inplace in order to handle readonly memory maps Oct 20, 2015
@ogrisel
Copy link
Member

ogrisel commented Oct 20, 2015

For the people interested in this issue, please feel free to submit individual pull request per estimator to make the review process faster by focusing review discussions on specific estimators.

@pletelli
Copy link

I'll work on this ! May take some time to understand memorymaps & transformers better

@sygi
Copy link

sygi commented Oct 20, 2015

As there's quite a few estimators/transformers to work on, I'll let myself to join. I'll start with FactorAnalysis, for instance ;)

@kastnerkyle
Copy link
Member

Is PLS under a refactor right now? @arthurmensch

@kastnerkyle
Copy link
Member

Sorry. Misclick.

@kastnerkyle
Copy link
Member

Incremental PCA could fix this at the same time as #5173 . I see the issue but it probably means copying data per minibatch which may slow things down. And I am pretty sure @lesteve was using read only memmaps for his processing

@SumedhArani
Copy link

I'm new to scikit and would wish to contribute to this issue. I'll have to go through memory maps, estimators and transformers. I'll stick to NuSVC and will get to the end of the problem as soon as possible

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2015

FYI the list of estimators that are expected to do something inplace should be found via:

>>> import inspect
>>> from sklearn.utils.testing import all_estimators
>>> pprint([e for e in all_estimators()
...         if 'copy' in inspect.signature(e[1].__init__).parameters])
[('AffinityPropagation',
  <class 'sklearn.cluster.affinity_propagation_.AffinityPropagation'>),
 ('Binarizer', <class 'sklearn.preprocessing.data.Binarizer'>),
 ('Birch', <class 'sklearn.cluster.birch.Birch'>),
 ('CCA', <class 'sklearn.cross_decomposition.cca_.CCA'>),
 ('FactorAnalysis',
  <class 'sklearn.decomposition.factor_analysis.FactorAnalysis'>),
 ('Imputer', <class 'sklearn.preprocessing.imputation.Imputer'>),
 ('IncrementalPCA',
  <class 'sklearn.decomposition.incremental_pca.IncrementalPCA'>),
 ('MaxAbsScaler', <class 'sklearn.preprocessing.data.MaxAbsScaler'>),
 ('MinMaxScaler', <class 'sklearn.preprocessing.data.MinMaxScaler'>),
 ('Normalizer', <class 'sklearn.preprocessing.data.Normalizer'>),
 ('OrthogonalMatchingPursuitCV',
  <class 'sklearn.linear_model.omp.OrthogonalMatchingPursuitCV'>),
 ('PCA', <class 'sklearn.decomposition.pca.PCA'>),
 ('PLSCanonical', <class 'sklearn.cross_decomposition.pls_.PLSCanonical'>),
 ('PLSRegression', <class 'sklearn.cross_decomposition.pls_.PLSRegression'>),
 ('PLSSVD', <class 'sklearn.cross_decomposition.pls_.PLSSVD'>),
 ('RandomizedPCA', <class 'sklearn.decomposition.pca.RandomizedPCA'>),
 ('RobustScaler', <class 'sklearn.preprocessing.data.RobustScaler'>),
 ('StandardScaler', <class 'sklearn.preprocessing.data.StandardScaler'>)]

@pletelli
Copy link

TODOs are:

  • check that for all Estimators given a readonly array, fit doesn't raise (even if copy=False in the init)
    -> if a copy is eventually needed and the user specified copy=False, do we raise a warning ?
  • remove explicit ValueError raise
  • check that there is never a check_array(copy=self.copy) if copy is not useful (which means if no inplace transformation is executed)
  • check (optionnally) that transform with readonly data and copy=False raise for Estimators that are doing inplace modifications -> we may not be able to check this ?

@SumedhArani
Copy link

I understood what needs to be done and was working on robust scaler, and when I tried to import it, it gave me a import error but the class robust scaler does exist in data.py in sklearn.preprocessing. I tried to even run the plot_robust_scaling.py given in the examples folder. Could you let me know what to do?
I'm a new comer to scikit!

@vighneshbirodkar
Copy link
Contributor

Shouldn't SparseCoder also be in the list to be considered ? See #5956

@arthurmensch
Copy link
Contributor Author

I am working on this again

@amueller
Copy link
Member

Is anyone working on this? This seems pretty bad.

@amueller
Copy link
Member

MiniBatchDictionaryLearning has the same issue.

@amueller
Copy link
Member

We also need to test on sparse arrays, probably of multiple types (CSR, CSC should be enough?)

@rousseau
Copy link

I confirm that MiniBatchDictionaryLearning has the same issue.

Is the resolution of this bug can change the behavior/results of the algorithm?

@amueller
Copy link
Member

Is the resolution of this bug can change the behavior/results of the algorithm?

Sorry I can't parse this sentence.

@rousseau
Copy link

rousseau commented Aug 1, 2016

Sorry if it was not clear.

What I meant is: Does a fix of this bug will change the results obtained from the algorithm?

@amueller amueller modified the milestones: 0.18, 0.19 Sep 22, 2016
@amueller
Copy link
Member

@rousseau no.

@jamestwebber
Copy link
Contributor

I just encountered this bug in sklearn.metrics.pairwise, so it's not restricted to estimators.

In my case I was trying to compute pairwise manhattan distances on a big sparse matrix. If I give it n_jobs>1 it crashes with ValueError: buffer source array is read-only message.

I did the hacky thing of changing the lines in manhattan_distance to say copy=True) and now it's chugging along, but that's not a general solution.

@jnothman
Copy link
Member

jnothman commented Dec 5, 2016 via email

@lesteve
Copy link
Member

lesteve commented Dec 5, 2016

joblib is behind all this of course but AFAIR, the idea was to modify scikit-learn code to avoid doing inplace modification by default (copy=True by default and expose a copy parameter if not present already), see #6614 (comment).

@jamestwebber I created a separate issue #7981.

@Morikko
Copy link
Contributor

Morikko commented Mar 12, 2017

I am interested to work on it. However I need some clarifications:
The goal is to:

  1. Add a parameter copy if it wasn't already
  2. Put True by default to the copy parameter if False before

A 3rd point was done in PR #5507:
3. Force to true (even if set to False) the copy parameter if the array is read-only
It was done by adding before each check_array call:

copy = not X.flags.writable or self.copy 

Is it necessary ?

Finally, is this PR #4807 important before doing modifications ?

@lesteve lesteve removed the Easy Well-defined and straightforward way to resolve label Mar 13, 2017
@lesteve
Copy link
Member

lesteve commented Mar 13, 2017

I removed the easy tag. @Morikko my personal advice: maybe try to tackle simpler PRs first before coming back to this one.

@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 14, 2017
@ghost
Copy link

ghost commented Aug 22, 2017

ValueError: output array is read-only when using n_jobs > 1 with RandomizedLasso.

@iarroyof
Copy link

Hi.. the same problem with TrucatedSVD

@jnothman
Copy link
Member

jnothman commented Feb 4, 2018

I think we need to call this one a blocker and make a point of fixing it... I'm not sure I'm as optimistic as @arthurmensch about how easy they are to fix... I think it would be good to see an example of one fixed, so that contributors can follow suit on others.

@jnothman
Copy link
Member

jnothman commented Feb 4, 2018

I also think we should try to merge #4807 and make solving this a matter of removing estimators from a list of those failing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet