-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For the people interested in this issue, please feel free to submit individual pull request per estimator to make the review process faster by focusing review discussions on specific estimators. |
I'll work on this ! May take some time to understand memorymaps & transformers better |
As there's quite a few estimators/transformers to work on, I'll let myself to join. I'll start with FactorAnalysis, for instance ;) |
Is PLS under a refactor right now? @arthurmensch |
Sorry. Misclick. |
I'm new to scikit and would wish to contribute to this issue. I'll have to go through memory maps, estimators and transformers. I'll stick to NuSVC and will get to the end of the problem as soon as possible |
FYI the list of estimators that are expected to do something inplace should be found via: >>> import inspect
>>> from sklearn.utils.testing import all_estimators
>>> pprint([e for e in all_estimators()
... if 'copy' in inspect.signature(e[1].__init__).parameters])
[('AffinityPropagation',
<class 'sklearn.cluster.affinity_propagation_.AffinityPropagation'>),
('Binarizer', <class 'sklearn.preprocessing.data.Binarizer'>),
('Birch', <class 'sklearn.cluster.birch.Birch'>),
('CCA', <class 'sklearn.cross_decomposition.cca_.CCA'>),
('FactorAnalysis',
<class 'sklearn.decomposition.factor_analysis.FactorAnalysis'>),
('Imputer', <class 'sklearn.preprocessing.imputation.Imputer'>),
('IncrementalPCA',
<class 'sklearn.decomposition.incremental_pca.IncrementalPCA'>),
('MaxAbsScaler', <class 'sklearn.preprocessing.data.MaxAbsScaler'>),
('MinMaxScaler', <class 'sklearn.preprocessing.data.MinMaxScaler'>),
('Normalizer', <class 'sklearn.preprocessing.data.Normalizer'>),
('OrthogonalMatchingPursuitCV',
<class 'sklearn.linear_model.omp.OrthogonalMatchingPursuitCV'>),
('PCA', <class 'sklearn.decomposition.pca.PCA'>),
('PLSCanonical', <class 'sklearn.cross_decomposition.pls_.PLSCanonical'>),
('PLSRegression', <class 'sklearn.cross_decomposition.pls_.PLSRegression'>),
('PLSSVD', <class 'sklearn.cross_decomposition.pls_.PLSSVD'>),
('RandomizedPCA', <class 'sklearn.decomposition.pca.RandomizedPCA'>),
('RobustScaler', <class 'sklearn.preprocessing.data.RobustScaler'>),
('StandardScaler', <class 'sklearn.preprocessing.data.StandardScaler'>)] |
TODOs are:
|
I understood what needs to be done and was working on robust scaler, and when I tried to import it, it gave me a import error but the class robust scaler does exist in data.py in sklearn.preprocessing. I tried to even run the plot_robust_scaling.py given in the examples folder. Could you let me know what to do? |
Shouldn't |
I am working on this again |
Is anyone working on this? This seems pretty bad. |
|
We also need to test on sparse arrays, probably of multiple types (CSR, CSC should be enough?) |
I confirm that MiniBatchDictionaryLearning has the same issue. Is the resolution of this bug can change the behavior/results of the algorithm? |
Sorry I can't parse this sentence. |
Sorry if it was not clear. What I meant is: Does a fix of this bug will change the results obtained from the algorithm? |
@rousseau no. |
I just encountered this bug in sklearn.metrics.pairwise, so it's not restricted to estimators. In my case I was trying to compute pairwise manhattan distances on a big sparse matrix. If I give it n_jobs>1 it crashes with I did the hacky thing of changing the lines in manhattan_distance to say |
Sounds like a @joblib issue...
…On 5 December 2016 at 16:23, James Webber ***@***.***> wrote:
I just encountered this bug in sklearn.metrics.pairwise
<https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py>,
so it's not restricted to estimators.
In my case I was trying to compute pairwise manhattan distances on a big
sparse matrix. If I give it n_jobs>1 it crashes with ValueError: buffer
source array is read-only message.
I did the hacky thing of changing the lines in manhattan_distance to say
copy=True) and now it's chugging along, but that's not a general solution.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5481 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6x3UY80vwiJPKU5uqBVnC26o7H_Aks5rE5_LgaJpZM4GSE5T>
.
|
joblib is behind all this of course but AFAIR, the idea was to modify scikit-learn code to avoid doing inplace modification by default ( @jamestwebber I created a separate issue #7981. |
I am interested to work on it. However I need some clarifications:
A 3rd point was done in PR #5507: copy = not X.flags.writable or self.copy Is it necessary ? Finally, is this PR #4807 important before doing modifications ? |
I removed the easy tag. @Morikko my personal advice: maybe try to tackle simpler PRs first before coming back to this one. |
|
Hi.. the same problem with TrucatedSVD |
I think we need to call this one a blocker and make a point of fixing it... I'm not sure I'm as optimistic as @arthurmensch about how easy they are to fix... I think it would be good to see an example of one fixed, so that contributors can follow suit on others. |
I also think we should try to merge #4807 and make solving this a matter of removing estimators from a list of those failing... |
PR #4807 reveals a variety of estimators that fails on memory mapped input, once we allow check_array to process a memory map without copying its content in memory.
Estimators failing on read only memory maps :
Transformers :
Most of these should be easy to fix (e.g X -= X.mean should be replaced etc.)
The text was updated successfully, but these errors were encountered: