Skip to content

Audit our usage of numpy.astype to remove unecessary memory copies #4573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Apr 11, 2015 · 11 comments
Closed

Audit our usage of numpy.astype to remove unecessary memory copies #4573

ogrisel opened this issue Apr 11, 2015 · 11 comments
Labels
Easy Well-defined and straightforward way to resolve

Comments

@ogrisel
Copy link
Member

ogrisel commented Apr 11, 2015

The following pattern will always trigger a memory copy even when X as the right type:

X = X.astype(dtype)

In recent numpy, it's possible to avoid this problem with:

X = X.astype(dtype, copy=False)

However we need to keep the backward compatibility with numpy 1.6.1 that does not provide this option. Therefore we should indeed use:

from sklearn.utils.fixes import asype

X = astype(X, dtype, copy=False)

The following search reveals that we might have several places in our code were we do unecessary memory copy:

$ find sklearn -name "*.py" | xargs grep "astype" | grep -v "copy=" | grep -v "test_"

It would be great if someone could scan those and contribute one or several pull requests to use as astype(X, dtype, copy=False) each time we find a pattern that could potentially do large, unwanted memory copies. This issue can be tackled by new contributors to scikit-learn.

Slightly related to #4555.

@ogrisel ogrisel added the Easy Well-defined and straightforward way to resolve label Apr 11, 2015
@jrings
Copy link

jrings commented Apr 11, 2015

This sounds like a good place to start contributing, I'll take a look!

@jrings
Copy link

jrings commented Apr 11, 2015

Ok I put in a PR which still failed on Travis for two things, so I added a commit to fix them, but it didn't take, so I tried a different PR and it just has a merge with recent master commits but not my changes O_o
Hang on, I'll try again :)

@ogrisel
Copy link
Member Author

ogrisel commented Apr 13, 2015

Please never merge the master branch to a feature branch. Start you feature branch from a recent version of master, then later you can rebase it from time to time on top of master.

@TomDLT
Copy link
Member

TomDLT commented May 19, 2015

#4645 is merged, this issue can be closed

@amueller
Copy link
Member

I think there are still a couple left-over. How did you check?
Doing git grep -n astype | grep -v test reveals some suspicious lines, like

cluster/k_means_.py:773
linear_model/stochastic_gradient.py:870
naive_bayes.py:476
preprocessing/data.py:351
preprocessing/data.py:384
preprocessing/imputation.py:359

well and a couple more...

@TomDLT
Copy link
Member

TomDLT commented May 20, 2015

cluster/k_means_.py:773
linear_model/stochastic_gradient.py:870
naive_bayes.py:476
preprocessing/data.py:351
preprocessing/data.py:384

have been solved in #4645 :)

preprocessing/imputation.py:359

I am not sure about the memory/copy behavior of this line. Is it possible not to copy?

Otherwise, I have checked with git grep -n astype | grep -v test, looking for lines like X = X.astype(dtype), and I see no obvious situation where a copy may not be needed.

I am also not sure about the following:

plot_gradient_boosting_quantile.py
plot_gradient_boosting_regression.py
plot_gradient_boosting_regularization.py

where the data type changes from float64 to float32. Is it useful?

@ogrisel
Copy link
Member Author

ogrisel commented May 20, 2015

There is no need to optimize the code in examples or tests functions. Just the code in the library itself.

For imputation.py:359:

X.data[mask] = valid_statistics[indexes].astype(X.dtype)

could indeed be optimized as:

X.data[mask] = astype(valid_statistics[indexes], X.dtype, copy=False)

as valid_statistics is always a numpy array.

@amueller
Copy link
Member

whoops looks like I was on an outdated branch when checking. sorry.

@ogrisel
Copy link
Member Author

ogrisel commented May 21, 2015

@TomDLT would you like to open a new PR to fix the remaining case in imputation.py so that we can close this issue?

@TomDLT
Copy link
Member

TomDLT commented May 21, 2015

ok I will

@ogrisel
Copy link
Member Author

ogrisel commented May 21, 2015

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve
Projects
None yet
4 participants