-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Use astype(.., copy=False) when possible #11973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Use astype(.., copy=False) when possible #11973
Conversation
Circle CI python 2 build failed due to https://github.com/tomMoral/loky/issues/151 as it did in #11899 (comment) in the Circle CI running on my fork. Interestingly it didn't fail there when merged into master. |
sklearn/utils/fixes.py
Outdated
|
||
# TODO: replace by copy=False, when only scipy > 1.1 is supported. | ||
def _astype_copy_false(X): | ||
"""Returns the copy=False parameter for {ndarray,csr_matrix,csc_matrix}.astype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! You are right, it's too long (although technically it passed flake8) -- Fixed.
sklearn/impute.py
Outdated
@@ -338,7 +338,7 @@ def _dense_fit(self, X, strategy, missing_values, fill_value): | |||
most_frequent = np.empty(X.shape[0]) | |||
|
|||
for i, (row, row_mask) in enumerate(zip(X[:], mask[:])): | |||
row_mask = np.logical_not(row_mask).astype(np.bool) | |||
row_mask = np.logical_not(row_mask).astype(np.bool, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code review capability of your editor must be adding extra spaces :)
This is 79 character long and valid according to PEP8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups got trapped by my own editor :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't need astype
here since row_mask
will always be bool since mask
is bool. However, there is a astype
in fixes which might changed:
scikit-learn/sklearn/utils/fixes.py
Line 321 in 73a5592
return np.frompyfunc(lambda x: x != x, 1, 1)(X).astype(bool) |
sklearn/preprocessing/imputation.py
Outdated
@@ -289,7 +289,7 @@ def _dense_fit(self, X, strategy, missing_values, axis): | |||
most_frequent = np.empty(X.shape[0]) | |||
|
|||
for i, (row, row_mask) in enumerate(zip(X[:], mask[:])): | |||
row_mask = np.logical_not(row_mask).astype(np.bool) | |||
row_mask = np.logical_not(row_mask).astype(np.bool, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we change a deprecate file?
sklearn/cluster/hierarchical.py
Outdated
@@ -487,7 +489,8 @@ def linkage_tree(X, connectivity=None, n_components='deprecated', | |||
del diag_mask | |||
|
|||
if affinity == 'precomputed': | |||
distances = X[connectivity.row, connectivity.col].astype('float64') | |||
distances = X[connectivity.row, connectivity.col].astype( | |||
'float64', **_astype_copy_false(X)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flake8 tell me 4 spaces less ;)
LGTM. if you could have a pass on the following occurences. I am not 100% sure that they should change.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to review this very thoroughly, so I would be unsure if there are cases where a copy was intended. I would be happy to merge it into master (but not 0.20.X) and see if any users complain of breakage in 0.21dev.
I think that @jnothman is right. Actually, we could merge it now without backport in 0.20.X. |
Do you want to add a what's new? |
1fd1ed5
to
faa16a1
Compare
Thanks for the review! Addressed some of @glemaitre 's comments, rebased (to include the v0.21 what's new template) and added a what's new. |
As I implied, I can't claim to thoroughly review this with confidence, but I'm okay to merge it on trust and the test of time, once we've dropped Py2 support. |
faa16a1
to
40f4e35
Compare
CI failing, @rth |
b2db080
to
3d2cf40
Compare
Rebased and fixed CI. CI is in progress but it should pass (last commit only linting failed). |
sklearn/cluster/hierarchical.py
Outdated
@@ -109,7 +111,7 @@ def _single_linkage_tree(connectivity, n_samples, n_nodes, n_clusters, | |||
|
|||
# Convert edge list into standard hierarchical clustering format | |||
single_linkage_tree = _hierarchical._single_linkage_label(mst_array) | |||
children_ = single_linkage_tree[:, :2].astype(np.int) | |||
children_ = single_linkage_tree[:, :2].astype(np.int, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, here I think that it's beneficial to have a copy. It's a somewhat subtle reason:
We are taking a view of the data that is smaller than the original data, via the slicing. This view references the original data. If no copy is made, the original data cannot be garbage collected, and memory is wasted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to fix or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix tomorrow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, removed the copy=False
here.
sklearn/cluster/hierarchical.py
Outdated
@@ -229,7 +231,7 @@ def ward_tree(X, connectivity=None, n_clusters=None, return_distance=False): | |||
stacklevel=2) | |||
X = np.require(X, requirements="W") | |||
out = hierarchy.ward(X) | |||
children_ = out[:, :2].astype(np.intp) | |||
children_ = out[:, :2].astype(np.intp, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not the argument about view not being released applies here or not depends on whether we have return_distance or not. I am not sure whether or not the copy should be kept or not.
This comment does not ask for a change in the diff. It's just a reflection as I pass by.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed copy=False
here to use the defaults when in doubt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I did make a comment that I think would be an improvement. But if y'all think that it's a nitpick, I don't mind merging without taking in account my comment.
Awesome. Merging! |
* Use astype(.., copy=False) when possible * Use Use astype(.., copy=False) also in tests * Fix CI * Guillaume's comments * Address review comments * Add what's new * Fix CI * Lint * Fix merge issues * More merge conflict fixes * Fix failing test * Lint * Use copy=True in cluster/hierarchical * More fixes
* Use astype(.., copy=False) when possible * Use Use astype(.., copy=False) also in tests * Fix CI * Guillaume's comments * Address review comments * Add what's new * Fix CI * Lint * Fix merge issues * More merge conflict fixes * Fix failing test * Lint * Use copy=True in cluster/hierarchical * More fixes
This uses
X.astype(..., copy=False)
when possible to avoid memory copies.Closes #11970
I have manually reviewed all cases of
.astype
and manually addedcopy=False
when this makes sense (i.e. when the array can be of the same dtype as the target dtype)To support the copy argument for both sparse and dense arrays, a new
utils.fixes._astype_copy_false
function is added. It can be used as,This is somewhat awkward, however it makes it explicit that this is a temporary fix that can be dropped once we only support scipy > 1.1 and IMO is better in this sense than,
as the latter essentially permanently re-implements some checks from numpy/scipy.