-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Downcast large matrix indices where possible in sparsefuncs._minor_reduce (fix #13737) #13741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-Authored-By: rlms <macsweenroddy@gmail.com>
Co-Authored-By: rlms <macsweenroddy@gmail.com>
…/rlms/scikit-learn into variance-threshold-zero-variance
We are working towards a release, so please ping after the release is out
in a week or two.
|
@jnothman pinging |
# reduceat tries casts X.indptr to intp, which errors | ||
# if it is int64 on a 32 bit system. | ||
# Reinitializing prevents this where possible, see #13737 | ||
X = type(X)((X.data, X.indices, X.indptr), shape=X.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what does this reinitialisation actually do? Will it downcast indptr if X.shape is less than int32.max???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, precisely. The logic is in get_index_dtype
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in other cases is this a non-copying operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (see __init__
for _cs_matrix
here). That could be made explicit by adding copy=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason it's still marked WIP?
If it should be MRC, I'm happy to try include this in release 0.21.2 (due imminently) if another reviewer approves.
Please add an entry to the change log at doc/whats_new/v0.21.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
. We might land up moving this entry if the release ships sooner.
The what's new entry should ideally mention public API (e.g. estimators) that this affects |
@@ -26,6 +26,14 @@ Changelog | |||
(regression introduced in 0.21) :issue:`13910` | |||
by :user:`Jérémie du Boisberranger <jeremiedbb>`. | |||
|
|||
:mod:`sklearn.utils.sparsefuncs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better off putting it under preprocessing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do, although the code change is in sparsefuncs and will also affect feature_selection.VarianceThreshold
once pull #13704 goes through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I believe that in this case the two classes that are potentially impacted by this bug are: LabelBinarizer
and MaxAbsScaler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not read the previous comment when writing mine. In the end I am fine with the current PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the whats new section issue, LGTM.
@jnothman merge? |
Thanks @rlms |
Reference Issues/PRs
Fixes #13737.
What does this implement/fix? Explain your changes.
Currently
sparsefuncs.min_max_axis
gives aTypeError
whenever the input is a csc matrix (and axis is 0) with int64 indices on a 32-bit machine (due to a cast to intp in_minor_reduce
). This does not occur (necessarily) for csr matrices, since the conversion to a csc matrix in_min_or_max_axis
will downcast the indices when safe to do so (i.e. when all values fit in an int32). Reinitialising the input to_minor_reduce
should avoid this (and also the corresponding problem that would occur for csr inputs and axis of 1).A test for min_max_axis has been changed to cover the large matrices this affects. As expected, it now fails on 32-bit Windows for the master code base.
Any other comments?