Skip to content

[MRG] Downcast large matrix indices where possible in sparsefuncs._minor_reduce (fix #13737) #13741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 23, 2019

Conversation

rlms
Copy link
Contributor

@rlms rlms commented Apr 28, 2019

Reference Issues/PRs

Fixes #13737.

What does this implement/fix? Explain your changes.

Currently sparsefuncs.min_max_axis gives a TypeError whenever the input is a csc matrix (and axis is 0) with int64 indices on a 32-bit machine (due to a cast to intp in _minor_reduce). This does not occur (necessarily) for csr matrices, since the conversion to a csc matrix in _min_or_max_axis will downcast the indices when safe to do so (i.e. when all values fit in an int32). Reinitialising the input to _minor_reduce should avoid this (and also the corresponding problem that would occur for csr inputs and axis of 1).

A test for min_max_axis has been changed to cover the large matrices this affects. As expected, it now fails on 32-bit Windows for the master code base.

Any other comments?

@jnothman
Copy link
Member

jnothman commented Apr 28, 2019 via email

@rlms
Copy link
Contributor Author

rlms commented May 13, 2019

@jnothman pinging

# reduceat tries casts X.indptr to intp, which errors
# if it is int64 on a 32 bit system.
# Reinitializing prevents this where possible, see #13737
X = type(X)((X.data, X.indices, X.indptr), shape=X.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what does this reinitialisation actually do? Will it downcast indptr if X.shape is less than int32.max???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, precisely. The logic is in get_index_dtype here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in other cases is this a non-copying operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (see __init__ for _cs_matrix here). That could be made explicit by adding copy=False.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason it's still marked WIP?

If it should be MRC, I'm happy to try include this in release 0.21.2 (due imminently) if another reviewer approves.

Please add an entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:. We might land up moving this entry if the release ships sooner.

@jnothman
Copy link
Member

The what's new entry should ideally mention public API (e.g. estimators) that this affects

@rlms rlms changed the title [WIP] Downcast large matrix indices where possible in sparsefuncs._minor_reduce (fix #13737) [MRG] Downcast large matrix indices where possible in sparsefuncs._minor_reduce (fix #13737) May 23, 2019
@@ -26,6 +26,14 @@ Changelog
(regression introduced in 0.21) :issue:`13910`
by :user:`Jérémie du Boisberranger <jeremiedbb>`.

:mod:`sklearn.utils.sparsefuncs`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better off putting it under preprocessing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, although the code change is in sparsefuncs and will also affect feature_selection.VarianceThreshold once pull #13704 goes through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I believe that in this case the two classes that are potentially impacted by this bug are: LabelBinarizer and MaxAbsScaler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not read the previous comment when writing mine. In the end I am fine with the current PR.

@jnothman jnothman mentioned this pull request May 23, 2019
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the whats new section issue, LGTM.

@ogrisel
Copy link
Member

ogrisel commented May 23, 2019

@jnothman merge?

@jnothman jnothman merged commit dc5e4d8 into scikit-learn:master May 23, 2019
@jnothman
Copy link
Member

Thanks @rlms

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request May 23, 2019
@rlms rlms deleted the min-max-axis-csc branch May 23, 2019 17:27
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

utils.sparsefuncs.min_max_axis gives TypeError when input is large csc matrix when OS is 32 bit Windows
5 participants