Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,14 @@ Changelog
(regression introduced in 0.21) :issue:`13910`
by :user:`Jérémie du Boisberranger <jeremiedbb>`.

:mod:`sklearn.utils.sparsefuncs`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better off putting it under preprocessing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, although the code change is in sparsefuncs and will also affect feature_selection.VarianceThreshold once pull #13704 goes through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I believe that in this case the two classes that are potentially impacted by this bug are: LabelBinarizer and MaxAbsScaler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not read the previous comment when writing mine. In the end I am fine with the current PR.

................................

- |Fix| Fixed a bug where :func:`min_max_axis` would fail on 32-bit systems
for certain large inputs. This affects :class:`preprocessing.MaxAbsScaler`,
:func:`preprocessing.normalize` and :class:`preprocessing.LabelBinarizer`.
:pr:`13741` by :user:`Roddy MacSween <rlms>`.

.. _changes_0_21_1:

Version 0.21.1
Expand Down
5 changes: 5 additions & 0 deletions sklearn/utils/sparsefuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,11 @@ def inplace_swap_column(X, m, n):

def _minor_reduce(X, ufunc):
major_index = np.flatnonzero(np.diff(X.indptr))

# reduceat tries casts X.indptr to intp, which errors
# if it is int64 on a 32 bit system.
# Reinitializing prevents this where possible, see #13737
X = type(X)((X.data, X.indices, X.indptr), shape=X.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what does this reinitialisation actually do? Will it downcast indptr if X.shape is less than int32.max???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, precisely. The logic is in get_index_dtype here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in other cases is this a non-copying operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (see __init__ for _cs_matrix here). That could be made explicit by adding copy=False.

value = ufunc.reduceat(X.data, X.indptr[major_index])
return major_index, value

Expand Down
6 changes: 5 additions & 1 deletion sklearn/utils/tests/test_sparsefuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -393,14 +393,18 @@ def test_inplace_swap_column():
[(0, np.min, np.max, False),
(np.nan, np.nanmin, np.nanmax, True)]
)
@pytest.mark.parametrize("large_indices", [True, False])
def test_min_max(dtype, axis, sparse_format, missing_values, min_func,
max_func, ignore_nan):
max_func, ignore_nan, large_indices):
X = np.array([[0, 3, 0],
[2, -1, missing_values],
[0, 0, 0],
[9, missing_values, 7],
[4, 0, 5]], dtype=dtype)
X_sparse = sparse_format(X)
if large_indices:
X_sparse.indices = X_sparse.indices.astype('int64')
X_sparse.indptr = X_sparse.indptr.astype('int64')

mins_sparse, maxs_sparse = min_max_axis(X_sparse, axis=axis,
ignore_nan=ignore_nan)
Expand Down