Skip to content

MNT Adjust code after NEP 51 numpy scalar formatting changes #27042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 22, 2023

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Aug 9, 2023

There are some failures in scipy-dev build due to numpy/numpy#22449 that implements NEP 51 as mentioned in #26814 (comment).

I tried to adjust the scikit-learn error code when it makes sense e.g. when in my opinion adding the full numpy type make it less easier to read for example I find:

ValueError: The classes, [np.int64(0), np.int64(1), np.int64(2), np.int64(3)], are not in class_weight

a lot less easier to read than:

ValueError: The classes, [0, 1, 2, 3], are not in class_weight

When that was not possible, I adjusted the test to be less strict.

Note this may well be the case that there are other instances of this issue, that are not caught by our tests, not sure if there is an easy way to find them ...

@github-actions
Copy link

github-actions bot commented Aug 9, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c9472a7. Link to the linter CI: here

@@ -2270,9 +2270,9 @@ def _check_pos_label_consistency(pos_label, y_true):
or np.array_equal(classes, [1])
)
):
classes_repr = ", ".join(repr(c) for c in classes)
classes_repr = str(classes.tolist()).replace('[', '{').replace(']', '}')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the same error message as previously, but honestly I would be fine with using square brackets rather than curly brackets to simplify the code, i.e.:

y_true takes value in ['a', 'b'] ...

rather than

y_true takes value in {'a', 'b'} ...

@lesteve
Copy link
Member Author

lesteve commented Aug 9, 2023

Oh well it seems like the scipy dev wheel in https://anaconda.org/scientific-python-nightly-wheels/scipy/files is not recent enough to have scipy/scipy#19015 hence the error:

ImportError: cannot import name 'ComplexWarning' from 'numpy' (/usr/share/miniconda/envs/testvenv/lib/python3.11/site-packages/numpy/__init__.py)

@@ -774,8 +774,8 @@ def _map_drop_idx_to_infrequent(self, feature_idx, drop_idx):
if infrequent_indices is not None and drop_idx in infrequent_indices:
categories = self.categories_[feature_idx]
raise ValueError(
f"Unable to drop category {categories[drop_idx]!r} from feature"
f" {feature_idx} because it is infrequent"
f"Unable to drop category {categories[drop_idx].tolist()!r} from"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the NEP correctly then repr(np.int64(1)) will now look like np.int64(1) but str(np.int64(1)) will (continue) to be 1. So I think we can just replace the !r with !s no?

But then isn't categories[drop_idx] an array ... so why does the formatting change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for strings, we generally want a quoting which is why we use !r

In other words you want:

Unable to drop category 'a' from feature 0

instead of

Unable to drop category a from feature 0

@betatim
Copy link
Member

betatim commented Aug 10, 2023

I tried to install the numpy and scipy nightlies to run the tests locally to see what things look like (I find it too hard to do in my head), but right now scipy and numpy are incompatible with each other (some exception got moved but scipy hasn't been adjusted yet). So I propose we wait a bit to resolve this.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am fine with the rendering.

@lesteve
Copy link
Member Author

lesteve commented Aug 18, 2023

The scipy-dev test passes

OK then there are some doctests in rst that are broken, left to do on another PR ...

@@ -259,7 +259,7 @@ def test_imputation_median_special_cases():
@pytest.mark.parametrize("dtype", [None, object, str])
def test_imputation_mean_median_error_invalid_type(strategy, dtype):
X = np.array([["a", "b", 3], [4, "e", 6], ["g", "h", 9]], dtype=dtype)
msg = "non-numeric data:\ncould not convert string to float: '"
msg = "non-numeric data:\ncould not convert string to float:"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember here why I removed the quote here. We reraise an error from numpy ... we could also rewrite the error message if we think this is really important.

raise new_ve from None

In [1]: import numpy as np

In [2]: np.array(['a']).astype(np.float64)
--------------------------------------------------------------------------
ValueError                               Traceback (most recent call last)
Cell In[2], line 1
----> 1 np.array(['a']).astype(np.float64)

ValueError: could not convert string to float: np.str_('a')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rewriting the error message will be confusing for the user. Not sure, but it somehow feels wrong to modify an exception that we don't generate.

@@ -259,7 +259,7 @@ def test_imputation_median_special_cases():
@pytest.mark.parametrize("dtype", [None, object, str])
def test_imputation_mean_median_error_invalid_type(strategy, dtype):
X = np.array([["a", "b", 3], [4, "e", 6], ["g", "h", 9]], dtype=dtype)
msg = "non-numeric data:\ncould not convert string to float: '"
msg = "non-numeric data:\ncould not convert string to float:"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember here why I removed the quote here. We reraise an error from numpy ... we could also rewrite the error message if we think this is really important.

raise new_ve from None

In [1]: import numpy as np

In [2]: np.array(['a']).astype(np.float64)
--------------------------------------------------------------------------
ValueError                               Traceback (most recent call last)
Cell In[2], line 1
----> 1 np.array(['a']).astype(np.float64)

ValueError: could not convert string to float: np.str_('a')

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well.

@ogrisel ogrisel merged commit f55da62 into scikit-learn:main Aug 22, 2023
@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2023

Thanks for the fixes @lesteve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants