Skip to content

FIX Param validation Interval error for large integers #26648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 27, 2023

Conversation

naoise-h
Copy link
Contributor

@naoise-h naoise-h commented Jun 21, 2023

At present, param validation on Interval returns a TypeError when provided with a large integer (greater than 64 bits). This is because of the np.isnan call, which attempts to convert to a Numpy integer, of which no type exists greater than 64 bits. This issue came to light when trying to use numpy.SeedSequence().entropy (a 128-bit integer) as a random_state seed.

What does this implement/fix? Explain your changes.

A TypeError is raised unexpectedly:

>>> from sklearn.utils._param_validation import Interval
>>> from numbers import Integral
>>> interval = Interval(Integral, 0, 2, closed="neither")
>>> 2 ** 128 in interval
Traceback (most recent call last):
  File ".../.local/share/JetBrains/Toolbox/apps/PyCharm-C/ch-0/231.9011.38/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File ".../scikit-learn/sklearn/utils/_param_validation.py", line 481, in __contains__
    if not isinstance(val, Integral) and np.isnan(val):
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Instead of calling np.isnan() on the value alone, it now first checks if the value is an integer (since integers cannot be nan). Tests have been added to check it works with large integers (including integers larger than those representible by a float, approx 2**1024 and larger).

After these changes, we have the following output, as expected:

>>> from sklearn.utils._param_validation import Interval
>>> from numbers import Integral
>>> interval = Interval(Integral, 0, 2, closed="neither")
>>> 2**128 in interval
False

Any other comments?

  • If we don't want to support integers larger than those representible by a float, we can use math.isnan instead, which raises an OverflowError when conversion to float fails.
  • The _NanConstraint will also throw an OverflowError if provided with a large integer (> 2**1024), but its only use is within MissingValues which already has an Integral class check

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @naoise-h !

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @naoise-h. This is the appropriate fix imo.

@jeremiedbb
Copy link
Member

out of curiosity, what is the parameter that you want to set to such a big int ?

@github-actions
Copy link

Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

@naoise-h
Copy link
Contributor Author

out of curiosity, what is the parameter that you want to set to such a big int ?

I was trying to seed a random_state as per Numpy's guidelines, which defaults to a 128-bit seed. I know scikit-learn still uses the legacy Numpy RandomState objects, but perhaps larger-bit seeds can be supported when the relevant Numpy version is a minimum requirement (I can't remember off-hand when Numpy started using SeedSequence for seeding the legacy RNG).

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. @jeremiedbb why is this marked "no changelog needed"? Wasn't this bug already there in 1.2?

@github-actions
Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2c27f4a. Link to the linter CI: here

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not wait for the refactoring of utils to merge this one. Thanks @naoise-h.

@jeremiedbb jeremiedbb enabled auto-merge (squash) September 27, 2023 15:55
@jeremiedbb jeremiedbb merged commit 38a1e64 into scikit-learn:main Sep 27, 2023
lesteve pushed a commit to lesteve/scikit-learn that referenced this pull request Sep 28, 2023
…26648)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…26648)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
@naoise-h naoise-h deleted the large_integers branch January 23, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:utils Validation related to input validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants