-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Use check_array
to validate y
#25089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Would this hack still work after this change? |
I've not tried to run the code from the SO answer, so technically "I don't know". From a quick read of the code it seems that it relies on the From a super high level view, this change replaces a |
sklearn/utils/validation.py
Outdated
y = check_array( | ||
y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI jobs fail because some tests have y
be one-element arrays.
One solution is to adapt check_array
regarding
scikit-learn/sklearn/utils/validation.py
Lines 926 to 933 in cbfb6ab
if ensure_min_samples > 0: | |
n_samples = _num_samples(array) | |
if n_samples < ensure_min_samples: | |
raise ValueError( | |
"Found array with %d sample(s) (shape=%s) while a" | |
" minimum of %d is required%s." | |
% (n_samples, array.shape, ensure_min_samples, context) | |
) |
probably by:
- changing the default value of
ensure_min_samples
to0
- changing the check for
ensure_min_samples > 1
Alternatively, we can change this call to:
y = check_array( | |
y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False | |
) | |
y = check_array( | |
y, | |
ensure_2d=False, | |
ensure_min_samples=0, | |
dtype=dtype, | |
input_name="y", | |
force_all_finite=False, | |
) |
The first proposal is more sensible to me while probably necessitating more changes and adaptation in cascade whilst the second proposal is relatively simple but probably semantically not always correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example one test test_classification_report_output_dict_empty_input
checks that you can call classification_report
with []
for both y_true
and y_pred
. So I think we need to allow ensure_min_samples=0
.
bfa11e7
to
b19979b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM given that tests pass.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes #25073
Reference Issues/PRs
closes #25073 (more precisely this PR combined with #25080 closes it)
What does this implement/fix? Explain your changes.
Uses
check_array
in_check_y
so that we get the same behaviour for converting pandas special data types (that can represent missing values) likeInt64
as forX
. This is done in the part of the code aroundpandas_requires_conversion
.Is this what you had in mind?
cc @glemaitre