Skip to content

MAINT validate parameter in sklearn.preprocessing._encoders #23579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jun 24, 2022

Conversation

Diadochokinetic
Copy link
Contributor

@Diadochokinetic Diadochokinetic commented Jun 10, 2022

Reference Issues/PRs

Make all estimators use _validate_params #23462

What does this implement/fix? Explain your changes.

Implements _validate_params for sklear.preprocessing._encoders. Shared parameters will be implemented in the _BaseEncoder.

Any other comments?

This is my first contribution. Tips and Feedback are highly appreciated.

@Diadochokinetic
Copy link
Contributor Author

I'm currently stuck, because _InstancesOf doesn't support numpy data types, see #23599

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 12, 2022

I think we leave off validation dtype for now. In NumPy, the dtype can be specified in many ways. For example, this currently works:

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(dtype="i4")
X = [['dog'], ['cat'], ['snake']]
enc.fit_transform(X)
# array([[1],
#        [0],
#        [2]], dtype=int32)

We can let NumPy handle bad dtypes. For example, this will raise already:

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(dtype="i423e12")
X = [['dog'], ['cat'], ['snake']]
enc.fit_transform(X)
# TypeError: data type 'i423e12' not understood

For reference, what NumPy considers a valid "dtype" is quite complex according to NumPy's typing system

@Diadochokinetic
Copy link
Contributor Author

Since the parameter validation requires at least one constraint for each parameter. What is the best way to approach the dtype parameter? Is there some "meta" class/type that always returns True?

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 12, 2022

What is the best way to approach the dtype parameter?

I think we can use object. For future contributors, include a comment near the validate_params for dtype that states that we are allowing NumPy to do the validation.

@Diadochokinetic
Copy link
Contributor Author

I think we can use object.

Unfortunately this collides with sklearn.utils.estimator_checks.check_param_validation. This function includes a test, whether an artificial "bad" parameter raises an appropriate error message. The artificial bad parameter yields True for isinstance.

>>> param_with_bad_type = type("BadType", (), {})()
>>> isinstance(param_with_bad_type, object)
True

Maybe this check should be disabled for object as parameter_constraint.

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 12, 2022

The current design of _validate_params is not flexible enough to turn off validation. Let's see what @jeremiedbb thinks about adding such functionality. For context, I think it's better to delegate validation of dtype to NumPy because the dtype parameter can accept many inputs.

@jeremiedbb jeremiedbb added the Validation related to input validation label Jun 13, 2022
@jeremiedbb
Copy link
Member

The current design of _validate_params is not flexible enough to turn off validation. Let's see what @jeremiedbb thinks about adding such functionality. For context, I think it's better to delegate validation of dtype to NumPy because the dtype parameter can accept many inputs.

@thomasjpfan I guess we can add a special case "no validation" or "any" or "delegate validation". This is a case were going for using typing would have make it easier 😄. I'll make a PR to implement that.

@Diadochokinetic
Copy link
Contributor Author

I'll make a PR to implement that.

Okay, then I'll resume working on this, when the PR is done :)

@jeremiedbb
Copy link
Member

@Diadochokinetic I directly pushed some changes to take into account the recent improvements in the validation mechanism.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre changed the title towards #23462 [WIP] implement _validate_params for sklear.preprocessing._encoders MAINT validate parameter in sklearn.preprocessing._encoders Jun 24, 2022
@glemaitre glemaitre self-requested a review June 24, 2022 14:31
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@glemaitre glemaitre merged commit 8a8d068 into scikit-learn:main Jun 24, 2022
@glemaitre
Copy link
Member

Thanks @Diadochokinetic

@Diadochokinetic Diadochokinetic deleted the ohe_validate_params branch June 25, 2022 07:15
ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
…learn#23579)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants