Skip to content

DOC Improve wording in Categorical Feature support example #31864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 6, 2025

Conversation

ArturoAmorQ
Copy link
Member

Reference Issues/PRs

Follow up from #31062.

What does this implement/fix? Explain your changes.

In #31062 (comment) it was suggested to add TargetEncoder to the benchmark, but I realized there's already an example comparing such strategy in the scenario of high cardinality, where it is the most useful.

Instead this PR links to said example and takes the opportunity to:

  • remove the no longer needed verbose_feature_names_out=False in the ordinal_encoder pipeline (introduced in ENH Specify categorical features with feature names in HGBDT #24889);
  • make a general pass on the wording to:
    • remove the corresponding mention to OrdinalEncoder in the "Native support" pipeline;
    • prefer verbs in present mode;
    • remove redundancies in favor of more informative text;
    • improve conclusions.

Any other comments?

Maybe we can also rework the above mentioned TargetEncoder example? Even merge both examples?

Copy link

github-actions bot commented Aug 1, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 69f33db. Link to the linter CI: here

@adrinjalali
Copy link
Member

cc @ogrisel since he had the original comment.

@ogrisel ogrisel self-assigned this Aug 4, 2025
@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2025

In #31062 (comment) it was suggested to add TargetEncoder to the benchmark, but I realized there's already an example comparing such strategy in the scenario of high cardinality, where it is the most useful.

I think it's worth adding it even to this example. I would expect this method to be among the best even when the cardinality of the categorical features is not that large.

Maybe we can also rework the above mentioned TargetEncoder example?

How would you rework it?

Even merge both examples?

That sounds challenging to do. The examples on wines dataset is already quite slow to run so expanding it and using cv=5 instead of cv=3 would make it even slower. At the same time, it's good to have a categorical feature engineering example that runs on real data with a mix of both high and low cardinality features.

The underfitting analysis of this faster example is nice and complementary to the other. I think I would keep both.

ArturoAmorQ and others added 3 commits August 6, 2025 15:34
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@ArturoAmorQ
Copy link
Member Author

ArturoAmorQ commented Aug 6, 2025

I think it's worth adding it even to this example. I would expect this method to be among the best even when the cardinality of the categorical features is not that large.

@ogrisel these are the results for both plots when adding the TargetEncoder. They seem to be too noisy with the default TargetEncoder(cv=5):

target_encoding_1 target_encoding_2

How would you rework it?

To keep both examples, I would cross-reference each other, use the same plotting function and possibly add the native support to the benchmark.

@lorentzenchr
Copy link
Member

@ArturoAmorQ can I merge?

@ArturoAmorQ
Copy link
Member Author

@lorentzenchr sure, thanks!

@lorentzenchr lorentzenchr merged commit b824c72 into scikit-learn:main Aug 6, 2025
36 checks passed
@ArturoAmorQ ArturoAmorQ deleted the wording_categorical branch August 7, 2025 09:34
@ogrisel
Copy link
Member

ogrisel commented Aug 7, 2025

@ogrisel these are the results for both plots when adding the TargetEncoder. They seem to be too noisy with the default TargetEncoder(cv=5):

@ArturoAmorQ I don't think the size of the error bars of the TargetEncoder pipeline are significantly larger than the other alternatives. I still think it would be interesting to feature TargetEncoder encoder in this example, as it's usual a competitive alternative to the others (in terms of Pareto optimality) and can furthermore naturally handle both low and high cardinality categories. I think the examples are a useful way to make good alternative discoverable and explain the pros and cons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants