Skip to content

DOC Update plots in Categorical Feature Support in GBDT example #31062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 17, 2025

Conversation

ArturoAmorQ
Copy link
Member

@ArturoAmorQ ArturoAmorQ commented Mar 24, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Inspired by this example's way of showing the test_score/fit_time trade-off, I find a scatter plot easier to read and interpret than bar plots.

This PR also introduces a log scale for fit times.

Current plot:

sphx_glr_plot_gradient_boosting_categorical_main

This PR (most recent render):

sphx_glr_plot_gradient_boosting_categorical_001

Any other comments?

I took the opportunity to show the intermediate html diagrams and a introduce a wording tweak.

Copy link

github-actions bot commented Mar 24, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 96b7c7c. Link to the linter CI: here

@ogrisel
Copy link
Member

ogrisel commented Apr 30, 2025

As discussed during the bi-weekly meeting, it would be great to make it explicit that the best models are in the bottom left corner, maybe using a matplotlib arrow annotation in the bottom left corner with the text "best models" pointing towards the point at coordinate (0, 0) (or (0.1, 0.1) because of the log scale on the x axis).

Also, could you please add more ticks on the x axis, e.g.: 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4.

Also, please add the TargetEncoder to this plot.

@ArturoAmorQ
Copy link
Member Author

Current result:

image

Also, please add the TargetEncoder to this plot.

As that also requires adding narrative on the interpretation, I would rather leave that to a follow-up PR.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this took me so long to get to!
This is a much nicer graph, thank you!

I couldn't comment on the line but maybe we could add the name of the category handling strategy in the bullet list at the top, i.e.:

- "Dropped": dropping the categorical features
- "One Hot": using a :class:`~preprocessing.OneHotEncoder`
- "Ordinal": using an :class:`~preprocessing.OrdinalEncoder` and treat categories as
  ordered, equidistant quantities
- "Native": using an :class:`~preprocessing.OrdinalEncoder` and rely on the :ref:`native
  category support <categorical_support_gbdt>` of the
  :class:`~ensemble.HistGradientBoostingRegressor` estimator.

Do you think it's worth mentioning somewhere on the graph that the error bars are 1 standard deviation or is it obvious what the error bars are?

I saw the comment on adding TargetEncoder but maybe that could be left to another PR...?

It's nice that we make it clear to the user where the best models are (with the arrow and 'best models' text), but it does confuse me at first:

  • I expect it to be pointing to something, e.g., a scatter point. We could add a circle that it points to, but this is not necessarily clear either
  • The graph axes does not start at 0,0 so it's possibly mis-leading?

I can't think how to improve this. Maybe it is clear enough if people read the label titles, that smaller error/faster is better? Maybe we could add to the x and y axes labels, e.g., an arrow; "<- faster fitting" and "↓ better model/lower error"). Or just explaining it in the text may be enough?

Edit: Maybe we could just have the text 'Best models' in the bottom left corner and no arrow ?



class CustomLogFormatter(ticker.Formatter):
def __call__(self, x, pos=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a docstring here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm surprised it's so hard to format superscript in mpl?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe scientific notation is good enough? e.g., "%1.1e" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, that simplifies code a lot.


def plot_performance_tradeoff(results, title):
fig, ax = plt.subplots()
markers = ["s", "o", "^", "x"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, the last graph, it is a bit tricky to quickly tell the native and ordinal markers/error bars
Would it be 'easy' to change the colour of the the marker, to be the same colour as the scatter points?

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that risks leading the reader to think that the marker is an actual scatter point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

Stupid question, for 'Ordinal', why is the black error bar (?) marker not centered between the horizontal and vertical error bars?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see that it is in the new rendered doc, for some reason the horizontal error bar was not rendering properly in the first one above

image

coeff_str = f"{coeff:.1f}x"

# Format exponent using Unicode superscripts
superscripts = str.maketrans("-0123456789", "⁻⁰¹²³⁴⁵⁶⁷⁸⁹")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would math notation work? e.g., "$10^{%d}$"

@ArturoAmorQ
Copy link
Member Author

I saw the #31062 (comment) on adding TargetEncoder but maybe that could be left to another PR...?

Yes. I prefer reducing the scope of this PR.

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

Sorry I meant the words 'faster fitting'/'lower error' may be added to the x and y axis labels.
They do also make sense here, but maybe they are too long inside the graph and 'Best models' was better?

image

@betatim
Copy link
Member

betatim commented Jul 17, 2025

I'll merge this for now as it has two approvals and I am looking for something to merge to update the banner on scikit-learn.org

@betatim betatim merged commit 588f396 into scikit-learn:main Jul 17, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants