-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP Adds Generalized Additive Models with Bagged Hist Gradient Boosting Trees #19914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Follow up from the meeting: @lorentzenchr From reading @GaelVaroquaux I see GAMs as "interpretable" in the same line as logistic regression's coefficients, which comes with the same pitfalls. The big difference between GAMs (tree based or spline based) is that GAMs uses a function to represent the conditional dependence. From the meeting, I mentioned that I went with this gradient boosted version of GAM because the other versions do not scale. There are newer versions of GAM's that scale which is implemented in R's bam package. The paper the package is based on was cited 209 times. I think moving forward, we can get the same type of behavior by adding |
From the meeting, I mentioned that I went with this gradient boosted version of GAM because the other versions do not scale. There are newer versions of GAM's that scale which is implemented in R's bam package. The paper the package is based on was cited 209 times.
My view (echoing the FAQ on the inclusion criteria
https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms)
is that an algorithm implementing a classic model has more lenient
inclusion criteria.
So, I would say that if the algorithm of bam solves the exact same
problem as classic GAMs, it is definitely eligible for inclusion.
Do you know a few details about the algorithm?
|
I think adding support for interaction constraints to our histogram gradient boosting tree implementation would be interesting. E.g. each tree would be allowed to consider only 1 feature at a time, or 2 or 3 and so on as an inductive bias to make the decision function easier to grasp: the partial dependence plots would more faithfully reflect the actual behavior of the decision function, even though they would still be problematic to interpret in the presence of correlated features (as it the case for linear models). We could also consider allowing the users to pass specifications of groups of features that are allowed to interact with one-another but the API will be more complex to use. |
I wonder if fitting interactions limited GBRT and followed by fitting a 1d cubic spline approximation of each feature-wise decision function followed by a refitting a linear model on the spline features would not get the best of both words: smooth models with a scalable fitting procedure. Probably not for scikit-learn, but I wonder if the GAM community has explored this 2-stage fitting strategy. |
Just as cross-reference the issue for HGBT interaction constraints: #19148. |
Thanks, I was looking for this PR in the github search but missed it for some reason (bad keywords). Edit: it's only an issue, I thought there was already a draft PR ;) |
For reference, in a private exchange with Yannig Goude, he mentioned that the bam function of the Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data The main paper for the original implementation of the non-discretized bam function is: Generalized additive models for large data sets So I think this would be a good reference implementation to compare to if we ever decide to implement spline-based GAMs in scikit-learn. |
Somewhat related is the "GA^2M" of Caruana et al and the subsequent "Explainable Boosting Machine" in Microsoft's InterpretML, see Explainable Boosting in https://github.com/interpretml/interpret#citations. GBMs train on two-feature combinations (and single features), and those become the basis for the GAM. (And there's a ton of weirder stuff in that package's implementation, IIRC bagging the overall GAM, including an iterative fitting of the bases, using RFs as the gradient boosting base model, etc.) |
@bmreiniger Rich renamed the model from GA^2M to Explainable Boosting Machine, this is all the same work, I'd say. Also, I don't get the issue with the constant in the example, shouldn't the mean predictor be the first step? |
This is sooo awesome!! I'd love to have it. I pinged Harsha who works on the interpret-ml code, I hope he can have a look. |
Hi everyone, I work on interpret, which has the most recent implementation of the boosted tree based GAMs referenced in those papers -- as Andreas mentioned, we just call them Explainable Boosting Machines now in this codebase. Rich and everyone else on our side were excited to see this PR, and we thought we might be able to help clarify some details or contribute to the discussion. For example, many of the algorithmic choices were made to make models identifiable or more explainable (e.g. enhancing smoothness of learned graphs or contributing uncertainty intervals to each shape function). We'd all be happy to talk through these details on a call sometime if @thomasjpfan or anyone else is interested! |
I think the inclusion of spline-based GAMs clearly meets our inclusion criterion. Using bam as a reference as suggested by @ogrisel in #19914 (comment) will be the starting point for that. There is still a question of inclusion for tree-based GAMs. The Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission published in 2017 has 1k citations according to google scholar. Is this enough for inclusion? Would using benchmarks to show that tree-based GAMs perform better than spline-based GAMs be enough for inclusion? Moving forward, I plan to work on spline-based GAMs. I think it will be strange to include the tree-based ones without the spline-based ones. |
Moving forward, I plan to work on spline-based GAMs. I think it will be strange to include the tree-based ones without the spline-based ones.
I think that this is a great resolution.
For the question above, I cannot answer. Such question is always challenging.
Thanks Thomas!!
|
I generally agree with @thomasjpfan, but I also think it's strange to include #21020 without including tree-based GAMs first, but maybe we can have them all in the same release lol? Deciding to include #21020 but deciding tree-based GAMs are out of scope makes no sense to me, since #21020 is a generalization of the published and relatively well-understood tree-based GAMs of Rich. |
I dare to close. It's inactive for (rounded) 3 years. |
This is very WIP!
There are many things left to be done here! Please do not review yet! Here is an example of the estimator in action!
There is still much to be done for this PR, but I want to see if this fits in with our inclusion criterion. Intelligible Models for Classification and Regression has 259 citations.
Reference Issues/PRs
Closes #3482
What does this implement/fix? Explain your changes.
This implementation builds on the parts of
HistGradientBoosting*
by restricting the splitter to only split one feature at a time.Intelligible Models for Classification and Regression shows that Bagged Gradient Boosting Trees are better than the other GAMs (backfilling techniques or using splines). The tree based approach was shown to train faster for bigger datasets.
Using the histograms is based off of a follow up paper Accurate Intelligible Models with Pairwise Interactions. In this paper, they also describe a way to quickly obtain pairwise interactions. (This can be a follow up PR)
CC @amueller