Skip to content

Generalized additive models (GAMs)? #3482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
olgabot opened this issue Jul 24, 2014 · 51 comments
Open

Generalized additive models (GAMs)? #3482

olgabot opened this issue Jul 24, 2014 · 51 comments

Comments

@olgabot
Copy link

olgabot commented Jul 24, 2014

Hello there,
Thanks for making this fantastic library. I use it every day in my bioinformatics research. We're developing a toolkit for single-cell RNA-seq analysis (http://github.com/yeolab/flotilla) and want to add all current state-of-the-art analyses. Unfortunately, most of these are in R. I can reimplemement some of them, but they rely on certain R packages, in particular VGAM, aka Vector Generalized Linear and Additive Models. I've found a few mentions of GAMs here:

Has there been any update on creating these libraries?

@amueller
Copy link
Member

There is pyearth: https://github.com/jcrudy/py-earth

@raghavrv
Copy link
Member

Is there anything else related to GAMs that could potentially be added?

@amueller
Copy link
Member

Not sure which of these are actually commonly used and scale to larger datasets.... Have you checked?

@raghavrv
Copy link
Member

the original GAM paper by Hastie has around 1300 citations and seems to be of much interest to people and the SpAM had around 200 odd... ( not sure if citation is the only important criteria here for inclusion based on your recent addition to the FAQ )... I don't think LISO is that popular ( I may be mistaken )...

And for SpAM, there is a paper from Yahoo Labs - AAAI 15 by Wei Sun. From the application described in that paper I think it may scale well to larger datasets...

@raghavrv
Copy link
Member

raghavrv commented Feb 2, 2015

@amueller I am interested in working on SpAM... Given the above references do you feel it will be of considerable interest to the devs? If so I'll finish off my other PRs and start with SpAM?

@agramfort
Copy link
Member

@fabianp and @eickenberg have an implementation of SpAM somewhere. Can you
share it?

@eickenberg
Copy link
Contributor

Yup, it's been a while since I've looked at this, and it is only a proof of concept, we provide """as is""" :)

https://gist.github.com/eickenberg/fe7010b63a4196f849fa
https://gist.github.com/eickenberg/b54b5defead3df1fc761

@raghavrv
Copy link
Member

raghavrv commented Feb 2, 2015

Thanks for sharing :)

@agramfort can I proceed with this at hand?

I am thinking of making a directory for all GAM based models... so we can have something like :
/gam/spam.py for SpAM
/gam/gamlss.py for GAMLSS
/gam/earth.py for MARS ( perhaps? )

@agramfort
Copy link
Member

I guess. Just keep it small so you maximize your chances to get feedback
quickly

@GaelVaroquaux
Copy link
Member

/gam/spam.py for SpAM
/gam/gamlss.py for GAMLSS
/gam/earth.py for MARS ( perhaps? )

gam is a horrible acronyme. It's impossible to google or to guess what
this is.

@raghavrv
Copy link
Member

raghavrv commented Feb 2, 2015

how about simply additive_models ? ( generalized_additive_models would be too large a directory name I think.... or let me know if that is not an issue )

@GaelVaroquaux
Copy link
Member

how about simply additive_models ?

I like it. Just like 'linear_models' are really
'generalized_linear_models'.

@jnothman
Copy link
Member

jnothman commented Feb 2, 2015

I think linear_models is already a bit long, and should probably have been
linear. Should we just call this sklearn.additive?

On 3 February 2015 at 08:46, Gael Varoquaux notifications@github.com
wrote:

how about simply additive_models ?

I like it. Just like 'linear_models' are really
'generalized_linear_models'.


Reply to this email directly or view it on GitHub
#3482 (comment)
.

@sreenivasraghavan71
Copy link

I saw the inclusion of GAMs in GSoC 2015. I just want to know whether the project is assigned to anyone or else I would start developing the GAM models. Please reply

@agramfort
Copy link
Member

agramfort commented Jul 28, 2015 via email

@datnamer
Copy link

@agramfort and @sreenivasraghavan71 there is a pr for statsmodels GAM here: statsmodels/statsmodels#2435

They are sorely in need of more code reviewers and maintainers! Help would probably be appreciated.

@smrtslckr
Copy link

Curious about this myself. Noticed this wiki: https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2015 Is this currently in flight?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 9, 2016 via email

@kingroryg
Copy link

Does anyone know of an efficient workaround using GAM in Python ?

@eickenberg
Copy link
Contributor

Around what do you want to work?

Maybe it is better to ask this on the mailing list with reference to this
issue.

On Thursday, September 29, 2016, Sarthak Munshi notifications@github.com
wrote:

Does anyone know of an efficient workaround using GAM in Python ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3482 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPwC8MFd7hMz_GU2pPSGHR5KRA-Utsiks5qu7R6gaJpZM4CQlK0
.

@amueller
Copy link
Member

@saru95 there's pyearth, if that helps.

@dswah
Copy link

dswah commented May 10, 2017

@saru95, @amueller

I've written a python implementation of GAMs using penalized B splines inspired heavily by Simon Wood and his mgcv C-RAN package.

check it out here https://github.com/dswah/pyGAM
feedback is most welcome.

I've included several model families, as well as link functions and distributions that you can mix and match to make very custom GAMs.

@banilo
Copy link
Contributor

banilo commented Oct 7, 2017

to me there is still a need for a good and standard GAM implementation in
python like the gam package in R.

I completely agree. Any news on incorporating GAMs into the sklearn arsenal?

@juyanmei
Copy link

juyanmei commented Nov 7, 2017

How to use SAM package to select feature?

@mizukasai
Copy link

Any updates on adding GAMs to sklearn ?
They're pretty neat, I'm using them as they are very interpreatable (with pyGAM in python and gam in R)
I would be interested in GA²Ms too (GAMs with pairewise ineractions)
Here's a paper by Lou et Al explaining the GA²Ms
http://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf

@dsaxton
Copy link

dsaxton commented Aug 2, 2018

I'd like this feature a lot as well; seems like including additive models can make sklearn more of a "one stop shop."

@jnothman
Copy link
Member

jnothman commented Aug 2, 2018 via email

@agramfort
Copy link
Member

@dswah you could consider migrating your project to scikit-learn-contrib https://github.com/scikit-learn-contrib to get more visibility and expand the developper base.

note that you used the GPL license so this code cannot enter the sklearn code base.

i'd also really love to have a state-of-the-art GAM package in Python.

@jnothman
Copy link
Member

jnothman commented Aug 2, 2018

There are some more general glms in #9405 too

@amueller
Copy link
Member

There's been a lot of interest in GAMs recently because of interpretability.
There's a cool variant proposed in https://www.cs.cornell.edu/~yinlou/papers/lou-kdd12.pdf that uses gradient boosted trees on single variables (as opposed to using splines) called EBM.

The EBM paper has 205 citations which is not that great but narrowly passes our threshold. An application paper has over 600 citations though:
http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf

There's been a more recent evaluation here:
https://arxiv.org/abs/2006.06466

disclaimer: this work partially comes out of MS. I had put @thomasjpfan on GAMs previously, though we had looked more at the classical spline stuff.

The gradient boosting variant might be more straightforward to implement given what we already have (they are using binning).
There's also a C++ implementation of this in interpret.ml: https://github.com/interpretml/interpret

https://github.com/interpretml/interpret/blob/master/python/interpret-core/interpret/glassbox/ebm/ebm.py

@amueller
Copy link
Member

@NicolasHug might also be interested in so far as this is a small extension of HistGradientBoosting ;)

@rth
Copy link
Member

rth commented Jul 17, 2020

we had looked more at the classical spline stuff.

On that there was also the #17027 proposal more recently.

@eickenberg
Copy link
Contributor

eickenberg commented Jul 17, 2020 via email

@amueller
Copy link
Member

@eickenberg I assume you mean splines, not additive gradient boosting?

@eickenberg
Copy link
Contributor

Indeed, but the gradient boosting variant looks interesting and not that hard to add.

@eickenberg
Copy link
Contributor

(Looks like it's a hot topic again: https://arxiv.org/abs/2004.13912 @eugenium pointed me to this)

@agramfort
Copy link
Member

agramfort commented Jul 23, 2020 via email

@eickenberg
Copy link
Contributor

Here are mine:

https://gist.github.com/eickenberg/fe7010b63a4196f849fa
https://gist.github.com/eickenberg/b54b5defead3df1fc761

This one does a round-robin coordinate descent

@banilo
Copy link
Contributor

banilo commented Jul 23, 2020 via email

@thomasjpfan
Copy link
Member

  • When looking at R's mgcv and the related book "Generalized Additive Models" By Wood (2017 Second Edition), the "hard" part is the smoothing selection which can be done using REML (restricted maximum likelihood) or GCV. The Wood book recommends using REML because it "tends to be more resistant to occasional severe over-fitting". Here is a reference on using REML for smoothing parameter selection by Wood. (REML is the default for mgcv)

  • For fitting, mgcv uses penalized iteratively re-weighted least squared (PIRLS) instead of backfitting because it is not easy to efficiently integrate the smoothness estimates.

  • The boosting trees approach looks very promising when compared to splines.

Who would be interested in reviewing a PR for GAMs? :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 26, 2020 via email

@amueller
Copy link
Member

@GaelVaroquaux did you see the comment about the gradient boosting version?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 27, 2020 via email

@NicolasHug
Copy link
Member

Who would be interested in reviewing a PR for GAMs?

I would but I want to review the categorical features first :p

@lorentzenchr
Copy link
Member

I'm also in favor of GAMs, though I see them more in the context of GLMs: "GLM + spline + penalty = GAM". The tricky part, in my opinion, is not solvers but the handling of the features, i.e. the API. In R' mgcv you can specify spline type and penalty for each feature by name .

@lorentzenchr
Copy link
Member

Currently, there are 2 proposals on the table:

  1. Spline based GAMs similar to R mgcv, see also WIP Adds Generalized Additive Models with Bagged Hist Gradient Boosting Trees #19914 (comment).
    According to WIP Adds Generalized Additive Models with Bagged Hist Gradient Boosting Trees #19914 (comment), @thomasjpfan might give it a shot.
    Also note that [MRG] Common Private Loss Module with tempita #20567 might come in handy for new solvers of such GLMs, eg coordinate descent.
  2. GB trees on single variables. I think this can be achieved with interaction constraints for HGBT in ENH FEA add interaction constraints to HGBT #21020.

I wonder how great a combination of the two would be: Being able to specify for which feature to use smooth splines and for which to use trees? (and which to treat as categorical?)
This leads me again to the - in my opinion -more important but open point of API design.

@amueller
Copy link
Member

Re 2, I would rather have #19914 first, in particular because it does cyclic boosting and is build on well-cited work. I think doing #21020 is also nice, but I think it's less well understood in terms of interpretability.
I'd be ok with doing something as general as #21020 as long as it's easy to have something that's equivalent to the explainable boosting machine algorithm with the solution.

Re API:
Sorry I haven't been in the loop on this much but for HistGradientBoosting we have a boolean mask/indices for categorical variables in the constructor, so that is the current band-aid solution for categorical variables.
I think we should go with that for now as a proper solution seems pretty hard. The only way is passing feature-aligned meta-data, either in an extra data structure (feature_props?) or more likely using DataFrames (and then there's the question of whether we want to use custom meta-data or the pandas dtypes).

@lorentzenchr
Copy link
Member

lorentzenchr commented Oct 16, 2021

For tree based GAMs, the point is that interaction constraints give us the possibility to specify feature-wise additivity in link-space (or allow for pairwise interactions…). That‘s the big step for interpretability.
The cyclic boosting is then something on top that might improve certain aspects of the model (or not).
Therefore, I consider it a logical implementation plan to start with #21020.

@amueller
Copy link
Member

@lorentzenchr from an API perspective having general interaction constraints seems more complex than what's standard in GAMs and I guess it's a question whether we want separate classes or not.
I'm fine with having options for interaction constraints and cyclic boosting, but it will make it much less discoverable, and also will make it harder to use the method from the literature because it requires setting multiple parameters to specific values.

It seemed a bit weird in terms of inclusion criteria to me (what's the reference for general interactions constraints?). I guess we're weighing what the implementations do more heavily than what the literature does now, which is an option but not something we have decided on.

In the end, as long as we have easy ways for users to discover how to do gradient boosting GAMs that are in accordance with what's empirically & academically validated then that's great.

@lorentzenchr
Copy link
Member

If we have all pieces together, I'm not opposed to the idea of a new class that sets some default parameters such that boosted tree-based GAMs become easier available.

@UnixJunkie
Copy link
Contributor

The best R package is 'mgcv' apparently these days.
There is pyGAM in python, but not sure there is significant maintenance power behind it.
https://github.com/dswah/pyGAM

Look at here for a nice introduction to Generalized Additive Models and some incentive:
https://multithreaded.stitchfix.com/assets/files/gam.pdf

Seems quite yummy as a regression model (e.g. non-linear but still explainable).
Maybe only two hyper parameters that need to be optimized explicitly (number of underlying splines + regularization parameter).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet