Generalized additive models (GAMs)? #3482

olgabot · 2014-07-24T18:45:01Z

Hello there,
Thanks for making this fantastic library. I use it every day in my bioinformatics research. We're developing a toolkit for single-cell RNA-seq analysis (http://github.com/yeolab/flotilla) and want to add all current state-of-the-art analyses. Unfortunately, most of these are in R. I can reimplemement some of them, but they rely on certain R packages, in particular VGAM, aka Vector Generalized Linear and Additive Models. I've found a few mentions of GAMs here:

https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-(gsoc)-2012
A subtype of GAMs, Multiple additive regression (MARS) discussion: Feature request: Implement MARS #845
- MARS pull request [WIP] Earth (MARS) #2285

Has there been any update on creating these libraries?

amueller · 2015-01-22T23:20:33Z

There is pyearth: https://github.com/jcrudy/py-earth

raghavrv · 2015-01-30T18:39:12Z

GAM ( Generalized Additive Models ) - Tribshirani 86 - 1335 Citations - CRAN gam - CRAN mgcv - CRAN bam - CRAN vgam - Wikipedia
MARS ( Multivariate Adaptive Regression Splines ) - Friedman 91 - 4568 Citations - CRAN R Package - python impl - pyearth - PR [WIP] Earth (MARS) #2285
SpAM ( Sparse Additive Model ) - NIPS - Han Liu, Pradeep Ravikumar et al - 07 - 238 Citations - JMLR Zhao 12 - Yahoo AAAI 15 - CRAN R Package (SAM) Released 2014
GAMLSS ( GAM for Location Scale and Shape ) - Rigby, Stasinopoulos 05 - 500 odd Citations - Stasinopoulos 07 - Journal Article - 298 citations - CRAN GAMLSS 2014
LISO ( LASSO ISOtone for High Dimensional Additive Isotonic Regression) - Paper - CRAN R Package

Is there anything else related to GAMs that could potentially be added?

amueller · 2015-01-30T19:00:39Z

Not sure which of these are actually commonly used and scale to larger datasets.... Have you checked?

raghavrv · 2015-01-30T21:13:36Z

the original GAM paper by Hastie has around 1300 citations and seems to be of much interest to people and the SpAM had around 200 odd... ( not sure if citation is the only important criteria here for inclusion based on your recent addition to the FAQ )... I don't think LISO is that popular ( I may be mistaken )...

And for SpAM, there is a paper from Yahoo Labs - AAAI 15 by Wei Sun. From the application described in that paper I think it may scale well to larger datasets...

raghavrv · 2015-02-02T10:49:23Z

@amueller I am interested in working on SpAM... Given the above references do you feel it will be of considerable interest to the devs? If so I'll finish off my other PRs and start with SpAM?

agramfort · 2015-02-02T12:28:41Z

@fabianp and @eickenberg have an implementation of SpAM somewhere. Can you
share it?

eickenberg · 2015-02-02T17:29:26Z

Yup, it's been a while since I've looked at this, and it is only a proof of concept, we provide """as is""" :)

https://gist.github.com/eickenberg/fe7010b63a4196f849fa
https://gist.github.com/eickenberg/b54b5defead3df1fc761

raghavrv · 2015-02-02T21:08:01Z

Thanks for sharing :)

@agramfort can I proceed with this at hand?

I am thinking of making a directory for all GAM based models... so we can have something like :
/gam/spam.py for SpAM
/gam/gamlss.py for GAMLSS
/gam/earth.py for MARS ( perhaps? )

agramfort · 2015-02-02T21:14:47Z

I guess. Just keep it small so you maximize your chances to get feedback
quickly

GaelVaroquaux · 2015-02-02T21:35:41Z

/gam/spam.py for SpAM
/gam/gamlss.py for GAMLSS
/gam/earth.py for MARS ( perhaps? )

gam is a horrible acronyme. It's impossible to google or to guess what
this is.

raghavrv · 2015-02-02T21:45:12Z

how about simply additive_models ? ( generalized_additive_models would be too large a directory name I think.... or let me know if that is not an issue )

GaelVaroquaux · 2015-02-02T21:46:26Z

how about simply additive_models ?

I like it. Just like 'linear_models' are really
'generalized_linear_models'.

jnothman · 2015-02-02T23:09:13Z

I think linear_models is already a bit long, and should probably have been
linear. Should we just call this sklearn.additive?

On 3 February 2015 at 08:46, Gael Varoquaux notifications@github.com
wrote:

how about simply additive_models ?

I like it. Just like 'linear_models' are really
'generalized_linear_models'.

—
Reply to this email directly or view it on GitHub
#3482 (comment)
.

sreenivasraghavan71 · 2015-07-27T14:34:12Z

I saw the inclusion of GAMs in GSoC 2015. I just want to know whether the project is assigned to anyone or else I would start developing the GAM models. Please reply

agramfort · 2015-07-28T07:10:29Z

to me there is still a need for a good and standard GAM implementation in python like the gam package in R.

datnamer · 2015-07-30T19:47:42Z

@agramfort and @sreenivasraghavan71 there is a pr for statsmodels GAM here: statsmodels/statsmodels#2435

They are sorely in need of more code reviewers and maintainers! Help would probably be appreciated.

smrtslckr · 2016-03-09T01:30:12Z

Curious about this myself. Noticed this wiki: https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2015 Is this currently in flight?

GaelVaroquaux · 2016-03-09T06:20:10Z

Noticed this wiki: https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2015 Is this currently in flight?

Scikit-learn is not participating in the GSoC this year, because all the core contributors are already too committed to mentor.

kingroryg · 2016-09-29T12:51:21Z

Does anyone know of an efficient workaround using GAM in Python ?

eickenberg · 2016-09-29T13:00:03Z

Around what do you want to work?

Maybe it is better to ask this on the mailing list with reference to this
issue.

On Thursday, September 29, 2016, Sarthak Munshi notifications@github.com
wrote:

Does anyone know of an efficient workaround using GAM in Python ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3482 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPwC8MFd7hMz_GU2pPSGHR5KRA-Utsiks5qu7R6gaJpZM4CQlK0
.

amueller · 2016-09-29T14:01:20Z

@saru95 there's pyearth, if that helps.

dswah · 2017-05-10T18:02:37Z

@saru95, @amueller

I've written a python implementation of GAMs using penalized B splines inspired heavily by Simon Wood and his mgcv C-RAN package.

check it out here https://github.com/dswah/pyGAM
feedback is most welcome.

I've included several model families, as well as link functions and distributions that you can mix and match to make very custom GAMs.

banilo · 2017-10-07T19:15:28Z

to me there is still a need for a good and standard GAM implementation in
python like the gam package in R.

I completely agree. Any news on incorporating GAMs into the sklearn arsenal?

juyanmei · 2017-11-07T06:37:53Z

How to use SAM package to select feature?

mizukasai · 2018-04-06T15:07:36Z

Any updates on adding GAMs to sklearn ?
They're pretty neat, I'm using them as they are very interpreatable (with pyGAM in python and gam in R)
I would be interested in GA²Ms too (GAMs with pairewise ineractions)
Here's a paper by Lou et Al explaining the GA²Ms
http://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf

dsaxton · 2018-08-02T03:48:09Z

I'd like this feature a lot as well; seems like including additive models can make sklearn more of a "one stop shop."

jnothman · 2018-08-02T07:49:58Z

we can't promise to be a one stop shop with one finite set of maintainers. I definitely think these things should exist in the python ecosystem whether or not we are able to see them merged in any near future

agramfort · 2018-08-02T09:45:07Z

@dswah you could consider migrating your project to scikit-learn-contrib https://github.com/scikit-learn-contrib to get more visibility and expand the developper base.

note that you used the GPL license so this code cannot enter the sklearn code base.

i'd also really love to have a state-of-the-art GAM package in Python.

jnothman · 2018-08-02T10:11:35Z

There are some more general glms in #9405 too

amueller · 2020-07-17T21:52:21Z

There's been a lot of interest in GAMs recently because of interpretability.
There's a cool variant proposed in https://www.cs.cornell.edu/~yinlou/papers/lou-kdd12.pdf that uses gradient boosted trees on single variables (as opposed to using splines) called EBM.

The EBM paper has 205 citations which is not that great but narrowly passes our threshold. An application paper has over 600 citations though:
http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf

There's been a more recent evaluation here:
https://arxiv.org/abs/2006.06466

disclaimer: this work partially comes out of MS. I had put @thomasjpfan on GAMs previously, though we had looked more at the classical spline stuff.

The gradient boosting variant might be more straightforward to implement given what we already have (they are using binning).
There's also a C++ implementation of this in interpret.ml: https://github.com/interpretml/interpret

https://github.com/interpretml/interpret/blob/master/python/interpret-core/interpret/glassbox/ebm/ebm.py

amueller · 2020-07-17T21:55:06Z

@NicolasHug might also be interested in so far as this is a small extension of HistGradientBoosting ;)

rth · 2020-07-17T21:57:42Z

we had looked more at the classical spline stuff.

On that there was also the #17027 proposal more recently.

eickenberg · 2020-07-17T22:08:42Z

Been a while that I haven't seen this thread pop up :) I'll be cheering on whoever takes this on! It would be great to have such models in sklearn. https://www.linguee.com/german-english/translation/was+lange+w%C3%A4hrt+wird+endlich+gut.html

…

On Fri, Jul 17, 2020 at 2:57 PM Roman Yurchak ***@***.***> wrote: we had looked more at the classical spline stuff. On that there was also the #17027 <#17027> proposal more recently. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3482 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7AC4TTLSOUZFTEUXZURDR4DCONANCNFSM4ASCKK2A> .

amueller · 2020-07-17T22:09:19Z

@eickenberg I assume you mean splines, not additive gradient boosting?

eickenberg · 2020-07-17T22:12:18Z

Indeed, but the gradient boosting variant looks interesting and not that hard to add.

eickenberg · 2020-07-19T23:38:39Z

(Looks like it's a hot topic again: https://arxiv.org/abs/2004.13912 @eugenium pointed me to this)

agramfort · 2020-07-23T21:37:59Z

here is code I wrote many years ago when I looked at GAMs. It implements the so-called 'backfitting' approach which resembles a lot to coordinate descent. You improve the loss one feature at a time. https://gist.github.com/agramfort/c1d0307f4545a54642f011f79d9966b8 hope this helps. I agree that a good GAM implementation would be very valuable. My code is a toy.

…

eickenberg · 2020-07-23T23:04:14Z

Here are mine:

https://gist.github.com/eickenberg/fe7010b63a4196f849fa
https://gist.github.com/eickenberg/b54b5defead3df1fc761

This one does a round-robin coordinate descent

banilo · 2020-07-23T23:13:08Z

Hot topic and wanted, indeed! Hinton also published on GAM extensions to DNNs this year: https://arxiv.org/abs/2004.13912 Danilo

thomasjpfan · 2020-07-24T01:08:22Z

When looking at R's mgcv and the related book "Generalized Additive Models" By Wood (2017 Second Edition), the "hard" part is the smoothing selection which can be done using REML (restricted maximum likelihood) or GCV. The Wood book recommends using REML because it "tends to be more resistant to occasional severe over-fitting". Here is a reference on using REML for smoothing parameter selection by Wood. (REML is the default for mgcv)
For fitting, mgcv uses penalized iteratively re-weighted least squared (PIRLS) instead of backfitting because it is not easy to efficiently integrate the smoothness estimates.
The boosting trees approach looks very promising when compared to splines.

Who would be interested in reviewing a PR for GAMs? :)

GaelVaroquaux · 2020-07-26T07:05:14Z

I think that a good GAM implementation would be great. However, we need to spend time in having a good solver. I worry that Iterative Reweighted Least Square solvers will be very slow. Coordinate descent will probably work well, unless n is very large. A first work, before a pull request, would be to prototype and compare solvers (trying to reuse as much as possible existing implementation, as those listed in the thread above).

amueller · 2020-07-27T13:28:32Z

@GaelVaroquaux did you see the comment about the gradient boosting version?

GaelVaroquaux · 2020-07-27T13:48:21Z

@GaelVaroquaux did you see the comment about the gradient boosting version?

Yes, and I like it :) Still, it would be nice to do a bit of benchmarking based on the various bits of code that have been suggested above.

NicolasHug · 2020-07-27T14:49:26Z

Who would be interested in reviewing a PR for GAMs?

I would but I want to review the categorical features first :p

lorentzenchr · 2020-09-09T18:57:21Z

I'm also in favor of GAMs, though I see them more in the context of GLMs: "GLM + spline + penalty = GAM". The tricky part, in my opinion, is not solvers but the handling of the features, i.e. the API. In R' mgcv you can specify spline type and penalty for each feature by name .

lorentzenchr · 2021-10-08T14:42:13Z

Currently, there are 2 proposals on the table:

Spline based GAMs similar to R mgcv, see also WIP Adds Generalized Additive Models with Bagged Hist Gradient Boosting Trees #19914 (comment).
According to WIP Adds Generalized Additive Models with Bagged Hist Gradient Boosting Trees #19914 (comment), @thomasjpfan might give it a shot.
Also note that [MRG] Common Private Loss Module with tempita #20567 might come in handy for new solvers of such GLMs, eg coordinate descent.
GB trees on single variables. I think this can be achieved with interaction constraints for HGBT in ENH FEA add interaction constraints to HGBT #21020.

I wonder how great a combination of the two would be: Being able to specify for which feature to use smooth splines and for which to use trees? (and which to treat as categorical?)
This leads me again to the - in my opinion -more important but open point of API design.

amueller · 2021-10-15T18:51:19Z

Re 2, I would rather have #19914 first, in particular because it does cyclic boosting and is build on well-cited work. I think doing #21020 is also nice, but I think it's less well understood in terms of interpretability.
I'd be ok with doing something as general as #21020 as long as it's easy to have something that's equivalent to the explainable boosting machine algorithm with the solution.

Re API:
Sorry I haven't been in the loop on this much but for HistGradientBoosting we have a boolean mask/indices for categorical variables in the constructor, so that is the current band-aid solution for categorical variables.
I think we should go with that for now as a proper solution seems pretty hard. The only way is passing feature-aligned meta-data, either in an extra data structure (feature_props?) or more likely using DataFrames (and then there's the question of whether we want to use custom meta-data or the pandas dtypes).

lorentzenchr · 2021-10-16T14:05:07Z

For tree based GAMs, the point is that interaction constraints give us the possibility to specify feature-wise additivity in link-space (or allow for pairwise interactions…). That‘s the big step for interpretability.
The cyclic boosting is then something on top that might improve certain aspects of the model (or not).
Therefore, I consider it a logical implementation plan to start with #21020.

amueller · 2021-10-23T15:37:14Z

@lorentzenchr from an API perspective having general interaction constraints seems more complex than what's standard in GAMs and I guess it's a question whether we want separate classes or not.
I'm fine with having options for interaction constraints and cyclic boosting, but it will make it much less discoverable, and also will make it harder to use the method from the literature because it requires setting multiple parameters to specific values.

It seemed a bit weird in terms of inclusion criteria to me (what's the reference for general interactions constraints?). I guess we're weighing what the implementations do more heavily than what the literature does now, which is an option but not something we have decided on.

In the end, as long as we have easy ways for users to discover how to do gradient boosting GAMs that are in accordance with what's empirically & academically validated then that's great.

lorentzenchr · 2021-10-23T18:43:38Z

If we have all pieces together, I'm not opposed to the idea of a new class that sets some default parameters such that boosted tree-based GAMs become easier available.

UnixJunkie · 2022-02-04T03:09:40Z

The best R package is 'mgcv' apparently these days.
There is pyGAM in python, but not sure there is significant maintenance power behind it.
https://github.com/dswah/pyGAM

Look at here for a nice introduction to Generalized Additive Models and some incentive:
https://multithreaded.stitchfix.com/assets/files/gam.pdf

Seems quite yummy as a regression model (e.g. non-linear but still explainable).
Maybe only two hyper parameters that need to be optimized explicitly (number of underlying splines + regularization parameter).

amueller added the New Feature label Jan 22, 2015

raghavrv mentioned this issue Mar 2, 2015

[WIP] Earth (MARS) #2285

Closed

dswah mentioned this issue Mar 10, 2017

post package to online fora dswah/pyGAM#53

Closed

naught101 mentioned this issue Jan 15, 2020

Local Regression is Missing #3075

Open

lorentzenchr mentioned this issue Sep 9, 2020

[MRG+2] FEA Add SplineTransformer #18368

Merged

thomasjpfan mentioned this issue Apr 17, 2021

WIP Adds Generalized Additive Models with Bagged Hist Gradient Boosting Trees #19914

Closed

lorentzenchr mentioned this issue Oct 23, 2021

HistGradientBoosting* interaction constraints #19148

Closed

cmarmo added the module:ensemble label Dec 6, 2021

Generalized additive models (GAMs)? #3482

Generalized additive models (GAMs)? #3482

Comments

olgabot commented Jul 24, 2014

amueller commented Jan 22, 2015

raghavrv commented Jan 30, 2015

amueller commented Jan 30, 2015

raghavrv commented Jan 30, 2015

raghavrv commented Feb 2, 2015

agramfort commented Feb 2, 2015

eickenberg commented Feb 2, 2015

raghavrv commented Feb 2, 2015

agramfort commented Feb 2, 2015

GaelVaroquaux commented Feb 2, 2015

raghavrv commented Feb 2, 2015

GaelVaroquaux commented Feb 2, 2015

jnothman commented Feb 2, 2015

sreenivasraghavan71 commented Jul 27, 2015

agramfort commented Jul 28, 2015 via email

datnamer commented Jul 30, 2015

smrtslckr commented Mar 9, 2016

GaelVaroquaux commented Mar 9, 2016 via email

kingroryg commented Sep 29, 2016

eickenberg commented Sep 29, 2016

amueller commented Sep 29, 2016

dswah commented May 10, 2017 • edited Loading

banilo commented Oct 7, 2017

juyanmei commented Nov 7, 2017

mizukasai commented Apr 6, 2018

dsaxton commented Aug 2, 2018 • edited Loading

jnothman commented Aug 2, 2018 via email

agramfort commented Aug 2, 2018

jnothman commented Aug 2, 2018

amueller commented Jul 17, 2020

amueller commented Jul 17, 2020

rth commented Jul 17, 2020 • edited Loading

eickenberg commented Jul 17, 2020 via email

amueller commented Jul 17, 2020

eickenberg commented Jul 17, 2020

eickenberg commented Jul 19, 2020

agramfort commented Jul 23, 2020 via email

eickenberg commented Jul 23, 2020

banilo commented Jul 23, 2020 via email

thomasjpfan commented Jul 24, 2020

GaelVaroquaux commented Jul 26, 2020 via email

amueller commented Jul 27, 2020

GaelVaroquaux commented Jul 27, 2020 via email

NicolasHug commented Jul 27, 2020

lorentzenchr commented Sep 9, 2020

lorentzenchr commented Oct 8, 2021

amueller commented Oct 15, 2021

lorentzenchr commented Oct 16, 2021 • edited Loading

amueller commented Oct 23, 2021

lorentzenchr commented Oct 23, 2021

UnixJunkie commented Feb 4, 2022

dswah commented May 10, 2017 •

edited

Loading

dsaxton commented Aug 2, 2018 •

edited

Loading

rth commented Jul 17, 2020 •

edited

Loading

lorentzenchr commented Oct 16, 2021 •

edited

Loading