-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Poisson, gamma and tweedie family of loss functions #5975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think we should at least add a poisson regression, though I'm not super familiar with it. Can you elaborate on the offset? These would all be separate models in |
The poisson distribution is widely used for modeling count data. It can be shown to be the limiting distribution for a normal approximation to a binomial where the number of trials goes to infinity and the probability goes to zero and both happen at such a rate that np is equal to some mean frequency for your process. Gamma can be theoretical shown to be the time till a poisson event occurs. So for example the number of accidents you'll have this year can be theoretical shown to be poisson. And the expected time till your next accident is or third ext is a gamma process. Tweedie is a generalized parent of these distributions that allows for additional weight on zero. Think of tweedie as modeling loss dollars and 99 percent of all customers have zero weight the rest have a long tailed positive loss or gamma. In practice these distributions are widely used for regression problems in insurance, hazard modeling, disaster models, finance, economics and social sciences. Feel free to reference wikipedia. I'd like to have these loss functions as choices in glmnet, GBM and random forest. This means that in GBM for example Freedman's boosting algorithm would use this loss instead of Gaussian or quartile loss. Gamma and poisson (beta tweedie) are already in Rs GBM and glm packages and xgboost has some support. The offsets are used by practitioners to weight their data by exposure. Typically a poisson model has a link function ex: yhat=offset x exp(regression model output) is called the log link and is most common. Here the offset allows exposure to be captured differently for different units of observation. Poisson processes are additive but different examples may have been taken over non equal space or time or customer counts and hence the offset vector is needed for each observation. I'm willing to tackle programming this but I'm not super familiar with the api so I'd appreciate suggestions so I do this right and get get it rolled into the release. |
Okay, I'm working on implementing this. I'm adding the three distributions noted above and offsets. I'd appreciate feedback from the general sklearn audience on how to implement the offsets. I'm planning on adding a new argument to the GradientBoostedRegression call 'offset=None' where offset would be a vector like object with length n (number of samples). My main question is whether I should add offsets to all loss functions (Gaussian, Huber, Quantile) such as is done in R's GBM implementation or whether I should just add enable the offsets to work with the tweedie family and throw a warning if you try to use offset with an unsupported loss function? |
I was more asking for practical use-cases, as in data-sets or publications. I know what the distributions do ;) It would probably be a good addition, though I can't guarantee you that your contribution will be merged. It would probably be good to discuss it more broadly before you jump in. Unless you just want to implement it for your self and don't care if we merge it ;) So I take it you are mostly interested in gradient boosting, not linear models? |
I'm interested in integrating it all around but mostly tree ensembles
|
It would be nice to have some input from our GBM experts that I pinged above, but you can also write to the mailing list asking if people would be interested in the addition. |
do you also plan to support coordinate solvers with L1/L2 penalties |
Any updates on this issue? I'd like to see a Poisson regression added into |
no one AFAIK.
don't hesitate to give it a try and share a WIP implementation.
|
I was going to work on this and am still going too. If I do it though I
|
@thenomemac Do you have a WIP implementation that I could take a look at? For offsets, couldn't you just require the user to divide their counts through by the offset/exposure to make the "y" value a rate instead of a count (https://en.wikipedia.org/wiki/Poisson_regression#.22Exposure.22_and_offset)? R's GLM package has a great interface for specifying models (including specifying offsets) but not sure how that would fit into the existing linear models API. |
@bjlkeng I don't have a WIP implementation complete yet. I started a while back then got distracted. I don't think dividing through by exposures to get a poisson rate is a correct derivation of the GBM algorithm for poisson loss. The offset=log(exposure) in the gradient is an additive factor. So, you're effectively giving more weight to "areas" with a higher exposure. Not 100% sure if you can get back to the correct gradient at each iteration of fitting the base learner (tree) because the current scheme for passing weights to the tree learner is multiplicative not additive. I'll try to type up a more rigorous mathematical derivation of what I'm referring to. I can say that in practice it does matter. In real world data sets I've modeled where you'd expect the count data to be poisson, using R's gbm will converge faster and to a better outcome because it's handling offsets the "mathematically" correct way. And other gbm implementations such as xgboost with a poisson loss function can't model the data as well using a Poisson rate as suggested. |
(aside I found the link to this issue on stats.stackexchange statmodels GLM has offset and exposure (exposure only for log link) In master there is now an elastic net option for GLM and a few other models, implemented via apython loop for coordinate descend (uses generic maximum likelihood with offset) In master there is now also Tweedie family for GLM. However, it doesn't use the full loglikelihood loss function because that is an infinite sum of terms and calculating a truncated version is slow and commonly used approximations are not very accurate over some range of the parameter space So, if you need a reference implementation, statsmodels has these parts. I never heard of or looked at GBM for GLM. I also don't know enough about the scikit-learn code to tell how it would fit in. |
@thenomemac You're absolutely right about the loss function changing due to exposure, I was mistaken. In fact, I believe I worked it out. I have a very early WIP (more of just playing around): https://github.com/bjlkeng/scikit-learn/blob/poisson_regression/sklearn/linear_model/poisson.py (check out the @josef-pkt Thanks, I looked at the statsmodels implementation, it's actually quite good (except for the API, which I'm not a fan of). It's actually a bit more general where their "count" model supports other count-based regressions like negative binomial. The statsmodel implementation already takes into account exposure and regularization (which is what I was also looking for). Given that statsmodels has an implementation, do you think it's still valuable to have something like this in |
I think this would still be valuable. |
@bjlkeng Thanks for the comment! Are you interested in taking it up and making a Pull Request? If not I can try to take this issue and attempt making a PR... First for poisson and subsequently for gamma... @agramfort Is that fine by you? :) |
Hey @raghavrv , sorry for the late reply. Work got pretty busy and also I realized it would take longer than I had time for. So please feel free to continue on. I would take a look at the statsmodel implementation because they have most of the functionality that I think you would want here. |
@raghavrv Did you start working on this? I may also be able to contribute so that we have at least Poisson regression in sklearn. |
@btabibian @raghavrv What is the status of this? I need a Poisson regression implementation for a project and would be willing to contribute. |
Please go ahead :) I haven't had time for this sorry... |
Same i haven't had time. Also I've been struggling with the API and how to
integrate offsets. I could show the math or a code example in statsmodels.
But TLDR you need offsets if you want to use poisson regression with
unequal exposures (area or time) then you need offsets. The model doesn't
give the correct weighting if you just divide your counts by exposures.
…On Apr 1, 2017 2:49 PM, "(Venkat) Raghav (Rajagopalan)" < ***@***.***> wrote:
Please go ahead :) I haven't had time for this sorry...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5975 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOeuWjVGf3-VmeasNHMLQAB1dnd4zuUQks5rrpw4gaJpZM4Gwd6->
.
|
I will start looking into it then. @thenomemac thanks for the tip! I will check out the statsmodels implementation to see how they handle exposure. |
Hello, are there any updates? Would it also be possible to add a negative binomial likelihood? This shouldn't make much of a difference to Poisson. Otherwise I could look into this.. Best, |
Hi @dirmeier, unfortunately not. I've switched to a hierarchical Bayesian model and never got around to implementing the Poisson regression. |
@dirmeier, @jakobworldpeace is there some work in progress that you can point us to? I can also jump on taking a look at this? |
Hi @NickHoernle, |
@NickHoernle There is no WIP but the statsmodels Poisson regression implementation should get you started. |
Excellent. I'll start taking a look at this and see where we get. |
I've got a validated, general glm implementation in my py-glm library, but have no plans to attempt to merge it into sklearn (I made some design decisions that make it incompatible with sklearn's). It's setup so that it should be very easy to add other exponential families. I also have a full, pure python glmnet implementation that supports the same exponential families in the same library, but I got stuck on a bug and I've put it down. I would love some help reasoning through the bug, just need some motivation to pick it back up. |
@madrury Happy to take a shot at helping you out with that bug. |
Hello, anything built for these distributions? Curious about any updates. Thanks. |
I'd be personally fine with closing this issue to help the contributors
focus. Reasons:
- The python landscape has changed
- statsmodels is now much more mature and includes these distributions with
proper exposure weighting
- jit based implementations via pytorch or tensorflow make it easier to
implement any esotaric loss with no performance penalty or package
recompilation
Thoughts?
|
we are currently allocating resources to help with
#9405
and make it (at least some parts) land in master. It should be tackled over
the next months.
|
Awesome work !
…On Sat, Apr 13, 2019, 3:27 AM Alexandre Gramfort ***@***.***> wrote:
we are currently allocating resources to help with
#9405
and make it (at least some parts) land in master. It should be tackled over
the next months.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5975 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOeuWj8PD0nfltM7Acg12Pfhl4sG5n7Fks5vgYbogaJpZM4Gwd6->
.
|
It would be great to have GLM in sciki-learn. This will reduce the need to go to other languages. Looking forward to it. |
Agreed. Coming from the R world, I was surprised that sklearn didn't already have GLM functionality. I hope it will soon. |
* new estimator GeneralizedLinearRegressor * loss functions for Tweedie family and Binomial * elasitc net penalties * control of penalties by matrix P2 and vector P1 * new solvers: coordinate descent, irls * tests * documentation * example for Poisson regression
I'll add another vote to include GLM In sklearn. It's a core class of models which is taught in undergraduate stats programs. Also, the fact that there is a "Generalized Linear Models" section in the user manual that doesn't include any discussion of link functions or error distributions is surprising to me. |
@patrickspry Statsmodels has a good implementation of most GLMs an undergrad would learn. |
@patrickspry There is a fairly complete PR in #9405 and work is in progress to merge that functionality. |
Oh, fantastic! Thanks for the heads up!
|
Any estimated timeline to merge the PR? Thanks. |
For linear models #14300 is now merged, although additional features could still be added #9405 (comment)
That could indeed be the next step (e.g. #15123 (comment)) |
It is impressive to see the great work in sklearn 0.23, which includes the Poisson, gamma and tweedie. Hope to see more enhancements in the future. |
Looks like we can close the issue now that #14300 is merged? |
Thanks for the feedback @magicmathmandarin ! Yes, absolutely. The original PR #9405 actually included deviance of BinomialDistribution for binary logistic regression. The reason we didn't include that in the first merged PR is that even they are indeed part of the same theoretical framework, the specialized
Sure. There are some follow-up discussions in #16668, #16692 and #15123. |
I would like sklearn to support Poisson, gamma and other Tweedie family loss functions. These loss distributions are widely used in industry for count and other long tailed data. Additionally, they are implemented in other libraries such as R: GLM, GLMNET, GBM ext. Part of implementing these distributions would be to include a way for offsets to be passed to the loss functions. This is a common way to handle exposure when using a log link function with these distributions.
Would the sklearn community be open to adding these loss functions. If so I or (hopefully others) would be willing to research the feasibility of implementing these loss functions and offsets into the sklearn API. Thanks
The text was updated successfully, but these errors were encountered: