-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] META Add Generalized Linear Models #9405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think this world be something good to have, and the code and docstrings are high quality.
Yes, I think we want to focus on regularised models so this probably won't be merged without net.
However, if I were you, I'd start by getting the API right (strings preferred to custom objects), and by starting to sell the usefulness of glms ideally by composing an example that shows off its predictive and inferential capabilities on some real-world dataset. Maybe another example could generate datasets each of which requires a different family to get minimum loss.
And you'd do well with more tests. You might also want to fix up pep8 and keep it fixed by adding a checker to your editor.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add one or more entries to doc/modules/classes.rst
so that the API reference docs are generated.
sklearn/linear_model/glm.py
Outdated
# So far, coefficients=beta and weight=w (as standard literature) | ||
# TODO: Add l2-penalty | ||
# TODO: Add l1-penalty (elastic net) | ||
# TODO: Add cross validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you intend to implement cross validation for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like LogisticRegressionCV, once penalties are added ...
* Fixed pep8 * Fixed flake8 * Rename GeneralizedLinearModel as GeneralizedLinearRegressor * Use of six.with_metaclass * PEP257: summary should be on same line as quotes * Docstring of class GeneralizedLinearRegressor: \ before mu * Arguments family and link accept strings * Use of ConvergenceWarning
* GeneralizedLinearRegressor added to doc/modules/classes.rst
@jnothman Thank you very much for your fast and encouraging feedback. I tried my best to make the suggested changes. For sure, I'd like to include more tests, better documentation, examples and penalites (like glmnet). But I will be away the next month. |
my concern here is the overlap of scope with what we already have
(LogisticRegression, LinearRegression, Ridge etc...)
if you provide a generic GLM object then we'll have to decide what to do
with the existing classes which basically mean that you be at least as fast
as them.
|
@agramfort, I think the point here is to provide Poisson regression etc, and to leave more specialised tools alone. Is there an issue with that? @lorentzenchr, rather than creating CV variants, I would focus on supporting warm_start. From there a generic cv adaptation is fairly straightforward. And a month's break is fine. Plenty to go on with here. |
@agramfort <https://github.com/agramfort>, I think the point here is to
provide Poisson regression etc, and to leave more specialised tools alone.
Is there an issue with that?
no but then let's call stuff PoissonRegression to be consistent and avoid
the generic GLM/Family naming.
|
Isn't there a great deal of similarity between the backend of these models that would be captured with a single GLM object? I agree that since we already have some GLMs that we shouldn't re-invent the wheel, but there seems like there would be an explosion of linear models if we name them all separately. |
true would could maybe factorize some code but have different estimators
was an initial decision to have as explicit as possible. Having one single
object is a big refactoring / API change.
|
we're only really talking about logistic and linear regression, right? and
this doesn't currently support logistic? and SGD overlaps with others
already. and penalised linear and logistic are much more common in ML than
other glms. I don't see the problem with offering specialists alongside a
generalist, even if we then decide to make the specialist just an
instantiation of the generalist.
…On 21 Jul 2017 3:15 am, "Alexandre Gramfort" ***@***.***> wrote:
true would could maybe factorize some code but have different estimators
was an initial decision to have as explicit as possible. Having one single
object is a big refactoring / API change.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9405 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61iy_7Zf_OtmCaNNhazLp-MgbZOhks5sP4tJgaJpZM4Ob2sP>
.
|
how are we going to document this? choose between: LogisticRegression(solver='newton-cg') ? see also https://github.com/glm-tools/pyglmnet also lightning has a lot of super efficient code. @lorentzenchr do you have the energy to unify all this? |
@agramfort the same issue exists already with SGDClassifier as @jnothman points out. It's not ideal, but I think we should not expose the same solver in different places if we can avoid it. So maybe not add log here? |
I am fine having a private base class for the new GLMs. it's a sound
internal design choice.
now I think I am still in favor of having model with names in the public
API. PoissonRegressor etc.
I would therefore not implement log in this new code unless it can be used
to benchmark this
code with models we already have which could be nice to spot computation
issues or benefits
of the new implem.
|
Let me try to list some thoughts I had in mind with this PR:
@agramfort I do certainly not have the energy to unify this all, but I hope I'm not alone ;-) |
* fixed bug: init parameter max_iter * fix API for family and link: default parameter changed to string non public variables self._family_instance and self._link_instance * fixed bug in score, minus sign forgotten * added check_is_fitted to estimate_phi and score * added check_array(X) in predict * replaced lambda functions in TweedieDistribution * some documentation
* Fixed pep8 * Fixed flake8 * Rename GeneralizedLinearModel as GeneralizedLinearRegressor * Use of six.with_metaclass * PEP257: summary should be on same line as quotes * Docstring of class GeneralizedLinearRegressor: \ before mu * Arguments family and link accept strings * Use of ConvergenceWarning
* GeneralizedLinearRegressor added to doc/modules/classes.rst
* fixed bug: init parameter max_iter * fix API for family and link: default parameter changed to string non public variables self._family_instance and self._link_instance * fixed bug in score, minus sign forgotten * added check_is_fitted to estimate_phi and score * added check_array(X) in predict * replaced lambda functions in TweedieDistribution * some documentation
* make raw docstrings where appropriate * make ExponentialDispersionModel (i.e. TweedieDistribution) pickable: ExponentialDispersionModel has new properties include_lower_bound, method in_y_range is not abstract anymore. * set self.intercept_=0 if fit_intercept=False, such that it is always defined. * set score to D2, a generalized R2 with deviance instead of squared error, as does glmnet. This also solves issues with check_regressors_train(GeneralizedLinearRegressor), which assumes R2 score. * change of names: weight to weights in ExponentialDispersionModel and to sample_weight in GeneralizedLinearRegressor * add class method linear_predictor
* added L2 penalty * api change: alpha, l1_ratio, P1, P2, warm_start, check_input, copy_X * added entry in user guide * improved docstrings * helper function _irls_step
* fix some bugs in user guide linear_model.rst * fix some pep8 issues in test_glm.py
* fix some bugs in user guide linear_model.rst * fix some pep8 issues in test_glm.py
* fix some bugs in user guide linear_model.rst * fix some pep8 issues in test_glm.py
Thanks @lorentzenchr, I addressed your comments. Also synced changes in #14300. |
It's a real bummer that the milestone has to be pushed back. The addition of these GLM models is REALLY exciting news, so I'm saddened to have to tell the team that we can't expect to use this for at least 6(?) months. Anyhow, I'm super appreciative of the tutorial that @lorentzenchr has put together. VERY nice work! :D |
@jboarman you could use the code now by using the branch, and you could use it once it's merged by using the sklearn dev version ;) but yes, it will only be in a release in about 6 month. |
@jboarman, indeed it will be great to have people trying this out with
nightly builds once we've got it merged to master.
|
Thanks guys ... Obviously a lot of ppl are interested in this and working hard to merge in this PR. I shouldn't complain! 😊 |
The tutorial https://github.com/lorentzenchr/Tutorial_freMTPL2 has now spatial smoothing and thus shows a good use case and the intention of the penalty matrix |
A minimal implementation of this PR with L2 penalty and LBFGS solver is now merged in #14300 and should be available in the next release. I have opened follow-up issues for some of the other features included here,
The easiest might be to keep this PR open for visibility, but address them in separate smaller PRs. |
Removing from the milestone, since parts of this are already in. |
@lorentzenchr: just for confirmation before closing this PR; has all the content of this PR been integrated in scikit-learn, or are there still some pieces that have not been yet? |
@jjerphan With the current release, we only have the minimal GLM implementation and a lot is still missing. I think we have issues for all those additional features so we could close this PR (which is old with lots of details that I would do differently today). |
This PR adds Generalized Linear Models (full Tweedie family) with elastic net penalty comparable to R's glmnet, see issue #5975. Futher distributions of the exponential dispersion family can be added easily.
It also solves #11566.
Sorry, for not posting this earlier.
I'm excited about your feedback.
TODO
GeneralizedLinearRegressor