Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Minimal Generalized linear models implementation (L2 + lbfgs) #14300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal Generalized linear models implementation (L2 + lbfgs) #14300
Changes from all commits
919912c
01033e3
4071a8a
757bc3c
ed8e74f
fe876da
2993e03
a6f9f13
c9a7a95
a7755de
9aa1fc4
ca3eae2
61bc6b8
b24a7ca
f95b390
09176b4
def12ae
9b574bd
90299fd
e3a5a9a
54b80b8
e962859
7db0320
dcfe9ed
4879bb6
56069e5
53f3c5f
ac1fef3
be5a3c4
e67fecb
62f4448
d25042e
07ee495
d0eb285
1e4b538
3265148
1862ab6
4154074
c5d77d7
9ab5ac2
e4d0be1
6ff4d58
13102d5
af89e52
3802420
ddc4b71
426ae1d
a404384
65796a3
e44afe7
5927379
a6df2a7
5af89a7
cd347d4
0d7f9cd
dbffad8
d914ab2
3187204
cc03c1a
4d433d1
aa52b4a
816aa8f
6500c81
59a6d9d
03a8a2d
bbf7f38
49a3a8e
228e8c8
09a57c9
4b485ca
aa0adf1
abd47d7
c65ac12
7a9d067
18b4503
29658d6
1ea70d3
efdcb5b
ef0d063
0125e1c
6a8a600
73f3bd1
11b178f
3806fbe
ae1c672
f07c831
e34fb57
ebbbe9c
918e257
5236cd8
37d0f47
9c337f2
5e05935
4a68213
8ee5c85
b106c25
a1f8aab
c0999ea
e09e336
6601d30
81eabe3
5174dae
61dc13f
44524ca
58d2409
ee351e1
33fe9be
94272e7
2457039
6396d2c
8be0387
da66fd5
b9bc170
ab6c5d8
b424a07
95a9058
59eceb4
12a5067
04f30f4
3630b52
516eadb
5e14928
6cc1df5
752d6aa
38a4ad4
c15a1cc
37de07b
adbf997
47dbc84
3b526e9
b1eb611
d964c01
a1844b8
f513392
84229a6
dd22699
059aeb7
8c6c255
0a23313
29964af
976b436
7c850d1
f64dc4a
b1f5bde
a9ab4e4
be7bb67
4125c20
0070d52
9d6bb52
b353b2d
88757fd
6d119d4
b735eb7
2d91114
89103bc
4f28a44
d4dfd0b
64d6fbd
82ace9f
5288a0f
499e8d2
f4aa839
48fcbe6
d71fb9f
fa90272
4d16f31
15eb1d3
3d097c6
6372287
b117856
31f5b3d
a498ff5
3fae28a
a2b6841
a47798a
83391dd
3bfb54e
d325fe2
661cf56
560c180
0ea2dce
4ca2e95
89b429d
2d0b195
ea6a3e8
ddae396
21572d9
d7ff6f4
939d240
da0d2a6
162fb3b
cafc92f
d3235db
9401230
5bc48e5
e572c31
8d70042
6dc5e34
c479889
3628cff
88d150e
0d31f47
3ab2877
87de01b
072f4fb
fbc22d8
27ae4a2
51be7e1
25f0e53
5eddf9c
81068ce
c700acf
a126f4a
82c4483
d3083db
f63f795
45de110
b8459b0
581b4d7
bb75435
94dfc00
0810bf3
a90a0aa
f74ab96
a349be7
497a76c
c8c902f
bda7ad6
83c2ba6
7498f3e
90b1213
697bda2
578408c
2668910
d1c3dc9
a9686f6
04e7aca
21a739c
79ada1e
39eeb44
e3cf69d
56aa0d7
27a344c
e817b2c
0fdc518
6d4ecb2
b96e021
293214c
987239a
edba3b8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an inverse function, would it makes sense to denote it
It would also be nice to suggest what this link function is doing. From what I saw this is a link between the linear predictor and the mean of the distribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, this is standard literature nomenclature: link function g: g(E[Y]) = Xw and inverse link function h: h(Xw) = g^{-1}(Xw) = E[Y].
Second, we had this info/equation in the user guide, but some reviewer wanted it gone 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here: https://78243-843222-gh.circle-artifacts.com/0/doc/modules/linear_model.html#id34
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the link function
g
could be introduced e.g. below in the usage part.I don't like using E[Y] because:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should go with it then :) (my remarks are from somebody which is not familiar with GLMs)
I was not thinking to go until this level of detail but more about defining with a couple of words what is the link function. Regarding the mathematical section, I'm fine with the majority of reviewers thought it was an overkill. I will not be pushy on this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug First excuse - it was late 😊 Second excuse, GLM literature is lazy and skips the conditional most of the time, e.g. E[Y] really means E[Y|X].
More seriously: Maybe, I missed that piece in the actual user guide, but I can't find a reference as to what regressors do actually estimate/predict (most of the time E[Y|X]). That's why I bring this up again (last time, promise). Should we add that info somewhere in the UG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a possibility, yes. Though I would prefer doing that in another PR to make sure the UG is overall consistent, since this would be a non-trivial update.
In the mean time, I still think we can have a reasonable UG for the GLMs that doesn't mention E[Y/X], we can just talk about predictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add the Tweedie distribution in the list as well?
The first regressor below is the
TweedieRegressor
and when reading the section, this is the first time that the term occurs. So I was wondering if it makes sense to mention something about the distribution earlier?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say so. Only objection is to have this info 2 times. Maybe a link to the
mean_tweedie_deviance
metric here suffices?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we always apply standard-scaling or one should also consider robust-scaling and so on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can have L-BFGS running, shall we have a similar concern than in #15556 and #15583
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and we should apply the same method. It will remove the need for scaling from an optimization perspective. Unscaled data means that the penalization is weirdly weighted but if we don't scale we should make sure the algorithm still converges if the data is not scaled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I have added item to the todo list of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's standard practice to say that the data should be standardized or scaled. Which scaler to use and whether it should be robust to outliers is I think off topic in this section of the user guide. Users can read the preprocessing docs if interested.
Copying what I responded in #14300 (comment), since I'm doing another pass on comments, I don't think it belongs in this PR. This PR is a minimal implementation, and while preconditioning may improve convergence it is not essential, we have been using LBFGS without for years. Also as the above linked PR showed the results were not as clear cut as expected.