-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
DOC update and improve the sample_weight
entry in the glossary
#30564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
sample_weight
entry in the glossary
For information, for the examples of sample weight usage, we could be more specifice, for instance:
But linking to the papers does not necessarily make it explicit that this can be implemented via scikit-learn's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to offer a few alternative word choices for consideration.
Could you wait for my review before merging? Even if it takes 2 weeks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a high-level review: This PR makes sample weights the longest entry in the glossary. I would rather prefer a concise entry.
I could move some of the details to a dedicated section of the user guide (maybe part of the estimator API) and keep a short version in the glossary with a link to the user guide for more details if people prefer. |
FYI: In supervised learning, one of the main reasons for using weights, e.g. in insurance frequency models (Poisson loss), is to account for the different size/time/exposure/... of each observation. The assumptions are usually:
In the end, the weights enter the loss. The idea is that the "correct" weights improve the efficiency of the estimation of the expected loss** (=empirical mean of per observation losses). However, to my knowledge, there is no direct connection between **expected loss = statistical risk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also mention reweighting for transfer learning, and define in one sentence what it means.
Also, I find that this paragraph is very useful, but either put it in a fold or in a section of the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ogrisel. I have added a few comments.
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
Thanks for the reviews @StefanieSenger @GaelVaroquaux @lorentzenchr. I addressed all the specific feedback. Personally, I am fine with keeping all this info centralized into a largish glossary entry, but I can also split it into a dedicated section in the user guide (where?) and cross-reference if people prefer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good entry, even though it is quite long.
It would be nicer to have a short glossary entry and a dedicated user guide section. But I think we'd need to put in more work to make the user guide section. For example expanding on the third party paragraph from this PR with some links to examples of third-party estimators (I'm wondering if the text as it is, is too brief for people who aren't experts to understand it and too long for those who are experts/third-party implementers).
The final paragraph on KMeans and sampling is another one that I think we could expand in the user guide. At least I am not sure I fully understand what it is trying to tell me without an example.
However, maybe we should merge this now already and make an issue to improve the user-guide and reduce the glossary entry. Mostly because making the user guide entry would be quite a bit of additional work
floats, so that sample weights are usually equivalent up to a constant | ||
positive scaling factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equivalent to what? I don't understand this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invariance to the scale of weights is actually not guaranteed for all estimators or hyperparameter choices. So let's be more generic.
floats, so that sample weights are usually equivalent up to a constant | |
positive scaling factor. | |
floats to express fractional relative importance of data points with respect | |
to one another others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to say this?
floats, so that sample weights are usually equivalent up to a constant | |
positive scaling factor. | |
floats, since for many estimators, only the relative values of weights matter, | |
not their absolute scale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to explicitly state that because we are not yet sure to which extent this is true or not: we do haven't started testing this aspect systematically, but we know that this is not the case for many important ones such as LogisticRegression
and Ridge
. Both those estimators have a regularization parameter whose effect depends on the scale of the weights.
doc/glossary.rst
Outdated
Third-party libraries can also use `sample_weight`-compatible | ||
estimators as building blocks to reduce a specific statistical task | ||
into a weighted regression or classification task. For instance sample | ||
weights can be constructed to adjust a time-to-event model for | ||
censoring in a predictive survival analysis setting. In causal | ||
inference, it is possible to reduce a conditional average treatment | ||
effect estimation task to a weighted regression task under some | ||
assumptions. Sample weights can also be used to mitigate | ||
fairness-related harms based on a given quantitative definition of | ||
fairness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph doesn't feel like it belongs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find this paragraph very important: it grounds an abstract concept into practical application cases. Have those use cases in mind is helpful both for users and contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @adrinjalali, that this paragraph doesn't operate in the same scope as the rest of this glossary entry. A glossary is meant to define terms clearly and serve as a quick lookup.
In the user guide, this highlighting a specific use case would be great in a toggle or visually marked in a different style. Here however, it breaks the format and usability of the glossary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken: this means we will need a dedicated section in the doc to speak about the semantics and the practical use of the sample weights in more details.
As discussed in #29907 (comment).
min_samples
inDBSCAN
and contrasted it with themin_sample_leaf
/min_weight_fraction_in_leaf
parameters pair.class_weight
parameter.