Skip to content

DOC update and improve the sample_weight entry in the glossary #30564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 31, 2024

As discussed in #29907 (comment).

  • Link to practical usage examples.
  • Removed some FIXMEs and TODOs.
  • Fixed the description of the min_samples in DBSCAN and contrasted it with the min_sample_leaf / min_weight_fraction_in_leaf parameters pair.
  • Refine the description of the interplay with the class_weight parameter.
  • Reference ongoing work and the tracking meta-issue.

Copy link

github-actions bot commented Dec 31, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5e2d5f6. Link to the linter CI: here

@ogrisel ogrisel changed the title DOC update and improve the sample_weight entry in the glossary DOC update and improve the sample_weight entry in the glossary Jan 2, 2025
@ogrisel
Copy link
Member Author

ogrisel commented Jan 2, 2025

For information, for the examples of sample weight usage, we could be more specifice, for instance:

But linking to the papers does not necessarily make it explicit that this can be implemented via scikit-learn's sample_weight and the code of those libraries is not always easy to understand without reading the papers first and furthermore I am not sure how to link to a source code snippet while making it easy to check that the link stays relevant over time.

Copy link
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to offer a few alternative word choices for consideration.

bagging models randomly resample from the training data, etc. In this
case, the sample weight-repetition equivalence property described above
does not hold exactly. However, it should hold at least in expectation
over the randomness of the fitting procedure.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I think the equivalence should even hold in distribution rather than just in expectation, but I am not 100% sure this can always be enforced in scikit-learn (yet).

@lorentzenchr
Copy link
Member

Could you wait for my review before merging? Even if it takes 2 weeks?

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a high-level review: This PR makes sample weights the longest entry in the glossary. I would rather prefer a concise entry.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 8, 2025

Just a high-level review: This PR makes sample weights the longest entry in the glossary. I would rather prefer a concise entry.

I could move some of the details to a dedicated section of the user guide (maybe part of the estimator API) and keep a short version in the glossary with a link to the user guide for more details if people prefer.

@lorentzenchr
Copy link
Member

lorentzenchr commented Jan 10, 2025

FYI: In supervised learning, one of the main reasons for using weights, e.g. in insurance frequency models (Poisson loss), is to account for the different size/time/exposure/... of each observation. The assumptions are usually:

  • $Y=\frac{C}{w}$ is the ratio of interest, C is an amount (counts, EUR, kilogram, ...).
  • $Var[Y] \sim \frac{1}{w}$

In the end, the weights enter the loss. The idea is that the "correct" weights improve the efficiency of the estimation of the expected loss** (=empirical mean of per observation losses). However, to my knowledge, there is no direct connection between $Var[Y_i]$ and $Var[loss_i]$ , see Section 5.2.2 in https://arxiv.org/abs/2202.12780.

**expected loss = statistical risk

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also mention reweighting for transfer learning, and define in one sentence what it means.

Also, I find that this paragraph is very useful, but either put it in a fold or in a section of the docs

Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel. I have added a few comments.

ogrisel and others added 4 commits February 24, 2025 14:32
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
@ogrisel
Copy link
Member Author

ogrisel commented Feb 24, 2025

Thanks for the reviews @StefanieSenger @GaelVaroquaux @lorentzenchr. I addressed all the specific feedback. Personally, I am fine with keeping all this info centralized into a largish glossary entry, but I can also split it into a dedicated section in the user guide (where?) and cross-reference if people prefer.

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good entry, even though it is quite long.

It would be nicer to have a short glossary entry and a dedicated user guide section. But I think we'd need to put in more work to make the user guide section. For example expanding on the third party paragraph from this PR with some links to examples of third-party estimators (I'm wondering if the text as it is, is too brief for people who aren't experts to understand it and too long for those who are experts/third-party implementers).

The final paragraph on KMeans and sampling is another one that I think we could expand in the user guide. At least I am not sure I fully understand what it is trying to tell me without an example.

However, maybe we should merge this now already and make an issue to improve the user-guide and reduce the glossary entry. Mostly because making the user guide entry would be quite a bit of additional work

@ogrisel ogrisel added the Waiting for Second Reviewer First reviewer is done, need a second one! label Mar 11, 2025
Comment on lines +1854 to +1855
floats, so that sample weights are usually equivalent up to a constant
positive scaling factor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalent to what? I don't understand this sentence.

Comment on lines +1888 to +1897
Some model hyper-parameters are expressed in terms of a discrete number
of samples in a region of the feature space. When fitting with sample
weights, a count of samples is often automatically converted to a sum
of their weights as is the case for `min_samples` in
:class:`cluster.DBSCAN`, for instance. However, this is not always the
case. In particular, the ``min_samples_leaf`` parameter in
:class:`ensemble.RandomForestClassifier` does not take weights into
account. One should instead pass `min_weight_fraction_leaf` to
:class:`ensemble.RandomForestClassifier` to specify the minimum sum of
weights of samples in a leaf.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this information is best placed in the docstrings of those classes, not here, or here keep only the first 3-4 lines

Comment on lines +1877 to +1886
Third-party libraries can also use `sample_weight`-compatible
estimators as building blocks to reduce a specific statistical task
into a weighted regression or classification task. For instance sample
weights can be constructed to adjust a time-to-event model for
censoring in a predictive survival analysis setting. In causal
inference, it is possible to reduce a conditional average treatment
effect estimation task to a weighted regression task under some
assumptions. Sample weights can also be used to mitigate
fairness-related harms based on a given quantitative definition of
fairness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph doesn't feel like it belongs here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

7 participants