-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC update and improve the sample_weight
entry in the glossary
#30564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7771446
876261d
a93a221
865eed7
1ece29e
ef31eb9
5e2d5f6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1848,26 +1848,72 @@ See concept :term:`sample property`. | |
See :ref:`group_cv`. | ||
|
||
``sample_weight`` | ||
A relative weight for each sample. Intuitively, if all weights are | ||
integers, a weighted model or score should be equivalent to that | ||
calculated when repeating the sample the number of times specified in | ||
the weight. Weights may be specified as floats, so that sample weights | ||
are usually equivalent up to a constant positive scaling factor. | ||
|
||
FIXME Is this interpretation always the case in practice? We have no | ||
common tests. | ||
|
||
Some estimators, such as decision trees, support negative weights. | ||
FIXME: This feature or its absence may not be tested or documented in | ||
many estimators. | ||
|
||
This is not entirely the case where other parameters of the model | ||
consider the number of samples in a region, as with ``min_samples`` in | ||
:class:`cluster.DBSCAN`. In this case, a count of samples becomes | ||
to a sum of their weights. | ||
|
||
In classification, sample weights can also be specified as a function | ||
of class with the :term:`class_weight` estimator :term:`parameter`. | ||
A relative weight for each sample. Intuitively, if all weights are | ||
integers, using them in a model or scorer is like duplicating each | ||
sample as many times as the weight value. Weights may be specified as | ||
floats, so that sample weights are usually equivalent up to a constant | ||
positive scaling factor. | ||
|
||
`sample_weight` can be both an argument of the estimator's `fit` method | ||
for model training or a parameter of a :term:`scorer` for model | ||
evaluation. These callables are said to *consume* the sample weights | ||
while other components of scikit-learn can *route* the weights to the | ||
underlying estimators or scorers (see | ||
:ref:`glossary_metadata_routing`). | ||
|
||
Weighting samples can be useful in several contexts. For instance, if | ||
the training data is not uniformly sampled from the target population, | ||
it can be corrected by weighting the training data points based on the | ||
`inverse probability | ||
<https://en.wikipedia.org/wiki/Inverse_probability_weighting>`_ of | ||
their selection for training (e.g. inverse propensity weighting). It is | ||
also useful to model the frequency of an event of interest per unit of | ||
time on a dataset of observations with different exposure durations per | ||
individual (see | ||
:ref:`sphx_glr_auto_examples_linear_model_plot_poisson_regression_non_normal_loss.py` | ||
and | ||
:ref:`sphx_glr_auto_examples_linear_model_plot_tweedie_regression_insurance_claims.py`). | ||
|
||
Third-party libraries can also use `sample_weight`-compatible | ||
estimators as building blocks to reduce a specific statistical task | ||
into a weighted regression or classification task. For instance sample | ||
weights can be constructed to adjust a time-to-event model for | ||
censoring in a predictive survival analysis setting. In causal | ||
inference, it is possible to reduce a conditional average treatment | ||
effect estimation task to a weighted regression task under some | ||
assumptions. Sample weights can also be used to mitigate | ||
fairness-related harms based on a given quantitative definition of | ||
fairness. | ||
Comment on lines
+1877
to
+1886
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph doesn't feel like it belongs here. |
||
|
||
Some model hyper-parameters are expressed in terms of a discrete number | ||
of samples in a region of the feature space. When fitting with sample | ||
weights, a count of samples is often automatically converted to a sum | ||
of their weights as is the case for `min_samples` in | ||
:class:`cluster.DBSCAN`, for instance. However, this is not always the | ||
case. In particular, the ``min_samples_leaf`` parameter in | ||
:class:`ensemble.RandomForestClassifier` does not take weights into | ||
account. One should instead pass `min_weight_fraction_leaf` to | ||
:class:`ensemble.RandomForestClassifier` to specify the minimum sum of | ||
weights of samples in a leaf. | ||
Comment on lines
+1888
to
+1897
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this information is best placed in the docstrings of those classes, not here, or here keep only the first 3-4 lines |
||
|
||
In classification, weights can also be specified for all samples | ||
belonging to a given target class with the :term:`class_weight` | ||
estimator :term:`parameter`. If both ``sample_weight`` and | ||
``class_weight`` are provided, the final weight assigned to a sample is | ||
the product of the two. | ||
|
||
At the time of writing (version 1.7), not all scikit-learn estimators | ||
correctly implement the weight-repetition equivalence property. The | ||
`#16298 meta issue | ||
<https://github.com/scikit-learn/scikit-learn/issues/16298>`_ tracks | ||
ongoing work to detect and fix remaining discrepancies. | ||
|
||
Furthermore, some estimators have a stochastic fit method. For | ||
instance, :class:`cluster.KMeans` depends on a random initialization, | ||
bagging models randomly resample from the training data, etc. In this | ||
case, the sample weight-repetition equivalence property described above | ||
does not hold exactly. However, it should hold at least in expectation | ||
over the randomness of the fitting procedure. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to reviewers: I think the equivalence should even hold in distribution rather than just in expectation, but I am not 100% sure this can always be enforced in scikit-learn (yet). |
||
|
||
``X`` | ||
Denotes data that is observed at training and prediction time, used as | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equivalent to what? I don't understand this sentence.