Skip to content

DOC update and improve the sample_weight entry in the glossary #30564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 66 additions & 20 deletions doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1848,26 +1848,72 @@ See concept :term:`sample property`.
See :ref:`group_cv`.

``sample_weight``
A relative weight for each sample. Intuitively, if all weights are
integers, a weighted model or score should be equivalent to that
calculated when repeating the sample the number of times specified in
the weight. Weights may be specified as floats, so that sample weights
are usually equivalent up to a constant positive scaling factor.

FIXME Is this interpretation always the case in practice? We have no
common tests.

Some estimators, such as decision trees, support negative weights.
FIXME: This feature or its absence may not be tested or documented in
many estimators.

This is not entirely the case where other parameters of the model
consider the number of samples in a region, as with ``min_samples`` in
:class:`cluster.DBSCAN`. In this case, a count of samples becomes
to a sum of their weights.

In classification, sample weights can also be specified as a function
of class with the :term:`class_weight` estimator :term:`parameter`.
A relative weight for each sample. Intuitively, if all weights are
integers, using them in a model or scorer is like duplicating each
sample as many times as the weight value. Weights may be specified as
floats, so that sample weights are usually equivalent up to a constant
positive scaling factor.
Comment on lines +1854 to +1855
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalent to what? I don't understand this sentence.


`sample_weight` can be both an argument of the estimator's `fit` method
for model training or a parameter of a :term:`scorer` for model
evaluation. These callables are said to *consume* the sample weights
while other components of scikit-learn can *route* the weights to the
underlying estimators or scorers (see
:ref:`glossary_metadata_routing`).

Weighting samples can be useful in several contexts. For instance, if
the training data is not uniformly sampled from the target population,
it can be corrected by weighting the training data points based on the
`inverse probability
<https://en.wikipedia.org/wiki/Inverse_probability_weighting>`_ of
their selection for training (e.g. inverse propensity weighting). It is
also useful to model the frequency of an event of interest per unit of
time on a dataset of observations with different exposure durations per
individual (see
:ref:`sphx_glr_auto_examples_linear_model_plot_poisson_regression_non_normal_loss.py`
and
:ref:`sphx_glr_auto_examples_linear_model_plot_tweedie_regression_insurance_claims.py`).

Third-party libraries can also use `sample_weight`-compatible
estimators as building blocks to reduce a specific statistical task
into a weighted regression or classification task. For instance sample
weights can be constructed to adjust a time-to-event model for
censoring in a predictive survival analysis setting. In causal
inference, it is possible to reduce a conditional average treatment
effect estimation task to a weighted regression task under some
assumptions. Sample weights can also be used to mitigate
fairness-related harms based on a given quantitative definition of
fairness.
Comment on lines +1877 to +1886
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph doesn't feel like it belongs here.


Some model hyper-parameters are expressed in terms of a discrete number
of samples in a region of the feature space. When fitting with sample
weights, a count of samples is often automatically converted to a sum
of their weights as is the case for `min_samples` in
:class:`cluster.DBSCAN`, for instance. However, this is not always the
case. In particular, the ``min_samples_leaf`` parameter in
:class:`ensemble.RandomForestClassifier` does not take weights into
account. One should instead pass `min_weight_fraction_leaf` to
:class:`ensemble.RandomForestClassifier` to specify the minimum sum of
weights of samples in a leaf.
Comment on lines +1888 to +1897
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this information is best placed in the docstrings of those classes, not here, or here keep only the first 3-4 lines


In classification, weights can also be specified for all samples
belonging to a given target class with the :term:`class_weight`
estimator :term:`parameter`. If both ``sample_weight`` and
``class_weight`` are provided, the final weight assigned to a sample is
the product of the two.

At the time of writing (version 1.7), not all scikit-learn estimators
correctly implement the weight-repetition equivalence property. The
`#16298 meta issue
<https://github.com/scikit-learn/scikit-learn/issues/16298>`_ tracks
ongoing work to detect and fix remaining discrepancies.

Furthermore, some estimators have a stochastic fit method. For
instance, :class:`cluster.KMeans` depends on a random initialization,
bagging models randomly resample from the training data, etc. In this
case, the sample weight-repetition equivalence property described above
does not hold exactly. However, it should hold at least in expectation
over the randomness of the fitting procedure.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I think the equivalence should even hold in distribution rather than just in expectation, but I am not 100% sure this can always be enforced in scikit-learn (yet).


``X``
Denotes data that is observed at training and prediction time, used as
Expand Down
Loading