Skip to content

Commit 5b74193

Browse files
authored
Merge branch 'main' into obliquepr
2 parents 4a4b4bb + fb4dbfd commit 5b74193

File tree

76 files changed

+2257
-648
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+2257
-648
lines changed

.github/ISSUE_TEMPLATE/bug_report.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ body:
1919
attributes:
2020
label: Steps/Code to Reproduce
2121
description: |
22-
Please add a [minimal code example](https://scikit-learn.org/stable/developers/minimal_reproducer.html) that can reproduce the error when running it. Be as succinct as possible, **do not depend on external data files**: instead you can generate synthetic data using `numpy.random`, [sklearn.datasets.make_regression](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html), [sklearn.datasets.make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) or a few lines of Python code. Example:
22+
Please add a [minimal code example](https://scikit-learn.org/dev/developers/minimal_reproducer.html) that can reproduce the error when running it. Be as succinct as possible, **do not depend on external data files**: instead you can generate synthetic data using `numpy.random`, [sklearn.datasets.make_regression](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html), [sklearn.datasets.make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) or a few lines of Python code. Example:
2323
2424
```python
2525
from sklearn.feature_extraction.text import CountVectorizer
@@ -40,7 +40,7 @@ body:
4040
4141
In short, **we are going to copy-paste your code** to run it and we expect to get the same result as you.
4242
43-
We acknowledge that crafting a [minimal reproducible code example](https://scikit-learn.org/stable/developers/minimal_reproducer.html) requires some effort on your side but it really helps the maintainers quickly reproduce the problem and analyze its cause without any ambiguity. Ambiguous bug reports tend to be slower to fix because they will require more effort and back and forth discussion between the maintainers and the reporter to pin-point the precise conditions necessary to reproduce the problem.
43+
We acknowledge that crafting a [minimal reproducible code example](https://scikit-learn.org/dev/developers/minimal_reproducer.html) requires some effort on your side but it really helps the maintainers quickly reproduce the problem and analyze its cause without any ambiguity. Ambiguous bug reports tend to be slower to fix because they will require more effort and back and forth discussion between the maintainers and the reporter to pin-point the precise conditions necessary to reproduce the problem.
4444
placeholder: |
4545
```
4646
Sample code to reproduce the problem

azure-pipelines.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,7 @@ jobs:
147147
BLAS: 'mkl'
148148
COVERAGE: 'true'
149149
SHOW_SHORT_SUMMARY: 'true'
150+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '42' # default global random seed
150151

151152
# Check compilation with Ubuntu bionic 18.04 LTS and scipy from conda-forge
152153
- template: build_tools/azure/posix.yml
@@ -168,6 +169,7 @@ jobs:
168169
BLAS: 'openblas'
169170
COVERAGE: 'false'
170171
BUILD_WITH_ICC: 'false'
172+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '0' # non-default seed
171173

172174
- template: build_tools/azure/posix.yml
173175
parameters:
@@ -190,6 +192,7 @@ jobs:
190192
PANDAS_VERSION: 'none'
191193
THREADPOOLCTL_VERSION: 'min'
192194
COVERAGE: 'false'
195+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '1' # non-default seed
193196
# Linux + Python 3.8 build with OpenBLAS
194197
py38_conda_defaults_openblas:
195198
DISTRIB: 'conda'
@@ -201,6 +204,7 @@ jobs:
201204
MATPLOTLIB_VERSION: 'min'
202205
THREADPOOLCTL_VERSION: '2.2.0'
203206
SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES: '1'
207+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '2' # non-default seed
204208
# Linux environment to test the latest available dependencies.
205209
# It runs tests requiring lightgbm, pandas and PyAMG.
206210
pylatest_pip_openblas_pandas:
@@ -210,6 +214,7 @@ jobs:
210214
CHECK_PYTEST_SOFT_DEPENDENCY: 'true'
211215
TEST_DOCSTRINGS: 'true'
212216
CHECK_WARNINGS: 'true'
217+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '3' # non-default seed
213218

214219
- template: build_tools/azure/posix-docker.yml
215220
parameters:
@@ -231,6 +236,7 @@ jobs:
231236
PYTEST_XDIST_VERSION: 'none'
232237
PYTEST_VERSION: 'min'
233238
THREADPOOLCTL_VERSION: '2.2.0'
239+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '4' # non-default seed
234240

235241
- template: build_tools/azure/posix.yml
236242
parameters:
@@ -249,12 +255,14 @@ jobs:
249255
BLAS: 'mkl'
250256
CONDA_CHANNEL: 'conda-forge'
251257
CPU_COUNT: '3'
258+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '5' # non-default seed
252259
pylatest_conda_mkl_no_openmp:
253260
DISTRIB: 'conda'
254261
BLAS: 'mkl'
255262
SKLEARN_TEST_NO_OPENMP: 'true'
256263
SKLEARN_SKIP_OPENMP_TEST: 'true'
257264
CPU_COUNT: '3'
265+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '6' # non-default seed
258266

259267
- template: build_tools/azure/windows.yml
260268
parameters:
@@ -280,6 +288,8 @@ jobs:
280288
# Temporary fix for setuptools to use disutils from standard lib
281289
# https://github.com/numpy/numpy/issues/17216
282290
SETUPTOOLS_USE_DISTUTILS: 'stdlib'
291+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '7' # non-default seed
283292
py38_pip_openblas_32bit:
284293
PYTHON_VERSION: '3.8'
285294
PYTHON_ARCH: '32'
295+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '8' # non-default seed

build_tools/azure/test_script.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,13 @@ if [[ "$BUILD_WITH_ICC" == "true" ]]; then
1515
source /opt/intel/oneapi/setvars.sh
1616
fi
1717

18+
if [[ "$BUILD_REASON" == "Schedule" ]]; then
19+
# Enable global random seed randomization to discover seed-sensitive tests
20+
# only on nightly builds.
21+
# https://scikit-learn.org/stable/computing/parallelism.html#environment-variables
22+
export SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any"
23+
fi
24+
1825
mkdir -p $TEST_DIR
1926
cp setup.cfg $TEST_DIR
2027
cd $TEST_DIR

doc/computing/parallelism.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,60 @@ These environment variables should be set before importing scikit-learn.
194194
Sets the seed of the global random generator when running the tests,
195195
for reproducibility.
196196

197+
Note that scikit-learn tests are expected to run deterministically with
198+
explicit seeding of their own independent RNG instances instead of relying
199+
on the numpy or Python standard library RNG singletons to make sure that
200+
test results are independent of the test execution order. However some
201+
tests might forget to use explicit seeding and this variable is a way to
202+
control the intial state of the aforementioned singletons.
203+
204+
:SKLEARN_TESTS_GLOBAL_RANDOM_SEED:
205+
206+
Controls the seeding of the random number generator used in tests that
207+
rely on the `global_random_seed`` fixture.
208+
209+
All tests that use this fixture accept the contract that they should
210+
deterministically pass for any seed value from 0 to 99 included.
211+
212+
If the SKLEARN_TESTS_GLOBAL_RANDOM_SEED environment variable is set to
213+
"any" (which should be the case on nightly builds on the CI), the fixture
214+
will choose an arbitrary seed in the above range (based on the BUILD_NUMBER
215+
or the current day) and all fixtured tests will run for that specific seed.
216+
The goal is to ensure that, over time, our CI will run all tests with
217+
different seeds while keeping the test duration of a single run of the full
218+
test suite limited. This will check that the assertions of tests
219+
written to use this fixture are not dependent on a specific seed value.
220+
221+
The range of admissible seed values is limited to [0, 99] because it is
222+
often not possible to write a test that can work for any possible seed and
223+
we want to avoid having tests that randomly fail on the CI.
224+
225+
Valid values for SKLEARN_TESTS_GLOBAL_RANDOM_SEED:
226+
227+
- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="42": run tests with a fixed seed of 42
228+
- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="40-42": run the tests with all seeds
229+
between 40 and 42 included
230+
- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any": run the tests with an arbitrary
231+
seed selected between 0 and 99 included
232+
- SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all": run the tests with all seeds
233+
between 0 and 99 included
234+
235+
If the variable is not set, then 42 is used as the global seed in a
236+
deterministic manner. This ensures that, by default, the scikit-learn test
237+
suite is as deterministic as possible to avoid disrupting our friendly
238+
third-party package maintainers. Similarly, this variable should not be set
239+
in the CI config of pull-requests to make sure that our friendly
240+
contributors are not the first people to encounter a seed-sensitivity
241+
regression in a test unrelated to the changes of their own PR. Only the
242+
scikit-learn maintainers who watch the results of the nightly builds are
243+
expected to be annoyed by this.
244+
245+
When writing a new test function that uses this fixture, please use the
246+
following command to make sure that it passes deterministically for all
247+
admissible seeds on your local machine:
248+
249+
SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -v -k test_your_test_name
250+
197251
:SKLEARN_SKIP_NETWORK_TESTS:
198252

199253
When this environment variable is set to a non zero value, the tests

doc/conf.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
# -*- coding: utf-8 -*-
2-
#
31
# scikit-learn documentation build configuration file, created by
42
# sphinx-quickstart on Fri Jan 8 09:13:42 2010.
53
#

doc/developers/contributing.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ how to set up your git repository:
254254

255255
.. prompt:: bash $
256256

257-
pip install pytest pytest-cov flake8 mypy black==22.1.0
257+
pip install pytest pytest-cov flake8 mypy numpydoc black==22.1.0
258258

259259
.. _upstream:
260260

@@ -391,10 +391,10 @@ complies with the following rules before marking a PR as ``[MRG]``. The
391391
with `pytest`, but it is usually not recommended since it takes a long
392392
time. It is often enough to only run the test related to your changes:
393393
for example, if you changed something in
394-
`sklearn/linear_model/logistic.py`, running the following commands will
394+
`sklearn/linear_model/_logistic.py`, running the following commands will
395395
usually be enough:
396396

397-
- `pytest sklearn/linear_model/logistic.py` to make sure the doctest
397+
- `pytest sklearn/linear_model/_logistic.py` to make sure the doctest
398398
examples are correct
399399
- `pytest sklearn/linear_model/tests/test_logistic.py` to run the tests
400400
specific to the file

doc/install.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,8 @@ Then run:
8080
><span class="sk-expandable" data-packager="pip" data-os="mac" data-venv="no">pip install -U scikit-learn</span
8181
><span class="sk-expandable" data-packager="pip" data-os="windows" data-venv="no">pip install -U scikit-learn</span
8282
><span class="sk-expandable" data-packager="pip" data-os="linux" data-venv="no">pip3 install -U scikit-learn</span
83-
><span class="sk-expandable" data-packager="conda" data-venv="">conda create -n sklearn-env -c conda-forge scikit-learn</span
84-
><span class="sk-expandable" data-packager="conda" data-venv="">conda activate sklearn-env</span
83+
><span class="sk-expandable" data-packager="conda">conda create -n sklearn-env -c conda-forge scikit-learn</span
84+
><span class="sk-expandable" data-packager="conda">conda activate sklearn-env</span
8585
></code></pre></div>
8686

8787
In order to check your installation you can use

doc/modules/preprocessing.rst

Lines changed: 117 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -594,17 +594,19 @@ dataset::
594594
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
595595

596596
If there is a possibility that the training data might have missing categorical
597-
features, it can often be better to specify ``handle_unknown='ignore'`` instead
598-
of setting the ``categories`` manually as above. When
599-
``handle_unknown='ignore'`` is specified and unknown categories are encountered
600-
during transform, no error will be raised but the resulting one-hot encoded
601-
columns for this feature will be all zeros
602-
(``handle_unknown='ignore'`` is only supported for one-hot encoding)::
603-
604-
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
597+
features, it can often be better to specify
598+
`handle_unknown='infrequent_if_exist'` instead of setting the `categories`
599+
manually as above. When `handle_unknown='infrequent_if_exist'` is specified
600+
and unknown categories are encountered during transform, no error will be
601+
raised but the resulting one-hot encoded columns for this feature will be all
602+
zeros or considered as an infrequent category if enabled.
603+
(`handle_unknown='infrequent_if_exist'` is only supported for one-hot
604+
encoding)::
605+
606+
>>> enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist')
605607
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
606608
>>> enc.fit(X)
607-
OneHotEncoder(handle_unknown='ignore')
609+
OneHotEncoder(handle_unknown='infrequent_if_exist')
608610
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
609611
array([[1., 0., 0., 0., 0., 0.]])
610612

@@ -621,7 +623,8 @@ since co-linearity would cause the covariance matrix to be non-invertible::
621623
... ['female', 'from Europe', 'uses Firefox']]
622624
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
623625
>>> drop_enc.categories_
624-
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
626+
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object),
627+
array(['uses Firefox', 'uses Safari'], dtype=object)]
625628
>>> drop_enc.transform(X).toarray()
626629
array([[1., 1., 1.],
627630
[0., 0., 0.]])
@@ -634,7 +637,8 @@ categories. In this case, you can set the parameter `drop='if_binary'`.
634637
... ['female', 'Asia', 'Chrome']]
635638
>>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary').fit(X)
636639
>>> drop_enc.categories_
637-
[array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object), array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
640+
[array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object),
641+
array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
638642
>>> drop_enc.transform(X).toarray()
639643
array([[1., 0., 0., 1., 0., 0., 1.],
640644
[0., 0., 1., 0., 0., 1., 0.],
@@ -699,6 +703,107 @@ separate categories::
699703
See :ref:`dict_feature_extraction` for categorical features that are
700704
represented as a dict, not as scalars.
701705

706+
.. _one_hot_encoder_infrequent_categories:
707+
708+
Infrequent categories
709+
---------------------
710+
711+
:class:`OneHotEncoder` supports aggregating infrequent categories into a single
712+
output for each feature. The parameters to enable the gathering of infrequent
713+
categories are `min_frequency` and `max_categories`.
714+
715+
1. `min_frequency` is either an integer greater or equal to 1, or a float in
716+
the interval `(0.0, 1.0)`. If `min_frequency` is an integer, categories with
717+
a cardinality smaller than `min_frequency` will be considered infrequent.
718+
If `min_frequency` is a float, categories with a cardinality smaller than
719+
this fraction of the total number of samples will be considered infrequent.
720+
The default value is 1, which means every category is encoded separately.
721+
722+
2. `max_categories` is either `None` or any integer greater than 1. This
723+
parameter sets an upper limit to the number of output features for each
724+
input feature. `max_categories` includes the feature that combines
725+
infrequent categories.
726+
727+
In the following example, the categories, `'dog', 'snake'` are considered
728+
infrequent::
729+
730+
>>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
731+
... ['snake'] * 3], dtype=object).T
732+
>>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse=False).fit(X)
733+
>>> enc.infrequent_categories_
734+
[array(['dog', 'snake'], dtype=object)]
735+
>>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
736+
array([[0., 0., 1.],
737+
[1., 0., 0.],
738+
[0., 1., 0.],
739+
[0., 0., 1.]])
740+
741+
By setting handle_unknown to `'infrequent_if_exist'`, unknown categories will
742+
be considered infrequent::
743+
744+
>>> enc = preprocessing.OneHotEncoder(
745+
... handle_unknown='infrequent_if_exist', sparse=False, min_frequency=6)
746+
>>> enc = enc.fit(X)
747+
>>> enc.transform(np.array([['dragon']]))
748+
array([[0., 0., 1.]])
749+
750+
:meth:`OneHotEncoder.get_feature_names_out` uses 'infrequent' as the infrequent
751+
feature name::
752+
753+
>>> enc.get_feature_names_out()
754+
array(['x0_cat', 'x0_rabbit', 'x0_infrequent_sklearn'], dtype=object)
755+
756+
When `'handle_unknown'` is set to `'infrequent_if_exist'` and an unknown
757+
category is encountered in transform:
758+
759+
1. If infrequent category support was not configured or there was no
760+
infrequent category during training, the resulting one-hot encoded columns
761+
for this feature will be all zeros. In the inverse transform, an unknown
762+
category will be denoted as `None`.
763+
764+
2. If there is an infrequent category during training, the unknown category
765+
will be considered infrequent. In the inverse transform, 'infrequent_sklearn'
766+
will be used to represent the infrequent category.
767+
768+
Infrequent categories can also be configured using `max_categories`. In the
769+
following example, we set `max_categories=2` to limit the number of features in
770+
the output. This will result in all but the `'cat'` category to be considered
771+
infrequent, leading to two features, one for `'cat'` and one for infrequent
772+
categories - which are all the others::
773+
774+
>>> enc = preprocessing.OneHotEncoder(max_categories=2, sparse=False)
775+
>>> enc = enc.fit(X)
776+
>>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
777+
array([[0., 1.],
778+
[1., 0.],
779+
[0., 1.],
780+
[0., 1.]])
781+
782+
If both `max_categories` and `min_frequency` are non-default values, then
783+
categories are selected based on `min_frequency` first and `max_categories`
784+
categories are kept. In the following example, `min_frequency=4` considers
785+
only `snake` to be infrequent, but `max_categories=3`, forces `dog` to also be
786+
infrequent::
787+
788+
>>> enc = preprocessing.OneHotEncoder(min_frequency=4, max_categories=3, sparse=False)
789+
>>> enc = enc.fit(X)
790+
>>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
791+
array([[0., 0., 1.],
792+
[1., 0., 0.],
793+
[0., 1., 0.],
794+
[0., 0., 1.]])
795+
796+
If there are infrequent categories with the same cardinality at the cutoff of
797+
`max_categories`, then then the first `max_categories` are taken based on lexicon
798+
ordering. In the following example, "b", "c", and "d", have the same cardinality
799+
and with `max_categories=2`, "b" and "c" are infrequent because they have a higher
800+
lexicon order.
801+
802+
>>> X = np.asarray([["a"] * 20 + ["b"] * 10 + ["c"] * 10 + ["d"] * 10], dtype=object).T
803+
>>> enc = preprocessing.OneHotEncoder(max_categories=3).fit(X)
804+
>>> enc.infrequent_categories_
805+
[array(['b', 'c'], dtype=object)]
806+
702807
.. _preprocessing_discretization:
703808

704809
Discretization
@@ -981,7 +1086,7 @@ Interestingly, a :class:`SplineTransformer` of ``degree=0`` is the same as
9811086
Penalties <10.1214/ss/1038425655>`. Statist. Sci. 11 (1996), no. 2, 89--121.
9821087

9831088
* Perperoglou, A., Sauerbrei, W., Abrahamowicz, M. et al. :doi:`A review of
984-
spline function procedures in R <10.1186/s12874-019-0666-3>`.
1089+
spline function procedures in R <10.1186/s12874-019-0666-3>`.
9851090
BMC Med Res Methodol 19, 46 (2019).
9861091

9871092
.. _function_transformer:

0 commit comments

Comments
 (0)