scikit-learn
diff --git a/‎.github/ISSUE_TEMPLATE/bug_report.yml
Lines changed: 2 additions & 2 deletions b/‎.github/ISSUE_TEMPLATE/bug_report.yml
Lines changed: 2 additions & 2 deletions
diff --git a/‎azure-pipelines.yml
Lines changed: 10 additions & 0 deletions b/‎azure-pipelines.yml
Lines changed: 10 additions & 0 deletions
diff --git a/‎build_tools/azure/test_script.sh
Lines changed: 7 additions & 0 deletions b/‎build_tools/azure/test_script.sh
Lines changed: 7 additions & 0 deletions
diff --git a/‎doc/computing/parallelism.rst
Lines changed: 54 additions & 0 deletions b/‎doc/computing/parallelism.rst
Lines changed: 54 additions & 0 deletions
diff --git a/‎doc/conf.py
Lines changed: 0 additions & 2 deletions b/‎doc/conf.py
Lines changed: 0 additions & 2 deletions
diff --git a/‎doc/developers/contributing.rst
Lines changed: 3 additions & 3 deletions b/‎doc/developers/contributing.rst
Lines changed: 3 additions & 3 deletions
diff --git a/‎doc/install.rst
Lines changed: 2 additions & 2 deletions b/‎doc/install.rst
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/modules/preprocessing.rst
Lines changed: 117 additions & 12 deletions b/‎doc/modules/preprocessing.rst
Lines changed: 117 additions & 12 deletions
@@ -19,7 +19,7 @@ body:
   attributes:
     label: Steps/Code to Reproduce
     description: |
-      Please add a [minimal code example](https://scikit-learn.org/stable/developers/minimal_reproducer.html) that can reproduce the error when running it. Be as succinct as possible, **do not depend on external data files**: instead you can generate synthetic data using `numpy.random`, [sklearn.datasets.make_regression](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html), [sklearn.datasets.make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) or a few lines of Python code. Example:
+      Please add a [minimal code example](https://scikit-learn.org/dev/developers/minimal_reproducer.html) that can reproduce the error when running it. Be as succinct as possible, **do not depend on external data files**: instead you can generate synthetic data using `numpy.random`, [sklearn.datasets.make_regression](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html), [sklearn.datasets.make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) or a few lines of Python code. Example:
 
       ```python
       from sklearn.feature_extraction.text import CountVectorizer
@@ -40,7 +40,7 @@ body:
 
       In short, **we are going to copy-paste your code** to run it and we expect to get the same result as you.
 
-      We acknowledge that crafting a [minimal reproducible code example](https://scikit-learn.org/stable/developers/minimal_reproducer.html) requires some effort on your side but it really helps the maintainers quickly reproduce the problem and analyze its cause without any ambiguity. Ambiguous bug reports tend to be slower to fix because they will require more effort and back and forth discussion between the maintainers and the reporter to pin-point the precise conditions necessary to reproduce the problem.
+      We acknowledge that crafting a [minimal reproducible code example](https://scikit-learn.org/dev/developers/minimal_reproducer.html) requires some effort on your side but it really helps the maintainers quickly reproduce the problem and analyze its cause without any ambiguity. Ambiguous bug reports tend to be slower to fix because they will require more effort and back and forth discussion between the maintainers and the reporter to pin-point the precise conditions necessary to reproduce the problem.
     placeholder: |
       ```
       Sample code to reproduce the problem
 
@@ -147,6 +147,7 @@ jobs:
         BLAS: 'mkl'
         COVERAGE: 'true'
         SHOW_SHORT_SUMMARY: 'true'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '42'  # default global random seed
 
 # Check compilation with Ubuntu bionic 18.04 LTS and scipy from conda-forge
 - template: build_tools/azure/posix.yml
@@ -168,6 +169,7 @@ jobs:
         BLAS: 'openblas'
         COVERAGE: 'false'
         BUILD_WITH_ICC: 'false'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '0'  # non-default seed
 
 - template: build_tools/azure/posix.yml
   parameters:
@@ -190,6 +192,7 @@ jobs:
         PANDAS_VERSION: 'none'
         THREADPOOLCTL_VERSION: 'min'
         COVERAGE: 'false'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '1'  # non-default seed
       # Linux + Python 3.8 build with OpenBLAS
       py38_conda_defaults_openblas:
         DISTRIB: 'conda'
@@ -201,6 +204,7 @@ jobs:
         MATPLOTLIB_VERSION: 'min'
         THREADPOOLCTL_VERSION: '2.2.0'
         SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES: '1'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '2'  # non-default seed
       # Linux environment to test the latest available dependencies.
       # It runs tests requiring lightgbm, pandas and PyAMG.
       pylatest_pip_openblas_pandas:
@@ -210,6 +214,7 @@ jobs:
         CHECK_PYTEST_SOFT_DEPENDENCY: 'true'
         TEST_DOCSTRINGS: 'true'
         CHECK_WARNINGS: 'true'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '3'  # non-default seed
 
 - template: build_tools/azure/posix-docker.yml
   parameters:
@@ -231,6 +236,7 @@ jobs:
         PYTEST_XDIST_VERSION: 'none'
         PYTEST_VERSION: 'min'
         THREADPOOLCTL_VERSION: '2.2.0'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '4'  # non-default seed
 
 - template: build_tools/azure/posix.yml
   parameters:
@@ -249,12 +255,14 @@ jobs:
         BLAS: 'mkl'
         CONDA_CHANNEL: 'conda-forge'
         CPU_COUNT: '3'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '5'  # non-default seed
       pylatest_conda_mkl_no_openmp:
         DISTRIB: 'conda'
         BLAS: 'mkl'
         SKLEARN_TEST_NO_OPENMP: 'true'
         SKLEARN_SKIP_OPENMP_TEST: 'true'
         CPU_COUNT: '3'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '6'  # non-default seed
 
 - template: build_tools/azure/windows.yml
   parameters:
@@ -280,6 +288,8 @@ jobs:
         # Temporary fix for setuptools to use disutils from standard lib
         # https://github.com/numpy/numpy/issues/17216
         SETUPTOOLS_USE_DISTUTILS: 'stdlib'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '7'  # non-default seed
       py38_pip_openblas_32bit:
         PYTHON_VERSION: '3.8'
         PYTHON_ARCH: '32'
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED: '8'  # non-default seed
@@ -15,6 +15,13 @@ if [[ "$BUILD_WITH_ICC" == "true" ]]; then
     source /opt/intel/oneapi/setvars.sh
 fi
 
+if [[ "$BUILD_REASON" == "Schedule" ]]; then
+    # Enable global random seed randomization to discover seed-sensitive tests
+    # only on nightly builds.
+    # https://scikit-learn.org/stable/computing/parallelism.html#environment-variables
+    export SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any"
+fi
+
 mkdir -p $TEST_DIR
 cp setup.cfg $TEST_DIR
 cd $TEST_DIR
 
@@ -194,6 +194,60 @@ These environment variables should be set before importing scikit-learn.
     Sets the seed of the global random generator when running the tests,
     for reproducibility.
 
+    Note that scikit-learn tests are expected to run deterministically with
+    explicit seeding of their own independent RNG instances instead of relying
+    on the numpy or Python standard library RNG singletons to make sure that
+    test results are independent of the test execution order. However some
+    tests might forget to use explicit seeding and this variable is a way to
+    control the intial state of the aforementioned singletons.
+
+:SKLEARN_TESTS_GLOBAL_RANDOM_SEED:
+
+    Controls the seeding of the random number generator used in tests that
+    rely on the `global_random_seed`` fixture.
+
+    All tests that use this fixture accept the contract that they should
+    deterministically pass for any seed value from 0 to 99 included.
+
+    If the SKLEARN_TESTS_GLOBAL_RANDOM_SEED environment variable is set to
+    "any" (which should be the case on nightly builds on the CI), the fixture
+    will choose an arbitrary seed in the above range (based on the BUILD_NUMBER
+    or the current day) and all fixtured tests will run for that specific seed.
+    The goal is to ensure that, over time, our CI will run all tests with
+    different seeds while keeping the test duration of a single run of the full
+    test suite limited. This will check that the assertions of tests
+    written to use this fixture are not dependent on a specific seed value.
+
+    The range of admissible seed values is limited to [0, 99] because it is
+    often not possible to write a test that can work for any possible seed and
+    we want to avoid having tests that randomly fail on the CI.
+
+    Valid values for SKLEARN_TESTS_GLOBAL_RANDOM_SEED:
+
+    - SKLEARN_TESTS_GLOBAL_RANDOM_SEED="42": run tests with a fixed seed of 42
+    - SKLEARN_TESTS_GLOBAL_RANDOM_SEED="40-42": run the tests with all seeds
+      between 40 and 42 included
+    - SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any": run the tests with an arbitrary
+      seed selected between 0 and 99 included
+    - SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all": run the tests with all seeds
+      between 0 and 99 included
+
+    If the variable is not set, then 42 is used as the global seed in a
+    deterministic manner. This ensures that, by default, the scikit-learn test
+    suite is as deterministic as possible to avoid disrupting our friendly
+    third-party package maintainers. Similarly, this variable should not be set
+    in the CI config of pull-requests to make sure that our friendly
+    contributors are not the first people to encounter a seed-sensitivity
+    regression in a test unrelated to the changes of their own PR. Only the
+    scikit-learn maintainers who watch the results of the nightly builds are
+    expected to be annoyed by this.
+
+    When writing a new test function that uses this fixture, please use the
+    following command to make sure that it passes deterministically for all
+    admissible seeds on your local machine:
+
+        SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -v -k test_your_test_name
+
 :SKLEARN_SKIP_NETWORK_TESTS:
 
     When this environment variable is set to a non zero value, the tests
 
@@ -1,5 +1,3 @@
-# -*- coding: utf-8 -*-
-#
 # scikit-learn documentation build configuration file, created by
 # sphinx-quickstart on Fri Jan  8 09:13:42 2010.
 #
 
@@ -254,7 +254,7 @@ how to set up your git repository:
 
    .. prompt:: bash $
 
-        pip install pytest pytest-cov flake8 mypy black==22.1.0
+        pip install pytest pytest-cov flake8 mypy numpydoc black==22.1.0
 
 .. _upstream:
 
@@ -391,10 +391,10 @@ complies with the following rules before marking a PR as ``[MRG]``. The
    with `pytest`, but it is usually not recommended since it takes a long
    time. It is often enough to only run the test related to your changes:
    for example, if you changed something in
-   `sklearn/linear_model/logistic.py`, running the following commands will
+   `sklearn/linear_model/_logistic.py`, running the following commands will
    usually be enough:
 
-   - `pytest sklearn/linear_model/logistic.py` to make sure the doctest
+   - `pytest sklearn/linear_model/_logistic.py` to make sure the doctest
      examples are correct
    - `pytest sklearn/linear_model/tests/test_logistic.py` to run the tests
      specific to the file
 
@@ -80,8 +80,8 @@ Then run:
         ><span class="sk-expandable" data-packager="pip" data-os="mac" data-venv="no">pip install -U scikit-learn</span
         ><span class="sk-expandable" data-packager="pip" data-os="windows" data-venv="no">pip install -U scikit-learn</span
         ><span class="sk-expandable" data-packager="pip" data-os="linux" data-venv="no">pip3 install -U scikit-learn</span
-        ><span class="sk-expandable" data-packager="conda" data-venv="">conda create -n sklearn-env -c conda-forge scikit-learn</span
-        ><span class="sk-expandable" data-packager="conda" data-venv="">conda activate sklearn-env</span
+        ><span class="sk-expandable" data-packager="conda">conda create -n sklearn-env -c conda-forge scikit-learn</span
+        ><span class="sk-expandable" data-packager="conda">conda activate sklearn-env</span
        ></code></pre></div>
 
 In order to check your installation you can use
 
@@ -594,17 +594,19 @@ dataset::
     array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
 
 If there is a possibility that the training data might have missing categorical
-features, it can often be better to specify ``handle_unknown='ignore'`` instead
-of setting the ``categories`` manually as above. When
-``handle_unknown='ignore'`` is specified and unknown categories are encountered
-during transform, no error will be raised but the resulting one-hot encoded
-columns for this feature will be all zeros
-(``handle_unknown='ignore'`` is only supported for one-hot encoding)::
-
-    >>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
+features, it can often be better to specify
+`handle_unknown='infrequent_if_exist'` instead of setting the `categories`
+manually as above. When `handle_unknown='infrequent_if_exist'` is specified
+and unknown categories are encountered during transform, no error will be
+raised but the resulting one-hot encoded columns for this feature will be all
+zeros or considered as an infrequent category if enabled.
+(`handle_unknown='infrequent_if_exist'` is only supported for one-hot
+encoding)::
+
+    >>> enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist')
     >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
     >>> enc.fit(X)
-    OneHotEncoder(handle_unknown='ignore')
+    OneHotEncoder(handle_unknown='infrequent_if_exist')
     >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
     array([[1., 0., 0., 0., 0., 0.]])
 
@@ -621,7 +623,8 @@ since co-linearity would cause the covariance matrix to be non-invertible::
     ...      ['female', 'from Europe', 'uses Firefox']]
     >>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
     >>> drop_enc.categories_
-    [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
+    [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object),
+     array(['uses Firefox', 'uses Safari'], dtype=object)]
     >>> drop_enc.transform(X).toarray()
     array([[1., 1., 1.],
            [0., 0., 0.]])
@@ -634,7 +637,8 @@ categories. In this case, you can set the parameter `drop='if_binary'`.
     ...      ['female', 'Asia', 'Chrome']]
     >>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary').fit(X)
     >>> drop_enc.categories_
-    [array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object), array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
+    [array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object),
+     array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
     >>> drop_enc.transform(X).toarray()
     array([[1., 0., 0., 1., 0., 0., 1.],
            [0., 0., 1., 0., 0., 1., 0.],
@@ -699,6 +703,107 @@ separate categories::
 See :ref:`dict_feature_extraction` for categorical features that are
 represented as a dict, not as scalars.
 
+.. _one_hot_encoder_infrequent_categories:
+
+Infrequent categories
+---------------------
+
+:class:`OneHotEncoder` supports aggregating infrequent categories into a single
+output for each feature. The parameters to enable the gathering of infrequent
+categories are `min_frequency` and `max_categories`.
+
+1. `min_frequency` is either an  integer greater or equal to 1, or a float in
+   the interval `(0.0, 1.0)`. If `min_frequency` is an integer, categories with
+   a cardinality smaller than `min_frequency`  will be considered infrequent.
+   If `min_frequency` is a float, categories with a cardinality smaller than
+   this fraction of the total number of samples will be considered infrequent.
+   The default value is 1, which means every category is encoded separately.
+
+2. `max_categories` is either `None` or any integer greater than 1. This
+   parameter sets an upper limit to the number of output features for each
+   input feature. `max_categories` includes the feature that combines
+   infrequent categories.
+
+In the following example, the categories, `'dog', 'snake'` are considered
+infrequent::
+
+   >>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
+   ...               ['snake'] * 3], dtype=object).T
+   >>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse=False).fit(X)
+   >>> enc.infrequent_categories_
+   [array(['dog', 'snake'], dtype=object)]
+   >>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
+   array([[0., 0., 1.],
+          [1., 0., 0.],
+          [0., 1., 0.],
+          [0., 0., 1.]])
+
+By setting handle_unknown to `'infrequent_if_exist'`, unknown categories will
+be considered infrequent::
+
+   >>> enc = preprocessing.OneHotEncoder(
+   ...    handle_unknown='infrequent_if_exist', sparse=False, min_frequency=6)
+   >>> enc = enc.fit(X)
+   >>> enc.transform(np.array([['dragon']]))
+   array([[0., 0., 1.]])
+
+:meth:`OneHotEncoder.get_feature_names_out` uses 'infrequent' as the infrequent
+feature name::
+
+   >>> enc.get_feature_names_out()
+   array(['x0_cat', 'x0_rabbit', 'x0_infrequent_sklearn'], dtype=object)
+
+When `'handle_unknown'` is set to `'infrequent_if_exist'` and an unknown
+category is encountered in transform:
+
+1. If infrequent category support was not configured or there was no
+   infrequent category during training, the resulting one-hot encoded columns
+   for this feature will be all zeros. In the inverse transform, an unknown
+   category will be denoted as `None`.
+
+2. If there is an infrequent category during training, the unknown category
+   will be considered infrequent. In the inverse transform, 'infrequent_sklearn'
+   will be used to represent the infrequent category.
+
+Infrequent categories can also be configured using `max_categories`. In the
+following example, we set `max_categories=2` to limit the number of features in
+the output. This will result in all but the `'cat'` category to be considered
+infrequent, leading to two features, one for `'cat'` and one for infrequent
+categories - which are all the others::
+
+   >>> enc = preprocessing.OneHotEncoder(max_categories=2, sparse=False)
+   >>> enc = enc.fit(X)
+   >>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
+   array([[0., 1.],
+          [1., 0.],
+          [0., 1.],
+          [0., 1.]])
+
+If both `max_categories` and `min_frequency` are non-default values, then
+categories are selected based on `min_frequency` first and `max_categories`
+categories are kept. In the following example, `min_frequency=4` considers
+only `snake` to be infrequent, but `max_categories=3`, forces `dog` to also be
+infrequent::
+
+   >>> enc = preprocessing.OneHotEncoder(min_frequency=4, max_categories=3, sparse=False)
+   >>> enc = enc.fit(X)
+   >>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
+   array([[0., 0., 1.],
+          [1., 0., 0.],
+          [0., 1., 0.],
+          [0., 0., 1.]])
+
+If there are infrequent categories with the same cardinality at the cutoff of
+`max_categories`, then then the first `max_categories` are taken based on lexicon
+ordering. In the following example, "b", "c", and "d", have the same cardinality
+and with `max_categories=2`, "b" and "c" are infrequent because they have a higher
+lexicon order.
+
+   >>> X = np.asarray([["a"] * 20 + ["b"] * 10 + ["c"] * 10 + ["d"] * 10], dtype=object).T
+   >>> enc = preprocessing.OneHotEncoder(max_categories=3).fit(X)
+   >>> enc.infrequent_categories_
+   [array(['b', 'c'], dtype=object)]
+
 .. _preprocessing_discretization:
 
 Discretization
@@ -981,7 +1086,7 @@ Interestingly, a :class:`SplineTransformer` of ``degree=0`` is the same as
       Penalties <10.1214/ss/1038425655>`. Statist. Sci. 11 (1996), no. 2, 89--121.
 
     * Perperoglou, A., Sauerbrei, W., Abrahamowicz, M. et al. :doi:`A review of
-      spline function procedures in R <10.1186/s12874-019-0666-3>`. 
+      spline function procedures in R <10.1186/s12874-019-0666-3>`.
       BMC Med Res Methodol 19, 46 (2019).
 
 .. _function_transformer:
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-# -- coding: utf-8 --`
`2`		`-#`
`3`	`1`	`# scikit-learn documentation build configuration file, created by`
`4`	`2`	`# sphinx-quickstart on Fri Jan 8 09:13:42 2010.`
`5`	`3`	`#`