Extend SequentialFeatureSelector example to demonstrate how to use negative tol #25525

jnsofini · 2023-01-31T22:20:43Z

Describe the bug

I utilized the SequentialFeatureSelector for feature selection in my code, with the direction set to "backward." The tolerance value is negative and the selection process stops when the decrease in the metric, AUC in this case, is less than the specified tolerance. Generally, increasing the number of features results in a higher AUC, but sacrificing some features, especially correlated ones that offer little contribution, can produce a pessimistic model with a lower AUC. The code worked as expected in sklearn 1.1.1, but when I updated to sklearn 1.2.1, I encountered the following error.

Steps/Code to Reproduce

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_breast_cancer(return_X_y=True)

TOL = -0.001
feature_selector = SequentialFeatureSelector(
                    LogisticRegression(max_iter=1000),
                    n_features_to_select="auto",
                    direction="backward",
                    scoring="roc_auc",
                    tol=TOL
                )


pipe = Pipeline(
    [('scaler', StandardScaler()), 
    ('feature_selector', feature_selector), 
    ('log_reg', LogisticRegression(max_iter=1000))]
    )



if __name__ == "__main__":
    pipe.fit(X, y)
    print(pipe['log_reg'].coef_[0])

Expected Results

$ python sfs_tol.py 
[-2.0429818   0.5364346  -1.35765488 -2.85009904 -2.84603016]

Actual Results

$ python sfs_tol.py 
Traceback (most recent call last):
  File "/home/modelling/users-workspace/nsofinij/lab/open-source/sfs_tol.py", line 28, in <module>
    pipe.fit(X, y)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/pipeline.py", line 401, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/pipeline.py", line 359, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/base.py", line 862, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/feature_selection/_sequential.py", line 201, in fit
    self._validate_params()
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/base.py", line 581, in _validate_params
    validate_parameter_constraints(
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'tol' parameter of SequentialFeatureSelector must be None or a float in the range (0, inf). Got -0.001 instead.

Versions

System:
    python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]
executable: /home/modelling/opt/anaconda3/envs/py310/bin/python
   machine: Linux-4.14.301-224.520.amzn2.x86_64-x86_64-with-glibc2.26

Python dependencies:
      sklearn: 1.2.1
          pip: 23.0
   setuptools: 66.1.1
        numpy: 1.24.1
        scipy: 1.10.0
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.6.3
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 64

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2023-02-09T17:02:25Z

During the triaging meeting, we think that allowing for tol to be negative makes sense. It would be first tol parameter in scikit-learn to accept negative values, so we also need to update the docstring to be clear about what negative tolerance's mean and when it is useful.

ogrisel · 2023-02-10T18:35:45Z

For the record, we also discussed the alternative to deprecate the tol parameter in favor of a parameter with a different name but we judged that maybe this is not needed and just updating the docstring to discuss the meaning and purpose of negative tol (and positive tol values) would be less disruptive and probably enough, at least in a first iteration.

ogrisel · 2023-02-10T18:37:29Z

Maybe we could also extend the existing example or craft a new one to demonstrate how to use negative tols. The breast cancer dataset seems to be an interesting use case.

Anybody interested in doing a PR for some or all of the above (the example can come in a distinct PR not to delay the merge of the review and merge of the fix).

jnsofini · 2023-02-10T21:11:29Z

@ogrisel which existing example is referred to in your post? I could look into improving it. I can expand the breast cancer case and show how a different values of tol will generate different models and the pessimistic behaviours. Also showing how the behaviour of a positive tol differ from that of a negative tol will be nice to have.

ogrisel · 2023-02-11T08:54:47Z

I checked and there is only one example for SequentialFeatureSelector:

https://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html#sphx-glr-auto-examples-feature-selection-plot-select-from-model-diabetes-py

The body of this example uses the diabetes dataset that has probably too few features to be able to demonstrate your use of negative tolerance values.

I think it's fine to keep the existing contents as it is but then append a new section that demonstrate backward selection with a negative tolerance to select a smaller number of features on the Brest cancer dataset.

jeremiedbb · 2023-02-22T15:52:18Z

I opened #25664 to fix the regression but leave the example part for another PR to not delay the 1.2.2 release.

rprkh · 2023-04-17T17:57:20Z

I think it's fine to keep the existing contents as it is but then append a new section that demonstrate backward selection with a negative tolerance to select a smaller number of features on the Brest cancer dataset.

Hey @ogrisel, I created a Colab Notebook for feature selection on the breast cancer dataset using SequentialFeatureSelector, that makes use of negative values of tol. I will work on creating a PR for this based on the code in the notebook.

jnsofini added Bug Needs Triage Issue requires triage labels Jan 31, 2023

thomasjpfan added module:feature_extraction and removed Needs Triage Issue requires triage labels Feb 9, 2023

thomasjpfan added this to the 1.2.2 milestone Feb 9, 2023

jeremiedbb mentioned this issue Feb 22, 2023

FIX Allow negative tol in SequentialFeatureSelector #25664

Merged

jeremiedbb changed the title ~~SequentialFeatureSelector Failing to accept negative values of tol in scikit-learn version 1.2.1, but works for 1.1.1~~ Extend SequentialFeatureSelector example to demonstrate how to use negative tol Feb 27, 2023

jeremiedbb removed this from the 1.2.2 milestone Feb 28, 2023

rprkh mentioned this issue Apr 18, 2023

DOC example on feature selection using negative tol values #26205

Merged

ArturoAmorQ mentioned this issue Apr 24, 2023

DOC Improve docstring for tol in SequentialFeatureSelector #26271

Merged

Micky774 closed this as completed in #26205 Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend SequentialFeatureSelector example to demonstrate how to use negative tol #25525

Extend SequentialFeatureSelector example to demonstrate how to use negative tol #25525

jnsofini commented Jan 31, 2023 •

edited by ogrisel

Loading

thomasjpfan commented Feb 9, 2023

ogrisel commented Feb 10, 2023

ogrisel commented Feb 10, 2023 •

edited

Loading

jnsofini commented Feb 10, 2023

ogrisel commented Feb 11, 2023

jeremiedbb commented Feb 22, 2023

rprkh commented Apr 17, 2023 •

edited

Loading

Extend SequentialFeatureSelector example to demonstrate how to use negative tol #25525

Extend SequentialFeatureSelector example to demonstrate how to use negative tol #25525

Comments

jnsofini commented Jan 31, 2023 • edited by ogrisel Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

thomasjpfan commented Feb 9, 2023

ogrisel commented Feb 10, 2023

ogrisel commented Feb 10, 2023 • edited Loading

jnsofini commented Feb 10, 2023

ogrisel commented Feb 11, 2023

jeremiedbb commented Feb 22, 2023

rprkh commented Apr 17, 2023 • edited Loading

jnsofini commented Jan 31, 2023 •

edited by ogrisel

Loading

ogrisel commented Feb 10, 2023 •

edited

Loading

rprkh commented Apr 17, 2023 •

edited

Loading