Skip to content

Extend SequentialFeatureSelector example to demonstrate how to use negative tol #25525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnsofini opened this issue Jan 31, 2023 · 7 comments · Fixed by #26205
Closed

Extend SequentialFeatureSelector example to demonstrate how to use negative tol #25525

jnsofini opened this issue Jan 31, 2023 · 7 comments · Fixed by #26205

Comments

@jnsofini
Copy link

jnsofini commented Jan 31, 2023

Describe the bug

I utilized the SequentialFeatureSelector for feature selection in my code, with the direction set to "backward." The tolerance value is negative and the selection process stops when the decrease in the metric, AUC in this case, is less than the specified tolerance. Generally, increasing the number of features results in a higher AUC, but sacrificing some features, especially correlated ones that offer little contribution, can produce a pessimistic model with a lower AUC. The code worked as expected in sklearn 1.1.1, but when I updated to sklearn 1.2.1, I encountered the following error.

Steps/Code to Reproduce

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_breast_cancer(return_X_y=True)

TOL = -0.001
feature_selector = SequentialFeatureSelector(
                    LogisticRegression(max_iter=1000),
                    n_features_to_select="auto",
                    direction="backward",
                    scoring="roc_auc",
                    tol=TOL
                )


pipe = Pipeline(
    [('scaler', StandardScaler()), 
    ('feature_selector', feature_selector), 
    ('log_reg', LogisticRegression(max_iter=1000))]
    )



if __name__ == "__main__":
    pipe.fit(X, y)
    print(pipe['log_reg'].coef_[0])

Expected Results

$ python sfs_tol.py 
[-2.0429818   0.5364346  -1.35765488 -2.85009904 -2.84603016]

Actual Results

$ python sfs_tol.py 
Traceback (most recent call last):
  File "/home/modelling/users-workspace/nsofinij/lab/open-source/sfs_tol.py", line 28, in <module>
    pipe.fit(X, y)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/pipeline.py", line 401, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/pipeline.py", line 359, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/base.py", line 862, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/feature_selection/_sequential.py", line 201, in fit
    self._validate_params()
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/base.py", line 581, in _validate_params
    validate_parameter_constraints(
  File "/home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'tol' parameter of SequentialFeatureSelector must be None or a float in the range (0, inf). Got -0.001 instead.

Versions

System:
    python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]
executable: /home/modelling/opt/anaconda3/envs/py310/bin/python
   machine: Linux-4.14.301-224.520.amzn2.x86_64-x86_64-with-glibc2.26

Python dependencies:
      sklearn: 1.2.1
          pip: 23.0
   setuptools: 66.1.1
        numpy: 1.24.1
        scipy: 1.10.0
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.6.3
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 64

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/modelling/opt/anaconda3/envs/py310/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64
@jnsofini jnsofini added Bug Needs Triage Issue requires triage labels Jan 31, 2023
@thomasjpfan
Copy link
Member

During the triaging meeting, we think that allowing for tol to be negative makes sense. It would be first tol parameter in scikit-learn to accept negative values, so we also need to update the docstring to be clear about what negative tolerance's mean and when it is useful.

@thomasjpfan thomasjpfan added module:feature_extraction and removed Needs Triage Issue requires triage labels Feb 9, 2023
@thomasjpfan thomasjpfan added this to the 1.2.2 milestone Feb 9, 2023
@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2023

For the record, we also discussed the alternative to deprecate the tol parameter in favor of a parameter with a different name but we judged that maybe this is not needed and just updating the docstring to discuss the meaning and purpose of negative tol (and positive tol values) would be less disruptive and probably enough, at least in a first iteration.

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2023

Maybe we could also extend the existing example or craft a new one to demonstrate how to use negative tols. The breast cancer dataset seems to be an interesting use case.

Anybody interested in doing a PR for some or all of the above (the example can come in a distinct PR not to delay the merge of the review and merge of the fix).

@jnsofini
Copy link
Author

@ogrisel which existing example is referred to in your post? I could look into improving it. I can expand the breast cancer case and show how a different values of tol will generate different models and the pessimistic behaviours. Also showing how the behaviour of a positive tol differ from that of a negative tol will be nice to have.

@ogrisel
Copy link
Member

ogrisel commented Feb 11, 2023

I checked and there is only one example for SequentialFeatureSelector:

https://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html#sphx-glr-auto-examples-feature-selection-plot-select-from-model-diabetes-py

The body of this example uses the diabetes dataset that has probably too few features to be able to demonstrate your use of negative tolerance values.

I think it's fine to keep the existing contents as it is but then append a new section that demonstrate backward selection with a negative tolerance to select a smaller number of features on the Brest cancer dataset.

@jeremiedbb
Copy link
Member

I opened #25664 to fix the regression but leave the example part for another PR to not delay the 1.2.2 release.

@jeremiedbb jeremiedbb changed the title SequentialFeatureSelector Failing to accept negative values of tol in scikit-learn version 1.2.1, but works for 1.1.1 Extend SequentialFeatureSelector example to demonstrate how to use negative tol Feb 27, 2023
@jeremiedbb jeremiedbb removed this from the 1.2.2 milestone Feb 28, 2023
@rprkh
Copy link
Contributor

rprkh commented Apr 17, 2023

I think it's fine to keep the existing contents as it is but then append a new section that demonstrate backward selection with a negative tolerance to select a smaller number of features on the Brest cancer dataset.

Hey @ogrisel, I created a Colab Notebook for feature selection on the breast cancer dataset using SequentialFeatureSelector, that makes use of negative values of tol. I will work on creating a PR for this based on the code in the notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants