Skip to content

DOC example on feature selection using negative tol values #26205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 26, 2023
53 changes: 50 additions & 3 deletions examples/feature_selection/plot_select_from_model_diabetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,6 @@
print(f"Done in {toc_bwd - tic_bwd:.3f}s")

# %%
# Discussion
# ----------
#
# Interestingly, forward and backward selection have selected the same set of
# features. In general, this isn't the case and the two methods would lead to
# different results.
Expand All @@ -143,3 +140,53 @@
# attribute. The forward SFS is faster than the backward SFS because it only
# needs to perform `n_features_to_select = 2` iterations, while the backward
# SFS needs to perform `n_features - n_features_to_select = 8` iterations.
#
# Using negative tolerance values
# -------------------------------
#
# :class:`~sklearn.feature_selection.SequentialFeatureSelector` can be used
# to remove features present in the dataset and return a
# smaller subset of the original features with `direction="backward"`
# and a negative value of `tol`.
#
# We begin by loading the Breast Cancer dataset, consisting of 30 different
# features and 569 samples.
from sklearn.datasets import load_breast_cancer
import numpy as np

breast_cancer_data = load_breast_cancer()
X, y = breast_cancer_data.data, breast_cancer_data.target
feature_names = np.array(breast_cancer_data.feature_names)
print(breast_cancer_data.DESCR)

# %%
# We will make use of the :class:`~sklearn.linear_model.LogisticRegression`
# estimator with :class:`~sklearn.feature_selection.SequentialFeatureSelector`
# to perform the feature selection.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

for tol in [-1e-2, -1e-3, -1e-4]:
start = time()
feature_selector = SequentialFeatureSelector(
LogisticRegression(),
n_features_to_select="auto",
direction="backward",
scoring="roc_auc",
tol=tol,
n_jobs=2,
)
model = make_pipeline(StandardScaler(), feature_selector, LogisticRegression())
model.fit(X, y)
end = time()
print(f"\ntol: {tol}")
print(f"Features selected: {feature_names[model[1].get_support()]}")
print(f"ROC AUC score: {roc_auc_score(y, model.predict_proba(X)[:, 1]):.3f}")
print(f"Done in {end - start:.3f}s")

# %%
# We can see that the number of features selected tend to increase as negative
# values of `tol` approach to zero. The time taken for feature selection also
# decreases as the values of `tol` come closer to zero.