-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Notebook style and enhanced descriptions and add example links for feature_selection.RFE #26950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Notebook style and enhanced descriptions and add example links for feature_selection.RFE #26950
Conversation
Thanks for the PR. That example is really not in a good shape. It needs to be worked before it really deserves to be linked like this, an example for the improvements are here: #26805 You can do that in the same PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite nice now.
# %% | ||
from sklearn.feature_selection import RFE | ||
|
||
# Arbitrarily chosen; can be adjusted based on domain knowledge or iterative testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be set using a GridSearchCV
in reality, which you could also do here. You can start by a random number, then optimize using GridSearchCV
, but putting it in a pipeline with SVC
X_train_rfe = rfe.transform(X_train) | ||
X_test_rfe = rfe.transform(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users should almost always use a pipeline instead of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LGTM.
cc @scikit-learn/documentation-team maybe?
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Shreesha3112, thanks for your time and effort. Here is a batch of comments that I hope you will find as constructive criticism, which was my intention.
My overall suggestion is to revert all the changes made in this PR except for the introduction paragraph and the narrative on the dataset description. For the latter you can alternatively link to the documentation of said dataset.
# Display the first digit | ||
plt.imshow(digits.images[0], cmap="gray") | ||
plt.title(f"Label: {digits.target[0]}") | ||
plt.axis("off") | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is redundant with the Digit Dataset example.
from sklearn.feature_selection import RFE | ||
from sklearn.model_selection import GridSearchCV | ||
from sklearn.pipeline import Pipeline | ||
|
||
# Define the parameters for the grid search | ||
param_grid = {"rfe__n_features_to_select": [1, 5, 10, 20, 30, 40, 50, 64]} | ||
|
||
# Create a pipeline with feature selection followed by SVM | ||
pipe = Pipeline( | ||
[ | ||
("rfe", RFE(estimator=SVC(kernel="linear", C=1))), | ||
("svc", SVC(kernel="linear", C=1)), | ||
] | ||
) | ||
|
||
# Create the grid search object | ||
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1) | ||
|
||
# Fit to the data and get the best estimator | ||
grid_search.fit(X_train, y_train) | ||
best_pipeline = grid_search.best_estimator_ | ||
|
||
# Plot pixel ranking | ||
# Extract the optimal number of features from the best estimator | ||
optimal_num_features = best_pipeline.named_steps["rfe"].n_features_ | ||
|
||
print(f"Optimal number of features: {optimal_num_features}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole part could be more easily done using the class RFECV
, which is an optimized version of a grid search.
# Feature Selection Impact on Model Accuracy | ||
# --------------------------------------------------- | ||
# | ||
# To understand the relationship between the number of features selected and model | ||
# performance, let's train the :class:`~sklearn.svm.SVC` on various subsets of | ||
# features ranked by :class:`~sklearn.feature_selection.RFE`. We'll then plot the | ||
# accuracy of the model as a function of the number of features used. This will help | ||
# us visualize any trade-offs between feature selection and model accuracy. | ||
|
||
# %% | ||
import numpy as np | ||
|
||
# Split the dataset | ||
X_train, X_test, y_train, y_test = train_test_split( | ||
X, y, test_size=0.3, random_state=42 | ||
) | ||
|
||
# Train with RFE to get the rankings (as done earlier in the code) | ||
svc = SVC(kernel="linear", C=1) | ||
rfe = RFE(estimator=svc, n_features_to_select=1, step=1) | ||
rfe.fit(X_train, y_train) | ||
ranking = rfe.ranking_ | ||
|
||
# Store accuracies | ||
# Adjust the step for finer granularity | ||
num_features_list = [1, 5, 10, 20, 30, 40, 50, 64] | ||
accuracies = [] | ||
|
||
for num_features in num_features_list: | ||
# Select top 'num_features' important features | ||
top_features_idx = np.where(ranking <= num_features)[0] | ||
X_train_selected = X_train[:, top_features_idx] | ||
X_test_selected = X_test[:, top_features_idx] | ||
|
||
# Train SVM and get accuracy | ||
svc_selected = SVC(kernel="linear", C=1) | ||
svc_selected.fit(X_train_selected, y_train) | ||
y_pred = svc_selected.predict(X_test_selected) | ||
accuracy = accuracy_score(y_test, y_pred) | ||
accuracies.append(accuracy) | ||
|
||
# Plot the accuracies | ||
plt.plot(num_features_list, accuracies, marker="o", linestyle="-") | ||
plt.xlabel("Number of Selected Features") | ||
plt.ylabel("Accuracy") | ||
plt.title("Feature Selection Impact on Model Accuracy") | ||
plt.grid(True) | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is redundant with the RFECV example, where an interpretation and error bars are given.
For an example on usage, see | ||
:ref:`sphx_glr_auto_examples_feature_selection_plot_rfe_digits.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we always want to link examples from the docstrings (unless really relevant to the introductory paragraph), as they already appear in the lowest part of the page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that we don't need when we have a single example which is the case here.
# Visualizing Feature Importance after RFE | ||
# ---------------------------------------- | ||
# | ||
# :class:`~sklearn.feature_selection.RFE` provides a ranking of the features based on | ||
# their importance. We can visualize this ranking to gain insights into which pixels | ||
# (or features) are deemed most significant by :class:`~sklearn.feature_selection.RFE` | ||
# in the digit classification task. | ||
|
||
# %% | ||
ranking = best_pipeline.named_steps["rfe"].ranking_.reshape(digits.images[0].shape) | ||
plt.matshow(ranking, cmap=plt.cm.Blues) | ||
plt.colorbar() | ||
plt.title("Ranking of pixels with RFE") | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation ranking_
is an array where the most important features are assigned rank 1 and the higher ranking the less important. Notice that in the current version of the example you have shades of blue going from 1 to 64 (we have 8 x 8 pixels) whereas your code uses a model which already truncated the feature space to keep only the 5 x 5 most relevant pixels and degenerating the rest to a value of 1
. I am afraid this was not the spirit of the example.
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking into account the reviews of @ArturoAmorQ, I think that we could improve this PR by making an example that present both RFE
and RFECV
.
In this case, we could also remove the other RFECV example since it will be redundant and make the redirection toward this newly improved example.
@Shreesha3112 Would you be able to address these issues?
I'm going to tag this PR as stalled for the time being. |
I won't be able to contribute for the next 2-3 weeks. If anyone else wants to pick it up can go ahead. |
Hi, I would love to work on this. Could someone please assign me? |
Hi @adrinjalali and @glemaitre . Sorry to bother you, but is there any way I can help with this PR? |
@raj-pulapakura you can open a new PR and continue the work from this PR. You can base your work on this PR's branch to keep the history. |
On it! |
Hi! I'm new to this repo(and open-source in general). Could someone help me get started? |
Hey Sagnik, I've been doing some research on the same and here are some tips I found for an open-source beginner:
Please let me know if this helps and how it'll go! |
@adrinjalali , can I work on this issue? I am new to contributing to this repo and open source projects in general. How should I go about resolving this issue? |
@plon-Susk7 yeah go for it. |
Closing this PR as superseded by #28862. |
Reference Issues/PRs
issue #26927
What does this implement/fix? Explain your changes.
This PR adds example links for
feature_selection.RFE
Class
: feature_selection.RFERelated Example files:
Any other comments?