-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Combined examples for feature_selection.RFE and feature_selection.RFECV #28065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Combined examples for feature_selection.RFE and feature_selection.RFECV #28065
Conversation
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the file doc/sg_execution_times.rst
. We should add this file in the .gitignore
file indeed but this will be done in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are removing a file, we need to handle the redirection. Could you add the following in the redirects
dictionary in doc/conf.py
around l.300:
"auto_examples/feature_selection/plot_rfe_breast_cancer"
),
I wouldn't change the digits dataset to cancer as it is not about accuracy, but the spirit of this example is to show a visualization of the importance (or relevance if you may) of each pixel when predicting a digit. I would rather merge the two examples (RFE and RFECV) as they are but adding a bit of narrative. See my comment here on what the pixel color means and you can keep the introduction paragraph and the narrative on the dataset description from the original PR, as mentioned here. |
@ArturoAmorQ I see, thanks for the clarification. I'll revert back to the digits dataset. Given your suggestion to merge the RFE and RFECV examples, do we want to:
|
@glemaitre has suggested in this comment that RFE and RFECV should be shown, so I'm going to go with that. |
I'll make sure to do this after I have finished updating the code 👍 |
…raj-pulapakura/scikit-learn into add_links_feature_selection_RFE
plot_rfe_digits.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this file
I am getting a bit confused and my comment for redirection is not anymore useful. @ArturoAmorQ do you think it could be a good idea to remove |
ping @ArturoAmorQ |
If we merge the two examples, yes, we can remove it. But the current version of RFECV displays error bars (which cannot be done with the heatmap using the digits dataset) and discusses in a simple dataset what happens with correlated features. Notice that if we merge the two examples, that means using digits in a first section, and the synthetic dataset in a further section. |
# %% | ||
# Visualizing Feature Importance after RFE | ||
# ---------------------------------------- | ||
# | ||
# RFECV and RFE provide a ranking of the features based on their importance. | ||
# We can visualize this ranking to gain insights into which pixels | ||
# (or features) are deemed most significant by RFECV in the digit | ||
# classification task. | ||
|
||
# Plot pixel ranking | ||
# %% | ||
ranking = rfecv.ranking_.reshape(digits.images[0].shape) | ||
plt.matshow(ranking, cmap=plt.cm.Blues) | ||
plt.colorbar() | ||
plt.title("Ranking of pixels with RFE") | ||
plt.title("Ranking of pixels with RFECV") | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation ranking_
is an array where the most important features are assigned rank 1 and the higher ranking the less important. Notice that in the current version of the example you have shades of blue going from 1 to 64 (we have 8 x 8 pixels) whereas your code uses a model which already truncated the feature space to keep only the 42 most relevant pixels and degenerating the rest to a value of 1. I am afraid this was not the spirit of the example. You can alternatively show the feature importance of the whole 64 pixels using different seeds of the train_test_split
to check for stability.
I think there are 2 slightly contradicting objectives here: The first objective is to merge both examples because there's no point having 2 examples which both demonstrate the use of RFE/RFECV. The second objective is to give the user:
|
We can use a single dataset for both aspect. Another thing that we should do is to use an other estimator than SVM here since it does not scale with the number of samples. It would be best to use a |
@raj-pulapakura would you like to continue working on this PR? |
Closing this PR as superseded by #28862. |
Reference Issues/PRs
Follow up to #26950. Issue #26927
What does this implement/fix? Explain your changes.
This PR picks up from the work of @Shreesha3112 , whom I am very grateful to for providing the starting code.
I've followed the advice from @glemaitre in regards to this review: #26950 (review) . In particular, I've combined the RFE and RFECV examples into a single document. I've also swapped out the handwritten digits dataset for the breast cancer dataset, as the model performance for this particular dataset actually benefits from RFE.
I haven't deleted the redundant RFECV example, just in case.