Skip to content

DOC Combined examples for feature_selection.RFE and feature_selection.RFECV #28065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

raj-pulapakura
Copy link
Contributor

@raj-pulapakura raj-pulapakura commented Jan 5, 2024

Reference Issues/PRs

Follow up to #26950. Issue #26927

What does this implement/fix? Explain your changes.

This PR picks up from the work of @Shreesha3112 , whom I am very grateful to for providing the starting code.

I've followed the advice from @glemaitre in regards to this review: #26950 (review) . In particular, I've combined the RFE and RFECV examples into a single document. I've also swapped out the handwritten digits dataset for the breast cancer dataset, as the model performance for this particular dataset actually benefits from RFE.

I haven't deleted the redundant RFECV example, just in case.

@raj-pulapakura raj-pulapakura changed the title Add links feature selection rfe DOC Combined examples for feature_selection.RFE and feature_selection.RFECV Jan 5, 2024
Copy link

github-actions bot commented Jan 5, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f1e642f. Link to the linter CI: here

@raj-pulapakura
Copy link
Contributor Author

@adrinjalali

@glemaitre glemaitre self-requested a review January 9, 2024 15:10
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the file doc/sg_execution_times.rst. We should add this file in the .gitignore file indeed but this will be done in another PR.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are removing a file, we need to handle the redirection. Could you add the following in the redirects dictionary in doc/conf.py around l.300:

        "auto_examples/feature_selection/plot_rfe_breast_cancer"
    ),

@ArturoAmorQ
Copy link
Member

I wouldn't change the digits dataset to cancer as it is not about accuracy, but the spirit of this example is to show a visualization of the importance (or relevance if you may) of each pixel when predicting a digit. I would rather merge the two examples (RFE and RFECV) as they are but adding a bit of narrative.

See my comment here on what the pixel color means and you can keep the introduction paragraph and the narrative on the dataset description from the original PR, as mentioned here.

@raj-pulapakura
Copy link
Contributor Author

raj-pulapakura commented Jan 10, 2024

@ArturoAmorQ I see, thanks for the clarification. I'll revert back to the digits dataset.

Given your suggestion to merge the RFE and RFECV examples, do we want to:

  1. Show usage of RFE and RFECV, or
  2. Show usage of just RFECV

@raj-pulapakura
Copy link
Contributor Author

@glemaitre has suggested in this comment that RFE and RFECV should be shown, so I'm going to go with that.

@raj-pulapakura
Copy link
Contributor Author

Since we are removing a file, we need to handle the redirection. Could you add the following in the redirects dictionary in doc/conf.py around l.300:

        "auto_examples/feature_selection/plot_rfe_breast_cancer"
    ),

I'll make sure to do this after I have finished updating the code 👍

@glemaitre glemaitre self-requested a review January 10, 2024 13:10
@glemaitre glemaitre removed their request for review January 10, 2024 13:11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this file

@glemaitre
Copy link
Member

I would rather merge the two examples (RFE and RFECV) as they are but adding a bit of narrative.

I am getting a bit confused and my comment for redirection is not anymore useful. @ArturoAmorQ do you think it could be a good idea to remove plot_rfe_with_cross_validation.py since it does not bring more value than the discussion that we will add here.

@raj-pulapakura
Copy link
Contributor Author

ping @ArturoAmorQ

@ArturoAmorQ
Copy link
Member

do you think it could be a good idea to remove plot_rfe_with_cross_validation.py since it does not bring more value than the discussion that we will add here.

If we merge the two examples, yes, we can remove it. But the current version of RFECV displays error bars (which cannot be done with the heatmap using the digits dataset) and discusses in a simple dataset what happens with correlated features.

Notice that if we merge the two examples, that means using digits in a first section, and the synthetic dataset in a further section.

Comment on lines +141 to 155
# %%
# Visualizing Feature Importance after RFE
# ----------------------------------------
#
# RFECV and RFE provide a ranking of the features based on their importance.
# We can visualize this ranking to gain insights into which pixels
# (or features) are deemed most significant by RFECV in the digit
# classification task.

# Plot pixel ranking
# %%
ranking = rfecv.ranking_.reshape(digits.images[0].shape)
plt.matshow(ranking, cmap=plt.cm.Blues)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.title("Ranking of pixels with RFECV")
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation ranking_ is an array where the most important features are assigned rank 1 and the higher ranking the less important. Notice that in the current version of the example you have shades of blue going from 1 to 64 (we have 8 x 8 pixels) whereas your code uses a model which already truncated the feature space to keep only the 42 most relevant pixels and degenerating the rest to a value of 1. I am afraid this was not the spirit of the example. You can alternatively show the feature importance of the whole 64 pixels using different seeds of the train_test_split to check for stability.

@raj-pulapakura
Copy link
Contributor Author

I think there are 2 slightly contradicting objectives here:

The first objective is to merge both examples because there's no point having 2 examples which both demonstrate the use of RFE/RFECV.

The second objective is to give the user:

  • A visual example which shows how the RFE algorithm identifies important features, using the digits dataset
  • A practical example which has a detailed narrative and analysis, using a synthetic dataset

@glemaitre
Copy link
Member

A visual example which shows how the RFE algorithm identifies important features, using the digits dataset

A practical example which has a detailed narrative and analysis, using a synthetic dataset

We can use a single dataset for both aspect.

Another thing that we should do is to use an other estimator than SVM here since it does not scale with the number of samples. It would be best to use a LogisticRegression if we use a linear kernel.

@marenwestermann
Copy link
Member

@raj-pulapakura would you like to continue working on this PR?

@ArturoAmorQ
Copy link
Member

Closing this PR as superseded by #28862.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants