Skip to content

DOC Notebook style and enhanced descriptions and add example links for feature_selection.RFE #26950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Shreesha3112
Copy link
Contributor

@Shreesha3112 Shreesha3112 commented Jul 31, 2023

Reference Issues/PRs

issue #26927

What does this implement/fix? Explain your changes.

This PR adds example links for feature_selection.RFE

Class: feature_selection.RFE

Related Example files:

Any other comments?

@github-actions
Copy link

github-actions bot commented Jul 31, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: cc0aa51. Link to the linter CI: here

@adrinjalali
Copy link
Member

Thanks for the PR.

That example is really not in a good shape. It needs to be worked before it really deserves to be linked like this, an example for the improvements are here: #26805

You can do that in the same PR.

@Shreesha3112 Shreesha3112 changed the title DOC Add example links for feature_selection.RFE DOC Notebook style and enhanced descriptions and add example links for feature_selection.RFE Aug 15, 2023
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite nice now.

# %%
from sklearn.feature_selection import RFE

# Arbitrarily chosen; can be adjusted based on domain knowledge or iterative testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be set using a GridSearchCV in reality, which you could also do here. You can start by a random number, then optimize using GridSearchCV, but putting it in a pipeline with SVC

Comment on lines 101 to 102
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users should almost always use a pipeline instead of doing this.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM.

cc @scikit-learn/documentation-team maybe?

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@Shreesha3112
Copy link
Contributor Author

cc @ArturoAmorQ @lucyleeow

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Shreesha3112, thanks for your time and effort. Here is a batch of comments that I hope you will find as constructive criticism, which was my intention.

My overall suggestion is to revert all the changes made in this PR except for the introduction paragraph and the narrative on the dataset description. For the latter you can alternatively link to the documentation of said dataset.

Comment on lines +35 to +39
# Display the first digit
plt.imshow(digits.images[0], cmap="gray")
plt.title(f"Label: {digits.target[0]}")
plt.axis("off")
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is redundant with the Digit Dataset example.

Comment on lines +85 to +110
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Define the parameters for the grid search
param_grid = {"rfe__n_features_to_select": [1, 5, 10, 20, 30, 40, 50, 64]}

# Create a pipeline with feature selection followed by SVM
pipe = Pipeline(
[
("rfe", RFE(estimator=SVC(kernel="linear", C=1))),
("svc", SVC(kernel="linear", C=1)),
]
)

# Create the grid search object
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1)

# Fit to the data and get the best estimator
grid_search.fit(X_train, y_train)
best_pipeline = grid_search.best_estimator_

# Plot pixel ranking
# Extract the optimal number of features from the best estimator
optimal_num_features = best_pipeline.named_steps["rfe"].n_features_

print(f"Optimal number of features: {optimal_num_features}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole part could be more easily done using the class RFECV, which is an optimized version of a grid search.

Comment on lines +153 to +200
# Feature Selection Impact on Model Accuracy
# ---------------------------------------------------
#
# To understand the relationship between the number of features selected and model
# performance, let's train the :class:`~sklearn.svm.SVC` on various subsets of
# features ranked by :class:`~sklearn.feature_selection.RFE`. We'll then plot the
# accuracy of the model as a function of the number of features used. This will help
# us visualize any trade-offs between feature selection and model accuracy.

# %%
import numpy as np

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Train with RFE to get the rankings (as done earlier in the code)
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
rfe.fit(X_train, y_train)
ranking = rfe.ranking_

# Store accuracies
# Adjust the step for finer granularity
num_features_list = [1, 5, 10, 20, 30, 40, 50, 64]
accuracies = []

for num_features in num_features_list:
# Select top 'num_features' important features
top_features_idx = np.where(ranking <= num_features)[0]
X_train_selected = X_train[:, top_features_idx]
X_test_selected = X_test[:, top_features_idx]

# Train SVM and get accuracy
svc_selected = SVC(kernel="linear", C=1)
svc_selected.fit(X_train_selected, y_train)
y_pred = svc_selected.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)

# Plot the accuracies
plt.plot(num_features_list, accuracies, marker="o", linestyle="-")
plt.xlabel("Number of Selected Features")
plt.ylabel("Accuracy")
plt.title("Feature Selection Impact on Model Accuracy")
plt.grid(True)
plt.show()
Copy link
Member

@ArturoAmorQ ArturoAmorQ Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is redundant with the RFECV example, where an interpretation and error bars are given.

Comment on lines +76 to +77
For an example on usage, see
:ref:`sphx_glr_auto_examples_feature_selection_plot_rfe_digits.py`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we always want to link examples from the docstrings (unless really relevant to the introductory paragraph), as they already appear in the lowest part of the page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that we don't need when we have a single example which is the case here.

Comment on lines +137 to 150
# Visualizing Feature Importance after RFE
# ----------------------------------------
#
# :class:`~sklearn.feature_selection.RFE` provides a ranking of the features based on
# their importance. We can visualize this ranking to gain insights into which pixels
# (or features) are deemed most significant by :class:`~sklearn.feature_selection.RFE`
# in the digit classification task.

# %%
ranking = best_pipeline.named_steps["rfe"].ranking_.reshape(digits.images[0].shape)
plt.matshow(ranking, cmap=plt.cm.Blues)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation ranking_ is an array where the most important features are assigned rank 1 and the higher ranking the less important. Notice that in the current version of the example you have shades of blue going from 1 to 64 (we have 8 x 8 pixels) whereas your code uses a model which already truncated the feature space to keep only the 5 x 5 most relevant pixels and degenerating the rest to a value of 1. I am afraid this was not the spirit of the example.

Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
@glemaitre glemaitre self-requested a review November 2, 2023 20:45
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account the reviews of @ArturoAmorQ, I think that we could improve this PR by making an example that present both RFE and RFECV.

In this case, we could also remove the other RFECV example since it will be redundant and make the redirection toward this newly improved example.

@Shreesha3112 Would you be able to address these issues?

@glemaitre
Copy link
Member

I'm going to tag this PR as stalled for the time being.

@Shreesha3112
Copy link
Contributor Author

Taking into account the reviews of @ArturoAmorQ, I think that we could improve this PR by making an example that present both RFE and RFECV.

In this case, we could also remove the other RFECV example since it will be redundant and make the redirection toward this newly improved example.

@Shreesha3112 Would you be able to address these issues?

I won't be able to contribute for the next 2-3 weeks. If anyone else wants to pick it up can go ahead.

@adrinjalali adrinjalali added help wanted good first issue Easy with clear instructions to resolve labels Dec 5, 2023
@raj-pulapakura
Copy link
Contributor

raj-pulapakura commented Dec 13, 2023

Hi, I would love to work on this. Could someone please assign me?

@raj-pulapakura
Copy link
Contributor

raj-pulapakura commented Dec 19, 2023

Hi @adrinjalali and @glemaitre . Sorry to bother you, but is there any way I can help with this PR?

@adrinjalali
Copy link
Member

@raj-pulapakura you can open a new PR and continue the work from this PR. You can base your work on this PR's branch to keep the history.

@raj-pulapakura
Copy link
Contributor

On it!

@sagnik-t
Copy link

Hi! I'm new to this repo(and open-source in general). Could someone help me get started?

@muchemicarol
Copy link

Hi! I'm new to this repo(and open-source in general). Could someone help me get started?

Hey Sagnik,
Good start! Not sure if you received support for this, but here goes..

I've been doing some research on the same and here are some tips I found for an open-source beginner:

  1. Look through issues that are labelled beginner friendly/good first issues or you could pick an issue that you feel you can work on.
  2. Join the community, it's on the README file on the home page of repo.
  3. Follow the contribution guidelines which should include steps on how to fork the repo, push your code and contribution ethics.
  4. Pace yourself.

Please let me know if this helps and how it'll go!

@plon-Susk7
Copy link
Contributor

@adrinjalali , can I work on this issue? I am new to contributing to this repo and open source projects in general. How should I go about resolving this issue?

@adrinjalali
Copy link
Member

@plon-Susk7 yeah go for it.

@ArturoAmorQ
Copy link
Member

Closing this PR as superseded by #28862.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants