RFECV to provide the average ranking_ and support_ #17782

apptimise · 2020-06-29T23:24:26Z

Describe the workflow you want to enable

Currently, RFECV searches for the optimal number of features to provide the best score. It then fits an RFE on the whole training set: L566

So, the score is the cross-validated score, but the ranking_ is for the whole set of data. Isn't this misleading? e.g. with RFE and 3-fold cross-validation, I get these ranking_s:

[1 1 4 1 1 2 3 5 7 6]
[1 1 2 1 1 7 4 3 5 6]
[1 1 2 1 1 7 5 6 3 4]

while RFECV returns something like:

[1 1 3 1 1 5 2 4 7 6]

which is the ranking_ RFE would provide if fit over the whole set. So, the score is a cross-validated score, while the ranking_ and support_ are not.

Describe your proposed solution

Imagine a situation where you get the best performance with k features, but the k features in some folds are different from the k features you would get by fitting the model on the whole training set. Wouldn't it be a better idea to return a form of average/voting_based ranking_ and support_?

The text was updated successfully, but these errors were encountered:

jnothman · 2020-06-30T03:20:28Z

I understand that the RFECV uses CV only to determine the best k, not the set of features. The attributes returned are therefore correct, and reporting averages would be inappropriate when features have interactions (e.g. collinearity). Providing the rankings produced by each split's RFE would provide additional information about the stability of the model, but not a better model.

apptimise · 2020-06-30T11:30:24Z

Thanks.

I understand that the RFECV uses CV only to determine the best k, not the set of features.

That's why I think providing the selected set of features or their rankings using the whole set can be confusing (because that's not what the algorithm does/is intended for)

Also, let's consider a voting-based approach. Why would it be wrong (even if collinearity exists)? consider 3 folds with the following ranking_s:

[1 2 3 4]
[2 1 4 3]
[1 3 2 4]

The sum is: [4 6 9 11] therefore, the true ranking_ can be considered as [1 2 3 4]
Now suppose that you fit on the whole training set and get something like [2 1 4 3] (similar to fold-2). All the ranks will be different from what can be considered as an "average"/voting-based ranking.

jnothman · 2020-06-30T14:47:51Z

what if the first two features are identical, but for random noise. They are also the most important feature. Because the two are redundant, my splits might rank:

[1, 4, 2, 3]
[4, 1, 2, 3]
[1, 4, 2, 3]
[4, 1, 2, 3]

Sum these...

[10, 10, 8, 12]

Not really useful.

RFE is very explicitly an alternative to univariate selection, that will take this conditional dependency into account.

And grid search, which is what's happening, usually works by learning hyperparameters under cv, but final model parameters with the full training set.

apptimise · 2020-06-30T20:03:59Z

@jnothman I'm still not clear why it's not useful. The final rank translates to [2, 2, 1, 3] which shows how features behaved across different folds. Although, all of these are hypothetical examples. So, I'm trying to see if I can generate some data to prove my point.
BTW, do you have a reference on why/how RFE would help with situations such as multicollinearity?

apptimise · 2020-06-30T20:32:25Z

Ok, here is an example where the features selected by a voting-based approach can lead to better r2 score, compared with features selected by current RFECV approach. I still use RFECV to decide about the number of features to choose, but then instead of fitting it on the whole set, I reduce the X using the voting vector, and then compute the score:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline

n_folds = 5
k_fold = KFold(n_splits=n_folds, shuffle=True, random_state=42)

rng = np.random.RandomState(seed=0)
X, y = make_friedman1(n_samples=50, n_features=20, random_state=0)

estimator = SVR(kernel="linear")
rfecv = RFECV(estimator, step=1, scoring="r2", cv=k_fold)
rfecv = rfecv.fit(X, y)
print("RFECV support_", rfecv.support_)
print("RFECV ranking_", rfecv.ranking_)
print("RFECV grid_scores_", rfecv.grid_scores_)
print("RFECV score", "%.4f" % rfecv.score(X, y), "\n\n")

# voting based
from sklearn.feature_selection import RFE
from sklearn import metrics


X_train, y_train = X.copy(), y.copy()

n_select = rfecv.n_features_
print("features to select: ", n_select)
rfe_pipe = Pipeline(
    [
        ("rfe_select", RFE(estimator=estimator, n_features_to_select=n_select, step=1)),
        ("eval", estimator),
    ]
)

manual_cv_scores = []
votes = np.zeros(X.shape[1])
fold_count = 0
for train_ind, validation_ind in k_fold.split(X_train):
    fold_count += 1
    Xtr, Xval = X_train[train_ind, :], X_train[validation_ind, :]
    ytr, yval = y_train[train_ind], y_train[validation_ind]
    rfe_pipe.fit(Xtr, ytr)
    print(
        "Fold ",
        fold_count,
        "Selected Features:         ",
        rfe_pipe["rfe_select"].ranking_,
    )
    votes = votes + rfe_pipe["rfe_select"].ranking_
    y_pred = rfe_pipe.predict(Xval)
    manual_cv_scores.append(metrics.r2_score(yval, y_pred))

print("Votes:", votes)
print("Manual CV, rfe_pipe, scores:    ", ["%.6f" % r2 for r2 in manual_cv_scores])
print(
    "Mean CV (equals to",
    rfecv.n_features_,
    "th item in RFECV grid_scores_)",
    "%.6f" % np.mean(manual_cv_scores),
)
estimator.fit(X[:, votes.argsort()[:n_select]], y)
ypred = estimator.predict(X[:, votes.argsort()[:n_select]])
print("voting_based score:", "%.4f" % metrics.r2_score(y, ypred))

RFECV score 0.6748
RFECV score 0.6947

jnothman · 2020-06-30T23:00:45Z

It doesn't matter how features behave across different folds. It matters that the final model has the most informative features and minimum model size.

apptimise · 2020-07-10T10:25:45Z

Ok, this keeps coming back to me. Now, imagine we would like to use grid search across folds for model selection (hyperparameter tuning) or checking overfitting.
I guess we can agree that there difference across folds need to be taken into account. I just wanna know how you would address the problem when performing a grid search.

I guess using RFE in a Pipeline and then performing grid search on the pipe will not result in an apple to apple comparison, because again, in each fold, selected features might be different. So what would you do?

What I can think of is to do RFECV first, get the selected features on the whole training set and then create a custom feature selector in my pipeline which selects those features. This way, the selected features in the grid search will be the same across folds. I suumarize it:

Aim: model selection using grid search or comparing train-test scores using cross validation

RFECV -> select the best features for the whole training set
Create custome_feature_selector to select the above features
pipe: preprocessor - custom_feature_selector - predictor
GridSearchCV(pipe,...)

What do you think?

glemaitre · 2021-09-24T13:07:12Z

Answering the discussion in #20976, I came to this issue.

I think that I am in line with the @jnothman and this particular comments: #17782 (comment)

Now that we have a cv_results_ dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.

Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal k.

MarieSacksick · 2024-09-17T10:18:02Z

Answering the discussion in #20976, I came to this issue.

I think that I am in line with the @jnothman and this particular comments: #17782 (comment)

Now that we have a cv_results_ dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.

Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal k.

It would be awesome to have the support for each split! Is it planned anytime in the future?

glemaitre · 2024-09-17T16:09:18Z

It would be awesome to have the support for each split! Is it planned anytime in the future?

I'm not aware of anyone working on it but it looks like a good enhancement to be done. @MarieS-WiMLDS do you want to implement this feature?

apptimise added the New Feature label Jun 29, 2020

MarieSacksick mentioned this issue Oct 30, 2024

FEAT rfecv: add support and ranking for each cv and step #30179

Merged

adrinjalali closed this as completed in #30179 Apr 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFECV to provide the average ranking_ and support_ #17782

RFECV to provide the average ranking_ and support_ #17782

apptimise commented Jun 29, 2020

jnothman commented Jun 30, 2020 via email

apptimise commented Jun 30, 2020 •

edited

Loading

jnothman commented Jun 30, 2020

apptimise commented Jun 30, 2020

apptimise commented Jun 30, 2020

jnothman commented Jun 30, 2020 via email

apptimise commented Jul 10, 2020 •

edited

Loading

glemaitre commented Sep 24, 2021

MarieSacksick commented Sep 17, 2024

glemaitre commented Sep 17, 2024

RFECV to provide the average ranking_ and support_ #17782

RFECV to provide the average ranking_ and support_ #17782

Comments

apptimise commented Jun 29, 2020

Describe the workflow you want to enable

Describe your proposed solution

jnothman commented Jun 30, 2020 via email

apptimise commented Jun 30, 2020 • edited Loading

jnothman commented Jun 30, 2020

apptimise commented Jun 30, 2020

apptimise commented Jun 30, 2020

jnothman commented Jun 30, 2020 via email

apptimise commented Jul 10, 2020 • edited Loading

glemaitre commented Sep 24, 2021

MarieSacksick commented Sep 17, 2024

glemaitre commented Sep 17, 2024

apptimise commented Jun 30, 2020 •

edited

Loading

apptimise commented Jul 10, 2020 •

edited

Loading