Skip to content

RFECV to provide the average ranking_ and support_ #17782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
apptimise opened this issue Jun 29, 2020 · 10 comments · Fixed by #30179
Closed

RFECV to provide the average ranking_ and support_ #17782

apptimise opened this issue Jun 29, 2020 · 10 comments · Fixed by #30179

Comments

@apptimise
Copy link

Describe the workflow you want to enable

Currently, RFECV searches for the optimal number of features to provide the best score. It then fits an RFE on the whole training set: L566

So, the score is the cross-validated score, but the ranking_ is for the whole set of data. Isn't this misleading? e.g. with RFE and 3-fold cross-validation, I get these ranking_s:

[1 1 4 1 1 2 3 5 7 6]
[1 1 2 1 1 7 4 3 5 6]
[1 1 2 1 1 7 5 6 3 4]

while RFECV returns something like:

[1 1 3 1 1 5 2 4 7 6]

which is the ranking_ RFE would provide if fit over the whole set. So, the score is a cross-validated score, while the ranking_ and support_ are not.

Describe your proposed solution

Imagine a situation where you get the best performance with k features, but the k features in some folds are different from the k features you would get by fitting the model on the whole training set. Wouldn't it be a better idea to return a form of average/voting_based ranking_ and support_?

@jnothman
Copy link
Member

jnothman commented Jun 30, 2020 via email

@apptimise
Copy link
Author

apptimise commented Jun 30, 2020

Thanks.

I understand that the RFECV uses CV only to determine the best k, not the set of features.

That's why I think providing the selected set of features or their rankings using the whole set can be confusing (because that's not what the algorithm does/is intended for)

Also, let's consider a voting-based approach. Why would it be wrong (even if collinearity exists)? consider 3 folds with the following ranking_s:

[1 2 3 4]
[2 1 4 3]
[1 3 2 4]

The sum is: [4 6 9 11] therefore, the true ranking_ can be considered as [1 2 3 4]
Now suppose that you fit on the whole training set and get something like [2 1 4 3] (similar to fold-2). All the ranks will be different from what can be considered as an "average"/voting-based ranking.

@jnothman
Copy link
Member

what if the first two features are identical, but for random noise. They are also the most important feature. Because the two are redundant, my splits might rank:

[1, 4, 2, 3]
[4, 1, 2, 3]
[1, 4, 2, 3]
[4, 1, 2, 3]

Sum these...

[10, 10, 8, 12]

Not really useful.

RFE is very explicitly an alternative to univariate selection, that will take this conditional dependency into account.

And grid search, which is what's happening, usually works by learning hyperparameters under cv, but final model parameters with the full training set.

@apptimise
Copy link
Author

  • @jnothman I'm still not clear why it's not useful. The final rank translates to [2, 2, 1, 3] which shows how features behaved across different folds. Although, all of these are hypothetical examples. So, I'm trying to see if I can generate some data to prove my point.

  • BTW, do you have a reference on why/how RFE would help with situations such as multicollinearity?

@apptimise
Copy link
Author

Ok, here is an example where the features selected by a voting-based approach can lead to better r2 score, compared with features selected by current RFECV approach. I still use RFECV to decide about the number of features to choose, but then instead of fitting it on the whole set, I reduce the X using the voting vector, and then compute the score:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline

n_folds = 5
k_fold = KFold(n_splits=n_folds, shuffle=True, random_state=42)

rng = np.random.RandomState(seed=0)
X, y = make_friedman1(n_samples=50, n_features=20, random_state=0)

estimator = SVR(kernel="linear")
rfecv = RFECV(estimator, step=1, scoring="r2", cv=k_fold)
rfecv = rfecv.fit(X, y)
print("RFECV support_", rfecv.support_)
print("RFECV ranking_", rfecv.ranking_)
print("RFECV grid_scores_", rfecv.grid_scores_)
print("RFECV score", "%.4f" % rfecv.score(X, y), "\n\n")

# voting based
from sklearn.feature_selection import RFE
from sklearn import metrics


X_train, y_train = X.copy(), y.copy()

n_select = rfecv.n_features_
print("features to select: ", n_select)
rfe_pipe = Pipeline(
    [
        ("rfe_select", RFE(estimator=estimator, n_features_to_select=n_select, step=1)),
        ("eval", estimator),
    ]
)

manual_cv_scores = []
votes = np.zeros(X.shape[1])
fold_count = 0
for train_ind, validation_ind in k_fold.split(X_train):
    fold_count += 1
    Xtr, Xval = X_train[train_ind, :], X_train[validation_ind, :]
    ytr, yval = y_train[train_ind], y_train[validation_ind]
    rfe_pipe.fit(Xtr, ytr)
    print(
        "Fold ",
        fold_count,
        "Selected Features:         ",
        rfe_pipe["rfe_select"].ranking_,
    )
    votes = votes + rfe_pipe["rfe_select"].ranking_
    y_pred = rfe_pipe.predict(Xval)
    manual_cv_scores.append(metrics.r2_score(yval, y_pred))

print("Votes:", votes)
print("Manual CV, rfe_pipe, scores:    ", ["%.6f" % r2 for r2 in manual_cv_scores])
print(
    "Mean CV (equals to",
    rfecv.n_features_,
    "th item in RFECV grid_scores_)",
    "%.6f" % np.mean(manual_cv_scores),
)
estimator.fit(X[:, votes.argsort()[:n_select]], y)
ypred = estimator.predict(X[:, votes.argsort()[:n_select]])
print("voting_based score:", "%.4f" % metrics.r2_score(y, ypred))

RFECV score 0.6748
RFECV score 0.6947

@jnothman
Copy link
Member

jnothman commented Jun 30, 2020 via email

@apptimise
Copy link
Author

apptimise commented Jul 10, 2020

Ok, this keeps coming back to me. Now, imagine we would like to use grid search across folds for model selection (hyperparameter tuning) or checking overfitting.
I guess we can agree that there difference across folds need to be taken into account. I just wanna know how you would address the problem when performing a grid search.

I guess using RFE in a Pipeline and then performing grid search on the pipe will not result in an apple to apple comparison, because again, in each fold, selected features might be different. So what would you do?

What I can think of is to do RFECV first, get the selected features on the whole training set and then create a custom feature selector in my pipeline which selects those features. This way, the selected features in the grid search will be the same across folds. I suumarize it:

Aim: model selection using grid search or comparing train-test scores using cross validation

RFECV -> select the best features for the whole training set
Create custome_feature_selector to select the above features
pipe: preprocessor - custom_feature_selector - predictor
GridSearchCV(pipe,...)

What do you think?

@glemaitre
Copy link
Member

Answering the discussion in #20976, I came to this issue.

I think that I am in line with the @jnothman and this particular comments: #17782 (comment)

Now that we have a cv_results_ dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.

Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal k.

@MarieSacksick
Copy link
Contributor

Answering the discussion in #20976, I came to this issue.

I think that I am in line with the @jnothman and this particular comments: #17782 (comment)

Now that we have a cv_results_ dictionary, we could store the support for each split. It might serve to understand if the selection is more or less stable, with all the limitations that it implies.

Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal k.

It would be awesome to have the support for each split! Is it planned anytime in the future?

@glemaitre
Copy link
Member

It would be awesome to have the support for each split! Is it planned anytime in the future?

I'm not aware of anyone working on it but it looks like a good enhancement to be done. @MarieS-WiMLDS do you want to implement this feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants