-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFECV to provide the average ranking_ and support_ #17782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I understand that the RFECV uses CV only to determine the best k, not the
set of features. The attributes returned are therefore correct, and
reporting averages would be inappropriate when features have interactions
(e.g. collinearity). Providing the rankings produced by each split's RFE
would provide additional information about the stability of the model, but
not a better model.
|
Thanks.
That's why I think providing the selected set of features or their rankings using the whole set can be confusing (because that's not what the algorithm does/is intended for) Also, let's consider a voting-based approach. Why would it be wrong (even if collinearity exists)? consider 3 folds with the following
The sum is: |
what if the first two features are identical, but for random noise. They are also the most important feature. Because the two are redundant, my splits might rank:
Sum these...
Not really useful. RFE is very explicitly an alternative to univariate selection, that will take this conditional dependency into account. And grid search, which is what's happening, usually works by learning hyperparameters under cv, but final model parameters with the full training set. |
|
Ok, here is an example where the features selected by a voting-based approach can lead to better from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
n_folds = 5
k_fold = KFold(n_splits=n_folds, shuffle=True, random_state=42)
rng = np.random.RandomState(seed=0)
X, y = make_friedman1(n_samples=50, n_features=20, random_state=0)
estimator = SVR(kernel="linear")
rfecv = RFECV(estimator, step=1, scoring="r2", cv=k_fold)
rfecv = rfecv.fit(X, y)
print("RFECV support_", rfecv.support_)
print("RFECV ranking_", rfecv.ranking_)
print("RFECV grid_scores_", rfecv.grid_scores_)
print("RFECV score", "%.4f" % rfecv.score(X, y), "\n\n")
# voting based
from sklearn.feature_selection import RFE
from sklearn import metrics
X_train, y_train = X.copy(), y.copy()
n_select = rfecv.n_features_
print("features to select: ", n_select)
rfe_pipe = Pipeline(
[
("rfe_select", RFE(estimator=estimator, n_features_to_select=n_select, step=1)),
("eval", estimator),
]
)
manual_cv_scores = []
votes = np.zeros(X.shape[1])
fold_count = 0
for train_ind, validation_ind in k_fold.split(X_train):
fold_count += 1
Xtr, Xval = X_train[train_ind, :], X_train[validation_ind, :]
ytr, yval = y_train[train_ind], y_train[validation_ind]
rfe_pipe.fit(Xtr, ytr)
print(
"Fold ",
fold_count,
"Selected Features: ",
rfe_pipe["rfe_select"].ranking_,
)
votes = votes + rfe_pipe["rfe_select"].ranking_
y_pred = rfe_pipe.predict(Xval)
manual_cv_scores.append(metrics.r2_score(yval, y_pred))
print("Votes:", votes)
print("Manual CV, rfe_pipe, scores: ", ["%.6f" % r2 for r2 in manual_cv_scores])
print(
"Mean CV (equals to",
rfecv.n_features_,
"th item in RFECV grid_scores_)",
"%.6f" % np.mean(manual_cv_scores),
)
estimator.fit(X[:, votes.argsort()[:n_select]], y)
ypred = estimator.predict(X[:, votes.argsort()[:n_select]])
print("voting_based score:", "%.4f" % metrics.r2_score(y, ypred)) RFECV score 0.6748 |
It doesn't matter how features behave across different folds. It matters
that the final model has the most informative features and minimum model
size.
|
Ok, this keeps coming back to me. Now, imagine we would like to use grid search across folds for model selection (hyperparameter tuning) or checking overfitting. I guess using What I can think of is to do Aim: model selection using grid search or comparing train-test scores using cross validation
What do you think? |
Answering the discussion in #20976, I came to this issue. I think that I am in line with the @jnothman and this particular comments: #17782 (comment) Now that we have a Regarding the behaviour of the current estimator, I think that this is fine to refit a model on the optimal |
It would be awesome to have the support for each split! Is it planned anytime in the future? |
I'm not aware of anyone working on it but it looks like a good enhancement to be done. @MarieS-WiMLDS do you want to implement this feature? |
Describe the workflow you want to enable
Currently,
RFECV
searches for the optimal number of features to provide the best score. It then fits anRFE
on the whole training set: L566So, the score is the cross-validated score, but the
ranking_
is for the whole set of data. Isn't this misleading? e.g. withRFE
and 3-fold cross-validation, I get theseranking_
s:while
RFECV
returns something like:which is the
ranking_
RFE
would provide if fit over the whole set. So, the score is a cross-validated score, while theranking_
andsupport_
are not.Describe your proposed solution
Imagine a situation where you get the best performance with
k
features, but thek
features in some folds are different from thek
features you would get by fitting the model on the whole training set. Wouldn't it be a better idea to return a form of average/voting_basedranking_
andsupport_
?The text was updated successfully, but these errors were encountered: