Description
This suggestion concerns random shuffling in the new model_selection
module.
I faced a challenge in the following set up. I do a grid search with CV, and I want the CV reshuffling to be consistent for each parameter I am looping through. Now it seems impossible to do with model_selection.KFold
, as copy.copy()
and copy.deepcopy()
lead to an error when called in following sequence:
import copy
import sklearn
y = np.random.randn(100)
n_folds = 10
kf_ = sklearn.model_selection.KFold(n_folds=n_folds, shuffle=True)
kf = kf_.split(y)
for tr, ts in copy.copy(kf):
print((ts))
copying it earlier does not make sense, as the RNG is initialized only during kf_.split(y)
call.
One solution is to specify the seed for each shuffling fold. Another fundamental solution is to refactor model_selection
and move check_random_state(self.random_state)
as here to the __init__
of the _BaseKFold
, and then each kf_.split(y)
will give consistently shuffled indices.
Versions
Python 3.5.1 (v3.5.1:37a07cee5969, Dec 5 2015, 21:12:44)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
NumPy 1.11.0
SciPy 0.17.0
Scikit-Learn 0.18.dev0