Skip to content

[architectural suggestion] Move random number generator to initializer in model_selection._split #6726

Closed
@DSLituiev

Description

@DSLituiev

This suggestion concerns random shuffling in the new model_selection module.

I faced a challenge in the following set up. I do a grid search with CV, and I want the CV reshuffling to be consistent for each parameter I am looping through. Now it seems impossible to do with model_selection.KFold, as copy.copy() and copy.deepcopy() lead to an error when called in following sequence:

import copy
import sklearn
y = np.random.randn(100)
n_folds = 10
kf_ = sklearn.model_selection.KFold(n_folds=n_folds, shuffle=True)
kf = kf_.split(y)
for tr, ts in copy.copy(kf):
    print((ts))

copying it earlier does not make sense, as the RNG is initialized only during kf_.split(y) call.

One solution is to specify the seed for each shuffling fold. Another fundamental solution is to refactor model_selection and move check_random_state(self.random_state) as here to the __init__ of the _BaseKFold, and then each kf_.split(y) will give consistently shuffled indices.

Versions

Python 3.5.1 (v3.5.1:37a07cee5969, Dec 5 2015, 21:12:44)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
NumPy 1.11.0
SciPy 0.17.0
Scikit-Learn 0.18.dev0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions