-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[architectural suggestion] Move random number generator to initializer in model_selection._split #6726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that the functionality I describe is already available in n_folds = 9
kf = sklearn.cross_validation.KFold(len(y), n_folds=n_folds, shuffle=True)
for tr, ts in kf:
print((ts))
# is the same as:
for tr, ts in kf:
print((ts)) |
What about the import numpy as np
from sklearn.model_selection import KFold
y = np.random.randn(100)
kf = KFold(n_folds=9, shuffle=True, random_state=0)
for tr, ts in kf.split(y):
print((ts))
# is the same as:
print('')
for tr, ts in kf.split(y):
print((ts))
|
I do use |
So you want the splits to be random, but consistent across all parameters? If only I read this before the release :-/ (also, we merged |
Yes, it is a problem that We should call Whether or not we also move the random_state initialisation to (And yes, you can create the list and should not need to copy or deepcopy in most cases.) Maybe we needed the new CV splitters to be an experimental release before deprecating the old ones. @amueller and @ogrisel, given that we've found some substantial issues (this and pickling at least), can we set some goals for 0.18.1 in terms of timescale? Aim to RC at beginning of November, for instance? And then focus almost exclusively on release-critical issues? I also think the 0.18.1 release plan should be noted on the mailing list. |
I have to think about this issue again, I think I didn't really understand it. Not sure how the behavior was before. |
@jnothman Unless the user is using a cv which will produce different train test sets for each from sklearn.model_selection import ShuffleSplit
from sklearn.cross_validation import ShuffleSplit as OldShuffleSplit
import numpy as np
y = np.random.RandomState(0).randn(100)
n_iter = 10
ss = ShuffleSplit(n_splits=n_iter, random_state=0)
old_ss_iter = OldShuffleSplit(len(y), n_iter=n_iter, random_state=0)
new_ss_iter_1 = list(ss.split(y))
new_ss_iter_2 = list(ss.split(y))
# Let's change the random state and make a 3rd split(y) call.
ss.random_state=42
new_ss_iter_3 = list(ss.split(y))
for trte1, trte2, trte3, trte4 in zip(
new_ss_iter_1, new_ss_iter_2, new_ss_iter_3, old_ss_iter):
np.testing.assert_equal(trte1, trte2)
np.testing.assert_equal(trte1, trte4)
# This will fail as random states are not same
np.testing.assert_equal(trte2, trte3) And you will observe the same behavior with y = np.random.RandomState(0).randn(100)
n_splits = 10
kf = KFold(n_splits=n_splits, shuffle=True, random_state=0)
old_kf_iter = OldKFold(len(y), n_folds=n_splits, shuffle=True, random_state=0)
new_kf_iter_1 = list(kf.split(y))
new_kf_iter_2 = list(kf.split(y))
# Let's change the random state and make a 3rd split(y) call.
kf.random_state=42
new_kf_iter_3 = list(kf.split(y))
for trte1, trte2, trte3, trte4 in zip(
new_kf_iter_1, new_kf_iter_2, new_kf_iter_3, old_kf_iter):
np.testing.assert_equal(trte1, trte2)
np.testing.assert_equal(trte1, trte4)
# This will fail as random states are not same
np.testing.assert_equal(trte2, trte3)
It does give consistently shuffled indices (see above code snippet for proof) now only because it is not at @DSLituiev Could you share a sample code that works for you in |
I don't think that you should assume most users set @DSLituiev's particular issue can be solved by documentation that if someone wants consistent split they should use I now think there is also a issue in |
Being practical about timescale, @amueller, doing 0.18.1 in October means releasing in November. There are two weeks of October left |
Ah. I understand the issue... Thanks for the clarification.
So IIUC. This + documenting that without |
Well, it's clear there are a few ways to define the problem. I think that On 13 October 2016 at 01:30, Raghav RV notifications@github.com wrote:
|
PR at #7660 |
I'm closing this as we decided that the current behavior after #7660 is sufficient enough. |
personally I think making multiple calls to split return identical results
is still a much better design. But I have no idea how to fix it.
…On 16 Jun 2017 12:13 am, "(Venkat) Raghav, Rajagopalan" < ***@***.***> wrote:
I'm closing this as we decided that the current behavior after #7660
<#7660> is sufficient
enough.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#6726 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6578BDWSIqdhsd5HI_r9-PFsS5w2ks5sETv3gaJpZM4IRRsA>
.
|
This suggestion concerns random shuffling in the new
model_selection
module.I faced a challenge in the following set up. I do a grid search with CV, and I want the CV reshuffling to be consistent for each parameter I am looping through. Now it seems impossible to do with
model_selection.KFold
, ascopy.copy()
andcopy.deepcopy()
lead to an error when called in following sequence:copying it earlier does not make sense, as the RNG is initialized only during
kf_.split(y)
call.One solution is to specify the seed for each shuffling fold. Another fundamental solution is to refactor
model_selection
and movecheck_random_state(self.random_state)
as here to the__init__
of the_BaseKFold
, and then eachkf_.split(y)
will give consistently shuffled indices.Versions
Python 3.5.1 (v3.5.1:37a07cee5969, Dec 5 2015, 21:12:44)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
NumPy 1.11.0
SciPy 0.17.0
Scikit-Learn 0.18.dev0
The text was updated successfully, but these errors were encountered: