[MRG+1] ENH: added max_train_size to TimeSeriesSplit #8282

dalmia · 2017-02-03T05:12:48Z

Reference Issue

What does this implement/fix? Explain your changes.

This adds a parameter max_train_size to TimeSeriesSplit that puts an upper limit on the size of each training fold.

Any other comments?

There is one corner case where the size of the first train fold is smaller than the max_train_size. In my implementation, I have taken the last of those. Please check the tests for a more elaborate description.

dalmia · 2017-02-03T14:02:56Z

The build is successful. Travis is failing due to unknown reasons also encountered in #8040.
Please review.

dalmia · 2017-02-03T14:32:06Z

Restarting the build a few times worked there.

dalmia · 2017-02-16T15:18:29Z

sklearn/model_selection/_split.py

@@ -687,7 +690,8 @@ class TimeSeriesSplit(_BaseKFold):
    with a test set of size ``n_samples//(n_splits + 1)``,
    where ``n_samples`` is the number of samples.
    """
-    def __init__(self, n_splits=3):
+    def __init__(self, n_splits=3, max_train_size=0):
+        self.max_train_size = max_train_size


This should be below the super call.

dalmia · 2017-02-16T15:21:03Z

sklearn/model_selection/tests/test_split.py

+    assert_array_equal(test, [4])
+
+
+


flake8 error.

codecov · 2017-02-16T19:38:42Z

Codecov Report

Merging #8282 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8282      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60809    60920     +111     
==========================================
+ Hits        57617    57726     +109     
- Misses       3192     3194       +2

Impacted Files	Coverage Δ
sklearn/model_selection/_split.py	`98.77% <100%> (ø)`	✅
sklearn/model_selection/tests/test_split.py	`95.65% <100%> (+0.09%)`	✅
sklearn/utils/tests/test_validation.py	`97.38% <ø> (-0.94%)`	❌
sklearn/utils/tests/test_class_weight.py	`100% <ø> (ø)`	✅
sklearn/neighbors/tests/test_ball_tree.py	`98% <ø> (ø)`	✅
sklearn/ensemble/tests/test_weight_boosting.py	`100% <ø> (ø)`	✅
sklearn/linear_model/tests/test_randomized_l1.py	`100% <ø> (ø)`	✅
sklearn/neighbors/tests/test_kde.py	`98.85% <ø> (ø)`	✅
sklearn/neighbors/tests/test_kd_tree.py	`97.45% <ø> (ø)`	✅
sklearn/gaussian_process/tests/test_kernels.py	`98.85% <ø> (+0.02%)`	✅
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 676e863...ac4b88a. Read the comment docs.

jnothman · 2017-02-17T01:49:53Z

sklearn/model_selection/_split.py

@@ -687,10 +690,11 @@ class TimeSeriesSplit(_BaseKFold):
    with a test set of size ``n_samples//(n_splits + 1)``,
    where ``n_samples`` is the number of samples.
    """
-    def __init__(self, n_splits=3):
+    def __init__(self, n_splits=3, max_train_size=0):


conventionally we use None to mean "feature disabled"

jnothman · 2017-02-17T01:51:06Z

sklearn/model_selection/_split.py

-            yield (indices[:test_start],
-                   indices[test_start:test_start + test_size])
+            if self.max_train_size > 0 and self.max_train_size < test_start:
+                yield (indices[test_start - self.max_train_size:test_start],


if this becomes negative it means something else. I think you need a max

The check ensures that only when the current fold's index exceeds the max_train_size will this be yielded. Else, it'll revert back to the default. I have updated the tests to capture what I felt you were mentioning.

lesteve · 2017-02-17T07:32:33Z

Not sure why codecov is red, try to rebase on master and see what happens.

jnothman

I would have considered writing tests that avoided writing out each split, but instead checked something like the following invariant: test set is same as without max_train_size, train with max_train_size is a suffix of train without max_train_size but limited to that length.

But this looks fine to me apart from those nitpicks.

jnothman · 2017-02-21T13:18:22Z

sklearn/model_selection/_split.py

@@ -664,14 +664,17 @@ class TimeSeriesSplit(_BaseKFold):
    n_splits : int, default=3
        Number of splits. Must be at least 1.

+    max_train_size : int, optional
+        Maximum size for a single training fold.


I'm never sure about the correct use of "fold". Let's call it a "training set" or a "training sample".

jnothman · 2017-02-21T13:21:23Z

sklearn/model_selection/tests/test_split.py

+    train, test = next(splits)
+    assert_array_equal(train, [0, 1, 2, 3, 4])
+    assert_array_equal(test, [5])
+


Should really do assert_raises(StopIteration, next(splits)).

jnothman · 2017-02-21T13:23:18Z

sklearn/model_selection/tests/test_split.py

@@ -1176,6 +1176,46 @@ def test_time_series_cv():
    assert_equal(n_splits_actual, 2)


+def test_time_series_max_train_size():
+    X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])


How about X = np.zeros((6, 1)); then it is clear how many samples in X.

jnothman · 2017-02-21T13:24:56Z

sklearn/model_selection/tests/test_split.py

+    train, test = next(splits)
+    assert_array_equal(train, [2, 3])
+    assert_array_equal(test, [4])
+


Another split?

dalmia · 2017-02-22T06:30:10Z

Yes, that seems much cleaner, writing out the splits should have been avoided in the first place. Thank you for the suggestion 👍

dalmia · 2017-02-22T13:30:54Z

CircleCI needs a rebuild.

jnothman · 2017-06-08T05:09:00Z

LGTM. Please add a what's new entry.

agramfort · 2017-06-08T08:51:38Z

merging. I'll add the what's new entry in master directly

jnothman · 2017-06-08T08:54:35Z

Thanks @dalmia!

agramfort · 2017-06-08T08:55:10Z

done in 305ed51

* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length

dalmia commented Feb 16, 2017

View reviewed changes

jnothman reviewed Feb 17, 2017

View reviewed changes

dalmia added 6 commits February 18, 2017 00:54

ENH: added max_train_size to TimeSeriesSplit

1d33bd9

FIX: update doctest

9b7ef8f

FIX: correct error in the previous update

1f8b57d

FIX: added doctest fix for cross_validation.rst

8b91118

FIX: remove errors

cd0f7c5

TST: tests updated and default value changed to None

84278be

dalmia force-pushed the 8249 branch from 979e80d to 84278be Compare February 17, 2017 19:26

jnothman reviewed Feb 21, 2017

View reviewed changes

dalmia added 2 commits February 22, 2017 12:00

TST: improve split tests

17c9f4c

FIX: reduce code length

ac4b88a

jnothman changed the title ~~[MRG] ENH: added max_train_size to TimeSeriesSplit~~ [MRG+1] ENH: added max_train_size to TimeSeriesSplit Jun 8, 2017

jnothman added Waiting for Reviewer Sprint labels Jun 8, 2017

agramfort merged commit 511c9a8 into scikit-learn:master Jun 8, 2017

mitar mentioned this pull request Apr 17, 2019

Unclear behavior of max_train_size argument in TimeSeriesSplit #13666

Open

Uh oh!

[MRG+1] ENH: added max_train_size to TimeSeriesSplit #8282

[MRG+1] ENH: added max_train_size to TimeSeriesSplit #8282

Uh oh!

Conversation

dalmia commented Feb 3, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

dalmia commented Feb 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dalmia commented Feb 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve commented Feb 17, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dalmia commented Feb 22, 2017

Uh oh!

dalmia commented Feb 22, 2017

Uh oh!

jnothman commented Jun 8, 2017

Uh oh!

agramfort commented Jun 8, 2017

Uh oh!

jnothman commented Jun 8, 2017

Uh oh!

agramfort commented Jun 8, 2017

Uh oh!

Uh oh!

dalmia commented Feb 3, 2017 •

edited

Loading

codecov bot commented Feb 16, 2017 •

edited

Loading