Skip to content

[MRG+1] ENH: added max_train_size to TimeSeriesSplit #8282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 8, 2017

Conversation

dalmia
Copy link
Contributor

@dalmia dalmia commented Feb 3, 2017

Reference Issue

Fixes #8249

What does this implement/fix? Explain your changes.

This adds a parameter max_train_size to TimeSeriesSplit that puts an upper limit on the size of each training fold.

Any other comments?

There is one corner case where the size of the first train fold is smaller than the max_train_size. In my implementation, I have taken the last of those. Please check the tests for a more elaborate description.

@dalmia
Copy link
Contributor Author

dalmia commented Feb 3, 2017

The build is successful. Travis is failing due to unknown reasons also encountered in #8040.
Please review.

@dalmia
Copy link
Contributor Author

dalmia commented Feb 3, 2017

Restarting the build a few times worked there.

@@ -687,7 +690,8 @@ class TimeSeriesSplit(_BaseKFold):
with a test set of size ``n_samples//(n_splits + 1)``,
where ``n_samples`` is the number of samples.
"""
def __init__(self, n_splits=3):
def __init__(self, n_splits=3, max_train_size=0):
self.max_train_size = max_train_size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be below the super call.

assert_array_equal(test, [4])



Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flake8 error.

@codecov
Copy link

codecov bot commented Feb 16, 2017

Codecov Report

Merging #8282 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8282      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60809    60920     +111     
==========================================
+ Hits        57617    57726     +109     
- Misses       3192     3194       +2
Impacted Files Coverage Δ
sklearn/model_selection/_split.py 98.77% <100%> (ø)
sklearn/model_selection/tests/test_split.py 95.65% <100%> (+0.09%)
sklearn/utils/tests/test_validation.py 97.38% <ø> (-0.94%)
sklearn/utils/tests/test_class_weight.py 100% <ø> (ø)
sklearn/neighbors/tests/test_ball_tree.py 98% <ø> (ø)
sklearn/ensemble/tests/test_weight_boosting.py 100% <ø> (ø)
sklearn/linear_model/tests/test_randomized_l1.py 100% <ø> (ø)
sklearn/neighbors/tests/test_kde.py 98.85% <ø> (ø)
sklearn/neighbors/tests/test_kd_tree.py 97.45% <ø> (ø)
sklearn/gaussian_process/tests/test_kernels.py 98.85% <ø> (+0.02%)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 676e863...ac4b88a. Read the comment docs.

@@ -687,10 +690,11 @@ class TimeSeriesSplit(_BaseKFold):
with a test set of size ``n_samples//(n_splits + 1)``,
where ``n_samples`` is the number of samples.
"""
def __init__(self, n_splits=3):
def __init__(self, n_splits=3, max_train_size=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conventionally we use None to mean "feature disabled"

yield (indices[:test_start],
indices[test_start:test_start + test_size])
if self.max_train_size > 0 and self.max_train_size < test_start:
yield (indices[test_start - self.max_train_size:test_start],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this becomes negative it means something else. I think you need a max

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check ensures that only when the current fold's index exceeds the max_train_size will this be yielded. Else, it'll revert back to the default. I have updated the tests to capture what I felt you were mentioning.

@lesteve
Copy link
Member

lesteve commented Feb 17, 2017

Not sure why codecov is red, try to rebase on master and see what happens.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have considered writing tests that avoided writing out each split, but instead checked something like the following invariant: test set is same as without max_train_size, train with max_train_size is a suffix of train without max_train_size but limited to that length.

But this looks fine to me apart from those nitpicks.

@@ -664,14 +664,17 @@ class TimeSeriesSplit(_BaseKFold):
n_splits : int, default=3
Number of splits. Must be at least 1.

max_train_size : int, optional
Maximum size for a single training fold.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm never sure about the correct use of "fold". Let's call it a "training set" or a "training sample".

train, test = next(splits)
assert_array_equal(train, [0, 1, 2, 3, 4])
assert_array_equal(test, [5])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should really do assert_raises(StopIteration, next(splits)).

@@ -1176,6 +1176,46 @@ def test_time_series_cv():
assert_equal(n_splits_actual, 2)


def test_time_series_max_train_size():
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about X = np.zeros((6, 1)); then it is clear how many samples in X.

train, test = next(splits)
assert_array_equal(train, [2, 3])
assert_array_equal(test, [4])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another split?

@dalmia
Copy link
Contributor Author

dalmia commented Feb 22, 2017

Yes, that seems much cleaner, writing out the splits should have been avoided in the first place. Thank you for the suggestion 👍

@dalmia
Copy link
Contributor Author

dalmia commented Feb 22, 2017

CircleCI needs a rebuild.

@jnothman
Copy link
Member

jnothman commented Jun 8, 2017

LGTM. Please add a what's new entry.

@jnothman jnothman changed the title [MRG] ENH: added max_train_size to TimeSeriesSplit [MRG+1] ENH: added max_train_size to TimeSeriesSplit Jun 8, 2017
@agramfort
Copy link
Member

merging. I'll add the what's new entry in master directly

@agramfort agramfort merged commit 511c9a8 into scikit-learn:master Jun 8, 2017
@jnothman
Copy link
Member

jnothman commented Jun 8, 2017

Thanks @dalmia!

@agramfort
Copy link
Member

done in 305ed51

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TimeSeriesSplit with anchour option
4 participants