-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] ENH: added max_train_size to TimeSeriesSplit #8282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The build is successful. Travis is failing due to unknown reasons also encountered in #8040. |
Restarting the build a few times worked there. |
sklearn/model_selection/_split.py
Outdated
@@ -687,7 +690,8 @@ class TimeSeriesSplit(_BaseKFold): | |||
with a test set of size ``n_samples//(n_splits + 1)``, | |||
where ``n_samples`` is the number of samples. | |||
""" | |||
def __init__(self, n_splits=3): | |||
def __init__(self, n_splits=3, max_train_size=0): | |||
self.max_train_size = max_train_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be below the super call.
assert_array_equal(test, [4]) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flake8 error.
Codecov Report
@@ Coverage Diff @@
## master #8282 +/- ##
==========================================
+ Coverage 94.75% 94.75% +<.01%
==========================================
Files 342 342
Lines 60809 60920 +111
==========================================
+ Hits 57617 57726 +109
- Misses 3192 3194 +2
Continue to review full report at Codecov.
|
sklearn/model_selection/_split.py
Outdated
@@ -687,10 +690,11 @@ class TimeSeriesSplit(_BaseKFold): | |||
with a test set of size ``n_samples//(n_splits + 1)``, | |||
where ``n_samples`` is the number of samples. | |||
""" | |||
def __init__(self, n_splits=3): | |||
def __init__(self, n_splits=3, max_train_size=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conventionally we use None to mean "feature disabled"
sklearn/model_selection/_split.py
Outdated
yield (indices[:test_start], | ||
indices[test_start:test_start + test_size]) | ||
if self.max_train_size > 0 and self.max_train_size < test_start: | ||
yield (indices[test_start - self.max_train_size:test_start], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this becomes negative it means something else. I think you need a max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check ensures that only when the current fold's index exceeds the max_train_size
will this be yielded. Else, it'll revert back to the default. I have updated the tests to capture what I felt you were mentioning.
Not sure why codecov is red, try to rebase on master and see what happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have considered writing tests that avoided writing out each split, but instead checked something like the following invariant: test set is same as without max_train_size
, train with max_train_size
is a suffix of train without max_train_size
but limited to that length.
But this looks fine to me apart from those nitpicks.
sklearn/model_selection/_split.py
Outdated
@@ -664,14 +664,17 @@ class TimeSeriesSplit(_BaseKFold): | |||
n_splits : int, default=3 | |||
Number of splits. Must be at least 1. | |||
|
|||
max_train_size : int, optional | |||
Maximum size for a single training fold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm never sure about the correct use of "fold". Let's call it a "training set" or a "training sample".
train, test = next(splits) | ||
assert_array_equal(train, [0, 1, 2, 3, 4]) | ||
assert_array_equal(test, [5]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should really do assert_raises(StopIteration, next(splits))
.
@@ -1176,6 +1176,46 @@ def test_time_series_cv(): | |||
assert_equal(n_splits_actual, 2) | |||
|
|||
|
|||
def test_time_series_max_train_size(): | |||
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about X = np.zeros((6, 1))
; then it is clear how many samples in X.
train, test = next(splits) | ||
assert_array_equal(train, [2, 3]) | ||
assert_array_equal(test, [4]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another split?
Yes, that seems much cleaner, writing out the splits should have been avoided in the first place. Thank you for the suggestion 👍 |
CircleCI needs a rebuild. |
LGTM. Please add a what's new entry. |
merging. I'll add the what's new entry in master directly |
Thanks @dalmia! |
done in 305ed51 |
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
* ENH: added max_train_size to TimeSeriesSplit * FIX: update doctest * FIX: correct error in the previous update * FIX: added doctest fix for cross_validation.rst * FIX: remove errors * TST: tests updated and default value changed to None * TST: improve split tests * FIX: reduce code length
Reference Issue
Fixes #8249
What does this implement/fix? Explain your changes.
This adds a parameter
max_train_size
toTimeSeriesSplit
that puts an upper limit on the size of each training fold.Any other comments?
There is one corner case where the size of the first train fold is smaller than the
max_train_size
. In my implementation, I have taken the last of those. Please check the tests for a more elaborate description.