Skip to content

Unclear behavior of max_train_size argument in TimeSeriesSplit #13666

Open
@mitar

Description

@mitar

Description

So I am trying to understand the behavior of TimeSeriesSplit. Especially the max_train_size parameter. I was initially surprised that it is an absolute number and not a ratio like it is in other splitting operations.

I traced this parameter to issue #8249 and PR #8282 and I realized that it was added to support window-based splitting, as it is described here. This was very surprising for me because this is not really clear from documentation that this is happening. Moreover, I found parameters initialWindow, horizon, and fixedWindow much easier to understand, especially with that image.

I would suggest that:

If we have splitting done by number of folds (which I prefer because it makes things adapt to different dataset sizes automatically), then also window size should be expressed in folds. In a way, parameters could then be:

  • How many folds to do.
  • Number of folds used in horizon, i.e., used in test data. It looks like this is currently fixed to 1 in this splitting operation and cannot really be configured. I suggest we allow this to be configured.
  • Number of folds used in the window, i.e., training data. Default could be None, which would mean a non-fixed window and would mean to use all folds before the test data. Or you could fix it to get a sliding window.

Versions

Relates to how it is in sklearn v0.20.3.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions