Skip to content

Commit 89be37b

Browse files
authored
Refac rolling (blue-yonder#173)
* we should track the requirements of the notebook files * added github ISSUE_template * updated question about rolling in FAQ * refactored page about rolling mechanism * refactored docstring of rolling utility * it is reproduce instead of produce * linked to data formats in faq * fixed two typos in industry 4.0 examples * clarified that rolling=0 does nothing * polished first paragrapoh of rolling page * now the medical example should be more clear * verb was missing in roll_time_series method * parameter should be checked directly * we do not have a shift parameter for the rolling * rolling direction of 0 should just return ts container * Revert "rolling direction of 0 should just return ts container" This reverts commit 7278060. * a maximum length of time series of NaN should break rolling * adapted unti tests to value error on single row rolling * added possible reason for nan maximum length * removed paratnethesis in DataFrame constructor
1 parent 16d950d commit 89be37b

File tree

6 files changed

+80
-46
lines changed

6 files changed

+80
-46
lines changed

.github/ISSUE_TEMPLATE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Oh no, you encountered a problem while using *tsfesh*.
2+
3+
We, the maintainers, are happy to help you. When opening an issue, please provide the following information to us:
4+
5+
1. Your operating system
6+
2. The version of *tsfresh* that you are using
7+
3. The data on which the problem occurred (please do not upload 1000s of time series but try to boil the problem down to a small group or even a singular one)
8+
4. A minimal code snippet which reproduces the problem/bug
9+
10+
For questions, you can also use our [gitter chatroom](https://gitter.im/tsfresh/)

docs/text/faq.rst

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,15 @@ FAQ
22
===
33

44

5-
1. *Does tsfresh support different time series lengths?*
5+
1. **Does tsfresh support different time series lengths?**
66
Yes, it supports different time series lengths. However, some feature calculators can demand a minimal length
77
of the time series. If a shorter time series is passed to the calculator, normally a NaN is returned.
88

99

10-
2. *Is it possible to extract features from rolling/shifted time series?*
11-
Yes, there is the option `rolling` for the :func:`tsfresh.feature_extraction.extract_features` function.
12-
Set it to a non-zero value to enable rolling. In the moment, this just rolls the input data into
13-
as many time series as there are time steps - so there is no internal optimization for rolling calculations.
14-
Please see :ref:`rolling-label` for more information.
10+
2. **Is it possible to extract features from rolling/shifted time series?**
11+
Yes, the :func:`tsfresh.dataframe_functions.roll_time_series` function allows to conviniently create a rolled
12+
time series datframe from your data. You just have to transform your data into one of the supported tsfresh
13+
:ref:`data-formats-label`.
14+
Then, the :func:`tsfresh.dataframe_functions.roll_time_series` give you a DataFrame with the rolled time series,
15+
that you can pass to tsfresh.
16+
On the following page you can find a detailed description: :ref:`rolling-label`.

docs/text/rolling.rst

Lines changed: 33 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,26 +3,32 @@
33
How to handle rolling time series
44
=================================
55

6-
In many application with time series on real-world problems, the "time" column
7-
(we will call it time in the following, although it can be anything)
8-
gives a certain sequential order to the data. We can exploit this sequence to generate
9-
more input data out of single time series, by *rolling* over the data.
10-
11-
Imagine the following situation: you have the data of EEG measurements, that
12-
you want to use to classify patients into healthy and not healthy (we oversimplify the problem here).
13-
You have e.g. 100 time steps of data, so you can extract features that may forecast the healthiness
14-
of the patients. But what would happen if you had only the recorded measurement for 50 time steps?
15-
The patients would be as healthy as with 100 time steps. So you can easily increase the amount of
16-
training data by reusing time series cut into smaller pieces.
17-
18-
Another example is streaming data, e.g. in Industry 4.0 applications. Here you typically get one
19-
new data row at a time and use this to predict machine failures for example. To train you model,
6+
Lets assume that we have a DataFrame of one of the tsfresh :ref:`data-formats-label`.
7+
The "sort" column of such a container gives a sequential state to the individual measurements.
8+
In the case of time series this can be the *time* dimension while in the case of spectra the order is given by the
9+
*wavelength* or *frequency* dimensions.
10+
We can exploit this sequence to generate more input data out of single time series, by *rolling* over the data.
11+
12+
Imagine the following situation:
13+
You have the data of certain sensors (e.g. EEG measurements) as the base to classify patients into a healthy and not
14+
healthy group (we oversimplify the problem here).
15+
Lets say you have sensor data of 100 time steps, so you may extract features for the forecasting of the patients
16+
healthiness by a classification algorithm.
17+
If you also have measurements of the healthiness for those 100 time steps (this is the target vector), then you could
18+
predict the healthiness of the patient in every time step, which essentially states a time series forecasting problem.
19+
So, to do that, you want to extract features in every time step of the original time series while for example looking at
20+
the last 10 steps.
21+
A rolling mechanism creates such time series for every time step by creating sub time series of the sensor data of the
22+
last 10 time steps.
23+
24+
Another example can be found in streaming data, e.g. in Industry 4.0 applications.
25+
Here you typically get one new data row at a time and use this to for example predict machine failures. To train your model,
2026
you could act as if you would stream the data, by feeding your classifier the data after one time step,
2127
the data after the first two time steps etc.
2228

2329
Both examples imply, that you extract the features not only on the full data set, but also
24-
on all temporal coherent subsets of data, which is the process of *rolling*. You can do this easily,
25-
by calling the function :func:`tsfresh.utilities.dataframe_functions.roll_time_series`.
30+
on all temporal coherent subsets of data, which is the process of *rolling*. In tsfresh, this is implemented in the
31+
function :func:`tsfresh.utilities.dataframe_functions.roll_time_series`.
2632

2733
The rolling mechanism takes a time series :math:`x` with its data rows :math:`[x_1, x_2, x_3, ..., x_n]`
2834
and creates :math:`n` new time series :math:`\hat x^k`, each of them with a different consecutive part
@@ -31,8 +37,7 @@ of :math:`x`:
3137
.. math::
3238
\hat x^k = [x_k, x_{k-1}, x_{k-2}, ..., x_1]
3339
34-
To see what this does in real-world applications, we look into the following example data frame (we show only one possible data format,
35-
but rolling works on all 3 data formats :ref:`data-formats-label`):
40+
To see what this does in real-world applications, we look into the following example flat DataFrame in tsfresh format
3641

3742
+----+------+----+----+
3843
| id | time | x | y |
@@ -50,9 +55,13 @@ but rolling works on all 3 data formats :ref:`data-formats-label`):
5055
| 2 | t9 | 11 | 13 |
5156
+----+------+----+----+
5257

53-
where you have measured two values (x and y) for two different entities (1 and 2) in 4 or 2 time steps.
58+
where you have measured the values from two sensors x and y for two different entities (id 1 and 2) in 4 or 2 time
59+
steps (t1 to t9).
5460

55-
If you set `rolling` to 0, the feature extraction works on
61+
Now, we can use :func:`tsfresh.utilities.dataframe_functions.roll_time_series` to get consecutive sub-time series.
62+
E.g. if you set `rolling` to 0, the feature extraction works on the original time series without any rolling.
63+
64+
So it extracts 2 set of features,
5665

5766
+----+------+----+----+
5867
| id | time | x | y |
@@ -76,8 +85,6 @@ and
7685
| 2 | t9 | 11 | 13 |
7786
+----+------+----+----+
7887

79-
So it extracts 2 set of features.
80-
8188
If you set rolling to 1, the feature extraction works with all of the following time series:
8289

8390
+----+------+----+----+
@@ -164,4 +171,7 @@ If you set rolling to -1, you end up with features for the time series, rolled i
164171
| 2 | t8 | 10 | 12 |
165172
+----+------+----+----+
166173
| 2 | t9 | 11 | 13 |
167-
+----+------+----+----+
174+
+----+------+----+----+
175+
176+
We only gave an example for the flat DataFrame format, but rolling actually works on all 3 :ref:`data-formats-label`
177+
that are supported by tsfresh.

notebooks-requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
matplotlib==2.0.0
2+
seaborn==0.7.1
3+
ipython==5.3.0
4+
notebook==4.4.1

tests/utilities/test_dataframe_functions.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -164,13 +164,13 @@ def test_with_wrong_input(self):
164164

165165
class RollingTestCase(TestCase):
166166
def test_with_wrong_input(self):
167-
test_df = pd.DataFrame([{"id": 0, "kind": "a", "value": 3, "sort": np.NaN}])
167+
test_df = pd.DataFrame({"id": [0, 0], "kind": ["a", "b"], "value": [3, 3], "sort": [np.NaN, np.NaN]})
168168
self.assertRaises(ValueError, dataframe_functions.roll_time_series,
169169
df_or_dict=test_df, column_id="id",
170170
column_sort="sort", column_kind="kind",
171171
rolling_direction=1)
172172

173-
test_df = pd.DataFrame([{"id": 0, "kind": "a", "value": 3, "sort": 1}])
173+
test_df = pd.DataFrame({"id": [0, 0], "kind": ["a", "b"], "value": [3, 3], "sort": [1, 1]})
174174
self.assertRaises(AttributeError, dataframe_functions.roll_time_series,
175175
df_or_dict=test_df, column_id="strange_id",
176176
column_sort="sort", column_kind="kind",
@@ -197,12 +197,12 @@ def test_with_wrong_input(self):
197197
column_sort=None, column_kind=None,
198198
rolling_direction=0)
199199

200-
def test_single_row(self):
200+
def test_assert_single_row(self):
201201
test_df = pd.DataFrame([{"id": np.NaN, "kind": "a", "value": 3, "sort": 1}])
202-
dataframe_functions.roll_time_series(
203-
df_or_dict=test_df, column_id="id",
204-
column_sort="sort", column_kind="kind",
205-
rolling_direction=1)
202+
self.assertRaises(ValueError, dataframe_functions.roll_time_series,
203+
df_or_dict=test_df, column_id="id",
204+
column_sort="sort", column_kind="kind",
205+
rolling_direction=1)
206206

207207
def test_positive_rolling(self):
208208
first_class = pd.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "time": range(4)})

tsfresh/utilities/dataframe_functions.py

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -330,12 +330,19 @@ def normalize_input_to_internal_representation(df_or_dict, column_id, column_sor
330330

331331
def roll_time_series(df_or_dict, column_id, column_sort, column_kind, rolling_direction):
332332
"""
333-
Roll the (sorted) data frames for each kind and each id separately in "time"
334-
(time is here the abstract sort order defined by the sort column). For each rolling step a new id will be
335-
created, with the name "id={id}, shift={shift}" where the id is the former id of the column and shift is the
336-
amount of "time" shifts. ATTENTION: This will (obviously) create new IDs! The sign of rolling defines the
337-
direction of time rolling.
338-
For more information, please see :ref:`rolling-label`.
333+
Roll the (sorted) data frames for each kind and each id separately in the "time" domain
334+
(which is represented by the sort order of the sort column given by `column_sort`).
335+
336+
For each rolling step, a new id is created by the scheme "id={id}, shift={shift}", here id is the former id of the
337+
column and shift is the amount of "time" shifts.
338+
339+
A few remarks:
340+
341+
* This method will create new IDs!
342+
* The sign of rolling defines the direction of time rolling, a positive value means we are going back in time
343+
* It is possible to shift time series of different lenghts but
344+
* We assume that the time series are uniformly sampled
345+
* For more information, please see :ref:`rolling-label`.
339346
340347
:param df_or_dict: a pandas DataFrame or a dictionary. The required shape/form of the object depends on the rest of
341348
the passed arguments.
@@ -358,6 +365,9 @@ def roll_time_series(df_or_dict, column_id, column_sort, column_kind, rolling_di
358365
:rtype: the one from df_or_dict
359366
"""
360367

368+
if rolling_direction == 0:
369+
raise ValueError("Rolling direction of 0 is not possible")
370+
361371
if isinstance(df_or_dict, dict):
362372
if column_kind is not None:
363373
raise ValueError("You passed in a dictionary and gave a column name for the kind. Both are not possible.")
@@ -405,14 +415,12 @@ def roll_time_series(df_or_dict, column_id, column_sort, column_kind, rolling_di
405415
# Roll the data frames if requested
406416
rolling_direction = np.sign(rolling_direction)
407417

408-
if rolling_direction == 0:
409-
raise ValueError("Rolling direction of 0 is not possible")
410-
411418
grouped_data = df.groupby(grouper)
412419
maximum_number_of_timeshifts = grouped_data.count().max().max()
413420

414421
if np.isnan(maximum_number_of_timeshifts):
415-
maximum_number_of_timeshifts = 0
422+
raise ValueError("Somehow the maximum length of your time series is NaN (Does your time series container have "
423+
"only one row?). Can not perform rolling.")
416424

417425
if rolling_direction > 0:
418426
range_of_shifts = range(maximum_number_of_timeshifts, -1, -1)

0 commit comments

Comments
 (0)