Skip to content

Commit 6cb1fb2

Browse files
committed
Rework some formulations of the data_formats help page
1 parent c5e4cc7 commit 6cb1fb2

File tree

1 file changed

+46
-25
lines changed

1 file changed

+46
-25
lines changed

docs/text/data_formats.rst

Lines changed: 46 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -3,48 +3,63 @@
33
Data Formats
44
============
55

6-
tsfresh offers three different options to specify the time series data to be used in the :func:`tsfresh.extract_features`
7-
function. Irrespective of the input format, tsfresh will always return the calculated features in the same output format.
6+
tsfresh offers three different options to specify the time series data to be used in the
7+
:func:`tsfresh.extract_features` function (and all utility functions that expect a time series, e.g. the
8+
:func:`tsfresh.utilities.dataframe_functions.roll_time_series` function).
9+
10+
Irrespective of the input format, tsfresh will always return the calculated features in the same output format
11+
described below.
812

913
All three input format options consist of :class:`pandas.DataFrame` objects. There are four important column types that
10-
make up those DataFrames:
14+
make up those DataFrames. Each will be described with an example from the robot failures dataset
15+
(see :ref:`quick-start-label`).
1116

12-
Mandatory
17+
Mandatory:
1318

14-
:`column_id`: This column indicates which entities the time series belong to. Features will be extracted individually for each
15-
entity. The resulting feature matrix will contain one row per entity.
19+
:`column_id`: This column indicates which entities the time series belong to. Features will be extracted individually
20+
for each entity. The resulting feature matrix will contain one row per entity.
21+
Each robot is a different entity, so each of it has a different id.
1622
:`column_value`: This column contains the actual values of the time series.
23+
This corresponds to the measured values for different the sensors on the robots.
1724

18-
Optional (but strongly recommended to specify)
25+
Optional (but strongly recommended to specify if you have this column):
1926

20-
:`column_sort`: This column contains values which allow to sort the time series (e.g. time stamps). It is not required to
21-
have equidistant time steps or the same time scale for the different ids and/or kinds.
27+
:`column_sort`: This column contains values which allow to sort the time series (e.g. time stamps). It is not required
28+
to have equidistant time steps or the same time scale for the different ids and/or kinds.
2229
If you omit this column, the DataFrame is assumed to be already sorted in increasing order.
30+
The robot sensor measurements each have a time stamp which is used in this column.
31+
32+
Please note that none of the algorithms of tsfresh uses the actual values in this time column - but only their
33+
sorting order.
2334

24-
Optional
35+
Optional:
2536

2637
:`column_kind`: This column indicates the names of the different time series types (E.g. different sensors in an
27-
industrial application). For each kind of time series the features are calculated individually.
38+
industrial application as in the robot dataset).
39+
For each kind of time series the features are calculated individually.
2840

2941

3042
Important: None of these columns is allowed to contain any ``NaN``, ``Inf`` or ``-Inf`` values.
3143

32-
Now there are three slightly different input formats for the time series data:
44+
In the following we describe the different input formats, that are build on those columns:
3345
* A flat DataFrame
3446
* A stacked DataFrame
3547
* A dictionary of flat DataFrames
3648

3749
The difference between a flat and a stacked DataFrame is indicated by specifying or not specifying the parameters
38-
`column_value` and `column_kind` in the `extract_features` function.
50+
`column_value` and `column_kind` in the :func:`tsfresh.extract_features` function.
51+
52+
If you do not know which one to choose, you probably want to try out the flat or stacked DataFrame.
3953

4054
Input Option 1. Flat DataFrame
4155
------------------------------
4256

43-
If both `column_value` and `column_kind` are set to ``None``, the time series data is assumed to be in a flat
44-
DataFrame. This means that each different time series is saved as its own column.
57+
If both `column_value` and `column_kind` are set to ``None``, the time series data is assumed to be in a flat
58+
DataFrame. This means that each different time series must be saved as its own column.
4559

46-
Example: Imagine you record the values of time series x and y for different objects A and B for three different times t1, t2 and
47-
t3. Now you want to calculate some feature with tsfresh. Your resulting DataFrame has to look like this:
60+
Example: Imagine you record the values of time series x and y for different objects A and B for three different
61+
times t1, t2 and t3. Now you want to calculate some feature with tsfresh. Your resulting DataFrame may look
62+
like this:
4863

4964
+----+------+----------+----------+
5065
| id | time | x | y |
@@ -68,18 +83,18 @@ and you would pass
6883
6984
column_id="id", column_sort="time", column_kind=None, column_value=None
7085
71-
to the extraction functions.
86+
to the extraction functions, to extract features separately for all ids and separately for the x and y values.
7287

7388
Input Option 2. Stacked DataFrame
7489
---------------------------------
7590

7691
If both `column_value` and `column_kind` are set, the time series data is assumed to be a stacked DataFrame.
77-
This means that there are no different columns for the different type of time series.
92+
This means that there are no different columns for the different types of time series.
7893
This representation has several advantages over the flat Data Frame.
7994
For example, the time stamps of the different time series do not have to align.
8095

8196
It does not contain different columns for the different types of time series but only one
82-
value column and a kind column:
97+
value column and a kind column. The example from above would look like this:
8398

8499
+----+------+------+----------+
85100
| id | time | kind | value |
@@ -115,11 +130,14 @@ Then you would set
115130
116131
column_id="id", column_sort="time", column_kind="kind", column_value="value"
117132
133+
to end up with the same extracted features as above.
134+
135+
118136
Input Option 3. Dictionary of flat DataFrames
119137
---------------------------------------------
120138

121-
Instead of passing a DataFrame which must be split up by its different kinds, you can also give a dictionary mapping
122-
from the kind as string to a DataFrame containing only the time series data of that kind.
139+
Instead of passing a DataFrame which must be split up by its different kinds by tsfresh, you can also give a
140+
dictionary mapping from the kind as string to a DataFrame containing only the time series data of that kind.
123141
So essentially you are using a singular DataFrame for each kind of time series.
124142

125143
The data from the example can be split into two DataFrames resulting in the following dictionary
@@ -163,7 +181,7 @@ The data from the example can be split into two DataFrames resulting in the foll
163181

164182
}
165183

166-
tsfresh would be passed this dictionary and the following arguments
184+
You would pass this dictionary to tsfresh together with the following arguments:
167185

168186
.. code:: python
169187
@@ -186,5 +204,8 @@ It will always be a :class:`pandas.DataFrame` with the following layout
186204
| B | ... | ... | ... | ... | ... | ... |
187205
+----+-------------+-----+-------------+-------------+-----+-------------+
188206

189-
where the x features are calculated using all x values (independently for A and B), y features using all y values and so
190-
on.
207+
where the x features are calculated using all x values (independently for A and B), y features using all y values
208+
and so on.
209+
210+
This form of DataFrame is also the expected input format to the feature selection algorithms (e.g. the
211+
:func:`tsfresh.select_features` function).

0 commit comments

Comments
 (0)