-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC new example on feature engineering for cyclic time features #20281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC new example on feature engineering for cyclic time features #20281
Conversation
4f69c63
to
e620c5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very nice and clearly written example tutorial!
# %% | ||
# We observe that this model performance can almost rival the performance of | ||
# the gradient boosted trees with an average error around 6% of the maximum | ||
# demand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not asking to add it here, since it's already quite long but how does polynomial features without cyclic_spline_transformer perform?
Also I wonder if having KBinsDiscretizer would produce mostly equivalent results in terms or score, though with fewer artifacts in the prediction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I think I would agree with you. Let me try on my local copy what you suggest out of curiosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Polynomial Features or Polynomial kernel approximation on the raw time features does not work any better than the linear model on the raw time features.
-
Binning on the other hand only slightly worse than spline features but not by much (one or 2 percents than the matching model, with or without poly kernel approx). Here are the plots when binning:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks for checking it!
It's probably too late to change in now in sphinx-gallery but |
It's not too late at all. Where should we put this? What name for the file and the title do you suggest? |
I added one-hot encoding of the time features because it's a natural strong baseline in this case and it makes for interesting analysis. I also started to reorganize the order a bit. I want to do it further (move the plot for the features + linear model before introducing the Nystroem kernel models). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example only requiring ~ 15 seconds is very impressive given the scope.
# %% | ||
# We visualize those predictions by zooming on the last 96 hours (4 days) of | ||
# the test set to get some qualitative insights: | ||
plt.figure(figsize=(12, 4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we use the OO interface of matplotlib? (especially now that we have so many plots)
# Again we zoom on the last 4 days of the test set: | ||
|
||
last_hours = slice(-96, None) | ||
plt.figure(figsize=(12, 4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mpl OO interface here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome example!
np.linspace(0, 26, 1000).reshape(-1, 1), | ||
columns=["hour"], | ||
) | ||
splines = periodic_spline_transformer(24, n_knots=12).fit_transform(hour_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find non-intuitive to have 11 splines in the figure. I know this number is arbitrary, but to relate splines with bins, wouldn't it make more sense to have 12 splines (and 13 knots)?
Then in periodic_spline_transformer
, the default would be n_knots = period + 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what I can do. It makes me think that maybe the SplineTransformer
should allow for n_splines
and a period
argument...
But that could make the parameters docstring very complex to understand. /cc @lorentzenchr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When writing the SplineTransformer
, I thought the number of knots is more intuitive than the number of splines/dof. But I documented the numbers very clearly (and it is accesible via n_features_out_)
. I did not, however, think of periodic splines or a period. That came only a little later with @mlondschien.
A period
argument, however, would make sense in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was only suggesting to use periodic_spline_transformer(24, n_knots=13)
in this example, to get 12 splines instead of 11. I agree the numbers are well documented in SplineTransformer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice writeup! Different to what you are writing above @ogrisel, the number of knots you are choosing are not natural. They are arbitrary. I would vary the period but keep the number of knots fixed for month / weekday / hour (e.g. 5?). If you use period + 1
knots, resulting in period
splines, the resulting splines are equivalent as using one-hot-encoded features (assuming integer value features). This is why the performance of splines is so similar to one-hot-encoded features. To benefit from the additional "smoothness" from splines, you will need to reduce the number of splines. Note that you could use non-evenly spaced knots, e.g. via quantiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to display the strengths of periodic splines, I would suggest to include interactions between periodic transformations of the time variables. For e.g. 4 knots this could be manageable, whereas this would explode for one-hot encoded features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the example to control for the number of splines and made the number of knots a technical detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just excellent and a lot of fun to read!
Already now: the best tutorial/example of the year. And we still have a year to go🥳 🍻 |
Maybe something like |
Agreed, but I would rather not change the sphinx gallery as part of this PR but instead coordinate via #18257. |
I think I addressed all the comments. Thanks for the reviews! |
# | ||
# Here, we do minimal ordinal encoding for the categorical variables and then | ||
# let the model know that it should treat those as categorical variables by | ||
# using a dedicated tree splitting rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to mention that we explicitly provide the order of the categories to avoid automatic ordering based on lexicography.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
# %% | ||
# This model has an average error around 4 to 5% of the maximum demand. This is | ||
# quite good for a first trial without any hyper-parameter tuning! We just had | ||
# to make the categorical variables explicit. Note that the time related |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you mentioned this point now. I was expecting to see it a bit earlier :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find more lightweight to do it this way.
|
||
""" | ||
========================== | ||
Cyclic feature engineering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we should have something more related to "date-time encoding". I might think it might be easier to find than "cyclic" even if the title is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the title. The title and the filename no longer match though. I think it's ok but not 100% sure.
This is a very nice tutorial! Since I added the periodic feature to the
from sklearn.preprocessing import PolynomialFeatures
hour_workday_interaction = make_pipeline(
ColumnTransformer(
[
("cyclic_hour", periodic_spline_transformer(24, n_splines=8), ["hour"]),
("workingday", FunctionTransformer(lambda x: x=="True"), ["workingday"]),
]
), PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
)
|
That's a good point. But the execution speed won't be the same. Maybe I will add a note. I will think a bit how to take your other insightful remarks into account. |
26c7b6f
to
a51a667
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nitpicks
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@mlondschien I have updated the notebook to take your remarks into account in the last few commits. |
@mlondschien we have a problem with the periodic splines: There are 12 splines for a period of 24 as expected on the figure. The periodic signal seems to start again as expected at the right location to ensure the continuity. But it seems that we have a missing spline near the end of the period. However the number of splines (aka output features was good). This seems to be caused using See the commit below: |
What is the use-case / expected behaviour here? Are you interested in producing pretty plots or features for modelling? I see three (non-compatible) outcomes here (i.e. for a periodicity of 24):
If you want 12 splines with knots that are spaced two hours apart, you need to pass If you want splines with knots that are spaced two hours apart without an intercept, you need to pass If you want 12 splines without an intercept, you need to pass |
As we always use an L2 penalty (Ridge regression), we can set |
BTW, I reeeeeeeally like the addition of interactions in 8d066f3! |
I agree. Since we use regularization, I think this is what makes most sense for this example: we want a symmetric handling of all the hours. I prefer to not use |
I merged this. Thank you everyone for the detailed reviews. |
This looks awesome! Thank you for this. |
Yeah indeed 🥇 |
…it-learn#20281) Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…it-learn#20281) Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Here is a prototype example to explore some cyclic date-related feature engineering strategies being discussed in #20259.
This is not meant to be reviewed or merge in its current state. In particular I did not put any narrative yet but once we have converged on which models we want to highlight, we can turn this into a full fledged tutorial.Update: I think this example is interesting to consider for merging irrespective of the outcome of the discussion in #20259..