-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] New Feature - CyclicalEncorder (cosine/sine) in preprocessing #20259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this would be a very nice addition to scikit-learn. Thanks for working on this. This PR still needs a new paragraph in the user guide (in the proprocessing section). It would be interesting to mention the I think this PR also needs a practical usage example where this transformer is included in a column transformer & pipeline to preprocess the date feature on a regression problem with either weather/season-related cyclic effects (e.g. time progression in current year) or business-related cyclic effects (e.g. time progression in current week or current month, time progression in current day). This could be achieved by chaining a Such an example could use a dataset such as https://www.openml.org/d/1168 to try to predict the load or the price? |
Also, would it be possible to accept a non-discretized continuous inputs with a known periodicity? |
Alternatively one could use the Bike Sharing Demand forecasting dataset. from sklearn.datasets import fetch_openml
X, y = fetch_openml("Bike_Sharing_Demand", as_frame=True, version=2, return_X_y=True) The datetime info is already expanded. |
@ogrisel Thank you for your feedback and inputs.
Currently, it won't be possible to accept continuous inputs as the current implementation is utilizing OrdinalEncoder under the hood. Continuous inputs could be handled by something like CyclicalScaler, but it will be out of the scope of this PR for now. I will work on the rest of your comments later. |
Thinking out loud: maybe we could even have a There could be 2 strategies for encoding: either WDYT @tsua, @lorentzenchr and @mlondschien? The bi-feature cos-sin encoding might be enough to get a smooth cyclic representation but it's not necessarily optimal for a downstream linear model. I have the feeling that the spline strategy can be more expressive by choosing more (evenly spaced) knots. For instance a day-in-week feature with 7 or 14 knots should represent weekends more effectively (possibly at the cost of overfitting) while keeping a well conditioned design matrix. As an alternative to many-knots spline-based periodic features, it would be possible to use a cosine encoding with more phased features of the same period. However I have the feeling that the conditioning of the resulting optimization problem would be less favorable. Furthermore the locality of the spline feature is more likely to yield more intepretable linear model coefficients. |
@lorentzenchr Though my position is that we can leave the choice of encoder to users, CyclicEncoder has an advantage to |
No: |
Interesting! I too would like to see a use-case where using trigonometric functions as basis functions yield a benefit over periodic splines. Here we can see that for degree 3 with four uniformly spaced knots the Given that we believe using trigonometric functions make sense, what does this bring over using a import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
X = np.array([[0, 12], [1, 3], [2, 6], [0, 5]])
transformer = ColumnTransformer(
transformers=[
("week_sin", FunctionTransformer(lambda x: np.sin(x * 2 * np.pi / 7)), [0]),
("week_cos", FunctionTransformer(lambda x: np.cos(x * 2 * np.pi / 7)), [0]),
("month_sin", FunctionTransformer(lambda x: np.sin(x * 2 * np.pi / 12)), [1]),
("month_cos", FunctionTransformer(lambda x: np.cos(x * 2 * np.pi / 12)), [1]),
]
)
transformer.fit_transform(X)
array([[ 0.00000000e+00, 1.00000000e+00, -2.44929360e-16,
1.00000000e+00],
[ 7.81831482e-01, 6.23489802e-01, 1.00000000e+00,
6.12323400e-17],
[ 9.74927912e-01, -2.22520934e-01, 1.22464680e-16,
-1.00000000e+00],
[ 0.00000000e+00, 1.00000000e+00, 5.00000000e-01,
-8.66025404e-01]]) creates trigonometric features of the day of the week and month of an input. Concerning the categoricals: Why are we limiting ourselves to (ordered) categorical features here and basing this of the ordinal encoder? We are treating the categoricals as numeric (using the codes of the categories) anyways. Also, using categoricals we implicitly assume that e.g. |
It would give discoverability of good practices. But maybe this can be achieved with an example that shows how to use the I am also not convinced we should restrict ourselves to categorical inputs. We should treat those ordinal inputs as numerical data and not inherit from If the input feature has datetime type, then maybe it would be helpful to provide a tool that does everything together as I described above as in this case the code would be sightly more complex. |
I started playing with cyclic spline features for date encoding and it seems to work well enough: #20281. I want to also experiment with manual |
Thank you all for comments and active discussion here!
|
|
The point on Your point above about the empirical number of knots needed to effectively use
In any case I think it would make sense to have a generic |
I tried the trigonometric encoding on the Bike Sharing Demand regression problem and it does not help a simple linear models compared to raw ordinal features. See code in #20281. On the contrary, the periodic spline features work quite well, especially when followed by a polynomial kernel approximation to model interactions between features: Also note that the best model is still gradient boosting with default hyper-parameters without any preprocessing beyond declaring categorical features as categorical features :) |
Because there is no unique inverse transform for B-Splines. Consider a spline of degree 0, i.e. piecewise constant (like a tree). You can't infer the original value if you know the constant. (This argument, however, might not hold for higher degrees. Even if possible, it would be a very complicated inverse transform.) |
I'm not yet sure if I'm convinced. So far, I haven't seen a convincing use case nor literature/references. Popularity per se isn't a striking argument. But I'm open to be convinced otherwise. |
Yes I agree about the non-invertibility for small numbers of degrees and knots. However, in most practical cases the spline features are probably over-complete and bijective. Inverting them probably requires running some iterative optimization procedure. It could be both complex and slow to execute but maybe it's worth it? For model interpretability one typically calls |
Easy invertibility is another one. Simplicity is another. The fact that it's data independent (assuming you know the period) is also nice. I would argue that popularity is still a valid argument: its a baseline to compare against and one of the values of scikit-learn is about implementing popular, maintainable baselines. |
BTW, I made more progress on my study in #20281 and wrote the narrative analysis. The main take away is the cosine + sine encoding is not enough for this problem and over-complete splines lead to better performance. However by adding 2 additional cosine / sine with a twice higher frequency for the hour feature (not represented in the example for the sake of simplicity) I get similar performance as the periodic splines (at least for the linear regression model without polynomial kernel expansion). So an over-complete |
FYI I updated #20281 to add one-hot features which are strong baseline for this dataset and reordered the introduction and analysis of the models: I moved the discussion of the use of approximate kernels to the end to avoid back and forth reference. |
@ogrisel Thank you for your great documentation in #20281 I was a bit away of this, but will work on After successfully developing |
I accidentally closed the PR... will open a new one |
@tsuga You're last force push closed this PR as it removed all commits. I think, you can still work on this PR, as the discussion is already here. Just (force?) push commits here again and let's see if we can reopen it. |
Reference Issues/PRs
n/a
What does this implement/fix? Explain your changes.
Recent machine learning projects hires the cosine/sine encoding of cyclical (often temporal) category variables.
https://www.google.com/search?q=cyclical+category+variables+sin+cos
This PR adds CyclicalEncorder for the scikit-learn preprocessing.
Plenty of documentation and test are also included in the PR.
Any other comments?