Skip to content

[WIP] New Feature - CyclicalEncorder (cosine/sine) in preprocessing #20259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

Conversation

tsuga
Copy link
Contributor

@tsuga tsuga commented Jun 14, 2021

Reference Issues/PRs

n/a

What does this implement/fix? Explain your changes.

Recent machine learning projects hires the cosine/sine encoding of cyclical (often temporal) category variables.
https://www.google.com/search?q=cyclical+category+variables+sin+cos

This PR adds CyclicalEncorder for the scikit-learn preprocessing.

Plenty of documentation and test are also included in the PR.

Any other comments?

@tsuga tsuga changed the title New Feature - CyclicalEncorder (cosine/sine) in preprocessing [WIP] New Feature - CyclicalEncorder (cosine/sine) in preprocessing Jun 14, 2021
@ogrisel
Copy link
Member

ogrisel commented Jun 14, 2021

I think this would be a very nice addition to scikit-learn. Thanks for working on this.

This PR still needs a new paragraph in the user guide (in the proprocessing section). It would be interesting to mention the SplineTransformer (in particular with extrapolation="periodic") as an alternative in a See also section at the end of the docstring.

I think this PR also needs a practical usage example where this transformer is included in a column transformer & pipeline to preprocess the date feature on a regression problem with either weather/season-related cyclic effects (e.g. time progression in current year) or business-related cyclic effects (e.g. time progression in current week or current month, time progression in current day). This could be achieved by chaining a FunctionTransformer with a function that turns a datetime column of a pandas dataframe into at least 4 columns such as day in year, day in week, day in month, day in week and time in day. Those features could then be smoothed using the CyclicEncoder to avoid jumps in feature values at the end of each period.

Such an example could use a dataset such as https://www.openml.org/d/1168 to try to predict the load or the price?

@ogrisel
Copy link
Member

ogrisel commented Jun 14, 2021

Also, would it be possible to accept a non-discretized continuous inputs with a known periodicity?

@ogrisel
Copy link
Member

ogrisel commented Jun 14, 2021

Alternatively one could use the Bike Sharing Demand forecasting dataset.

from sklearn.datasets import fetch_openml

X, y = fetch_openml("Bike_Sharing_Demand", as_frame=True, version=2, return_X_y=True)

The datetime info is already expanded.

@tsuga
Copy link
Contributor Author

tsuga commented Jun 14, 2021

@ogrisel Thank you for your feedback and inputs.

Also, would it be possible to accept a non-discretized continuous inputs with a known periodicity?

Currently, it won't be possible to accept continuous inputs as the current implementation is utilizing OrdinalEncoder under the hood. Continuous inputs could be handled by something like CyclicalScaler, but it will be out of the scope of this PR for now.

I will work on the rest of your comments later.

@ogrisel
Copy link
Member

ogrisel commented Jun 14, 2021

Thinking out loud: maybe we could even have a CyclicalDateTimeEncoder with different cycles to be configured cycles=["day", "week", "month", "year"]. This encoder would only accept numpy arrays or pandas dataframes with date or datetime dtypes.

There could be 2 strategies for encoding: either cos_sin with 2 features or using spline-based features with a controllable number of knots (and periodic extra-polation properly configured for the matching type of cycle).

WDYT @tsua, @lorentzenchr and @mlondschien?

The bi-feature cos-sin encoding might be enough to get a smooth cyclic representation but it's not necessarily optimal for a downstream linear model. I have the feeling that the spline strategy can be more expressive by choosing more (evenly spaced) knots. For instance a day-in-week feature with 7 or 14 knots should represent weekends more effectively (possibly at the cost of overfitting) while keeping a well conditioned design matrix.

As an alternative to many-knots spline-based periodic features, it would be possible to use a cosine encoding with more phased features of the same period. However I have the feeling that the conditioning of the resulting optimization problem would be less favorable. Furthermore the locality of the spline feature is more likely to yield more intepretable linear model coefficients.

@lorentzenchr
Copy link
Member

Interesting PR! I'd be keen to see an actual use case for this encoder/transformer. At first glance, and in line what @ogrisel alredy said above, I think there's no advantage for tree based estimators, and linear models are maybe better fit with a SplineTransformer.

@tsuga
Copy link
Contributor Author

tsuga commented Jun 15, 2021

@lorentzenchr
The pitfall for tree-based estimators are documented in the docs (Note section).

Though my position is that we can leave the choice of encoder to users, CyclicEncoder has an advantage to SplineTransformer that, for one (original) feature, CyclicEncoder create only two columns regardless of the number of categories. On the other hand, SplineTransformer creates n_categories columns, which means putting more weights on the variable mathematically (please correct me if I'm wrong). For example, if we use 'week of day', we put x7 weights if we use SplineTransformer, and just as little as x2 weights if we use CyclicEncoder.

@ogrisel
Copy link
Member

ogrisel commented Jun 15, 2021

On the other hand, SplineTransformer creates n_categories columns, which means putting more weights on the variable mathematically (please correct me if I'm wrong).

No: SplineTransformer would be used to transform a single (continuous or ordinal) time feature as a set of derived features controllable size (that depends on degree and number of knots) to represent periodic progress in current year for instance. Same for progress in current month, progress in current week and so on. The number of output features does not depend on the cardinality of the input feature. Actually the input feature can be continuous with times encoded up to the nanosecond instead of 365 days in a year for instance.

@mlondschien
Copy link
Contributor

Interesting! I too would like to see a use-case where using trigonometric functions as basis functions yield a benefit over periodic splines. Here we can see that for degree 3 with four uniformly spaced knots the SplineTransformer creates basis functions that, at least visually, are very similar to sin(x), sin(x + 2pi/3), sin(x + 4pi/3).

For me the restriction to using only sin(x) and cos(x) = sin(x - pi/2) seems a bit arbitrary.

Given that we believe using trigonometric functions make sense, what does this bring over using a ColumnTransformer + FunctionTransformer? E.g.

import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

X = np.array([[0, 12], [1, 3], [2, 6], [0, 5]])

transformer = ColumnTransformer(
	transformers=[
		("week_sin", FunctionTransformer(lambda x: np.sin(x * 2 * np.pi / 7)), [0]),
		("week_cos", FunctionTransformer(lambda x: np.cos(x * 2 * np.pi / 7)), [0]),
		("month_sin", FunctionTransformer(lambda x: np.sin(x * 2 * np.pi / 12)), [1]),
		("month_cos", FunctionTransformer(lambda x: np.cos(x * 2 * np.pi / 12)), [1]),
	]
)

transformer.fit_transform(X)

array([[ 0.00000000e+00,  1.00000000e+00, -2.44929360e-16,
         1.00000000e+00],
       [ 7.81831482e-01,  6.23489802e-01,  1.00000000e+00,
         6.12323400e-17],
       [ 9.74927912e-01, -2.22520934e-01,  1.22464680e-16,
        -1.00000000e+00],
       [ 0.00000000e+00,  1.00000000e+00,  5.00000000e-01,
        -8.66025404e-01]])

creates trigonometric features of the day of the week and month of an input.

Concerning the categoricals: Why are we limiting ourselves to (ordered) categorical features here and basing this of the ordinal encoder? We are treating the categoricals as numeric (using the codes of the categories) anyways. Also, using categoricals we implicitly assume that e.g. Morning, Afternoon, Evening and Night are evenly spaced. The transformer above works for numeric columns.

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2021

Given that we believe using trigonometric functions make sense, what does this bring over using a ColumnTransformer + FunctionTransformer?

It would give discoverability of good practices. But maybe this can be achieved with an example that shows how to use the FunctionTransformer for this use case instead.

I am also not convinced we should restrict ourselves to categorical inputs. We should treat those ordinal inputs as numerical data and not inherit from OrdinalEncoder. If the original data is encoded with string labels (e.g. ["Mon", "Tue", "Wed", ...] then the user can always use a pipeline with OrdinalEncoder as the first step).

If the input feature has datetime type, then maybe it would be helpful to provide a tool that does everything together as I described above as in this case the code would be sightly more complex.

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2021

I started playing with cyclic spline features for date encoding and it seems to work well enough: #20281. I want to also experiment with manual FunctionTransformer with sin/cos bi-feature encoding.

@tsuga
Copy link
Contributor Author

tsuga commented Jun 17, 2021

On the other hand, SplineTransformer creates n_categories columns, which means putting more weights on the variable mathematically (please correct me if I'm wrong).

No: SplineTransformer would be used to transform a single (continuous or ordinal) time feature as a set of derived features controllable size (that depends on degree and number of knots) to represent periodic progress in current year for instance. Same for progress in current month, progress in current week and so on. The number of output features does not depend on the cardinality of the input feature. Actually the input feature can be continuous with times encoded up to the nanosecond instead of 365 days in a year for instance.

Gotcha! Now I understand that.

However, my point remains. Sin/Cos encoding generates just 2 columns in the output.
On the other hand, SplineTransformer may need more. For example, a 7-categories feature (typically, Mon-Sun) requires, not theoretically but practically, at least n_knots=4 or more, which generates at least 3 columns or more. If n_knots=3, under 0-based numbering, cat: 1 and cat: 3 are very similar, and cat: 5 and cat: 6 are very similar as well (too little expressivity!). So, n_knots=3 is too small for a 7-categories feature. (Again, correct me if I'm wrong)
image

We are getting more issues and aspects. Give me some time to summarize points (for the sake of my understanding...)

@tsuga
Copy link
Contributor Author

tsuga commented Jun 17, 2021

Thank you all for comments and active discussion here!
I'm summarizing the points of discussion we have. Please let me know if anything is missed.

  1. What is the appropriate mathematical backend(s)?

    1. B-spline
    2. Cos/Sin
      1. Advantages:
        1. have only 2 columns in output (B-spline may need more to retain expressivity)
        2. can inverse_transform
        3. is straightforward (for some people)
      2. Disadvantages:
        1. makes less sense to tree-based estimators (though I could see some people hire sin/cos encoding for NN-based estimators and others on Kaggle and elsewhere.)
  2. How should we name this transformer?

    1. As cos/sin is not only the way to manage cyclical feature, it would be better to name this cos/sin encoder CosSinEncoder or CosSinTransformer to be fair. We may use Trigonometric instead of CosSin for formality.
    2. Question: What is the difference between encoder, scaler, and transformer in the language used in scikit-learn? I feel some difference in nuances but don't have a clear separation.
  3. How can we support not only discretized category variables but also (non-discretized) continuous variables and datetime variables?

    1. I'm happy to abandon the current implementation (inheriting OrdinalEncorder) to accomodate (non-discretized) continuous variables.
    2. We may create another transformer specialized for datetime variables like:
      enc = CosSinDatetimeTransformer(freq=["day_of_week", "day_of_month"])
      
  4. Shall we need a dedicated class? Or, is ColumnTransformer + FunctionTransformer enough?

    1. I agree with @ogrisel for discoverability of good practices.
    2. In addition, cos/sin encoding is inversible. Though FunctionTransformer can have inverse_func, it is cleaner to put them in one dedicated class.
  5. How shall we enrich docs?

    • TBD - Depends on the discussion above.

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2021

Question: What is the difference between encoder, scaler, and transformer in the language used in scikit-learn? I feel some difference in nuances but don't have a clear separation.

  • transformer: anything with a .transform method. Encoders and scalers are particular cases.
  • scaler: preprocessing transformers for numerical features that scale (and often also shift) the features one by one (standard scaler, minmax scaler, robust scaler...). There exist other marginal feature transformers that are not simple scalers (e.g. QuantileTransformer, PowerTransformer that also output 1 feature for 1 input feature but do a more complex transformation that cannot be reduced to shift and scale) and others that expand one input feature into several (KBinsDiscretizer, SplineTransformer).
  • encoder: typically used to describe transformers for discrete inputs (categorical input features or target labels)

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2021

The point on inverse_transform is interesting. This is something we overlooked when we reviewed the SplineTransformer PRs. I wonder how hard it would be to implement it.

Your point above about the empirical number of knots needed to effectively use SplineTransformer is also interesting. However I am not sure in which situations using only 2 cos/sin features would be competitive:

  • for tree-based models, I am the impression that the raw ordinal representation is enough as they can deal with monotonicity (at least when n_samples is large)
  • for linear models, I think that the cos/sin expansion is not expressive enough while the spline transformer makes it possible for linear models to make meaningful and smooth local predictions
  • for kernel-based models, I am not so sure.

In any case I think it would make sense to have a generic CosSinTransformer in scikit-learn given the popularity of this approach. It's still an interesting baseline to compare periodic splines against, even if it's just for the sake of education.

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2021

I tried the trigonometric encoding on the Bike Sharing Demand regression problem and it does not help a simple linear models compared to raw ordinal features. See code in #20281.

On the contrary, the periodic spline features work quite well, especially when followed by a polynomial kernel approximation to model interactions between features:

https://141562-843222-gh.circle-artifacts.com/0/doc/auto_examples/applications/plot_cyclical_feature_engineering.html#trigonometric-features

Also note that the best model is still gradient boosting with default hyper-parameters without any preprocessing beyond declaring categorical features as categorical features :)

@lorentzenchr
Copy link
Member

The point on inverse_transform is interesting. This is something we overlooked when we reviewed the SplineTransformer PRs. I wonder how hard it would be to implement it.

Because there is no unique inverse transform for B-Splines. Consider a spline of degree 0, i.e. piecewise constant (like a tree). You can't infer the original value if you know the constant. (This argument, however, might not hold for higher degrees. Even if possible, it would be a very complicated inverse transform.)

@lorentzenchr
Copy link
Member

In any case I think it would make sense to have a generic CosSinTransformer in scikit-learn given the popularity of this approach. It's still an interesting baseline to compare periodic splines against, even if it's just for the sake of education.

I'm not yet sure if I'm convinced. So far, I haven't seen a convincing use case nor literature/references. Popularity per se isn't a striking argument.

But I'm open to be convinced otherwise.

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2021

Because there is no unique inverse transform for B-Splines. Consider a spline of degree 0, i.e. piecewise constant (like a tree). You can't infer the original value if you know the constant. (This argument, however, might not hold for higher degrees. Even if possible, it would be a very complicated inverse transform.)

Yes I agree about the non-invertibility for small numbers of degrees and knots. However, in most practical cases the spline features are probably over-complete and bijective. Inverting them probably requires running some iterative optimization procedure. It could be both complex and slow to execute but maybe it's worth it? For model interpretability one typically calls inverse_transform only on a few samples to visual inspection / debugging. Computational performance is probably not so much of a problem.

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2021

I'm not yet sure if I'm convinced. So far, I haven't seen a convincing use case nor literature/references. Popularity per se isn't a striking argument.

Easy invertibility is another one. Simplicity is another. The fact that it's data independent (assuming you know the period) is also nice.

I would argue that popularity is still a valid argument: its a baseline to compare against and one of the values of scikit-learn is about implementing popular, maintainable baselines.

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2021

BTW, I made more progress on my study in #20281 and wrote the narrative analysis. The circle CI workers are overloaded with pending builds so I do not have the rendered HTML at this point unfortunately. Here is the rendered HTML:

https://141704-843222-gh.circle-artifacts.com/0/doc/auto_examples/applications/plot_cyclical_feature_engineering.html

The main take away is the cosine + sine encoding is not enough for this problem and over-complete splines lead to better performance. However by adding 2 additional cosine / sine with a twice higher frequency for the hour feature (not represented in the example for the sake of simplicity) I get similar performance as the periodic splines (at least for the linear regression model without polynomial kernel expansion). So an over-complete CosSinTransformer would still be a competitive yet simple and invertible feature engineering strategy for this specific example.

@ogrisel
Copy link
Member

ogrisel commented Jun 21, 2021

FYI I updated #20281 to add one-hot features which are strong baseline for this dataset and reordered the introduction and analysis of the models: I moved the discussion of the use of approximate kernels to the end to avoid back and forth reference.

https://141935-843222-gh.circle-artifacts.com/0/doc/auto_examples/applications/plot_cyclical_feature_engineering.html

@tsuga
Copy link
Contributor Author

tsuga commented Jun 29, 2021

@ogrisel Thank you for your great documentation in #20281

I was a bit away of this, but will work on CosSinTransformer this week.
I hope to close this PR with CosSinTransformer.

After successfully developing TrigonometricTransformer, I'm thinking of having DateTimeTransformer backed by CosSinTransformer or SplineTransformer.

@tsuga tsuga closed this Jun 29, 2021
@tsuga
Copy link
Contributor Author

tsuga commented Jun 29, 2021

I accidentally closed the PR... will open a new one

@lorentzenchr
Copy link
Member

@tsuga You're last force push closed this PR as it removed all commits. I think, you can still work on this PR, as the discussion is already here. Just (force?) push commits here again and let's see if we can reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants