Skip to content

Support orthogonal polynomial features (via QR decomposition) in PolynomialFeatures #31223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cottnich opened this issue Apr 18, 2025 · 7 comments
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@cottnich
Copy link

Describe the workflow you want to enable

I want to introduce support for orthogonal polynomial features via QR decomposition in PolynomialFeatures, closely mirroring the behavior of R's poly() function.

In regression modeling, using orthogonal polynomials can often lead to improved numerical stability and reduced multi-collinearity among polynomial terms

As an example of what the difference looks like in R,

#fits raw polynomial data without an orthogonal basis
model_raw <- lm(y ~ I(x) + I(x^2) + I(x^3), data = data)
#model_raw <- lm(y ~poly(x,3,raw=TRUE), data = data)

#fits the same degree-3 polynomial using an orthogonal basis
model_poly <- lm(y ~ poly(x, 3), data = data)

This behavior cannot currently be replicated with scikit-learn's PolynomialFeatures, which only produces the raw monomial terms. As a result transitioning from R to Python often leads to discrepancies in model behavior and performance.

Describe your proposed solution

I propose extending PolynomialFeatures with a new parameter:

PolynomialFeatures(..., method="raw")

Accepted values:

  • "raw" (default): retains existing behavior, returning standard raw terms
  • "qr": applies QR decomposition to each feature to generate orthogonal polynomial features.

Because R's poly() only operates on 1D input vectors, my thought was to apply QR decomposition feature by feature when the input is multi-dimensional. Each column is processed independently, mirroring R's approach.

This feature would interact with other parameters as follows:

  • include_bias: When method="qr", The orthogonal polynomial basis inherently includes a transformed first column. However, this column is not a plain column of ones. Therefore, the concept of include_bias=True (which appends a column of ones) becomes redundant or misleading in this context. One option is to always set include_bias=False if method=qr and always return orthogonal columns only, or raise a warning.

  • interaction_only: This would be incompatible with method="qr" since the QR-based transformation does not naturally support selective inclusion of interaction terms.

Describe alternatives you've considered, if relevant

Currently, users must implement QR decomposition manually when orthogonal polynomials are needed. This is a common pattern in statistical workflows but lacks "off the shelf" support in any major python library. This feature would eliminate the need to do this decomposition manually and would improve workflows for researchers who are used to R's statistical tools.

Additional context

This idea stemmed from a broader effort to convert statistical modeling pipelines from R to python, where discrepencies in regression results were traced to the lack of orthogonal polynomial support in PolynomialFeatures.

I have drafted and tested a 1D implementation of this feature but wanted feedback on whether this idea aligns with scikit-learn's scope before moving on. In particular, I'd appreciate input on

  • Acceptability of feature-wise orthogonalization for multi-feature input.
  • Preferred parameter naming (e.g., method="qr" vs. orthogonal=True).
  • Compatibility decisions around parameters like include_bias and interaction_only.
@cottnich cottnich added Needs Triage Issue requires triage New Feature labels Apr 18, 2025
@ogrisel
Copy link
Member

ogrisel commented Apr 25, 2025

Thanks for your proposal.

Because R's poly() only operates on 1D input vectors, my thought was to apply QR decomposition feature by feature when the input is multi-dimensional. Each column is processed independently, mirroring R's approach.

That sounds quite different from the current behavior of PolynomialFeatures which always considers interactions between input features. I wonder if trying to implement the two approaches into the same class makes sense. Maybe we could implement that as a new transformer but before doing so, I would rather make sure that people actually need this feature before implementing it in scikit-learn and adding the maintenance of a new estimator and cognitive overhead choosing from too many estimators and options.

Could you please publish prototype code to a gist.github.com or other small repo or notebook? It would be great if you could highlight a simple regression tasks for which this kind of feature engineering is a game changer (either in terms of predictive performance, model size, fitting speed, numerical stability of the fit) compared to what is already available in scikit-learn.

It would be great to compare to:

  • SplineTransformer()
  • PolynomialFeatures()
  • make_pipeline(SplineTransformer(), PolynomialFeatures(interaction_only=True)

and all of the above followed by a PCA step.

BTW: maybe @lorentzenchr would like to share his views on such a new feature in scikit-learn.

@ogrisel
Copy link
Member

ogrisel commented Apr 25, 2025

I also noticed that formulaic provides an implementation of the poly function that seems to follow the R-style orthogonalization convention by default:

Note that it is possible to wrap formulaic as a scikit-learn transformer but it requires copy-pasting some boilerplate code as explained here:

Rather than replicating such features in scikit-learn, it might be more fruitful to see if formulaic authors be open to the idea of adding first-class integration with scikit-learn API/pipelines in directly into formulaic or via a new extension package. If so, we could extend on of scikit-learn examples on polynomial / spline features to show how to use formulaic with scikit-learn a go beyond the options implemented by default in scikit-learn.

@ogrisel ogrisel added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Apr 25, 2025
@lorentzenchr
Copy link
Member

@cottnich What is your motivation?

As a result transitioning from R to Python often leads to discrepancies in model behavior and performance.

Without penalty, the predicted values are the same for all design matrices related to orthogonal transformations.
B-Splines are almost always preferred to pure polynomials of features.
If you want to be as close to R's lm and glm, I recommend to use https://www.statsmodels.org .

@alexshtf
Copy link

There are existing orthogonal polynomial bases, such as the Legendre basis, which are already orthogonal and don't need any transformation at all. We can also support interactions using tensor-product bases.
Do we really need the heavy computational overhead of "orthogonalizing" the coefficients?

@lorentzenchr
Copy link
Member

Again, what is your motivation? Why do you want orthogonal polynomials?

@alexshtf
Copy link

alexshtf commented May 21, 2025

I don't. The OP does. But I can guess why.

You can fit high degree polynomials without the well-known "bad effects" of overfitting. Moreover and you can easily prune models by simply removing the tail of the coefficients, and staying with low-degree polynomials. This is because they are a sort of a spectrum, where higher degree polynomials are like frequency components - this pruning is just denoising.

What I don't understand is why it belongs to scikit-learn, and not their own research repository, where the OP does their research on polynomials. This library is easily extensible - just inherit the right transformer base-class, and you can put your new super-duper polynomial basis into your pipeline.

@cottnich
Copy link
Author

Thank you for all the responses and my apologies for not checking on this thread. I actually wasn't aware that the formulaic library handles this. I assumed it was like statsmodels.formula which doesn't do any additional computations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

No branches or pull requests

4 participants