-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Support orthogonal polynomial features (via QR decomposition) in PolynomialFeatures
#31223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for your proposal.
That sounds quite different from the current behavior of Could you please publish prototype code to a gist.github.com or other small repo or notebook? It would be great if you could highlight a simple regression tasks for which this kind of feature engineering is a game changer (either in terms of predictive performance, model size, fitting speed, numerical stability of the fit) compared to what is already available in scikit-learn. It would be great to compare to:
and all of the above followed by a BTW: maybe @lorentzenchr would like to share his views on such a new feature in scikit-learn. |
I also noticed that Note that it is possible to wrap Rather than replicating such features in scikit-learn, it might be more fruitful to see if |
@cottnich What is your motivation?
Without penalty, the predicted values are the same for all design matrices related to orthogonal transformations. |
There are existing orthogonal polynomial bases, such as the Legendre basis, which are already orthogonal and don't need any transformation at all. We can also support interactions using tensor-product bases. |
Again, what is your motivation? Why do you want orthogonal polynomials? |
I don't. The OP does. But I can guess why. You can fit high degree polynomials without the well-known "bad effects" of overfitting. Moreover and you can easily prune models by simply removing the tail of the coefficients, and staying with low-degree polynomials. This is because they are a sort of a spectrum, where higher degree polynomials are like frequency components - this pruning is just denoising. What I don't understand is why it belongs to scikit-learn, and not their own research repository, where the OP does their research on polynomials. This library is easily extensible - just inherit the right transformer base-class, and you can put your new super-duper polynomial basis into your pipeline. |
Thank you for all the responses and my apologies for not checking on this thread. I actually wasn't aware that the formulaic library handles this. I assumed it was like statsmodels.formula which doesn't do any additional computations. |
Describe the workflow you want to enable
I want to introduce support for orthogonal polynomial features via QR decomposition in
PolynomialFeatures
, closely mirroring the behavior of R'spoly()
function.In regression modeling, using orthogonal polynomials can often lead to improved numerical stability and reduced multi-collinearity among polynomial terms
As an example of what the difference looks like in R,
This behavior cannot currently be replicated with
scikit-learn
'sPolynomialFeatures
, which only produces the raw monomial terms. As a result transitioning from R to Python often leads to discrepancies in model behavior and performance.Describe your proposed solution
I propose extending
PolynomialFeatures
with a new parameter:Accepted values:
"raw"
(default): retains existing behavior, returning standard raw terms"qr"
: applies QR decomposition to each feature to generate orthogonal polynomial features.Because R's
poly()
only operates on 1D input vectors, my thought was to apply QR decomposition feature by feature when the input is multi-dimensional. Each column is processed independently, mirroring R's approach.This feature would interact with other parameters as follows:
include_bias
: Whenmethod="qr"
, The orthogonal polynomial basis inherently includes a transformed first column. However, this column is not a plain column of ones. Therefore, the concept ofinclude_bias=True
(which appends a column of ones) becomes redundant or misleading in this context. One option is to always setinclude_bias=False
ifmethod=qr
and always return orthogonal columns only, or raise a warning.interaction_only
: This would be incompatible withmethod="qr"
since the QR-based transformation does not naturally support selective inclusion of interaction terms.Describe alternatives you've considered, if relevant
Currently, users must implement QR decomposition manually when orthogonal polynomials are needed. This is a common pattern in statistical workflows but lacks "off the shelf" support in any major python library. This feature would eliminate the need to do this decomposition manually and would improve workflows for researchers who are used to R's statistical tools.
Additional context
This idea stemmed from a broader effort to convert statistical modeling pipelines from R to python, where discrepencies in regression results were traced to the lack of orthogonal polynomial support in
PolynomialFeatures
.I have drafted and tested a 1D implementation of this feature but wanted feedback on whether this idea aligns with
scikit-learn
's scope before moving on. In particular, I'd appreciate input onmethod="qr"
vs.orthogonal=True
).include_bias
andinteraction_only
.The text was updated successfully, but these errors were encountered: