-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
linear regression predictions unstable #7378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what happens if you use numpy arrays? |
what's in a, whats in |
That's a good catch - it I do the same with numpy arrays I get consistent results (a and b are both test inputs that go into model.predict):
Does that mean the issue is with the way predict handles data frames? |
probably. Or the way |
I think you hit the crux of it - they're not the same. Seems like The only thing is that it means that even model.predict(a) is wrong (because it gives the same results as concat, which is sorted incorrectly). Which might mean some issue with Pandas again. Not sure I understand it now, but at least I know to build my model using np arrays exclusively. Thanks! |
Can you explain what you mean by "Order"? |
Sure thing - I'm not showing the output because it's 20+ columns wide but the order of features (columns) in each row is not the same as in the pandas DF. For example, if the DF is ordered [par1, par2, par3], in the concatenated array they might appear [par3, par1, par2] so that the model predicts on the wrong features (and thus gives inconsistent results). Does that make sense? Not sure what the bug would be in this case (have not looked at the source) but it might be with the way pandas keeps track of column order rather. |
Ok, so the (temporary) solution is to explicitly sort the train sample and input DF prior to modeling and predicting:
yields the correct result. This way you essentially force both train and test data sets to have the same column order. |
So concatenating might change column order? Or is the column order in a different from the training set to begin with? |
I agree - likely a duplicate. |
Closing as a duplicate of #7242 |
When I make predictions on two samples, separately and appended to one another, I get different results. For example, this is as simple as it get (both X_train and y_train are pandas DataFrames):
But when I predict, I get inconsistent results. Predicting for a and b separately:
Predicting for the two together:
Which is completely different, except for b, which is the same result. This also seems to be model agnostic as far as I can tell (although all within linear regression).
The text was updated successfully, but these errors were encountered: