Skip to content

linear regression predictions unstable #7378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ttal opened this issue Sep 9, 2016 · 11 comments
Closed

linear regression predictions unstable #7378

ttal opened this issue Sep 9, 2016 · 11 comments

Comments

@ttal
Copy link

ttal commented Sep 9, 2016

When I make predictions on two samples, separately and appended to one another, I get different results. For example, this is as simple as it get (both X_train and y_train are pandas DataFrames):

model = linear_model.LinearRegression(normalize=True)
model.fit(X_train, y_train)

But when I predict, I get inconsistent results. Predicting for a and b separately:

mode.predict(a)
--> array([[ -4.91992866e+19],
           [ -9.44975142e+18],
           [ -6.28885430e+19],
           [ -5.99072700e+19],
           [ -9.97222845e+19]])

model.predict(b):
--> array([[ 273408.]])

Predicting for the two together:

model.predict(a.append(b))
--> array([[  1.82526373e+17],
           [  1.82526373e+17],
           [  4.86912000e+05],
           [  2.41920000e+05],
           [ -1.45384048e+18],
           [  2.73408000e+05]])

Which is completely different, except for b, which is the same result. This also seems to be model agnostic as far as I can tell (although all within linear regression).

@amueller
Copy link
Member

amueller commented Sep 9, 2016

what happens if you use numpy arrays?
Which version of scikit-learn are you using?

@amueller
Copy link
Member

amueller commented Sep 9, 2016

what's in a, whats in a.append(b)?

@ttal
Copy link
Author

ttal commented Sep 9, 2016

That's a good catch - it I do the same with numpy arrays I get consistent results (a and b are both test inputs that go into model.predict):

model.predict(np.concatenate((np.array(a), np.array(b)), axis=0))
--> array([[ -4.91992866e+19],
           [ -9.44975142e+18],
           [ -6.28885430e+19],
           [ -5.99072700e+19],
           [ -9.97222845e+19],
           [  2.73408000e+05]])

Does that mean the issue is with the way predict handles data frames?

@amueller
Copy link
Member

amueller commented Sep 9, 2016

probably. Or the way pandas.append works.
Which version of scikit-learn are you using?
Can you give the result of
np.array(a.append(b))
and compare that to
np.concatenate((np.array(a), np.array(b)), axis=0)
?

@ttal
Copy link
Author

ttal commented Sep 9, 2016

I think you hit the crux of it - they're not the same. Seems like np.array(a.append(b)) gives the correct order, while np.concatenate uses different sorting (based on the order I expect to get, and get for b).

The only thing is that it means that even model.predict(a) is wrong (because it gives the same results as concat, which is sorted incorrectly). Which might mean some issue with Pandas again.

Not sure I understand it now, but at least I know to build my model using np arrays exclusively.

Thanks!

@amueller
Copy link
Member

amueller commented Sep 9, 2016

Can you explain what you mean by "Order"?
We want to make sure other people don't run into the same issue, and pandas DataFrames should be supported as input.

@ttal
Copy link
Author

ttal commented Sep 9, 2016

Sure thing - I'm not showing the output because it's 20+ columns wide but the order of features (columns) in each row is not the same as in the pandas DF. For example, if the DF is ordered [par1, par2, par3], in the concatenated array they might appear [par3, par1, par2] so that the model predicts on the wrong features (and thus gives inconsistent results). Does that make sense?

Not sure what the bug would be in this case (have not looked at the source) but it might be with the way pandas keeps track of column order rather.

@ttal
Copy link
Author

ttal commented Sep 9, 2016

Ok, so the (temporary) solution is to explicitly sort the train sample and input DF prior to modeling and predicting:

model.predict(np.array(a.sort_index(1)))

yields the correct result. This way you essentially force both train and test data sets to have the same column order.

@amueller
Copy link
Member

amueller commented Sep 9, 2016

So concatenating might change column order? Or is the column order in a different from the training set to begin with?
If the column order changes, that seems like a pandas issue. If the column order in the test set is different from the training set to begin with, this is a duplicate of #7242, which I hope we will be able to address at some point in the future.

@ttal
Copy link
Author

ttal commented Sep 9, 2016

I agree - likely a duplicate.

@rth
Copy link
Member

rth commented Jun 20, 2019

I agree - likely a duplicate.

Closing as a duplicate of #7242

@rth rth closed this as completed Jun 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants