linear regression predictions unstable #7378

ttal · 2016-09-09T16:54:11Z

When I make predictions on two samples, separately and appended to one another, I get different results. For example, this is as simple as it get (both X_train and y_train are pandas DataFrames):

model = linear_model.LinearRegression(normalize=True)
model.fit(X_train, y_train)

But when I predict, I get inconsistent results. Predicting for a and b separately:

mode.predict(a)
--> array([[ -4.91992866e+19],
           [ -9.44975142e+18],
           [ -6.28885430e+19],
           [ -5.99072700e+19],
           [ -9.97222845e+19]])

model.predict(b):
--> array([[ 273408.]])

Predicting for the two together:

model.predict(a.append(b))
--> array([[  1.82526373e+17],
           [  1.82526373e+17],
           [  4.86912000e+05],
           [  2.41920000e+05],
           [ -1.45384048e+18],
           [  2.73408000e+05]])

Which is completely different, except for b, which is the same result. This also seems to be model agnostic as far as I can tell (although all within linear regression).

The text was updated successfully, but these errors were encountered:

amueller · 2016-09-09T16:55:32Z

what happens if you use numpy arrays?
Which version of scikit-learn are you using?

amueller · 2016-09-09T16:55:55Z

what's in a, whats in a.append(b)?

ttal · 2016-09-09T17:14:04Z

That's a good catch - it I do the same with numpy arrays I get consistent results (a and b are both test inputs that go into model.predict):

model.predict(np.concatenate((np.array(a), np.array(b)), axis=0))
--> array([[ -4.91992866e+19],
           [ -9.44975142e+18],
           [ -6.28885430e+19],
           [ -5.99072700e+19],
           [ -9.97222845e+19],
           [  2.73408000e+05]])

Does that mean the issue is with the way predict handles data frames?

amueller · 2016-09-09T17:16:20Z

probably. Or the way pandas.append works.
Which version of scikit-learn are you using?
Can you give the result of
np.array(a.append(b))
and compare that to
np.concatenate((np.array(a), np.array(b)), axis=0)
?

ttal · 2016-09-09T17:29:40Z

I think you hit the crux of it - they're not the same. Seems like np.array(a.append(b)) gives the correct order, while np.concatenate uses different sorting (based on the order I expect to get, and get for b).

The only thing is that it means that even model.predict(a) is wrong (because it gives the same results as concat, which is sorted incorrectly). Which might mean some issue with Pandas again.

Not sure I understand it now, but at least I know to build my model using np arrays exclusively.

Thanks!

amueller · 2016-09-09T17:37:07Z

Can you explain what you mean by "Order"?
We want to make sure other people don't run into the same issue, and pandas DataFrames should be supported as input.

ttal · 2016-09-09T17:48:39Z

Sure thing - I'm not showing the output because it's 20+ columns wide but the order of features (columns) in each row is not the same as in the pandas DF. For example, if the DF is ordered [par1, par2, par3], in the concatenated array they might appear [par3, par1, par2] so that the model predicts on the wrong features (and thus gives inconsistent results). Does that make sense?

Not sure what the bug would be in this case (have not looked at the source) but it might be with the way pandas keeps track of column order rather.

ttal · 2016-09-09T17:58:00Z

Ok, so the (temporary) solution is to explicitly sort the train sample and input DF prior to modeling and predicting:

model.predict(np.array(a.sort_index(1)))

yields the correct result. This way you essentially force both train and test data sets to have the same column order.

amueller · 2016-09-09T18:24:34Z

So concatenating might change column order? Or is the column order in a different from the training set to begin with?
If the column order changes, that seems like a pandas issue. If the column order in the test set is different from the training set to begin with, this is a duplicate of #7242, which I hope we will be able to address at some point in the future.

ttal · 2016-09-09T18:35:03Z

I agree - likely a duplicate.

rth · 2019-06-20T21:02:42Z

I agree - likely a duplicate.

Closing as a duplicate of #7242

rth closed this as completed Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linear regression predictions unstable #7378

linear regression predictions unstable #7378

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

rth commented Jun 20, 2019

linear regression predictions unstable #7378

linear regression predictions unstable #7378

Comments

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

ttal commented Sep 9, 2016

amueller commented Sep 9, 2016

ttal commented Sep 9, 2016

rth commented Jun 20, 2019