Potential error caused by different column order

#### Description



Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong. 
I have checked the source code and discovered that the fit function of sklean didn't save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.
#### Steps/Code to Reproduce



``` python
#for simplification, consider a very simple case
from sklearn.datasets import load_iris
import pandas as pd

#make a dataframe
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
iris_pd = pd.DataFrame(X)
iris_pd.columns = iris.feature_names
iris_pd['target'] = y

from sklearn.cross_validation import train_test_split
train, test = train_test_split(iris_pd, test_size= 0.3)

feature_columns_train = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
feature_columns_test = ['sepal length (cm)','sepal width (cm)','petal width (cm)','petal length (cm)']

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
lg.fit(train[feature_columns_train], train['target'])

prob1 = lg.predict_proba(test[feature_columns_train])
prob2 = lg.predict_proba(test[feature_columns_test])
```
#### Expected Results



Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.

``` python
prob1[:5] = 
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
         [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
         [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
         [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
         [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])
```
#### Actual Results



``` python
prob2[:5] = 
array([[  4.36321678e-01,   2.25057553e-04,   5.63453265e-01],
         [  4.92513658e-01,   1.76391882e-05,   5.07468703e-01],
         [  9.92946715e-01,   7.05167151e-03,   1.61346947e-06],
         [  9.83726756e-01,   1.62387090e-02,   3.45348884e-05],
         [  5.01392274e-01,   5.37144591e-04,   4.98070581e-01]])
```
#### Versions



``` python
Linux-2.6.32-642.1.1.el6.x86_64-x86_64-with-redhat-6.7-Santiago
('Python', '2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec  6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.1')
('SciPy', '0.16.0')
('Scikit-Learn', '0.17')
```


## The probable solution

I also implement a very simple solution. Hope this would help. :)

``` python
class SafeLogisticRegression(LogisticRegression):
    def fit(self, X, y, sample_weight=None):
        self.columns = X.columns
        LogisticRegression.fit(self, X, y, sample_weight=None)
    def predict_proba(self, X):
        new_columns = list(X.columns)
        old_columns = list(self.columns)
        if new_columns != old_columns:
            if len(new_columns) == len(old_columns):
                try:
                    X = X[old_columns]
                    print "The order of columns has changed. Fixed."
                except:
                    raise ValueError('The columns has changed. Please check.')
            else:
                raise ValueError('The number of columns has changed.')
        return LogisticRegression.predict_proba(self, X)
```

Then apply this new class:

``` python
slg = SafeLogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
slg.fit(train[feature_columns_train], train['target']) 
```
#### Test one: if the column order is changed

``` python
prob1 = slg.predict_proba(test[feature_columns_train])
prob2 = slg.predict_proba(test[feature_columns_test])

#The order of columns has changed. Fixed.
```

Result for test one:

``` python
prob1[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

prob2[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])
```
#### Test two: if the columns are different (different columns)

Simulate by changing one of the column names

``` python
prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))
```

error message:

``` python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-84cea68536fe> in <module>()
----> 1 prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     12                     print "The order of columns has changed. Fixed."
     13                 except:
---> 14                     raise ValueError('The columns has changed. Please check.')
     15             else:
     16                 raise ValueError('The number of columns has changed.')

ValueError: The columns has changed. Please check.
```
#### Test three: if the number of columns changes

Simulate by dropping one column

``` python
prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))
```

error message:

``` python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-47c63ae1ac22> in <module>()
----> 1 prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     14                     raise ValueError('The columns has changed. Please check.')
     15             else:
---> 16                 raise ValueError('The number of columns has changed.')
     17         return LogisticRegression.predict_proba(self, X)

ValueError: The number of columns has changed.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Potential error caused by different column order #7242

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

The probable solution

Test one: if the column order is changed

Test two: if the columns are different (different columns)

Test three: if the number of columns changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Potential error caused by different column order #7242

Description

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

The probable solution

Test one: if the column order is changed

Test two: if the columns are different (different columns)

Test three: if the number of columns changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions