Description
Description
Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong.
I have checked the source code and discovered that the fit function of sklean didn't save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.
Steps/Code to Reproduce
#for simplification, consider a very simple case
from sklearn.datasets import load_iris
import pandas as pd
#make a dataframe
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
iris_pd = pd.DataFrame(X)
iris_pd.columns = iris.feature_names
iris_pd['target'] = y
from sklearn.cross_validation import train_test_split
train, test = train_test_split(iris_pd, test_size= 0.3)
feature_columns_train = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
feature_columns_test = ['sepal length (cm)','sepal width (cm)','petal width (cm)','petal length (cm)']
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
lg.fit(train[feature_columns_train], train['target'])
prob1 = lg.predict_proba(test[feature_columns_train])
prob2 = lg.predict_proba(test[feature_columns_test])
Expected Results
Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.
prob1[:5] =
array([[ 3.89507414e-04, 3.20099743e-01, 6.79510750e-01],
[ 4.63256526e-04, 4.65385156e-01, 5.34151587e-01],
[ 8.79704318e-01, 1.20295572e-01, 1.10268420e-07],
[ 7.80611983e-01, 2.19385827e-01, 2.19046022e-06],
[ 2.78898454e-02, 7.77243988e-01, 1.94866167e-01]])
Actual Results
prob2[:5] =
array([[ 4.36321678e-01, 2.25057553e-04, 5.63453265e-01],
[ 4.92513658e-01, 1.76391882e-05, 5.07468703e-01],
[ 9.92946715e-01, 7.05167151e-03, 1.61346947e-06],
[ 9.83726756e-01, 1.62387090e-02, 3.45348884e-05],
[ 5.01392274e-01, 5.37144591e-04, 4.98070581e-01]])
Versions
Linux-2.6.32-642.1.1.el6.x86_64-x86_64-with-redhat-6.7-Santiago
('Python', '2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec 6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.1')
('SciPy', '0.16.0')
('Scikit-Learn', '0.17')
The probable solution
I also implement a very simple solution. Hope this would help. :)
class SafeLogisticRegression(LogisticRegression):
def fit(self, X, y, sample_weight=None):
self.columns = X.columns
LogisticRegression.fit(self, X, y, sample_weight=None)
def predict_proba(self, X):
new_columns = list(X.columns)
old_columns = list(self.columns)
if new_columns != old_columns:
if len(new_columns) == len(old_columns):
try:
X = X[old_columns]
print "The order of columns has changed. Fixed."
except:
raise ValueError('The columns has changed. Please check.')
else:
raise ValueError('The number of columns has changed.')
return LogisticRegression.predict_proba(self, X)
Then apply this new class:
slg = SafeLogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
slg.fit(train[feature_columns_train], train['target'])
Test one: if the column order is changed
prob1 = slg.predict_proba(test[feature_columns_train])
prob2 = slg.predict_proba(test[feature_columns_test])
#The order of columns has changed. Fixed.
Result for test one:
prob1[:5] =
array([[ 3.89507414e-04, 3.20099743e-01, 6.79510750e-01],
[ 4.63256526e-04, 4.65385156e-01, 5.34151587e-01],
[ 8.79704318e-01, 1.20295572e-01, 1.10268420e-07],
[ 7.80611983e-01, 2.19385827e-01, 2.19046022e-06],
[ 2.78898454e-02, 7.77243988e-01, 1.94866167e-01]])
prob2[:5] =
array([[ 3.89507414e-04, 3.20099743e-01, 6.79510750e-01],
[ 4.63256526e-04, 4.65385156e-01, 5.34151587e-01],
[ 8.79704318e-01, 1.20295572e-01, 1.10268420e-07],
[ 7.80611983e-01, 2.19385827e-01, 2.19046022e-06],
[ 2.78898454e-02, 7.77243988e-01, 1.94866167e-01]])
Test two: if the columns are different (different columns)
Simulate by changing one of the column names
prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))
error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-47-84cea68536fe> in <module>()
----> 1 prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))
<ipython-input-21-c3000b030a21> in predict_proba(self, X)
12 print "The order of columns has changed. Fixed."
13 except:
---> 14 raise ValueError('The columns has changed. Please check.')
15 else:
16 raise ValueError('The number of columns has changed.')
ValueError: The columns has changed. Please check.
Test three: if the number of columns changes
Simulate by dropping one column
prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))
error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-48-47c63ae1ac22> in <module>()
----> 1 prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))
<ipython-input-21-c3000b030a21> in predict_proba(self, X)
14 raise ValueError('The columns has changed. Please check.')
15 else:
---> 16 raise ValueError('The number of columns has changed.')
17 return LogisticRegression.predict_proba(self, X)
ValueError: The number of columns has changed.