Skip to content

Potential error caused by different column order #7242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SauceCat opened this issue Aug 25, 2016 · 46 comments
Closed

Potential error caused by different column order #7242

SauceCat opened this issue Aug 25, 2016 · 46 comments

Comments

@SauceCat
Copy link

Description

Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong.
I have checked the source code and discovered that the fit function of sklean didn't save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.

Steps/Code to Reproduce

#for simplification, consider a very simple case
from sklearn.datasets import load_iris
import pandas as pd

#make a dataframe
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
iris_pd = pd.DataFrame(X)
iris_pd.columns = iris.feature_names
iris_pd['target'] = y

from sklearn.cross_validation import train_test_split
train, test = train_test_split(iris_pd, test_size= 0.3)

feature_columns_train = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
feature_columns_test = ['sepal length (cm)','sepal width (cm)','petal width (cm)','petal length (cm)']

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
lg.fit(train[feature_columns_train], train['target'])

prob1 = lg.predict_proba(test[feature_columns_train])
prob2 = lg.predict_proba(test[feature_columns_test])

Expected Results

Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.

prob1[:5] = 
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
         [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
         [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
         [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
         [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Actual Results

prob2[:5] = 
array([[  4.36321678e-01,   2.25057553e-04,   5.63453265e-01],
         [  4.92513658e-01,   1.76391882e-05,   5.07468703e-01],
         [  9.92946715e-01,   7.05167151e-03,   1.61346947e-06],
         [  9.83726756e-01,   1.62387090e-02,   3.45348884e-05],
         [  5.01392274e-01,   5.37144591e-04,   4.98070581e-01]])

Versions

Linux-2.6.32-642.1.1.el6.x86_64-x86_64-with-redhat-6.7-Santiago
('Python', '2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec  6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.1')
('SciPy', '0.16.0')
('Scikit-Learn', '0.17')

The probable solution

I also implement a very simple solution. Hope this would help. :)

class SafeLogisticRegression(LogisticRegression):
    def fit(self, X, y, sample_weight=None):
        self.columns = X.columns
        LogisticRegression.fit(self, X, y, sample_weight=None)
    def predict_proba(self, X):
        new_columns = list(X.columns)
        old_columns = list(self.columns)
        if new_columns != old_columns:
            if len(new_columns) == len(old_columns):
                try:
                    X = X[old_columns]
                    print "The order of columns has changed. Fixed."
                except:
                    raise ValueError('The columns has changed. Please check.')
            else:
                raise ValueError('The number of columns has changed.')
        return LogisticRegression.predict_proba(self, X)

Then apply this new class:

slg = SafeLogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
slg.fit(train[feature_columns_train], train['target']) 

Test one: if the column order is changed

prob1 = slg.predict_proba(test[feature_columns_train])
prob2 = slg.predict_proba(test[feature_columns_test])

#The order of columns has changed. Fixed.

Result for test one:

prob1[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

prob2[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Test two: if the columns are different (different columns)

Simulate by changing one of the column names

prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-84cea68536fe> in <module>()
----> 1 prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     12                     print "The order of columns has changed. Fixed."
     13                 except:
---> 14                     raise ValueError('The columns has changed. Please check.')
     15             else:
     16                 raise ValueError('The number of columns has changed.')

ValueError: The columns has changed. Please check.

Test three: if the number of columns changes

Simulate by dropping one column

prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-47c63ae1ac22> in <module>()
----> 1 prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     14                     raise ValueError('The columns has changed. Please check.')
     15             else:
---> 16                 raise ValueError('The number of columns has changed.')
     17         return LogisticRegression.predict_proba(self, X)

ValueError: The number of columns has changed.
@amueller
Copy link
Member

Scikit-learn currently has no concept of column labels. I agree this would be a nice to have feature. Don't hold your breath for it. Storing the column names at all would be an important first step, which I hope to achieve soonish. This will not be available in 0.18, and the consistency check probably not in 0.19 either. Scikit-learn operates on numpy arrays, which don't (currently) have column names. This might change in the future, and we might add better integration with pandas.

@amueller amueller added this to the 1.0 milestone Aug 25, 2016
@SauceCat
Copy link
Author

Thanks for replying! Looking forward to the new version! 💯

@pjcpjc
Copy link

pjcpjc commented Aug 7, 2017

Thanks for the edifying conversation guys!

@jnothman
Copy link
Member

jnothman commented Aug 8, 2017

I think a wrapper to ensure column order is maintained between fit and predict/transform would be a good idea.

@pjcpjc
Copy link

pjcpjc commented Aug 8, 2017

I would guess the way to do it would be respect column names if they were present both for fit and predict, and respect column ordering if they were missing in either (or both). There is also feature selection that would need to abide. I'm sure its not as easy as it might first appear.

@jnothman
Copy link
Member

jnothman commented Aug 8, 2017 via email

@pjcpjc
Copy link

pjcpjc commented Aug 11, 2017

In my opinion this is an sklearn issue not a pandas issue. As I said, not complaining, not expecting it to be terribly easy.

@amueller
Copy link
Member

I would like to solve this inside scikit-learn, but I don't think there is consensus to do that (yet).
I would argue that we should protect our users against silent wrong results, other voices might say that users shouldn't pass dataframes?

@pjcpjc
Copy link

pjcpjc commented Aug 11, 2017

other voices might say that users shouldn't pass dataframes?

You could print a warning that says "Beware when using Dataframes. sklearn uses column ordering not column labels. Downcast to numpy arrays to avoid this warning".

That would have saved my bacon here.

Pandas prints out a warning with the view vs. copy "gotcha", and that was enough to motivate me to study and pre-empt any problem. This issue seems similar. http://bit.ly/2qR1fDH , http://bit.ly/2pRbR7Q

@amueller
Copy link
Member

@pjcpjc warning is an option, but this would be very very loud. Many people are passing dataframes.
I guess it depends a bit on whether we want them to pass dataframes or not. One problem is that if we warn, that basically tells users to use .values everywhere to convert to numpy arrays, so they don't get a warning. But then this error becomes silent again, so we won nothing. We just made the users change their code so that we can never hope to catch the error.

Also, I think having the estimator know the feature names is helpful for model inspection.

@pjcpjc
Copy link

pjcpjc commented Aug 11, 2017

I'd prefer for sklearn to respect column names if they are present both for fit and predict, and respect column ordering if they were missing in either (or both). But I also think a warning would be better than how it works right now, where its easy to get a false sense of security and end up with buggy code, which was my experience with this issue.

You can see my bug fix here. The original version didn't have cell 18. A warning would have flagged me that cell 18 was needed. The additional code changes to pass numpy arrays would be minor, if needed.

Just one opinion.

@amueller
Copy link
Member

I'm not sure you got the problem with the warning. If there's always a warning, people will change their code so they don't get a warning, which is by passing in df.values, i.e. a numpy array without column names. Then later they change their code and columns mismatch again but they get no warning because they only passed a numpy array. So in the end, adding the warning didn't safe them at all, just created more work.

I totally agree that we should do better then we're doing now, I just think a warning whenever a dataframe is passed will not help.

@pjcpjc
Copy link

pjcpjc commented Aug 11, 2017

I think I understand your point. My point is if the warning links to a "see here for details" documentation page, and that page takes great pains to point out that you need to manage the column positions outside of sklearn, then my guess it will save quite a few people.

I also suspect that the people who ignore the warning, and/or don't take it seriously and just do the simplest possible thing to make the warning go away, are probably un-saveable in any event. They will find some other way to muck up their data science practice.

That said - agree to disagree. It would have helped me - but it certainly is possible I'm too unusual to be of guidance.

@amueller
Copy link
Member

Ok, I see where you're coming from. It might be a good way to make people aware that this issue potentially exists at all. I still thinks it's a bit too loud and unspecific, though it might indeed save some people. I hope for a near-ish term better solution ;)

@jnothman
Copy link
Member

I think warning for all pandas DataFrame input is overkill. Lots of people are loading DataFrames from static sources (as we recommend) and not reordering columns.

The options, as far as I'm concerned are:

  • add some documentation on Scikit-learn with Pandas and pitfalls therein.
  • add a common test which ensures column name equality, otherwise raising a warning (and perhaps in a later version an error) when mismatched.

Obviously the latter is a substantial effort, and adds O(n_features) space to the pickle representation of an estimator unless we use a hash, but ... dataframes are important to support well.

@jnothman
Copy link
Member

If we do this, I'd suggest we extend check_array, at least to do the checking, and perhaps even to help store the estimator attribute. In terms of syntax, this could be as magical as check_X_y(X, y, ..., estimator=self) in fit and check_array(X, ..., estimator=self) in predict/transform, but maybe we'd prefer check_X_y(X, y, ..., store_column_hash=self) and check_array(X, ..., column_hash=self._column_hash). (We could make it even more magical with decorators or a metaclass, but we don't do that 'round here.)

There will be cases where this is hard to do because check_array is called by a function called by the estimator, rather than directly within it. We'd also need to do the same for warm_start and partial_fit.

In short, this is possible, but it'll be a big project and will touch everything.

@amueller
Copy link
Member

I would rather not store hashes, because this information is very valuable for model inspection.
I'm not sure I like passing self to check_array, I guess the only alternative would be to return column_names and then store it, and pass column_names=self._column_names in predict / transform.
(if we do this, we probably want to really look into get_feature_names and the pipeline processing of the same, because it'd be AWESOME!)

@jnothman
Copy link
Member

jnothman commented Aug 13, 2017 via email

@thebeancounter
Copy link

So, for now if I wish to use fit, and then use transform on another df, what is the proper way to be able to use the proper column order?
How can i guaranty pandas dataframe columns to be in a specific order?

@jnothman
Copy link
Member

jnothman commented Jan 5, 2018 via email

@jnothman
Copy link
Member

Currently, estimators have ad-hoc checks that n_features match from fit to predict.

Maybe we should implement in BaseEstimator (for use after check_array):

def _fit_consistency(self, X):
    self.__n_features = X.shape[1]
    if isinstance(X, pd.DataFrame):
        self.__columns = X.columns

def _check_consistency(self, X):
    if self.__n_features != X.shape[1]: raise ...
    if isinstance(X, pd.DataFrame):
        if not hasattr(self, 'BaseEstimator__columns'): warn ...
        if self.__columns != X.columns: raise ??
    elif hasattr(self, 'BaseEstimator__columns'): warn ...

@kienerj
Copy link

kienerj commented Feb 1, 2018

I just wanted to add that this feature is extremely important out in the real-world where the model building process and the prediction process (exposed to users) often are completely separate.

The prediction process should not need to know about feature selection, feature order and so forth. It should just take all the features for the specific problem and the estimator knows which ones it needs in what order. This is how it works in other tools. It's extremely convenient and the fact I prefer that tool over sklearn. Sklearn however offers more estimator types and better performance (as in speed and prediction quality even for same model types). So it would be great to have this here too.

I would imagine this happens either through pandas column names or if you pass in numpy arrays fit and predict could have an additional optional parameter column_names (not that great to create this list). If no column names are present, it works like it does now.

@jnothman
Copy link
Member

jnothman commented Feb 1, 2018

I don't think your suggestion is going to happen.

Well, here are the options (not all mutually exclusive):

  1. Deprecate implicit conversion of DataFrames to NumPy arrays; in the future, the user would have to call .values explicitly in the future
  2. Document it: Pandas Interoperability section in User Guide #10453 is up for grabs and your writing on the subject is fluent, @kienerj!
  3. Call df.sort_index(axis=1) in the future (with a warning for a deprecation period as it is not backwards compatible), with the risk that this (a) increases runtime/memory; (b) does not identify changes in column names where the number of columns is preserved
  4. Put some PandasProtect wrapper or mixin in sklearn-pandas and tell people to use it if they want to be safe
  5. Invent a PandasCheck transformer (either here or in sklearn-pandas) and tell Pandas users to always use pipelines and put this at the beginning
  6. Store column names in fit and raise a warning (or perhaps error) in transform etc if not matched.

Number 5. is by far the easiest, although it requires users to be aware, so it's coupled with 2. It looks like this (with documentation and an awareness campaign):

class PandasCheck(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = getattr(X, 'columns', None)
        return self
    def transform(self, X):
        check_is_fitted(self, 'columns_')
        if not np.all(self.columns_ == X.columns):
            raise ValueError('Mismatched columns in transform. Expected %r, got %r' % (self.columns_, X.columns))
        return X

@qdeffense
Copy link
Contributor

I've found these functions useful:

from sklearn.utils import check_array
def check_array_pandas(X: pd.DataFrame,
                       feature_names: List,
                       *args, **kwargs):
    ''' A wrapper for sklearn check_array

    Used to check and enforce column names if pandas dataframes are used.

    Parameters
    --------
    X: pd.DataFrame
        The input you want to check.
    feature_names: list of column names
        The column names you'd like to check and their ordering you want to enforce.
    *args
        *args for check_array
    **kwargs
        **kwargs for check_array

    Returns
    --------
    X: array like
        With columns ordered like feature_names and all other columns removed.
        Also then put through check_array.

    Raises
    --------
    KeyError:
        If any feature_names aren't in X.columns.
    TypeError:
        If X is not a pd.DataFrame.
    ValueError:
        If feature_names is None.

    '''
    if feature_names is not None:
        if isinstance(X, pd.DataFrame):
            if isinstance(feature_names, (collections.Iterable)):
                X = X[feature_names]
            else:
                raise TypeError('feature_names should be iterable.')
        else:
            raise TypeError('X should be a DataFrame for safety purposes.')

    return check_array(X, *args, **kwargs)

Do this in fit:

self.feature_names_ = list(X.columns) if isinstance(X, pd.DataFrame) else None

And this in init

self.feature_names_ = None

Then just replace all calls to check_array with check_array_pandas

wouldn't that work then ?

@ryanpeach
Copy link

Please see the test cases in #12936

@qdeffense
Copy link
Contributor

@jnothman what do you think ?

@jnothman
Copy link
Member

jnothman commented Jan 8, 2019

  • I suspect we should raise an error when the order is different from in fit, rather than reordering, in part to alert users who thought their code was working in previous scikit-learn versions but actually were mixing their features. We can always change it to reordering later.
  • feature_names_ is ambiguous. It needs a better name.
  • I am a little hesitant about putting self.feature_names_ = list(X.columns) if isinstance(X, pd.DataFrame) else None in each estimator. I would rather this abstracted away into a function or method, e.g. set_X_signature(self, X) with check_array supporting X_signature to check integrity.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2019

We could potentially make setting (not just checking) the signature a job for check_array, e.g. check_array(X, set_signature=self)

Getting someone to implement these changes (preferably someone who knows how to edit files en-masse) is more the challenge. In practice where an estimator delegates checking of X to some other function, it's not going to be trivial to implement everywhere.

@ryanpeach
Copy link

Its difficult because things don't super baseestimator much in the codebase.

@ryanpeach
Copy link

ryanpeach commented Jan 9, 2019

I disagree on raising an error. Data frames are different philosophically than arrays. People are likely passing data frames in the wrong order at the moment, but they aren't wrong to do so they just aren't aware the sklearn package hasnt handled their edge case. Pandas dataframe column ordering has always existed, but it's just not talked about, because anyone using pandas would never iloc a column when indexing... You index columns by name.

Maybe a warning as we transition, but I wouldnt recommend an error.

@jnothman
Copy link
Member

jnothman commented Jan 9, 2019 via email

@ryanpeach
Copy link

I absolutely agree that a warning is a must. I agree with your reply 100%.

A compromise may be to issue an update that creates a warning ASAP, by collecting column names where appropriate and creating the architecture of the solution. Then in the next breaking update, implement automatic reordering.

@ryanpeach
Copy link

As for extra columns, in my own code I have implemented a lot of what we are talking about in the context of cross validated feature selectors, such as RFECV. By accepting dataframes of any size, you allow these learners to always be correct. User can give them just the columns they selected on predict, the data they fit on on predict, or any arbitrary dataframe on predict. The goal in my mind is to enable the agnostic use of estimators, because these estimators are going to be pickled and long forgotten in a repo somewhere. I dont want to use them mistakenly when I dig them out of the filing cabinet.

@jnothman
Copy link
Member

jnothman commented Jan 13, 2019 via email

@SreejeshPM
Copy link

Is there any workaround for this problem. Other than reordering of the columns.

@aishwariya96
Copy link

Hello,
I went through the same issue and tried this

df2 = df2[df1.columns]
df2 is the data frame where the columns need to be ordered like in the df1.

@amueller
Copy link
Member

amueller commented May 27, 2020

Hm we still don't have a slep for this, right? I think we should try doing one. @NicolasHug did you want to do that?
This is related to feature names, but actually the current feature name ideas might not solve this issue.
@adrinjalali do you have thoughts on whether we want to do this before doing the feature names?

@thomasjpfan
Copy link
Member

I think the options are mostly listed in #7242 (comment)

@amueller I know you do not like this, but I tend to prefer PandasCheck
which also enables #12627 (comment) . This approach only does the check in the beginning. The major argument against storing is "pandas users need to use a pipeline to hav column name consistency".

The alternative is to store the names everywhere. I can't get over the fact that to have this consistency, users would need to pay a price in memory. Do we want users to buy more ram to get column name consistency?

@kienerj
Copy link

kienerj commented May 27, 2020

I'm with @ryanpeach

By accepting dataframes of any size, you allow these learners to always be correct.

and my initial comment to this issue. This is from a naive end-user point of view not knowing the fundamental architecture of sklearn and just being concerned about making it easy for me to get correct predictions.

Do we want users to buy more ram to get column name consistency?

How much RAM does storing even thousands of column names really use up? I mean if you really go all-in and have thousand very long column names, it adds up to couple 100s of MB. As long as it is optional, I don't see the issue given the huge convenience gain. I mean getting 64 GB of RAM doesn't cost all that much, compared to developer time or even worse wrong results.

@amueller
Copy link
Member

amueller commented May 27, 2020

I want 6). I think 1, 2 and 3 don't solve the issue. 4 and 5 solve the issue if the user is aware of the issue and actively takes steps to prevent a problem from occurring. They sort-of solve the issue but not really the core issue of user expectations and silent errors. Also they are really verbose.

I guess in that I disagree with @ryanpeach in #7242 (comment) but his solution is not even on the list ;)

However, if the first step is a ColumnTransformer, then we actual do the "right" thing in that we rearrange the columns - makes me think, is inserting ColumnTransformer(remainder='passthrough') to the beginning of a pipeline actually implementing the feature @ryanpeach wants?

@adrinjalali
Copy link
Member

@adrinjalali do you have thoughts on whether we want to do this before doing the feature names?

I think this is independent of output_feature_names_, and I'd be in favor of the same kind of validation we have in ColumnTransformer everywhere. I think implementing the input_feature_names_ only if the given input has feature names makes kinda sense, and it can be implemented even if it's not propagated through the pipeline.

This should also be really easy to implement now that we have @NicolasHug 's work on _validate_data merged, and I don't think we need a SLEP for it, for the same reason that we didn't need a SLEP to implement the validation in ColumnTransformer.

@amueller
Copy link
Member

@adrinjalali Cool, I totally agree (and encouraged @NicolasHug to draft a PR) ;)
Would love to hear @jnothman's thoughts as well

@amueller
Copy link
Member

amueller commented Jun 3, 2020

Also see #17407

@adrinjalali
Copy link
Member

Now that we validate column names, this is partially fixed if the user uses a ColumnTransformer, right?

Removing the milestone.

@thomasjpfan
Copy link
Member

We now validate feature names by default and raise a warning if the column names are not in the same order. In 1.2, we will start raising an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.