Potential error caused by different column order #7242

SauceCat · 2016-08-25T07:28:34Z

Description

Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong.
I have checked the source code and discovered that the fit function of sklean didn't save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.

Steps/Code to Reproduce

#for simplification, consider a very simple case
from sklearn.datasets import load_iris
import pandas as pd

#make a dataframe
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
iris_pd = pd.DataFrame(X)
iris_pd.columns = iris.feature_names
iris_pd['target'] = y

from sklearn.cross_validation import train_test_split
train, test = train_test_split(iris_pd, test_size= 0.3)

feature_columns_train = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
feature_columns_test = ['sepal length (cm)','sepal width (cm)','petal width (cm)','petal length (cm)']

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
lg.fit(train[feature_columns_train], train['target'])

prob1 = lg.predict_proba(test[feature_columns_train])
prob2 = lg.predict_proba(test[feature_columns_test])

Expected Results

Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.

prob1[:5] = 
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
         [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
         [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
         [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
         [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Actual Results

prob2[:5] = 
array([[  4.36321678e-01,   2.25057553e-04,   5.63453265e-01],
         [  4.92513658e-01,   1.76391882e-05,   5.07468703e-01],
         [  9.92946715e-01,   7.05167151e-03,   1.61346947e-06],
         [  9.83726756e-01,   1.62387090e-02,   3.45348884e-05],
         [  5.01392274e-01,   5.37144591e-04,   4.98070581e-01]])

Versions

Linux-2.6.32-642.1.1.el6.x86_64-x86_64-with-redhat-6.7-Santiago
('Python', '2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec  6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.1')
('SciPy', '0.16.0')
('Scikit-Learn', '0.17')

The probable solution

I also implement a very simple solution. Hope this would help. :)

class SafeLogisticRegression(LogisticRegression):
    def fit(self, X, y, sample_weight=None):
        self.columns = X.columns
        LogisticRegression.fit(self, X, y, sample_weight=None)
    def predict_proba(self, X):
        new_columns = list(X.columns)
        old_columns = list(self.columns)
        if new_columns != old_columns:
            if len(new_columns) == len(old_columns):
                try:
                    X = X[old_columns]
                    print "The order of columns has changed. Fixed."
                except:
                    raise ValueError('The columns has changed. Please check.')
            else:
                raise ValueError('The number of columns has changed.')
        return LogisticRegression.predict_proba(self, X)

Then apply this new class:

slg = SafeLogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
slg.fit(train[feature_columns_train], train['target'])

Test one: if the column order is changed

prob1 = slg.predict_proba(test[feature_columns_train])
prob2 = slg.predict_proba(test[feature_columns_test])

#The order of columns has changed. Fixed.

Result for test one:

prob1[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

prob2[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Test two: if the columns are different (different columns)

Simulate by changing one of the column names

prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-84cea68536fe> in <module>()
----> 1 prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     12                     print "The order of columns has changed. Fixed."
     13                 except:
---> 14                     raise ValueError('The columns has changed. Please check.')
     15             else:
     16                 raise ValueError('The number of columns has changed.')

ValueError: The columns has changed. Please check.

Test three: if the number of columns changes

Simulate by dropping one column

prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-47c63ae1ac22> in <module>()
----> 1 prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     14                     raise ValueError('The columns has changed. Please check.')
     15             else:
---> 16                 raise ValueError('The number of columns has changed.')
     17         return LogisticRegression.predict_proba(self, X)

ValueError: The number of columns has changed.

amueller · 2016-08-25T17:32:58Z

Scikit-learn currently has no concept of column labels. I agree this would be a nice to have feature. Don't hold your breath for it. Storing the column names at all would be an important first step, which I hope to achieve soonish. This will not be available in 0.18, and the consistency check probably not in 0.19 either. Scikit-learn operates on numpy arrays, which don't (currently) have column names. This might change in the future, and we might add better integration with pandas.

SauceCat · 2016-08-26T05:39:10Z

Thanks for replying! Looking forward to the new version! 💯

pjcpjc · 2017-08-07T20:59:03Z

Thanks for the edifying conversation guys!

jnothman · 2017-08-08T00:39:06Z

I think a wrapper to ensure column order is maintained between fit and predict/transform would be a good idea.

pjcpjc · 2017-08-08T00:51:51Z

I would guess the way to do it would be respect column names if they were present both for fit and predict, and respect column ordering if they were missing in either (or both). There is also feature selection that would need to abide. I'm sure its not as easy as it might first appear.

jnothman · 2017-08-08T01:01:05Z

I don't think it would be a big deal to develop a wrapper, but I would consider it the place of https://github.com/pandas-dev/sklearn-pandas to do so (if it does not already). Raise an issue there?

…

On 8 August 2017 at 10:51, Pete Cacioppi ***@***.***> wrote: I would guess the way to do it would be respect column names if they were present both for fit and predict, and respect column ordering if they were missing in either (or both). There is also feature selection that would need to abide. I'm sure its not as easy as it might first appear. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60fB1VaFhzaJwZ9wt9RAfMRsQjIyks5sV7EpgaJpZM4JsxVp> .

pjcpjc · 2017-08-11T18:52:59Z

In my opinion this is an sklearn issue not a pandas issue. As I said, not complaining, not expecting it to be terribly easy.

amueller · 2017-08-11T19:55:36Z

I would like to solve this inside scikit-learn, but I don't think there is consensus to do that (yet).
I would argue that we should protect our users against silent wrong results, other voices might say that users shouldn't pass dataframes?

pjcpjc · 2017-08-11T20:07:35Z

other voices might say that users shouldn't pass dataframes?

You could print a warning that says "Beware when using Dataframes. sklearn uses column ordering not column labels. Downcast to numpy arrays to avoid this warning".

That would have saved my bacon here.

Pandas prints out a warning with the view vs. copy "gotcha", and that was enough to motivate me to study and pre-empt any problem. This issue seems similar. http://bit.ly/2qR1fDH , http://bit.ly/2pRbR7Q

amueller · 2017-08-11T20:12:58Z

@pjcpjc warning is an option, but this would be very very loud. Many people are passing dataframes.
I guess it depends a bit on whether we want them to pass dataframes or not. One problem is that if we warn, that basically tells users to use .values everywhere to convert to numpy arrays, so they don't get a warning. But then this error becomes silent again, so we won nothing. We just made the users change their code so that we can never hope to catch the error.

Also, I think having the estimator know the feature names is helpful for model inspection.

pjcpjc · 2017-08-11T20:22:53Z

I'd prefer for sklearn to respect column names if they are present both for fit and predict, and respect column ordering if they were missing in either (or both). But I also think a warning would be better than how it works right now, where its easy to get a false sense of security and end up with buggy code, which was my experience with this issue.

You can see my bug fix here. The original version didn't have cell 18. A warning would have flagged me that cell 18 was needed. The additional code changes to pass numpy arrays would be minor, if needed.

Just one opinion.

amueller · 2017-08-11T22:12:58Z

I'm not sure you got the problem with the warning. If there's always a warning, people will change their code so they don't get a warning, which is by passing in df.values, i.e. a numpy array without column names. Then later they change their code and columns mismatch again but they get no warning because they only passed a numpy array. So in the end, adding the warning didn't safe them at all, just created more work.

I totally agree that we should do better then we're doing now, I just think a warning whenever a dataframe is passed will not help.

pjcpjc · 2017-08-11T22:48:03Z

I think I understand your point. My point is if the warning links to a "see here for details" documentation page, and that page takes great pains to point out that you need to manage the column positions outside of sklearn, then my guess it will save quite a few people.

I also suspect that the people who ignore the warning, and/or don't take it seriously and just do the simplest possible thing to make the warning go away, are probably un-saveable in any event. They will find some other way to muck up their data science practice.

That said - agree to disagree. It would have helped me - but it certainly is possible I'm too unusual to be of guidance.

amueller · 2017-08-11T22:56:36Z

Ok, I see where you're coming from. It might be a good way to make people aware that this issue potentially exists at all. I still thinks it's a bit too loud and unspecific, though it might indeed save some people. I hope for a near-ish term better solution ;)

jnothman · 2017-08-13T13:19:01Z

I think warning for all pandas DataFrame input is overkill. Lots of people are loading DataFrames from static sources (as we recommend) and not reordering columns.

The options, as far as I'm concerned are:

add some documentation on Scikit-learn with Pandas and pitfalls therein.
add a common test which ensures column name equality, otherwise raising a warning (and perhaps in a later version an error) when mismatched.

Obviously the latter is a substantial effort, and adds O(n_features) space to the pickle representation of an estimator unless we use a hash, but ... dataframes are important to support well.

jnothman · 2017-08-13T13:26:47Z

If we do this, I'd suggest we extend check_array, at least to do the checking, and perhaps even to help store the estimator attribute. In terms of syntax, this could be as magical as check_X_y(X, y, ..., estimator=self) in fit and check_array(X, ..., estimator=self) in predict/transform, but maybe we'd prefer check_X_y(X, y, ..., store_column_hash=self) and check_array(X, ..., column_hash=self._column_hash). (We could make it even more magical with decorators or a metaclass, but we don't do that 'round here.)

There will be cases where this is hard to do because check_array is called by a function called by the estimator, rather than directly within it. We'd also need to do the same for warm_start and partial_fit.

In short, this is possible, but it'll be a big project and will touch everything.

amueller · 2017-08-13T15:27:50Z

I would rather not store hashes, because this information is very valuable for model inspection.
I'm not sure I like passing self to check_array, I guess the only alternative would be to return column_names and then store it, and pass column_names=self._column_names in predict / transform.
(if we do this, we probably want to really look into get_feature_names and the pipeline processing of the same, because it'd be AWESOME!)

jnothman · 2017-08-13T21:46:07Z

I'm okay with storing names but it might be an expensive addition to pickling. In the numpy case, we should use the same method to store number of features. Do we need to be checking consistency between fit and predict in any other way? (The thing with get_feature_names is that I only think it's helpful to automatically name features in case of: * Feature selection * Feature union * Scaling (don't change name) * One Hot Where output features combine inputs, it becomes too custom.)

…

On 14 Aug 2017 1:27 am, "Andreas Mueller" ***@***.***> wrote: I would rather not store hashes, because this information is very valuable for model inspection. I'm not sure I like passing self to check_array, I guess the only alternative would be to return column_names and then store it, and pass column_names=self._column_names in predict / transform. (if we do this, we probably want to really look into get_feature_names and the pipeline processing of the same, because it'd be AWESOME!) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xfcMWcHelqiI09zJlcgIchrrU_Kks5sXxX4gaJpZM4JsxVp> .

thebeancounter · 2018-01-04T16:06:16Z

So, for now if I wish to use fit, and then use transform on another df, what is the proper way to be able to use the proper column order?
How can i guaranty pandas dataframe columns to be in a specific order?

jnothman · 2018-01-05T03:03:51Z

`df.loc[:, list of column names]` before both fit and transform should work fine

jnothman · 2018-01-11T06:53:30Z

Currently, estimators have ad-hoc checks that n_features match from fit to predict.

Maybe we should implement in BaseEstimator (for use after check_array):

def _fit_consistency(self, X):
    self.__n_features = X.shape[1]
    if isinstance(X, pd.DataFrame):
        self.__columns = X.columns

def _check_consistency(self, X):
    if self.__n_features != X.shape[1]: raise ...
    if isinstance(X, pd.DataFrame):
        if not hasattr(self, 'BaseEstimator__columns'): warn ...
        if self.__columns != X.columns: raise ??
    elif hasattr(self, 'BaseEstimator__columns'): warn ...

kienerj · 2018-02-01T07:17:13Z

I just wanted to add that this feature is extremely important out in the real-world where the model building process and the prediction process (exposed to users) often are completely separate.

The prediction process should not need to know about feature selection, feature order and so forth. It should just take all the features for the specific problem and the estimator knows which ones it needs in what order. This is how it works in other tools. It's extremely convenient and the fact I prefer that tool over sklearn. Sklearn however offers more estimator types and better performance (as in speed and prediction quality even for same model types). So it would be great to have this here too.

I would imagine this happens either through pandas column names or if you pass in numpy arrays fit and predict could have an additional optional parameter column_names (not that great to create this list). If no column names are present, it works like it does now.

jnothman · 2018-02-01T10:53:20Z

I don't think your suggestion is going to happen.

Well, here are the options (not all mutually exclusive):

Deprecate implicit conversion of DataFrames to NumPy arrays; in the future, the user would have to call .values explicitly in the future
Document it: Pandas Interoperability section in User Guide #10453 is up for grabs and your writing on the subject is fluent, @kienerj!
Call df.sort_index(axis=1) in the future (with a warning for a deprecation period as it is not backwards compatible), with the risk that this (a) increases runtime/memory; (b) does not identify changes in column names where the number of columns is preserved
Put some PandasProtect wrapper or mixin in sklearn-pandas and tell people to use it if they want to be safe
Invent a PandasCheck transformer (either here or in sklearn-pandas) and tell Pandas users to always use pipelines and put this at the beginning
Store column names in fit and raise a warning (or perhaps error) in transform etc if not matched.

Number 5. is by far the easiest, although it requires users to be aware, so it's coupled with 2. It looks like this (with documentation and an awareness campaign):

class PandasCheck(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = getattr(X, 'columns', None)
        return self
    def transform(self, X):
        check_is_fitted(self, 'columns_')
        if not np.all(self.columns_ == X.columns):
            raise ValueError('Mismatched columns in transform. Expected %r, got %r' % (self.columns_, X.columns))
        return X

qdeffense · 2019-01-08T12:53:00Z

I've found these functions useful:

from sklearn.utils import check_array
def check_array_pandas(X: pd.DataFrame,
                       feature_names: List,
                       *args, **kwargs):
    ''' A wrapper for sklearn check_array

    Used to check and enforce column names if pandas dataframes are used.

    Parameters
    --------
    X: pd.DataFrame
        The input you want to check.
    feature_names: list of column names
        The column names you'd like to check and their ordering you want to enforce.
    *args
        *args for check_array
    **kwargs
        **kwargs for check_array

    Returns
    --------
    X: array like
        With columns ordered like feature_names and all other columns removed.
        Also then put through check_array.

    Raises
    --------
    KeyError:
        If any feature_names aren't in X.columns.
    TypeError:
        If X is not a pd.DataFrame.
    ValueError:
        If feature_names is None.

    '''
    if feature_names is not None:
        if isinstance(X, pd.DataFrame):
            if isinstance(feature_names, (collections.Iterable)):
                X = X[feature_names]
            else:
                raise TypeError('feature_names should be iterable.')
        else:
            raise TypeError('X should be a DataFrame for safety purposes.')

    return check_array(X, *args, **kwargs)

Do this in fit:

self.feature_names_ = list(X.columns) if isinstance(X, pd.DataFrame) else None

And this in init

self.feature_names_ = None

Then just replace all calls to check_array with check_array_pandas

wouldn't that work then ?

ryanpeach · 2019-01-08T13:25:25Z

Please see the test cases in #12936

qdeffense · 2019-01-08T17:13:37Z

@jnothman what do you think ?

jnothman · 2019-01-08T22:40:13Z

I suspect we should raise an error when the order is different from in fit, rather than reordering, in part to alert users who thought their code was working in previous scikit-learn versions but actually were mixing their features. We can always change it to reordering later.
feature_names_ is ambiguous. It needs a better name.
I am a little hesitant about putting self.feature_names_ = list(X.columns) if isinstance(X, pd.DataFrame) else None in each estimator. I would rather this abstracted away into a function or method, e.g. set_X_signature(self, X) with check_array supporting X_signature to check integrity.

jnothman · 2019-01-08T22:44:50Z

We could potentially make setting (not just checking) the signature a job for check_array, e.g. check_array(X, set_signature=self)

Getting someone to implement these changes (preferably someone who knows how to edit files en-masse) is more the challenge. In practice where an estimator delegates checking of X to some other function, it's not going to be trivial to implement everywhere.

ryanpeach · 2019-01-09T14:15:52Z

Its difficult because things don't super baseestimator much in the codebase.

ryanpeach · 2019-01-09T14:19:00Z

I disagree on raising an error. Data frames are different philosophically than arrays. People are likely passing data frames in the wrong order at the moment, but they aren't wrong to do so they just aren't aware the sklearn package hasnt handled their edge case. Pandas dataframe column ordering has always existed, but it's just not talked about, because anyone using pandas would never iloc a column when indexing... You index columns by name.

Maybe a warning as we transition, but I wouldnt recommend an error.

jnothman · 2019-01-09T22:33:26Z

I appreciate the difference in philosophy, and that being order-invariant for dataframes is something we might like in the future. Something we would need to also decide there is whether to accept data at predict time that includes extra columns. I think you will find some core devs strongly opposed to accepting dataframes with extra columns at predict time. In any case, I think we should behave responsibly and notify the users that they may have not previously been getting what they expected. If not an exception, we can go for a warning, e.g. "X.columns is different from those in fitting. In releases of scikit-learn before 0.xx this would have silently resulted in incorrect output. This warning will be removed in version 0.xx+1."

ryanpeach · 2019-01-11T16:40:34Z

I absolutely agree that a warning is a must. I agree with your reply 100%.

A compromise may be to issue an update that creates a warning ASAP, by collecting column names where appropriate and creating the architecture of the solution. Then in the next breaking update, implement automatic reordering.

ryanpeach · 2019-01-11T16:44:01Z

As for extra columns, in my own code I have implemented a lot of what we are talking about in the context of cross validated feature selectors, such as RFECV. By accepting dataframes of any size, you allow these learners to always be correct. User can give them just the columns they selected on predict, the data they fit on on predict, or any arbitrary dataframe on predict. The goal in my mind is to enable the agnostic use of estimators, because these estimators are going to be pickled and long forgotten in a repo somewhere. I dont want to use them mistakenly when I dig them out of the filing cabinet.

jnothman · 2019-01-13T22:26:34Z

I would be happy to review an implementation PR

SreejeshPM · 2019-01-21T13:27:32Z

Is there any workaround for this problem. Other than reordering of the columns.

aishwariya96 · 2019-05-06T08:36:22Z

Hello,
I went through the same issue and tried this

df2 = df2[df1.columns]
df2 is the data frame where the columns need to be ordered like in the df1.

amueller · 2020-05-27T14:55:48Z

Hm we still don't have a slep for this, right? I think we should try doing one. @NicolasHug did you want to do that?
This is related to feature names, but actually the current feature name ideas might not solve this issue.
@adrinjalali do you have thoughts on whether we want to do this before doing the feature names?

thomasjpfan · 2020-05-27T15:17:42Z

I think the options are mostly listed in #7242 (comment)

@amueller I know you do not like this, but I tend to prefer PandasCheck
which also enables #12627 (comment) . This approach only does the check in the beginning. The major argument against storing is "pandas users need to use a pipeline to hav column name consistency".

The alternative is to store the names everywhere. I can't get over the fact that to have this consistency, users would need to pay a price in memory. Do we want users to buy more ram to get column name consistency?

kienerj · 2020-05-27T15:41:50Z

I'm with @ryanpeach

By accepting dataframes of any size, you allow these learners to always be correct.

and my initial comment to this issue. This is from a naive end-user point of view not knowing the fundamental architecture of sklearn and just being concerned about making it easy for me to get correct predictions.

Do we want users to buy more ram to get column name consistency?

How much RAM does storing even thousands of column names really use up? I mean if you really go all-in and have thousand very long column names, it adds up to couple 100s of MB. As long as it is optional, I don't see the issue given the huge convenience gain. I mean getting 64 GB of RAM doesn't cost all that much, compared to developer time or even worse wrong results.

amueller · 2020-05-27T15:52:09Z

I want 6). I think 1, 2 and 3 don't solve the issue. 4 and 5 solve the issue if the user is aware of the issue and actively takes steps to prevent a problem from occurring. They sort-of solve the issue but not really the core issue of user expectations and silent errors. Also they are really verbose.

I guess in that I disagree with @ryanpeach in #7242 (comment) but his solution is not even on the list ;)

However, if the first step is a ColumnTransformer, then we actual do the "right" thing in that we rearrange the columns - makes me think, is inserting ColumnTransformer(remainder='passthrough') to the beginning of a pipeline actually implementing the feature @ryanpeach wants?

adrinjalali · 2020-05-27T20:43:15Z

@adrinjalali do you have thoughts on whether we want to do this before doing the feature names?

I think this is independent of output_feature_names_, and I'd be in favor of the same kind of validation we have in ColumnTransformer everywhere. I think implementing the input_feature_names_ only if the given input has feature names makes kinda sense, and it can be implemented even if it's not propagated through the pipeline.

This should also be really easy to implement now that we have @NicolasHug 's work on _validate_data merged, and I don't think we need a SLEP for it, for the same reason that we didn't need a SLEP to implement the validation in ColumnTransformer.

amueller · 2020-05-28T01:54:48Z

@adrinjalali Cool, I totally agree (and encouraged @NicolasHug to draft a PR) ;)
Would love to hear @jnothman's thoughts as well

amueller · 2020-06-03T19:33:31Z

Also see #17407

adrinjalali · 2021-08-22T09:12:02Z

Now that we validate column names, this is partially fixed if the user uses a ColumnTransformer, right?

Removing the milestone.

thomasjpfan · 2022-02-22T21:51:46Z

We now validate feature names by default and raise a warning if the column names are not in the same order. In 1.2, we will start raising an error.

amueller added this to the 1.0 milestone Aug 25, 2016

amueller added the New Feature label Aug 25, 2016

amueller mentioned this issue Sep 9, 2016

linear regression predictions unstable #7378

Closed

pomber mentioned this issue Jul 19, 2017

Provide an array endpoint hexacta/docker-sklearn-predict-http-api#2

Open

jnothman added API Bug labels Aug 13, 2017

jnothman mentioned this issue Jan 11, 2018

Pandas Interoperability section in User Guide #10453

Closed

jnothman mentioned this issue Jan 8, 2019

sklearn pandas dataframe support and column name checking. #12936

Closed

amueller mentioned this issue Jul 3, 2019

RFC ColumnTransformer input validation and requirements #14251

Closed

amueller mentioned this issue Jul 17, 2019

Version 1.0 of scikit-learn #14386

Closed

amueller mentioned this issue Mar 5, 2020

OneHotEncoder.get_feature_names doesn't work with integer column names #16593

Open

fberanizo mentioned this issue Jun 6, 2020

Add Filter and Pre Selection Components platiagro/projects#65

Merged

adrinjalali removed this from the 1.0 milestone Aug 22, 2021

AlexOlza mentioned this issue Dec 7, 2021

HP1: Ver por qué los nuevos datos de ingresos cambian tanto los modeloslos resultados (24 individuos diferentes a los datos viejos) AlexOlza/estratificacion#1

Closed

thomasjpfan closed this as completed Feb 22, 2022

Potential error caused by different column order #7242

Potential error caused by different column order #7242

Comments

SauceCat commented Aug 25, 2016

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

The probable solution

Test one: if the column order is changed

Test two: if the columns are different (different columns)

Test three: if the number of columns changes

amueller commented Aug 25, 2016

SauceCat commented Aug 26, 2016

pjcpjc commented Aug 7, 2017

jnothman commented Aug 8, 2017

pjcpjc commented Aug 8, 2017

jnothman commented Aug 8, 2017 via email

pjcpjc commented Aug 11, 2017

amueller commented Aug 11, 2017

pjcpjc commented Aug 11, 2017

amueller commented Aug 11, 2017

pjcpjc commented Aug 11, 2017 • edited Loading

amueller commented Aug 11, 2017

pjcpjc commented Aug 11, 2017 • edited Loading

amueller commented Aug 11, 2017

jnothman commented Aug 13, 2017

jnothman commented Aug 13, 2017

amueller commented Aug 13, 2017

jnothman commented Aug 13, 2017 via email

thebeancounter commented Jan 4, 2018

jnothman commented Jan 5, 2018 via email

jnothman commented Jan 11, 2018

kienerj commented Feb 1, 2018

jnothman commented Feb 1, 2018 • edited Loading

qdeffense commented Jan 8, 2019

ryanpeach commented Jan 8, 2019

qdeffense commented Jan 8, 2019

jnothman commented Jan 8, 2019

jnothman commented Jan 8, 2019

ryanpeach commented Jan 9, 2019

ryanpeach commented Jan 9, 2019 • edited Loading

jnothman commented Jan 9, 2019 via email

ryanpeach commented Jan 11, 2019

ryanpeach commented Jan 11, 2019

jnothman commented Jan 13, 2019 via email

SreejeshPM commented Jan 21, 2019

aishwariya96 commented May 6, 2019

amueller commented May 27, 2020 • edited Loading

thomasjpfan commented May 27, 2020

kienerj commented May 27, 2020

amueller commented May 27, 2020 • edited Loading

adrinjalali commented May 27, 2020

amueller commented May 28, 2020

amueller commented Jun 3, 2020

adrinjalali commented Aug 22, 2021

thomasjpfan commented Feb 22, 2022

pjcpjc commented Aug 11, 2017 •

edited

Loading

pjcpjc commented Aug 11, 2017 •

edited

Loading

jnothman commented Feb 1, 2018 •

edited

Loading

ryanpeach commented Jan 9, 2019 •

edited

Loading

amueller commented May 27, 2020 •

edited

Loading

amueller commented May 27, 2020 •

edited

Loading