-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Potential error caused by different column order #7242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Scikit-learn currently has no concept of column labels. I agree this would be a nice to have feature. Don't hold your breath for it. Storing the column names at all would be an important first step, which I hope to achieve soonish. This will not be available in 0.18, and the consistency check probably not in 0.19 either. Scikit-learn operates on numpy arrays, which don't (currently) have column names. This might change in the future, and we might add better integration with pandas. |
Thanks for replying! Looking forward to the new version! 💯 |
Thanks for the edifying conversation guys! |
I think a wrapper to ensure column order is maintained between fit and predict/transform would be a good idea. |
I would guess the way to do it would be respect column names if they were present both for fit and predict, and respect column ordering if they were missing in either (or both). There is also feature selection that would need to abide. I'm sure its not as easy as it might first appear. |
I don't think it would be a big deal to develop a wrapper, but I would
consider it the place of https://github.com/pandas-dev/sklearn-pandas to do
so (if it does not already). Raise an issue there?
…On 8 August 2017 at 10:51, Pete Cacioppi ***@***.***> wrote:
I would guess the way to do it would be respect column names if they were
present both for fit and predict, and respect column ordering if they were
missing in either (or both). There is also feature selection that would
need to abide. I'm sure its not as easy as it might first appear.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7242 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60fB1VaFhzaJwZ9wt9RAfMRsQjIyks5sV7EpgaJpZM4JsxVp>
.
|
In my opinion this is an sklearn issue not a pandas issue. As I said, not complaining, not expecting it to be terribly easy. |
I would like to solve this inside scikit-learn, but I don't think there is consensus to do that (yet). |
You could print a warning that says "Beware when using Dataframes. sklearn uses column ordering not column labels. Downcast to numpy arrays to avoid this warning". That would have saved my bacon here. Pandas prints out a warning with the view vs. copy "gotcha", and that was enough to motivate me to study and pre-empt any problem. This issue seems similar. http://bit.ly/2qR1fDH , http://bit.ly/2pRbR7Q |
@pjcpjc warning is an option, but this would be very very loud. Many people are passing dataframes. Also, I think having the estimator know the feature names is helpful for model inspection. |
I'd prefer for sklearn to respect column names if they are present both for fit and predict, and respect column ordering if they were missing in either (or both). But I also think a warning would be better than how it works right now, where its easy to get a false sense of security and end up with buggy code, which was my experience with this issue. You can see my bug fix here. The original version didn't have cell 18. A warning would have flagged me that cell 18 was needed. The additional code changes to pass numpy arrays would be minor, if needed. Just one opinion. |
I'm not sure you got the problem with the warning. If there's always a warning, people will change their code so they don't get a warning, which is by passing in I totally agree that we should do better then we're doing now, I just think a warning whenever a dataframe is passed will not help. |
I think I understand your point. My point is if the warning links to a "see here for details" documentation page, and that page takes great pains to point out that you need to manage the column positions outside of sklearn, then my guess it will save quite a few people. I also suspect that the people who ignore the warning, and/or don't take it seriously and just do the simplest possible thing to make the warning go away, are probably un-saveable in any event. They will find some other way to muck up their data science practice. That said - agree to disagree. It would have helped me - but it certainly is possible I'm too unusual to be of guidance. |
Ok, I see where you're coming from. It might be a good way to make people aware that this issue potentially exists at all. I still thinks it's a bit too loud and unspecific, though it might indeed save some people. I hope for a near-ish term better solution ;) |
I think warning for all pandas DataFrame input is overkill. Lots of people are loading DataFrames from static sources (as we recommend) and not reordering columns. The options, as far as I'm concerned are:
Obviously the latter is a substantial effort, and adds O(n_features) space to the pickle representation of an estimator unless we use a hash, but ... dataframes are important to support well. |
If we do this, I'd suggest we extend There will be cases where this is hard to do because In short, this is possible, but it'll be a big project and will touch everything. |
I would rather not store hashes, because this information is very valuable for model inspection. |
I'm okay with storing names but it might be an expensive addition to
pickling. In the numpy case, we should use the same method to store number
of features. Do we need to be checking consistency between fit and predict
in any other way?
(The thing with get_feature_names is that I only think it's helpful to
automatically name features in case of:
* Feature selection
* Feature union
* Scaling (don't change name)
* One Hot
Where output features combine inputs, it becomes too custom.)
…On 14 Aug 2017 1:27 am, "Andreas Mueller" ***@***.***> wrote:
I would rather not store hashes, because this information is very valuable
for model inspection.
I'm not sure I like passing self to check_array, I guess the only
alternative would be to return column_names and then store it, and pass
column_names=self._column_names in predict / transform.
(if we do this, we probably want to really look into get_feature_names
and the pipeline processing of the same, because it'd be AWESOME!)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7242 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xfcMWcHelqiI09zJlcgIchrrU_Kks5sXxX4gaJpZM4JsxVp>
.
|
So, for now if I wish to use fit, and then use transform on another df, what is the proper way to be able to use the proper column order? |
`df.loc[:, list of column names]` before both fit and transform should work
fine
|
Currently, estimators have ad-hoc checks that Maybe we should implement in BaseEstimator (for use after check_array):
|
I just wanted to add that this feature is extremely important out in the real-world where the model building process and the prediction process (exposed to users) often are completely separate. The prediction process should not need to know about feature selection, feature order and so forth. It should just take all the features for the specific problem and the estimator knows which ones it needs in what order. This is how it works in other tools. It's extremely convenient and the fact I prefer that tool over sklearn. Sklearn however offers more estimator types and better performance (as in speed and prediction quality even for same model types). So it would be great to have this here too. I would imagine this happens either through pandas column names or if you pass in numpy arrays fit and predict could have an additional optional parameter column_names (not that great to create this list). If no column names are present, it works like it does now. |
I don't think your suggestion is going to happen. Well, here are the options (not all mutually exclusive):
Number 5. is by far the easiest, although it requires users to be aware, so it's coupled with 2. It looks like this (with documentation and an awareness campaign): class PandasCheck(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.columns_ = getattr(X, 'columns', None)
return self
def transform(self, X):
check_is_fitted(self, 'columns_')
if not np.all(self.columns_ == X.columns):
raise ValueError('Mismatched columns in transform. Expected %r, got %r' % (self.columns_, X.columns))
return X |
wouldn't that work then ? |
Please see the test cases in #12936 |
@jnothman what do you think ? |
|
We could potentially make setting (not just checking) the signature a job for check_array, e.g. Getting someone to implement these changes (preferably someone who knows how to edit files en-masse) is more the challenge. In practice where an estimator delegates checking of X to some other function, it's not going to be trivial to implement everywhere. |
Its difficult because things don't super baseestimator much in the codebase. |
I disagree on raising an error. Data frames are different philosophically than arrays. People are likely passing data frames in the wrong order at the moment, but they aren't wrong to do so they just aren't aware the sklearn package hasnt handled their edge case. Pandas dataframe column ordering has always existed, but it's just not talked about, because anyone using pandas would never iloc a column when indexing... You index columns by name. Maybe a warning as we transition, but I wouldnt recommend an error. |
I appreciate the difference in philosophy, and that being order-invariant
for dataframes is something we might like in the future. Something we would
need to also decide there is whether to accept data at predict time that
includes extra columns. I think you will find some core devs strongly
opposed to accepting dataframes with extra columns at predict time.
In any case, I think we should behave responsibly and notify the users that
they may have not previously been getting what they expected. If not an
exception, we can go for a warning, e.g. "X.columns is different from those
in fitting. In releases of scikit-learn before 0.xx this would have
silently resulted in incorrect output. This warning will be removed in
version 0.xx+1."
|
I absolutely agree that a warning is a must. I agree with your reply 100%. A compromise may be to issue an update that creates a warning ASAP, by collecting column names where appropriate and creating the architecture of the solution. Then in the next breaking update, implement automatic reordering. |
As for extra columns, in my own code I have implemented a lot of what we are talking about in the context of cross validated feature selectors, such as RFECV. By accepting dataframes of any size, you allow these learners to always be correct. User can give them just the columns they selected on predict, the data they fit on on predict, or any arbitrary dataframe on predict. The goal in my mind is to enable the agnostic use of estimators, because these estimators are going to be pickled and long forgotten in a repo somewhere. I dont want to use them mistakenly when I dig them out of the filing cabinet. |
I would be happy to review an implementation PR
|
Is there any workaround for this problem. Other than reordering of the columns. |
Hello, df2 = df2[df1.columns] |
Hm we still don't have a slep for this, right? I think we should try doing one. @NicolasHug did you want to do that? |
I think the options are mostly listed in #7242 (comment) @amueller I know you do not like this, but I tend to prefer The alternative is to store the names everywhere. I can't get over the fact that to have this consistency, users would need to pay a price in memory. Do we want users to buy more ram to get column name consistency? |
I'm with @ryanpeach
and my initial comment to this issue. This is from a naive end-user point of view not knowing the fundamental architecture of sklearn and just being concerned about making it easy for me to get correct predictions.
How much RAM does storing even thousands of column names really use up? I mean if you really go all-in and have thousand very long column names, it adds up to couple 100s of MB. As long as it is optional, I don't see the issue given the huge convenience gain. I mean getting 64 GB of RAM doesn't cost all that much, compared to developer time or even worse wrong results. |
I want 6). I think 1, 2 and 3 don't solve the issue. 4 and 5 solve the issue if the user is aware of the issue and actively takes steps to prevent a problem from occurring. They sort-of solve the issue but not really the core issue of user expectations and silent errors. Also they are really verbose. I guess in that I disagree with @ryanpeach in #7242 (comment) but his solution is not even on the list ;) However, if the first step is a ColumnTransformer, then we actual do the "right" thing in that we rearrange the columns - makes me think, is inserting |
I think this is independent of This should also be really easy to implement now that we have @NicolasHug 's work on |
@adrinjalali Cool, I totally agree (and encouraged @NicolasHug to draft a PR) ;) |
Also see #17407 |
Now that we validate column names, this is partially fixed if the user uses a Removing the milestone. |
We now validate feature names by default and raise a warning if the column names are not in the same order. In 1.2, we will start raising an error. |
Description
Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong.
I have checked the source code and discovered that the fit function of sklean didn't save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.
Steps/Code to Reproduce
Expected Results
Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.
Actual Results
Versions
The probable solution
I also implement a very simple solution. Hope this would help. :)
Then apply this new class:
Test one: if the column order is changed
Result for test one:
Test two: if the columns are different (different columns)
Simulate by changing one of the column names
error message:
Test three: if the number of columns changes
Simulate by dropping one column
error message:
The text was updated successfully, but these errors were encountered: