-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] MNT enforce column names consistency #17407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd call this a FIX rather than a MNT! Certainly, it will require a change log entry. |
I think overall I am +0.3 on having this feature in general. I understand this is nice to have for single estimators. Once this gets into a pipeline and we have pandas output support, storing list of strings everywhere seems very wasteful. I would say we need a way to turn it off in a configuration flag, Furthermore, I would say this will be one of the "strict" common tests. I can see libraries not wanting a pandas dependency and choosing to not store the names. |
I agree that might be good for strict mode only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NicolasHug
I'd be more comfortable if we force string column names and raise otherwise. I think we've talked about this in some of the SLEPs and we seem happy with that solution.
We probably should also support xarray
at the same time since we're close to supporting it, WDYT?
The issue with that is that we need to agree on a canonical name for the feature names in an xarray.DataArray
object.
|
||
def _is_dataframe(X): | ||
# Return True if X is a pandas dataframe (or a Series) | ||
return hasattr(X, 'iloc') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we've done in other places as well, but I vaguely remember some upcoming API changes on pandas side which would affect this. How would you test for a DataFrame @TomAugspurger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm not sure. There aren't any plans to remove DataFrame.iloc
or Series.iloc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you don't just test for type(X) == type(pd.DataFrame()) or type(X) == type(pd.Series())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would require having pandas as a dependency and we don't want that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do a similarly direct test without adding a dependency on pandas... But generally duck typing is recommended in python to allow for new players with compatible APIs?
@adrinjalali I think it would be ok to have an initial version without xarray so we dont' have to solve that problem ;) Is this blocked by the strict estimator checks? I would say no but I'm not sure what y'all think. |
elif hasattr(self, '_feature_names_in'): | ||
feature_names = df.columns.values | ||
if np.any(feature_names != self._feature_names_in): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for backward compatibility we need to warn first, right?
The time and memory overhead comparison between pandas and xarray doesn't really justify not supporting xarray if we're going to support only one of them IMO. |
@adrinjalali sorry, which comparison do you mean? This change is assuming the input data is already a dataframe, right? |
@amueller I'm referring to the comparison done by @thomasjpfan in #16772 (comment) My point is that according to that benchmark, it seems like we should be supporting xarray at least at the same time as we would support pandas, since it has significantly less overhead. |
I'm gonna close this PR as Thomas as taken over in #18010 |
No description provided.