-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds Column name consistency #18010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds Column name consistency #18010
Conversation
thanks for piloting this. It's looking good. Ideally we release this on all estimators at once, but yes, there's little benefit in doing it all in one mammoth PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is ready to review. Currently, this PR only adds column name consistency to dummy
and impute
and has a test to check for it. Currently, there is an ignore list for all the modules that is ignored by the new tests. When we add the consistency check for each module we can remove it from the list.
The codecov message is error is strange. I am pretty sure my tests covers the patch and can verify on my local system (with --cov sklearn
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from minor nitpicks. Requires some dev docs, I think.
sklearn/impute/_base.py
Outdated
@@ -793,7 +793,7 @@ def transform(self, X): | |||
# Need not validate X again as it would have already been validated | |||
# in the Imputer calling MissingIndicator | |||
if not self._precomputed: | |||
X = self._validate_input(X, in_fit=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was that a bug? can you add a regression test?
sklearn/tests/test_array_out.py
Outdated
|
||
|
||
@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"]) | ||
def test_pandas_get_feature_names(array_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test with integer column names as well? and maybe a mix of integer, string and object column names? you know, for fun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we had a conversation about what kind of column names we'd accept. And we kinda concluded that we'd restrict column names to be str
, and I think we should stick to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also intuitively feel that feature names implies str
. It would be a good idea to add a test to check the error message we raise for invalid feature names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be okay with this if all our get_feature_names
output strings, but for DictVectorizer
we do not:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
X = [{2: 1, 4: 2}, {3: 3, 5: 1}]
v.fit(X)
v.get_feature_names()
# [2, 3, 4, 5]
On a second note, I think enforcing strings will break backward compatibly:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
X, _ = load_iris(as_frame=True, return_X_y=True)
X.columns = ['a', 'b', 2, 3]
scaler = StandardScaler().fit(X)
_ = scaler.transform(X) # currently works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know integers as columns currently works, but I think we should deprecate the usage and start returning strings as column names ourselves. Having integers as column names (or row indices for that matter) creates a ton of silent bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Becoming consistent would be great and since ColumnTransformer
is also not accepting mixed data type (and I am not sure that we can get around that without adding some ambuiguities), I would prefer to deprecate the mixed type support where it is currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to accept integer columns in _validate_data
and automatically convert them to be stored as strings in feature_names_in_
.
I am not sure this is a good idea though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize here are the options:
- Only store
feature_names_in_
when column names are all strings. If a dataframe is passed with mixed or integer column names, show a warning stating thatfeature_names_in_
is not set. - Support integer and/or mixed columns and deprecate.
- Convert to strings and store them into
feature_names_in_
and warn that this is happening. - Support any type of column names.
I am +0.5 for Option 2 as long as we update DictVectorizer
in a way to always output strings. Option 3 will lead to weird edge cases with column names like: ['4', 4]
(string and integer).
Option 4 is not consistent with ColumnTransformer
. ColumnTransformer
uses the names to actually slice the DataFrame so its reasonable to enforce strings. All the other estimators only requires an equality check, so I do not think we need to be restrictive on the column names.
To move this along I'll update the PR to option 2. (Support integer and/or mixed columns and deprecate.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a SLEP or just a vote on how to handle column names? I think some of us would be happy with whatever pandas accepts, and some of us [really] don't want any column names other than simple strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we handle names is fairly public, so having a short SLEP to formalize the behavior is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally happy with this PR, we just need to figure the some of the cases:
sklearn/base.py
Outdated
return | ||
|
||
fitted_feature_names = getattr(self, "feature_names_in_", None) | ||
if fitted_feature_names is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to raise if either (fitted_feature_names is None xor feature_names_in is None) == True
? I think we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means, we would start warning for training on numpy array and predicting on pandas dataframe. What would be a good message?
- train on numpy and predict on dataframe: Convert your dataframe into a numpy array before predicting?
- train on dataframe and predict on numpy array: Convert your numpy array into a dataframe before predicting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be enough to mention that the type of the array in fit
is different than the one at predict
and we expect a consistent type between fit
and predict
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't know if we should raise an error or only a warning mentioning that no information about the columns would be available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy with a warning (maybe) but I think a warning when the types differ between fit
and predict
is a nice thing to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ambivalent about a warning. On the one hand it seems dangerous because it means you can't check consistency, but on the other hand people will ignore warning if there's too many. I think +0.5 for the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be enough to mention that the type of the array in fit is different than the one at predict and we expect a consistent type between fit and predict.
If we were to support xarray in the future, I think we can strictly work on type. It would have to be "are there feature names in fit" and "are there feature names in predict".
Updated the PR with the warnings.
sklearn/tests/test_array_out.py
Outdated
|
||
|
||
@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"]) | ||
def test_pandas_get_feature_names(array_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we had a conversation about what kind of column names we'd accept. And we kinda concluded that we'd restrict column names to be str
, and I think we should stick to that?
sklearn/utils/_array_out.py
Outdated
|
||
Supports: | ||
- pandas DataFrame | ||
- xarray DataArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elaborate on the requirements for the xarray coords to be considered as feature names maybe
sklearn/utils/_array_out.py
Outdated
|
||
|
||
def _get_feature_names(X): | ||
"""Get feature names from X. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this function enforce string feature names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so, we can always relax that requirement later if we decide to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the existing comment and the additional comment below on the check for no_validation
tagged estimators, this LGTM:
sklearn/tests/test_array_out.py
Outdated
|
||
|
||
@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"]) | ||
def test_pandas_get_feature_names(array_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also intuitively feel that feature names implies str
. It would be a good idea to add a test to check the error message we raise for invalid feature names.
sklearn/utils/_array_out.py
Outdated
|
||
|
||
def _get_feature_names(X): | ||
"""Get feature names from X. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so, we can always relax that requirement later if we decide to.
It's more of question on mixed columns or datetime columns names. If we start raising errors in 1.2 because of scikit-learn's restriction on types in feature names, then we do not support "array-likes" anymore.
I think the reasons were:
I'm still on the side of supporting ints. |
Sounds good! I like the PR as it's now. |
@ogrisel @glemaitre is your approval still valid? ;) |
yep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you very much for the follow-up. Merging!
OMG!!!! 🥳 |
Side note: using parquet rather than CSV seems to force column names to be strings by default: "0" (that just broke my code, but that's another story). Maybe in the long term, a way forward is to push for more parquet? |
That would be an option to explore, in particular, what are the memory copies involved when going from/to pandas to/from arrow data tables and other heterogeneous column containers. |
## August 31th, 2021 ### Gael * TODO: Jeremy's renewal, Chiara's replacement, Mathis's consulting gig ### Olivier - input feature names: main PR [#18010](scikit-learn/scikit-learn#18010) that links into sub PRs - remaining (need review): [#20853](scikit-learn/scikit-learn#20853) (found a bug in `OvOClassifier.n_features_in_`) - reviewing `get_feature_names_out`: [#18444](scikit-learn/scikit-learn#18444) - next: give feedback to Chiara on ARM wheel building [#20711](scikit-learn/scikit-learn#20711) (needed for the release) - next: assist Adrin for the release process - next: investigate regression in loky that blocks the cloudpickle release [#432](cloudpipe/cloudpickle#432) - next: come back to intel to write a technical roadmap for a possible collaboration ### Julien - Was on holidays - Planned week @ Nexedi, Lille, from September 13th to 17th - Reviewed PRs - [`#20567`](scikit-learn/scikit-learn#20567) Common Private Loss module - [`#18310`](scikit-learn/scikit-learn#18310) ENH Add option to centered ICE plots (cICE) - Others PRs prior to holidays - [`#20254`](scikit-learn/scikit-learn#20254) - Adapted benchmarks on `pdist_aggregation` to test #20254 against sklearnex - Adapting PR for `fast_euclidean` and `fast_sqeuclidean` on user-facing APIs - Next: comparing against scipy's - Next: Having feedback on [#20254](scikit-learn/scikit-learn#20254) would also help - Next: I need to block time to study Cython code. ### Mathis - `sklearn_benchmarks` - Adapting benchmark script to run on Margaret - Fix issue with profiling files too big to be deployed on Github Pages - Ensure deterministic benchmark results - Working on declarative pipeline specification - Next: run long HPO benchmarks on Margaret ### Arturo - Finished MOOC! - Finished filling [Loïc's notes](https://notes.inria.fr/rgSzYtubR6uSOQIfY9Fpvw#) to find questions with score under 60% (Issue [#432](INRIA/scikit-learn-mooc#432)) - started addressing easy-to-fix questions, resulting in gitlab MRs [#21](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/21) and [#22](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/22) - currently working on expanding the notes up to 70% - Continued cross-linking forum posts with issues in GitHub, resulting in [#444](INRIA/scikit-learn-mooc#444), [#445](INRIA/scikit-learn-mooc#445), [#446](INRIA/scikit-learn-mooc#446), [#447](INRIA/scikit-learn-mooc#447) and [#448](INRIA/scikit-learn-mooc#448) ### Jérémie - back from holidays, catching up - Mathis' benchmarks - trying to find what's going on with ASV benchmarks (asv should display the versions of all build and runtime depndencies for each run) ### Guillaume - back from holidays - Next: - release with Adrin - check the PR and issue trackers ### TODO / Next - Expand Loïc’s notes up to 70% (Arturo) - Create presentation to discuss my experience doing the MOOC (Arturo) - Help with the scikit-learn release (Olivier, Guillaume) - HR: Jeremy's renewal, Chiara's replacement (Gael) - Mathis's consulting gig (Olivier, Gael, Mathis)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan I think I found something.
if not missing_names and not missing_names: | ||
message += ( | ||
"Feature names must be in the same order as they were in fit.\n" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo? I guess the intended line 478 is
if not missing_names and not unexpected_names:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's a typo. Are you interested in opening a PR to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! (It is difficult to say no at this point!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request created: #23091
Reference Issues/PRs
Towards #17407