-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Convert ColumnTransformer input list to numpy array #12104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Convert ColumnTransformer input list to numpy array #12104
Conversation
@@ -419,6 +419,9 @@ def fit_transform(self, X, y=None): | |||
sparse matrices. | |||
|
|||
""" | |||
if isinstance(X, list): | |||
X = np.array(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using check_array
from utils/validation
would be safer/cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip! Added.
@@ -467,6 +469,8 @@ def transform(self, X): | |||
""" | |||
check_is_fitted(self, 'transformers_') | |||
|
|||
X = check_array(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A naive use of check_array will break pandas support. I think we should do something like:
if not hasattr(X, 'loc'):
X = check_array(X)
We don't want to do an explicit isinstance
to avoid introducing an explicit dependency on the pandas library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SShouldn't such a check be done in check_array
? Or else we should have this check almost everywhere if we want to keep the dataframe, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, check_array
should also not convert to numeric or raise on sparse in this case (it should basically do no check at all). Therefore, in this case, I would maybe not use check_array?
I think in the end we want to improve check_array
to also handle pandas dataframes, but that is a larger endeavour than fixing this bug.
@adrinjalali the column transformer is currently the only sklearn object where we don't convert the passed dataframe to an array, therefore this issue does not come up everywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility would be to only call check_array
if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).
Further, it should also not enforce all finite values in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).
So this is similar to ogrisel's solution above? I'm +1 to that solution (Maybe also add a comment to note down the limitation of current check_array).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).
Yes, with the difference that we also should not call check_array for arrays (Olivier only checks for dataframe), or otherwise specify a lot of flags to check_array to accept sparse, not error on nan data, ..
So for this bug fix, we should just check if X is a list
using isinstance
and then use check_array
to convert it to an array, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps
if not hasattr(X, '__array__'):
X = np.asarray(X)
?
(or do we want dtype='object' to be safe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want dtype='object' arrays to not be converted I think, but ndarrays do have a __array__
, so your proposal is fine for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sparse matrices apparently have no __array__
, so an additional check should include that as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't such a check be done in check_array?
Opened a new issue about this here: #12148
Okay, I'll try to finish this as @vinayak-mehta is silent. |
I'm sorry @jnothman there's no excuse for being silent :| I should've done this sooner. Travis failed on Python 3.4, conda but it doesn't show any details, just says "stalled build or something wrong with the build itself.", any idea? EDIT: just saw "No output has been received in the last 10m0s", it timed out. :? |
No worries. We just need this merged ASAP.
Probably just needs a rerun.
|
"""Use check_array only on lists and other non-array-likes / sparse""" | ||
if hasattr(X, '__array__') or sparse.issparse(X): | ||
return X | ||
return check_array(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to allow missing values here? (force_all_finite='allow-nan'
)
@ogrisel is the |
I pushed a new commit to address @jorisvandenbossche's comment (#12104 (review)). In the process, I also made sure that the column transformer does not implicitly change the dtype of the list content especially when the content is heterogeneously typed. In turn that made me change |
Although, that leaves it to the actual transformer to which the data is passed to coerce the X data to numeric if needed or not, so probably this is a good idea. |
I think we have to do it: CT is the only estimator to naturally accept heterogeneously typed collections. It should not try to convert everything to a numpy string dtype because one column is Python string valued. |
Indeed that's my thought: CT should stay as neutral as possible w.r.t. to dtypes. A Python list is a list of Python objects. Here we just need arrays to be able to index on columns. |
How was this triggered by this PR? As lists should not have a |
OK, I suppose this is then for the actual transformers (eg StandardScaler) that gets the object dtyped subarray? That would make the use case of a list indeed rather annoying and impossible to do without a warning. |
Yes, the numerical scalers (StandardScaler, MinMaxScaler, ...) all have |
@ogrisel changing |
From live discussion with @ogrisel: for this PR and to get the basic lists fix merged, let's leave the Of course, since it seems Andy actually tagged, we might have a little bit more time again for a bugfix release. |
CI is passing here, I think this should be good to go now |
<!-- Thanks for contributing a pull request! Please ensure you have taken a look at the contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist --> #### Reference Issues/PRs <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See https://github.com/blog/1506-closing-issues-via-pull-requests --> Fixes #12096. #### What does this implement/fix? Explain your changes. Converts the input list for ColumnTransformer to a numpy array. Added a check inside `transform` and `fit_transform` to check if the input `X` is a list, if it is then it gets converted to a numpy array. #### Any other comments? Should this conversion be documented in the docstrings for ColumnTransfomer's `fit`, `transform` and `fit_transform`? <!-- Please be aware that we are a loose team of volunteers so patience is necessary; assistance handling other issues is very welcome. We value all user contributions, no matter how minor they are. If we are slow to review, either the pull request needs some benchmarking, tinkering, convincing, etc. or more likely the reviewers are simply busy. In either case, we ask for your understanding during the review process. For more information, see our FAQ on this topic: http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention. Thanks for contributing! -->
* tag '0.20.0': (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
* releases: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
* dfsg: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
Reference Issues/PRs
Fixes #12096.
What does this implement/fix? Explain your changes.
Converts the input list for ColumnTransformer to a numpy array.
Added a check inside
transform
andfit_transform
to check if the inputX
is a list, if it is then it gets converted to a numpy array.Any other comments?
Should this conversion be documented in the docstrings for ColumnTransfomer's
fit
,transform
andfit_transform
?