-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ColumnTransformer generalization to work on empty lists #12084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this PR!
Could you please change the title to a more meaningful description of what this PR does? That would help PR triage.
@@ -111,6 +111,20 @@ def test_column_transformer(): | |||
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both) | |||
assert len(ct.transformers_) == 2 | |||
|
|||
# test case that ensures that the column transformer does also work when | |||
# a given transformer doesn't have any columns to work on | |||
ct = ColumnTransformer([('trans1', Trans(), [0, 1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it could be a separate test? Also it might be worth checking that
ColumnTransformer([('trans1', Trans(), [])])
is an identity operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you mean?
ct = ColumnTransformer([('trans', Trans(), [])],
remainder='passthrough')
assert_array_equal(ct.fit_transform(X_array), X_res_both)
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
assert len(ct.transformers_) == 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update documentation (of transformers_
at least).
I've not looked at tests.
@@ -335,6 +337,8 @@ def _update_fitted_transformers(self, transformers): | |||
# so get next transformer, but save original string | |||
next(transformers) | |||
trans = 'passthrough' | |||
elif hasattr(column, '__len__') and len(column) == 0: | |||
trans = old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather trans = 'drop'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or trans = 'empty'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like this merged before 0.20rc2...
@@ -335,6 +337,8 @@ def _update_fitted_transformers(self, transformers): | |||
# so get next transformer, but save original string | |||
next(transformers) | |||
trans = 'passthrough' | |||
elif hasattr(column, '__len__') and len(column) == 0: | |||
trans = old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or trans = 'empty'
def test_column_transformer_empty_columns(): | ||
# test case that ensures that the column transformer does also work when | ||
# a given transformer doesn't have any columns to work on | ||
X_array = np.array([[0, 1, 2], [2, 4, 6]]).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also test with a DataFrame if pandas is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is in the fn test_column_transformer_dataframe
(needs to be skipped if no pandas)
Sorry for my confusion about |
1 similar comment
Sorry for my confusion about |
remainder='passthrough') | ||
assert_array_equal(ct.fit_transform(X_array), X_res_both) | ||
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both) | ||
assert len(ct.transformers_) == 2 # including remainder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take care with the pep8
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both) | ||
assert len(ct.transformers_) == 2 # including remainder | ||
|
||
fixture = np.array([[],[],[]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem here
remainder='drop') | ||
assert_array_equal(ct.fit_transform(X_array), fixture) | ||
assert_array_equal(ct.fit(X_array).transform(X_array), fixture) | ||
assert len(ct.transformers_) == 2 # including remainder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8
Travis CI is failing.. |
I (presumably) fixed the pep8 problems. Should this issue be resolved first / as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check more explicitly the values in transformers_
and the presence of 'empty'
assert_array_equal(ct.fit_transform(X_df), X_res_both) | ||
assert_array_equal(ct.fit(X_df).transform(X_df), X_res_both) | ||
assert len(ct.transformers_) == 2 | ||
assert ct.transformers_[-1][0] != 'remainder' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you check that this is 'empty'?
estimator, 'drop', or 'passthrough'. If there are remaining columns, | ||
the final element is a tuple of the form: | ||
estimator, 'drop', or 'passthrough'. Note that the column list is | ||
allowed to be empty. In that case the transformers will not be fitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of date. "Where there were no columns selected, this string 'empty' will stand in place of the transformer."
@jnotham I am still not fully convinced of the 'empty' string, see #12071 (comment) |
I find 'empty' more explicit, while an estimator without its fitted attributes might be confusing. But I don't mind very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to an empty list, I think we should also check for an all-False boolean array
Okay, let's do the unfitted transformer. Shrug. |
@janvanrijn do you have time to update this today? If not, I can also push some updates (just asking, as we want to get this in as fast as possible, it is one of the remaining blockers for the release) |
I am extremely busy with some other task atm, but i can make some time for this. If you have more time, and as you guys have a better overview of what exactly should happen, feel free to alter/update/change/copy from/remove/bypass this pr. |
@janvanrijn OK, I pushed some updates to this PR. |
@jnothman after updating this PR to also check for all-False boolean arrays, I realized that also slices can be empty .. But to fully properly determine of they are empty, you need the X data, which I actually just removed from |
I think we should determine this only during fit. I'm not sure how you would consistently determine this during How about we release and do this in 0.20.1? There's always a .1 any way... |
Well, in most cases this will not be a problem in practice. Because if the structure of your data does not change, this determination of empty selection will be consistent, and if the structure of your data does change between fit and transform, your transform will be fucked up anyway.
If you want to release 0.20 the coming hours, I would propose to merge this as is (it already fixes the main use case, just not yet empty slices), and then we can expand the fix for 0.20.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well. Merging, thanks!
I am able to run the random bot now without problems. Thanks for merging, this is really great |
Opened a follow-up issue here: #12162 |
* tag '0.20.0': (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
* releases: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
* dfsg: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...
Reference Issues/PRs
Fixes #12071
What does this implement/fix? Explain your changes.
Column Transformer can handle 'empty' columns.
Any other comments?
main code by @jorisvandenbossche
test cases by me