ColumnTransformer generalization to work on empty lists #12084

janvanrijn · 2018-09-14T19:23:10Z

Reference Issues/PRs

Fixes #12071

What does this implement/fix? Explain your changes.

Column Transformer can handle 'empty' columns.

Any other comments?

main code by @jorisvandenbossche
test cases by me

rth

Thanks for making this PR!

Could you please change the title to a more meaningful description of what this PR does? That would help PR triage.

rth · 2018-09-14T21:33:55Z

sklearn/compose/tests/test_column_transformer.py

@@ -111,6 +111,20 @@ def test_column_transformer():
    assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
    assert len(ct.transformers_) == 2

+    # test case that ensures that the column transformer does also work when
+    # a given transformer doesn't have any columns to work on
+    ct = ColumnTransformer([('trans1', Trans(), [0, 1]),


Maybe it could be a separate test? Also it might be worth checking that

ColumnTransformer([('trans1', Trans(), [])])

is an identity operation?

Is this what you mean?

ct = ColumnTransformer([('trans', Trans(), [])], remainder='passthrough') assert_array_equal(ct.fit_transform(X_array), X_res_both) assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both) assert len(ct.transformers_) == 2

jnothman

Please update documentation (of transformers_ at least).

I've not looked at tests.

jnothman · 2018-09-15T11:49:22Z

sklearn/compose/_column_transformer.py

@@ -335,6 +337,8 @@ def _update_fitted_transformers(self, transformers):
                # so get next transformer, but save original string
                next(transformers)
                trans = 'passthrough'
+            elif hasattr(column, '__len__') and len(column) == 0:
+                trans = old


I'd rather trans = 'drop'

Or trans = 'empty'

jnothman

I'd like this merged before 0.20rc2...

jnothman · 2018-09-17T08:07:36Z

sklearn/compose/_column_transformer.py

@@ -335,6 +337,8 @@ def _update_fitted_transformers(self, transformers):
                # so get next transformer, but save original string
                next(transformers)
                trans = 'passthrough'
+            elif hasattr(column, '__len__') and len(column) == 0:
+                trans = old


Or trans = 'empty'

janvanrijn · 2018-09-17T10:09:52Z

I incorporated the comments by @rth and @jnothman Let me know if I can do anything more, I will handle this PR with priority

sklearn/compose/_column_transformer.py

jnothman · 2018-09-17T10:43:13Z

sklearn/compose/tests/test_column_transformer.py

+def test_column_transformer_empty_columns():
+    # test case that ensures that the column transformer does also work when
+    # a given transformer doesn't have any columns to work on
+    X_array = np.array([[0, 1, 2], [2, 4, 6]]).T


please also test with a DataFrame if pandas is available

This one is in the fn test_column_transformer_dataframe (needs to be skipped if no pandas)

jnothman · 2018-09-17T21:48:12Z

Sorry for my confusion about columns. I thought it was an extract from X

jnothman · 2018-09-17T21:48:41Z

Sorry for my confusion about columns. I thought it was an extract from X

eamanu · 2018-09-18T15:23:07Z

sklearn/compose/tests/test_column_transformer.py

+                           remainder='passthrough')
+    assert_array_equal(ct.fit_transform(X_array), X_res_both)
+    assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
+    assert len(ct.transformers_) == 2 # including remainder


take care with the pep8

eamanu · 2018-09-18T15:23:22Z

sklearn/compose/tests/test_column_transformer.py

+    assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
+    assert len(ct.transformers_) == 2 # including remainder
+
+    fixture = np.array([[],[],[]])


eamanu · 2018-09-18T15:24:03Z

sklearn/compose/tests/test_column_transformer.py

+                           remainder='drop')
+    assert_array_equal(ct.fit_transform(X_array), fixture)
+    assert_array_equal(ct.fit(X_array).transform(X_array), fixture)
+    assert len(ct.transformers_) == 2 # including remainder


rth · 2018-09-19T09:17:33Z

Travis CI is failing..

janvanrijn · 2018-09-19T11:07:55Z

I (presumably) fixed the pep8 problems. Should this issue be resolved first / as well?

jnothman

Please check more explicitly the values in transformers_ and the presence of 'empty'

jnothman · 2018-09-20T08:22:34Z

sklearn/compose/tests/test_column_transformer.py

+    assert_array_equal(ct.fit_transform(X_df), X_res_both)
+    assert_array_equal(ct.fit(X_df).transform(X_df), X_res_both)
+    assert len(ct.transformers_) == 2
+    assert ct.transformers_[-1][0] != 'remainder'


Shouldn't you check that this is 'empty'?

jnothman · 2018-09-20T08:23:42Z

sklearn/compose/_column_transformer.py

-        estimator, 'drop', or 'passthrough'. If there are remaining columns,
-        the final element is a tuple of the form:
+        estimator, 'drop', or 'passthrough'. Note that the column list is
+        allowed to be empty. In that case the transformers will not be fitted.


Out of date. "Where there were no columns selected, this string 'empty' will stand in place of the transformer."

jorisvandenbossche · 2018-09-20T09:51:19Z

@jnotham I am still not fully convinced of the 'empty' string, see #12071 (comment)

jnothman · 2018-09-20T11:02:43Z

I find 'empty' more explicit, while an estimator without its fitted attributes might be confusing. But I don't mind very much.

master

jorisvandenbossche

In addition to an empty list, I think we should also check for an all-False boolean array

jnothman · 2018-09-22T11:01:43Z

Okay, let's do the unfitted transformer. Shrug.

jorisvandenbossche · 2018-09-24T09:48:11Z

@janvanrijn do you have time to update this today? If not, I can also push some updates (just asking, as we want to get this in as fast as possible, it is one of the remaining blockers for the release)

janvanrijn · 2018-09-24T10:37:40Z

I am extremely busy with some other task atm, but i can make some time for this. If you have more time, and as you guys have a better overview of what exactly should happen, feel free to alter/update/change/copy from/remove/bypass this pr.

jorisvandenbossche · 2018-09-24T12:30:02Z

@janvanrijn OK, I pushed some updates to this PR.

jorisvandenbossche · 2018-09-24T15:40:30Z

@jnothman after updating this PR to also check for all-False boolean arrays, I realized that also slices can be empty .. But to fully properly determine of they are empty, you need the X data, which I actually just removed from _iter in the previous PR (#12107).
Not difficult to add this back, but it also raises the question if this should only be done during fit and not anymore rely on the data during transform (similarly as we decided to not evaluate the function any more during transform). This is certainly all possible, but will complicate the code a bit further .. Thoughts ?

amueller · 2018-09-24T17:18:19Z

I think we should determine this only during fit. I'm not sure how you would consistently determine this during transform.

How about we release and do this in 0.20.1? There's always a .1 any way...

jorisvandenbossche · 2018-09-24T22:03:45Z

I'm not sure how you would consistently determine this during transform.

Well, in most cases this will not be a problem in practice. Because if the structure of your data does not change, this determination of empty selection will be consistent, and if the structure of your data does change between fit and transform, your transform will be fucked up anyway.
It is more out of principle, and to avoid redundant checks, that we should maybe only do this at fit time.

How about we release and do this in 0.20.1? There's always a .1 any way...

If you want to release 0.20 the coming hours, I would propose to merge this as is (it already fixes the main use case, just not yet empty slices), and then we can expand the fix for 0.20.1

amueller

looks good

rth

LGTM as well. Merging, thanks!

janvanrijn · 2018-09-25T20:42:20Z

I am able to run the random bot now without problems. Thanks for merging, this is really great

jorisvandenbossche · 2018-09-25T22:03:48Z

Opened a follow-up issue here: #12162

* tag '0.20.0': (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...

* releases: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...

* dfsg: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...

added solution by Joris and testcases

ddacaed

rth reviewed Sep 14, 2018

View reviewed changes

janvanrijn changed the title ~~added solution by Joris and testcases~~ ColumnTransformer generalization to work on empty lists Sep 14, 2018

jnothman reviewed Sep 15, 2018

View reviewed changes

jnothman reviewed Sep 17, 2018

View reviewed changes

incorporated comments by Roman and Joel

c98a6c1

jnothman reviewed Sep 17, 2018

View reviewed changes

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

jnothman reviewed Sep 17, 2018

View reviewed changes

eamanu reviewed Sep 18, 2018

View reviewed changes

fix pep8 problems

afab865

jnothman reviewed Sep 20, 2018

View reviewed changes

janvanrijn added 2 commits September 21, 2018 15:16

Merge pull request #1 from janvanrijn/master

d3c5d19

master

improved doc string

faaa9a3

jorisvandenbossche reviewed Sep 21, 2018

View reviewed changes

jorisvandenbossche added this to the 0.20 milestone Sep 24, 2018

jorisvandenbossche added 3 commits September 24, 2018 13:43

change 'empty' to unfitted transformer

3b7f01a

also support all-False boolean arrays

7e70d16

combine basic and pandas test with parametrization

3c281ea

fix pep8

7a5ccaf

amueller approved these changes Sep 25, 2018

View reviewed changes

rth approved these changes Sep 25, 2018

View reviewed changes

rth merged commit e58f366 into scikit-learn:master Sep 25, 2018

amueller pushed a commit that referenced this pull request Sep 25, 2018

ColumnTransformer generalization to work on empty lists (#12084)

f659f55

janvanrijn deleted the fix_#12071 branch September 25, 2018 20:42

jorisvandenbossche mentioned this pull request Sep 25, 2018

Ensure ColumnTransformer also works with empty slices as column selector #12162

Open

Uh oh!

ColumnTransformer generalization to work on empty lists #12084

ColumnTransformer generalization to work on empty lists #12084

Uh oh!

Conversation

janvanrijn commented Sep 14, 2018 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janvanrijn commented Sep 17, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 17, 2018

Uh oh!

jnothman commented Sep 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Sep 19, 2018

Uh oh!

janvanrijn commented Sep 19, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Sep 20, 2018

Uh oh!

jnothman commented Sep 20, 2018

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 22, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

janvanrijn commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

amueller commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

amueller left a comment

janvanrijn commented Sep 14, 2018 •

edited by jnothman

Loading