Skip to content

ColumnTransformer generalization to work on empty lists #12084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 25, 2018

Conversation

janvanrijn
Copy link
Contributor

@janvanrijn janvanrijn commented Sep 14, 2018

Reference Issues/PRs

Fixes #12071

What does this implement/fix? Explain your changes.

Column Transformer can handle 'empty' columns.

Any other comments?

main code by @jorisvandenbossche
test cases by me

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR!

Could you please change the title to a more meaningful description of what this PR does? That would help PR triage.

@@ -111,6 +111,20 @@ def test_column_transformer():
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
assert len(ct.transformers_) == 2

# test case that ensures that the column transformer does also work when
# a given transformer doesn't have any columns to work on
ct = ColumnTransformer([('trans1', Trans(), [0, 1]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be a separate test? Also it might be worth checking that

ColumnTransformer([('trans1', Trans(), [])])

is an identity operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you mean?

ct = ColumnTransformer([('trans', Trans(), [])], 
                           remainder='passthrough')
assert_array_equal(ct.fit_transform(X_array), X_res_both)
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
assert len(ct.transformers_) == 2

@janvanrijn janvanrijn changed the title added solution by Joris and testcases ColumnTransformer generalization to work on empty lists Sep 14, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update documentation (of transformers_ at least).

I've not looked at tests.

@@ -335,6 +337,8 @@ def _update_fitted_transformers(self, transformers):
# so get next transformer, but save original string
next(transformers)
trans = 'passthrough'
elif hasattr(column, '__len__') and len(column) == 0:
trans = old
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather trans = 'drop'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or trans = 'empty'

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like this merged before 0.20rc2...

@@ -335,6 +337,8 @@ def _update_fitted_transformers(self, transformers):
# so get next transformer, but save original string
next(transformers)
trans = 'passthrough'
elif hasattr(column, '__len__') and len(column) == 0:
trans = old
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or trans = 'empty'

@janvanrijn
Copy link
Contributor Author

I incorporated the comments by @rth and @jnothman Let me know if I can do anything more, I will handle this PR with priority

def test_column_transformer_empty_columns():
# test case that ensures that the column transformer does also work when
# a given transformer doesn't have any columns to work on
X_array = np.array([[0, 1, 2], [2, 4, 6]]).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also test with a DataFrame if pandas is available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is in the fn test_column_transformer_dataframe (needs to be skipped if no pandas)

@jnothman
Copy link
Member

Sorry for my confusion about columns. I thought it was an extract from X

1 similar comment
@jnothman
Copy link
Member

Sorry for my confusion about columns. I thought it was an extract from X

remainder='passthrough')
assert_array_equal(ct.fit_transform(X_array), X_res_both)
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
assert len(ct.transformers_) == 2 # including remainder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take care with the pep8

assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
assert len(ct.transformers_) == 2 # including remainder

fixture = np.array([[],[],[]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem here

remainder='drop')
assert_array_equal(ct.fit_transform(X_array), fixture)
assert_array_equal(ct.fit(X_array).transform(X_array), fixture)
assert len(ct.transformers_) == 2 # including remainder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8

@rth
Copy link
Member

rth commented Sep 19, 2018

Travis CI is failing..

@janvanrijn
Copy link
Contributor Author

I (presumably) fixed the pep8 problems. Should this issue be resolved first / as well?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check more explicitly the values in transformers_ and the presence of 'empty'

assert_array_equal(ct.fit_transform(X_df), X_res_both)
assert_array_equal(ct.fit(X_df).transform(X_df), X_res_both)
assert len(ct.transformers_) == 2
assert ct.transformers_[-1][0] != 'remainder'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you check that this is 'empty'?

estimator, 'drop', or 'passthrough'. If there are remaining columns,
the final element is a tuple of the form:
estimator, 'drop', or 'passthrough'. Note that the column list is
allowed to be empty. In that case the transformers will not be fitted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of date. "Where there were no columns selected, this string 'empty' will stand in place of the transformer."

@jorisvandenbossche
Copy link
Member

@jnotham I am still not fully convinced of the 'empty' string, see #12071 (comment)

@jnothman
Copy link
Member

I find 'empty' more explicit, while an estimator without its fitted attributes might be confusing. But I don't mind very much.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to an empty list, I think we should also check for an all-False boolean array

@jnothman
Copy link
Member

Okay, let's do the unfitted transformer. Shrug.

@jorisvandenbossche
Copy link
Member

@janvanrijn do you have time to update this today? If not, I can also push some updates (just asking, as we want to get this in as fast as possible, it is one of the remaining blockers for the release)

@jorisvandenbossche jorisvandenbossche added this to the 0.20 milestone Sep 24, 2018
@janvanrijn
Copy link
Contributor Author

I am extremely busy with some other task atm, but i can make some time for this. If you have more time, and as you guys have a better overview of what exactly should happen, feel free to alter/update/change/copy from/remove/bypass this pr.

@jorisvandenbossche
Copy link
Member

@janvanrijn OK, I pushed some updates to this PR.

@jorisvandenbossche
Copy link
Member

@jnothman after updating this PR to also check for all-False boolean arrays, I realized that also slices can be empty .. But to fully properly determine of they are empty, you need the X data, which I actually just removed from _iter in the previous PR (#12107).
Not difficult to add this back, but it also raises the question if this should only be done during fit and not anymore rely on the data during transform (similarly as we decided to not evaluate the function any more during transform). This is certainly all possible, but will complicate the code a bit further .. Thoughts ?

@amueller
Copy link
Member

I think we should determine this only during fit. I'm not sure how you would consistently determine this during transform.

How about we release and do this in 0.20.1? There's always a .1 any way...

@jorisvandenbossche
Copy link
Member

I'm not sure how you would consistently determine this during transform.

Well, in most cases this will not be a problem in practice. Because if the structure of your data does not change, this determination of empty selection will be consistent, and if the structure of your data does change between fit and transform, your transform will be fucked up anyway.
It is more out of principle, and to avoid redundant checks, that we should maybe only do this at fit time.

How about we release and do this in 0.20.1? There's always a .1 any way...

If you want to release 0.20 the coming hours, I would propose to merge this as is (it already fixes the main use case, just not yet empty slices), and then we can expand the fix for 0.20.1

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. Merging, thanks!

@rth rth merged commit e58f366 into scikit-learn:master Sep 25, 2018
@janvanrijn
Copy link
Contributor Author

I am able to run the random bot now without problems. Thanks for merging, this is really great

@jorisvandenbossche
Copy link
Member

Opened a follow-up issue here: #12162

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 29, 2018
* tag '0.20.0': (77 commits)
  ColumnTransformer generalization to work on empty lists (scikit-learn#12084)
  add sparse_threshold to make_columns_transformer (scikit-learn#12152)
  [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104)
  Change version to 0.20.0
  BUG: check equality instead of identity in check_cv (scikit-learn#12155)
  [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114)
  [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089)
  DOC Removed duplicated doc in tree.rst (scikit-learn#11922)
  [MRG] DOC covariance doctest examples (scikit-learn#12124)
  typo and formatting fixes in 0.20 doc (scikit-learn#11963)
  DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133)
  [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107)
  Fix typo (scikit-learn#12126)
  DOC Typo in OneHotEncoder
  DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117)
  DOC Removing quotes from variant names. (scikit-learn#12113)
  DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108)
  [MRG] MNT Re-enable PyPy CI (scikit-learn#12039)
  MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048)
  TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 29, 2018
* releases: (77 commits)
  ColumnTransformer generalization to work on empty lists (scikit-learn#12084)
  add sparse_threshold to make_columns_transformer (scikit-learn#12152)
  [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104)
  Change version to 0.20.0
  BUG: check equality instead of identity in check_cv (scikit-learn#12155)
  [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114)
  [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089)
  DOC Removed duplicated doc in tree.rst (scikit-learn#11922)
  [MRG] DOC covariance doctest examples (scikit-learn#12124)
  typo and formatting fixes in 0.20 doc (scikit-learn#11963)
  DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133)
  [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107)
  Fix typo (scikit-learn#12126)
  DOC Typo in OneHotEncoder
  DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117)
  DOC Removing quotes from variant names. (scikit-learn#12113)
  DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108)
  [MRG] MNT Re-enable PyPy CI (scikit-learn#12039)
  MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048)
  TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 29, 2018
* dfsg: (77 commits)
  ColumnTransformer generalization to work on empty lists (scikit-learn#12084)
  add sparse_threshold to make_columns_transformer (scikit-learn#12152)
  [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104)
  Change version to 0.20.0
  BUG: check equality instead of identity in check_cv (scikit-learn#12155)
  [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114)
  [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089)
  DOC Removed duplicated doc in tree.rst (scikit-learn#11922)
  [MRG] DOC covariance doctest examples (scikit-learn#12124)
  typo and formatting fixes in 0.20 doc (scikit-learn#11963)
  DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133)
  [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107)
  Fix typo (scikit-learn#12126)
  DOC Typo in OneHotEncoder
  DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117)
  DOC Removing quotes from variant names. (scikit-learn#12113)
  DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108)
  [MRG] MNT Re-enable PyPy CI (scikit-learn#12039)
  MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048)
  TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ColumnTransformer requires at least one column for each part it transforms
6 participants