Skip to content

ColumnTransformer requires at least one column for each part it transforms #12071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
janvanrijn opened this issue Sep 13, 2018 · 33 comments · Fixed by #12084
Closed

ColumnTransformer requires at least one column for each part it transforms #12071

janvanrijn opened this issue Sep 13, 2018 · 33 comments · Fixed by #12084
Milestone

Comments

@janvanrijn
Copy link
Contributor

janvanrijn commented Sep 13, 2018

Description

ColumnTransformer requires at least one column for each part it transforms. This sounds logical, but makes automatic experimentation across datasets with mixed input types hard to apply with a single sklearn model. I would need three separate models for:

  • datasets that have only numeric inputs
  • datasets that have only categorical inputs
  • datasets that have mixed (numeric / categorical) inputs

Of course, this is doable, but it would be extremely convenient to be able to do all this with one sklearn model.

Steps/Code to Reproduce

import sklearn
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute

X, y = sklearn.datasets.fetch_openml('iris', 1, return_X_y=True)

numeric_transformer = sklearn.pipeline.make_pipeline(
    sklearn.preprocessing.Imputer(),
    sklearn.preprocessing.StandardScaler())

categorical_transformer = sklearn.pipeline.make_pipeline(
    sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing'),
    sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')
)

transformer = sklearn.compose.ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, []),
        ('nominal', categorical_transformer, [0,1,2,3])],
    remainder='passthrough')

clf = sklearn.pipeline.make_pipeline(transformer, sklearn.tree.DecisionTreeClassifier())

clf.fit(X, y)

Expected Results

a fitted model :)

Actual Results

Traceback (most recent call last):
  File "/home/janvanrijn/projects/sklearn-bot/testjan.py", line 25, in <module>
    clf.fit(X, y)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py", line 425, in fit_transform
    result = self._fit_transform(X, y, _fit_transform_one)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/compose/_column_transformer.py", line 371, in _fit_transform
    X=X, fitted=fitted, replace_strings=True))
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 298, in fit_transform
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/base.py", line 462, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 158, in fit
    force_all_finite=False)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/utils/validation.py", line 585, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(150, 0)) while a minimum of 1 is required.

Versions

I just installed the git branch 0.20.X

Proposed solution

I can author a PR that checks the column count, or passes through a constant dummy column

@jorisvandenbossche
Copy link
Member

@janvanrijn thanks for the feedback!
Yes, I think this is a case that we would want to support. The question is maybe whether it are the other transformers that should support 0 features (it is the imputer that is raising the error), or whether the ColumnTransformer should be smart enough to not pass the data through in this case.

That second possibility (what you also mention), is certainly the easiest to do.

I am trying to think of cases where this could give a problem. The fitted transformers_ would then hold a transformer that is actually not fitted. This introduces some inconsistency, but not sure if that would give problems in practice.

cc @jnothman

@Ryanglambert
Copy link

Ryanglambert commented Sep 14, 2018 via email

@amueller
Copy link
Member

I'm in favor of ignoring empty feature lists. I think many transformers will require >0 features.

do we need to store the transformer in transformers_?

@jorisvandenbossche
Copy link
Member

do we need to store the transformer in transformers_?

I would do that, otherwise the length and index into the original passed transformers and the fitted transformers_ attribute gets out of sync, which doesn't seem ideal?

@amueller
Copy link
Member

hmmm good point.

@janvanrijn
Copy link
Contributor Author

Yes, I think this is a case that we would want to support.

Awesome.

Is there somewhere an extensive list of all transformers available? I could contribute a set of unit tests this afternoon to start either solution.

@amueller
Copy link
Member

@janvanrijn you can get one from all_estimators and if you want to add unit tests you could do that in the common tests that already have all the infrastructure to test something with all estimators. Not sure that's necessary here though.

@janvanrijn
Copy link
Contributor Author

@amueller just out of curiosity, why won't that be necessary?

Whichever of the proposed solutions is going to take only several lines of code, whereas the real trick is to properly test it (IMO).

@jorisvandenbossche
Copy link
Member

whereas the real trick is to properly test it (IMO).

Yes, but when doing this in the ColumnTransformer, we don't need to test it with every possible transformer, only testing that the transformer is not called in case there are no features is enough.

@jorisvandenbossche
Copy link
Member

Don't have further time this until next week, but this small patch seems to work (at least for the basics):

--- a/sklearn/compose/_column_transformer.py
+++ b/sklearn/compose/_column_transformer.py
@@ -234,6 +234,8 @@ boolean mask array or callable
                         check_inverse=False)
                 elif trans == 'drop':
                     continue
+                elif hasattr(column, '__len__') and len(column) == 0:
+                    continue
 
             yield (name, trans, sub, get_weight(name))
 
@@ -335,6 +337,8 @@ boolean mask array or callable
                 # so get next transformer, but save original string
                 next(transformers)
                 trans = 'passthrough'
+            elif hasattr(column, '__len__') and len(column) == 0:
+                trans = old
             else:
                 trans = next(transformers)
             transformers_.append((name, trans, column))

if we are OK with transformers_ containing unfitted transformers

@janvanrijn
Copy link
Contributor Author

Is there something I can do to facilitate this PR?

@janvanrijn
Copy link
Contributor Author

Adding the following line to 114 of test_column_transformer makes it fail.

ct = ColumnTransformer([('trans1', Trans(), [0, 1]),
                            ('trans2', Trans(), [])])
    assert_array_equal(ct.fit_transform(X_array), X_res_both)
    assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
    assert len(ct.transformers_) == 2

    ct = ColumnTransformer([('trans1', Trans(), []),
                            ('trans2', Trans(), [0, 1])])
    assert_array_equal(ct.fit_transform(X_array), X_res_both)
    assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
    assert len(ct.transformers_) == 2

the following error:

E       AssertionError: assert 1 == 2
E        +  where 1 = len([('trans1', Trans(), [0, 1])])
E        +    where [('trans1', Trans(), [0, 1])] = ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,\n         transformer_weights=None,\n         transformers=[('trans1', Trans(), [0, 1]), ('trans2', Trans(), [])]).transformers_

According the following message, I assume that you still want the transformers to be listed, right? I will look into it.

I would do that, otherwise the length and index into the original passed transformers and the fitted transformers_ attribute gets out of sync, which doesn't seem ideal?

@janvanrijn
Copy link
Contributor Author

I can't copy paste properly, please ignore my previous comment. Your code seems to survive my testcase properly.

@jnothman
Copy link
Member

jnothman commented Sep 15, 2018 via email

@jnothman
Copy link
Member

jnothman commented Sep 15, 2018 via email

@jorisvandenbossche
Copy link
Member

@jnothman I would find it a bit strange to change the transformer the user passed to a string (to indicate that you "dropping nothing" ..). I would personally rather have the inconsistency of an unfitted transformer in the fitted attribute, than losing the information of which transformer was passed (of course it is still available in the non-fitted attribute transformer, so maybe that is not really a problem)

@jnothman
Copy link
Member

jnothman commented Sep 17, 2018 via email

@jnothman jnothman added this to the 0.20 milestone Sep 17, 2018
@jorisvandenbossche
Copy link
Member

But it is the column selection that is (already) empty, not the transformer. So that's maybe what I find 'strange' about it to change the transformer.

I am trying to think about a situation where it might be useful to keep: eg where the column selection is done dynamically (with a function), and you don't know in advance that there will be such an 'empty' column selection. In such a case I think it might still be useful to be able to check in the transformers_ attribute which transformer received an empty set.
Although for this use case, I was assuming that we convert the function to the list of indices in the fitted transformer_ attribute, so you could easily see empty lists, but this is apparently not the case (we keep the function object as the column selector). But that's another issue we could consider.

@jnothman
Copy link
Member

jnothman commented Sep 17, 2018 via email

@janvanrijn
Copy link
Contributor Author

So at what time will the list be formed?

@jorisvandenbossche
Copy link
Member

@jnothman do you still prefer changing the transformer to 'empty' over keeping an unfitted transformer in transformers_ (as we need to decide on this for the PR)?
I think I prefer the unfitted transformer, although that also giving an inconsistency.

Currently the strings we use ('passthrough', 'drop') are about an action on what to do with the selected columns, while 'empty' is not an action; it are the selected columns that are an empty list.

@ogrisel
Copy link
Member

ogrisel commented Sep 19, 2018

I think I am fine with what is proposed by @jorisvandenbossche : skip transformers that get 0 columns after selection but keep them unfitted in transformers_. They should not be used at transform time anyway.

@jnothman
Copy link
Member

I find 'empty' more explicit... For the most part I'm ambivalent

@jorisvandenbossche
Copy link
Member

If you don't really mind, then I would personally go to keeping the unfitted one.
Mainly because I find 'empty' a bit inconsistent with out other strings, and it is the column selection that is already empty.

@janvanrijn
Copy link
Contributor Author

Am I correct by inferring that updating the documentation was sufficient for making this consist with the requirements?

@jorisvandenbossche
Copy link
Member

I am a bit biased as this is my preference (so @jnothman can speak up), but I think the conclusion is to use the unfitted transformer and not the 'empty' string as you did now in the PR.

@jnothman
Copy link
Member

jnothman commented Sep 22, 2018 via email

@samborba
Copy link

samborba commented May 24, 2020

Hello everyone, everything good?

I would like to ask a question: it was said that it is now possible to work with empty lists, but if I have a list with all the values equals to "False" it still generates the same error mentioned in here. The scenario of me having an empty list and a list with all their values equal to "False" are not the same? I mean, ColumnTransformer will have no action to take...

Thank you for your help!!!

@jnothman
Copy link
Member

@samborba please open a new issue including code to reproduce your error. Thanks

@zachmayer
Copy link
Contributor

I've having the same issue: i have a selector function that's returning all False, which should be equivalent to an empty list.

But "all false" from the selector function is not being treated equivalent to "empty list from explicit column names".

I'll post code to reproduce later.

@zachmayer
Copy link
Contributor

@jnothman

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

def noop_selector(X):
    return [False for col in X]

working_pipeline = ColumnTransformer(
    transformers=[
        ('noop', SimpleImputer(), []),
    ]
)

broken_pipeline = ColumnTransformer(
    transformers=[
        ('noop', SimpleImputer(), noop_selector),
    ]
)

data = pd.DataFrame({'col': ['a', 'b', 'c']})
working_pipeline.fit(data)
broken_pipeline.fit(data)

The working pipeline skips the simple imputer. The broken pipeline tries to impute with no data, resulting in ValueError: at least one array or dtype is required.

@jnothman
Copy link
Member

@zachmayer thanks for the clear report, but could you please open a new issue for better visibility and tracking?

@zachmayer
Copy link
Contributor

@jnothman for sure! Here you go: #17586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants