-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ColumnTransformer requires at least one column for each part it transforms #12071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@janvanrijn thanks for the feedback! That second possibility (what you also mention), is certainly the easiest to do. I am trying to think of cases where this could give a problem. The fitted cc @jnothman |
My 2 cents
When I read this it feels like ColumnTransformer should see the empty list
and not even attempt to use the numeric transformer. Kind of like when you
do "`for i in []`.
transformer = sklearn.compose.ColumnTransformer(
transformers=[
('numeric', numeric_transformer, []),
('nominal', categorical_transformer, [1,2,3,4])],
remainder='passthrough')
…On Fri, Sep 14, 2018, 3:15 AM Joris Van den Bossche < ***@***.***> wrote:
@janvanrijn <https://github.com/janvanrijn> thanks for the feedback!
Yes, I think this is a case that we would want to support. The question is
maybe whether it are the other transformers that should support 0 features
(it is the imputer that is raising the error), or whether the
ColumnTransformer should be smart enough to not pass the data through in
this case.
That second possibility (what you also mention), is certainly the easiest
to do.
I am trying to think of cases where this could give a problem. The fitted
transformers_ would then hold a transformer that is actually not fitted.
This introduces some inconsistency, but not sure if that would give
problems in practice.
cc @jnothman <https://github.com/jnothman>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#12071 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADmCPMmz9AKuDyCJMpfjwqKxaXthBGJ5ks5ua1eugaJpZM4WoRWh>
.
|
I'm in favor of ignoring empty feature lists. I think many transformers will require >0 features. do we need to store the transformer in |
I would do that, otherwise the length and index into the original passed |
hmmm good point. |
Awesome. Is there somewhere an extensive list of all transformers available? I could contribute a set of unit tests this afternoon to start either solution. |
@janvanrijn you can get one from |
@amueller just out of curiosity, why won't that be necessary? Whichever of the proposed solutions is going to take only several lines of code, whereas the real trick is to properly test it (IMO). |
Yes, but when doing this in the ColumnTransformer, we don't need to test it with every possible transformer, only testing that the transformer is not called in case there are no features is enough. |
Don't have further time this until next week, but this small patch seems to work (at least for the basics): --- a/sklearn/compose/_column_transformer.py
+++ b/sklearn/compose/_column_transformer.py
@@ -234,6 +234,8 @@ boolean mask array or callable
check_inverse=False)
elif trans == 'drop':
continue
+ elif hasattr(column, '__len__') and len(column) == 0:
+ continue
yield (name, trans, sub, get_weight(name))
@@ -335,6 +337,8 @@ boolean mask array or callable
# so get next transformer, but save original string
next(transformers)
trans = 'passthrough'
+ elif hasattr(column, '__len__') and len(column) == 0:
+ trans = old
else:
trans = next(transformers)
transformers_.append((name, trans, column)) if we are OK with |
Is there something I can do to facilitate this PR? |
Adding the following line to 114 of test_column_transformer makes it fail.
the following error:
According the following message, I assume that you still want the transformers to be listed, right? I will look into it.
|
I can't copy paste properly, please ignore my previous comment. Your code seems to survive my testcase properly. |
Can we consider an empty set of column indices to imply 'drop'?
|
Shouldn't that fix the problem in transformers_?
|
@jnothman I would find it a bit strange to change the transformer the user passed to a string (to indicate that you "dropping nothing" ..). I would personally rather have the inconsistency of an unfitted transformer in the fitted attribute, than losing the information of which transformer was passed (of course it is still available in the non-fitted attribute |
I don't see why an unfitted transformer is better than an indicator that it
was empty. Maybe 'empty'?
…On Mon, 17 Sep 2018 at 17:53, Joris Van den Bossche < ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I would find it a bit strange to
change the transformer the user passed to a string (to indicate that you
"dropping nothing" ..). I would personally rather have the inconsistency of
an unfitted transformer in the fitted attribute, than losing the
information of which transformer was passed (of course it is still
available in the non-fitted attribute transformer, so maybe that is not
really a problem)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12071 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62FbcuHGzI7V9EV1aeSxLTFRdTSwks5ub1TugaJpZM4WoRWh>
.
|
But it is the column selection that is (already) empty, not the transformer. So that's maybe what I find 'strange' about it to change the transformer. I am trying to think about a situation where it might be useful to keep: eg where the column selection is done dynamically (with a function), and you don't know in advance that there will be such an 'empty' column selection. In such a case I think it might still be useful to be able to check in the |
Yes, we need to realised those lists of indices. We can't leave them as a
function to apply at transform time! Good catch!
|
So at what time will the list be formed? |
@jnothman do you still prefer changing the transformer to 'empty' over keeping an unfitted transformer in Currently the strings we use ('passthrough', 'drop') are about an action on what to do with the selected columns, while 'empty' is not an action; it are the selected columns that are an empty list. |
I think I am fine with what is proposed by @jorisvandenbossche : skip transformers that get 0 columns after selection but keep them unfitted in |
I find 'empty' more explicit... For the most part I'm ambivalent |
If you don't really mind, then I would personally go to keeping the unfitted one. |
Am I correct by inferring that updating the documentation was sufficient for making this consist with the requirements? |
I am a bit biased as this is my preference (so @jnothman can speak up), but I think the conclusion is to use the unfitted transformer and not the 'empty' string as you did now in the PR. |
Go for it. Change it back to an unfitted estimator. But do note this
behaviour in the documentation of transformers_
…On Fri, 21 Sep 2018 at 23:36, Joris Van den Bossche < ***@***.***> wrote:
I am a bit biased as this is my preference (so @jnothman
<https://github.com/jnothman> can speak up), but I think the conclusion
is to use the unfitted transformer and not the 'empty' string as you did
now in the PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12071 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6w4Wmqxql-jyCNa1FlwxiYp3qsypks5udOtqgaJpZM4WoRWh>
.
|
Hello everyone, everything good? I would like to ask a question: it was said that it is now possible to work with empty lists, but if I have a list with all the values equals to "False" it still generates the same error mentioned in here. The scenario of me having an empty list and a list with all their values equal to "False" are not the same? I mean, ColumnTransformer will have no action to take... Thank you for your help!!! |
@samborba please open a new issue including code to reproduce your error. Thanks |
I've having the same issue: i have a selector function that's returning all False, which should be equivalent to an empty list. But "all false" from the selector function is not being treated equivalent to "empty list from explicit column names". I'll post code to reproduce later. |
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
def noop_selector(X):
return [False for col in X]
working_pipeline = ColumnTransformer(
transformers=[
('noop', SimpleImputer(), []),
]
)
broken_pipeline = ColumnTransformer(
transformers=[
('noop', SimpleImputer(), noop_selector),
]
)
data = pd.DataFrame({'col': ['a', 'b', 'c']})
working_pipeline.fit(data)
broken_pipeline.fit(data) The working pipeline skips the simple imputer. The broken pipeline tries to impute with no data, resulting in |
@zachmayer thanks for the clear report, but could you please open a new issue for better visibility and tracking? |
Description
ColumnTransformer requires at least one column for each part it transforms. This sounds logical, but makes automatic experimentation across datasets with mixed input types hard to apply with a single sklearn model. I would need three separate models for:
Of course, this is doable, but it would be extremely convenient to be able to do all this with one sklearn model.
Steps/Code to Reproduce
Expected Results
a fitted model :)
Actual Results
Versions
I just installed the git branch 0.20.X
Proposed solution
I can author a PR that checks the column count, or passes through a constant dummy column
The text was updated successfully, but these errors were encountered: