Named col indexing fails with ColumnTransformer remainder on changing DataFrame column ordering

#### Description
I am using ColumnTransformer to prepare (impute etc.) heterogenous data. I use a DataFrame to have more control on the different (types of) columns by their name.

I had some really cryptic problems when downstream transformers complained of data of the wrong type, while the ColumnTransformer should have divided them up properly.

I found out that ColumnTransformer silently passes the wrong columns along as `remainder` when
- specifying columns by name,
- using the `remainder` option, and using
- DataFrames where column ordering can differ between `fit` and `transform`

In this case, the wrong columns are passed on to the downstream transformers, as the example demonstrates:

#### Steps/Code to Reproduce
```python
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
import pandas as pd

def msg(msg):
  def print_cols(X, y=None):
    print(msg, list(X.columns))
    return X
  return print_cols

ct = make_column_transformer(
  (FunctionTransformer(msg('col a'), validate=False), ['a']),
  remainder=FunctionTransformer(msg('remainder'), validate=False)
)

fit_df = pd.DataFrame({
  'a': [2,3], 
  'b': [4,5]})

ct.fit(fit_df)

# prints:
# cols a ['a']
# remainder ['b']

transform_df = pd.DataFrame({
  'b': [4,5],  # note that column ordering
  'a': [2,3]}) # is the only difference to fit_df

ct.transform(transform_df)

# prints:
# col a ['a']
# remainder ['a'] <-- Should be ['b']
```

#### Expected Results
When using ColumnTransformer with a DataFrame and specifying columns by name, `remainder` should reference the same columns when fitting and when transforming (['b'] in above example), regardless of the column positions in the data during fitting and transforming.

#### Actual Results
`remainder` appears to, during fitting, remember remaining named DataFrame columns by their numeric index (not by their names), which (silently) leads to the wrong columns being handled downstream if the transformed DataFrame's column ordering differs from that of the fitted DataFrame.

Position in module where the `remainder` index is determined:
https://github.com/scikit-learn/scikit-learn/blob/7813f7efb5b2012412888b69e73d76f2df2b50b6/sklearn/compose/_column_transformer.py#L309

My current workaround is to not use the `remainder` option but specify all columns by name explicitly.

#### Versions
System:
    python: 3.7.3 (default, Mar 30 2019, 03:44:34)  [Clang 9.1.0 (clang-902.0.39.2)]
executable: /Users/asschude/.local/share/virtualenvs/launchpad-mmWds3ry/bin/python
   machine: Darwin-17.7.0-x86_64-i386-64bit

BLAS:
    macros: NO_ATLAS_INFO=3, HAVE_CBLAS=None
  lib_dirs: 
cblas_libs: cblas

Python deps:
       pip: 19.1.1
setuptools: 41.0.1
   sklearn: 0.21.2
     numpy: 1.16.4
     scipy: 1.3.0
    Cython: None
    pandas: 0.24.2



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Named col indexing fails with ColumnTransformer remainder on changing DataFrame column ordering #14223

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Named col indexing fails with ColumnTransformer remainder on changing DataFrame column ordering #14223

Description

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions