[MRG] Convert ColumnTransformer input list to numpy array #12104

vinayak-mehta · 2018-09-18T11:50:33Z

Reference Issues/PRs

Fixes #12096.

What does this implement/fix? Explain your changes.

Converts the input list for ColumnTransformer to a numpy array.

Added a check inside transform and fit_transform to check if the input X is a list, if it is then it gets converted to a numpy array.

Any other comments?

Should this conversion be documented in the docstrings for ColumnTransfomer's fit, transform and fit_transform?

adrinjalali · 2018-09-18T11:55:40Z

sklearn/compose/_column_transformer.py

@@ -419,6 +419,9 @@ def fit_transform(self, X, y=None):
            sparse matrices.

        """
+        if isinstance(X, list):
+            X = np.array(X)


I think using check_array from utils/validation would be safer/cleaner.

Thanks for the tip! Added.

ogrisel · 2018-09-18T14:02:19Z

sklearn/compose/_column_transformer.py

@@ -467,6 +469,8 @@ def transform(self, X):
        """
        check_is_fitted(self, 'transformers_')

+        X = check_array(X)


A naive use of check_array will break pandas support. I think we should do something like:

if not hasattr(X, 'loc'): X = check_array(X)

We don't want to do an explicit isinstance to avoid introducing an explicit dependency on the pandas library.

SShouldn't such a check be done in check_array? Or else we should have this check almost everywhere if we want to keep the dataframe, don't we?

Further, check_array should also not convert to numeric or raise on sparse in this case (it should basically do no check at all). Therefore, in this case, I would maybe not use check_array?

I think in the end we want to improve check_array to also handle pandas dataframes, but that is a larger endeavour than fixing this bug.
@adrinjalali the column transformer is currently the only sklearn object where we don't convert the passed dataframe to an array, therefore this issue does not come up everywhere else

One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).

Further, it should also not enforce all finite values in this case.

One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).

So this is similar to ogrisel's solution above? I'm +1 to that solution (Maybe also add a comment to note down the limitation of current check_array).

One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).

Yes, with the difference that we also should not call check_array for arrays (Olivier only checks for dataframe), or otherwise specify a lot of flags to check_array to accept sparse, not error on nan data, ..

So for this bug fix, we should just check if X is a list using isinstance and then use check_array to convert it to an array, right?

Perhaps

if not hasattr(X, '__array__'): X = np.asarray(X)

?

(or do we want dtype='object' to be safe?)

We want dtype='object' arrays to not be converted I think, but ndarrays do have a __array__, so your proposal is fine for that?

Ah, sparse matrices apparently have no __array__, so an additional check should include that as well

Shouldn't such a check be done in check_array?

Opened a new issue about this here: #12148

jnothman · 2018-09-23T03:46:27Z

Okay, I'll try to finish this as @vinayak-mehta is silent.

vinayak-mehta · 2018-09-23T04:20:00Z

I'm sorry @jnothman there's no excuse for being silent :| I should've done this sooner. Travis failed on Python 3.4, conda but it doesn't show any details, just says "stalled build or something wrong with the build itself.", any idea?

EDIT: just saw "No output has been received in the last 10m0s", it timed out. :?

jnothman · 2018-09-23T04:54:45Z

No worries. We just need this merged ASAP. Probably just needs a rerun.

jorisvandenbossche · 2018-09-24T06:05:49Z

sklearn/compose/_column_transformer.py

+    """Use check_array only on lists and other non-array-likes / sparse"""
+    if hasattr(X, '__array__') or sparse.issparse(X):
+        return X
+    return check_array(X)


I think we still need to allow missing values here? (force_all_finite='allow-nan')

…versions

jorisvandenbossche · 2018-09-24T09:08:08Z

@ogrisel is the dtype=object the right thing to do? (this is typically not done for other estimators)

ogrisel · 2018-09-24T09:08:23Z

I pushed a new commit to address @jorisvandenbossche's comment (#12104 (review)).

In the process, I also made sure that the column transformer does not implicitly change the dtype of the list content especially when the content is heterogeneously typed.

In turn that made me change check_array(data, warn_on_dtype=True) to be silent if data has an object dtype. I think the warning is spurious in this case. One might disagree but otherwise, I do not see any way to make the valid case I put in the test not raise any warning.

jorisvandenbossche · 2018-09-24T09:09:03Z

Although, that leaves it to the actual transformer to which the data is passed to coerce the X data to numeric if needed or not, so probably this is a good idea.

ogrisel · 2018-09-24T09:09:50Z

@ogrisel is the dtype=object the right thing to do? (this is typically not done for other estimators)

I think we have to do it: CT is the only estimator to naturally accept heterogeneously typed collections. It should not try to convert everything to a numpy string dtype because one column is Python string valued.

ogrisel · 2018-09-24T09:12:03Z

Although, that leaves it to the actual transformer to which the data is passed to coerce the X data to numeric if needed or not, so probably this is a good idea.

Indeed that's my thought: CT should stay as neutral as possible w.r.t. to dtypes. A Python list is a list of Python objects. Here we just need arrays to be able to index on columns.

jorisvandenbossche · 2018-09-24T09:13:08Z

In turned that maded me change check_array(data, warn_on_dtype=True) to be sildent if data has an object dtype. I thing the warning is spurious in this case. One might disagree but I do not see any way to make the valid case I put in the test no raise any warning otherwise.

How was this triggered by this PR? As lists should not have a dtype_orig ?

jorisvandenbossche · 2018-09-24T09:16:11Z

OK, I suppose this is then for the actual transformers (eg StandardScaler) that gets the object dtyped subarray? That would make the use case of a list indeed rather annoying and impossible to do without a warning.

ogrisel · 2018-09-24T09:19:16Z

Yes, the numerical scalers (StandardScaler, MinMaxScaler, ...) all have warn_on_dtype=True in their input checks.

jorisvandenbossche · 2018-09-24T15:29:05Z

@ogrisel changing warn_on_dtype not to warn in case of object data, raises the question if this should then also be changed in case of object DataFrames: #10949 (comment)

jorisvandenbossche · 2018-09-25T08:55:47Z

From live discussion with @ogrisel: for this PR and to get the basic lists fix merged, let's leave the warn_on_dtype change out. This means that actually using ColumnTransformer with list of lists in practice will for now give a not-very-useful warning in combination with some transformers, but that is less urgent to fix.
And we can then have a more general discussion about the usefulness of warn_on_dtype.

Of course, since it seems Andy actually tagged, we might have a little bit more time again for a bugfix release.

jorisvandenbossche · 2018-09-25T14:10:06Z

CI is passing here, I think this should be good to go now

#### Reference Issues/PRs  Fixes #12096. #### What does this implement/fix? Explain your changes. Converts the input list for ColumnTransformer to a numpy array. Added a check inside `transform` and `fit_transform` to check if the input `X` is a list, if it is then it gets converted to a numpy array. #### Any other comments? Should this conversion be documented in the docstrings for ColumnTransfomer's `fit`, `transform` and `fit_transform`?

* tag '0.20.0': (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...

* releases: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...

* dfsg: (77 commits) ColumnTransformer generalization to work on empty lists (scikit-learn#12084) add sparse_threshold to make_columns_transformer (scikit-learn#12152) [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104) Change version to 0.20.0 BUG: check equality instead of identity in check_cv (scikit-learn#12155) [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114) [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089) DOC Removed duplicated doc in tree.rst (scikit-learn#11922) [MRG] DOC covariance doctest examples (scikit-learn#12124) typo and formatting fixes in 0.20 doc (scikit-learn#11963) DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133) [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107) Fix typo (scikit-learn#12126) DOC Typo in OneHotEncoder DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117) DOC Removing quotes from variant names. (scikit-learn#12113) DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108) [MRG] MNT Re-enable PyPy CI (scikit-learn#12039) MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048) TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093) ...

Convert ColumnTransformer input list to numpy array

67ddad5

vinayak-mehta changed the title ~~Convert ColumnTransformer input list to numpy array~~ [MRG] Convert ColumnTransformer input list to numpy array Sep 18, 2018

adrinjalali reviewed Sep 18, 2018

View reviewed changes

Add check_array

b073a68

ogrisel reviewed Sep 18, 2018

View reviewed changes

Do not cast dataframes, etc.

5cca32e

Merge branch 'master' into ctranslist

1fc61a5

jnothman added this to the 0.20 milestone Sep 23, 2018

jorisvandenbossche mentioned this pull request Sep 24, 2018

ColumnTransformer breaks where X is a list #12096

Closed

jorisvandenbossche reviewed Sep 24, 2018

View reviewed changes

Make CT accept list of mixed typed objects including nans without con…

864c2cc

…versions

This was referenced Sep 24, 2018

Improvements to check_array to handle heterogenous / object data #12148

Open

[MRG+2] warn_on_dtype for DataFrames #10949

Merged

undo warn_on_dtype change + catch warning in test

dc7f74d

amueller merged commit f15ebb9 into scikit-learn:master Sep 25, 2018

vinayak-mehta deleted the column-transformer-list branch September 25, 2018 15:01

Uh oh!

[MRG] Convert ColumnTransformer input list to numpy array #12104

[MRG] Convert ColumnTransformer input list to numpy array #12104

Uh oh!

Conversation

vinayak-mehta commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinayak-mehta Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 23, 2018

Uh oh!

vinayak-mehta commented Sep 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 23, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

ogrisel commented Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

ogrisel commented Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

ogrisel commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 25, 2018

Uh oh!

jorisvandenbossche commented Sep 25, 2018

Uh oh!

Uh oh!

vinayak-mehta commented Sep 18, 2018 •

edited

Loading

vinayak-mehta Sep 20, 2018 •

edited

Loading

vinayak-mehta commented Sep 23, 2018 •

edited

Loading

ogrisel commented Sep 24, 2018 •

edited

Loading

ogrisel commented Sep 24, 2018 •

edited

Loading