Skip to content

[MRG] Convert ColumnTransformer input list to numpy array #12104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 25, 2018

Conversation

vinayak-mehta
Copy link
Contributor

@vinayak-mehta vinayak-mehta commented Sep 18, 2018

Reference Issues/PRs

Fixes #12096.

What does this implement/fix? Explain your changes.

Converts the input list for ColumnTransformer to a numpy array.

Added a check inside transform and fit_transform to check if the input X is a list, if it is then it gets converted to a numpy array.

Any other comments?

Should this conversion be documented in the docstrings for ColumnTransfomer's fit, transform and fit_transform?

@vinayak-mehta vinayak-mehta changed the title Convert ColumnTransformer input list to numpy array [MRG] Convert ColumnTransformer input list to numpy array Sep 18, 2018
@@ -419,6 +419,9 @@ def fit_transform(self, X, y=None):
sparse matrices.

"""
if isinstance(X, list):
X = np.array(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using check_array from utils/validation would be safer/cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip! Added.

@@ -467,6 +469,8 @@ def transform(self, X):
"""
check_is_fitted(self, 'transformers_')

X = check_array(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A naive use of check_array will break pandas support. I think we should do something like:

if not hasattr(X, 'loc'):
    X = check_array(X)

We don't want to do an explicit isinstance to avoid introducing an explicit dependency on the pandas library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SShouldn't such a check be done in check_array? Or else we should have this check almost everywhere if we want to keep the dataframe, don't we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, check_array should also not convert to numeric or raise on sparse in this case (it should basically do no check at all). Therefore, in this case, I would maybe not use check_array?

I think in the end we want to improve check_array to also handle pandas dataframes, but that is a larger endeavour than fixing this bug.
@adrinjalali the column transformer is currently the only sklearn object where we don't convert the passed dataframe to an array, therefore this issue does not come up everywhere else

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).

Further, it should also not enforce all finite values in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).

So this is similar to ogrisel's solution above? I'm +1 to that solution (Maybe also add a comment to note down the limitation of current check_array).

Copy link
Contributor Author

@vinayak-mehta vinayak-mehta Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility would be to only call check_array if we actually have a list (or the other way around if X is not yet a numpy ndarray or pandas dataframe).

Yes, with the difference that we also should not call check_array for arrays (Olivier only checks for dataframe), or otherwise specify a lot of flags to check_array to accept sparse, not error on nan data, ..

So for this bug fix, we should just check if X is a list using isinstance and then use check_array to convert it to an array, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps

if not hasattr(X, '__array__'):
    X = np.asarray(X)

?

(or do we want dtype='object' to be safe?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want dtype='object' arrays to not be converted I think, but ndarrays do have a __array__, so your proposal is fine for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sparse matrices apparently have no __array__, so an additional check should include that as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't such a check be done in check_array?

Opened a new issue about this here: #12148

@jnothman
Copy link
Member

Okay, I'll try to finish this as @vinayak-mehta is silent.

@vinayak-mehta
Copy link
Contributor Author

vinayak-mehta commented Sep 23, 2018

I'm sorry @jnothman there's no excuse for being silent :| I should've done this sooner. Travis failed on Python 3.4, conda but it doesn't show any details, just says "stalled build or something wrong with the build itself.", any idea?

EDIT: just saw "No output has been received in the last 10m0s", it timed out. :?

@jnothman
Copy link
Member

jnothman commented Sep 23, 2018 via email

"""Use check_array only on lists and other non-array-likes / sparse"""
if hasattr(X, '__array__') or sparse.issparse(X):
return X
return check_array(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to allow missing values here? (force_all_finite='allow-nan')

@jorisvandenbossche
Copy link
Member

@ogrisel is the dtype=object the right thing to do? (this is typically not done for other estimators)

@ogrisel
Copy link
Member

ogrisel commented Sep 24, 2018

I pushed a new commit to address @jorisvandenbossche's comment (#12104 (review)).

In the process, I also made sure that the column transformer does not implicitly change the dtype of the list content especially when the content is heterogeneously typed.

In turn that made me change check_array(data, warn_on_dtype=True) to be silent if data has an object dtype. I think the warning is spurious in this case. One might disagree but otherwise, I do not see any way to make the valid case I put in the test not raise any warning.

@jorisvandenbossche
Copy link
Member

Although, that leaves it to the actual transformer to which the data is passed to coerce the X data to numeric if needed or not, so probably this is a good idea.

@ogrisel
Copy link
Member

ogrisel commented Sep 24, 2018

@ogrisel is the dtype=object the right thing to do? (this is typically not done for other estimators)

I think we have to do it: CT is the only estimator to naturally accept heterogeneously typed collections. It should not try to convert everything to a numpy string dtype because one column is Python string valued.

@ogrisel
Copy link
Member

ogrisel commented Sep 24, 2018

Although, that leaves it to the actual transformer to which the data is passed to coerce the X data to numeric if needed or not, so probably this is a good idea.

Indeed that's my thought: CT should stay as neutral as possible w.r.t. to dtypes. A Python list is a list of Python objects. Here we just need arrays to be able to index on columns.

@jorisvandenbossche
Copy link
Member

In turned that maded me change check_array(data, warn_on_dtype=True) to be sildent if data has an object dtype. I thing the warning is spurious in this case. One might disagree but I do not see any way to make the valid case I put in the test no raise any warning otherwise.

How was this triggered by this PR? As lists should not have a dtype_orig ?

@jorisvandenbossche
Copy link
Member

OK, I suppose this is then for the actual transformers (eg StandardScaler) that gets the object dtyped subarray? That would make the use case of a list indeed rather annoying and impossible to do without a warning.

@ogrisel
Copy link
Member

ogrisel commented Sep 24, 2018

Yes, the numerical scalers (StandardScaler, MinMaxScaler, ...) all have warn_on_dtype=True in their input checks.

@jorisvandenbossche
Copy link
Member

@ogrisel changing warn_on_dtype not to warn in case of object data, raises the question if this should then also be changed in case of object DataFrames: #10949 (comment)

@jorisvandenbossche
Copy link
Member

From live discussion with @ogrisel: for this PR and to get the basic lists fix merged, let's leave the warn_on_dtype change out. This means that actually using ColumnTransformer with list of lists in practice will for now give a not-very-useful warning in combination with some transformers, but that is less urgent to fix.
And we can then have a more general discussion about the usefulness of warn_on_dtype.

Of course, since it seems Andy actually tagged, we might have a little bit more time again for a bugfix release.

@jorisvandenbossche
Copy link
Member

CI is passing here, I think this should be good to go now

@amueller amueller merged commit f15ebb9 into scikit-learn:master Sep 25, 2018
amueller pushed a commit that referenced this pull request Sep 25, 2018
<!--
Thanks for contributing a pull request! Please ensure you have taken a look at
the contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist
-->

#### Reference Issues/PRs
<!--
Example: Fixes #1234. See also #3456.
Please use keywords (e.g., Fixes) to create link to the issues or pull requests
you resolved, so that they will automatically be closed when your pull request
is merged. See https://github.com/blog/1506-closing-issues-via-pull-requests
-->
Fixes #12096.

#### What does this implement/fix? Explain your changes.
Converts the input list for ColumnTransformer to a numpy array.

Added a check inside `transform` and `fit_transform` to check if the input `X` is a list, if it is then it gets converted to a numpy array.

#### Any other comments?
Should this conversion be documented in the docstrings for ColumnTransfomer's `fit`, `transform` and `fit_transform`?

<!--
Please be aware that we are a loose team of volunteers so patience is
necessary; assistance handling other issues is very welcome. We value
all user contributions, no matter how minor they are. If we are slow to
review, either the pull request needs some benchmarking, tinkering,
convincing, etc. or more likely the reviewers are simply busy. In either
case, we ask for your understanding during the review process.
For more information, see our FAQ on this topic:
http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.

Thanks for contributing!
-->
@vinayak-mehta vinayak-mehta deleted the column-transformer-list branch September 25, 2018 15:01
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 29, 2018
* tag '0.20.0': (77 commits)
  ColumnTransformer generalization to work on empty lists (scikit-learn#12084)
  add sparse_threshold to make_columns_transformer (scikit-learn#12152)
  [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104)
  Change version to 0.20.0
  BUG: check equality instead of identity in check_cv (scikit-learn#12155)
  [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114)
  [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089)
  DOC Removed duplicated doc in tree.rst (scikit-learn#11922)
  [MRG] DOC covariance doctest examples (scikit-learn#12124)
  typo and formatting fixes in 0.20 doc (scikit-learn#11963)
  DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133)
  [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107)
  Fix typo (scikit-learn#12126)
  DOC Typo in OneHotEncoder
  DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117)
  DOC Removing quotes from variant names. (scikit-learn#12113)
  DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108)
  [MRG] MNT Re-enable PyPy CI (scikit-learn#12039)
  MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048)
  TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 29, 2018
* releases: (77 commits)
  ColumnTransformer generalization to work on empty lists (scikit-learn#12084)
  add sparse_threshold to make_columns_transformer (scikit-learn#12152)
  [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104)
  Change version to 0.20.0
  BUG: check equality instead of identity in check_cv (scikit-learn#12155)
  [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114)
  [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089)
  DOC Removed duplicated doc in tree.rst (scikit-learn#11922)
  [MRG] DOC covariance doctest examples (scikit-learn#12124)
  typo and formatting fixes in 0.20 doc (scikit-learn#11963)
  DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133)
  [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107)
  Fix typo (scikit-learn#12126)
  DOC Typo in OneHotEncoder
  DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117)
  DOC Removing quotes from variant names. (scikit-learn#12113)
  DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108)
  [MRG] MNT Re-enable PyPy CI (scikit-learn#12039)
  MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048)
  TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 29, 2018
* dfsg: (77 commits)
  ColumnTransformer generalization to work on empty lists (scikit-learn#12084)
  add sparse_threshold to make_columns_transformer (scikit-learn#12152)
  [MRG] Convert ColumnTransformer input list to numpy array (scikit-learn#12104)
  Change version to 0.20.0
  BUG: check equality instead of identity in check_cv (scikit-learn#12155)
  [MRG] Fix FutureWarnings in logistic regression examples (scikit-learn#12114)
  [MRG] Update test_metaestimators to pass y parameter when calling score (scikit-learn#12089)
  DOC Removed duplicated doc in tree.rst (scikit-learn#11922)
  [MRG] DOC covariance doctest examples (scikit-learn#12124)
  typo and formatting fixes in 0.20 doc (scikit-learn#11963)
  DOC Replaced the deprecated early_stopping parameter with n_iter_no_change. (scikit-learn#12133)
  [MRG +1] ColumnTransformer: store evaluated function column specifier during fit (scikit-learn#12107)
  Fix typo (scikit-learn#12126)
  DOC Typo in OneHotEncoder
  DOC Update fit_transform docstring of OneHotEncoder (scikit-learn#12117)
  DOC Removing quotes from variant names. (scikit-learn#12113)
  DOC BaggingRegressor missing default value for oob_score in docstring (scikit-learn#12108)
  [MRG] MNT Re-enable PyPy CI (scikit-learn#12039)
  MNT Only checks warnings on latest depedendencies versions in CI (scikit-learn#12048)
  TST Ignore warnings in common test to avoid collection errors (scikit-learn#12093)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ColumnTransformer breaks where X is a list
7 participants