Skip to content

ColumnTransformer give "TypeError: invalid type promotion" #20090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
princyok opened this issue May 13, 2021 · 5 comments
Closed

ColumnTransformer give "TypeError: invalid type promotion" #20090

princyok opened this issue May 13, 2021 · 5 comments

Comments

@princyok
Copy link

princyok commented May 13, 2021

Versions

sklearn 0.23.2

Description

Instantiating ColumnTransformer with the remainder argument set to "passthrough" produces a TypeError under certain circumstances. I narrowed down one such circumstance.

The error occurs when exactly n - 1 columns are transformed (where n is the total number of columns) and the one column that gets passed through (i.e., not transformed) has a dtype that cannot be converted to that of the other columns. The root cause is that sklearn tries to combine the arrays with numpy.hstack and fails.

Code to Reproduce

from sklearn import preprocessing, compose
import numpy as np
import pandas as pd
import datetime

prng = np.random.default_rng()
d = pd.DataFrame(prng.random((20,3)), columns = ["aaa", "bbb", "ccc"])
d["time"] = datetime.datetime.now()

columns = ["aaa", "bbb", "ccc"]
t = compose.ColumnTransformer(
    [("stnd", preprocessing.StandardScaler(), columns)],
    remainder="passthrough"
)
t.fit_transform(d)

The above code produces "TypeError: invalid type promotion". The dtype of the "time" column is datetime and that of the others is float. Like described above, letting only the "time" column to pass through results in hstack failing when it tries to concatenate the arrays. If you change columns to columns = ["aaa", "bbb"], it works as is expected. Also changing the remainder argument to remainder="drop" also works.

@glemaitre
Copy link
Member

I get a different traceback with the latest library version:

In [1]: from sklearn import preprocessing, compose
   ...: import numpy as np
   ...: import pandas as pd
   ...: import datetime
   ...: 
   ...: prng = np.random.default_rng()
   ...: d = pd.DataFrame(prng.random((20,3)), columns = ["aaa", "bbb", "ccc"])
   ...: d["time"] = datetime.datetime.now()
   ...: 
   ...: columns = ["aaa", "bbb", "ccc"]
   ...: t = compose.ColumnTransformer(
   ...:     [("stnd", preprocessing.StandardScaler(), columns)],
   ...:     remainder="passthrough"
   ...: )
   ...: t.fit_transform(d)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-321b39c9e11e> in <module>
     13     remainder="passthrough"
     14 )
---> 15 t.fit_transform(d)

~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    571         self._record_output_indices(Xs)
    572 
--> 573         return self._hstack(list(Xs))
    574 
    575     def transform(self, X):

~/Documents/packages/scikit-learn/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
    657         else:
    658             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 659             return np.hstack(Xs)
    660 
    661     def _sk_visual_block_(self):

<__array_function__ internals> in hstack(*args, **kwargs)

~/miniconda3/envs/dev/lib/python3.9/site-packages/numpy/core/shape_base.py in hstack(tup)
    344         return _nx.concatenate(arrs, 0)
    345     else:
--> 346         return _nx.concatenate(arrs, 1)
    347 
    348 

<__array_function__ internals> in concatenate(*args, **kwargs)

TypeError: The DTypes <class 'numpy.dtype[datetime64]'> and <class 'numpy.dtype[float64]'> do not have a common DType. For example they cannot be stored in a single array unless the dtype is `object`.

In [2]: import sklearn; sklearn.show_versions()

System:
    python: 3.9.1 (default, Dec 11 2020, 14:32:07)  [GCC 7.3.0]
executable: /home/glemaitre/miniconda3/envs/dev/bin/python
   machine: Linux-5.8.0-50-generic-x86_64-with-glibc2.32

Python dependencies:
          pip: 20.3.3
   setuptools: 52.0.0.post20210125
      sklearn: 1.0.dev0
        numpy: 1.20.0
        scipy: 1.6.0
       Cython: 0.29.21
       pandas: 1.2.1
   matplotlib: 3.3.4
       joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

I get a different traceback but it comes to some concatenation with numpy that fails due to the datetime column. We should check why it does work when we pass a subset of the feature.

@glemaitre glemaitre added Bug and removed Bug: triage labels May 17, 2021
@madhuracj
Copy link
Contributor

I had a quick look at this.
When columns ["aaa" , "bbb"] are passed, the remainder is a DataFrame having ["ccc", "time"]. When np.asanyarray() is called on the remainder inside np.hstack(), it generates an ndarray of type 'object' given the two columns are of two types.
On the other hand, when columns ["aaa" , "bbb", "ccc"] are passed, the remainder is a DataFrame having just "time" and np.asanyarray() creates an ndarray of 'datetime64', which is incompatible with 'float64'

@MaxwellLZH
Copy link
Contributor

take

@sfandres
Copy link

sfandres commented Oct 25, 2021

Same problem here. The function does not work as expected even if I add another dummy column to the data frame so that my "Date" column is not the only one that gets passed through. It transforms both columns

@adrinjalali
Copy link
Member

With sklearn.set_config(transform_output="pandas") the code in the original post now runs nicely. Therefore closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants