Description
Steps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import OneHotEncoder
numerical_features = np.random.randint(10, size=(5,4))
categorical = np.array([2, 2, 3, 2, 3]).reshape(-1,1)
X = np.hstack((numerical_features, categorical))
onehotencoder = OneHotEncoder(categorical_features=[4],
handle_unknown='ignore')
X_encoded = onehotencoder.fit_transform(X)
Expected Results
No error should be thrown. OneHotEncoder
should work as legacy and encode only the supplied columns.
Actual Results
/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-15-c174bb78e628>", line 1, in <module>
runfile('/home/vivek/untitless.py', wdir='/home/vivek')
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
execfile(filename, namespace)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/vivek/untitless.py", line 24, in <module>
X_encoded = onehotencoder.fit_transform(X)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 514, in fit_transform
self._categorical_features, copy=True)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/base.py", line 71, in _transform_selected
X_sel = transform(X[:, ind[sel]])
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 456, in _legacy_fit_transform
% type(X))
TypeError: Wrong type for parameter `n_values`. Expected 'auto', int or array of ints, got <class 'numpy.ndarray'>
Description
There is a difference between the actual default n_values
parameter in OneHotEncoder
and the assumption made in documentation and some internal code. This is leading to errors in specific conditions.
-
The documentation here states that the default value is
'auto'
. -
The code here for
_handle_deprecations
assumes that the default value is'auto'
. -
But the actual
__init__
method asn_values=None
as default. -
If I remove the
handle_unknown='ignore'
or addn_values='auto'
in the code, the code runs successfully, but the following warnings are shown:
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Versions
System:
python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0]
executable: /home/vivek/anaconda3/envs/my_env/bin/python
machine: Linux-4.15.0-43-generic-x86_64-with-debian-buster-sid
BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/vivek/anaconda3/envs/my_env/lib
cblas_libs: mkl_rt, pthread
Python deps:
pip: 18.1
setuptools: 40.2.0
sklearn: 0.20.1
numpy: 1.15.4
scipy: 1.1.0
Cython: 0.29
pandas: 0.23.4