-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for this. Tests are failing, though |
Oh you mentioned that. A strange failure. Let me check what that branch is doing... |
So the test failures relate to a recent Cython release. Merging master into discrete should fix it. I'll do this shortly. |
@jnothman Thanks. Kindly inform me when you finish. :) |
Try pushing an empty commit |
@jnothman test failure unrelated. I believe it worth a review. Thanks :) |
@jnothman Finally CIs are green. |
and return a sparse matrix. | ||
onehot-dense: | ||
encode the transformed result with one-hot encoding | ||
and return a dense matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dense array
encode the transformed result with one-hot encoding | ||
and return a dense matrix. | ||
ordinal: | ||
do not encode the transformed result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return the bin identifier encoded as an integer value.
@ogrisel Thanks. Comments addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superficial review without looking in the heart of the code.
@@ -114,6 +128,12 @@ def fit(self, X, y=None): | |||
""" | |||
X = check_array(X, dtype='numeric') | |||
|
|||
valid_encode = ['onehot', 'onehot-dense', 'ordinal'] | |||
if self.encode not in valid_encode: | |||
raise ValueError('Invalid encode value. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the value provided by the user in the error message, i.e. something like this:
"Valid options for 'encode' are {}. Got 'encode={}' instead".format(sorted(valid_encode), encode)
retain_order=True) | ||
|
||
# only one-hot encode discretized features | ||
mask = np.array([True] * X.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more readable to use np.repeat(True, X.shape[1])
# don't support inverse_transform | ||
if self.encode != 'ordinal': | ||
raise ValueError("inverse_transform only support " | ||
"encode='ordinal'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the value of self.encode
in the error message, e.g. . "Got {} instead"
@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |||
Column indices of ignored features. (Example: Categorical features.) | |||
If ``None``, all features will be discretized. | |||
|
|||
encode : string {'onehot', 'onehot-dense', 'ordinal'} (default='ordinal') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't thing you need the type information when you list all possible values. Double-check with the numpy doc style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve It seems that this is not the case in many functions (e.g. pca, LinearSVC) and I have no idea how to check the doc style. Could you please help me? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#sections
When a parameter can only assume one of a fixed set of values, those values can be listed in braces, with the default appearing first:
order : {'C', 'F', 'A'}
Description of `order`.
assert_raises(ValueError, est.inverse_transform, X) | ||
est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], | ||
encode='onehot').fit(X) | ||
expected3 = est.transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you not check that the output is sparse in the onehot case?
Also probably check that the output is not sparse in the onehot-dense case.
@lesteve Comments addressed except for the first one. Thanks. |
@lesteve Thanks. Comment addressed. |
I find a bit fuzzy both naming 'onehot' and 'onehot-dense' since that this is not explicit in the naming that by default this is sparse. Would it make sense to call 'onehot' -> 'onehot-sparse' |
Oh I see this is the same naming in the CategoricalEncoder PR. So that might be fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of suggestions. @jnothman this code will be probably changing using the CategoricalEncoder I supposed?
@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |||
Column indices of ignored features. (Example: Categorical features.) | |||
If ``None``, all features will be discretized. | |||
|
|||
encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after }
@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |||
Column indices of ignored features. (Example: Categorical features.) | |||
If ``None``, all features will be discretized. | |||
|
|||
encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') | |||
method used to encode the transformed result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method -> Method
method used to encode the transformed result. | ||
|
||
onehot: | ||
encode the transformed result with one-hot encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode -> Encode
encode the transformed result with one-hot encoding | ||
and return a sparse matrix. | ||
onehot-dense: | ||
encode the transformed result with one-hot encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode -> Encode
encode the transformed result with one-hot encoding | ||
and return a dense array. | ||
ordinal: | ||
return the bin identifier encoded as an integer value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return -> Return
@@ -114,6 +128,12 @@ def fit(self, X, y=None): | |||
""" | |||
X = check_array(X, dtype='numeric') | |||
|
|||
valid_encode = ['onehot', 'onehot-dense', 'ordinal'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to use a tuple instead of a list.
if self.encode not in valid_encode: | ||
raise ValueError("Valid options for 'encode' are {}. " | ||
"Got 'encode = {}' instead." | ||
.format(sorted(valid_encode), self.encode)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary to sort.
retain_order=True) | ||
|
||
# only one-hot encode discretized features | ||
mask = np.repeat(True, X.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be faster to use:
mask = np.ones(X.shape[1], dtype=bool)
@@ -174,3 +176,40 @@ def test_numeric_stability(): | |||
X = X_init / 10**i | |||
Xt = KBinsDiscretizer(n_bins=2).fit_transform(X) | |||
assert_array_equal(Xt_expected, Xt) | |||
|
|||
|
|||
def test_encode(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably split this test in several tests
Maybe this will land up using the categorical encoder... but it doesn't
need to. For example, unary encoding doesn't make sense for categories, in
general, but it does for ordinals / ordered categories, such as those
produced by a dicretizer. That is, the discretizer isn't producing
categorical features, it's producing ordinal features.
…On 3 September 2017 at 20:44, Guillaume Lemaitre ***@***.***> wrote:
***@***.**** commented on this pull request.
couple of suggestions. @jnothman <https://github.com/jnothman> this code
will be probably changing using the CategoricalEncoder I supposed?
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
comma after }
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
method -> Method
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
+
+ onehot:
+ encode the transformed result with one-hot encoding
encode -> Encode
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
+
+ onehot:
+ encode the transformed result with one-hot encoding
+ and return a sparse matrix.
+ onehot-dense:
+ encode the transformed result with one-hot encoding
encode -> Encode
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
+ encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+ method used to encode the transformed result.
+
+ onehot:
+ encode the transformed result with one-hot encoding
+ and return a sparse matrix.
+ onehot-dense:
+ encode the transformed result with one-hot encoding
+ and return a dense array.
+ ordinal:
+ return the bin identifier encoded as an integer value.
return -> Return
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')
+ valid_encode = ['onehot', 'onehot-dense', 'ordinal']
you might want to use a tuple instead of a list.
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')
+ valid_encode = ['onehot', 'onehot-dense', 'ordinal']
+ if self.encode not in valid_encode:
+ raise ValueError("Valid options for 'encode' are {}. "
+ "Got 'encode = {}' instead."
+ .format(sorted(valid_encode), self.encode))
This seems unnecessary to sort.
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> Data in the binned space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
X = self._validate_X_post_fit(X)
- return _transform_selected(X, self._transform,
- self.transformed_features_, copy=True,
- retain_order=True)
+ Xt = _transform_selected(X, self._transform,
+ self.transformed_features_, copy=True,
+ retain_order=True)
+
+ # only one-hot encode discretized features
+ mask = np.repeat(True, X.shape[1])
It would be faster to use:
mask = np.ones(X.shape[1], dtype=bool)
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> @@ -174,3 +176,40 @@ def test_numeric_stability():
X = X_init / 10**i
Xt = KBinsDiscretizer(n_bins=2).fit_transform(X)
assert_array_equal(Xt_expected, Xt)
+
+
+def test_encode():
I would probably split this test in several tests
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9647 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_ICm3txOQTJwatrzj4NzXqUUczpks5seoLwgaJpZM4PHS2g>
.
|
@glemaitre Thanks. Comments addressed. |
doc/modules/preprocessing.rst
Outdated
@@ -459,6 +459,8 @@ K-bins discretization | |||
>>> est.bin_width_ | |||
array([ 3., 1., 2.]) | |||
|
|||
By default the output is one-hot encoded into a sparse matrix (See :class:`OneHotEncoder`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather this referred to a section of the user guide which described what this means rather than provides a (in this context irrelevant) tool to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman The only place I can think of is http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features. Could you please provide more specific suggestion? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems the right place to point to.
@jnothman I changed the link to Encoding categorical features Now it looks like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Nice work.
@@ -3,8 +3,10 @@ | |||
import numpy as np | |||
from six.moves import range | |||
import warnings | |||
import scipy.sparse as sp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Move this above six.moves
, so it is alphabetical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact you should use sklearn.externals.six to not have dependencies on six
@@ -259,6 +299,15 @@ def inverse_transform(self, Xt): | |||
Data in the original feature space. | |||
""" | |||
check_is_fitted(self, ["offset_", "bin_width_"]) | |||
|
|||
# currently, preprocessing.OneHotEncoder | |||
# don't support inverse_transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't -> doesn't
if self.ignored_features is not None: | ||
mask[self.ignored_features] = False | ||
|
||
if self.encode == 'onehot': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You can make this more concise.
if self.encode == 'ordinal':
return Xt
# Only one-hot encode discretized features
mask = np.ones(X.shape[1], dtype=bool)
if self.ignored_features is not None:
mask[self.ignored_features] = False
encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],
categorical_features='all'
if self.ignored_features is None else mask,
sparse=encode_sparse).fit_transform(Xt)
@@ -259,6 +299,15 @@ def inverse_transform(self, Xt): | |||
Data in the original feature space. | |||
""" | |||
check_is_fitted(self, ["offset_", "bin_width_"]) | |||
|
|||
# currently, preprocessing.OneHotEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The convention of this file would be to capitalize comments. ("currently" -> "Currently")
# currently, preprocessing.OneHotEncoder | ||
# don't support inverse_transform | ||
if self.encode != 'ordinal': | ||
raise ValueError("inverse_transform only support " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support -> supports
est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2], | ||
encode='onehot-dense').fit(X) | ||
Xt_1 = est.transform(X) | ||
Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You don't need to have Xt_2
be an array. A nested list should suffice.
I would also rename Xt_2 -> Xt_expected
and Xt_1 -> Xt
.
[0, 1, 0, 1, 0, 0, 2.5, -3], | ||
[0, 0, 1, 0, 1, 0, 3.5, -2], | ||
[0, 0, 1, 0, 0, 1, 4.5, -1]]) | ||
assert_array_equal(Xt_1, Xt_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conventions are expected parameter first, and then actual.
Thanks for helping review, @hlin117!
…On 5 September 2017 at 11:03, Henry Lin ***@***.***> wrote:
***@***.**** commented on this pull request.
Overall looks great! Nice work.
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> @@ -3,8 +3,10 @@
import numpy as np
from six.moves import range
import warnings
+import scipy.sparse as sp
Nit: Move this above six.moves, so it is alphabetical.
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
+
+ # currently, preprocessing.OneHotEncoder
+ # don't support inverse_transform
don't -> doesn't
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> Data in the binned space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
X = self._validate_X_post_fit(X)
- return _transform_selected(X, self._transform,
- self.transformed_features_, copy=True,
- retain_order=True)
+ Xt = _transform_selected(X, self._transform,
+ self.transformed_features_, copy=True,
+ retain_order=True)
+
+ # only one-hot encode discretized features
+ mask = np.ones(X.shape[1], dtype=bool)
+ if self.ignored_features is not None:
+ mask[self.ignored_features] = False
+
+ if self.encode == 'onehot':
Nit: You can make this more concise.
if self.encode == 'ordinal':
return Xt
# Only one-hot encode discretized features
mask = np.ones(X.shape[1], dtype=bool)
if self.ignored_features is not None:
mask[self.ignored_features] = False
encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],
categorical_features='all'
if self.ignored_features is None else mask,
sparse=encode_sparse).fit_transform(Xt)
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
+
+ # currently, preprocessing.OneHotEncoder
Nit: The convention of this file would be to capitalize comments.
("currently" -> "Currently")
------------------------------
In sklearn/preprocessing/discretization.py
<#9647 (comment)>
:
> @@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
+
+ # currently, preprocessing.OneHotEncoder
+ # don't support inverse_transform
+ if self.encode != 'ordinal':
+ raise ValueError("inverse_transform only support "
support -> supports
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> + assert_raises(ValueError, est.inverse_transform, Xt_2)
+ est = KBinsDiscretizer(n_bins=[2, 3, 3, 3],
+ encode='onehot').fit(X)
+ Xt_3 = est.transform(X)
+ assert sp.issparse(Xt_3)
+ assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True)
+ .fit_transform(Xt_1).toarray(),
+ Xt_3.toarray())
+ assert_raises(ValueError, est.inverse_transform, Xt_3)
+
+
+def test_one_hot_encode_with_ignored_features():
+ est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2],
+ encode='onehot-dense').fit(X)
+ Xt_1 = est.transform(X)
+ Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4],
Nit: You don't need to have Xt_2 be an array. A nested list should
suffice.
I would also rename Xt_2 -> Xt_expected and Xt_1 -> Xt.
------------------------------
In sklearn/preprocessing/tests/test_discretization.py
<#9647 (comment)>
:
> + assert sp.issparse(Xt_3)
+ assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True)
+ .fit_transform(Xt_1).toarray(),
+ Xt_3.toarray())
+ assert_raises(ValueError, est.inverse_transform, Xt_3)
+
+
+def test_one_hot_encode_with_ignored_features():
+ est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2],
+ encode='onehot-dense').fit(X)
+ Xt_1 = est.transform(X)
+ Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4],
+ [0, 1, 0, 1, 0, 0, 2.5, -3],
+ [0, 0, 1, 0, 1, 0, 3.5, -2],
+ [0, 0, 1, 0, 0, 1, 4.5, -1]])
+ assert_array_equal(Xt_1, Xt_2)
Conventions are expected parameter first, and then actual.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9647 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67SYKYgmWhEIQUVJb8W4gMb92VD_ks5sfJ39gaJpZM4PHS2g>
.
|
Thanks @hlin117. Comments addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@glemaitre Thanks. Comments addressed. |
I made a small pass on it. I have to check the artifacts but I think that I am LGTM. |
if self.ignored_features is not None: | ||
mask[self.ignored_features] = False | ||
|
||
encode_sparse = (self.encode == 'onehot') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the paranthesis
mask[self.ignored_features] = False | ||
|
||
encode_sparse = (self.encode == 'onehot') | ||
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.n_bins_
is already an array, isn't it? I don't think that you need to pass it inside np.array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre
Thanks. According to the original author, we allow users to pass a list as n_bins_. If we don't do the conversion, we get an error(list indices must be integers).
@@ -259,6 +295,15 @@ def inverse_transform(self, Xt): | |||
Data in the original feature space. | |||
""" | |||
check_is_fitted(self, ["offset_", "bin_width_"]) | |||
|
|||
# Currently, preprocessing.OneHotEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can make your line up to 79 characters, that would be great.
|
||
def test_invalid_encode_option(): | ||
est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], encode='invalid-encode') | ||
assert_raises(ValueError, est.fit, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you match the error message (or part of it) using assert_raises_regex
assert not sp.issparse(Xt_2) | ||
assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=False) | ||
.fit_transform(Xt_1), Xt_2) | ||
assert_raises(ValueError, est.inverse_transform, Xt_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you match the error message (or part of it) using assert_raises_regex
In addition I would group all the error inside a side test.
|
||
|
||
def test_encode_options(): | ||
# test valid encode options through comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this comment
assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True) | ||
.fit_transform(Xt_1).toarray(), | ||
Xt_3.toarray()) | ||
assert_raises(ValueError, est.inverse_transform, Xt_3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you match the error message (or part of it) using assert_raises_regex
I don't have computer right now. But I thought to spot that validate_n_bins was converting self.n_bins (which can be a list or int or array) to self.nbins_ which is a numpy array. Can you check that this is the case?
|
@glemaitre Thanks a lot. I remove np.array and all the comments are addressed. Sorry @jnothman for not realizing your right suggestions previously. |
Any update on it? Thanks :) |
I checked the artifact. Can you add an entry in the documentation API to link to the User Guide entry. For example: |
Codecov Report
@@ Coverage Diff @@
## discrete #9647 +/- ##
============================================
- Coverage 96.18% 96.18% -0.01%
============================================
Files 336 338 +2
Lines 63835 62408 -1427
============================================
- Hits 61402 60027 -1375
+ Misses 2433 2381 -52
Continue to review full report at Codecov.
|
@glemaitre Sorry to disturb. Seems that the link to the user guide doesn't work and I cannot figure out the reason. |
I think that's sufficient consensus... |
Thanks @qinhanmin2014 |
@jnothman seems that the link to the user guide doesn't work and I cannot find the reason. Could you please help me? Thanks a lot :) |
I put the fix in #9705. I think that discretization was duplicated with the next sentence. It seems that this is not case sensitive. I build locally and it was fine after renaming it. |
@glemaitre Thanks a lot for the fix :) |
Reference Issue
Fixes #9336
What does this implement/fix? Explain your changes.
add encoding option to KBinsDiscretizer
(1)encode option support {'onehot', 'onehot-dense', 'ordinal'}, the default value is set to 'ordinal' mainly because of (3)
(2)only one-hot encode discretized features when ignored_features is set.
(according to OneHotEncoder, non-categorical features are always stacked to the right of the matrix.)
(3)seems hard to support inverse_transform for one-hot because OneHotEncoder don't support inverse_transform
Any other comments?