Skip to content

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Sep 7, 2017
Merged

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

merged 20 commits into from
Sep 7, 2017

Conversation

qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Aug 30, 2017

Reference Issue

Fixes #9336

What does this implement/fix? Explain your changes.

add encoding option to KBinsDiscretizer
(1)encode option support {'onehot', 'onehot-dense', 'ordinal'}, the default value is set to 'ordinal' mainly because of (3)
(2)only one-hot encode discretized features when ignored_features is set.
(according to OneHotEncoder, non-categorical features are always stacked to the right of the matrix.)
(3)seems hard to support inverse_transform for one-hot because OneHotEncoder don't support inverse_transform

Any other comments?

@jnothman
Copy link
Member

Thanks for this. Tests are failing, though

@jnothman
Copy link
Member

Oh you mentioned that. A strange failure. Let me check what that branch is doing...

@jnothman
Copy link
Member

So the test failures relate to a recent Cython release. Merging master into discrete should fix it. I'll do this shortly.

@qinhanmin2014
Copy link
Member Author

@jnothman Thanks. Kindly inform me when you finish. :)

@jnothman
Copy link
Member

Try pushing an empty commit

@qinhanmin2014 qinhanmin2014 changed the title [WIP] discrete branch: add encoding option to KBinsDiscretizer [MRG] discrete branch: add encoding option to KBinsDiscretizer Aug 31, 2017
@qinhanmin2014
Copy link
Member Author

@jnothman test failure unrelated. I believe it worth a review. Thanks :)

@qinhanmin2014
Copy link
Member Author

@jnothman Finally CIs are green.

and return a sparse matrix.
onehot-dense:
encode the transformed result with one-hot encoding
and return a dense matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dense array

encode the transformed result with one-hot encoding
and return a dense matrix.
ordinal:
do not encode the transformed result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the bin identifier encoded as an integer value.

@qinhanmin2014
Copy link
Member Author

@ogrisel Thanks. Comments addressed.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficial review without looking in the heart of the code.

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')

valid_encode = ['onehot', 'onehot-dense', 'ordinal']
if self.encode not in valid_encode:
raise ValueError('Invalid encode value. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the value provided by the user in the error message, i.e. something like this:

"Valid options for 'encode' are {}. Got 'encode={}' instead".format(sorted(valid_encode), encode)

retain_order=True)

# only one-hot encode discretized features
mask = np.array([True] * X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more readable to use np.repeat(True, X.shape[1])

# don't support inverse_transform
if self.encode != 'ordinal':
raise ValueError("inverse_transform only support "
"encode='ordinal'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the value of self.encode in the error message, e.g. . "Got {} instead"

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.

encode : string {'onehot', 'onehot-dense', 'ordinal'} (default='ordinal')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't thing you need the type information when you list all possible values. Double-check with the numpy doc style.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve It seems that this is not the case in many functions (e.g. pca, LinearSVC) and I have no idea how to check the doc style. Could you please help me? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#sections

When a parameter can only assume one of a fixed set of values, those values can be listed in braces, with the default appearing first:

order : {'C', 'F', 'A'}
    Description of `order`.

assert_raises(ValueError, est.inverse_transform, X)
est = KBinsDiscretizer(n_bins=[2, 3, 3, 3],
encode='onehot').fit(X)
expected3 = est.transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you not check that the output is sparse in the onehot case?

Also probably check that the output is not sparse in the onehot-dense case.

@qinhanmin2014
Copy link
Member Author

@lesteve Comments addressed except for the first one. Thanks.

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks. Comment addressed.

@glemaitre
Copy link
Member

I find a bit fuzzy both naming 'onehot' and 'onehot-dense' since that this is not explicit in the naming that by default this is sparse. Would it make sense to call 'onehot' -> 'onehot-sparse'

@glemaitre
Copy link
Member

Oh I see this is the same naming in the CategoricalEncoder PR. So that might be fine.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of suggestions. @jnothman this code will be probably changing using the CategoricalEncoder I supposed?

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.

encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after }

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.

encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
method used to encode the transformed result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method -> Method

method used to encode the transformed result.

onehot:
encode the transformed result with one-hot encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode -> Encode

encode the transformed result with one-hot encoding
and return a sparse matrix.
onehot-dense:
encode the transformed result with one-hot encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode -> Encode

encode the transformed result with one-hot encoding
and return a dense array.
ordinal:
return the bin identifier encoded as an integer value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return -> Return

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')

valid_encode = ['onehot', 'onehot-dense', 'ordinal']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to use a tuple instead of a list.

if self.encode not in valid_encode:
raise ValueError("Valid options for 'encode' are {}. "
"Got 'encode = {}' instead."
.format(sorted(valid_encode), self.encode))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary to sort.

retain_order=True)

# only one-hot encode discretized features
mask = np.repeat(True, X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be faster to use:

mask = np.ones(X.shape[1], dtype=bool)

@@ -174,3 +176,40 @@ def test_numeric_stability():
X = X_init / 10**i
Xt = KBinsDiscretizer(n_bins=2).fit_transform(X)
assert_array_equal(Xt_expected, Xt)


def test_encode():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably split this test in several tests

@jnothman
Copy link
Member

jnothman commented Sep 3, 2017 via email

@qinhanmin2014
Copy link
Member Author

@glemaitre Thanks. Comments addressed.

@@ -459,6 +459,8 @@ K-bins discretization
>>> est.bin_width_
array([ 3., 1., 2.])

By default the output is one-hot encoded into a sparse matrix (See :class:`OneHotEncoder`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather this referred to a section of the user guide which described what this means rather than provides a (in this context irrelevant) tool to do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman The only place I can think of is http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features. Could you please provide more specific suggestion? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems the right place to point to.

@qinhanmin2014
Copy link
Member Author

@jnothman I changed the link to Encoding categorical features Now it looks like this:
2017-09-04_224719
Further suggestions welcome :)

Copy link
Contributor

@hlin117 hlin117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Nice work.

@@ -3,8 +3,10 @@
import numpy as np
from six.moves import range
import warnings
import scipy.sparse as sp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Move this above six.moves, so it is alphabetical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact you should use sklearn.externals.six to not have dependencies on six

@@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])

# currently, preprocessing.OneHotEncoder
# don't support inverse_transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't -> doesn't

if self.ignored_features is not None:
mask[self.ignored_features] = False

if self.encode == 'onehot':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You can make this more concise.

if self.encode == 'ordinal':
    return Xt

# Only one-hot encode discretized features
mask = np.ones(X.shape[1], dtype=bool)
if self.ignored_features is not None:
    mask[self.ignored_features] = False

encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],
    categorical_features='all'
    if self.ignored_features is None else mask,
    sparse=encode_sparse).fit_transform(Xt)

@@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])

# currently, preprocessing.OneHotEncoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The convention of this file would be to capitalize comments. ("currently" -> "Currently")

# currently, preprocessing.OneHotEncoder
# don't support inverse_transform
if self.encode != 'ordinal':
raise ValueError("inverse_transform only support "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support -> supports

est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2],
encode='onehot-dense').fit(X)
Xt_1 = est.transform(X)
Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You don't need to have Xt_2 be an array. A nested list should suffice.

I would also rename Xt_2 -> Xt_expected and Xt_1 -> Xt.

[0, 1, 0, 1, 0, 0, 2.5, -3],
[0, 0, 1, 0, 1, 0, 3.5, -2],
[0, 0, 1, 0, 0, 1, 4.5, -1]])
assert_array_equal(Xt_1, Xt_2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventions are expected parameter first, and then actual.

@jnothman
Copy link
Member

jnothman commented Sep 5, 2017 via email

@qinhanmin2014
Copy link
Member Author

Thanks @hlin117. Comments addressed.

Copy link
Contributor

@hlin117 hlin117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@qinhanmin2014
Copy link
Member Author

@glemaitre Thanks. Comments addressed.

@glemaitre
Copy link
Member

I made a small pass on it. I have to check the artifacts but I think that I am LGTM.

if self.ignored_features is not None:
mask[self.ignored_features] = False

encode_sparse = (self.encode == 'onehot')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the paranthesis

mask[self.ignored_features] = False

encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.n_bins_ is already an array, isn't it? I don't think that you need to pass it inside np.array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre
Thanks. According to the original author, we allow users to pass a list as n_bins_. If we don't do the conversion, we get an error(list indices must be integers).

@@ -259,6 +295,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])

# Currently, preprocessing.OneHotEncoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can make your line up to 79 characters, that would be great.


def test_invalid_encode_option():
est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], encode='invalid-encode')
assert_raises(ValueError, est.fit, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you match the error message (or part of it) using assert_raises_regex

assert not sp.issparse(Xt_2)
assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=False)
.fit_transform(Xt_1), Xt_2)
assert_raises(ValueError, est.inverse_transform, Xt_2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you match the error message (or part of it) using assert_raises_regex

In addition I would group all the error inside a side test.



def test_encode_options():
# test valid encode options through comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment

assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True)
.fit_transform(Xt_1).toarray(),
Xt_3.toarray())
assert_raises(ValueError, est.inverse_transform, Xt_3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you match the error message (or part of it) using assert_raises_regex

@glemaitre
Copy link
Member

glemaitre commented Sep 5, 2017 via email

@qinhanmin2014
Copy link
Member Author

@glemaitre Thanks a lot. I remove np.array and all the comments are addressed. Sorry @jnothman for not realizing your right suggestions previously.

@qinhanmin2014
Copy link
Member Author

Any update on it? Thanks :)

@glemaitre
Copy link
Member

I checked the artifact. Can you add an entry in the documentation API to link to the User Guide entry.

For example:
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/ensemble/forest.py#L751

@glemaitre
Copy link
Member

LGTM to be merged to discrete I assume that an example will be added later on since this PR is only for the encoding option.

cc: @lesteve @jnothman @ogrisel

@codecov
Copy link

codecov bot commented Sep 7, 2017

Codecov Report

Merging #9647 into discrete will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##           discrete    #9647      +/-   ##
============================================
- Coverage     96.18%   96.18%   -0.01%     
============================================
  Files           336      338       +2     
  Lines         63835    62408    -1427     
============================================
- Hits          61402    60027    -1375     
+ Misses         2433     2381      -52
Impacted Files Coverage Δ
sklearn/preprocessing/discretization.py 100% <100%> (ø) ⬆️
sklearn/preprocessing/tests/test_discretization.py 100% <100%> (ø) ⬆️
sklearn/datasets/olivetti_faces.py 33.33% <0%> (-8.89%) ⬇️
sklearn/datasets/base.py 85.5% <0%> (-6.58%) ⬇️
sklearn/datasets/california_housing.py 39.47% <0%> (-4.12%) ⬇️
sklearn/datasets/covtype.py 52.17% <0%> (-4.08%) ⬇️
sklearn/tests/test_docstring_parameters.py 37.5% <0%> (-3.84%) ⬇️
sklearn/utils/tests/test_testing.py 81% <0%> (-3.68%) ⬇️
sklearn/datasets/lfw.py 12.41% <0%> (-2.4%) ⬇️
sklearn/datasets/kddcup99.py 32.72% <0%> (-2.14%) ⬇️
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ef342e...a4e9722. Read the comment docs.

@qinhanmin2014
Copy link
Member Author

@glemaitre Sorry to disturb. Seems that the link to the user guide doesn't work and I cannot figure out the reason.

@jnothman
Copy link
Member

jnothman commented Sep 7, 2017

I think that's sufficient consensus...

@jnothman jnothman merged commit eef7bdb into scikit-learn:discrete Sep 7, 2017
@jnothman
Copy link
Member

jnothman commented Sep 7, 2017

Thanks @qinhanmin2014

@qinhanmin2014
Copy link
Member Author

@jnothman seems that the link to the user guide doesn't work and I cannot find the reason. Could you please help me? Thanks a lot :)

@glemaitre
Copy link
Member

I put the fix in #9705. I think that discretization was duplicated with the next sentence. It seems that this is not case sensitive. I build locally and it was fine after renaming it.

@qinhanmin2014
Copy link
Member Author

@glemaitre Thanks a lot for the fix :)
The example is currently being discussed in #9339 as well as the mailing list.

@qinhanmin2014 qinhanmin2014 deleted the my-feature-1 branch September 7, 2017 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants