Skip to content

[MRG] ENH Remove ignored_features in KBinsDiscretizer #11467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 11, 2018
Merged

[MRG] ENH Remove ignored_features in KBinsDiscretizer #11467

merged 9 commits into from
Jul 11, 2018

Conversation

qinhanmin2014
Copy link
Member

Now we have ColumnTransformer, so we don't need to support ignored_features in KBinsDiscretizer (as we've done in OneHotEncoder). See:
#9342 (comment)
#9342 (comment)

I think this solution will work, but we still have things to consider:
(1) Is it good to use _transform_selected? I think it's acceptable, since with default selected="all", it will simply validate the input with check_array and return directly.
(2) Should we support inverse_transform for encoders other than ordinal? To support this, one way is to store the fitted OntHotEncoder (or simply build a new one and set categories_, so that it can support inverse_transform without fitting). The other way is to borrow some code from inverse_transform of OntHotEncoder.

@scikit-learn scikit-learn deleted a comment from sklearn-lgtm Jul 10, 2018
@qinhanmin2014
Copy link
Member Author

ping @jnothman @TomDLT ready for review :)

@jnothman
Copy link
Member

What is the benefit of using _transform_selected?

@jnothman
Copy link
Member

I'd be fine with storing the fitted encoder to inverse_transform. But that's an enhancement to consider later

@qinhanmin2014
Copy link
Member Author

What is the benefit of using _transform_selected?

No apparent benefits (maybe just avoid some duplicate code)

Xt = _transform_selected(X, self._transform, self.dtype, copy=True, retain_order=True)

is the same as

X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES)
Xt = self._transform(X)

@jnothman
Copy link
Member

jnothman commented Jul 11, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jul 11, 2018 via email

Copy link
Member Author

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @jnothman ready for review

Xt = _transform_selected(X, self._transform, self.dtype,
self.transformed_features_, copy=True,
retain_order=True)
X = check_array(X, dtype=FLOAT_DTYPES)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES) in _transform_selected.
Here, I remove accept_sparse because we don't support sparse input here. I remove copy because I don't find it useful. I can't remove dtype because the result will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_array is already called in _validate_X_post_fit.

Also, copy=True is necessary since X is modified inplace.
This should probably be tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also realize that self.dtype is not used anymore.
We can probably remove it completely.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you

@@ -117,16 +109,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):

np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])

You can combine ``KBinsDiscretizer`` with ``ColumnTransformer`` if you
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to use a :class: reference here

@qinhanmin2014
Copy link
Member Author

ping @TomDLT for a second review if you have time :)

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 11, 2018
Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner, thanks!
Just a few changes necessary

Xt = _transform_selected(X, self._transform, self.dtype,
self.transformed_features_, copy=True,
retain_order=True)
X = check_array(X, dtype=FLOAT_DTYPES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_array is already called in _validate_X_post_fit.

Also, copy=True is necessary since X is modified inplace.
This should probably be tested.

Xt = _transform_selected(X, self._transform, self.dtype,
self.transformed_features_, copy=True,
retain_order=True)
X = check_array(X, dtype=FLOAT_DTYPES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also realize that self.dtype is not used anymore.
We can probably remove it completely.

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinhanmin2014
Copy link
Member Author

Thanks @TomDLT for the review. (I'm waiting for CI, not expecting to get a response so quickly :))

check_array is already called in _validate_X_post_fit

I choose to remove _validate_X_post_fit. Firstly, this can avoid duplicate check_array in transform. Secondly, _validate_X_post_fit block us from supporting inverse_transform for encoders other than ordinal.

Also, copy=True is necessary since X is modified inplace. This should probably be tested.

Thanks, updated with a test. (Seems that there's not such test in the common test? not 100% sure though. )

I also realize that self.dtype is not used anymore.

Thanks, removed.

I also let KBinsDiscretizer go through the common test :)

@qinhanmin2014
Copy link
Member Author

CIs are green. ping @jnothman for a final check or @TomDLT if you're confident enough to merge directly. Thanks :)

if self.encode != 'ordinal':
raise ValueError("inverse_transform only supports "
"'encode = ordinal'. Got encode={!r} instead."
.format(self.encode))

Xt = self._validate_X_post_fit(Xt)
trans = self.transformed_features_
Xt = check_array(Xt, dtype='numeric')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use FLOAT_DTYPES here, since we modify Xt inplace and we may want to put float in it.


bin_edges = self.bin_edges_[trans]
for jj in range(X.shape[1]):
Xt = X.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no using the copy parameter of check_array?
It would avoid a double copy if X is not a float array.

This applies also for inverse_transform.

@qinhanmin2014
Copy link
Member Author

Thanks @TomDLT. Comments addressed. I need to have a more thorough understanding of our check_array :)

@TomDLT TomDLT merged commit e4089d1 into scikit-learn:discrete Jul 11, 2018
@qinhanmin2014 qinhanmin2014 deleted the remove-ignored-features branch July 12, 2018 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants