-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH Remove ignored_features in KBinsDiscretizer #11467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH Remove ignored_features in KBinsDiscretizer #11467
Conversation
What is the benefit of using _transform_selected? |
I'd be fine with storing the fitted encoder to inverse_transform. But that's an enhancement to consider later |
No apparent benefits (maybe just avoid some duplicate code)
is the same as
|
The use the latter. Much simpler to read and maintain.
We can then, if we wish, revert all the changes to _transform_selected
currently in discrete (but they're also fairly harmless, and
_transform_selected might be removed in v0.22)
|
Also, it may be unnecessary to keep _transform as a separate function
…On 11 July 2018 at 11:04, Joel Nothman ***@***.***> wrote:
The use the latter. Much simpler to read and maintain.
We can then, if we wish, revert all the changes to _transform_selected
currently in discrete (but they're also fairly harmless, and
_transform_selected might be removed in v0.22)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @jnothman ready for review
Xt = _transform_selected(X, self._transform, self.dtype, | ||
self.transformed_features_, copy=True, | ||
retain_order=True) | ||
X = check_array(X, dtype=FLOAT_DTYPES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use X = check_array(X, accept_sparse='csc', copy=True, dtype=FLOAT_DTYPES)
in _transform_selected
.
Here, I remove accept_sparse
because we don't support sparse input here. I remove copy
because I don't find it useful. I can't remove dtype
because the result will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_array
is already called in _validate_X_post_fit
.
Also, copy=True
is necessary since X
is modified inplace.
This should probably be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also realize that self.dtype
is not used anymore.
We can probably remove it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you
@@ -117,16 +109,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): | |||
|
|||
np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf]) | |||
|
|||
You can combine ``KBinsDiscretizer`` with ``ColumnTransformer`` if you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to use a :class: reference here
ping @TomDLT for a second review if you have time :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner, thanks!
Just a few changes necessary
Xt = _transform_selected(X, self._transform, self.dtype, | ||
self.transformed_features_, copy=True, | ||
retain_order=True) | ||
X = check_array(X, dtype=FLOAT_DTYPES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_array
is already called in _validate_X_post_fit
.
Also, copy=True
is necessary since X
is modified inplace.
This should probably be tested.
Xt = _transform_selected(X, self._transform, self.dtype, | ||
self.transformed_features_, copy=True, | ||
retain_order=True) | ||
X = check_array(X, dtype=FLOAT_DTYPES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also realize that self.dtype
is not used anymore.
We can probably remove it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @TomDLT for the review. (I'm waiting for CI, not expecting to get a response so quickly :))
I choose to remove _validate_X_post_fit. Firstly, this can avoid duplicate check_array in transform. Secondly, _validate_X_post_fit block us from supporting inverse_transform for encoders other than ordinal.
Thanks, updated with a test. (Seems that there's not such test in the common test? not 100% sure though. )
Thanks, removed. I also let KBinsDiscretizer go through the common test :) |
if self.encode != 'ordinal': | ||
raise ValueError("inverse_transform only supports " | ||
"'encode = ordinal'. Got encode={!r} instead." | ||
.format(self.encode)) | ||
|
||
Xt = self._validate_X_post_fit(Xt) | ||
trans = self.transformed_features_ | ||
Xt = check_array(Xt, dtype='numeric') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use FLOAT_DTYPES
here, since we modify Xt
inplace and we may want to put float in it.
|
||
bin_edges = self.bin_edges_[trans] | ||
for jj in range(X.shape[1]): | ||
Xt = X.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no using the copy parameter of check_array
?
It would avoid a double copy if X
is not a float array.
This applies also for inverse_transform
.
Thanks @TomDLT. Comments addressed. I need to have a more thorough understanding of our check_array :) |
Now we have
ColumnTransformer
, so we don't need to supportignored_features
inKBinsDiscretizer
(as we've done inOneHotEncoder
). See:#9342 (comment)
#9342 (comment)
I think this solution will work, but we still have things to consider:
(1) Is it good to use
_transform_selected
? I think it's acceptable, since with defaultselected="all"
, it will simply validate the input withcheck_array
and return directly.(2) Should we support
inverse_transform
for encoders other thanordinal
? To support this, one way is to store the fitted OntHotEncoder (or simply build a new one and setcategories_
, so that it can supportinverse_transform
without fitting). The other way is to borrow some code from inverse_transform of OntHotEncoder.