-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195
Conversation
767d777
to
ffd97a3
Compare
Looks cool. I'll have a look at it |
@@ -166,7 +166,7 @@ def _validate_n_bins(self, n_features, ignored): | |||
""" | |||
orig_bins = self.n_bins | |||
if isinstance(orig_bins, numbers.Number): | |||
if not isinstance(orig_bins, np.int): | |||
if not isinstance(orig_bins, (np.int, np.integer)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance, this seems not consistent with #10017. Seems that we are going to use (numbers.Integral, np.integer)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
learned a lot from this great example :)
name = estimator.__class__.__name__ | ||
if name == 'Pipeline': | ||
name = [get_name(est[1]) for est in estimator.steps] | ||
name = '\n'.join(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used both in the plot and the output message. But the '\n' will make the output message strange. See redered page.
KBinsDiscretizer(encode='onehot'), LinearSVC(random_state=0)), { | ||
'kbinsdiscretizer__n_bins': np.arange(2, 10), | ||
'linearsvc__C': np.logspace(-2, 7, 10), | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there not KBinsDiscretizer + GradientBoostingClassifier and KBinsDiscretizer + SVC? Though it seems that KBinsDiscretizer + GradientBoostingClassifier somehow performs better than GradientBoostingClassifier in the first dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they are non-linear classifier and the idea is to show that a linear classifier with this transformer can perform as good as a non-linear classifier, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :) Then I suppose it might be better to explicitly state that in the example, it seems a bit confused without your explanation at least from my side.
clf = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=5) | ||
clf.fit(X_train, y_train) | ||
score = clf.score(X_test, y_test) | ||
print(ds_cnt, name, score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the message seems too simple from my side, maybe it's better to add something? (e.g., add 'dataset' before ds_cnt (e.g., dataset 1) and add 'score:' before score (e.g., score:0.88)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I like as it is. My only small remark would be that I would find it easier if the scoring metric would by specified in the title or in the legend.
# preprocess dataset, split into training and test part | ||
X, y = ds | ||
X = StandardScaler().fit_transform(X) | ||
X_train, X_test, y_train, y_test = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid to use \
and jump a line between the parentheses instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if I see that it was the same in the original example ;)
A demonstration of feature discretization on synthetic classification datasets. | ||
Feature discretization decomposes each feature into a set of bins, here | ||
equally distributed in width. The discrete values are then one-hot encoded, | ||
and given to a linear classifier. On the two non-linearly separable datasets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold the reader's hand a bit more: the first two rows represent linearly non-separable datasets (moons and concentric circles) while the third is approximately linearly separable.
linearly. | ||
|
||
The plots show training points in solid colors and testing points | ||
semi-transparent. The lower right shows the classification accuracy on the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You either need to describe here or title the plot to show the groupings: linear classifiers, non-linear classifiers, linear classifiers with discretized input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the discretized classifiers should come before the nonlinear ones to emphasise the difference between the nonlinear group and the linear group.
e6e5972
to
57e4e94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -166,7 +166,7 @@ def _validate_n_bins(self, n_features, ignored): | |||
""" | |||
orig_bins = self.n_bins | |||
if isinstance(orig_bins, numbers.Number): | |||
if not isinstance(orig_bins, np.int): | |||
if not isinstance(orig_bins, (numbers.Integral, np.integer)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just note that it will eventually be changed to something like SCALAR_INTEGER_TYPES in #10017
@@ -29,6 +29,11 @@ def test_fit_transform(): | |||
assert_array_equal(expected, est.transform(X)) | |||
|
|||
|
|||
def test_valid_n_bins(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might prefer to remove the test
(1) KBinsDiscretizer is now a PR(not in master), so it can be considered an improvement inside the PR.
(2) We do not add too much test in #10017. I don't think it deserves a test.
(3) It seems not a regression test?
isinstance(2, np.int) = True
isinstance(np.array([2])[0], np.int) = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) This is a PR to the discrete
branch, which will end up in the PR from discrete
to master
.
(2) I would be in favor of this test, even though we may not need something systematic in #10017.
(3) For me:
np.__version__ == '1.13.1'
isinstance(np.array([2])[0], np.int) == False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous script is under python 2.7.12 and numpy 1.13.3 and I can reproduce yours under python 3.5.4 and numpy 1.13.1. So I think the reason for our difference is numpy version.
Let's keep the test if you think it is appropriate :) LGTM for the great example.
features, which easily lead to overfitting when the number of samples is small. | ||
|
||
The plots show training points in solid colors and testing points | ||
semi-transparent. The lower right shows the classification accuracy on the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think precede "test set" with "held out"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This is clear from context.
Merging. Thanks for the nice plot! |
Reference Issues/PRs
Complementary to #10192, which also tackles issue #9339
It is not intended to replace it, but rather to supplement it.
What does this implement/fix? Explain your changes.
Add another example for KBinsDiscretizer.
Any other comments?
local result
