Skip to content

[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 27, 2017

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Nov 23, 2017

Reference Issues/PRs

Complementary to #10192, which also tackles issue #9339
It is not intended to replace it, but rather to supplement it.

What does this implement/fix? Explain your changes.

Add another example for KBinsDiscretizer.

Any other comments?

local result
figure_1

@TomDLT TomDLT force-pushed the discretization_example branch from 767d777 to ffd97a3 Compare November 23, 2017 19:20
@glemaitre
Copy link
Member

Looks cool. I'll have a look at it

@@ -166,7 +166,7 @@ def _validate_n_bins(self, n_features, ignored):
"""
orig_bins = self.n_bins
if isinstance(orig_bins, numbers.Number):
if not isinstance(orig_bins, np.int):
if not isinstance(orig_bins, (np.int, np.integer)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, this seems not consistent with #10017. Seems that we are going to use (numbers.Integral, np.integer)?

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 24, 2017
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learned a lot from this great example :)

name = estimator.__class__.__name__
if name == 'Pipeline':
name = [get_name(est[1]) for est in estimator.steps]
name = '\n'.join(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used both in the plot and the output message. But the '\n' will make the output message strange. See redered page.

KBinsDiscretizer(encode='onehot'), LinearSVC(random_state=0)), {
'kbinsdiscretizer__n_bins': np.arange(2, 10),
'linearsvc__C': np.logspace(-2, 7, 10),
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there not KBinsDiscretizer + GradientBoostingClassifier and KBinsDiscretizer + SVC? Though it seems that KBinsDiscretizer + GradientBoostingClassifier somehow performs better than GradientBoostingClassifier in the first dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they are non-linear classifier and the idea is to show that a linear classifier with this transformer can perform as good as a non-linear classifier, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :) Then I suppose it might be better to explicitly state that in the example, it seems a bit confused without your explanation at least from my side.

clf = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=5)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(ds_cnt, name, score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the message seems too simple from my side, maybe it's better to add something? (e.g., add 'dataset' before ds_cnt (e.g., dataset 1) and add 'score:' before score (e.g., score:0.88)?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I like as it is. My only small remark would be that I would find it easier if the scoring metric would by specified in the title or in the legend.

# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid to use \ and jump a line between the parentheses instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if I see that it was the same in the original example ;)

A demonstration of feature discretization on synthetic classification datasets.
Feature discretization decomposes each feature into a set of bins, here
equally distributed in width. The discrete values are then one-hot encoded,
and given to a linear classifier. On the two non-linearly separable datasets,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold the reader's hand a bit more: the first two rows represent linearly non-separable datasets (moons and concentric circles) while the third is approximately linearly separable.

linearly.

The plots show training points in solid colors and testing points
semi-transparent. The lower right shows the classification accuracy on the test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You either need to describe here or title the plot to show the groupings: linear classifiers, non-linear classifiers, linear classifiers with discretized input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the discretized classifiers should come before the nonlinear ones to emphasise the difference between the nonlinear group and the linear group.

@TomDLT TomDLT force-pushed the discretization_example branch from e6e5972 to 57e4e94 Compare November 27, 2017 11:14
@scikit-learn scikit-learn deleted a comment from codecov bot Nov 27, 2017
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@TomDLT Could you please have a look at #10192 so that we can get the two examples in, merge discrete into master and avoid repeatedly solving conflicts? Thanks :)

@@ -166,7 +166,7 @@ def _validate_n_bins(self, n_features, ignored):
"""
orig_bins = self.n_bins
if isinstance(orig_bins, numbers.Number):
if not isinstance(orig_bins, np.int):
if not isinstance(orig_bins, (numbers.Integral, np.integer)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just note that it will eventually be changed to something like SCALAR_INTEGER_TYPES in #10017

@@ -29,6 +29,11 @@ def test_fit_transform():
assert_array_equal(expected, est.transform(X))


def test_valid_n_bins():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might prefer to remove the test
(1) KBinsDiscretizer is now a PR(not in master), so it can be considered an improvement inside the PR.
(2) We do not add too much test in #10017. I don't think it deserves a test.
(3) It seems not a regression test?

isinstance(2, np.int) = True
isinstance(np.array([2])[0], np.int) = True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) This is a PR to the discrete branch, which will end up in the PR from discrete to master.
(2) I would be in favor of this test, even though we may not need something systematic in #10017.
(3) For me:

np.__version__ == '1.13.1'
isinstance(np.array([2])[0], np.int) == False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous script is under python 2.7.12 and numpy 1.13.3 and I can reproduce yours under python 3.5.4 and numpy 1.13.1. So I think the reason for our difference is numpy version.
Let's keep the test if you think it is appropriate :) LGTM for the great example.

@TomDLT
Copy link
Member Author

TomDLT commented Nov 27, 2017

New rendered example

@qinhanmin2014 qinhanmin2014 changed the title [MRG] discrete branch: add an second example for KBinsDiscretizer [MRG+1] discrete branch: add an second example for KBinsDiscretizer Nov 27, 2017
features, which easily lead to overfitting when the number of samples is small.

The plots show training points in solid colors and testing points
semi-transparent. The lower right shows the classification accuracy on the test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think precede "test set" with "held out"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This is clear from context.

@jnothman
Copy link
Member

Merging. Thanks for the nice plot!

@jnothman jnothman merged commit 430af30 into scikit-learn:discrete Nov 27, 2017
@TomDLT TomDLT deleted the discretization_example branch November 28, 2017 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants