Skip to content

[MRG+2] discrete branch: add an example for KBinsDiscretizer #10192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 27, 2017
Merged

[MRG+2] discrete branch: add an example for KBinsDiscretizer #10192

merged 10 commits into from
Nov 27, 2017

Conversation

qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Nov 23, 2017

Reference Issues/PRs

Fixes #9339

What does this implement/fix? Explain your changes.

Add an example for KBinsDiscretizer
reference: "Introduction to machine learning with python" (Chapter 4 section 2)

Any other comments?

local result
figure_1

@qinhanmin2014
Copy link
Member Author

ping @jnothman Could you help me diagnose the Circle failure? Do we need to merge master into discrete again? Thanks a lot.

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to merge master into discrete again?

Done, Circle is not failing anymore.

linestyle=':', label='decision tree')
plt.plot(X[:, 0], y, 'o', c='k')
bins = enc.offset_[0] + enc.bin_width_[0] * np.arange(1, enc.n_bins_[0])
plt.vlines(bins, -3, 3, linewidth=1, alpha=.2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have automatic ymin, ymax, you can use

plt.vlines(bins, *plt.gca().get_ylim(), linewidth=1, alpha=.2)

# construct the dataset
rnd = np.random.RandomState(42)
X = rnd.uniform(-3, 3, size=100)
y_no_noise = (np.sin(4 * X) + X)
Copy link
Member

@TomDLT TomDLT Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favour of something more non-linear, to better show the failure of the linear model.

e.g. y = np.sin(X) + rnd.normal(size=len(X)) / 3:

figure_1

plt.title("Result before discretization")

# predict with transformed dataset
plt.subplot(122)
Copy link
Member

@TomDLT TomDLT Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison is clearer if the subplots have the same ylim (sharey=True). You can do:

fig, axes = plt.subplots(nrows=2, sharey=True, figsize=(10, 4))
plt.sca(axes[0])
...
plt.sca(axes[1])
...

@qinhanmin2014
Copy link
Member Author

@TomDLT Thanks a lot for your great help :)

@lesteve
Copy link
Member

lesteve commented Nov 23, 2017

Do we need to merge master into discrete again?

Done, Circle is not failing anymore.

Hmmm that is a bit weird I would expect that checkout_merge_commit.sh would take care of that ...

@qinhanmin2014
Copy link
Member Author

ping @TomDLT Thanks for your instructions. Comments adressed :) Result from Circle page
sphx_glr_plot_discretization_001

ping @jnothman I think it's ready for review. Thanks :)

ping @amueller The example is based on your book "Introduction to machine learning with python" (Chapter 4 section 2 binning&discretization) using scikit-learn's latest api KBinsDiscretizer. Would be grateful if you can take some time to have a look (I've put your name at the beginning). Thanks :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not yet looked at the code...

================================================================

The example compares prediction result of linear regression (linear model)
and decision tree (tree based model) before and after discretization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before and after -> with and without

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • of real-valued features

before discretization, linear model become much more flexible while decision
tree gets much less flexible. Note that binning features generally has no
beneficial effect for tree-based models, as these models can learn to split
up the data anywhere.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the linear model is fast to build and relatively straightforward to interpret.

@jnothman
Copy link
Member

And yes, I find this a much more compelling argument than what we had before with iris.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention one-hot encoding.

Also note that if the bins are not reasonably wide, there would appear to be a substantially increased risk of overfitting, so the discretiser parameters need tuning under cv.

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 24, 2017
@qinhanmin2014
Copy link
Member Author

@jnothman Comments addressed. Thanks a lot for the instant review :)

is to use discretization (also known as binning). In the example, we
discretize the feature and one-hot encode the transformed data. Note that if
the bins are not reasonably wide, there would appear to be a substantially
increased risk of overfitting, so the discretiser parameters need to be tuned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need -> should usually

seeing as you don't do that tuning here


# predict with original dataset
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(10, 4))
plt.sca(axes[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use sca much in other examples? It seems a bit unconventional. Then again, I can see how using axes methods together with pyplot.subplots might be seen as inconsistent

@qinhanmin2014
Copy link
Member Author

@jnothman Thanks :) Comments addressed.

seeing as you don't do that tuning here

In fact, the value of n_bins here (10) is among the best choices from cv (if we gridsearch on a pipeline KBinsDiscretizer + LinearRegression). So I think put the value directly is consistant with this statement and might make the example easier to go through. Especially considering that we now have another example which shows the detailed tuning process.

Do we use sca much in other examples? It seems a bit unconventional.

I don't find plt.sca under examples folder. I have followed matplotlib example and use a more common way.

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 26, 2017
Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

discretize the feature and one-hot encode the transformed data. Note that if
the bins are not reasonably wide, there would appear to be a substantially
increased risk of overfitting, so the discretiser parameters should usually
be tuned under cv.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cv -> cross-validation

is to use discretization (also known as binning). In the example, we
discretize the feature and one-hot encode the transformed data. Note that if
the bins are not reasonably wide, there would appear to be a substantially
increased risk of overfitting, so the discretiser parameters should usually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discretiser -> discretizer

@TomDLT TomDLT changed the title [MRG] discrete branch: add an example for KBinsDiscretizer [MRG+2] discrete branch: add an example for KBinsDiscretizer Nov 27, 2017
@scikit-learn scikit-learn deleted a comment from codecov bot Nov 27, 2017
@qinhanmin2014
Copy link
Member Author

Merging giving the approvals from jnothman and TomDLT.
redered example from Circle.

@qinhanmin2014 qinhanmin2014 merged commit 2c9134e into scikit-learn:discrete Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants