-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] discrete branch: add an example for KBinsDiscretizer #10192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] discrete branch: add an example for KBinsDiscretizer #10192
Conversation
ping @jnothman Could you help me diagnose the Circle failure? Do we need to merge master into discrete again? Thanks a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to merge master into discrete again?
Done, Circle is not failing anymore.
linestyle=':', label='decision tree') | ||
plt.plot(X[:, 0], y, 'o', c='k') | ||
bins = enc.offset_[0] + enc.bin_width_[0] * np.arange(1, enc.n_bins_[0]) | ||
plt.vlines(bins, -3, 3, linewidth=1, alpha=.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have automatic ymin, ymax
, you can use
plt.vlines(bins, *plt.gca().get_ylim(), linewidth=1, alpha=.2)
# construct the dataset | ||
rnd = np.random.RandomState(42) | ||
X = rnd.uniform(-3, 3, size=100) | ||
y_no_noise = (np.sin(4 * X) + X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plt.title("Result before discretization") | ||
|
||
# predict with transformed dataset | ||
plt.subplot(122) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison is clearer if the subplots have the same ylim (sharey=True
). You can do:
fig, axes = plt.subplots(nrows=2, sharey=True, figsize=(10, 4))
plt.sca(axes[0])
...
plt.sca(axes[1])
...
@TomDLT Thanks a lot for your great help :) |
Hmmm that is a bit weird I would expect that checkout_merge_commit.sh would take care of that ... |
ping @TomDLT Thanks for your instructions. Comments adressed :) Result from Circle page ping @jnothman I think it's ready for review. Thanks :) ping @amueller The example is based on your book "Introduction to machine learning with python" (Chapter 4 section 2 binning&discretization) using scikit-learn's latest api KBinsDiscretizer. Would be grateful if you can take some time to have a look (I've put your name at the beginning). Thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not yet looked at the code...
================================================================ | ||
|
||
The example compares prediction result of linear regression (linear model) | ||
and decision tree (tree based model) before and after discretization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before and after -> with and without
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- of real-valued features
before discretization, linear model become much more flexible while decision | ||
tree gets much less flexible. Note that binning features generally has no | ||
beneficial effect for tree-based models, as these models can learn to split | ||
up the data anywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the linear model is fast to build and relatively straightforward to interpret.
And yes, I find this a much more compelling argument than what we had before with iris. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention one-hot encoding.
Also note that if the bins are not reasonably wide, there would appear to be a substantially increased risk of overfitting, so the discretiser parameters need tuning under cv.
@jnothman Comments addressed. Thanks a lot for the instant review :) |
is to use discretization (also known as binning). In the example, we | ||
discretize the feature and one-hot encode the transformed data. Note that if | ||
the bins are not reasonably wide, there would appear to be a substantially | ||
increased risk of overfitting, so the discretiser parameters need to be tuned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need -> should usually
seeing as you don't do that tuning here
|
||
# predict with original dataset | ||
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(10, 4)) | ||
plt.sca(axes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use sca much in other examples? It seems a bit unconventional. Then again, I can see how using axes methods together with pyplot.subplots might be seen as inconsistent
@jnothman Thanks :) Comments addressed.
In fact, the value of n_bins here (10) is among the best choices from cv (if we gridsearch on a pipeline KBinsDiscretizer + LinearRegression). So I think put the value directly is consistant with this statement and might make the example easier to go through. Especially considering that we now have another example which shows the detailed tuning process.
I don't find plt.sca under examples folder. I have followed matplotlib example and use a more common way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
discretize the feature and one-hot encode the transformed data. Note that if | ||
the bins are not reasonably wide, there would appear to be a substantially | ||
increased risk of overfitting, so the discretiser parameters should usually | ||
be tuned under cv. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cv -> cross-validation
is to use discretization (also known as binning). In the example, we | ||
discretize the feature and one-hot encode the transformed data. Note that if | ||
the bins are not reasonably wide, there would appear to be a substantially | ||
increased risk of overfitting, so the discretiser parameters should usually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discretiser -> discretizer
Merging giving the approvals from jnothman and TomDLT. |
Reference Issues/PRs
Fixes #9339
What does this implement/fix? Explain your changes.
Add an example for KBinsDiscretizer
reference: "Introduction to machine learning with python" (Chapter 4 section 2)
Any other comments?
local result
