-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
discrete branch: add a compelling example of discretization's benefits #9339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Discretisation is especially helpful in the case of:
It is especially helpful in the case of high noise since there is a greater benefit to the increased accuracy than the loss of information due to the reduced precision. |
but we would like to show this with data and plots, ideally, as in our
example gallery.
…On 11 Aug 2017 12:53 am, "Josh Ring" ***@***.***> wrote:
*Discretisation is especially helpful in the case of:*
- High noise in the original continuous data distribution and benefits
by allowing trivial averaging inside of the new category. It is a prime
example of the trade off of precision for accuracy.
- eg: linear trend with high noise is split into 5 monotonically
increasing categories
It is especially helpful in the case of high noise since there is a
greater benefit to the increased accuracy than the loss of information due
to the reduced precision.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61JxuI0ayUvmFzG53aSBfkqO7Nv0ks5sWxlmgaJpZM4OVcsQ>
.
|
@jnothman (Sorry for the repeatedly update) DecisionTree score before discretization : 0.946666666667
DecisionTree score after discretization : 0.96
SVC score before discretization : 0.96
SVC score after discretization : 0.966666666667 |
hmm I'm a bit surprised by that result. is it a matter of luck about where
the bin edges are? if we shift the bin edges a little is it still as good?
(of course we have no way to do this within the current discretiser API,
but a range param like that for np.histogram would be sufficient) if we
take samples of the data is it still as good? (Is this result averaging
across folds?)
|
@jnothman Sorry but are you looking at the latest version? I have just updated my result because I suddenly realized my mistake. |
I wasn't. I still don't understand why the DTC should struggle to find good splits in the continuous space. Again, is this averaged over folds? What's the standard deviation in either case? Please plot with alpha=.3 to give a rudimentary sense of density. I don't think this exemplifies a typical use of discretizing. |
@jnothman
Since our discretization is naive, we cannot expect big improve. iris = load_iris()
X = iris.data
y = iris.target
X = X[:, [2,3]]
Xt = KBinsDiscretizer(n_bins=10, encode='ordinal').fit_transform(X)
clf1 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score before discretization : {}"
.format(np.mean(cross_val_score(clf1, X, y, cv=5))))
print("DecisionTree score std before discretization : {}"
.format(np.std(cross_val_score(clf1, X, y, cv=5))))
clf2 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score after discretization : {}"
.format(np.mean(cross_val_score(clf2, Xt, y, cv=5))))
print("DecisionTree score std after discretization : {}"
.format(np.std(cross_val_score(clf2, Xt, y, cv=5)))) |
Okay. Seeing the std is a little more persuasive, although the improvements
are still well within the margin of error.
…On 6 September 2017 at 18:45, Hanmin Qin ***@***.***> wrote:
@jnothman <https://github.com/jnothman>
Regret me if the example is not good since I'm not an expert at machine
learning :)
The score is averaged over folds.
DecisionTree score before discretization : 0.946666666667
DecisionTree score std before discretization : 0.04
DecisionTree score after discretization : 0.96
DecisionTree score std after discretization : 0.0326598632371
SVC score before discretization : 0.96
SVC score std before discretization : 0.0249443825785
SVC score after discretization : 0.966666666667
SVC score std after discretization : 0.0249443825785
Since our discretization is naive, we cannot expect big improve.
The experiment is designed mainly based on this paper
<http://www.math.unipd.it/%7Edulli/corso04/disc.pdf> (citation > 2000)
and other materails.
Here is part of the main code:
iris = load_iris()
X = iris.data
y = iris.target
X = X[:, [2,3]]
Xt = KBinsDiscretizer(n_bins=10, encode='ordinal').fit_transform(X)
clf1 = DecisionTreeClassifier(random_state=0)print("DecisionTree score before discretization : {}"
.format(np.mean(cross_val_score(clf1, X, y, cv=5))))print("DecisionTree score std before discretization : {}"
.format(np.std(cross_val_score(clf1, X, y, cv=5))))
clf2 = DecisionTreeClassifier(random_state=0)print("DecisionTree score after discretization : {}"
.format(np.mean(cross_val_score(clf2, Xt, y, cv=5))))print("DecisionTree score std after discretization : {}"
.format(np.std(cross_val_score(clf2, Xt, y, cv=5))))
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66XyCFBIwu4jXnL1YvmL6G-NYhVyks5sfluYgaJpZM4OVcsQ>
.
|
If someone in the community has some ideas or materials, l'm willing to have a try. |
Ask the mailing list?
…On 6 September 2017 at 18:56, Hanmin Qin ***@***.***> wrote:
If someone in the community has some ideas or materials, l'm willing to
have a try.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67gSavTJSZxCKwHNQKizZlJo499tks5sfl42gaJpZM4OVcsQ>
.
|
@hlin117 Sorry to disturb. Could you kindly please spare some time to share your opinion on the example to illustrate the application of discretization? I have provided my opinion above but seems not good. Thanks a lot :) |
It's good, but not great... If we can get something more compelling...
|
Hi @qinhanmin2014!
I'm not familiar with any datasets which benefit strongly from discretization, but I feel that you can produce a dataset which would. I think discretization would be useful for datasets whose features shouldn't be represented as continuous features. For example, consider a feature whose data is represented by a bimodal distribution. If all of the features in this dataset are like this, then some classifiers (like logistic regression, for example), which assume "stronger feature value -> stronger output value" would not perform well. So, to summarize, you can generate a dataset which would perform well under discretization by ensuring each feature is multimodal. |
One class of datasets which would benefit from discretization are biological datasets. Here's a quote from "Discretization of continuous features in clinical datasets":
|
@hlin117 Thanks a lot for the detailed reply :) |
The true "reality" is that there is no single technique which will work on all datasets :)
Academic papers out there will not create datasets for themselves (usually). This is because when researchers need to compare one model after another, they need to use well studied datasets, and show that on that particular dataset, their model can perform better than another model. So it's quite a fallacy to claim in a paper that your model is performant by using a dataset which no one else has used.
Actually, I think using a generated dataset might be a great way to move forward with this.
If you look at the user docs for the Kernel PCA, they actually use a generated dataset instead of a well studied dataset for the purpose of demonstrating the concept of a Kernel PCA. |
The reason why people are interested in Iris and other datasets is because these are well studied datasets which most classifiers can perform very well in. With data being so plentiful now in 2017, it's difficult to declare a dataset as "canonical". |
@hlin117 Thanks a lot :) I agreed that using a generated dataset might provide better example. I'll first wait for the reply from the community for my current example using iris dataset. At the same time, try to think about generating a dataset. Feel free to ping me if you have further ideas. |
We recently merged a discretizing transformer into the
discrete
branch (see diff between that branch and master). Before merging it into master, we'd like a compelling example for our example gallery showing an application of machine learning where discretized features are particularly useful.To dear contributor: Make sure to submit a pull request to the
discrete
branch.The text was updated successfully, but these errors were encountered: