discrete branch: add a compelling example of discretization's benefits #9339

jnothman · 2017-07-12T10:45:41Z

We recently merged a discretizing transformer into the discrete branch (see diff between that branch and master). Before merging it into master, we'd like a compelling example for our example gallery showing an application of machine learning where discretized features are particularly useful.

To dear contributor: Make sure to submit a pull request to the discrete branch.

The text was updated successfully, but these errors were encountered:

joshring · 2017-08-10T14:53:24Z

Discretisation is especially helpful in the case of:

High noise in the original continuous data distribution and benefits by allowing trivial averaging inside of the new category. It is a prime example of the trade off of precision for accuracy.
eg: linear trend with high noise is split into 5 monotonically increasing categories

It is especially helpful in the case of high noise since there is a greater benefit to the increased accuracy than the loss of information due to the reduced precision.

jnothman · 2017-08-10T23:18:27Z

but we would like to show this with data and plots, ideally, as in our example gallery.

…

On 11 Aug 2017 12:53 am, "Josh Ring" ***@***.***> wrote: *Discretisation is especially helpful in the case of:* - High noise in the original continuous data distribution and benefits by allowing trivial averaging inside of the new category. It is a prime example of the trade off of precision for accuracy. - eg: linear trend with high noise is split into 5 monotonically increasing categories It is especially helpful in the case of high noise since there is a greater benefit to the increased accuracy than the loss of information due to the reduced precision. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61JxuI0ayUvmFzG53aSBfkqO7Nv0ks5sWxlmgaJpZM4OVcsQ> .

qinhanmin2014 · 2017-09-06T07:09:16Z

@jnothman (Sorry for the repeatedly update)
Here is my plan for the example, please have a look. Thanks :)
Dataset: iris (only use two features)
(1)plot the data before and after discretization

(2)train a classifier using the data before and after discretization and compare the result

DecisionTree score before discretization : 0.946666666667
DecisionTree score after discretization : 0.96
SVC score before discretization : 0.96
SVC score after discretization : 0.966666666667

jnothman · 2017-09-06T07:51:11Z

hmm I'm a bit surprised by that result. is it a matter of luck about where the bin edges are? if we shift the bin edges a little is it still as good? (of course we have no way to do this within the current discretiser API, but a range param like that for np.histogram would be sufficient) if we take samples of the data is it still as good? (Is this result averaging across folds?)

qinhanmin2014 · 2017-09-06T07:56:45Z

@jnothman Sorry but are you looking at the latest version? I have just updated my result because I suddenly realized my mistake.

jnothman · 2017-09-06T08:06:43Z

I wasn't. I still don't understand why the DTC should struggle to find good splits in the continuous space.

Again, is this averaged over folds? What's the standard deviation in either case?

Please plot with alpha=.3 to give a rudimentary sense of density.

I don't think this exemplifies a typical use of discretizing.

qinhanmin2014 · 2017-09-06T08:45:10Z

@jnothman
Regret me if the example is not good since I'm not an expert at machine learning :)
The score is averaged over folds.

DecisionTree score before discretization : 0.946666666667
DecisionTree score std before discretization : 0.04
DecisionTree score after discretization : 0.96
DecisionTree score std after discretization : 0.0326598632371
SVC score before discretization : 0.96
SVC score std before discretization : 0.0249443825785
SVC score after discretization : 0.966666666667
SVC score std after discretization : 0.0249443825785

Since our discretization is naive, we cannot expect big improve.
The experiment is designed mainly based on this paper (citation > 2000) and other materails.
Here is part of the main code:

iris = load_iris()
X = iris.data
y = iris.target
X = X[:, [2,3]]
Xt = KBinsDiscretizer(n_bins=10, encode='ordinal').fit_transform(X)
clf1 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score before discretization : {}"
      .format(np.mean(cross_val_score(clf1, X, y, cv=5))))
print("DecisionTree score std before discretization : {}"
      .format(np.std(cross_val_score(clf1, X, y, cv=5))))
clf2 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score after discretization : {}"
      .format(np.mean(cross_val_score(clf2, Xt, y, cv=5))))
print("DecisionTree score std after discretization : {}"
      .format(np.std(cross_val_score(clf2, Xt, y, cv=5))))

jnothman · 2017-09-06T08:46:32Z

Okay. Seeing the std is a little more persuasive, although the improvements are still well within the margin of error.

…

On 6 September 2017 at 18:45, Hanmin Qin ***@***.***> wrote: @jnothman <https://github.com/jnothman> Regret me if the example is not good since I'm not an expert at machine learning :) The score is averaged over folds. DecisionTree score before discretization : 0.946666666667 DecisionTree score std before discretization : 0.04 DecisionTree score after discretization : 0.96 DecisionTree score std after discretization : 0.0326598632371 SVC score before discretization : 0.96 SVC score std before discretization : 0.0249443825785 SVC score after discretization : 0.966666666667 SVC score std after discretization : 0.0249443825785 Since our discretization is naive, we cannot expect big improve. The experiment is designed mainly based on this paper <http://www.math.unipd.it/%7Edulli/corso04/disc.pdf> (citation > 2000) and other materails. Here is part of the main code: iris = load_iris() X = iris.data y = iris.target X = X[:, [2,3]] Xt = KBinsDiscretizer(n_bins=10, encode='ordinal').fit_transform(X) clf1 = DecisionTreeClassifier(random_state=0)print("DecisionTree score before discretization : {}" .format(np.mean(cross_val_score(clf1, X, y, cv=5))))print("DecisionTree score std before discretization : {}" .format(np.std(cross_val_score(clf1, X, y, cv=5)))) clf2 = DecisionTreeClassifier(random_state=0)print("DecisionTree score after discretization : {}" .format(np.mean(cross_val_score(clf2, Xt, y, cv=5))))print("DecisionTree score std after discretization : {}" .format(np.std(cross_val_score(clf2, Xt, y, cv=5)))) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66XyCFBIwu4jXnL1YvmL6G-NYhVyks5sfluYgaJpZM4OVcsQ> .

qinhanmin2014 · 2017-09-06T08:56:20Z

If someone in the community has some ideas or materials, l'm willing to have a try.

jnothman · 2017-09-06T08:59:44Z

Ask the mailing list?

…

On 6 September 2017 at 18:56, Hanmin Qin ***@***.***> wrote: If someone in the community has some ideas or materials, l'm willing to have a try. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67gSavTJSZxCKwHNQKizZlJo499tks5sfl42gaJpZM4OVcsQ> .

qinhanmin2014 · 2017-09-07T14:20:24Z

@hlin117 Sorry to disturb. Could you kindly please spare some time to share your opinion on the example to illustrate the application of discretization? I have provided my opinion above but seems not good. Thanks a lot :)

jnothman · 2017-09-07T22:06:23Z

It's good, but not great... If we can get something more compelling...

qinhanmin2014 · 2017-09-08T15:32:56Z

@jnothman I opened #9713 for review. Feel free to close it if it is too naive :)

hlin117 · 2017-09-09T19:20:47Z

Hi @qinhanmin2014!

@hlin117 Sorry to disturb. Could you kindly please spare some time to share your opinion on the example to illustrate the application of discretization? I have provided my opinion above but seems not good. Thanks a lot :)

I'm not familiar with any datasets which benefit strongly from discretization, but I feel that you can produce a dataset which would.

I think discretization would be useful for datasets whose features shouldn't be represented as continuous features. For example, consider a feature whose data is represented by a bimodal distribution. If all of the features in this dataset are like this, then some classifiers (like logistic regression, for example), which assume "stronger feature value -> stronger output value" would not perform well.

So, to summarize, you can generate a dataset which would perform well under discretization by ensuring each feature is multimodal.

hlin117 · 2017-09-09T19:23:30Z

One class of datasets which would benefit from discretization are biological datasets. Here's a quote from "Discretization of continuous features in clinical datasets":

Our results confirm the findings of previous studies, which show that discretization in general improves the accuracy of naïve Bayes classifiers. This is thought to be due to the ability of discretization to approximate the distribution of the continuous attribute, which otherwise would be assumed to be Gaussian. We might therefore expect the greatest gains to occur for datasets in which the attributes are not normally distributed. In such cases, the assumption of normality within the continuous data would lead to a lower accuracy overall, which should be somewhat overcome by the discretization process.

qinhanmin2014 · 2017-09-10T02:53:20Z

@hlin117 Thanks a lot for the detailed reply :)
If we use an existing dataset, then seems that iris is the best choice in scikit-learn since it has been used in some references.
Indeed, we can try to create a dataset and get better result, but I suspect whether it can reflect the reality and I can't find some references which manually create the test data by themselves. Do you have some? (e.g., related to the bimodal distribution you claim)

hlin117 · 2017-09-11T03:55:29Z

.. Indeed, we can try to create a dataset and get better result, but I suspect whether it can reflect the reality ...

The true "reality" is that there is no single technique which will work on all datasets :)

... and I can't find some references which manually create the test data by themselves. Do you have some? (e.g., related to the bimodal distribution you claim)

Academic papers out there will not create datasets for themselves (usually). This is because when researchers need to compare one model after another, they need to use well studied datasets, and show that on that particular dataset, their model can perform better than another model. So it's quite a fallacy to claim in a paper that your model is performant by using a dataset which no one else has used.

If we use an existing dataset, then seems that iris is the best choice in scikit-learn since it has been used in some references.

Actually, I think using a generated dataset might be a great way to move forward with this.

By creating a dataset whose feature are generated using more than one distribution, you're demonstrating the concept that discretization may have its niche, yet powerful use cases.
You want your example to be as brief as possible. If you import an external dataset with the properties as described above, your PR is going to be very lengthy, which much of the code being data.

If you look at the user docs for the Kernel PCA, they actually use a generated dataset instead of a well studied dataset for the purpose of demonstrating the concept of a Kernel PCA.

hlin117 · 2017-09-11T03:58:40Z

The reason why people are interested in Iris and other datasets is because these are well studied datasets which most classifiers can perform very well in. With data being so plentiful now in 2017, it's difficult to declare a dataset as "canonical".

qinhanmin2014 · 2017-09-11T07:46:03Z

@hlin117 Thanks a lot :) I agreed that using a generated dataset might provide better example. I'll first wait for the reply from the community for my current example using iris dataset. At the same time, try to think about generating a dataset. Feel free to ping me if you have further ideas.

jnothman · 2017-11-27T23:16:46Z

Fixed in #10192, #10195

jnothman added Documentation Moderate Anything that requires some knowledge of conventions and best practices Need Contributor labels Jul 12, 2017

This was referenced Jul 12, 2017

KBinsDiscretizer: Automatic determination of number of bins #9337

Open

[MRG+2] Add fixed width discretization to scikit-learn #7668

Closed

[MRG+2] Merge discrete branch into master #9342

Merged

qinhanmin2014 mentioned this issue Sep 7, 2017

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

Merged

qinhanmin2014 mentioned this issue Sep 8, 2017

[MRG] discrete branch: add an example for KBinsDiscretizer #9713

Closed

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

qinhanmin2014 mentioned this issue Nov 23, 2017

[MRG+2] discrete branch: add an example for KBinsDiscretizer #10192

Merged

TomDLT mentioned this issue Nov 23, 2017

[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195

Merged

jnothman closed this as completed Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discrete branch: add a compelling example of discretization's benefits #9339

discrete branch: add a compelling example of discretization's benefits #9339

jnothman commented Jul 12, 2017

joshring commented Aug 10, 2017

jnothman commented Aug 10, 2017 via email

qinhanmin2014 commented Sep 6, 2017 •

edited

Loading

jnothman commented Sep 6, 2017 via email

qinhanmin2014 commented Sep 6, 2017

jnothman commented Sep 6, 2017

qinhanmin2014 commented Sep 6, 2017

jnothman commented Sep 6, 2017 via email

qinhanmin2014 commented Sep 6, 2017

jnothman commented Sep 6, 2017 via email

qinhanmin2014 commented Sep 7, 2017

jnothman commented Sep 7, 2017 via email

qinhanmin2014 commented Sep 8, 2017

hlin117 commented Sep 9, 2017

hlin117 commented Sep 9, 2017

qinhanmin2014 commented Sep 10, 2017

hlin117 commented Sep 11, 2017 •

edited

Loading

hlin117 commented Sep 11, 2017

qinhanmin2014 commented Sep 11, 2017

jnothman commented Nov 27, 2017

discrete branch: add a compelling example of discretization's benefits #9339

discrete branch: add a compelling example of discretization's benefits #9339

Comments

jnothman commented Jul 12, 2017

joshring commented Aug 10, 2017

jnothman commented Aug 10, 2017 via email

qinhanmin2014 commented Sep 6, 2017 • edited Loading

jnothman commented Sep 6, 2017 via email

qinhanmin2014 commented Sep 6, 2017

jnothman commented Sep 6, 2017

qinhanmin2014 commented Sep 6, 2017

jnothman commented Sep 6, 2017 via email

qinhanmin2014 commented Sep 6, 2017

jnothman commented Sep 6, 2017 via email

qinhanmin2014 commented Sep 7, 2017

jnothman commented Sep 7, 2017 via email

qinhanmin2014 commented Sep 8, 2017

hlin117 commented Sep 9, 2017

hlin117 commented Sep 9, 2017

qinhanmin2014 commented Sep 10, 2017

hlin117 commented Sep 11, 2017 • edited Loading

hlin117 commented Sep 11, 2017

qinhanmin2014 commented Sep 11, 2017

jnothman commented Nov 27, 2017

qinhanmin2014 commented Sep 6, 2017 •

edited

Loading

hlin117 commented Sep 11, 2017 •

edited

Loading