Skip to content

discrete branch: add a compelling example of discretization's benefits #9339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Jul 12, 2017 · 20 comments
Closed
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices

Comments

@jnothman
Copy link
Member

We recently merged a discretizing transformer into the discrete branch (see diff between that branch and master). Before merging it into master, we'd like a compelling example for our example gallery showing an application of machine learning where discretized features are particularly useful.

To dear contributor: Make sure to submit a pull request to the discrete branch.

@joshring
Copy link

Discretisation is especially helpful in the case of:

  • High noise in the original continuous data distribution and benefits by allowing trivial averaging inside of the new category. It is a prime example of the trade off of precision for accuracy.
  • eg: linear trend with high noise is split into 5 monotonically increasing categories

It is especially helpful in the case of high noise since there is a greater benefit to the increased accuracy than the loss of information due to the reduced precision.

@jnothman
Copy link
Member Author

jnothman commented Aug 10, 2017 via email

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Sep 6, 2017

@jnothman (Sorry for the repeatedly update)
Here is my plan for the example, please have a look. Thanks :)
Dataset: iris (only use two features)
(1)plot the data before and after discretization
index
(2)train a classifier using the data before and after discretization and compare the result

DecisionTree score before discretization : 0.946666666667
DecisionTree score after discretization : 0.96
SVC score before discretization : 0.96
SVC score after discretization : 0.966666666667

@jnothman
Copy link
Member Author

jnothman commented Sep 6, 2017 via email

@qinhanmin2014
Copy link
Member

@jnothman Sorry but are you looking at the latest version? I have just updated my result because I suddenly realized my mistake.

@jnothman
Copy link
Member Author

jnothman commented Sep 6, 2017

I wasn't. I still don't understand why the DTC should struggle to find good splits in the continuous space.

Again, is this averaged over folds? What's the standard deviation in either case?

Please plot with alpha=.3 to give a rudimentary sense of density.

I don't think this exemplifies a typical use of discretizing.

@qinhanmin2014
Copy link
Member

@jnothman
Regret me if the example is not good since I'm not an expert at machine learning :)
The score is averaged over folds.

DecisionTree score before discretization : 0.946666666667
DecisionTree score std before discretization : 0.04
DecisionTree score after discretization : 0.96
DecisionTree score std after discretization : 0.0326598632371
SVC score before discretization : 0.96
SVC score std before discretization : 0.0249443825785
SVC score after discretization : 0.966666666667
SVC score std after discretization : 0.0249443825785

Since our discretization is naive, we cannot expect big improve.
The experiment is designed mainly based on this paper (citation > 2000) and other materails.
Here is part of the main code:

iris = load_iris()
X = iris.data
y = iris.target
X = X[:, [2,3]]
Xt = KBinsDiscretizer(n_bins=10, encode='ordinal').fit_transform(X)
clf1 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score before discretization : {}"
      .format(np.mean(cross_val_score(clf1, X, y, cv=5))))
print("DecisionTree score std before discretization : {}"
      .format(np.std(cross_val_score(clf1, X, y, cv=5))))
clf2 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score after discretization : {}"
      .format(np.mean(cross_val_score(clf2, Xt, y, cv=5))))
print("DecisionTree score std after discretization : {}"
      .format(np.std(cross_val_score(clf2, Xt, y, cv=5))))

@jnothman
Copy link
Member Author

jnothman commented Sep 6, 2017 via email

@qinhanmin2014
Copy link
Member

If someone in the community has some ideas or materials, l'm willing to have a try.

@jnothman
Copy link
Member Author

jnothman commented Sep 6, 2017 via email

@qinhanmin2014
Copy link
Member

@hlin117 Sorry to disturb. Could you kindly please spare some time to share your opinion on the example to illustrate the application of discretization? I have provided my opinion above but seems not good. Thanks a lot :)

@jnothman
Copy link
Member Author

jnothman commented Sep 7, 2017 via email

@qinhanmin2014
Copy link
Member

@jnothman I opened #9713 for review. Feel free to close it if it is too naive :)

@hlin117
Copy link
Contributor

hlin117 commented Sep 9, 2017

Hi @qinhanmin2014!

@hlin117 Sorry to disturb. Could you kindly please spare some time to share your opinion on the example to illustrate the application of discretization? I have provided my opinion above but seems not good. Thanks a lot :)

I'm not familiar with any datasets which benefit strongly from discretization, but I feel that you can produce a dataset which would.

I think discretization would be useful for datasets whose features shouldn't be represented as continuous features. For example, consider a feature whose data is represented by a bimodal distribution. If all of the features in this dataset are like this, then some classifiers (like logistic regression, for example), which assume "stronger feature value -> stronger output value" would not perform well.

So, to summarize, you can generate a dataset which would perform well under discretization by ensuring each feature is multimodal.

@hlin117
Copy link
Contributor

hlin117 commented Sep 9, 2017

One class of datasets which would benefit from discretization are biological datasets. Here's a quote from "Discretization of continuous features in clinical datasets":

Our results confirm the findings of previous studies, which show that discretization in general improves the accuracy of naïve Bayes classifiers. This is thought to be due to the ability of discretization to approximate the distribution of the continuous attribute, which otherwise would be assumed to be Gaussian. We might therefore expect the greatest gains to occur for datasets in which the attributes are not normally distributed. In such cases, the assumption of normality within the continuous data would lead to a lower accuracy overall, which should be somewhat overcome by the discretization process.

@qinhanmin2014
Copy link
Member

@hlin117 Thanks a lot for the detailed reply :)
If we use an existing dataset, then seems that iris is the best choice in scikit-learn since it has been used in some references.
Indeed, we can try to create a dataset and get better result, but I suspect whether it can reflect the reality and I can't find some references which manually create the test data by themselves. Do you have some? (e.g., related to the bimodal distribution you claim)

@hlin117
Copy link
Contributor

hlin117 commented Sep 11, 2017

.. Indeed, we can try to create a dataset and get better result, but I suspect whether it can reflect the reality ...

The true "reality" is that there is no single technique which will work on all datasets :)

... and I can't find some references which manually create the test data by themselves. Do you have some? (e.g., related to the bimodal distribution you claim)

Academic papers out there will not create datasets for themselves (usually). This is because when researchers need to compare one model after another, they need to use well studied datasets, and show that on that particular dataset, their model can perform better than another model. So it's quite a fallacy to claim in a paper that your model is performant by using a dataset which no one else has used.

If we use an existing dataset, then seems that iris is the best choice in scikit-learn since it has been used in some references.

Actually, I think using a generated dataset might be a great way to move forward with this.

  1. By creating a dataset whose feature are generated using more than one distribution, you're demonstrating the concept that discretization may have its niche, yet powerful use cases.
  2. You want your example to be as brief as possible. If you import an external dataset with the properties as described above, your PR is going to be very lengthy, which much of the code being data.

If you look at the user docs for the Kernel PCA, they actually use a generated dataset instead of a well studied dataset for the purpose of demonstrating the concept of a Kernel PCA.

@hlin117
Copy link
Contributor

hlin117 commented Sep 11, 2017

The reason why people are interested in Iris and other datasets is because these are well studied datasets which most classifiers can perform very well in. With data being so plentiful now in 2017, it's difficult to declare a dataset as "canonical".

@qinhanmin2014
Copy link
Member

@hlin117 Thanks a lot :) I agreed that using a generated dataset might provide better example. I'll first wait for the reply from the community for my current example using iris dataset. At the same time, try to think about generating a dataset. Feel free to ping me if you have further ideas.

@jnothman
Copy link
Member Author

Fixed in #10192, #10195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests

5 participants