-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] EXA Improve example plot_svm_anova.py #11731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] EXA Improve example plot_svm_anova.py #11731
Conversation
what if you keep the old data and just scale the features? |
@agramfort |
ok thanks for giving it a try.
I agree now that Iris is more adapted.
|
Wondering if someone can review it. The original example is wrong IMO. |
thanks @adrinjalali I agree that your version is better. |
examples/svm/plot_svm_anova.py
Outdated
transform = SelectPercentile(chi2) | ||
|
||
clf = Pipeline([('anova', transform), ('svc', SVC(gamma="auto"))]) | ||
clf = Pipeline([('anova', transform), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the transform
variable is only used here, I guess we can remove it and have SelectPercentile(chi2)
directly in the pipeline. The comment above the pipeline also needs to change accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LGTM, thanks @qinhanmin2014 !
ping @jnothman @agramfort |
This reverts commit d8214fe.
This reverts commit d8214fe.
I think the example plot_svm_anova.py is not good. It claims that


This example shows how to perform univariate feature selection to improve the classification scores.
However, with some non-informative features, we actually get the worst result when we select number of features equal to the original dataset. The reason is that the features in digits dataset are either 0 or 1, so it seems not reasonable to add non-informative features with np.random.random.Related comment #11588 (review)
In the new example, I use the iris dataset (4 features) and add 36 non-informative features. We can find that our model achieves best performance when we select around 10% of features.
Before the PR:
After the PR: