Skip to content

[MRG] EXA Improve example plot_svm_anova.py #11731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 28, 2019
Merged

[MRG] EXA Improve example plot_svm_anova.py #11731

merged 6 commits into from
Jan 28, 2019

Conversation

qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Aug 1, 2018

I think the example plot_svm_anova.py is not good. It claims that This example shows how to perform univariate feature selection to improve the classification scores. However, with some non-informative features, we actually get the worst result when we select number of features equal to the original dataset. The reason is that the features in digits dataset are either 0 or 1, so it seems not reasonable to add non-informative features with np.random.random.
Related comment #11588 (review)
In the new example, I use the iris dataset (4 features) and add 36 non-informative features. We can find that our model achieves best performance when we select around 10% of features.
Before the PR:
sphx_glr_plot_svm_anova_001
After the PR:
sphx_glr_plot_svm_anova_001

@agramfort
Copy link
Member

what if you keep the old data and just scale the features?

@qinhanmin2014
Copy link
Member Author

@agramfort
I tried to scale the digits dataset with scale but the problem is how to do feature selection
(1) We can't use chi2 because we now have negative features
(2) We'll get warnings with f_classif because we have constant features
(3) It takes >100 seconds if we use mutual_info_classif, seems unacceptable from my side
What's more, the result is not so good (see below, our models are supposed to achieve best performance when we select around 25% (64/264) of features)
f_classif:
2018-08-02_200228
mutual_info_classif:
2018-08-02_200236

@agramfort
Copy link
Member

agramfort commented Aug 2, 2018 via email

@qinhanmin2014 qinhanmin2014 changed the title EXA Improve example plot_svm_anova.py [MRG] EXA Improve example plot_svm_anova.py Aug 3, 2018
@qinhanmin2014
Copy link
Member Author

Not sure whether it's appropriate to use cv=5 for the iris dataset since we do not have enough data.
I plot different result here for comparison:
2018-08-04_093847

@qinhanmin2014
Copy link
Member Author

Wondering if someone can review it. The original example is wrong IMO.
ping @jnothman @adrinjalali maybe (apologies if the ping makes you unhappy :))

@qinhanmin2014
Copy link
Member Author

thanks @adrinjalali I agree that your version is better.

transform = SelectPercentile(chi2)

clf = Pipeline([('anova', transform), ('svc', SVC(gamma="auto"))])
clf = Pipeline([('anova', transform),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the transform variable is only used here, I guess we can remove it and have SelectPercentile(chi2) directly in the pipeline. The comment above the pipeline also needs to change accordingly.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM, thanks @qinhanmin2014 !

@qinhanmin2014
Copy link
Member Author

qinhanmin2014 commented Jan 27, 2019

ping @jnothman @agramfort
maybe we can hurry this into 0.20.3, since it's not good to include a wrong example
output from Circle:
sphx_glr_plot_svm_anova_001

@jnothman jnothman merged commit 1deb95a into scikit-learn:master Jan 28, 2019
@qinhanmin2014 qinhanmin2014 deleted the svm-anova-example branch January 28, 2019 02:16
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 30, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019
@qinhanmin2014 qinhanmin2014 mentioned this pull request Feb 19, 2019
17 tasks
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants