Skip to content

DOC rework plot_document_classification_20newsgroups.py example #22928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 23, 2022

Conversation

ArturoAmorQ
Copy link
Member

@ArturoAmorQ ArturoAmorQ commented Mar 23, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The 20newsgroups classification example is our only document classification
example and it can/should be more didactic. Therefore, this PR reworks it by:

  • adding a loading function to easily remove text metadata when fetching dataset
  • adding a demo on a particular classifier before benchmarking the rest of them:
    • compare feature effects
    • show pollution in the dataset introduced by the metadata / motivate preprocessing dataset
    • use ConfusionMatrixDisplay to visualize results
  • adding text to provide interpretation of intermediate steps
  • removing similarly performing classifiers to avoid distractions and overlap in plots
  • removing the feature selection tools (they appear to have little to none effect)
  • making the benchmark function accept customized names

Any other comments?

The demo uses RidgeClassifier as it has a relatively low training time, but any classifier with a coef_ attribute may work.

This example uses TfidfVectorizer only. A comparison of its performance with respect to HashingVectorizer will be craft in another example.

@jeremiedbb
Copy link
Member

I triggered the CI again because the artifacts were not generated

@jjerphan jjerphan self-requested a review March 29, 2022 13:56
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, here is a first pass of feedback.

I also think you should append your name in the list of authors of this file in the header comment.

ArturoAmorQ and others added 3 commits April 4, 2022 16:41
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice refactoring. Here is a batch of feedback!

ArturoAmorQ and others added 3 commits May 10, 2022 10:45
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@ArturoAmorQ
Copy link
Member Author

Thanks for your time and comments @ogrisel and @jjerphan !

@jjerphan
Copy link
Member

Thank you for your contribution, @ArturoAmorQ. 🙂

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a final review and here are a few more nitpicks but otherwise LGTM.

This example is much more interesting to read that it used to be :) Thank you very much for this contribution @ArturoAmorQ !

@ogrisel ogrisel merged commit 7102832 into scikit-learn:main May 23, 2022
@ArturoAmorQ ArturoAmorQ deleted the doc_classification branch June 9, 2022 13:30
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
…it-learn#22928)



Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
glemaitre pushed a commit that referenced this pull request Aug 5, 2022

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants