Skip to content

add dataset parameter to fetch_20newsgroups() #4035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

walid-shalaby
Copy link

As some literature papers use different versions of the 20ng dataset, this parameter will allow users choose which version of the 20ng dataset to download:

  1. 'bydate' --> download the sorted by date version (18846 documents)
    @ http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
  2. 'original' --> download the original version (19997 documents)
    @ http://people.csail.mit.edu/jrennie/20Newsgroups/20news-19997.tar.gz
  3. 'preprocessed' --> download a preprocessed version where duplicates are removed and only "From" and "Subject" headers exist (18828 documents)
    @ http://people.csail.mit.edu/jrennie/20Newsgroups/20news-18828.tar.gz

…hich version of the 20ng to download ('bydate' or 'original', 'preprocessed')
@agramfort
Copy link
Member

can you run pep8 checker on your file.

also see why travis is failing.

@jnothman
Copy link
Member

jnothman commented Jan 7, 2015

Please avoid using tabs for indentation. This seems to be the cause of the test failure in Py3k

@rth
Copy link
Member

rth commented Sep 26, 2018

I'm going to close this as with the addition of the openml fetcher #11419 adding new datasets should consist in uploading them to OpenML. In this particular case, the sparse arff parser is pretty slow, but it could be an occasion to improve that. In any case, adding new functionality to existing dataset fetchers will will be more difficult to maintain.

Thank you for contributing!

@rth rth closed this Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants