Skip to content

[MRG] Openml data loader #11419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 153 commits into from
Aug 15, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
153 commits
Select commit Hold shift + click to select a range
268f533
start on openml dataset loader
amueller Oct 10, 2017
4f3e93e
working on stuff
amueller Oct 11, 2017
fabaa90
first version working
amueller Oct 12, 2017
1804b14
docstrings, use version="active" as default
amueller Oct 12, 2017
ffd4335
add caching to openml loader
amueller Oct 12, 2017
fe0904b
pep8 annoyance
amueller Oct 12, 2017
eea026e
fix download url, allow datasets without target
amueller Oct 12, 2017
bca12e9
allow specifying the target column, starting on docs
amueller Oct 12, 2017
f59ce8b
add openml to the narrative docs
amueller Oct 12, 2017
d7dee6d
get more people to upload stuff to openml.
amueller Oct 12, 2017
4c19ad9
store metadata, convert to dtype object if there is nominal data.
amueller Oct 12, 2017
16b7fed
fix doctests, add fetch_openml to __init__
amueller Oct 12, 2017
b3f6c36
make arff reading work in python2.7
amueller Oct 17, 2017
dc401f2
ignore doctests for now because of unicode issues
amueller Oct 24, 2017
d8cfd37
add version filter.
amueller Oct 24, 2017
6f6bb57
some typos, addressing joel's comments, working on better errors
amueller Nov 14, 2017
b5c72d9
nicer error message on non-existing ID
amueller Nov 15, 2017
64483f8
minor improvements to data wrangling
amueller Nov 15, 2017
26aaff2
allow downloading inactive datasets if specified by name and version
amueller Nov 15, 2017
b3b9276
update mice version 4 dataset id
amueller Nov 15, 2017
b2c283a
Merge branch 'master' of github.com:scikit-learn/scikit-learn into op…
amueller Nov 15, 2017
7e91c71
add whatsnew entry
amueller Nov 15, 2017
11909d5
add unicode and normalize whitespace flags to pytest config
amueller Nov 15, 2017
126f406
Merge branch 'master' into openml_loader
amueller Nov 22, 2017
7e16203
add test for fetch_openml
amueller Nov 22, 2017
8dcb26b
test error messages
amueller Nov 22, 2017
0d562b6
fix command for make test-coverage
amueller Nov 22, 2017
c2266c5
Merge branch 'fix_test_coverage' into openml_loader
amueller Nov 22, 2017
e274ad3
make flake8 green
amueller Nov 22, 2017
eb39a01
py35 compatiility
amueller Nov 22, 2017
c56b549
Merge branch 'master' into openml_loader
amueller Dec 21, 2017
67825e8
trying to use CSV interface
amueller Dec 21, 2017
e4ab363
Merge branch 'openml_loader' of https://github.com/amueller/scikit-le…
janvanrijn Jul 2, 2018
7894622
Merge branch 'amueller-openml_loader'
janvanrijn Jul 2, 2018
f6e6c8e
packed liac-arff and added appropriate utility functions
janvanrijn Jul 2, 2018
64724d1
extended unit tests
janvanrijn Jul 3, 2018
4ac219c
added comment
janvanrijn Jul 3, 2018
73ff417
added sparse arff
janvanrijn Jul 6, 2018
a18dcff
added mock files
janvanrijn Jul 6, 2018
f8ad349
mocked data description and data features
janvanrijn Jul 6, 2018
720d4f6
changed mocking structure to not depend on unittest.mock (for Python …
janvanrijn Jul 9, 2018
8a4732b
removed panda dependency
janvanrijn Jul 12, 2018
593ed17
small updated incorporating several of the comments on the pr by @rth…
janvanrijn Jul 14, 2018
60ee9a2
improved parameterization of fetch openml (split id and name into two…
janvanrijn Jul 14, 2018
fe7cfc9
extended functionality and unit tests for sparse matrices
janvanrijn Jul 14, 2018
be9332b
added __init__.py for externals/liac-arff (neccessary for Python 2)
janvanrijn Jul 14, 2018
3a537b3
removed monkeypatch context manager from with block
janvanrijn Jul 14, 2018
757552a
adapted test paths
janvanrijn Jul 14, 2018
266e722
monkeypatched anneal test
janvanrijn Jul 14, 2018
c7c83a6
fixed travis doc test and line size
janvanrijn Jul 15, 2018
280292c
fix indent for unit test
janvanrijn Jul 15, 2018
63a6167
doctest
janvanrijn Jul 15, 2018
8856a91
mocked more unit tests
janvanrijn Jul 15, 2018
6a834e8
mocked ALL unit tests
janvanrijn Jul 15, 2018
d5f0441
fixed doc test, added unit test with ignore attributes and row id
janvanrijn Jul 15, 2018
f13ed63
added openml reference paper to documentation
janvanrijn Jul 15, 2018
579d10e
suggestions by @amueller
janvanrijn Jul 15, 2018
2dc3e2c
moved liac-arff to general externals directory (for appveyor)
janvanrijn Jul 16, 2018
2659dd9
improved docstring based on Joris' comments
janvanrijn Jul 16, 2018
403576b
changed order of arguments
janvanrijn Jul 16, 2018
5c65cb7
openml doc improvements
janvanrijn Jul 16, 2018
f8cf13f
improved doc+tests
janvanrijn Jul 16, 2018
bf10a86
code improvements requested by Joris
janvanrijn Jul 16, 2018
6d4bd90
doc fixes addressing review comments
janvanrijn Jul 16, 2018
3f8128a
incorporated many suggestions from the code review:
janvanrijn Jul 17, 2018
7ee5329
final touch on json load factorization
janvanrijn Jul 17, 2018
aedfefd
more replies to reviews
janvanrijn Jul 17, 2018
defb1a6
changed arff import
janvanrijn Jul 17, 2018
b731b0d
fixed line lengths
janvanrijn Jul 17, 2018
c7faee2
extended unit test by checking the output data arrays
janvanrijn Jul 17, 2018
a8dec37
removed unused variable
janvanrijn Jul 17, 2018
26b5c28
aligned whatsnew
janvanrijn Jul 17, 2018
440eb86
changed line size
janvanrijn Jul 17, 2018
68ffece
name lower case
janvanrijn Jul 17, 2018
1cec31b
renamed mock files (to lowercase)
janvanrijn Jul 17, 2018
b5cce8c
added support for multi-target classification
janvanrijn Jul 17, 2018
02108c4
fix visual indent error
janvanrijn Jul 17, 2018
a01e8ba
gzipped test data
janvanrijn Jul 17, 2018
b45fea6
Merge pull request #2 from scikit-learn/master
janvanrijn Jul 18, 2018
6ea8cfb
url open fix
janvanrijn Jul 19, 2018
6e68177
added more tests
janvanrijn Jul 19, 2018
29046e1
fix flake8 stuff
janvanrijn Jul 20, 2018
40b5bb1
added sanity check unit test
janvanrijn Jul 20, 2018
7c921c0
small fixes for travis
janvanrijn Jul 20, 2018
004b831
see if loading a non-binary file works in gzip
janvanrijn Jul 20, 2018
02ae09f
fix unit test?
janvanrijn Jul 20, 2018
09b0fd8
added more explicit error when unicode error arises
janvanrijn Jul 20, 2018
bc7fd91
changed back to 'rb'
janvanrijn Jul 20, 2018
26a5f51
solving all my unicode problems
janvanrijn Jul 20, 2018
e267f16
removed str hack, proper way to handle unicode across Python versions
janvanrijn Jul 20, 2018
931c93f
improved path.join handling, so Windows also likes the OpenML extension
janvanrijn Jul 20, 2018
cdca931
removed strict string check
janvanrijn Jul 20, 2018
be25e36
changed assertion in sanity check test
janvanrijn Jul 20, 2018
1582da5
replaced last occurance of str into string_types
janvanrijn Jul 20, 2018
0f4d22c
test to see if appveyor handles ungzipped files
janvanrijn Jul 20, 2018
817a3dc
changed abspath to realpath
janvanrijn Jul 20, 2018
25f9fce
appveyor: added debug info
janvanrijn Jul 20, 2018
3f32a93
added arff.gz and json.gz to manifest.in
janvanrijn Jul 21, 2018
381fd45
moved openml test files to data directory
janvanrijn Jul 21, 2018
73c7fdb
added additional unit test for multi-target
janvanrijn Jul 21, 2018
07b2a7e
changed joblib import
janvanrijn Jul 21, 2018
6393f0a
emotions unit test to gzip format
janvanrijn Jul 21, 2018
12a538d
review by Roman
janvanrijn Jul 21, 2018
f71548e
extended default feature check
janvanrijn Jul 21, 2018
2cf14ea
explicit naming for feature dicts
janvanrijn Jul 21, 2018
73544da
typo fix
janvanrijn Jul 21, 2018
887d328
removed sys import (for flake test)
janvanrijn Jul 21, 2018
56f478e
fix Travix fail
janvanrijn Jul 21, 2018
1df22eb
make travis run again
janvanrijn Jul 21, 2018
8e42e09
removed unused variable
janvanrijn Jul 21, 2018
96c627b
incorporated comments by Joel
janvanrijn Jul 23, 2018
e6fe403
incorporated last comment
janvanrijn Jul 23, 2018
c08065b
sparse data support
janvanrijn Jul 23, 2018
94f86c1
comments by Joel
janvanrijn Jul 24, 2018
312ecd9
Merge branch 'sparse_data' into master
janvanrijn Jul 24, 2018
546c4b0
resolved merge errors
janvanrijn Jul 24, 2018
30662b4
fixes flake8 error
janvanrijn Jul 24, 2018
8d02af2
comments by Joel
janvanrijn Jul 25, 2018
197685b
changed arff.loads to arff.load
janvanrijn Jul 25, 2018
8d9d7dc
reverted loads for PY3
janvanrijn Jul 25, 2018
5218c67
implemented caching as recommended by Joel
janvanrijn Jul 25, 2018
857042f
changed function signature, better testing for cache and small bugfixes
janvanrijn Jul 25, 2018
82eb0fd
Merge pull request #5 from janvanrijn/new_caching
janvanrijn Jul 25, 2018
bff90e5
changed hashes into file structure
janvanrijn Jul 25, 2018
4f655d3
Merge pull request #6 from janvanrijn/new_caching
janvanrijn Jul 25, 2018
5bec402
comments by Joel
janvanrijn Jul 26, 2018
6c4c920
comments by Joel
janvanrijn Jul 27, 2018
eae5f62
flake8 correction
janvanrijn Jul 27, 2018
7a01418
extended test files with expected missing values
janvanrijn Jul 30, 2018
426bea4
encode categoricals
janvanrijn Jul 31, 2018
15c2b48
renamed variable names
janvanrijn Aug 2, 2018
3f62bef
Merge pull request #8 from janvanrijn/encode
janvanrijn Aug 2, 2018
cea4f64
Targets not encoded; Towards metadata about nominals
jnothman Aug 6, 2018
cea54a4
handle no target
jnothman Aug 6, 2018
350a64f
Merge pull request #9 from jnothman/fetch_openml
janvanrijn Aug 6, 2018
9a4d1d8
Make target dtype an object in classification
jnothman Aug 7, 2018
2892be6
Test categories attribute, and some fixes
jnothman Aug 7, 2018
778fbd4
Merge pull request #10 from jnothman/fetch_openml
janvanrijn Aug 7, 2018
25bae94
added nominal values to openml mock files, renamed data list mock fil…
janvanrijn Aug 7, 2018
87b8bd7
Merge branch 'master' into master
janvanrijn Aug 7, 2018
2881936
upgraded liac arff,
janvanrijn Aug 8, 2018
d7ce597
tiny typo fix
janvanrijn Aug 8, 2018
261448d
flake8 fix
janvanrijn Aug 8, 2018
7624af7
incorporated additional raise from https://github.com/renatopp/liac-a…
janvanrijn Aug 8, 2018
40c11dd
DOC note return value is experimental
jnothman Aug 13, 2018
c91eef4
Merge branch 'master' into master
jnothman Aug 13, 2018
4f7003b
Fix flake8
lesteve Aug 13, 2018
a9f3a2c
Remove obsolete FIXME
jnothman Aug 15, 2018
bd1a189
Better error message with string attributes
jnothman Aug 15, 2018
143c19b
cosmit
jnothman Aug 15, 2018
1789421
DOC clean what's new merge mess
jnothman Aug 15, 2018
088ddeb
DOC clean what's new merge mess more
jnothman Aug 15, 2018
fd9fba0
Fix urlopen has no __exit__ in Python 2
jnothman Aug 15, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ include *.rst
recursive-include doc *
recursive-include examples *
recursive-include sklearn *.c *.h *.pyx *.pxd *.pxi
recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt
recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt *.arff.gz *.json.gz
include COPYING
include AUTHORS.rst
include README.rst
Expand Down
148 changes: 148 additions & 0 deletions doc/datasets/openml.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
..
For doctests:

>>> import numpy as np
>>> import os


.. _openml:

Downloading datasets from the openml.org repository
===================================================

`openml.org <https://openml.org>`_ is a public repository for machine learning
data and experiments, that allows everybody to upload open datasets.

The ``sklearn.datasets`` package is able to download datasets
from the repository using the function
:func:`sklearn.datasets.fetch_openml`.

For example, to download a dataset of gene expressions in mice brains::

>>> from sklearn.datasets import fetch_openml
>>> mice = fetch_openml(name='miceprotein', version=4)

To fully specify a dataset, you need to provide a name and a version, though
the version is optional, see :ref:`openml_versions`_ below.
The dataset contains a total of 1080 examples belonging to 8 different
classes::

>>> mice.data.shape
(1080, 77)
>>> mice.target.shape
(1080,)
>>> np.unique(mice.target) # doctest: +NORMALIZE_WHITESPACE
array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object)

You can get more information on the dataset by looking at the ``DESCR``
and ``details`` attributes::

>>> print(mice.DESCR) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015
**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing
Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down
Syndrome. PLoS ONE 10(6): e0129126...

>>> mice.details # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP
{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',
'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',
'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff',
'file_id': '17928620', 'default_target_attribute': 'class',
'row_id_attribute': 'MouseID',
'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],
'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],
'visibility': 'public', 'status': 'active',
'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}


The ``DESCR`` contains a free-text description of the data, while ``details``
contains a dictionary of meta-data stored by openml, like the dataset id.
For more details, see the `OpenML documentation
<https://docs.openml.org/#data>`_ The ``data_id`` of the mice protein dataset
is 40966, and you can use this (or the name) to get more information on the
dataset on the openml website::

>>> mice.url
'https://www.openml.org/d/40966'

The ``data_id`` also uniquely identifies a dataset from OpenML::

>>> mice = fetch_openml(data_id=40966)
>>> mice.details # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP
{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',
'creator': ...,
'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':
'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id':
'1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C,
Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins
Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6):
e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14',
'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum':
'3c479a6885bfa0438971388283a1ce32'}

.. _openml_versions:

Dataset Versions
----------------

A dataset is uniquely specified by its ``data_id``, but not necessarily by its
name. Several different "versions" of a dataset with the same name can exist
which can contain entirely different datasets.
If a particular version of a dataset has been found to contain significant
issues, it might be deactivated. Using a name to specify a dataset will yield
the earliest version of a dataset that is still active. That means that
``fetch_openml(name="miceprotein")`` can yield different results at different
times if earlier versions become inactive.
You can see that the dataset with ``data_id`` 40966 that we fetched above is
the version 1 of the "miceprotein" dataset::

>>> mice.details['version'] #doctest: +SKIP
'1'

In fact, this dataset only has one version. The iris dataset on the other hand
has multiple versions::

>>> iris = fetch_openml(name="iris")
>>> iris.details['version'] #doctest: +SKIP
'1'
>>> iris.details['id'] #doctest: +SKIP
'61'

>>> iris_61 = fetch_openml(data_id=61)
>>> iris_61.details['version']
'1'
>>> iris_61.details['id']
'61'

>>> iris_969 = fetch_openml(data_id=969)
>>> iris_969.details['version']
'3'
>>> iris_969.details['id']
'969'

Specifying the dataset by the name "iris" yields the lowest version, version 1,
with the ``data_id`` 61. To make sure you always get this exact dataset, it is
safest to specify it by the dataset ``data_id``. The other dataset, with
``data_id`` 969, is version 3 (version 2 has become inactive), and contains a
binarized version of the data::

>>> np.unique(iris_969.target)
array(['N', 'P'], dtype=object)

You can also specify both the name and the version, which also uniquely
identifies the dataset::

>>> iris_version_3 = fetch_openml(name="iris", version=3)
>>> iris_version_3.details['version']
'3'
>>> iris_version_3.details['id']
'969'


.. topic:: References:

* Vanschoren, van Rijn, Bischl and Torgo
`"OpenML: networked science in machine learning"
<https://arxiv.org/pdf/1407.7722.pdf>`_,
ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.
1 change: 1 addition & 0 deletions doc/developers/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ link to it from your website, or simply star to say "I use it":
* `joblib <https://github.com/joblib/joblib/issues>`__
* `sphinx-gallery <https://github.com/sphinx-gallery/sphinx-gallery/issues>`__
* `numpydoc <https://github.com/numpy/numpydoc/issues>`__
* `liac-arff <https://github.com/renatopp/liac-arff>`__

and larger projects:

Expand Down
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,7 @@ Loaders
datasets.fetch_lfw_people
datasets.fetch_mldata
datasets.fetch_olivetti_faces
datasets.fetch_openml
datasets.fetch_rcv1
datasets.fetch_species_distributions
datasets.get_data_home
Expand Down
6 changes: 5 additions & 1 deletion doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,11 @@ Support for Python 3.3 has been officially dropped.
:mod:`sklearn.datasets`
.......................

- |MajorFeature| Added :func:`datasets.fetch_openml` to fetch datasets from
`OpenML <http://openml.org>`. OpenML is a free, open data sharing platform
and will be used instead of mldata as it provides better service availability.
:issue:`9908` by `Andreas Müller`_ and :user:`Jan N. van Rijn <janvanrijn>`.

- |Feature| In :func:`datasets.make_blobs`, one can now pass a list to the
`n_samples` parameter to indicate the number of samples to generate per
cluster. :issue:`8617` by :user:`Maskani Filali Mohamed <maskani-moh>` and
Expand All @@ -198,7 +203,6 @@ Support for Python 3.3 has been officially dropped.
data points could be generated. :issue:`10037` by :user:`Christian Braune
<christianbraune79>`.


:mod:`sklearn.decomposition`
............................

Expand Down
2 changes: 2 additions & 0 deletions sklearn/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from .twenty_newsgroups import fetch_20newsgroups
from .twenty_newsgroups import fetch_20newsgroups_vectorized
from .mldata import fetch_mldata, mldata_filename
from .openml import fetch_openml
from .samples_generator import make_classification
from .samples_generator import make_multilabel_classification
from .samples_generator import make_hastie_10_2
Expand Down Expand Up @@ -65,6 +66,7 @@
'fetch_covtype',
'fetch_rcv1',
'fetch_kddcup99',
'fetch_openml',
'get_data_home',
'load_boston',
'load_diabetes',
Expand Down
Loading