Skip to content

Commit 348d9aa

Browse files
committed
DOC: improve datasets information
Add some links accross the documentation and examples Add the dataset's description in their docstring
1 parent 6a65388 commit 348d9aa

File tree

7 files changed

+145
-67
lines changed

7 files changed

+145
-67
lines changed

doc/modules/datasets.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _datasets:
2+
13
=========================
24
Dataset loading utilities
35
=========================
@@ -16,6 +18,21 @@ This package also features helpers to fetch larger datasets commonly
1618
used by the machine learning community to benchmark algorithm on data
1719
that comes from the 'real world'.
1820

21+
Datasets shipped with the scikit learn
22+
========================================
23+
24+
The scikit learn comes with a few standard datasets:
25+
26+
.. autosummary::
27+
28+
:toctree: generated/
29+
:template: function.rst
30+
31+
load_iris
32+
load_diabetes
33+
load_digits
34+
load_linnerud
35+
1936

2037
Dataset generators
2138
==================

doc/tutorial.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _getting_started:
2+
13
Getting started: an introduction to machine learning with scikits.learn
24
=======================================================================
35

@@ -67,7 +69,9 @@ the `digits dataset
6769
A dataset is a dictionary-like object that holds all the data and some
6870
metadata about the data. This data is stored in the `.data` member, which
6971
is a `n_samples, n_features` array. In the case of supervised problem,
70-
explanatory variables are stored in the `.target` member.
72+
explanatory variables are stored in the `.target` member. More details on
73+
the different datasets can be found in the
74+
:ref:`dedicated section <datasets>`.
7175

7276
For instance, in the case of the digits dataset, `digits.data` gives
7377
access to the features that can be used to classify the digits samples::

examples/plot_digits_classification.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66
An example showing how the scikit-learn can be used to recognize images of
77
hand-written digits.
88
9+
This example is commented in the
10+
:ref:`tutorial section of the user manual <getting_started>`.
11+
912
"""
1013
print __doc__
1114

@@ -15,13 +18,17 @@
1518
# Standard scientific Python imports
1619
import pylab as pl
1720

21+
# Import datasets, classifiers and performance metrics
22+
from scikits.learn import datasets, svm, metrics
23+
1824
# The digits dataset
19-
from scikits.learn import datasets
2025
digits = datasets.load_digits()
2126

2227
# The data that we are interested in is made of 8x8 images of digits,
23-
# let's have a look at the first 3 images. We know which digit they
24-
# represent: it is given in the 'target' of the dataset.
28+
# let's have a look at the first 3 images, stored in the `images`
29+
# attribute of the dataset. If we were working from image files, we
30+
# could load them using pylab.imread. For these images know which
31+
# digit they represent: it is given in the 'target' of the dataset.
2532
for index, (image, label) in enumerate(zip(digits.images, digits.target)[:4]):
2633
pl.subplot(2, 4, index+1)
2734
pl.imshow(image, cmap=pl.cm.gray_r)
@@ -32,10 +39,7 @@
3239
n_samples = len(digits.images)
3340
data = digits.images.reshape((n_samples, -1))
3441

35-
# Import a classifier:
36-
from scikits.learn import svm
37-
from scikits.learn.metrics import classification_report
38-
from scikits.learn.metrics import confusion_matrix
42+
# Create a classifier: a support vector classifier
3943
classifier = svm.SVC()
4044

4145
# We learn the digits on the first half of the digits
@@ -45,13 +49,9 @@
4549
expected = digits.target[n_samples/2:]
4650
predicted = classifier.predict(data[n_samples/2:])
4751

48-
print "Classification report for classifier:"
49-
print classifier
50-
print
51-
print classification_report(expected, predicted)
52-
print
53-
print "Confusion matrix:"
54-
print confusion_matrix(expected, predicted)
52+
print "Classification report for classifier %s:\n%s\n" % (
53+
classifier, metrics.classification_report(expected, predicted))
54+
print "Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted)
5555

5656
for index, (image, prediction) in enumerate(
5757
zip(digits.images[n_samples/2:], predicted)[:4]):

scikits/learn/datasets/base.py

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
import csv
1111
import shutil
12+
import textwrap
1213
from os import environ
1314
from os.path import dirname
1415
from os.path import join
@@ -20,6 +21,7 @@
2021

2122
import numpy as np
2223

24+
################################################################################
2325

2426
class Bunch(dict):
2527
""" Container object for datasets: dictionnary-like object that
@@ -208,7 +210,6 @@ def load_iris():
208210
def load_digits(n_class=10):
209211
"""load the digits dataset and returns it.
210212
211-
212213
Parameters
213214
----------
214215
n_class : integer, between 0 and 10
@@ -256,13 +257,35 @@ def load_digits(n_class=10):
256257

257258

258259
def load_diabetes():
260+
""" Load the diabetes dataset and returns it.
261+
262+
Returns
263+
-------
264+
data : Bunch
265+
Dictionnary-like object, the interesting attributes are:
266+
'data', the data to learn and 'target', the labels for each
267+
sample.
268+
269+
270+
"""
259271
base_dir = join(dirname(__file__), 'data')
260272
data = np.loadtxt(join(base_dir, 'diabetes_data.csv.gz'))
261273
target = np.loadtxt(join(base_dir, 'diabetes_target.csv.gz'))
262274
return Bunch(data=data, target=target)
263275

264276

265277
def load_linnerud():
278+
""" Load the linnerud dataset and returns it.
279+
280+
Returns
281+
-------
282+
data : Bunch
283+
Dictionnary-like object, the interesting attributes are:
284+
'data_exercise' and 'data_physiological', the two multivariate
285+
datasets, as well as 'header_exercise' and
286+
'header_physiological', the corresponding headers.
287+
288+
"""
266289
base_dir = join(dirname(__file__), 'data/')
267290
# Read data
268291
data_exercise = np.loadtxt(base_dir + 'linnerud_exercise.csv', skiprows=1)
@@ -280,3 +303,29 @@ def load_linnerud():
280303
data_physiological=data_physiological,
281304
header_physiological=header_physiological,
282305
DESCR=fdescr.read())
306+
307+
################################################################################
308+
# Add the description in the docstring
309+
310+
def _add_notes(function, filename):
311+
"""Add a notes section to the docstring of a function reading it from a
312+
file"""
313+
fdescr = open(join(dirname(__file__), 'descr', filename), 'r')
314+
# Dedent the docstring
315+
doc = function.__doc__.split('\n')
316+
doc = '%s\n%s' % (textwrap.dedent(doc[0]),
317+
textwrap.dedent('\n'.join(doc[1:])))
318+
# Remove the first line of the description, which contains the
319+
# dataset's name
320+
descr = '\n'.join(fdescr.read().split('\n')[1:])
321+
function.__doc__ = doc + descr
322+
323+
324+
for function, filename in ((load_iris, 'iris.rst'),
325+
(load_linnerud, 'linnerud.rst'),
326+
(load_digits, 'digits.rst')):
327+
#try:
328+
_add_notes(function, filename)
329+
#except:
330+
# pass
331+

scikits/learn/datasets/descr/digits.rst

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,22 @@
11
Optical Recognition of Handwritten Digits Data Set
22

3-
Data Set Characteristics:
4-
5-
Source
3+
4+
Notes
65
-------
76

8-
Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
9-
Date: July; 1998
7+
Data Set Characteristics:
8+
9+
:Number of Instances: 5620
10+
11+
:Number of Attributes: 64
12+
13+
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
14+
15+
:Missing Attribute Values: None
16+
17+
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
18+
19+
:Date: July; 1998
1020

1121
This is a copy of the test set of the UCI ML hand-written digits datasets
1222

@@ -44,12 +54,4 @@ References
4454
- ...
4555

4656

47-
Number of Instances: 5620
48-
49-
Number of Attributes: 64
50-
51-
Attribute Information: 8x8 image of integer pixels in the range 0..16.
52-
53-
Missing Attribute Values: None
54-
5557

scikits/learn/datasets/descr/iris.rst

Lines changed: 34 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,40 @@
11
Iris Plants Database
22

3-
Source
3+
Notes
44
------
5-
Creator: R.A. Fisher
6-
Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
7-
Date: July, 1988
5+
Data Set Characteristics:
6+
7+
:Number of Instances: 150 (50 in each of three classes)
8+
9+
:Number of Attributes: 4 numeric, predictive attributes and the class
10+
11+
:Attribute Information:
12+
- sepal length in cm
13+
- sepal width in cm
14+
- petal length in cm
15+
- petal width in cm
16+
- class:
17+
- Iris-Setosa
18+
- Iris-Versicolour
19+
- Iris-Virginica
20+
21+
:Summary Statistics:
22+
============== ==== ==== ======= ===== ====================
23+
Min Max Mean SD Class Correlation
24+
============== ==== ==== ======= ===== ====================
25+
sepal length: 4.3 7.9 5.84 0.83 0.7826
26+
sepal width: 2.0 4.4 3.05 0.43 -0.4194
27+
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
28+
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
29+
============== ==== ==== ======= ===== ====================
30+
31+
:Missing Attribute Values: None
32+
33+
:Class Distribution: 33.3% for each of 3 classes.
34+
35+
:Creator: R.A. Fisher
36+
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
37+
:Date: July, 1988
838

939
This is a copy of UCI ML iris datasets.
1040

@@ -37,28 +67,4 @@ References
3767
- Many, many more ...
3868

3969

40-
Number of Instances: 150 (50 in each of three classes)
41-
42-
Number of Attributes: 4 numeric, predictive attributes and the class
43-
44-
Attribute Information:
45-
- sepal length in cm
46-
- sepal width in cm
47-
- petal length in cm
48-
- petal width in cm
49-
- class:
50-
- Iris-Setosa
51-
- Iris-Versicolour
52-
- Iris-Virginica
53-
54-
Summary Statistics:
55-
Min Max Mean SD Class Correlation
56-
sepal length: 4.3 7.9 5.84 0.83 0.7826
57-
sepal width: 2.0 4.4 3.05 0.43 -0.4194
58-
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
59-
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
60-
61-
Missing Attribute Values: None
62-
63-
Class Distribution: 33.3% for each of 3 classes.
6470

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
2+
Notes
3+
------
4+
5+
:Number of Instances: 20
6+
:Number of Attributes: 3
7+
:Missing Attribute Values: None
8+
19
The Linnerud dataset constains two small dataset:
210

311
- *exercise* A list containing the following components: exercise data with
@@ -6,19 +14,11 @@ The Linnerud dataset constains two small dataset:
614
- *physiological* data frame with 20 observations on 3 physiological variables:
715
Chins, Situps and Jumps
816

9-
Source
10-
------
11-
12-
Tenenhaus, M. (1998), Table 1, page 15.
17+
**Source:** Tenenhaus, M. (1998), Table 1, page 15.
1318

1419
References
1520
----------
1621

17-
Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.
18-
19-
Number of Instances: 20
20-
21-
Number of Attributes: 3
22+
* Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.
2223

23-
Missing Attribute Values: None
2424

0 commit comments

Comments
 (0)