Skip to content

[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
53063ba
Added subject independent KFold
JeanKossaifi Mar 24, 2015
37ecdd7
Changed SubjectIndependentKFold to DisjointGroupKFold
JeanKossaifi Mar 25, 2015
79cea0c
Changed name to DisjointLabelKFold
JeanKossaifi Mar 26, 2015
de5d272
Added example of use
JeanKossaifi Mar 26, 2015
a809881
FIX: whitespace related doctest failure
JeanKossaifi Mar 26, 2015
c14f847
FIX: Python 2.6 requires the field numbers in print
JeanKossaifi Mar 26, 2015
b0fc204
FIX: change docstring to comment in test function
JeanKossaifi Mar 26, 2015
cd77b82
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
JeanKossaifi Mar 28, 2015
1920a75
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
JeanKossaifi Mar 30, 2015
1916b26
DOC: moved docstring from function to class
JeanKossaifi May 14, 2015
f320bc1
FIX: added call to parent class
JeanKossaifi May 14, 2015
f58dbe6
FIX: error in calling the parent
JeanKossaifi May 14, 2015
89358a3
DOC: fixed doctest
JeanKossaifi May 14, 2015
90cfc56
FIX: doctest
JeanKossaifi May 14, 2015
5630d34
Cosmetic changes (minor refactoring)
JeanKossaifi Jun 23, 2015
19da58a
Optimised code (use np.bincount)
JeanKossaifi Jun 23, 2015
4706289
Cosmetic: use samples instead of weight for clarity
JeanKossaifi Jun 23, 2015
b305539
Minor fix: removed shuffle parameter
JeanKossaifi Jun 23, 2015
b7fc3d8
Cosmetic
JeanKossaifi Jun 23, 2015
65d96b3
Use mergesort instead of quicksort for reproducibility.
JeanKossaifi Jun 25, 2015
fa03f1b
Changed variable name 'y' to 'label'.
JeanKossaifi Jun 25, 2015
3faa60a
Added test for degenerate case where n_folds > n_labels.
JeanKossaifi Jun 29, 2015
90275be
Documented the requirement n_labels > n_folds.
JeanKossaifi Jun 29, 2015
0b2dff6
DOC: improved description + added see also sections.
JeanKossaifi Jul 1, 2015
6b4b63f
Fixed dtype of temporary arrays.
JeanKossaifi Jul 1, 2015
c9a636c
Improved test: check that one label is not in both test and training.
JeanKossaifi Jul 1, 2015
ecb0ea0
Added documentation for DisjoinLabelKFold.
JeanKossaifi Jul 1, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ Classes
cross_validation.LeavePOut
cross_validation.PredefinedSplit
cross_validation.StratifiedKFold
cross_validation.DisjointLabelKFold
cross_validation.ShuffleSplit
cross_validation.StratifiedShuffleSplit

Expand Down
27 changes: 27 additions & 0 deletions doc/modules/cross_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,33 @@ two slightly unbalanced classes::
[0 1 2 4 5 6 7] [3 8 9]


Disjoint label KFold
--------------------

:class:`DisjointLabelKFold` is a variation of *k-fold* which ensures that the same
label is not in both testing and training sets.
This is necessary for example if you obtained data from different subjects and you
want to avoid over-fitting (ie learning person specific features) by testing and
training on different subjects.

Imagine you have three subjects, each with an associated number from 1 to 3::

>>> from sklearn.cross_validation import DisjointLabelKFold

>>> labels = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

>>> dlkf = DisjointLabelKFold(labels, 3)
>>> for train, test in dlkf:
... print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

Each subject is in a different testing fold, and the same subject is never in both
testing and training.
Notice that the folds do not have exactly the same size due to the imbalance in the data.


Leave-One-Out - LOO
-------------------

Expand Down
139 changes: 139 additions & 0 deletions sklearn/cross_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,8 @@ class KFold(_BaseKFold):
StratifiedKFold: take label information into account to avoid building
folds with imbalanced class distributions (for binary or multiclass
classification tasks).

DisjointLabelKFold: K-fold iterator variant with non-overlapping labels.
"""

def __init__(self, n, n_folds=3, shuffle=False,
Expand Down Expand Up @@ -332,6 +334,133 @@ def __len__(self):
return self.n_folds


def disjoint_label_folds(labels, n_folds=3):
"""Creates folds where a same label is not in two different folds.

Parameters
----------
labels: numpy array, shape (n_samples,)
Contains an id for each sample.
The folds are built so that the same id doesn't appear in two different folds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id -> index

And maybe document the dtype; it should be np.intp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsmans @ogrisel changed to mergesort, thanks.
@larsmans currently the implementation allows the user to have an array of strings for y (for instance the name or id of the subjects in an experiment) as I didn't see any reason to not allow that. Do you want to enforce a np.intp dtype?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think string should be welcome.


n_folds: int, default=3
Number of folds to split the data into.

Returns
-------
folds: numpy array of shape (n_samples, )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folds : numpy array, shape (n_samples,)

Array of integers between 0 and (n_folds - 1).
Folds[i] contains the folds to which sample i is assigned.

Notes
-----
The folds are built by distributing the labels by frequency of appearance.
The number of labels has to be at least equal to the number of folds.
"""
labels = np.array(labels)
unique_labels, labels = np.unique(labels, return_inverse=True)
n_labels = len(unique_labels)
if n_folds > n_labels:
raise ValueError(
("Cannot have number of folds n_folds={0} greater"
" than the number of labels: {1}.").format(n_folds, n_labels))

# number of occurrence of each label (its "weight")
samples_per_label = np.bincount(labels)
# We want to distribute the most frequent labels first
ind = np.argsort(samples_per_label, kind='mergesort')[::-1]
samples_per_label = samples_per_label[ind]

# Total weight of each fold
samples_per_fold = np.zeros(n_folds, dtype=np.uint64)

# Mapping from label index to fold index
label_to_fold = np.zeros(len(unique_labels), dtype=np.uintp)

# While there are weights, distribute them
# Specifically, add the biggest weight to the lightest fold
for label_index, w in enumerate(samples_per_label):
ind_min = np.argmin(samples_per_fold)
samples_per_fold[ind_min] += w
label_to_fold[ind[label_index]] = ind_min

folds = label_to_fold[labels]

return folds


class DisjointLabelKFold(_BaseKFold):
"""K-fold iterator variant with non-overlapping labels.

The same label will not appear in two different folds (the number of
labels has to be at least equal to the number of folds).

The folds are approximately balanced in the sense so that the number of
distinct labels is approximately the same in each fold.

Parameters
----------
labels : array-like with shape (n_samples, )
Contains a label for each sample.
The folds are built so that the same label doesn't appear in two different folds.

n_folds : int, default is 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int, default=3

Number of folds.

Examples
--------
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
>>> y = np.array([1, 2, 3, 4])
>>> labels = np.array([0, 0, 2, 2])
>>> dl_kfold = cross_validation.DisjointLabelKFold(labels, n_folds=2)
>>> len(dl_kfold)
2
>>> print(dl_kfold)
sklearn.cross_validation.DisjointLabelKFold(n_labels=4, n_folds=2)
>>> for train_index, test_index in dl_kfold:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... print(X_train, X_test, y_train, y_test)
...
TRAIN: [0 1] TEST: [2 3]
[[1 2]
[3 4]] [[5 6]
[7 8]] [1 2] [3 4]
TRAIN: [2 3] TEST: [0 1]
[[5 6]
[7 8]] [[1 2]
[3 4]] [3 4] [1 2]

See also
--------
LeaveOneLabelOut for splitting the data according to explicit,
domain-specific stratification of the dataset.
"""
def __init__(self, labels, n_folds=3):
# No shuffling implemented yet
super(DisjointLabelKFold, self).__init__(len(labels), n_folds, False, None)
self.n_folds = n_folds
self.n = len(labels)
self.idxs = disjoint_label_folds(labels=labels, n_folds=n_folds)

def _iter_test_indices(self):
for i in range(self.n_folds):
yield (self.idxs == i)

def __repr__(self):
return '{0}.{1}(n_labels={2}, n_folds={3})'.format(
self.__class__.__module__,
self.__class__.__name__,
self.n,
self.n_folds,
)

def __len__(self):
return self.n_folds


class StratifiedKFold(_BaseKFold):
"""Stratified K-Folds cross validation iterator

Expand Down Expand Up @@ -380,6 +509,9 @@ class StratifiedKFold(_BaseKFold):
All the folds have size trunc(n_samples / n_folds), the last one has the
complementary.

See also
--------
DisjointLabelKFold: K-fold iterator variant with non-overlapping labels.
"""

def __init__(self, y, n_folds=3, shuffle=False,
Expand Down Expand Up @@ -486,6 +618,9 @@ class LeaveOneLabelOut(_PartitionIterator):
[3 4]] [[5 6]
[7 8]] [1 2] [1 2]

See also
--------
DisjointLabelKFold: K-fold iterator variant with non-overlapping labels.
"""

def __init__(self, labels):
Expand Down Expand Up @@ -559,6 +694,10 @@ class LeavePLabelOut(_PartitionIterator):
TRAIN: [0] TEST: [1 2]
[[1 2]] [[3 4]
[5 6]] [1] [2 1]

See also
--------
DisjointLabelKFold: K-fold iterator variant with non-overlapping labels.
"""

def __init__(self, labels, p):
Expand Down
68 changes: 68 additions & 0 deletions sklearn/tests/test_cross_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from sklearn.utils.testing import assert_almost_equal
from sklearn.utils.testing import assert_raises
from sklearn.utils.testing import assert_greater
from sklearn.utils.testing import assert_greater_equal
from sklearn.utils.testing import assert_less
from sklearn.utils.testing import assert_not_equal
from sklearn.utils.testing import assert_array_almost_equal
Expand Down Expand Up @@ -1038,3 +1039,70 @@ def test_check_is_partition():

p[0] = 23
assert_false(cval._check_is_partition(p, 100))


def test_disjoint_label_folds():
## Check that the function produces equilibrated folds
## with no label appearing in two different folds

# Fix the seed for reproducibility
rng = np.random.RandomState(0)

# Parameters of the test
n_labels = 15
n_samples = 1000
n_folds = 5

# Construct the test data
tolerance = 0.05 * n_samples # 5 percent error allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this number come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used the same default as for the other classes (KFold, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean test_stratified_shuffle_split_even?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is quite a different test, though, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I misread your question, I though you were referring to the default number of folds.
I just chose something seemingly reasonable (0.05 being usually used in statistics).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit magic to me, it is not clear why the test should pass. It depends on the distribution of labels, right? I guess with randint and many samples, they will be quite close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is true... I chose quite a large number of samples (1000) though and np.random.randint draws integers from a “discrete uniform” distribution in the “half-open” interval [low, high).
The seed is also fixed.

Should I leave that or do you suggest I simply manually write an array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine for now.

labels = rng.randint(0, n_labels, n_samples)
folds = cval.disjoint_label_folds(labels, n_folds)
ideal_n_labels_per_fold = n_samples // n_folds

# Check that folds have approximately the same size
assert_equal(len(folds), len(labels))
for i in np.unique(folds):
assert_greater_equal(tolerance, abs(sum(folds == i) - ideal_n_labels_per_fold))

# Check that each label appears only in 1 fold
for label in np.unique(labels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could remove the for loop by constructing a coo_matrix instead, but I'm not sure it's worth it. this one might be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, to the reader it is probably easier to have the loopy version.

assert_equal(len(np.unique(folds[labels == label])), 1)

# Check that no label is on both sides of the split
labels = np.asarray(labels, dtype=object) # to allow fancy indexing on labels
for train, test in cval.DisjointLabelKFold(labels, n_folds=n_folds):
assert_equal(len(np.intersect1d(labels[train], labels[test])), 0)

# Construct the test data
labels = ['Albert', 'Jean', 'Bertrand', 'Michel', 'Jean',
'Francis', 'Robert', 'Michel', 'Rachel', 'Lois',
'Michelle', 'Bernard', 'Marion', 'Laura', 'Jean',
'Rachel', 'Franck', 'John', 'Gael', 'Anna', 'Alix',
'Robert', 'Marion', 'David', 'Tony', 'Abel', 'Becky',
'Madmood', 'Cary', 'Mary', 'Alexandre', 'David', 'Francis',
'Barack', 'Abdoul', 'Rasha', 'Xi', 'Silvia']

n_labels = len(np.unique(labels))
n_samples = len(labels)
n_folds = 5
tolerance = 0.05 * n_samples # 5 percent error allowed
folds = cval.disjoint_label_folds(labels, n_folds)
ideal_n_labels_per_fold = n_samples // n_folds

# Check that folds have approximately the same size
assert_equal(len(folds), len(labels))
for i in np.unique(folds):
assert_greater_equal(tolerance, abs(sum(folds == i) - ideal_n_labels_per_fold))

# Check that each label appears only in 1 fold
for label in np.unique(labels):
assert_equal(len(np.unique(folds[labels == label])), 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please enhance this test by also checking that using the DisjointLabelKFold iterator you never have samples with the same label on both sides of the split, e.g.:

labels = np.asarray(labels, dtype=object)  # to allow fancy indexing on labels

for train, test in cval.DisjointLabelKFold(labels, n_folds=n_folds):
    assert_equal(len(np.intersect1d(labels[train], labels[test])), 0)


# Check that no label is on both sides of the split
labels = np.asarray(labels, dtype=object) # to allow fancy indexing on labels
for train, test in cval.DisjointLabelKFold(labels, n_folds=n_folds):
assert_equal(len(np.intersect1d(labels[train], labels[test])), 0)

# Should fail if there are more folds than labels
labels = np.array([1, 1, 1, 2, 2])
assert_raises(ValueError, cval.disjoint_label_folds, labels, n_folds=3)