Prediction label retrieval should be moved to after the first fit call #12267

oleksandr-pavlyk · 2018-10-03T20:04:11Z

What does this implement/fix? Explain your changes.

In estimator_checks::check_clustering a check is made that labels produced by fit with the given random_state are the same as the result of fit_predict with the same random_state.

However, the actual retrieval of labels was not made after call fit(X), but after subsequent call fit(X.tolist()), which may have consumed from the random stream, which may, in some implementations of KMeans, like that based on Intel DAAL, cause cluster centers to be reordered, and labels to come out equivalent up to a permutation causing a failure.

Any other comments?

Hoping for no objections here :)

@GaelVaroquaux

amueller · 2018-10-03T20:37:43Z

No, fit needs to be idempotent and you're not allowed to consume from the random stream according to scikit-learn conventions.
Unfortunately we don't have an explicit test for that. There's another PR that I currently can't find that tries to "fix" the same thing in a different common test.

We indeed should test more explicitly that fit is idempotent

amueller · 2018-10-03T20:39:05Z

Discussion was here: #10978

amueller · 2018-10-03T20:40:29Z

can you please instead check that even with a single iteration calling fit twice will result in the same prediction, i.e. that random initialization doesn't change with fixed random seed?

jnothman

The point is that you need to run check_random_state only in fit not in __init__.

oleksandr-pavlyk · 2018-10-04T13:08:01Z

@jnothman I checked the code and we only call check_random_state once in fit.

@amueller I am not sure I completely understood your request, but here is my attempt at verifying that calling fit on KMeans class with the same random_state twice produces identical results:

# coding: utf-8
import sklearn, sklearn.datasets, sklearn.cluster
X, lb = sklearn.datasets.make_blobs(centers=3, n_samples=10)

cl = sklearn.cluster.KMeans(n_clusters=3, random_state=100, algorithm='full', n_init=1, max_iter=1)
cl.fit(X)
(c1_, l1_) = (cl.cluster_centers_, cl.labels_)

cl.set_params(random_state=100)
cl.fit(X)
(c2_, l2_) = (cl.cluster_centers_, cl.labels_)
import numpy as np

assert np.all(c1_ == c2_), "cluster centers differ"
assert np.all(l1_ == l2_), "labels differ"

I am also unsure whether you think the proposed change is not a good one, and is indicative of an issue with our changes. Please clarify

amueller · 2018-10-04T13:36:11Z

Sorry for being unclear. Yes, I don't think your proposed change is good. Needing this change means your estimator violates sklearn api.
You are not allowed to change the random_state member variable. So yes, your tests is ok, but it shouldn't have the cl.set_params(random_state=100). If the random state is a fixed value, calling fit needs to be deterministics and have no side-effects.

oleksandr-pavlyk · 2018-10-04T13:55:57Z

I checked that with our changes

# %load pr_12267.py
import sklearn, sklearn.datasets, sklearn.cluster, numpy as np
X, lb = sklearn.datasets.make_blobs(centers=3, n_samples=10)

cl = sklearn.cluster.KMeans(n_clusters=3, random_state=0, algorithm='full', n_init=1, max_iter=1)
cl.fit(X)
(c1_, l1_) = (cl.cluster_centers_, cl.labels_)

cl.fit(X)
(c2_, l2_) = (cl.cluster_centers_, cl.labels_)

assert np.all(c1_ == c2_), "cluster centers differ"
assert np.all(l1_ == l2_), "labels differ"

does not raise asserts, but

import sklearn, sklearn.datasets, sklearn.cluster, numpy as np
X, lb = sklearn.datasets.make_blobs(centers=3, n_samples=10)

cl = sklearn.cluster.KMeans(n_clusters=3, random_state=0, algorithm='full', n_init=1, max_iter=1)
cl.fit(X)
(c1_, l1_) = (cl.cluster_centers_, cl.labels_)

cl.fit(X.tolist())
(c2_, l2_) = (cl.cluster_centers_, cl.labels_)

assert np.all(c1_ == c2_), "cluster centers differ"
assert np.all(l1_ == l2_), "labels differ"

does.

It has nothing to do random_state, but rather with the freedom DAAL takes to order resulting centers differently:

In [15]: c1_
Out[15]:
array([[ 0.8573024 ,  9.74137512],
       [ 6.23751117, -3.08010944],
       [ 2.54114708, -7.84122869]])

In [16]: c2_
Out[16]:
array([[ 6.23751117, -3.08010944],
       [ 0.8573024 ,  9.74137512],
       [ 2.54114708, -7.84122869]])

In [17]: l1_
Out[17]: array([2, 1, 0, 1, 2, 1, 0, 0, 1, 2], dtype=int32)

In [18]: l2_
Out[18]: array([2, 0, 1, 0, 2, 0, 1, 1, 0, 2], dtype=int32)

oleksandr-pavlyk · 2018-10-04T13:59:47Z

Concurrently with the current discussion I will try to understand what determines the order in which cluster centers are returned in DAAL.

amueller · 2018-10-04T14:02:46Z

Why does X being a list make a difference? That's... interesting.

oleksandr-pavlyk · 2018-10-04T14:08:26Z

@amueller Would it be OK if the check that pred is the same as pred2 would change to v_measure_score of these predictions be the same?

oleksandr-pavlyk · 2018-10-04T14:21:48Z

Aha. It just so happened that for X.tolist() sklearn's k_means_dense was called, instead of DAAL's code, which is called for the array. That explains the discrepancy between ft(X.list()) and fit(X).

amueller · 2018-10-04T15:06:31Z

I don't think using v_measure_score is appropriate. So you can just ensure DAAL is called and all is good, right? (see the test actually found an issue ;)

oleksandr-pavlyk · 2018-10-05T17:17:04Z

Ok, fair enough. I resolved it on my side.

Prediction label retrieval should be moved to after the first fit call

4ccfe2f

jnothman reviewed Oct 4, 2018

View reviewed changes

oleksandr-pavlyk closed this Oct 5, 2018

oleksandr-pavlyk deleted the estimator_checks-tweak branch October 5, 2018 17:17

amueller mentioned this pull request Oct 5, 2018

[MRG] Clarified indempotence of fit #12305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Prediction label retrieval should be moved to after the first fit call #12267

Prediction label retrieval should be moved to after the first fit call #12267

Uh oh!

oleksandr-pavlyk commented Oct 3, 2018

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

jnothman left a comment

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

amueller commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

amueller commented Oct 4, 2018 •

edited

Loading

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

amueller commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 5, 2018

Uh oh!

Uh oh!

Uh oh!

Prediction label retrieval should be moved to after the first fit call #12267

Prediction label retrieval should be moved to after the first fit call #12267

Uh oh!

Conversation

oleksandr-pavlyk commented Oct 3, 2018

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

amueller commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

amueller commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 4, 2018

Uh oh!

amueller commented Oct 4, 2018

Uh oh!

oleksandr-pavlyk commented Oct 5, 2018

Uh oh!

Uh oh!

amueller commented Oct 4, 2018 •

edited

Loading