Skip to content

Prediction label retrieval should be moved to after the first fit call #12267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

oleksandr-pavlyk
Copy link
Contributor

What does this implement/fix? Explain your changes.

In estimator_checks::check_clustering a check is made that labels produced by fit with the given random_state are the same as the result of fit_predict with the same random_state.

However, the actual retrieval of labels was not made after call fit(X), but after subsequent call fit(X.tolist()), which may have consumed from the random stream, which may, in some implementations of KMeans, like that based on Intel DAAL, cause cluster centers to be reordered, and labels to come out equivalent up to a permutation causing a failure.

Any other comments?

Hoping for no objections here :)

@GaelVaroquaux

@amueller
Copy link
Member

amueller commented Oct 3, 2018

No, fit needs to be idempotent and you're not allowed to consume from the random stream according to scikit-learn conventions.
Unfortunately we don't have an explicit test for that. There's another PR that I currently can't find that tries to "fix" the same thing in a different common test.

We indeed should test more explicitly that fit is idempotent

@amueller
Copy link
Member

amueller commented Oct 3, 2018

Discussion was here: #10978

@amueller
Copy link
Member

amueller commented Oct 3, 2018

can you please instead check that even with a single iteration calling fit twice will result in the same prediction, i.e. that random initialization doesn't change with fixed random seed?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that you need to run check_random_state only in fit not in __init__.

@oleksandr-pavlyk
Copy link
Contributor Author

@jnothman I checked the code and we only call check_random_state once in fit.

@amueller I am not sure I completely understood your request, but here is my attempt at verifying that calling fit on KMeans class with the same random_state twice produces identical results:

# coding: utf-8
import sklearn, sklearn.datasets, sklearn.cluster
X, lb = sklearn.datasets.make_blobs(centers=3, n_samples=10)

cl = sklearn.cluster.KMeans(n_clusters=3, random_state=100, algorithm='full', n_init=1, max_iter=1)
cl.fit(X)
(c1_, l1_) = (cl.cluster_centers_, cl.labels_)

cl.set_params(random_state=100)
cl.fit(X)
(c2_, l2_) = (cl.cluster_centers_, cl.labels_)
import numpy as np

assert np.all(c1_ == c2_), "cluster centers differ"
assert np.all(l1_ == l2_), "labels differ"

I am also unsure whether you think the proposed change is not a good one, and is indicative of an issue with our changes. Please clarify

@amueller
Copy link
Member

amueller commented Oct 4, 2018

Sorry for being unclear. Yes, I don't think your proposed change is good. Needing this change means your estimator violates sklearn api.
You are not allowed to change the random_state member variable. So yes, your tests is ok, but it shouldn't have the cl.set_params(random_state=100). If the random state is a fixed value, calling fit needs to be deterministics and have no side-effects.

@oleksandr-pavlyk
Copy link
Contributor Author

I checked that with our changes

# %load pr_12267.py
import sklearn, sklearn.datasets, sklearn.cluster, numpy as np
X, lb = sklearn.datasets.make_blobs(centers=3, n_samples=10)

cl = sklearn.cluster.KMeans(n_clusters=3, random_state=0, algorithm='full', n_init=1, max_iter=1)
cl.fit(X)
(c1_, l1_) = (cl.cluster_centers_, cl.labels_)

cl.fit(X)
(c2_, l2_) = (cl.cluster_centers_, cl.labels_)

assert np.all(c1_ == c2_), "cluster centers differ"
assert np.all(l1_ == l2_), "labels differ"

does not raise asserts, but

import sklearn, sklearn.datasets, sklearn.cluster, numpy as np
X, lb = sklearn.datasets.make_blobs(centers=3, n_samples=10)

cl = sklearn.cluster.KMeans(n_clusters=3, random_state=0, algorithm='full', n_init=1, max_iter=1)
cl.fit(X)
(c1_, l1_) = (cl.cluster_centers_, cl.labels_)

cl.fit(X.tolist())
(c2_, l2_) = (cl.cluster_centers_, cl.labels_)

assert np.all(c1_ == c2_), "cluster centers differ"
assert np.all(l1_ == l2_), "labels differ"

does.

It has nothing to do random_state, but rather with the freedom DAAL takes to order resulting centers differently:

In [15]: c1_
Out[15]:
array([[ 0.8573024 ,  9.74137512],
       [ 6.23751117, -3.08010944],
       [ 2.54114708, -7.84122869]])

In [16]: c2_
Out[16]:
array([[ 6.23751117, -3.08010944],
       [ 0.8573024 ,  9.74137512],
       [ 2.54114708, -7.84122869]])

In [17]: l1_
Out[17]: array([2, 1, 0, 1, 2, 1, 0, 0, 1, 2], dtype=int32)

In [18]: l2_
Out[18]: array([2, 0, 1, 0, 2, 0, 1, 1, 0, 2], dtype=int32)

@oleksandr-pavlyk
Copy link
Contributor Author

Concurrently with the current discussion I will try to understand what determines the order in which cluster centers are returned in DAAL.

@amueller
Copy link
Member

amueller commented Oct 4, 2018

Why does X being a list make a difference? That's... interesting.

@oleksandr-pavlyk
Copy link
Contributor Author

@amueller Would it be OK if the check that pred is the same as pred2 would change to v_measure_score of these predictions be the same?

@oleksandr-pavlyk
Copy link
Contributor Author

Aha. It just so happened that for X.tolist() sklearn's k_means_dense was called, instead of DAAL's code, which is called for the array. That explains the discrepancy between ft(X.list()) and fit(X).

@amueller
Copy link
Member

amueller commented Oct 4, 2018

I don't think using v_measure_score is appropriate. So you can just ensure DAAL is called and all is good, right? (see the test actually found an issue ;)

@oleksandr-pavlyk
Copy link
Contributor Author

Ok, fair enough. I resolved it on my side.

@oleksandr-pavlyk oleksandr-pavlyk deleted the estimator_checks-tweak branch October 5, 2018 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants