-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Prediction label retrieval should be moved to after the first fit call #12267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prediction label retrieval should be moved to after the first fit call #12267
Conversation
No, We indeed should test more explicitly that |
Discussion was here: #10978 |
can you please instead check that even with a single iteration calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that you need to run check_random_state
only in fit
not in __init__
.
@jnothman I checked the code and we only call @amueller I am not sure I completely understood your request, but here is my attempt at verifying that calling
I am also unsure whether you think the proposed change is not a good one, and is indicative of an issue with our changes. Please clarify |
Sorry for being unclear. Yes, I don't think your proposed change is good. Needing this change means your estimator violates sklearn api. |
I checked that with our changes
does not raise asserts, but
does. It has nothing to do
|
Concurrently with the current discussion I will try to understand what determines the order in which cluster centers are returned in DAAL. |
Why does |
@amueller Would it be OK if the check that |
Aha. It just so happened that for |
I don't think using |
Ok, fair enough. I resolved it on my side. |
What does this implement/fix? Explain your changes.
In
estimator_checks::check_clustering
a check is made that labels produced byfit
with the givenrandom_state
are the same as the result offit_predict
with the samerandom_state
.However, the actual retrieval of labels was not made after call
fit(X)
, but after subsequent callfit(X.tolist())
, which may have consumed from the random stream, which may, in some implementations of KMeans, like that based on Intel DAAL, cause cluster centers to be reordered, and labels to come out equivalent up to a permutation causing a failure.Any other comments?
Hoping for no objections here :)
@GaelVaroquaux