-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Add a test for sample weights for estimators #11558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Add a test for sample weights for estimators #11558
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, aside from 2 minor comments.
sklearn/utils/estimator_checks.py
Outdated
X = np.array([[1, 3], [1, 3], [1, 3], [1, 3], | ||
[2, 1], [2, 1], [2, 1], [2, 1], | ||
[3, 3], [3, 3], [3, 3], [3, 3], | ||
[4, 1], [4, 1], [4, 1], [4, 1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would force the dtype to be float here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
sklearn/utils/estimator_checks.py
Outdated
"sample_weight=ones" % name) | ||
if hasattr(estimator_orig, "transform"): | ||
X_pred1 = estimator1.transform(X) | ||
X_pred2 = estimator2.transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I would call these X_trans1 and X_trans2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this back to X_pred1 and X_pred2 because I moved both methods inside a for loop.
sklearn/utils/estimator_checks.py
Outdated
[3, 3], [3, 3], [3, 3], [3, 3], | ||
[4, 1], [4, 1], [4, 1], [4, 1]], dtype=np.dtype('float')) | ||
y = np.array([1, 1, 1, 1, 2, 2, 2, 2, | ||
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would have kept y as an integer. Only X as float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it could be great to add an entry in the what's new as well
sklearn/utils/estimator_checks.py
Outdated
@@ -553,6 +554,45 @@ def check_sample_weights_list(name, estimator_orig): | |||
estimator.fit(X, y, sample_weight=sample_weight) | |||
|
|||
|
|||
@ignore_warnings(category=(DeprecationWarning, FutureWarning)) | |||
def check_sample_weight_invariance(name, estimator_orig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth to add a one line comment regarding the purpose of the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
sklearn/utils/estimator_checks.py
Outdated
y = np.array([1, 1, 1, 1, 2, 2, 2, 2, | ||
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) | ||
|
||
if has_fit_parameter(estimator_orig, "random_state"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can replace those using set_random_state
from sklearn.utils.testing
it will check if the estimators has already the random state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
sklearn/utils/estimator_checks.py
Outdated
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) | ||
|
||
if has_fit_parameter(estimator_orig, "random_state"): | ||
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)), random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so basically:
from sklearn.utils.testing import set_random_state
set_random_state(estimator1, random_state=42)
set_random_state(estimator2, random_state=42)
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Used it as it is.
sklearn/utils/estimator_checks.py
Outdated
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y))) | ||
estimator2.fit(X, y=y, sample_weight=None) | ||
|
||
if hasattr(estimator_orig, "predict"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a loop here
for method in ('predict', 'transform'):
if hasattr(estimator_orig, method):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
sklearn/utils/estimator_checks.py
Outdated
X_pred1 = estimator1.predict(X) | ||
X_pred2 = estimator2.predict(X) | ||
try: | ||
assert_allclose(X_pred1, X_pred2, rtol=0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you might want to use assert_allclose_dense_sparse
if the output could be sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass the err_msg to assert_allclose_dense_sparse avoiding the try .. except ... to raise the proper error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we expect the output to be sparse.
err_msg is done!
Test are passing! Congratulations. But you had a PEP8 failure. I fixed it on your branch. We need to wait for the tests to run again. |
@glemaitre : can you reassess you review. We think that this is good to go. |
I can't understand why we need to skip "KMeans" and "MiniBatchKMeans". It seems that the test passes locally on my PC with these two classes. Also, at least we have things like |
sklearn/utils/estimator_checks.py
Outdated
# unit weights and no weights | ||
if (has_fit_parameter(estimator_orig, "sample_weight") and | ||
not (hasattr(estimator_orig, "_pairwise") | ||
and estimator_orig._pairwise) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd prefer some comments here to show us what you're doing (why do you skip the test?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Basically, the data we are testing is not pairwise. Estimator tests with _pariwise
fails otherwise.
Indeed, I think that it's good to add comments. I canceled the corresponding travis builds, to save time in the queue. |
sklearn/utils/estimator_checks.py
Outdated
and estimator_orig._pairwise) and | ||
name not in ["KMeans", "MiniBatchKMeans"]): | ||
# We skip pairwise because the data is not pairwise | ||
# KMeans and MiniBatchKMeans were unstable; hence skipped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about it. Are they still unstable under a fixed random_state. If so, is there a bug?
(Sorry if I made a mistake :) Go ahead if you have enough confidence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange. When I initially ran this without using set_random_state
, they were failing. After using set_random_state
, they seem to be stable. I wonder if estimator.fit(...,random_state=0)
is missing something that set_random_state
does not. I can include those estimators to the test. @GaelVaroquaux WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is weird to pass random_state
in the fit
. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state
.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit
.
@qinhanmin2014 : "KMeans" and "MiniBatchKMeans" are skipped because they were unstable. |
I would expect estimators in scikit-learn to be deterministic with same input and a fixed random_state, or am I wrong? |
sklearn/utils/estimator_checks.py
Outdated
and estimator_orig._pairwise) and | ||
name not in ["KMeans", "MiniBatchKMeans"]): | ||
# We skip pairwise because the data is not pairwise | ||
# KMeans and MiniBatchKMeans were unstable; hence skipped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is weird to pass random_state
in the fit
. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state
.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit
.
Oh you added back kmeans, cool so this is fine then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if CIs are green. Thanks @sergulaydore
OK, travis is green. Merging. Hurray! |
Why do we need an rtol as high as 0.5? |
This doesn't assert that weighting makes any difference ever, does it? |
@jnothman Actually we don't need rtol as 0.5. I just tested with the default value and the tests still passed. No, it only makes sure that the no weighting is equal to unit weighting. If you recommend, I can submit another PR with random weights. |
+1 to use default rtol, @sergulaydore please submit a PR, thanks. |
Done! Please see #11621. |
Reference Issues/PRs
Fixes the invariant test for sample weights as mentioned in issue #11316 (Refactor tests for sample weights).
What is new?
This is a generic test for estimators that makes sure the sample weights yield consistent results.
What does this implement/fix? Explain your changes.
The test checks if the output of the estimators are the same for
sample_weight = None
andsample_weight=np.ones()
.Any other comments?
Pairwise methods are skipped as they require pairwise data.