Skip to content

[MRG+2] Add a test for sample weights for estimators #11558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sergulaydore
Copy link
Contributor

@sergulaydore sergulaydore commented Jul 16, 2018

Reference Issues/PRs

Fixes the invariant test for sample weights as mentioned in issue #11316 (Refactor tests for sample weights).

What is new?

This is a generic test for estimators that makes sure the sample weights yield consistent results.

What does this implement/fix? Explain your changes.

The test checks if the output of the estimators are the same for sample_weight = None and sample_weight=np.ones().

Any other comments?

Pairwise methods are skipped as they require pairwise data.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, aside from 2 minor comments.

X = np.array([[1, 3], [1, 3], [1, 3], [1, 3],
[2, 1], [2, 1], [2, 1], [2, 1],
[3, 3], [3, 3], [3, 3], [3, 3],
[4, 1], [4, 1], [4, 1], [4, 1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would force the dtype to be float here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

"sample_weight=ones" % name)
if hasattr(estimator_orig, "transform"):
X_pred1 = estimator1.transform(X)
X_pred2 = estimator2.transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I would call these X_trans1 and X_trans2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this back to X_pred1 and X_pred2 because I moved both methods inside a for loop.

@GaelVaroquaux GaelVaroquaux changed the title Add a test for sample weights for estimators [MRG+1] Add a test for sample weights for estimators Jul 16, 2018
@GaelVaroquaux GaelVaroquaux added this to the 0.20 milestone Jul 16, 2018
[3, 3], [3, 3], [3, 3], [3, 3],
[4, 1], [4, 1], [4, 1], [4, 1]], dtype=np.dtype('float'))
y = np.array([1, 1, 1, 1, 2, 2, 2, 2,
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I would have kept y as an integer. Only X as float.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it could be great to add an entry in the what's new as well

@@ -553,6 +554,45 @@ def check_sample_weights_list(name, estimator_orig):
estimator.fit(X, y, sample_weight=sample_weight)


@ignore_warnings(category=(DeprecationWarning, FutureWarning))
def check_sample_weight_invariance(name, estimator_orig):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be worth to add a one line comment regarding the purpose of the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

y = np.array([1, 1, 1, 1, 2, 2, 2, 2,
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))

if has_fit_parameter(estimator_orig, "random_state"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can replace those using set_random_state from sklearn.utils.testing it will check if the estimators has already the random state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))

if has_fit_parameter(estimator_orig, "random_state"):
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)), random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically:

from sklearn.utils.testing import set_random_state

set_random_state(estimator1, random_state=42)
set_random_state(estimator2, random_state=42)

estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Used it as it is.

estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)

if hasattr(estimator_orig, "predict"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a loop here

for method in ('predict', 'transform'):
    if hasattr(estimator_orig, method):
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

X_pred1 = estimator1.predict(X)
X_pred2 = estimator2.predict(X)
try:
assert_allclose(X_pred1, X_pred2, rtol=0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you might want to use assert_allclose_dense_sparse if the output could be sparse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the err_msg to assert_allclose_dense_sparse avoiding the try .. except ... to raise the proper error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we expect the output to be sparse.
err_msg is done!

@TomDLT
Copy link
Member

TomDLT commented Jul 16, 2018

I did not look into your code, but you might get inspired from the excellent test in #10803.
see here

@GaelVaroquaux
Copy link
Member

Test are passing! Congratulations.

But you had a PEP8 failure. I fixed it on your branch. We need to wait for the tests to run again.

@GaelVaroquaux
Copy link
Member

@glemaitre : can you reassess you review. We think that this is good to go.

@qinhanmin2014
Copy link
Member

I can't understand why we need to skip "KMeans" and "MiniBatchKMeans". It seems that the test passes locally on my PC with these two classes. Also, at least we have things like v_measure_score to use here?

# unit weights and no weights
if (has_fit_parameter(estimator_orig, "sample_weight") and
not (hasattr(estimator_orig, "_pairwise")
and estimator_orig._pairwise) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd prefer some comments here to show us what you're doing (why do you skip the test?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Basically, the data we are testing is not pairwise. Estimator tests with _pariwise fails otherwise.

@GaelVaroquaux
Copy link
Member

Indeed, I think that it's good to add comments.

I canceled the corresponding travis builds, to save time in the queue.

and estimator_orig._pairwise) and
name not in ["KMeans", "MiniBatchKMeans"]):
# We skip pairwise because the data is not pairwise
# KMeans and MiniBatchKMeans were unstable; hence skipped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about it. Are they still unstable under a fixed random_state. If so, is there a bug?
(Sorry if I made a mistake :) Go ahead if you have enough confidence)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange. When I initially ran this without using set_random_state, they were failing. After using set_random_state, they seem to be stable. I wonder if estimator.fit(...,random_state=0) is missing something that set_random_state does not. I can include those estimators to the test. @GaelVaroquaux WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.

Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.

@sergulaydore
Copy link
Contributor Author

@qinhanmin2014 : "KMeans" and "MiniBatchKMeans" are skipped because they were unstable.

@qinhanmin2014
Copy link
Member

I would expect estimators in scikit-learn to be deterministic with same input and a fixed random_state, or am I wrong?

and estimator_orig._pairwise) and
name not in ["KMeans", "MiniBatchKMeans"]):
# We skip pairwise because the data is not pairwise
# KMeans and MiniBatchKMeans were unstable; hence skipped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.

Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.

@glemaitre
Copy link
Member

Oh you added back kmeans, cool so this is fine then.
Waiting for the CI to be green.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CIs are green. Thanks @sergulaydore

@qinhanmin2014 qinhanmin2014 changed the title [MRG+1] Add a test for sample weights for estimators [MRG+2] Add a test for sample weights for estimators Jul 17, 2018
@GaelVaroquaux
Copy link
Member

OK, travis is green. Merging. Hurray!

@GaelVaroquaux GaelVaroquaux merged commit bcd6ff3 into scikit-learn:master Jul 17, 2018
@jnothman
Copy link
Member

Why do we need an rtol as high as 0.5?

@jnothman
Copy link
Member

This doesn't assert that weighting makes any difference ever, does it?

@sergulaydore
Copy link
Contributor Author

@jnothman Actually we don't need rtol as 0.5. I just tested with the default value and the tests still passed. No, it only makes sure that the no weighting is equal to unit weighting. If you recommend, I can submit another PR with random weights.

@qinhanmin2014
Copy link
Member

+1 to use default rtol, @sergulaydore please submit a PR, thanks.

@sergulaydore
Copy link
Contributor Author

Done! Please see #11621.

@sergulaydore sergulaydore deleted the refactor_sample_weights branch July 18, 2018 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants