Skip to content

add common test that zero sample weight means samples are ignored #15015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

amueller
Copy link
Member

@amueller amueller commented Sep 18, 2019

Adding a test following #10873.
This is a pretty simple sample_weight test that says that a weight of 0 is equivalent to not having the samples.
I think every failure here should be considered a bug. This is:

  • SVR
  • SGDRegressor
  • RANSACRegressor
  • OneClassSVM
  • NuSVR
  • MiniBatchKMeans
  • LinearSVR
  • LinearSVC
  • LogisticRegressionCV
  • KernelDensity
  • KMeans
  • CalibratedClassifierCV
  • IsolationForest
  • RidgeClassifierCV

I wonder if for the SGD based algorithms there's an issue where we shrink w if we see a sample with sample weight 0.

@amueller
Copy link
Member Author

added decision_function and predict_proba, now also SGDClassifier, RidgeClassifierCV, LogisticRegressionCV, LinearSVC, NuSVC and SVC are failing.

For the CV ones, I guess the splits are different because the CV is influenced by the zero sample weights. That's a bit weird but not sure what the expected behavior should be?

@glemaitre
Copy link
Member

It is the same issues that I found in #14246 and #14266
I also found some issues in the ensemble (but this is only due to the bootstrapping).

@glemaitre
Copy link
Member

glemaitre commented Sep 19, 2019

All SVM family should be fixed with #14286. Waiting for a second review there.

Edit: Actually this is only solving issue with negative weights.

y2 = _enforce_estimator_tags_y(estimator3, y2)
weights = np.ones(shape=len(y) * 2)
weights[len(y):] = 0
X2, y2, weights = shuffle(X2, y2, weights, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shuffling the data might have an impact, isn't it.
I was checking and OneClassSVM will give the same result if you are not shuffling the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Your test is better wrt to that, I think. I didn't want things to be sorted by weight (CV estimators will fail) but I guess I would need to ensure the order of the non-zero-weight samples stays the same (which is I think what you did?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the default tol is too high to compare the decision function. I just tried with SVC with a tol=1e-12 then the decision function will be really close.

@glemaitre glemaitre self-requested a review September 19, 2019 16:07
@adrinjalali
Copy link
Member

For the CV ones, I guess the splits are different because the CV is influenced by the zero sample weights. That's a bit weird but not sure what the expected behavior should be?

I'd say splitters should take the weights into account for shuffling/splitting the data. If the data has a lot of samples with 0 weight, not considering it would mean some splits may have samples all with sample_weight=0. Also, in terms of a shuffle k split, it does make sense (to me) to have a similar weight across the splits.

That said, I'm not sure if it's a reasonable expectation to have the same set of non-zero-weight samples after introducing a bunch of zero-weighed samples to the data.

@amueller
Copy link
Member Author

amueller commented Oct 5, 2019

I'd say splitters should take the weights into account for shuffling/splitting the data.

I agree with your reasoning. However, that's a behavior change and also an API change. Right now, the signature of split is (X, y=None, groups=None).

It might also be tricky to define what the desired behavior is. So let's say for KFold we want the sum of the weights to be the same for each fold (instead of the number of samples). Do we want this to be as stable as possible or do we want to make sure they also have similar amounts of samples?
No matter what, this is probably NP complete (this sounds equivalent to binpacking?) but we could maybe do a "relax and round" solution.

What does it mean for StratifiedKFold? Right now we want the best stratification and also now guarantee that the number of samples off by at most one across folds. If we do this weighted, does that mean we ensure the weighted count of samples per class is the same per fold, and also the weighted total sum?
Is that an LP or just a least squares problem [relaxed, the integer version is NP again, I would guess?].
And do we want to take the real sample count into account as well?
(and then if we want to stratify and have groups, what do you do?)

Now suppose we do that. I assume the case with all equal sample weights (all ones) will result in a degenerate relaxed solution, and so the rounding will be tricky. In that case we can't recover the original unweighted solution from the all ones case :-/

That possibly means that we need to come up with a greedy heuristic (as we did in other cases) that reduces to the unweighted case more easily.

And finally, as you said, we need to define our expectations. We already give different results after shuffling the data in some cases (should this be a tag? hum... for randomized algorithms this is a bit tricky to define).

Do we want to say that if we replace a sample with two identical samples with half the weight the results should be the same? That's probably not mathematically true for SGD, right?

Right now I think the easiest solution might be having a tag that says that an estimator depends on the order of the data.
This is a bit weird in the case of KMeans(init='random') where reordering the data is equivalent to changing the random state and I don't know how I would define the tag there.

@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2019

I am not sure we want to go down the way of sample_weight-aware CV splitters. That sounds awfully complicated to get right to me.

@ogrisel
Copy link
Member

ogrisel commented Oct 9, 2019

Merged master to trigger a CI run with the result of #14286. I also update the description to be able to tick estimators and track progress.

@ogrisel
Copy link
Member

ogrisel commented Oct 9, 2019

#14286 alone is not enough to fix the common test for models in the SVM family. Maybe a stricter tol would also be required.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my previous try, we needed to reduce the tolerance for the SVM classifier.
#15038 should also be merged (It should not be an issue here since they are not the default parameters).

@rth
Copy link
Member

rth commented Nov 7, 2019

How about marking failing estimators with xfail, opening a follow up issue and merging this?

It would be helpful to have this in master (otherwise one have to checkout this branch, and localy merge master to do anything about this). While if it was in master, one could simply run it with pytest sklearn -k .. --runxfail.

@rth
Copy link
Member

rth commented Feb 20, 2020

I continued this PR in #16507 as suggested above which would allow to merge this test to master with some estimators marked as a known failure.

@thomasjpfan thomasjpfan modified the milestones: 0.23, 0.24 Apr 20, 2020
@rth rth closed this in #16507 May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants