-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Weighted scoring in cross validation (Closes #4632) #13432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use sample_weight in the metric calculations in model validations
Use sample_weight in the metric calculations in model validations and cross validation
Fix whitespace issues
Currently, all but the following four Azure Pipelines tests are passing:
Based on the error logs, it seems that the root cause of the failures is a pickling issue. I'm assuming this is because the following functions are inside the closure for the Error Logs
|
We have been stuck on this problem for a while. We can't just pass sample_weight in all cases as this breaks backwards compatibility. We could consider a switch which turns this behaviour on. Or we could consider (which we are trying) a more generic solution for passing around "sample properties" (#4497) |
Hey @jnothman. I became aware of #4497 right after I started working on the patch and was surprised to see that this issue has been open / unresolved since April 2015. Do you have any sort of timeline?
When I started writing the patch for internal purposes, I asked a colleague whether this feature should be disabled by default (even though I didn't believe that was the correct solution) so that it had a better chance of being merged to master since there have been other unsuccessful attempts at merging a fix. I was also wondering how the scikit-learn community views #4632. Do they see it as a bug? It seems clear based on comments in #4362 by @ndawe that this behavior is a bug.
I also confirmed this in the article attached to the description of this PR ( http://deaktator.github.io/2019/03/10/the-error-in-the-comparator/) Is something that can be done in the interim while waiting for a solution? Would the maintainers be open to the idea of a PR that issues a warning whenever The reason I ask is that there is no real indication that this problem is occurring. I only stumbled on it by happenstance when I noticed that results seemed incorrect. I'm not sure how widespread the tribal knowledge surrounding #4632 is. I want to try to help make others aware of the issue so that they don't make improper assumptions about how the current code works. (Thank you in advance for any response. I really do care very deeply about this issue.) |
I agree that this would sensibly be default behaviour, but I don't think we can really change the current behaviour and break backwards compatibility.
#1574 has been open since January 2013. Timelines are very hard to establish when problems are hard to solve, and when the development is mostly by volunteers, etc. But I don't think this specific problem is that hard to solve, and I have wondered if we were mistaken to try to solve the more general problem. I would be happy to see an interim solution merged. When you suggest just forwarding As an interim solution while sample props continues to be developed, I would consider merging a PR which adds a parameter such as |
@jnothman (@deaktator is the same person as @ryan-deak-zefr BTW) I’ll work on making changes to the PR with your suggestions if you can kindly confirm you’ll honestly give this PR consideration so long as the code lives up to scikit-learn quality standards. I just don’t want work on something that is futile. The only thing that I ask is that if the default behavior is to maintain backward compatibility, then a warning should be emitted when training with sample weights but not validating with sample weights. I think an issue peripheral to the statistical issues is that people don’t know this is happening. I want this warning to be emitted so that people know the issue exists and how to accommodate for it. I don’t think that API docs are really good enough with an issue this of this scope. I do appreciate the pipeline problem. If the most elegant solution available is to name the fit parameter to use, then I’ll do that. I think that maintaining consistency in the use of weights in training and scoring is important. I can’t off the top of my head think of a counterexample where you would want to train on one distribution, and test on another. Maybe there is one, but I can’t think of it. I can’t even think of a situation where you would want to change the waits at all, not just setting them all to 1. Let me know what you think. Thanks for the quick response. |
I certainly hope to review the PR in good faith. If, by miracle, a fuller sample properties proposal is ready and looks like it's heading to consensus before this change is, then we might decide not to take this. And my review does not mean that there will be no opposing review, or that it will get the second review required to merge in a short timeframe. I appreciate that this is a problem of its own, and would like to see it possible for users to do weighted scoring, even if we can't fix the more general problem. As to warning: I think I'm okay with that proposal, although you might need to identify the use of weights by |
If by "available" you mean one that is likely to get developer consensus this year, I'd say it is :) |
I'm not sure you're right about never wanting unweighted scoring with
weighted fitting. The simple case being that balancing your classes in
training can lead to a better fit, even when you want to evaluate against
the training data distribution without weighting. (of course we also have
class_weight to do this kind of rebalancing, but it could be achieved
through weighting)
|
@jnothman are you going to ICML ‘19? Maybe we can meet up and talk about this issue, if you are. Otherwise, do you know any other maintainers that are going? I saw Gael has a paper, but I don’t know who else would be best to talk to about this. Thanks. |
No, sorry, and I'm not sure who is. Gaël does have an interest in the sample properties functionality. I think we would all like to see weighted scoring, but how quickly we can make it default when we want to make things backwards compatible is a trickier question. In the meantime, I have to find some time to properly propose API solutions for the broader problem. |
@ryan-deak-zefr @jnothman is this PR still active? |
This will be solved by #22083. |
Reference Issues/PRs
Fixes #4632. See also #10806.
What does this implement/fix? Explain your changes.
This PR adds
sample_weight
to metrics in cross validation routines by propagating thesample_weight
parameter passed tofit_params
. It does not provide the ability to use differentsample_weight
values in validation.Any other comments?
A dask version of this PR has been created at dask/dask-ml#481
For more detailed information on the motivation for this PR, see http://deaktator.github.io/2019/03/10/the-error-in-the-comparator/