Skip to content

Weighted scoring in cross validation (Closes #4632) #13432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from

Conversation

ryan-deak-zefr
Copy link

@ryan-deak-zefr ryan-deak-zefr commented Mar 11, 2019

Reference Issues/PRs

Fixes #4632. See also #10806.

What does this implement/fix? Explain your changes.

This PR adds sample_weight to metrics in cross validation routines by propagating the sample_weight parameter passed to fit_params. It does not provide the ability to use different sample_weight values in validation.

Any other comments?

A dask version of this PR has been created at dask/dask-ml#481

For more detailed information on the motivation for this PR, see http://deaktator.github.io/2019/03/10/the-error-in-the-comparator/

@ryan-deak-zefr
Copy link
Author

Currently, all but the following four Azure Pipelines tests are passing:

  • scikit-learn.scikit-learn
  • scikit-learn.scikit-learn (Linux py35_np_atlas)
  • scikit-learn.scikit-learn (Linux pylatest_conda)
  • scikit-learn.scikit-learn (Windows py37_64)

Based on the error logs, it seems that the root cause of the failures is a pickling issue. I'm assuming this is because the following functions are inside the closure for the fit function in _search.py: is_none, collapse_nones, weights_sums, fit_and_score_and_sw_sum. If these are moved outside fit, will this eliminate the pickling issue? Anyone know?

Error Logs

2019-03-11T21:36:45.6001234Z function = <function BaseSearchCV.fit.<locals>.fit_and_score_and_sw_sum at 0x7f811ed7a378>
2019-03-11T21:36:45.6001496Z check_pickle = True
2019-03-11T21:36:45.6001540Z 
2019-03-11T21:36:45.6001601Z     def delayed(function, check_pickle=True):
2019-03-11T21:36:45.6001696Z         """Decorator used to capture the arguments of a function.
2019-03-11T21:36:45.6002169Z     
2019-03-11T21:36:45.6002253Z         Pass `check_pickle=False` when:
2019-03-11T21:36:45.6002318Z     
2019-03-11T21:36:45.6002671Z         - performing a possibly repeated check is too costly and has been done
2019-03-11T21:36:45.6002863Z           already once outside of the call to delayed.
2019-03-11T21:36:45.6002950Z     
2019-03-11T21:36:45.6003229Z         - when used in conjunction `Parallel(backend='threading')`.
2019-03-11T21:36:45.6003321Z     
2019-03-11T21:36:45.6003379Z         """
2019-03-11T21:36:45.6003468Z         # Try to pickle the input function, to catch the problems early when
2019-03-11T21:36:45.6003544Z         # using with multiprocessing:
2019-03-11T21:36:45.6003627Z         if check_pickle:
2019-03-11T21:36:45.6003703Z >           pickle.dumps(function)
2019-03-11T21:36:45.6004034Z E           AttributeError: Can't pickle local object 'BaseSearchCV.fit.<locals>.fit_and_score_and_sw_sum'

@jnothman
Copy link
Member

We have been stuck on this problem for a while. We can't just pass sample_weight in all cases as this breaks backwards compatibility. We could consider a switch which turns this behaviour on. Or we could consider (which we are trying) a more generic solution for passing around "sample properties" (#4497)

@ryan-deak-zefr
Copy link
Author

Hey @jnothman. I became aware of #4497 right after I started working on the patch and was surprised to see that this issue has been open / unresolved since April 2015. Do you have any sort of timeline?

We have been stuck on this problem for a while. We can't just pass sample_weight in all cases as this breaks backwards compatibility. We could consider a switch which turns this behaviour on. Or we could consider (which we are trying) a more generic solution for passing around "sample properties" (#4497)

When I started writing the patch for internal purposes, I asked a colleague whether this feature should be disabled by default (even though I didn't believe that was the correct solution) so that it had a better chance of being merged to master since there have been other unsuccessful attempts at merging a fix.

I was also wondering how the scikit-learn community views #4632. Do they see it as a bug? It seems clear based on comments in #4362 by @ndawe that this behavior is a bug.

Yes, weights should be accounted for. I suppose you already know I think that 😄

If they are not used, but your dataset is really supposed to be weighted, you can get a very different (suboptimal!) model.

I forgot I already have this implemented in my fork but it hasn't been merged yet. This was essential for my PhD research where sample weights could vary by orders of magnitude and just using a weight of 1 for all samples (anywhere in the learning process, including the GridSearchCV) would give poor results.

I also confirmed this in the article attached to the description of this PR ( http://deaktator.github.io/2019/03/10/the-error-in-the-comparator/)

Is something that can be done in the interim while waiting for a solution? Would the maintainers be open to the idea of a PR that issues a warning whenever sample_weight is passed in fit_params to a cross validation fit method (e.g., in GridSearchCV or RandomizedSearchCV) indicating that importance weights are used during training and not in metric calculations during validation?

The reason I ask is that there is no real indication that this problem is occurring. I only stumbled on it by happenstance when I noticed that results seemed incorrect. I'm not sure how widespread the tribal knowledge surrounding #4632 is. I want to try to help make others aware of the issue so that they don't make improper assumptions about how the current code works.

(Thank you in advance for any response. I really do care very deeply about this issue.)

@jnothman jnothman changed the title Closes #4632 Weighted scoring in cross validation(Closes #4632) Mar 12, 2019
@jnothman jnothman changed the title Weighted scoring in cross validation(Closes #4632) Weighted scoring in cross validation (Closes #4632) Mar 12, 2019
@jnothman
Copy link
Member

I agree that this would sensibly be default behaviour, but I don't think we can really change the current behaviour and break backwards compatibility.

this issue has been open / unresolved since April 2015

#1574 has been open since January 2013. Timelines are very hard to establish when problems are hard to solve, and when the development is mostly by volunteers, etc. But I don't think this specific problem is that hard to solve, and I have wondered if we were mistaken to try to solve the more general problem. I would be happy to see an interim solution merged.

When you suggest just forwarding sample_weight to scoring, you're assuming the estimator's fit method also receives sample_weight. If the estimator is a Pipeline (in practice, it often should be), it cannot take sample_weight as its fit parameters are of form {step name}__sample_weight instead.

As an interim solution while sample props continues to be developed, I would consider merging a PR which adds a parameter such as weight_scoring (or weight_scoring_with) whose value is a string which specifies which fit parameter should be taken as sample weights for scoring. One thing I'm not sure about yet is whether that parameter should also be passed to the estimator's fit or whether it should be removed and used only for scoring. The latter is more flexible (it allows for weighted scoring despite unweighted fitting) but perhaps also more error prone (the user may not realise they are getting an unweighted fit). WDYT?

@deaktator
Copy link

@jnothman (@deaktator is the same person as @ryan-deak-zefr BTW)

I’ll work on making changes to the PR with your suggestions if you can kindly confirm you’ll honestly give this PR consideration so long as the code lives up to scikit-learn quality standards. I just don’t want work on something that is futile. The only thing that I ask is that if the default behavior is to maintain backward compatibility, then a warning should be emitted when training with sample weights but not validating with sample weights. I think an issue peripheral to the statistical issues is that people don’t know this is happening. I want this warning to be emitted so that people know the issue exists and how to accommodate for it. I don’t think that API docs are really good enough with an issue this of this scope.

I do appreciate the pipeline problem. If the most elegant solution available is to name the fit parameter to use, then I’ll do that.

I think that maintaining consistency in the use of weights in training and scoring is important. I can’t off the top of my head think of a counterexample where you would want to train on one distribution, and test on another. Maybe there is one, but I can’t think of it. I can’t even think of a situation where you would want to change the waits at all, not just setting them all to 1.

Let me know what you think. Thanks for the quick response.

@jnothman
Copy link
Member

I certainly hope to review the PR in good faith. If, by miracle, a fuller sample properties proposal is ready and looks like it's heading to consensus before this change is, then we might decide not to take this. And my review does not mean that there will be no opposing review, or that it will get the second review required to merge in a short timeframe. I appreciate that this is a problem of its own, and would like to see it possible for users to do weighted scoring, even if we can't fix the more general problem.

As to warning: I think I'm okay with that proposal, although you might need to identify the use of weights by any(k.endswith('__sample_weight') or k == 'sample_weight' for k in fit_params). Do include that feature, but we may remove it or pull it into a separate PR to bypass conflict if it arises.

@jnothman
Copy link
Member

I do appreciate the pipeline problem. If the most elegant solution available is to name the fit parameter to use, then I’ll do that.

If by "available" you mean one that is likely to get developer consensus this year, I'd say it is :)

@jnothman
Copy link
Member

jnothman commented Mar 12, 2019 via email

@ryan-deak-zefr
Copy link
Author

@jnothman are you going to ICML ‘19? Maybe we can meet up and talk about this issue, if you are. Otherwise, do you know any other maintainers that are going? I saw Gael has a paper, but I don’t know who else would be best to talk to about this. Thanks.

@jnothman
Copy link
Member

are you going to ICML ‘19?

No, sorry, and I'm not sure who is. Gaël does have an interest in the sample properties functionality. I think we would all like to see weighted scoring, but how quickly we can make it default when we want to make things backwards compatible is a trickier question. In the meantime, I have to find some time to properly propose API solutions for the broader problem.

@amueller amueller mentioned this pull request Jul 14, 2019
6 tasks
@catapulta
Copy link

@ryan-deak-zefr @jnothman is this PR still active?

@lorentzenchr
Copy link
Member

This will be solved by #22083.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should cross-validation scoring take sample-weights into account?
6 participants