Weighted scoring in cross validation (Closes #4632) #13432

ryan-deak-zefr · 2019-03-11T16:25:40Z

Reference Issues/PRs

Fixes #4632. See also #10806.

What does this implement/fix? Explain your changes.

This PR adds sample_weight to metrics in cross validation routines by propagating the sample_weight parameter passed to fit_params. It does not provide the ability to use different sample_weight values in validation.

Any other comments?

A dask version of this PR has been created at dask/dask-ml#481

For more detailed information on the motivation for this PR, see http://deaktator.github.io/2019/03/10/the-error-in-the-comparator/

…calculation

… scorer

Use sample_weight in the metric calculations in model validations

Use sample_weight in the metric calculations in model validations and cross validation

Fix whitespace issues

ryan-deak-zefr · 2019-03-11T22:34:59Z

Currently, all but the following four Azure Pipelines tests are passing:

scikit-learn.scikit-learn
scikit-learn.scikit-learn (Linux py35_np_atlas)
scikit-learn.scikit-learn (Linux pylatest_conda)
scikit-learn.scikit-learn (Windows py37_64)

Based on the error logs, it seems that the root cause of the failures is a pickling issue. I'm assuming this is because the following functions are inside the closure for the fit function in _search.py: is_none, collapse_nones, weights_sums, fit_and_score_and_sw_sum. If these are moved outside fit, will this eliminate the pickling issue? Anyone know?

Error Logs

2019-03-11T21:36:45.6001234Z function = <function BaseSearchCV.fit.<locals>.fit_and_score_and_sw_sum at 0x7f811ed7a378>
2019-03-11T21:36:45.6001496Z check_pickle = True
2019-03-11T21:36:45.6001540Z 
2019-03-11T21:36:45.6001601Z     def delayed(function, check_pickle=True):
2019-03-11T21:36:45.6001696Z         """Decorator used to capture the arguments of a function.
2019-03-11T21:36:45.6002169Z     
2019-03-11T21:36:45.6002253Z         Pass `check_pickle=False` when:
2019-03-11T21:36:45.6002318Z     
2019-03-11T21:36:45.6002671Z         - performing a possibly repeated check is too costly and has been done
2019-03-11T21:36:45.6002863Z           already once outside of the call to delayed.
2019-03-11T21:36:45.6002950Z     
2019-03-11T21:36:45.6003229Z         - when used in conjunction `Parallel(backend='threading')`.
2019-03-11T21:36:45.6003321Z     
2019-03-11T21:36:45.6003379Z         """
2019-03-11T21:36:45.6003468Z         # Try to pickle the input function, to catch the problems early when
2019-03-11T21:36:45.6003544Z         # using with multiprocessing:
2019-03-11T21:36:45.6003627Z         if check_pickle:
2019-03-11T21:36:45.6003703Z >           pickle.dumps(function)
2019-03-11T21:36:45.6004034Z E           AttributeError: Can't pickle local object 'BaseSearchCV.fit.<locals>.fit_and_score_and_sw_sum'

jnothman · 2019-03-11T23:12:59Z

We have been stuck on this problem for a while. We can't just pass sample_weight in all cases as this breaks backwards compatibility. We could consider a switch which turns this behaviour on. Or we could consider (which we are trying) a more generic solution for passing around "sample properties" (#4497)

ryan-deak-zefr · 2019-03-11T23:52:04Z

Hey @jnothman. I became aware of #4497 right after I started working on the patch and was surprised to see that this issue has been open / unresolved since April 2015. Do you have any sort of timeline?

We have been stuck on this problem for a while. We can't just pass sample_weight in all cases as this breaks backwards compatibility. We could consider a switch which turns this behaviour on. Or we could consider (which we are trying) a more generic solution for passing around "sample properties" (#4497)

When I started writing the patch for internal purposes, I asked a colleague whether this feature should be disabled by default (even though I didn't believe that was the correct solution) so that it had a better chance of being merged to master since there have been other unsuccessful attempts at merging a fix.

I was also wondering how the scikit-learn community views #4632. Do they see it as a bug? It seems clear based on comments in #4362 by @ndawe that this behavior is a bug.

Yes, weights should be accounted for. I suppose you already know I think that 😄

If they are not used, but your dataset is really supposed to be weighted, you can get a very different (suboptimal!) model.

I forgot I already have this implemented in my fork but it hasn't been merged yet. This was essential for my PhD research where sample weights could vary by orders of magnitude and just using a weight of 1 for all samples (anywhere in the learning process, including the GridSearchCV) would give poor results.

I also confirmed this in the article attached to the description of this PR ( http://deaktator.github.io/2019/03/10/the-error-in-the-comparator/)

Is something that can be done in the interim while waiting for a solution? Would the maintainers be open to the idea of a PR that issues a warning whenever sample_weight is passed in fit_params to a cross validation fit method (e.g., in GridSearchCV or RandomizedSearchCV) indicating that importance weights are used during training and not in metric calculations during validation?

The reason I ask is that there is no real indication that this problem is occurring. I only stumbled on it by happenstance when I noticed that results seemed incorrect. I'm not sure how widespread the tribal knowledge surrounding #4632 is. I want to try to help make others aware of the issue so that they don't make improper assumptions about how the current code works.

(Thank you in advance for any response. I really do care very deeply about this issue.)

jnothman · 2019-03-12T00:52:52Z

I agree that this would sensibly be default behaviour, but I don't think we can really change the current behaviour and break backwards compatibility.

this issue has been open / unresolved since April 2015

#1574 has been open since January 2013. Timelines are very hard to establish when problems are hard to solve, and when the development is mostly by volunteers, etc. But I don't think this specific problem is that hard to solve, and I have wondered if we were mistaken to try to solve the more general problem. I would be happy to see an interim solution merged.

When you suggest just forwarding sample_weight to scoring, you're assuming the estimator's fit method also receives sample_weight. If the estimator is a Pipeline (in practice, it often should be), it cannot take sample_weight as its fit parameters are of form {step name}__sample_weight instead.

As an interim solution while sample props continues to be developed, I would consider merging a PR which adds a parameter such as weight_scoring (or weight_scoring_with) whose value is a string which specifies which fit parameter should be taken as sample weights for scoring. One thing I'm not sure about yet is whether that parameter should also be passed to the estimator's fit or whether it should be removed and used only for scoring. The latter is more flexible (it allows for weighted scoring despite unweighted fitting) but perhaps also more error prone (the user may not realise they are getting an unweighted fit). WDYT?

deaktator · 2019-03-12T01:43:41Z

@jnothman (@deaktator is the same person as @ryan-deak-zefr BTW)

I’ll work on making changes to the PR with your suggestions if you can kindly confirm you’ll honestly give this PR consideration so long as the code lives up to scikit-learn quality standards. I just don’t want work on something that is futile. The only thing that I ask is that if the default behavior is to maintain backward compatibility, then a warning should be emitted when training with sample weights but not validating with sample weights. I think an issue peripheral to the statistical issues is that people don’t know this is happening. I want this warning to be emitted so that people know the issue exists and how to accommodate for it. I don’t think that API docs are really good enough with an issue this of this scope.

I do appreciate the pipeline problem. If the most elegant solution available is to name the fit parameter to use, then I’ll do that.

I think that maintaining consistency in the use of weights in training and scoring is important. I can’t off the top of my head think of a counterexample where you would want to train on one distribution, and test on another. Maybe there is one, but I can’t think of it. I can’t even think of a situation where you would want to change the waits at all, not just setting them all to 1.

Let me know what you think. Thanks for the quick response.

jnothman · 2019-03-12T02:12:11Z

I certainly hope to review the PR in good faith. If, by miracle, a fuller sample properties proposal is ready and looks like it's heading to consensus before this change is, then we might decide not to take this. And my review does not mean that there will be no opposing review, or that it will get the second review required to merge in a short timeframe. I appreciate that this is a problem of its own, and would like to see it possible for users to do weighted scoring, even if we can't fix the more general problem.

As to warning: I think I'm okay with that proposal, although you might need to identify the use of weights by any(k.endswith('__sample_weight') or k == 'sample_weight' for k in fit_params). Do include that feature, but we may remove it or pull it into a separate PR to bypass conflict if it arises.

jnothman · 2019-03-12T02:13:03Z

I do appreciate the pipeline problem. If the most elegant solution available is to name the fit parameter to use, then I’ll do that.

If by "available" you mean one that is likely to get developer consensus this year, I'd say it is :)

jnothman · 2019-03-12T03:04:58Z

I'm not sure you're right about never wanting unweighted scoring with weighted fitting. The simple case being that balancing your classes in training can lead to a better fit, even when you want to evaluate against the training data distribution without weighting. (of course we also have class_weight to do this kind of rebalancing, but it could be achieved through weighting)

ryan-deak-zefr · 2019-06-09T20:27:18Z

@jnothman are you going to ICML ‘19? Maybe we can meet up and talk about this issue, if you are. Otherwise, do you know any other maintainers that are going? I saw Gael has a paper, but I don’t know who else would be best to talk to about this. Thanks.

jnothman · 2019-06-13T04:12:11Z

are you going to ICML ‘19?

No, sorry, and I'm not sure who is. Gaël does have an interest in the sample properties functionality. I think we would all like to see weighted scoring, but how quickly we can make it default when we want to make things backwards compatible is a trickier question. In the meantime, I have to find some time to properly propose API solutions for the broader problem.

catapulta · 2019-07-31T00:53:16Z

@ryan-deak-zefr @jnothman is this PR still active?

lorentzenchr · 2022-02-25T15:59:45Z

This will be solved by #22083.

ryan-deak-zefr and others added 27 commits February 14, 2019 15:14

Updated _validation to use sample_weight from training in the metric …

b305d1a

…calculation

Removed old commented out code.

3f2e9da

test comment

8a8faea

rearranged comments in test

a8801a3

formatting and comment

81100fa

comments

255c720

added comment for motivation behind directly passing sample_weight to…

e09ecf9

… scorer

added else: raise e to _apply_scorer

e12bd17

Merge pull request #1 from ZEFR-INC/DAS-1145_sample_wt_validation

b364f40

Use sample_weight in the metric calculations in model validations

sample weighting for cross validation inside model_selection/_search.py

699a661

make tests more difficult to pass.

38313b6

Fixed bugs that caused other tests to fail.

a32c4de

collape_nones (sic) -> collapse_nones

73bc931

Swapped subtraction arguments.

c83f5eb

added more tests

2e6c892

even more tests.

b042193

Show that the math in test_sample_weight_cross_validation works

7e54469

Use exact rational values for expected value

853c92c

renaming test variables

5d39641

more testing the tests

4c823fb

updated tests

e5b876f

simplified test

d145b50

testing testing testing

fc84b51

added comment

0ac5755

removed commented code

940b9a0

comma in comment

0e5704e

Merge pull request #2 from ZEFR-INC/DAS-1145_sample_wt_validation

5913e40

Use sample_weight in the metric calculations in model validations and cross validation

ryan-deak-zefr mentioned this pull request Mar 11, 2019

Closes #480 dask/dask-ml#481

Open

whitespace issues failing on public build.

0688a16

ryan-deak-zefr mentioned this pull request Mar 11, 2019

Fix whitespace issues ZEFR-INC/scikit-learn#3

Merged

Merge pull request #3 from ZEFR-INC/pub_PR_build_check_failures

f3493bd

Fix whitespace issues

jnothman changed the title ~~Closes #4632~~ Weighted scoring in cross validation(Closes #4632) Mar 12, 2019

jnothman changed the title ~~Weighted scoring in cross validation(Closes #4632)~~ Weighted scoring in cross validation (Closes #4632) Mar 12, 2019

amueller mentioned this pull request Jul 14, 2019

[WIP] sample_weight support #1574

Closed

6 tasks

rth mentioned this pull request Feb 17, 2020

Should cross-validation scoring take sample-weights into account? #4632

Closed

github-actions bot added the module:model_selection label Mar 2, 2020

adrinjalali mentioned this pull request Oct 29, 2020

[WIP] sample props (proposal 4) #16079

Closed

Base automatically changed from master to main January 22, 2021 10:50

adrinjalali mentioned this pull request Jun 24, 2021

sample-props alternate implementation #20350

Closed

cmarmo added the Needs Decision Requires decision label Feb 14, 2022

lorentzenchr closed this Feb 25, 2022

ryan-deak-zefr mentioned this pull request Feb 25, 2022

Base sample-prop implementation and docs (alternative to #21284) #22083

Merged

adrinjalali mentioned this pull request Aug 18, 2022

FEAT SLEP006: metadata routing infrastructure #24027

Merged

adrinjalali mentioned this pull request Apr 26, 2023

SLEP006 - Metadata Routing task list #22893

Open

28 tasks

adrinjalali mentioned this pull request Aug 3, 2023

ENH add metadata routing to cross_val* #26896

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Weighted scoring in cross validation (Closes #4632) #13432

Weighted scoring in cross validation (Closes #4632) #13432

Uh oh!

ryan-deak-zefr commented Mar 11, 2019 •

edited

Loading

Uh oh!

ryan-deak-zefr commented Mar 11, 2019

Uh oh!

jnothman commented Mar 11, 2019

Uh oh!

ryan-deak-zefr commented Mar 11, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

deaktator commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019 via email

Uh oh!

ryan-deak-zefr commented Jun 9, 2019

Uh oh!

jnothman commented Jun 13, 2019

Uh oh!

catapulta commented Jul 31, 2019

Uh oh!

lorentzenchr commented Feb 25, 2022

Uh oh!

Uh oh!

Uh oh!

Weighted scoring in cross validation (Closes #4632) #13432

Weighted scoring in cross validation (Closes #4632) #13432

Uh oh!

Conversation

ryan-deak-zefr commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ryan-deak-zefr commented Mar 11, 2019

Error Logs

Uh oh!

jnothman commented Mar 11, 2019

Uh oh!

ryan-deak-zefr commented Mar 11, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

deaktator commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019

Uh oh!

jnothman commented Mar 12, 2019 via email

Uh oh!

ryan-deak-zefr commented Jun 9, 2019

Uh oh!

jnothman commented Jun 13, 2019

Uh oh!

catapulta commented Jul 31, 2019

Uh oh!

lorentzenchr commented Feb 25, 2022

Uh oh!

Uh oh!

ryan-deak-zefr commented Mar 11, 2019 •

edited

Loading