Skip to content

TargetEncoder should respect sample_weights #28881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mayer79 opened this issue Apr 24, 2024 · 10 comments
Open

TargetEncoder should respect sample_weights #28881

mayer79 opened this issue Apr 24, 2024 · 10 comments

Comments

@mayer79
Copy link
Contributor

mayer79 commented Apr 24, 2024

Describe the workflow you want to enable

The current implementation of TargetEncoder seems to calculate (shrinked) averages of y. In cases with sample_weights, it would be more natural to work with (shrinked) weighted averages.

Describe your proposed solution

In case of sample_weights, shrinked averages should be replaced by corresponding shrinked weighted averages.

However, I am not 100% sure if sample_weights are accessable by a transformer.

Describe alternatives you've considered, if relevant

The alternative is to continue ignoring sample weights.

Additional context

No response

@mayer79 mayer79 added Needs Triage Issue requires triage New Feature labels Apr 24, 2024
@betatim betatim removed the Needs Triage Issue requires triage label Apr 25, 2024
@betatim
Copy link
Member

betatim commented Apr 25, 2024

In principle sample_weight can be accessed when fitting TargetEncoder. So at first sight this is something that can be considered as a new feature. There seems to be some support for it as well (👍 on the issue).

@ogrisel
Copy link
Member

ogrisel commented May 6, 2024

Related to #11316 for testing.

@MiguelParece
Copy link

I am working on this.

@DuarteSJ
Copy link
Contributor

DuarteSJ commented May 21, 2024

Hi! I am working on this issue with @MiguelParece. We have already incorporated sample weights to compute the weighted averages, but we are currently uncertain about whether we should also use them to calculate the variance when smooth="auto". I believe using sample weights for variance calculation is the standard approach, but we would like to confirm if this is indeed the best practice before proceeding with the implementation and testing.

@mayer79 @betatim

@mayer79
Copy link
Contributor Author

mayer79 commented May 22, 2024

Hi! I am working on this issue with @MiguelParece. We have already incorporated sample weights to compute the weighted averages, but we are currently uncertain about whether we should also use them to calculate the variance when smooth="auto". I believe using sample weights for variance calculation is the standard approach, but we would like to confirm if this is indeed the best practice before proceeding with the implementation and testing.

@mayer79 @betatim

Good point - I'd go for a weighted variance. Now, there are different ways to calculate weighted variances, see e.g. https://numpy.org/doc/stable/reference/generated/numpy.cov.html, which is related to the meaning of the "weights":

  • frequency weights: a value 2 means "the row represents two observations" or
  • general weights, where a value 2.5 could mean: "the observation is very important"

What type of weights should we stick to @lorentzenchr ?

@DuarteSJ
Copy link
Contributor

Hi! I am working on this issue with @MiguelParece. We have already incorporated sample weights to compute the weighted averages, but we are currently uncertain about whether we should also use them to calculate the variance when smooth="auto". I believe using sample weights for variance calculation is the standard approach, but we would like to confirm if this is indeed the best practice before proceeding with the implementation and testing.
@mayer79 @betatim

Good point - I'd go for a weighted variance. Now, there are different ways to calculate weighted variances, see e.g. https://numpy.org/doc/stable/reference/generated/numpy.cov.html, which is related to the meaning of the "weights":

* frequency weights: a value 2 means "the row represents two observations" or

* general weights, where a value 2.5 could mean: "the observation is very important"

What type of weights should we stick to @lorentzenchr ?

From what I understand from this comment on another issue regarding sample_weight we should use frequency weights. Would you agree @mayer79 ?

@lorentzenchr
Copy link
Member

In the end, it should not matter that much. I prefer the general weights as those are met more often in ML and freq weights are just more special (like in a survey).

@mayer79
Copy link
Contributor Author

mayer79 commented May 30, 2024

Agree @lorentzenchr . Maybe mention in the docstring of the new argument which formula is used to calculate weighted variance.

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2025

check_sample_weight_equivalence_on_dense_data which is part of our common test suite checks for the frequency weights semantics. I think this is what we should strive for in general but if there are cases where those semantics are not reachable for a particular reason, this reason should be made explicit in the XFAIL message of that particular estimator:

PER_ESTIMATOR_XFAIL_CHECKS = {

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2025

Related PR / discussion: #30564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants