-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TargetEncoder
should respect sample_weights
#28881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In principle |
Related to #11316 for testing. |
I am working on this. |
Hi! I am working on this issue with @MiguelParece. We have already incorporated sample weights to compute the weighted averages, but we are currently uncertain about whether we should also use them to calculate the variance when |
Good point - I'd go for a weighted variance. Now, there are different ways to calculate weighted variances, see e.g. https://numpy.org/doc/stable/reference/generated/numpy.cov.html, which is related to the meaning of the "weights":
What type of weights should we stick to @lorentzenchr ? |
From what I understand from this comment on another issue regarding |
In the end, it should not matter that much. I prefer the general weights as those are met more often in ML and freq weights are just more special (like in a survey). |
Agree @lorentzenchr . Maybe mention in the docstring of the new argument which formula is used to calculate weighted variance. |
|
Related PR / discussion: #30564 |
Describe the workflow you want to enable
The current implementation of
TargetEncoder
seems to calculate (shrinked) averages ofy
. In cases withsample_weights
, it would be more natural to work with (shrinked) weighted averages.Describe your proposed solution
In case of
sample_weights
, shrinked averages should be replaced by corresponding shrinked weighted averages.However, I am not 100% sure if
sample_weights
are accessable by a transformer.Describe alternatives you've considered, if relevant
The alternative is to continue ignoring sample weights.
Additional context
No response
The text was updated successfully, but these errors were encountered: