-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA TargetEncoder should respect sample_weights #29110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit updates the TargetEncoder to respect sample_weights. This addresses issue #28881 and modifies the files: _target_encoder.py, _target_encoder_fast.pyx, and test_target_encoder.py. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
I still have a few questions I would appreciate if someone could clarify. @mayer79 @betatim Is our interpretation of the meaning of In cases where |
Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
Type was wrong for sample weights. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>"
Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
Do can measure the performance impact of setting them to 1s as done in this PR against the current state in |
if sample_weight is not None: | ||
y_variance = np.sum(sample_weight * (y_numeric - y_mean) ** 2) / np.sum( | ||
sample_weight | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to reuse _incremental_mean_and_var
instead of reimplementing weighted mean and variance computation in this function.
Although I agree that _incremental_mean_and_var
is a bit cumbersome to use when the incremental part is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mean always needs to be computed. Variance is only computed in cases where smooth=auto
. Should I just use _incremental_mean_and_var
and always compute the both of them even if variance isn't going to be used?
When sample weights are None, innitialize them as an array of 1's for better code readability and consistency. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
@ogrisel What do you think we should do? |
Hey, any reason this was closed? Would love to have it implemented. |
Hey there! I believe I was close to finishing it (if I recall correctly it was fully implemented and working, but I am not 100% sure since it was long ago), but the guy who had started working on it with me closed the pr after 4 months of no response. At the time I opted for using frequency weights because of this comment, but in the issue it seems that it would be preferable to use general ones. If I get confirmation regarding the use of frequency weights vs general weights I can open a new pr that addresses this issue. |
I believe we want the frequency semantics. One way to test for this would be to make this pass the Note that preprocessing with |
@DuarteSJ are you interested in reviving this PR or would you prefer someone else to takeover? |
Yes, I'm interested in reviving this PR. I’ll take some time to review your previous message to ensure I fully understand it. If I have any questions, I’ll follow up. |
This commit updates the TargetEncoder to respect sample_weights. This Fixes #28881 and modifies the files: _target_encoder.py, _target_encoder_fast.pyx, and test_target_encoder.py.