Skip to content

FEA TargetEncoder should respect sample_weights #29110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

FEA TargetEncoder should respect sample_weights #29110

wants to merge 7 commits into from

Conversation

MiguelParece
Copy link

This commit updates the TargetEncoder to respect sample_weights. This Fixes #28881 and modifies the files: _target_encoder.py, _target_encoder_fast.pyx, and test_target_encoder.py.

This commit updates the TargetEncoder to respect sample_weights.
This addresses issue #28881 and modifies the files:
_target_encoder.py, _target_encoder_fast.pyx, and test_target_encoder.py.

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
Copy link

github-actions bot commented May 25, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 0b04adc. Link to the linter CI: here

@DuarteSJ
Copy link
Contributor

DuarteSJ commented May 25, 2024

I still have a few questions I would appreciate if someone could clarify. @mayer79 @betatim

Is our interpretation of the meaning of sample_weight correct?

In cases where sample_weight is not passed to the methods:
Should we just create an array of 1's?
Or should we avoid the unnecessary multiplications, even if it results in generating code that is a bit less readable?

DuarteSJ and others added 5 commits May 25, 2024 17:44
Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
Type was wrong for sample weights.

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>"
Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
@ogrisel
Copy link
Member

ogrisel commented May 27, 2024

Should we just create an array of 1's?
Or should we avoid the unnecessary multiplications, even if it results in generating code that is a bit less readable?

Do can measure the performance impact of setting them to 1s as done in this PR against the current state in main? I suppose it should not matter too much. If it's less than 30% I wouldn't worry too much as I doubt that TargetEncoder can ever become a performance bottleneck in ML pipelines.

if sample_weight is not None:
y_variance = np.sum(sample_weight * (y_numeric - y_mean) ** 2) / np.sum(
sample_weight
)
Copy link
Member

@ogrisel ogrisel May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to reuse _incremental_mean_and_var instead of reimplementing weighted mean and variance computation in this function.

Although I agree that _incremental_mean_and_var is a bit cumbersome to use when the incremental part is not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean always needs to be computed. Variance is only computed in cases where smooth=auto. Should I just use _incremental_mean_and_var and always compute the both of them even if variance isn't going to be used?

When sample weights are None, innitialize them as an array of 1's
for better code readability and consistency.

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>
@DuarteSJ
Copy link
Contributor

@ogrisel What do you think we should do?

@MiguelParece MiguelParece closed this by deleting the head repository Oct 29, 2024
@bhavek-jamnadas
Copy link

Hey, any reason this was closed?

Would love to have it implemented.

@DuarteSJ
Copy link
Contributor

DuarteSJ commented Mar 23, 2025

Hey, any reason this was closed?

Would love to have it implemented.

Hey there! I believe I was close to finishing it (if I recall correctly it was fully implemented and working, but I am not 100% sure since it was long ago), but the guy who had started working on it with me closed the pr after 4 months of no response. At the time I opted for using frequency weights because of this comment, but in the issue it seems that it would be preferable to use general ones. If I get confirmation regarding the use of frequency weights vs general weights I can open a new pr that addresses this issue.

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2025

I believe we want the frequency semantics. One way to test for this would be to make this pass the check_sample_weight_equivalence_on_dense_data on dense data when pipelined with a downstream supervised estimator that also supports the sample weight semantics properly: make_pipeline(KBinsDiscretizer(encode="ordinal"), TargetEncoder(), Ridge()).

Note that preprocessing with KBinsDiscretizer(encode="ordinal") is necessary to generate category-like features from the continuous features used internally in check_sample_weight_equivalence_on_dense_data.

@ogrisel
Copy link
Member

ogrisel commented Apr 3, 2025

@DuarteSJ are you interested in reviving this PR or would you prefer someone else to takeover?

@DuarteSJ
Copy link
Contributor

DuarteSJ commented Apr 4, 2025

@DuarteSJ are you interested in reviving this PR or would you prefer someone else to takeover?

Yes, I'm interested in reviving this PR. I’ll take some time to review your previous message to ensure I fully understand it. If I have any questions, I’ll follow up.
Should I open a new pr with the same name? Or what would be the standard practice here?

@DuarteSJ
Copy link
Contributor

DuarteSJ commented May 3, 2025

@ogrisel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TargetEncoder should respect sample_weights
4 participants