FEA TargetEncoder should respect sample_weights #29110

MiguelParece · 2024-05-25T16:20:37Z

This commit updates the TargetEncoder to respect sample_weights. This Fixes #28881 and modifies the files: _target_encoder.py, _target_encoder_fast.pyx, and test_target_encoder.py.

This commit updates the TargetEncoder to respect sample_weights. This addresses issue #28881 and modifies the files: _target_encoder.py, _target_encoder_fast.pyx, and test_target_encoder.py. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

github-actions · 2024-05-25T16:22:27Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 0b04adc. Link to the linter CI: here}

DuarteSJ · 2024-05-25T17:10:44Z

I still have a few questions I would appreciate if someone could clarify. @mayer79 @betatim

Is our interpretation of the meaning of sample_weight correct?

In cases where sample_weight is not passed to the methods:
Should we just create an array of 1's?
Or should we avoid the unnecessary multiplications, even if it results in generating code that is a bit less readable?

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

Type was wrong for sample weights. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>"

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

ogrisel · 2024-05-27T09:36:44Z

Should we just create an array of 1's?
Or should we avoid the unnecessary multiplications, even if it results in generating code that is a bit less readable?

Do can measure the performance impact of setting them to 1s as done in this PR against the current state in main? I suppose it should not matter too much. If it's less than 30% I wouldn't worry too much as I doubt that TargetEncoder can ever become a performance bottleneck in ML pipelines.

sklearn/preprocessing/_target_encoder.py

ogrisel · 2024-05-27T09:43:17Z

sklearn/preprocessing/tests/test_target_encoder.py

+        if sample_weight is not None:
+            y_variance = np.sum(sample_weight * (y_numeric - y_mean) ** 2) / np.sum(
+                sample_weight
+            )


I think we might want to reuse _incremental_mean_and_var instead of reimplementing weighted mean and variance computation in this function.

Although I agree that _incremental_mean_and_var is a bit cumbersome to use when the incremental part is not needed.

Mean always needs to be computed. Variance is only computed in cases where smooth=auto. Should I just use _incremental_mean_and_var and always compute the both of them even if variance isn't going to be used?

When sample weights are None, innitialize them as an array of 1's for better code readability and consistency. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

DuarteSJ · 2024-06-13T20:19:01Z

@ogrisel What do you think we should do?

bhavek-jamnadas · 2025-03-21T10:40:32Z

Hey, any reason this was closed?

Would love to have it implemented.

DuarteSJ · 2025-03-23T01:18:37Z

Hey, any reason this was closed?

Would love to have it implemented.

Hey there! I believe I was close to finishing it (if I recall correctly it was fully implemented and working, but I am not 100% sure since it was long ago), but the guy who had started working on it with me closed the pr after 4 months of no response. At the time I opted for using frequency weights because of this comment, but in the issue it seems that it would be preferable to use general ones. If I get confirmation regarding the use of frequency weights vs general weights I can open a new pr that addresses this issue.

ogrisel · 2025-03-27T15:30:21Z

I believe we want the frequency semantics. One way to test for this would be to make this pass the check_sample_weight_equivalence_on_dense_data on dense data when pipelined with a downstream supervised estimator that also supports the sample weight semantics properly: make_pipeline(KBinsDiscretizer(encode="ordinal"), TargetEncoder(), Ridge()).

Note that preprocessing with KBinsDiscretizer(encode="ordinal") is necessary to generate category-like features from the continuous features used internally in check_sample_weight_equivalence_on_dense_data.

ogrisel · 2025-04-03T15:39:00Z

@DuarteSJ are you interested in reviving this PR or would you prefer someone else to takeover?

DuarteSJ · 2025-04-04T18:24:50Z

@DuarteSJ are you interested in reviving this PR or would you prefer someone else to takeover?

Yes, I'm interested in reviving this PR. I’ll take some time to review your previous message to ensure I fully understand it. If I have any questions, I’ll follow up.
Should I open a new pr with the same name? Or what would be the standard practice here?

DuarteSJ · 2025-05-03T23:00:26Z

@ogrisel

github-actions bot added module:preprocessing cython labels May 25, 2024

DuarteSJ and others added 5 commits May 25, 2024 17:44

Fixed Linting Issues

e64aa53

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

Fixed a bug in fast target encoder

2dec152

Type was wrong for sample weights. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>"

Fixed code coverage problems

f74434a

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

Added entry to the whats_new document

cba0e76

Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

Fixed doc/whats_new/v1.6.rst

c95c0e3

ogrisel reviewed May 27, 2024

View reviewed changes

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

ogrisel reviewed May 27, 2024

View reviewed changes

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

ogrisel reviewed May 27, 2024

View reviewed changes

Refactored _target_encoder

0b04adc

When sample weights are None, innitialize them as an array of 1's for better code readability and consistency. Co-authored-by: Miguel Cabral Parece <miguelparece@tecnico.ulisboa.pt>

MiguelParece closed this by deleting the head repository Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA TargetEncoder should respect sample_weights #29110

FEA TargetEncoder should respect sample_weights #29110

MiguelParece commented May 25, 2024

github-actions bot commented May 25, 2024 •

edited

Loading

DuarteSJ commented May 25, 2024 •

edited

Loading

ogrisel commented May 27, 2024

ogrisel May 27, 2024 •

edited

Loading

DuarteSJ May 27, 2024

DuarteSJ commented Jun 13, 2024

bhavek-jamnadas commented Mar 21, 2025

DuarteSJ commented Mar 23, 2025 •

edited

Loading

ogrisel commented Mar 27, 2025 •

edited

Loading

ogrisel commented Apr 3, 2025

DuarteSJ commented Apr 4, 2025

DuarteSJ commented May 3, 2025

FEA TargetEncoder should respect sample_weights #29110

FEA TargetEncoder should respect sample_weights #29110

Conversation

MiguelParece commented May 25, 2024

github-actions bot commented May 25, 2024 • edited Loading

✔️ Linting Passed

DuarteSJ commented May 25, 2024 • edited Loading

ogrisel commented May 27, 2024

ogrisel May 27, 2024 • edited Loading

Choose a reason for hiding this comment

DuarteSJ May 27, 2024

Choose a reason for hiding this comment

DuarteSJ commented Jun 13, 2024

bhavek-jamnadas commented Mar 21, 2025

DuarteSJ commented Mar 23, 2025 • edited Loading

ogrisel commented Mar 27, 2025 • edited Loading

ogrisel commented Apr 3, 2025

DuarteSJ commented Apr 4, 2025

DuarteSJ commented May 3, 2025

github-actions bot commented May 25, 2024 •

edited

Loading

DuarteSJ commented May 25, 2024 •

edited

Loading

ogrisel May 27, 2024 •

edited

Loading

DuarteSJ commented Mar 23, 2025 •

edited

Loading

ogrisel commented Mar 27, 2025 •

edited

Loading