Added sample weight handling to BinMapper under HGBT #29641

snath-xoc · 2024-08-08T21:46:57Z

Calls sample weight within BinMapper and passes it on to _find_binning_thresholds where bin midpoints are calculated using weighted percentile. Sample weights also passed to rng.choice() subsampling in BinMapper for samples larger than 2e5

NOTE: when the n_bins<discrete_values the best workaround was to set the bins as the midpoints. In future, it may be worth getting rid of this altogether, however at the risk of getting inhomogeneous array from weighted_percentile. We will need to agree on the best methods of trimming.

Major changes proposed:

Subsampling with weights after which sample_weight is not propagated in _BinMapper (set to None)
Tests for sample_weight invariance for _BinMapper under deterministic case added
Some tests failed with the new changes and n_samples had to be increased e.g., test_interaction_cst_numerically
Some tests failed under special cases when sample weights are passed through but distinct values<max_bins. To be checked what to do
Statistical test added in comments below

NOTE: this also allows HGBT to pass weighted tests for more than 256 samples (but still less than 2e6)

TO DO:

Update tests under test_gradient_boosting so that sample weight is passed as positional argument
Fix test failures, something fishy changes such that e.g., test_interaction_cst_numerically is not working
Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.
When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.
Further test failure when max_bins are specified as > distinct values under weighted and repeated cases
Check that sample weight invariance holds under deterministic case (test added under test_binning)
For the case n_samples > subsample conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation, similar to issue Incorrect sample weight handling in KBinsDiscretizer #29906. Conduct analysis for many different values random_state for match of the mean bin edges.

github-actions · 2024-08-08T21:48:23Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c29afcb. Link to the linter CI: here}

ogrisel

I did not analyse the test failures yet but here is some early feedback.

For the case n_samples > subsample we would need to conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation.

Once the tests pass for the deterministic case, we should conduct such an analysis (e.g. using a side notebook, not included in the repo, where we rerun the binning for many different values random_state and then check for match of the mean bin edges).

Please add a TODO item to the description of this PR not to forget about this.

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

…oosting.py Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

snath-xoc · 2025-03-21T13:29:43Z

Statistical tests were conducted for n_samples>subsample (subsample=int(2e5)) using the following code:

import numpy as np
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
from scipy.stats import kstest
import matplotlib.pyplot as plt

BONFERRONI_CORRECTION = 1 #To adjust later according  to agreed test dim.

rng = np.random.RandomState(0)

n_samples = int(3e5)
X = rng.randint(0, 30, size=(n_samples,3))
# Use random integers (including zero) as weights.
sw = rng.randint(0, 5, size=n_samples)

X_repeated = np.repeat(X, sw,axis=0)
assert len(X_repeated)>int(2e5)

bin_thresholds_weighted=[]
bin_thresholds_repeated = []
for seed in np.arange(100):
    est_weighted = _BinMapper(n_bins=6, random_state=seed).fit(
                X, sample_weight=sw
            )

    est_repeated = _BinMapper(n_bins=6, random_state=seed+500).fit(
                X_repeated, sample_weight=None)

    bin_thresholds_weighted.append(est_weighted.bin_thresholds_)
    bin_thresholds_repeated.append(est_repeated.bin_thresholds_)

bin_thresholds_weighted = np.asarray(bin_thresholds_weighted)
bin_thresholds_repeated = np.asarray(bin_thresholds_repeated)

fig,axs = plt.subplots(3,4,figsize=(14,12))

j=0
for i,ax in enumerate(axs.flatten()):
    if i>0 and i%4==0:
        j+=1
    ax.hist(bin_thresholds_weighted[:,j,i%4].flatten())
    ax.hist(bin_thresholds_repeated[:,j,i%4].flatten(),alpha=0.5)
    pval = kstest(bin_thresholds_weighted[:,j,i%4].flatten(),
                                    bin_thresholds_repeated[:,j,i%4].flatten()).pvalue
    if pval<(0.05*BONFERRONI_CORRECTION):
        ax.set_title(f'p-value: {pval:.4f},failed')
    else:
        ax.set_title(f'p-value: {pval:.4f},passed')

The output is as follows:

snath-xoc · 2025-03-21T13:40:08Z

The ARM lint test is still failing due to the test_find_binning_thresholds_small_regular_data assertion error but I can't reproduce it locally.

ogrisel

Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.

This is annoying, but it's probably caused by the fact that we switched from np.percentile(..., method="linear") to np.percentile(..., method="averaged_inverted_cdf") in the sample_weight=None case. Assuming this change is really the cause and since it's necessary to fix the weight/repetition equivalence, we might consider this behavior change as a bug fix. It should be clearly documented as such in the changelog, possibly with a dedicated note in the "Changed models" section of the release notes to better warn the users about this change.

Another concern is that _averaged_weighted_percentile implementation is currently very naive and will cause a performance regression whenever the users passes sample_weight != None. We might want to postpone the final merge of this PR to wait for the review and merge of #30945 first.

ogrisel · 2025-03-21T14:07:41Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        func = _averaged_weighted_percentile
+        if len(distinct_values) <= max_bins:
+            max_bins = len(distinct_values)
+            func = _weighted_percentile


Why would _averaged_weighted_percentile not work when len(distinct_values) <= max_bins?

>>> import numpy as np >>> from sklearn.utils.stats import _weighted_percentile, _averaged_weighted_percentile >>> a = np.random.randint(0, 10, size=10000).astype(np.float64) >>> max_bins = 30 >>> max_bins > np.unique(a).shape[0] True >>> np.array( ... [_averaged_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles] ... ) array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5., 5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.]) >>> np.array( ... [_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles] ... ) array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5., 5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])

When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.

The manual midpoints between distinct values in the weighted case is probably responsible for this discrepancy.

We should probably explore if it's possible to post process the output of _averaged_weighted_percentile to trim duplicated thresholds. If the number of trimmed thresholds is lower than max_bins then it might be possible to further post-process to recover thresholds that match the unweighted case by shifting the thresholds to use 0.5 midpoints.

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

snath-xoc · 2025-06-13T10:28:02Z

@ogrisel reviving this, i added a changelog, let me know what you think!

ogrisel

Here is a new pass of review:

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

sklearn/ensemble/_hist_gradient_boosting/binning.py

ogrisel · 2025-07-02T16:02:31Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

@@ -198,6 +198,27 @@ def test_bin_mapper_repeated_values_invariance(n_distinct):
    assert_array_equal(binned_1, binned_2)


+@pytest.mark.parametrize("n_bins", [50, None])
+def test_binmapper_weighted_vs_repeated_equivalence(global_random_seed, n_bins):


I found an occurrence of a test failure when global_random_seed=63 when running:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -vl sklearn/ensemble/_hist_gradient_boosting/tests -k test_binmapper_weighted_vs_repeated_equivalence

Can you reproduce?

Here is the error message I get:

=========================================================================== test session starts ============================================================================ platform darwin -- Python 3.13.3, pytest-8.4.0, pluggy-1.6.0 -- /Users/ogrisel/miniforge3/envs/dev/bin/python3.13 cachedir: .pytest_cache rootdir: /Users/ogrisel/code/scikit-learn configfile: pyproject.toml plugins: xdist-3.7.0, run-parallel-0.4.3, anyio-4.9.0 collected 451 items / 449 deselected / 2 selected Collected 0 items to run in parallel sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py::test_binmapper_weighted_vs_repeated_equivalence[63-50] FAILED [ 50%] sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py::test_binmapper_weighted_vs_repeated_equivalence[63-None] PASSED [100%] ================================================================================= FAILURES ================================================================================= __________________________________________________________ test_binmapper_weighted_vs_repeated_equivalence[63-50] __________________________________________________________ global_random_seed = 63, n_bins = 50 @pytest.mark.parametrize("n_bins", [50, None]) def test_binmapper_weighted_vs_repeated_equivalence(global_random_seed, n_bins): rng = np.random.RandomState(global_random_seed) n_samples = 200 X = rng.randn(n_samples, 3) if n_bins is None: n_bins = np.unique(X[:, rng.randint(3)]).shape[0] + rng.randint(5) + 1 sw = rng.randint(0, 5, size=n_samples) X_repeated = np.repeat(X, sw, axis=0) est_weighted = _BinMapper(n_bins=n_bins).fit(X, sample_weight=sw) est_repeated = _BinMapper(n_bins=n_bins).fit(X_repeated, sample_weight=None) > assert_allclose(est_weighted.bin_thresholds_, est_repeated.bin_thresholds_) E AssertionError: E Not equal to tolerance rtol=1e-07, atol=0 E E Mismatched elements: 6 / 144 (4.17%) E Max absolute difference among violations: 0.022745 E Max relative difference among violations: 0.02664642 E ACTUAL: array([[-1.712595, -1.285816, -1.226962, -1.009446, -0.960101, -0.882069, E -0.83596 , -0.69571 , -0.675828, -0.603487, -0.399731, -0.361047, E -0.340069, -0.237937, -0.207451, -0.087749, -0.075609, 0.032785,... E DESIRED: array([[-1.712595, -1.285816, -1.226962, -1.009446, -0.960101, -0.882069, E -0.83596 , -0.69571 , -0.675828, -0.593016, -0.399731, -0.361047, E -0.340069, -0.237937, -0.207451, -0.087749, -0.075609, 0.032785,... X = array([[-2.13897865e+00, 1.11206124e+00, 3.58015526e-02], [-6.30157742e-01, 3.54051160e-05, -1.20895742e+00]... [-3.99730932e-01, 9.48906177e-01, 2.40632339e-01], [-3.83554471e-02, -1.19674458e+00, -1.27746484e+00]]) X_repeated = array([[-2.13897865, 1.11206124, 0.03580155], [ 0.29162807, -1.65065515, -1.50712909], [ 0.29162807, -...3234], [-0.39973093, 0.94890618, 0.24063234], [-0.39973093, 0.94890618, 0.24063234]], shape=(392, 3)) est_repeated = _BinMapper(n_bins=50) est_weighted = _BinMapper(n_bins=50) global_random_seed = 63 n_bins = 50 n_samples = 200 rng = RandomState(MT19937) at 0x13D5D3240 sw = array([1, 0, 4, 2, 4, 0, 4, 0, 4, 0, 4, 0, 2, 1, 3, 4, 4, 3, 0, 0, 3, 1, 0, 4, 2, 2, 0, 1, 2, 2, 0, 4, 3, 4, 0,...4, 4, 1, 3, 2, 4, 2, 2, 0, 4, 0, 2, 1, 0, 2, 2, 0, 0, 4, 3, 1, 4, 1, 0, 0, 0, 2, 4, 4, 3, 3, 2, 2, 4, 0]) sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py:215: AssertionError -------------------------------------------------------------------------- Captured stdout setup --------------------------------------------------------------------------- I: Seeding RNGs with 819376607 ========================================================================= short test summary info ========================================================================== FAILED sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py::test_binmapper_weighted_vs_repeated_equivalence[63-50] - AssertionError: =============================================================== 1 failed, 1 passed, 449 deselected in 0.46s ================================================================

So the problem happens when max_bins=50 < np.unique(X[sw != 0], axis=0).shape[0]=155, which means that it's not related to the branch of the code that uses midpoints as bin edges but rather uses the branch of the code that uses the _averaged_weighted_percentile. I think we try to extract a minimal reproducer to understand the root cause of this failure case.

Maybe the root cause is a bug in _averaged_weighted_percentile that does not implement the expected weighted/repeated equivalence? If that the case we should try to extract a minimal reproducer for that function in isolation and debug this and fix this in a dedicated PR.

I can confirm that I get the same failure for seed 63, will investigate a bit more, strange :/

@snath-xoc For information, the test failure disappeared once I merged this PR with the refactoring of #31778 (and adapted the _BinMapper code to use _weighted_percentile(..., average=True). So this might be caused by a bug in _weighted_percentile or _averaged_weighted_percentile that is fixed by #31775.

I will put some efforts in reviewing #31775 first.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

antoinebaker

Here a first round of review !

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/binning.py

antoinebaker · 2025-07-02T15:45:24Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+    # if sample weight is not None and null values exist
+    # we need to remove those before calculating the
+    # distinct points
+    if sample_weight is not None:
+        col_data_non_null = col_data[sample_weight != 0]
+        distinct_values = np.unique(col_data_non_null).astype(X_DTYPE)
+    else:
+        distinct_values = np.unique(col_data).astype(X_DTYPE)


Suggested change

# if sample weight is not None and null values exist

# we need to remove those before calculating the

# distinct points

if sample_weight is not None:

col_data_non_null = col_data[sample_weight != 0]

distinct_values = np.unique(col_data_non_null).astype(X_DTYPE)

else:

distinct_values = np.unique(col_data).astype(X_DTYPE)

if sample_weight is not None:

# A zero sample_weight should be equivalent to removing the sample.

# We discard sample_weight=0 when computing the distinct values.

distinct_values = np.unique(col_data[sample_weight != 0]).astype(X_DTYPE)

else:

distinct_values = np.unique(col_data).astype(X_DTYPE)

antoinebaker · 2025-07-02T15:51:44Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

        # We could compute approximate midpoint percentiles using the output of
        # np.unique(col_data, return_counts) instead but this is more
        # work and the performance benefit will be limited because we
        # work on a fixed-size subsample of the full data.


I think this comment belongs to the elif sample_weight is None: block above.

Besides, maybe the comment should be updated (midpoints are now replaced with the averaged_inverted_cdf method) ?

I updated this comment now but not sure if it is what you had in mind, feel free to check and provide suggestions!

I agree that speaking about midpoints is confusing for code that derives bin edges from percentiles.

Bin edges are midpoints between consecutive distinct values only when the number of distinct values is smaller than the requested number of bins.

Here is a suggestion to improve this: #29641 (comment)

sklearn/ensemble/_hist_gradient_boosting/binning.py

ogrisel · 2025-07-02T16:25:34Z

BTW, I re-ran #29641 (comment) on the current state of the PR and the results are good for the subsampling branch of the code:

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

ogrisel · 2025-07-28T13:58:08Z

For the record, the failure in #29641 (comment) can still be reproduced so we need to investigate what's going before considering a final review/merge of this PR.

sklearn/ensemble/_hist_gradient_boosting/binning.py

lorentzenchr · 2025-07-28T17:20:52Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

@@ -1734,7 +1740,7 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
    >>> X, y = load_diabetes(return_X_y=True)
    >>> est = HistGradientBoostingRegressor().fit(X, y)
    >>> est.score(X, y)
-    0.92...
+    0.93...


Why is that?

We now compute the bin edges using np.percentile(..., method="averaged_inverted_cdf") instead of np.percentile(..., method="linear") (the default in numpy) when sample_weight=None.

This is necessary to get the weighted / repetition equivalence semantics for the sample_weight parameter.

lorentzenchr · 2025-07-28T17:31:20Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+            if sample_weight is not None:
+                subsampling_probabilities = sample_weight / np.sum(sample_weight)
+            subset = rng.choice(
+                X.shape[0], self.subsample, p=subsampling_probabilities, replace=True


Suggested change

X.shape[0], self.subsample, p=subsampling_probabilities, replace=True

X.shape[0], self.subsample, p=subsampling_probabilities, replace=False

Preserve current behavior with sample_weight=None.

Sampling with replacement is necessary to ensure the correct weight semantics: if you have a data point with a large weight, it might need to be resampled several times for the edges derived from computed on the subset to be identically distributed as what you would observe from an equivalent unweighted dataset but with repeated values.

But I agree we need a non-regression test for this: #29641 (comment)

Then we should weigh in the trade-offs between conserving the old behavior for sample_weight=None and maintainability.
I would favor to make a case distinction and sample without replacement when sw=None.

The problem is that keeping the linear interpolation makes it impossible to have test_binmapper_weighted_vs_repeated_equivalence pass which is the point of this PR. Without changing this, we cannot test that our sample weight correctly implement the semantics we look for.

We could/should keep replace=True if samples_weight is None.

I really disagree: this would break the statistical equivalence between sample_weight=None and sample_weight=np.ones(n_samples) (and again test_binmapper_weighted_vs_repeated_equivalence).

lorentzenchr · 2025-07-28T17:34:26Z

doc/whats_new/upcoming_changes/sklearn.ensemble/29641.fix.rst

+- :class:`BaseHistGradientBoosting` now uses sample weights when binning data
+  either by using sample weights in the `_averaged_weighted_percentile` or
+  by subsampling using normalised sample weights when `n_samples` is greater
+  than 2e5 within the `_find_binning_thresholds` function.
+  By :user:`Shruti Nath <snath-xoc>` and :user:`Olivier Grisel <ogrisel>`


Please provide a better changelog entry:

What exactly changes, in particular what will produce different estimators compared to the previous (now current) version of scikit-learn. I think we should even list this PR under the Changed models section.

Do not mention private functions like _averaged_weighted_percentile.

@snath-xoc we need to add a dedicated entry in the "Changed models section" to explain that the way the HistGradientBoostingClassifier/Regressor compute their bin edges is slightly changed when fit with sample_weight=None.

I suppose this would be in addition to the changelog PR?

ogrisel

Some more feedback:

ogrisel · 2025-07-30T10:03:35Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

-            subset = rng.choice(X.shape[0], self.subsample, replace=False)
+            subsampling_probabilities = None
+            if sample_weight is not None:
+                subsampling_probabilities = sample_weight / np.sum(sample_weight)


Could you please extend the tests to cover this line? Maybe we could a simple kstest on the distribution of the bin edges for a smallish value of subsample (e.g. 100 out of a dataset of 300 data points) and a few bins (e.g. 3 or 5) so that we ignore the non-independent multiple testing problem. We need the test to run fast enough (ideally under 1s) so we have to constrain ourselves in the number of rng repetitions, but since subsampled binning on a single column of data is quite cheap, I believe this is doable.

sklearn/ensemble/_hist_gradient_boosting/binning.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

snath-xoc · 2025-08-01T19:29:18Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        # Calculate midpoints if distinct values <= max_bins
+        from numpy.lib.stride_tricks import sliding_window_view
+
+        bin_thresholds = sliding_window_view(distinct_values, 2).mean(axis=1)


@lorentzenchr the sliding window view requires a window_shape (it may have been mistakenly missed before). I set it at 2, hope that's what you had in mind

This line is causing the check_fit2d_1sample test to fail.

We need to check how _BinMapper was returning on datasets with a single data point previously and reproduce the same behavior. I suspect the previous code was returning an empty array for bin_thresholds. I am not sure how this could work. Maybe the HGB estimator code would then raise an error with the expected message later.

ogrisel · 2025-08-05T08:54:40Z

For reference, I tried to run the common test for HistGradientBoosting models using:

pytest -vk "check_sample_weight_equivalence and HistGradientBoosting" sklearn/tests/test_common.py

After commenting out the matching PER_ESTIMATOR_XFAIL_CHECKS entries in sklearn/utils/_test_common/instance_generator.py and we still large AssertionError. So fixing the binning alone is not enough to fix the equivalence check at the HGBDT estimator level (even on small datasets where early stopping and subsampling are not active).

Maybe we need to fix the remaining sources of discrepancies (at least on small datasets) before deciding if the change of behavior is justified.

sklearn/ensemble/_hist_gradient_boosting/binning.py

ogrisel · 2025-08-05T09:08:21Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        # Calculate midpoints if distinct values <= max_bins
+        from numpy.lib.stride_tricks import sliding_window_view
+
+        bin_thresholds = sliding_window_view(distinct_values, 2).mean(axis=1)


This line is causing the check_fit2d_1sample test to fail.

We need to check how _BinMapper was returning on datasets with a single data point previously and reproduce the same behavior. I suspect the previous code was returning an empty array for bin_thresholds. I am not sure how this could work. Maybe the HGB estimator code would then raise an error with the expected message later.

snath-xoc added 4 commits August 2, 2024 15:52

add weighted percentile to binning threshold calculation in HGBT

43da2a6

add weighted percentile to binning threshold calculation in HGBT

b7b1ce2

add sample_weight pass through to binning

d21a19c

fix _find_binning_thresholds and BinMapper sample weight handling

6b3f8ea

github-actions bot added the module:ensemble label Aug 8, 2024

snath-xoc added 4 commits August 8, 2024 17:49

minor fix

cf1149b

updated sampe weight handling in test_binning

fbd0365

updated sampe weight handling in test_binning

22c4c11

[all random seeds]

90665f1

ogrisel reviewed Aug 12, 2024

View reviewed changes

ogrisel mentioned this pull request Oct 25, 2024

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

snath-xoc and others added 11 commits March 13, 2025 21:43

Merge branch 'main' into HGBT_bin_weights

da8344e

Update sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_b…

00ead80

…oosting.py Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fix _find_binning_thresholds and BinMapper sample weight handling

2d74df6

[all random seeds]

398598e

comment

8d9544f

add None handling to _find_binning_threshold

3ffcbca

change sampling prob to None when no sample weights passed

0758886

fix distinct value handling and modify tests for convergence

4f58074

Merge branch 'main' into HGBT_bin_weights

b7ce7a9

update tests

664d9e4

Merge branch 'main' into HGBT_bin_weights

47e6b65

snath-xoc marked this pull request as ready for review March 21, 2025 13:32

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Show resolved Hide resolved

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py Outdated Show resolved Hide resolved

add changelog

955eb43

ogrisel reviewed Jul 2, 2025

View reviewed changes

antoinebaker reviewed Jul 2, 2025

View reviewed changes

snath-xoc and others added 5 commits July 4, 2025 18:16

Apply suggestions from code review

7064bde

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Apply suggestions from code review

b43232b

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

Merge branch 'main' into HGBT_bin_weights

13c5bd1

implement reviews

f7b1785

add test_unique_bins_repeated_weighted

39ed7f3

lorentzenchr reviewed Jul 28, 2025

View reviewed changes

ogrisel reviewed Jul 30, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request Jul 31, 2025

MNT Refactor _average_weighted_percentile to avoid double sort #31775

Open

snath-xoc and others added 4 commits August 1, 2025 21:25

Apply suggestions from code review

6f83e7f

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Merge branch 'main' into HGBT_bin_weights

06cf962

fix imports

42c333c

add subsample tests

9453103

snath-xoc commented Aug 1, 2025

View reviewed changes

snath-xoc added 3 commits August 1, 2025 21:57

update changelog

36a5253

update changelog

fea007f

update changelog

a561286

ogrisel reviewed Aug 5, 2025

View reviewed changes

snath-xoc and others added 4 commits August 7, 2025 12:24

Merge branch 'main' into HGBT_bin_weights

0b2ef69

add changed models log

9ed1e2f

fix one sample test to raise appropriate error

456d779

fix one sample test to raise appropriate error

c29afcb

	X.shape[0], self.subsample, p=subsampling_probabilities, replace=True
	X.shape[0], self.subsample, p=subsampling_probabilities, replace=False

Uh oh!

Added sample weight handling to BinMapper under HGBT #29641

Are you sure you want to change the base?

Added sample weight handling to BinMapper under HGBT #29641

Uh oh!

Conversation

snath-xoc commented Aug 8, 2024 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snath-xoc commented Mar 21, 2025

Uh oh!

snath-xoc commented Mar 21, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snath-xoc commented Jun 13, 2025

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoinebaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

snath-xoc commented Aug 8, 2024 •

edited by ogrisel

Loading

github-actions bot commented Aug 8, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel Jul 31, 2025 •

edited

Loading

ogrisel Jul 30, 2025 •

edited

Loading

ogrisel Aug 5, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading