Skip to content

DEBUG hgbt variable bins #28603 #28621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 12, 2024

Reference Issues/PRs

This PR debugs #28603 which is needed for errors on 32bit platforms (CI).

Bug description

This PR confirms that there is bug already with the commits until 9785084 for Linux_Docker debian_atlas_32bit, see CI run https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=64959&view=logs&j=aabdcdc3-bb64-5414-b357-ed024fe8659e&t=b7b3ba55-d585-563b-a032-f235636c22b0 or https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=65028&view=logs&j=aabdcdc3-bb64-5414-b357-ed024fe8659e:

=================================== FAILURES ===================================
_ test_splitting_missing_values[X_binned8-all_gradients8-True-9-True-5-False] __
# LAST TEST CASE
>       assert split_info.bin_idx == expected_bin_idx
E       assert 8 == 5

/io/sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py:542: AssertionError
=================================== FAILURES ===================================
_ test_splitting_missing_values[X_binned8-all_gradients8-True-9-True-5-False] __

X_binned = array([[9],
       [9],
       [9],
       [9],
       [0],
       [1],
       [2],
       [3],
       [4],
       [5]], dtype=uint8)
all_gradients = array([1., 1., 1., 1., 5., 5., 5., 5., 5., 5.], dtype=float32)
has_missing_values = array([1], dtype=uint8)
n_bins_non_missing = array([9], dtype=uint16),
expected_split_on_nan = True
expected_bin_idx = 5,
expected_go_to_left = False

    def test_splitting_missing_values(
        X_binned,
        all_gradients,
        has_missing_values,
        n_bins_non_missing,
        expected_split_on_nan,
        expected_bin_idx,
        expected_go_to_left,
    ):
...
>       assert split_info.bin_idx == expected_bin_idx
E       assert 8 == 5
E        +  where 8 = <sklearn.ensemble._hist_gradient_boosting.splitting.SplitInfo object at 0xe28c53b8>.bin_idx

/io/sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py:542: AssertionError

Very oddly, with the DEBUG commit 490763a, that only adds print statements (also in Cython), 3 CI runs in a row are all green (https://github.com/scikit-learn/scikit-learn/runs/22647313519, https://github.com/scikit-learn/scikit-learn/runs/22657509017, https://github.com/scikit-learn/scikit-learn/pull/28621/checks?check_run_id=22662390708)

Copy link

github-actions bot commented Mar 12, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 8823803. Link to the linter CI: here

@lorentzenchr lorentzenchr force-pushed the debug_hgbt_variable_bins branch 3 times, most recently from 2d34f19 to a7c5c52 Compare March 13, 2024 16:20
@lorentzenchr lorentzenchr force-pushed the debug_hgbt_variable_bins branch from a7c5c52 to 490763a Compare March 13, 2024 17:18
@lorentzenchr
Copy link
Member Author

Resolution: It is not 100% clear. In the _find_best_bin_to_split* functions, empty bins should have no effect as they do not change the gain. The comparison gain > best_gain should be number > same number and evaluate to False. It seems, however, that that on 32-bit Debian Linux, this can intermittently evaluate to True. The 1e-15 regularization in compute_node_value may play a part in this.
ENH handle count==0 in _find_best_bin_to_split resolves it by special casing empty bins: count == 0.

@lorentzenchr lorentzenchr deleted the debug_hgbt_variable_bins branch March 15, 2024 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant