ENH FEA add interaction constraints to HGBT #21020

lorentzenchr · 2021-09-12T21:12:01Z

Reference Issues/PRs

Fixes #19148.

What does this implement/fix? Explain your changes.

This PR introduces the argument interaction_cst to the classes HistGradientBoostingRegressor and HistGradientBoostingClassifier.

It main impact is in Splitter.find_node_split.

Any other comments?

Might be a good reason to wish for an early 1.1 release 😏

add new feature
validate interaction_cst
test new feature
optimize code
doc and whatsnew
test interaction delta of gradient boosting, idea hgbt(X + X[:, 0]) - hgbt(X) with and without constraints, see ba78cb9
add to an example, e.g. partial dependence plot should be parallel with constraints and in link space

lorentzenchr · 2021-09-13T22:20:45Z

@thomasjpfan I have two questions right now:

How to treat features not listed in the interaction constraints, e.g. with features 0..3 and interaction_cst = [{0, 1}], are features 2 and 3 allowed to enter the game at all?
What would be a good test for test_splitting? I might need a little help there.

lorentzenchr · 2021-09-15T21:06:36Z

Now, I could use some help as I'm stuck with the following error, which only happens for n_threads > 1 (this error message is produced by compiling with # cython: boundscheck=True in splitting.pyx):

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py:247: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/ensemble/_hist_gradient_boosting/grower.py:395: in grow
    self.split_next()
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7fe5d39d55d0>
sklearn/ensemble/_hist_gradient_boosting/grower.py:500: in split_next
    ) = self.splitter.split_indices(node.split_info, node.sample_indices)
        node       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeNode object at 0x7fe5d39d5f50>
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7fe5d39d55d0>
        tic        = 1.545465422
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   &sample_indices[right_offset[thread_idx]]
E   IndexError: Out of bounds on buffer access (axis 0)

I don't understand where any of sample_indices, right_offset and thread_idx have been changed by this PR. My guess is that right_offset[-1] = len(sample_indices) and so the last thread errors.

(@thomasjpfan You said I may ping you:blush:)

ogrisel · 2021-09-16T17:24:36Z

Now, I could use some help as I'm stuck with the following error, which only happens for n_threads > 1 (this error message is produced by compiling with # cython: boundscheck=True in splitting.pyx).

I don't understand how a build-time error can be impacted by a runtime value (n_threads > 1)...

lorentzenchr · 2021-09-16T18:52:44Z

Now, I could use some help as I'm stuck with the following error, which only happens for n_threads > 1 (this error message is produced by compiling with # cython: boundscheck=True in splitting.pyx).

I don't understand how a build-time error can be impacted by a runtime value (n_threads > 1)...

Me neither. There is nothing I understand with this error.

Running test_grower_interaction_constraints with TreeGrower(..., n_threads=1) is fine, while TreeGrower(..., n_threads=2) throws this error. It seems as if the calculation of right_offset is changed by this PR and now wrong. Some print out gives me (print(..., flush=True) is helpful in this case)

thread_idx = 1
sample_indices = [5 5]
right_offset = [1 2]
right_indices_buffer = [         1          4 3215170148 1069022147          6          9
 1036367913 1068865250 1066764793 1064467143]
offset_in_buffers = [0 1]
right_counts = [1 0]
len(sample_indices) = 2
right_offset[thread_idx] = 2

Note the last two lines and the fact that sample_indices[right_offset[thread_idx]] = sample_indices[2] gets called. But why does any of this PR change those calculations in the first place. I'm deeply puzzled.

Note that this error also appears in test_min_samples_leaf.

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

lorentzenchr · 2021-09-18T16:43:42Z

@thomasjpfan @ogrisel Here comes a detailed bug report based on commit 3ea3829.
The CI runs confirm it, this error only appears with OpenMP enabled.

pytest -xl sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py::test_grower_interaction_constraints 
======================================================================== test session starts ========================================================================
platform darwin -- Python 3.7.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: XXX/scikit-learn, configfile: setup.cfg
plugins: anyio-3.3.0, cov-2.10.1
collected 1 item                                                                                                                                                    

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py F                                                                                               [100%]

============================================================================= FAILURES ==============================================================================
________________________________________________________________ test_grower_interaction_constraints ________________________________________________________________

    def test_grower_interaction_constraints():
        """Check that grower respects interaction constraints."""
        n_features = 6
        interaction_cst = [{0, 1}, {1, 2}, {3, 4, 5}]
        n_samples = 5
        n_bins = 6
        root_feature_splits = []
    
        def get_all_children(node):
            res = []
            if node.is_leaf:
                return res
            for n in [node.left_child, node.right_child]:
                res.append(n)
                if not n.is_leaf:
                    res.extend(get_all_children(n))
            return res
    
        for seed in range(20):
            rng = np.random.RandomState(seed)
    
            X_binned = rng.randint(
                0, n_bins - 1, size=(n_samples, n_features), dtype=X_BINNED_DTYPE
            )
            X_binned = np.asfortranarray(X_binned)
            gradients = rng.normal(size=n_samples).astype(G_H_DTYPE)
            hessians = np.ones(shape=1, dtype=G_H_DTYPE)
    
            grower = TreeGrower(
                X_binned,
                gradients,
                hessians,
                max_depth=3,
                n_bins=n_bins,
                shrinkage=1.0,
                max_leaf_nodes=None,
                min_samples_leaf=1,
                interaction_cst=interaction_cst,
                n_threads=n_threads,
            )
>           grower.grow()

X_binned   = array([[4, 2, 4, 2, 4, 0],
       [4, 1, 0, 3, 3, 2],
       [3, 2, 2, 3, 3, 3],
       [0, 1, 4, 4, 4, 3],
       [1, 1, 1, 2, 2, 4]], dtype=uint8)
get_all_children = <function test_grower_interaction_constraints.<locals>.get_all_children at 0x7ff910990950>
gradients  = array([-0.29139397, -0.13309029, -0.1730696 , -1.7616516 , -0.08767307],
      dtype=float32)
grower     = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7ff910945550>
hessians   = array([1.], dtype=float32)
interaction_cst = [{0, 1}, {1, 2}, {3, 4, 5}]
n_bins     = 6
n_features = 6
n_samples  = 5
rng        = RandomState(MT19937) at 0x7FF9108D76B0
root_feature_splits = []
seed       = 0

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py:612: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/ensemble/_hist_gradient_boosting/grower.py:397: in grow
    self.split_next()
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7ff910945550>
sklearn/ensemble/_hist_gradient_boosting/grower.py:497: in split_next
    ) = self.splitter.split_indices(node.split_info, node.sample_indices)
        node       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeNode object at 0x7ff9109c0c10>
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7ff910945550>
        tic        = 1.203303851
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   &left_indices_buffer[offset_in_buffers[thread_idx]],
E   IndexError: Out of bounds on buffer access (axis 0)

HISTOGRAM_DTYPE = dtype([('sum_gradients', '<f8'), ('sum_hessians', '<f8'), ('count', '<u4')])
SplitInfo  = <class 'sklearn.ensemble._hist_gradient_boosting.splitting.SplitInfo'>
Splitter   = <class 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter'>
__builtins__ = <builtins>
__doc__    = 'This module contains routines and data structures to:\n\n- Find the best possible split of a node. For a given node, ... split to a node, i.e. split the indices of the samples at the node\n  into the newly created left and right childs.\n'
__file__   = 'XXX/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/splitting.cpython-37m-darwin.so'
__loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7ff91093bd10>
__name__   = 'sklearn.ensemble._hist_gradient_boosting.splitting'
__package__ = 'sklearn.ensemble._hist_gradient_boosting'
__pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
__pyx_unpickle_Splitter = <built-in function __pyx_unpickle_Splitter>
__spec__   = ModuleSpec(name='sklearn.ensemble._hist_gradient_boosting.splitting', loader=<_frozen_importlib_external.ExtensionFile...origin='XXX/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/splitting.cpython-37m-darwin.so')
__test__   = {}
compute_node_value = <built-in function compute_node_value>
np         = <module 'numpy' from 'XXX/python3_sklearn/lib/python3.7/site-packages/numpy/__init__.py'>

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx:394: IndexError
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=================================================================== 1 failed, 1 warning in 0.35s ====================================================================

System:
python: 3.7.9 (v3.7.9:13c94747c7, Aug 15 2020, 01:31:08) [Clang 6.0 (clang-600.0.57)]
executable: XXX/python3_sklearn/bin/python3.7
machine: Darwin-20.5.0-x86_64-i386-64bit

Python dependencies:
pip: 21.2.4
setuptools: 51.3.1
sklearn: 1.1.dev0
numpy: 1.20.3
scipy: 1.6.3
Cython: 0.29.21
pandas: 1.2.0
matplotlib: 3.3.3
joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

lorentzenchr · 2022-09-30T12:12:40Z

@thomasjpfan Is there anything I can do (help, support, trade) to get a LGTM from you? You gave the most thorough review, and I think all of your comments are addressed, no loose ends.

examples/inspection/plot_partial_dependence.py

thomasjpfan · 2022-10-09T15:29:05Z

examples/inspection/plot_partial_dependence.py

+# %%
+# All 4 plots show parallel lines meaning there is no interaction in the model.
+# (Note that to see the same with a
+# :class:`~sklearn.ensemble.HistGradientBoostingClassifier`, we would need to


I think partial dependence plots for HistGradientBoostingClassifier already plots in the link space:

from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.datasets import make_classification from sklearn.inspection import PartialDependenceDisplay X, y = make_classification(random_state=0) clf = HistGradientBoostingClassifier().fit(X, y) _ = PartialDependenceDisplay.from_estimator(clf, X, [0])

My preference is to remove the statement, so the narrative flows more naturally to the 2D-plot and focusing on the current problem.

You're right. The statement would be valid for HistGradientBoostingRegressor(loss="poisson"), but I'll remove it from here. This subject is already discussed in #18309.

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/grower.py

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

jjerphan

My (hopefully) penultimate review.

I still need to go through _compute_interactions and some tests.

doc/whats_new/v1.2.rst

examples/inspection/plot_partial_dependence.py

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

jjerphan · 2022-10-10T07:13:17Z

sklearn/ensemble/_hist_gradient_boosting/grower.py

+            Indices of the interaction sets that have to be applied on splits of
+            child nodes. The fewer sets the stronger the constraint as fewer sets
+            contain fewer features.


I do not understand the last sentence. One can have two sets containing half features each, right? In this case, the tree structure won't be as constrained as their features' interactions' being defined by a lot of singletons? Or is my understanding wrong?

It means the choice of possible features to split on is the smaller, the less entries interaction_cst_indices has. "Stronger constraint" means having larger impact by only allowing very few features to split on.

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py

jjerphan

Alles in 🧈! Thank you for your patience, @lorentzenchr.

Here are some last comments.

sklearn/ensemble/_hist_gradient_boosting/grower.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

lorentzenchr · 2022-10-10T18:56:12Z

@jjerphan Thank you so much for your thorough review. Et maitenant, laisson faire notre 🧈 ? (Ça ce dit???)😄

lorentzenchr · 2022-10-10T19:13:59Z

Supposedly one of my last remarks: We could cite

M. Mayer, S.C. Bourassa, M. Hoesli & D.F. Scognamiglio. (2022). "Machine Learning Applications to Land and Structure Valuation". Journal of Risk and Financial Management.
https://doi.org/10.3390/jrfm15050193

In Chapter 2.2, they explain interaction constraints nicely with some diagrams of trees, and refer to the paper that originally proposed interaction constraints:

Lee, Simon C. K., Sheldon Lin, and Katrien Antonio. 2015. Delta Boosting Machine and Its Application in Actuarial Modeling. Sydney: Institute of Actuaries of Australia.

jjerphan · 2022-10-10T19:52:16Z

@jjerphan Thank you so much for your thorough review.

You're welcome. Thank you for proposing the call, this gave me a reason to go through your PR again.

Et maitenant, laisson faire notre 🧈 ? (Ça ce dit???)😄

I don't know about this phrase. If I had to pick one, I would say: « Emballé, c'est pesé ! 📪 »

doc/modules/ensemble.rst

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

- add reference Machine Learning Applications to Land and Structure Valuation - add arxiv qualifier for G. Louppe

thomasjpfan

LGTM

lorentzenchr · 2022-10-11T21:51:43Z

@jjerphan @thomasjpfan Thank you so much for your time, thoughts and effort while reviewing. It improved this PR a lot. This PR is an important one for me because it enables a lot of really cool use cases of high relevance in practice.

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

amueller · 2022-11-09T17:25:08Z

This is so awesome, thank you for all the hard work!

lorentzenchr added 2 commits September 12, 2021 18:31

DOC add attribues to TreeGrower

244c409

ENH add interaction_cst

b31eea0

github-actions bot added module:ensemble cython labels Sep 12, 2021

lorentzenchr self-assigned this Sep 12, 2021

lorentzenchr added this to the 1.1 milestone Sep 12, 2021

use a set in _get_allowed_features

d9b273a

lorentzenchr force-pushed the hgbt_interaction_constraints branch from c09bdd9 to d9b273a Compare September 12, 2021 21:38

complete overhaul

1cc1cb5

lorentzenchr added 5 commits September 14, 2021 21:36

TST test_split_interaction_constraints

7baf695

DOC add is_leaf to Attributes section

8aced52

DOC improve interaction_cst_idx

9a9862c

FIX fix logic

ed31a7e

TST add test_grower_interaction_constraints

f2a0679

CLN make allowed_features an instance variable

eb1e255

lorentzenchr force-pushed the hgbt_interaction_constraints branch from 015b30e to eb1e255 Compare September 16, 2021 19:27

lorentzenchr added 3 commits September 17, 2021 13:28

TST restructure test_grower_interaction_constraints

ec48945

CLN improve logic

1ed28d2

TST improve test

c7c8c3f

lorentzenchr commented Sep 17, 2021

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved

lorentzenchr added 3 commits September 18, 2021 12:11

DOC add docstring for interaction_cst

764cdf5

ENH add validation of interaction_cst

5a26f6e

TST test input validation

eb75a30

lorentzenchr mentioned this pull request Sep 18, 2021

Tests are failing with the new version of lightgbm (3.0.0) #18316

Closed

DEBUG uncomment if condition

3ea3829

lorentzenchr added 4 commits September 6, 2022 20:44

Merge branch 'main' into hgbt_interaction_constraints

45d178d

DOC fix typo

9667937

Merge branch 'main' into hgbt_interaction_constraints

ca270f5

CLN better comment on test construction

9fb3e55

thomasjpfan reviewed Oct 9, 2022

View reviewed changes

lorentzenchr added 2 commits October 9, 2022 22:40

EXA review comments

5240d9f

ENH improvements from Thomas review comments

295aeee

jjerphan reviewed Oct 10, 2022

View reviewed changes

jjerphan approved these changes Oct 10, 2022

View reviewed changes

lorentzenchr added 2 commits October 10, 2022 20:45

CLN Julien's review comments

9560ea7

TST fix test_grower_interaction_constraints

28c4578

thomasjpfan reviewed Oct 11, 2022

View reviewed changes

doc/modules/ensemble.rst Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py Show resolved Hide resolved

lorentzenchr added 3 commits October 11, 2022 21:09

DOC add reference Mayer 2022

4d4b80a

- add reference Machine Learning Applications to Land and Structure Valuation - add arxiv qualifier for G. Louppe

CLN remove if node.is_leaf in for loop

461cd6a

CLN fix test_grower_interaction_constraints

e0e8220

thomasjpfan approved these changes Oct 11, 2022

View reviewed changes

thomasjpfan changed the title ~~[MRG] FEA add interaction constraints to HGBT~~ ENH FEA add interaction constraints to HGBT Oct 11, 2022

lorentzenchr removed the Waiting for Reviewer label Oct 11, 2022

thomasjpfan merged commit 5ceb8a6 into scikit-learn:main Oct 11, 2022

lorentzenchr deleted the hgbt_interaction_constraints branch October 11, 2022 21:51

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022

ENH FEA add interaction constraints to HGBT (scikit-learn#21020)

82579e2

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

This was referenced Nov 6, 2022

Add user friendly string options for interaction constraints in HistGradientBoosting* #24845

Closed

ENH compute histograms only for allowed features in HGBT #24856

Merged

lorentzenchr mentioned this pull request Nov 28, 2022

DOC make interaction constraints a MajorFeature in whatsnew #25055

Merged

Uh oh!

ENH FEA add interaction constraints to HGBT #21020

ENH FEA add interaction constraints to HGBT #21020

Uh oh!

Conversation

lorentzenchr commented Sep 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

lorentzenchr commented Sep 13, 2021

Uh oh!

lorentzenchr commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Sep 16, 2021

Uh oh!

lorentzenchr commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Sep 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Sep 30, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan Oct 9, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Oct 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan Oct 10, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Oct 10, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjerphan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Oct 10, 2022

Uh oh!

lorentzenchr commented Oct 10, 2022

Uh oh!

jjerphan commented Oct 10, 2022

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Sep 12, 2021 •

edited

Loading

lorentzenchr commented Sep 15, 2021 •

edited

Loading

lorentzenchr commented Sep 16, 2021 •

edited

Loading

lorentzenchr commented Sep 18, 2021 •

edited

Loading

jjerphan left a comment •

edited

Loading

jjerphan left a comment •

edited

Loading