Skip to content

ENH FEA add interaction constraints to HGBT #21020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Sep 12, 2021

Reference Issues/PRs

Fixes #19148.

What does this implement/fix? Explain your changes.

This PR introduces the argument interaction_cst to the classes HistGradientBoostingRegressor and HistGradientBoostingClassifier.

It main impact is in Splitter.find_node_split.

Any other comments?

Might be a good reason to wish for an early 1.1 release 😏

  • add new feature
  • validate interaction_cst
  • test new feature
  • optimize code
  • doc and whatsnew
  • test interaction delta of gradient boosting, idea hgbt(X + X[:, 0]) - hgbt(X) with and without constraints, see ba78cb9
  • add to an example, e.g. partial dependence plot should be parallel with constraints and in link space

@lorentzenchr lorentzenchr self-assigned this Sep 12, 2021
@lorentzenchr lorentzenchr added this to the 1.1 milestone Sep 12, 2021
@lorentzenchr lorentzenchr force-pushed the hgbt_interaction_constraints branch from c09bdd9 to d9b273a Compare September 12, 2021 21:38
@lorentzenchr
Copy link
Member Author

@thomasjpfan I have two questions right now:

  1. How to treat features not listed in the interaction constraints, e.g. with features 0..3 and interaction_cst = [{0, 1}], are features 2 and 3 allowed to enter the game at all?
  2. What would be a good test for test_splitting? I might need a little help there.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Sep 15, 2021

Now, I could use some help as I'm stuck with the following error, which only happens for n_threads > 1 (this error message is produced by compiling with # cython: boundscheck=True in splitting.pyx):

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py:247: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/ensemble/_hist_gradient_boosting/grower.py:395: in grow
    self.split_next()
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7fe5d39d55d0>
sklearn/ensemble/_hist_gradient_boosting/grower.py:500: in split_next
    ) = self.splitter.split_indices(node.split_info, node.sample_indices)
        node       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeNode object at 0x7fe5d39d5f50>
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7fe5d39d55d0>
        tic        = 1.545465422
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   &sample_indices[right_offset[thread_idx]]
E   IndexError: Out of bounds on buffer access (axis 0)

I don't understand where any of sample_indices, right_offset and thread_idx have been changed by this PR. My guess is that right_offset[-1] = len(sample_indices) and so the last thread errors.

(@thomasjpfan You said I may ping you:blush:)

@ogrisel
Copy link
Member

ogrisel commented Sep 16, 2021

Now, I could use some help as I'm stuck with the following error, which only happens for n_threads > 1 (this error message is produced by compiling with # cython: boundscheck=True in splitting.pyx).

I don't understand how a build-time error can be impacted by a runtime value (n_threads > 1)...

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Sep 16, 2021

Now, I could use some help as I'm stuck with the following error, which only happens for n_threads > 1 (this error message is produced by compiling with # cython: boundscheck=True in splitting.pyx).

I don't understand how a build-time error can be impacted by a runtime value (n_threads > 1)...

Me neither. There is nothing I understand with this error.

Running test_grower_interaction_constraints with TreeGrower(..., n_threads=1) is fine, while TreeGrower(..., n_threads=2) throws this error. It seems as if the calculation of right_offset is changed by this PR and now wrong. Some print out gives me (print(..., flush=True) is helpful in this case)

thread_idx = 1
sample_indices = [5 5]
right_offset = [1 2]
right_indices_buffer = [         1          4 3215170148 1069022147          6          9
 1036367913 1068865250 1066764793 1064467143]
offset_in_buffers = [0 1]
right_counts = [1 0]
len(sample_indices) = 2
right_offset[thread_idx] = 2

Note the last two lines and the fact that sample_indices[right_offset[thread_idx]] = sample_indices[2] gets called. But why does any of this PR change those calculations in the first place. I'm deeply puzzled.

Note that this error also appears in test_min_samples_leaf.

@lorentzenchr lorentzenchr force-pushed the hgbt_interaction_constraints branch from 015b30e to eb1e255 Compare September 16, 2021 19:27
@lorentzenchr
Copy link
Member Author

lorentzenchr commented Sep 18, 2021

@thomasjpfan @ogrisel Here comes a detailed bug report based on commit 3ea3829.
The CI runs confirm it, this error only appears with OpenMP enabled.

pytest -xl sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py::test_grower_interaction_constraints 
======================================================================== test session starts ========================================================================
platform darwin -- Python 3.7.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: XXX/scikit-learn, configfile: setup.cfg
plugins: anyio-3.3.0, cov-2.10.1
collected 1 item                                                                                                                                                    

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py F                                                                                               [100%]

============================================================================= FAILURES ==============================================================================
________________________________________________________________ test_grower_interaction_constraints ________________________________________________________________

    def test_grower_interaction_constraints():
        """Check that grower respects interaction constraints."""
        n_features = 6
        interaction_cst = [{0, 1}, {1, 2}, {3, 4, 5}]
        n_samples = 5
        n_bins = 6
        root_feature_splits = []
    
        def get_all_children(node):
            res = []
            if node.is_leaf:
                return res
            for n in [node.left_child, node.right_child]:
                res.append(n)
                if not n.is_leaf:
                    res.extend(get_all_children(n))
            return res
    
        for seed in range(20):
            rng = np.random.RandomState(seed)
    
            X_binned = rng.randint(
                0, n_bins - 1, size=(n_samples, n_features), dtype=X_BINNED_DTYPE
            )
            X_binned = np.asfortranarray(X_binned)
            gradients = rng.normal(size=n_samples).astype(G_H_DTYPE)
            hessians = np.ones(shape=1, dtype=G_H_DTYPE)
    
            grower = TreeGrower(
                X_binned,
                gradients,
                hessians,
                max_depth=3,
                n_bins=n_bins,
                shrinkage=1.0,
                max_leaf_nodes=None,
                min_samples_leaf=1,
                interaction_cst=interaction_cst,
                n_threads=n_threads,
            )
>           grower.grow()

X_binned   = array([[4, 2, 4, 2, 4, 0],
       [4, 1, 0, 3, 3, 2],
       [3, 2, 2, 3, 3, 3],
       [0, 1, 4, 4, 4, 3],
       [1, 1, 1, 2, 2, 4]], dtype=uint8)
get_all_children = <function test_grower_interaction_constraints.<locals>.get_all_children at 0x7ff910990950>
gradients  = array([-0.29139397, -0.13309029, -0.1730696 , -1.7616516 , -0.08767307],
      dtype=float32)
grower     = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7ff910945550>
hessians   = array([1.], dtype=float32)
interaction_cst = [{0, 1}, {1, 2}, {3, 4, 5}]
n_bins     = 6
n_features = 6
n_samples  = 5
rng        = RandomState(MT19937) at 0x7FF9108D76B0
root_feature_splits = []
seed       = 0

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py:612: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/ensemble/_hist_gradient_boosting/grower.py:397: in grow
    self.split_next()
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7ff910945550>
sklearn/ensemble/_hist_gradient_boosting/grower.py:497: in split_next
    ) = self.splitter.split_indices(node.split_info, node.sample_indices)
        node       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeNode object at 0x7ff9109c0c10>
        self       = <sklearn.ensemble._hist_gradient_boosting.grower.TreeGrower object at 0x7ff910945550>
        tic        = 1.203303851
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   &left_indices_buffer[offset_in_buffers[thread_idx]],
E   IndexError: Out of bounds on buffer access (axis 0)

HISTOGRAM_DTYPE = dtype([('sum_gradients', '<f8'), ('sum_hessians', '<f8'), ('count', '<u4')])
SplitInfo  = <class 'sklearn.ensemble._hist_gradient_boosting.splitting.SplitInfo'>
Splitter   = <class 'sklearn.ensemble._hist_gradient_boosting.splitting.Splitter'>
__builtins__ = <builtins>
__doc__    = 'This module contains routines and data structures to:\n\n- Find the best possible split of a node. For a given node, ... split to a node, i.e. split the indices of the samples at the node\n  into the newly created left and right childs.\n'
__file__   = 'XXX/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/splitting.cpython-37m-darwin.so'
__loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7ff91093bd10>
__name__   = 'sklearn.ensemble._hist_gradient_boosting.splitting'
__package__ = 'sklearn.ensemble._hist_gradient_boosting'
__pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
__pyx_unpickle_Splitter = <built-in function __pyx_unpickle_Splitter>
__spec__   = ModuleSpec(name='sklearn.ensemble._hist_gradient_boosting.splitting', loader=<_frozen_importlib_external.ExtensionFile...origin='XXX/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/splitting.cpython-37m-darwin.so')
__test__   = {}
compute_node_value = <built-in function compute_node_value>
np         = <module 'numpy' from 'XXX/python3_sklearn/lib/python3.7/site-packages/numpy/__init__.py'>

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx:394: IndexError
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=================================================================== 1 failed, 1 warning in 0.35s ====================================================================

System:
python: 3.7.9 (v3.7.9:13c94747c7, Aug 15 2020, 01:31:08) [Clang 6.0 (clang-600.0.57)]
executable: XXX/python3_sklearn/bin/python3.7
machine: Darwin-20.5.0-x86_64-i386-64bit

Python dependencies:
pip: 21.2.4
setuptools: 51.3.1
sklearn: 1.1.dev0
numpy: 1.20.3
scipy: 1.6.3
Cython: 0.29.21
pandas: 1.2.0
matplotlib: 3.3.3
joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

@lorentzenchr
Copy link
Member Author

@thomasjpfan Is there anything I can do (help, support, trade) to get a LGTM from you? You gave the most thorough review, and I think all of your comments are addressed, no loose ends.

# %%
# All 4 plots show parallel lines meaning there is no interaction in the model.
# (Note that to see the same with a
# :class:`~sklearn.ensemble.HistGradientBoostingClassifier`, we would need to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think partial dependence plots for HistGradientBoostingClassifier already plots in the link space:

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.inspection import PartialDependenceDisplay
X, y = make_classification(random_state=0)
clf = HistGradientBoostingClassifier().fit(X, y)

_ = PartialDependenceDisplay.from_estimator(clf, X, [0])

Screen Shot 2022-10-09 at 11 27 08 AM

My preference is to remove the statement, so the narrative flows more naturally to the 2D-plot and focusing on the current problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The statement would be valid for HistGradientBoostingRegressor(loss="poisson"), but I'll remove it from here. This subject is already discussed in #18309.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (hopefully) penultimate review.

I still need to go through _compute_interactions and some tests.

Comment on lines +642 to +644
Indices of the interaction sets that have to be applied on splits of
child nodes. The fewer sets the stronger the constraint as fewer sets
contain fewer features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the last sentence. One can have two sets containing half features each, right? In this case, the tree structure won't be as constrained as their features' interactions' being defined by a lot of singletons? Or is my understanding wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the choice of possible features to split on is the smaller, the less entries interaction_cst_indices has. "Stronger constraint" means having larger impact by only allowing very few features to split on.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alles in 🧈! Thank you for your patience, @lorentzenchr.

Here are some last comments.

@lorentzenchr
Copy link
Member Author

@jjerphan Thank you so much for your thorough review. Et maitenant, laisson faire notre 🧈 ? (Ça ce dit???)😄

@lorentzenchr
Copy link
Member Author

Supposedly one of my last remarks: We could cite

M. Mayer, S.C. Bourassa, M. Hoesli & D.F. Scognamiglio. (2022). "Machine Learning Applications to Land and Structure Valuation". Journal of Risk and Financial Management.
https://doi.org/10.3390/jrfm15050193

In Chapter 2.2, they explain interaction constraints nicely with some diagrams of trees, and refer to the paper that originally proposed interaction constraints:

Lee, Simon C. K., Sheldon Lin, and Katrien Antonio. 2015. Delta Boosting Machine and Its Application in Actuarial Modeling. Sydney: Institute of Actuaries of Australia.

@jjerphan
Copy link
Member

@jjerphan Thank you so much for your thorough review.

You're welcome. Thank you for proposing the call, this gave me a reason to go through your PR again.

Et maitenant, laisson faire notre 🧈 ? (Ça ce dit???)😄

I don't know about this phrase. If I had to pick one, I would say: « Emballé, c'est pesé ! 📪 »

- add reference Machine Learning Applications to Land and Structure Valuation
- add arxiv qualifier for G. Louppe
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title [MRG] FEA add interaction constraints to HGBT ENH FEA add interaction constraints to HGBT Oct 11, 2022
@thomasjpfan thomasjpfan merged commit 5ceb8a6 into scikit-learn:main Oct 11, 2022
@lorentzenchr
Copy link
Member Author

@jjerphan @thomasjpfan Thank you so much for your time, thoughts and effort while reviewing. It improved this PR a lot. This PR is an important one for me because it enables a lot of really cool use cases of high relevance in practice.

@lorentzenchr lorentzenchr deleted the hgbt_interaction_constraints branch October 11, 2022 21:51
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
@amueller
Copy link
Member

amueller commented Nov 9, 2022

This is so awesome, thank you for all the hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython High Priority High priority issues and pull requests module:ensemble
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

HistGradientBoosting* interaction constraints
10 participants