Skip to content

[MRG] FIX max_leaf_node and max_depth interaction in GBDT #16183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 7, 2020

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jan 23, 2020

Fixes #16179

The fix is simply to check for max_samples_leaf before checking for max_depth.

This makes sure that _finalize_splittable_nodes is called, which was the cause of the bug.

@NicolasHug NicolasHug changed the title [WIP] FIX max_leaf_node and max_depth interaction in GBDT [MRG] FIX max_leaf_node and max_depth interaction in GBDT Jan 23, 2020
@@ -280,6 +283,18 @@ def test_max_depth(max_depth):
assert depth == max_depth


def test_max_depth_max_leaf_nodes():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think that we should move this test in test_gradient_boosting.py instead of the test_grower.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion. It's really a test about the grower code more than the estimator but yeah it's not obvious to come up with something that would only involve the grower

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this may seem a little crazy, but:

def test_max_depth_max_leaf_nodes():
    n_samples = 1000
    n_bins = 64
    max_depth = 4
    max_leaf_nodes = 9  # 2**(max_depth - 1) + 1

    X, y = make_classification(n_samples=n_samples, random_state=42)
    X_binned = _BinMapper(n_bins=n_bins, random_state=42).fit_transform(X)
    y = y.astype(Y_DTYPE)

    loss = LeastSquares()
    baseline = loss.get_baseline_prediction(y, 1)
    raw_predictions = np.full(shape=(1, n_samples), fill_value=baseline)

    gradients, hessians = loss.init_gradients_and_hessians(
        n_samples=n_samples, prediction_dim=1)

    for i in range(5):
        loss.update_gradients_and_hessians(gradients, hessians, y,
                                           raw_predictions)
        grower = TreeGrower(X_binned, gradients[0, :], hessians[0, :],
                            n_bins=n_bins, max_leaf_nodes=max_leaf_nodes,
                            max_depth=max_depth)
        grower.grow()
        _update_raw_predictions(raw_predictions[0, :], grower)
        assert len(grower.finalized_leaves) <= max_leaf_nodes

@NicolasHug
Copy link
Member Author

you might like this new solution better @thomasjpfan .

@NicolasHug
Copy link
Member Author

failure is unrelated

@@ -355,16 +355,16 @@ def split_next(self):

self.n_nodes += 2

if self.max_depth is not None and depth == self.max_depth:
if (self.max_leaf_nodes is not None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff might be confusing but all I did was check for max_leaf_nodes before checking for max_depth

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reordered was noticed ;)

@NicolasHug
Copy link
Member Author

Thanks for the reviews so far. I have moved the test and simplified it to a 3-liner. I confirm it's a non-regression test that fails on master

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Can confirm that test fails on master.

@@ -355,16 +355,16 @@ def split_next(self):

self.n_nodes += 2

if self.max_depth is not None and depth == self.max_depth:
if (self.max_leaf_nodes is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reordered was noticed ;)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the final fix lol.

@adrinjalali adrinjalali merged commit 98b3c7c into scikit-learn:master Feb 7, 2020
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
* fix max_leaf_node max_depth interaction

* Added test

* comment

* what's new

* simpler solution

* moved and simplified test

* typo
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
* fix max_leaf_node max_depth interaction

* Added test

* comment

* what's new

* simpler solution

* moved and simplified test

* typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

max_leaf_nodes in HistGradientBoostingRegressor changes automatically based on values in max depth during training
4 participants