Skip to content

[MRG] Monotonic constraints for GBDT #15582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
Mar 24, 2020

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Nov 9, 2019

This PR adds support for monotonic constraints for the histogram GBDTs.

Addresses #6656

(see https://xgboost.readthedocs.io/en/latest/tutorials/monotonic.html)

The API is to pass e.g. HistGradientBoostingRegressor(monotonic_cst=[-1, 1])


For reviewers: The overall logic is pretty simple, but this still involved a lot of changes because I had to refactor the splitter, since now we require all nodes to have a value (previously only leaves would have a value).

@ogrisel @adrinjalali @thomasjpfan @amueller @glemaitre might be interested in this (after the release of course)

@NicolasHug
Copy link
Member Author

NicolasHug commented Nov 10, 2019

@amueller
Copy link
Member

why not make it an init argument for now?

@NicolasHug
Copy link
Member Author

NicolasHug commented Nov 10, 2019

This is more of a fit parameter right? Since the constraints are a property of X.

I'm open to anything, but ideally we wouldn't merge this PR if we knew for sure that the API would change in the future.

Comment on lines 562 to 589
if gain > best_gain and gain > self.min_gain_to_split:
found_better_split = True
best_gain = gain
best_bin_idx = bin_idx
best_sum_gradient_left = sum_gradient_left
best_sum_hessian_left = sum_hessian_left
best_n_samples_left = n_samples_left

if found_better_split:
split_info.gain = best_gain
split_info.bin_idx = best_bin_idx
# we scan from left to right so missing values go to the right
split_info.missing_go_to_left = False
split_info.sum_gradient_left = best_sum_gradient_left
split_info.sum_gradient_right = sum_gradients - best_sum_gradient_left
split_info.sum_hessian_left = best_sum_hessian_left
split_info.sum_hessian_right = sum_hessians - best_sum_hessian_left
split_info.n_samples_left = best_n_samples_left
split_info.n_samples_right = n_samples - best_n_samples_left

# We recompute best values here but it's cheap
split_info.value_left = compute_value(
split_info.sum_gradient_left, split_info.sum_hessian_left,
lower_bound, upper_bound, self.l2_regularization)

split_info.value_right = compute_value(
split_info.sum_gradient_right, split_info.sum_hessian_right,
lower_bound, upper_bound, self.l2_regularization)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: apart from the leaf value computation (which is new), the logic here is actually unchanged.

It's mostly just a small optimization: instead of setting the all the attribute of the split_info at each iteration, we only do it once at the end

Comment on lines -641 to -645
gain = negative_loss(sum_gradient_left, sum_hessian_left,
l2_regularization)
gain += negative_loss(sum_gradient_right, sum_hessian_right,
l2_regularization)
gain -= negative_loss_current_node
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers:

The gain computation has been updated to:

  • first compute values of left and right child
  • cap those values according to the bounds for monotonic constraints
  • discard any split that does not respect left < right (INC) or right < left (DEC)
  • compute the loss reduction from the previously (bounded) computed values

Comment on lines 142 to 143
# Considering the following tree with a monotonic INC constraint, we
# should have:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: start here

Copy link
Member Author

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the review!

I addressed most comments and will try a fix tomorrow for the gain computation.

# or all decreasing (or neither) depending on the monotonic constraint.
nodes = predictor.nodes

def get_leaves_values():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me that would suggest that there is only one leaf and that one leaf has multiple values.

Thoughts @jnothman ?

dfs(node.left_child)
dfs(node.right_child)

dfs(grower.root)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can do that here because unlike the predictor object, the grower does not provide an array of the nodes. It only has the root, and a finalized_leaves list which isn't enough for us here.

gain -= _loss_from_value(value_right, sum_gradient_right) # with bounds
# Note that the losses for both children are computed with bounded values,
# while the loss of the current node isn't. It's OK since all the gain
# comparisons will be made with the same loss_current_node anyway.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reasonably confident that lightgbm does it similarly: see https://github.com/microsoft/LightGBM/blob/master/src/treelearner/feature_histogram.hpp#L160, where the gain of the current node to be split will not take constraints into account (follow BeforeNumercal and then GetLeafGain). But I could be missing something.

You're right that it doesn't work well withmin_gain_to_split. I guess the correct way would be to pass the current node's value into find_node_split.

I'll try tomorrow and will report back!

NicolasHug and others added 2 commits March 20, 2020 07:23
@NicolasHug
Copy link
Member Author

OK, I pushed something and added a test.

I'm still a little bit hesitant on what we should be doing. I think that in the end, it all boils down on how we define the loss at a given node:

  1. loss = -sum_grad**2 / sum_hessians

or

  1. loss = sum_grad * value, where value = clip(-sum_grad / sum_hessians)

They're both equivalent if no clipping happens, like in the XGBoost paper.

In any case, I agree we should be consistent and use the same formula for all nodes. In the previous version (and in LightGBM as far as I understand), we were using 1 for the current node and 2 for the children. Now, we're using 2 for all nodes.

WDYT @ogrisel, good to go?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you just do a quick benchmark to check that the latest change did not cause a significant performance regression w.r.t. master?

@NicolasHug
Copy link
Member Author

Thanks for the reminder,

There actually was a slow-down due to compute_node_value having lots of interactions. I fixed it by putting it in splitting.pyx instead of common.pyx (Some Cython magic at work again).

With python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 500 --subsample 500000 --n-leaf-nodes 255,

I get about 51sec in the current branch and 47 on master now.

So there's still a slow down, but as far as I can tell the Cython code is about just as fast. It seems we're spending more time in the Python part.

That might be due to the apply_shrinkage() method which has to go through all the leaves at the end. I tried to remove it but the speed gain didn't seem really significant (also that made the code quite uglier).

@ogrisel
Copy link
Member

ogrisel commented Mar 22, 2020

Maybe you can use py-spy with native code profiling enabled both on master and on this branch and compare the 2 flamegraphs to help identify the discrepancy?

@NicolasHug
Copy link
Member Author

Great suggestion!

It turns out the grower is spending a significant amount of time setting the bounds of the children. That's where the difference came from. I'm quite surprised because there's no constraint at all in the benchmark. But I guess it adds up in the end since this is done for every single node.

I added a fast way in 83dba40. Now both times are similar:

branch ('total', 'count') ('total', 'mean') ('total', 'std') ('total', 'min') ('total', '50%') ('total', 'max') ('histogram computation', 'count') ('histogram computation', 'mean') ('histogram computation', 'std') ('histogram computation', 'min') ('histogram computation', '50%') ('histogram computation', 'max') ('finding best splits', 'count') ('finding best splits', 'mean') ('finding best splits', 'std') ('finding best splits', 'min') ('finding best splits', '50%') ('finding best splits', 'max') ('applying splits', 'count') ('applying splits', 'mean') ('applying splits', 'std') ('applying splits', 'min') ('applying splits', '50%') ('applying splits', 'max') ('predicting', 'count') ('predicting', 'mean') ('predicting', 'std') ('predicting', 'min') ('predicting', '50%') ('predicting', 'max')
master 10 45.6105 0.723786 44.832 45.4225 47.25 10 19.1552 0.572901 18.522 18.976 20.476 10 4.1208 0.117993 4.003 4.1025 4.398 10 7.4773 0.159384 7.289 7.428 7.809 10 0.7342 0.0298619 0.687 0.7285 0.781
monotonic_cst 10 45.3345 0.405736 44.625 45.32 46.034 10 19.0646 0.426279 18.43 19.047 19.878 10 4.0572 0.0953832 3.879 4.076 4.162 10 7.3875 0.0822398 7.225 7.3965 7.496 10 0.7201 0.0145789 0.704 0.716 0.747

(I ran them on a different machine from #15582 (comment), in case you're wondering why it's faster in both cases)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NicolasHug , other than the one concern, looks all good.

@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2020

Alright, very nice work. Let's merge!

@ogrisel ogrisel merged commit 36ebf3e into scikit-learn:master Mar 24, 2020
@NicolasHug
Copy link
Member Author

Thanks a lot @adrinjalali and @ogrisel for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants