Skip to content

[MRG] DOC User guide section for histogram-based GBDTs #14525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 5, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 158 additions & 42 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -440,59 +440,37 @@ Gradient Tree Boosting
======================

`Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_
or Gradient Boosted Regression Trees (GBRT) is a generalization
or Gradient Boosted Decision Trees (GBDT) is a generalization
of boosting to arbitrary
differentiable loss functions. GBRT is an accurate and effective
differentiable loss functions. GBDT is an accurate and effective
off-the-shelf procedure that can be used for both regression and
classification problems. Gradient Tree Boosting models are used in a
classification problems in a
variety of areas including Web search ranking and ecology.

The advantages of GBRT are:

+ Natural handling of data of mixed type (= heterogeneous features)

+ Predictive power

+ Robustness to outliers in output space (via robust loss functions)

The disadvantages of GBRT are:

+ Scalability, due to the sequential nature of boosting it can
hardly be parallelized.

The module :mod:`sklearn.ensemble` provides methods
for both classification and regression via gradient boosted regression
for both classification and regression via gradient boosted decision
trees.


.. note::

Scikit-learn 0.21 introduces two new experimental implementation of
Scikit-learn 0.21 introduces two new experimental implementations of
gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
and :class:`HistGradientBoostingRegressor`, inspired by
`LightGBM <https://github.com/Microsoft/LightGBM>`_. These fast estimators
first bin the input samples ``X`` into integer-valued bins (typically 256
bins) which tremendously reduces the number of splitting points to
consider, and allow the algorithm to leverage integer-based data
structures (histograms) instead of relying on sorted continuous values.

The new histogram-based estimators can be orders of magnitude faster than
their continuous counterparts when the number of samples is larger than
tens of thousands of samples. The API of these new estimators is slightly
different, and some of the features from :class:`GradientBoostingClassifier`
and :class:`GradientBoostingRegressor` are not yet supported.

These new estimators are still **experimental** for now: their predictions
and their API might change without any deprecation cycle. To use them, you
need to explicitly import ``enable_hist_gradient_boosting``::

>>> # explicitly require this experimental feature
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa
>>> # now you can import normally from ensemble
>>> from sklearn.ensemble import HistGradientBoostingClassifier
`LightGBM <https://github.com/Microsoft/LightGBM>`_.

These histogram-based estimators can be **orders of magnitude faster**
than :class:`GradientBoostingClassifier` and
:class:`GradientBoostingRegressor` when the number of samples is larger
than tens of thousands of samples.

They also have built-in support for missing values, which avoids the need
for an imputer.

These estimators are described in more detail below in
:ref:`histogram_based_gradient_boosting`.

The following guide focuses on :class:`GradientBoostingClassifier` and
:class:`GradientBoostingRegressor` only, which might be preferred for small
:class:`GradientBoostingRegressor`, which might be preferred for small
sample sizes since binning may lead to split points that are too approximate
in this setting.

Expand Down Expand Up @@ -526,7 +504,8 @@ The number of weak learners (i.e. regression trees) is controlled by the paramet
thus, the total number of induced trees equals
``n_classes * n_estimators``. For datasets with a large number
of classes we strongly recommend to use
:class:`RandomForestClassifier` as an alternative to :class:`GradientBoostingClassifier` .
:class:`HistGradientBoostingClassifier` as an alternative to
:class:`GradientBoostingClassifier` .

Regression
----------
Expand Down Expand Up @@ -838,7 +817,144 @@ accessed via the ``feature_importances_`` property::

* :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py`

.. _voting_classifier:
.. _histogram_based_gradient_boosting:

Histogram-Based Gradient Boosting
=================================

Scikit-learn 0.21 introduces two new experimental implementations of
gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
and :class:`HistGradientBoostingRegressor`, inspired by
`LightGBM <https://github.com/Microsoft/LightGBM>`_.

These histogram-based estimators can be **orders of magnitude faster**
than :class:`GradientBoostingClassifier` and
:class:`GradientBoostingRegressor` when the number of samples is larger
than tens of thousands of samples.

They also have built-in support for missing values, which avoids the need
for an imputer.

These fast estimators first bin the input samples ``X`` into
integer-valued bins (typically 256 bins) which tremendously reduces the
number of splitting points to consider, and allows the algorithm to
leverage integer-based data structures (histograms) instead of relying on
sorted continuous values when building the trees. The API of these
estimators is slightly different, and some of the features from
:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`
are not yet supported: in particular sample weights, and some loss
functions.

These estimators are still **experimental**: their predictions
and their API might change without any deprecation cycle. To use them, you
need to explicitly import ``enable_hist_gradient_boosting``::

>>> # explicitly require this experimental feature
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa
>>> # now you can import normally from ensemble
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bit redundant but fine.

>>> from sklearn.ensemble import HistGradientBoostingClassifier

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py`

Usage
-----

Most of the parameters are unchanged from
:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`.
One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and
controls the number of iterations of the boosting process:

>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.datasets import make_hastie_10_2

>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]
>>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train)

>>> clf.score(X_test, y_test)
0.8998

The size of the trees can be controlled through the ``max_leaf_nodes``,
``max_depth``, and ``min_samples_leaf`` parameters.

The number of bins used to bin the data is controlled with the ``max_bins``
parameter. Using less bins acts as a form of regularization. It is
generally recommended to use as many bins as possible, which is the default.

The ``l2_regularization`` parameter is a regularizer on the loss function and
corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.

Note that **early-stopping is enabled by default**. The early-stopping
behaviour is controlled via the ``scoring``, ``validation_fraction``,
``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop
using an arbitrary :term:`scorer`, or just the training or validation loss. By
default, early-stopping is performed using the default :term:`scorer` of
the estimator on a validation set.

Low-level parallelism
---------------------

:class:`HistGradientBoostingClassifier` and
:class:`HistGradientBoostingRegressor` have implementations that use OpenMP
for parallelization through Cython. The number of threads that is used can
be changed using the ``OMP_NUM_THREADS`` environment variable. By default,
all available cores are used. Please refer to the OpenMP documentation for
details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to OpenMP documentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Weird that the previous comments got lost, we discussed this with Adrin already)

I don't want to add it because the openmp doc is a 300 pages pdf. Googling OMP_NUM_THREADS gives you the docs from intel and ibm but they're probably not official.


The following parts are parallelized:

- mapping samples from real values to integer-valued bins (finding the bin
thresholds is however sequential)
- building histograms is parallelized over features
- finding the best split point at a node is parallelized over features
- during fit, mapping samples into the left and right children is
parallelized over samples
- gradient and hessians computations are parallelized over samples
- predicting is parallelized over samples

Why it's faster
---------------

The bottleneck of a gradient boosting procedure is building the decision
trees. Building a traditional decision tree (as in the other GBDTs
:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`)
requires sorting the samples at each node (for
each feature). Sorting is needed so that the potential gain of a split point
can be computed efficiently. Splitting a single node has thus a complexity
of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the
number of samples at the node.

:class:`HistGradientBoostingClassifier` and
:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the
feature values and instead use a data-structure called a histogram, where the
samples are implicitly ordered. Building a histogram has a
:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a
:math:`\mathcal{O}(\text{n_features} * n)` complexity, much smaller than the
previous one. In addition, instead of considering :math:`n` split points, we
here consider only ``max_bins`` split points, which is much smaller.

In order to build histograms, the input data `X` needs to be binned into
integer-valued bins. This binning procedure does require sorting the feature
values, but it only happens once at the very beginning of the boosting process
(not at each node, like in :class:`GradientBoostingClassifier` and
:class:`GradientBoostingRegressor`).

Finally, many parts of the implementation of
:class:`HistGradientBoostingClassifier` and
:class:`HistGradientBoostingRegressor` are parallelized.

.. topic:: References

.. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree
Boosting System". https://arxiv.org/abs/1603.02754
.. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient
BoostingDecision Tree"

.. _voting_classifier:

Voting Classifier
========================
Expand Down
19 changes: 5 additions & 14 deletions sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -619,13 +619,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):

This estimator is much faster than
:class:`GradientBoostingRegressor<sklearn.ensemble.GradientBoostingRegressor>`
for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
into integer-valued bins, which considerably reduces the number of
splitting points to consider, and allows the algorithm to leverage
integer-based data structures. For small sample sizes,
:class:`GradientBoostingRegressor<sklearn.ensemble.GradientBoostingRegressor>`
might be preferred since binning may lead to split points that are too
approximate in this setting.
for big datasets (n_samples >= 10 000).

This implementation is inspired by
`LightGBM <https://github.com/Microsoft/LightGBM>`_.
Expand All @@ -641,6 +635,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):
>>> # now you can import normally from ensemble
>>> from sklearn.ensemble import HistGradientBoostingClassifier

Read more in the :ref:`User Guide <histogram_based_gradient_boosting>`.

Parameters
----------
Expand Down Expand Up @@ -792,13 +787,7 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,

This estimator is much faster than
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
into integer-valued bins, which considerably reduces the number of
splitting points to consider, and allows the algorithm to leverage
integer-based data structures. For small sample sizes,
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
might be preferred since binning may lead to split points that are too
approximate in this setting.
for big datasets (n_samples >= 10 000).

This implementation is inspired by
`LightGBM <https://github.com/Microsoft/LightGBM>`_.
Expand All @@ -814,6 +803,8 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,
>>> # now you can import normally from ensemble
>>> from sklearn.ensemble import HistGradientBoostingClassifier

Read more in the :ref:`User Guide <histogram_based_gradient_boosting>`.

Parameters
----------
loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \
Expand Down