scikit-learn · adrinjalali · Aug 5, 2019 · Jul 30, 2019 · Jul 30, 2019 · Jul 30, 2019
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -440,59 +440,37 @@ Gradient Tree Boosting
 ======================
 
 `Gradient Tree Boosting <https://en.wikipedia.org/wiki/Gradient_boosting>`_
-or Gradient Boosted Regression Trees (GBRT) is a generalization
+or Gradient Boosted Decision Trees (GBDT) is a generalization
 of boosting to arbitrary
-differentiable loss functions. GBRT is an accurate and effective
+differentiable loss functions. GBDT is an accurate and effective
 off-the-shelf procedure that can be used for both regression and
-classification problems.  Gradient Tree Boosting models are used in a
+classification problems in a
 variety of areas including Web search ranking and ecology.
 
-The advantages of GBRT are:
-
-  + Natural handling of data of mixed type (= heterogeneous features)
-
-  + Predictive power
-
-  + Robustness to outliers in output space (via robust loss functions)
-
-The disadvantages of GBRT are:
-
-  + Scalability, due to the sequential nature of boosting it can
-    hardly be parallelized.
-
 The module :mod:`sklearn.ensemble` provides methods
-for both classification and regression via gradient boosted regression
+for both classification and regression via gradient boosted decision
 trees.
 
-
 .. note::
 
-  Scikit-learn 0.21 introduces two new experimental implementation of
+  Scikit-learn 0.21 introduces two new experimental implementations of
   gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
   and :class:`HistGradientBoostingRegressor`, inspired by
-  `LightGBM <https://github.com/Microsoft/LightGBM>`_. These fast estimators
-  first bin the input samples ``X`` into integer-valued bins (typically 256
-  bins) which tremendously reduces the number of splitting points to
-  consider, and allow the algorithm to leverage integer-based data
-  structures (histograms) instead of relying on sorted continuous values.
-
-  The new histogram-based estimators can be orders of magnitude faster than
-  their continuous counterparts when the number of samples is larger than
-  tens of thousands of samples. The API of these new estimators is slightly
-  different, and some of the features from :class:`GradientBoostingClassifier`
-  and :class:`GradientBoostingRegressor` are not yet supported.
-
-  These new estimators are still **experimental** for now: their predictions
-  and their API might change without any deprecation cycle. To use them, you
-  need to explicitly import ``enable_hist_gradient_boosting``::
-
-    >>> # explicitly require this experimental feature
-    >>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
-    >>> # now you can import normally from ensemble
-    >>> from sklearn.ensemble import HistGradientBoostingClassifier
+  `LightGBM <https://github.com/Microsoft/LightGBM>`_.
+
+  These histogram-based estimators can be **orders of magnitude faster**
+  than :class:`GradientBoostingClassifier` and
+  :class:`GradientBoostingRegressor` when the number of samples is larger
+  than tens of thousands of samples.
+
+  They also have built-in support for missing values, which avoids the need
+  for an imputer.
+
+  These estimators are described in more detail below in
+  :ref:`histogram_based_gradient_boosting`.
 
   The following guide focuses on :class:`GradientBoostingClassifier` and
-  :class:`GradientBoostingRegressor` only, which might be preferred for small
+  :class:`GradientBoostingRegressor`, which might be preferred for small
   sample sizes since binning may lead to split points that are too approximate
   in this setting.
 
@@ -526,7 +504,8 @@ The number of weak learners (i.e. regression trees) is controlled by the paramet
    thus, the total number of induced trees equals
    ``n_classes * n_estimators``. For datasets with a large number
    of classes we strongly recommend to use
-   :class:`RandomForestClassifier` as an alternative to :class:`GradientBoostingClassifier` .
+   :class:`HistGradientBoostingClassifier` as an alternative to
+   :class:`GradientBoostingClassifier` .
 
 Regression
 ----------
@@ -838,7 +817,144 @@ accessed via the ``feature_importances_`` property::
 
  * :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regression.py`
 
- .. _voting_classifier:
+.. _histogram_based_gradient_boosting:
+
+Histogram-Based Gradient Boosting
+=================================
+
+Scikit-learn 0.21 introduces two new experimental implementations of
+gradient boosting trees, namely :class:`HistGradientBoostingClassifier`
+and :class:`HistGradientBoostingRegressor`, inspired by
+`LightGBM <https://github.com/Microsoft/LightGBM>`_.
+
+These histogram-based estimators can be **orders of magnitude faster**
+than :class:`GradientBoostingClassifier` and
+:class:`GradientBoostingRegressor` when the number of samples is larger
+than tens of thousands of samples.
+
+They also have built-in support for missing values, which avoids the need
+for an imputer.
+
+These fast estimators first bin the input samples ``X`` into
+integer-valued bins (typically 256 bins) which tremendously reduces the
+number of splitting points to consider, and allows the algorithm to
+leverage integer-based data structures (histograms) instead of relying on
+sorted continuous values when building the trees. The API of these
+estimators is slightly different, and some of the features from
+:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`
+are not yet supported: in particular sample weights, and some loss
+functions.
+
+These estimators are still **experimental**: their predictions
+and their API might change without any deprecation cycle. To use them, you
+need to explicitly import ``enable_hist_gradient_boosting``::
+
+  >>> # explicitly require this experimental feature
+  >>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
+  >>> # now you can import normally from ensemble
+  >>> from sklearn.ensemble import HistGradientBoostingClassifier
+
+.. topic:: Examples:
+
+ * :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py`
+
+Usage
+-----
+
+Most of the parameters are unchanged from
+:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`.
+One exception is the ``max_iter`` parameter that replaces ``n_estimators``, and
+controls the number of iterations of the boosting process:
+
+  >>> from sklearn.experimental import enable_hist_gradient_boosting
+  >>> from sklearn.ensemble import HistGradientBoostingClassifier
+  >>> from sklearn.datasets import make_hastie_10_2
+
+  >>> X, y = make_hastie_10_2(random_state=0)
+  >>> X_train, X_test = X[:2000], X[2000:]
+  >>> y_train, y_test = y[:2000], y[2000:]
+  >>> clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train)
+
+  >>> clf.score(X_test, y_test)
+  0.8998
+
+The size of the trees can be controlled through the ``max_leaf_nodes``,
+``max_depth``, and ``min_samples_leaf`` parameters.
+
+The number of bins used to bin the data is controlled with the ``max_bins``
+parameter. Using less bins acts as a form of regularization. It is
+generally recommended to use as many bins as possible, which is the default.
+
+The ``l2_regularization`` parameter is a regularizer on the loss function and
+corresponds to :math:`\lambda` in equation (2) of [XGBoost]_.
+
+Note that **early-stopping is enabled by default**. The early-stopping
+behaviour is controlled via the ``scoring``, ``validation_fraction``,
+``n_iter_no_change``, and ``tol`` parameters. It is possible to early-stop
+using an arbitrary :term:`scorer`, or just the training or validation loss. By
+default, early-stopping is performed using the default :term:`scorer` of
+the estimator on a validation set.
+
+Low-level parallelism
+---------------------
+
+:class:`HistGradientBoostingClassifier` and
+:class:`HistGradientBoostingRegressor` have implementations that use OpenMP
+for parallelization through Cython. The number of threads that is used can
+be changed using the ``OMP_NUM_THREADS`` environment variable. By default,
+all available cores are used. Please refer to the OpenMP documentation for
+details.
+
+The following parts are parallelized:
+
+- mapping samples from real values to integer-valued bins (finding the bin
+  thresholds is however sequential)
+- building histograms is parallelized over features
+- finding the best split point at a node is parallelized over features
+- during fit, mapping samples into the left and right children is
+  parallelized over samples
+- gradient and hessians computations are parallelized over samples
+- predicting is parallelized over samples
+
+Why it's faster
+---------------
+
+The bottleneck of a gradient boosting procedure is building the decision
+trees. Building a traditional decision tree (as in the other GBDTs
+:class:`GradientBoostingClassifier` and :class:`GradientBoostingRegressor`)
+requires sorting the samples at each node (for
+each feature). Sorting is needed so that the potential gain of a split point
+can be computed efficiently. Splitting a single node has thus a complexity
+of :math:`\mathcal{O}(\text{n_features} * n \log(n))` where :math:`n` is the
+number of samples at the node.
+
+:class:`HistGradientBoostingClassifier` and
+:class:`HistGradientBoostingRegressor`, in contrast, do not require sorting the
+feature values and instead use a data-structure called a histogram, where the
+samples are implicitly ordered. Building a histogram has a
+:math:`\mathcal{O}(n)` complexity, so the node splitting procedure has a
+:math:`\mathcal{O}(\text{n_features} * n)` complexity, much smaller than the
+previous one. In addition, instead of considering :math:`n` split points, we
+here consider only ``max_bins`` split points, which is much smaller.
+
+In order to build histograms, the input data `X` needs to be binned into
+integer-valued bins. This binning procedure does require sorting the feature
+values, but it only happens once at the very beginning of the boosting process
+(not at each node, like in :class:`GradientBoostingClassifier` and
+:class:`GradientBoostingRegressor`).
+
+Finally, many parts of the implementation of
+:class:`HistGradientBoostingClassifier` and
+:class:`HistGradientBoostingRegressor` are parallelized.
+
+.. topic:: References
+
+  .. [XGBoost] Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree
+     Boosting System". https://arxiv.org/abs/1603.02754
+  .. [LightGBM] Ke et. al. "LightGBM: A Highly Efficient Gradient
+     BoostingDecision Tree"
+
+.. _voting_classifier:
 
 Voting Classifier
 ========================

diff --git a/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py b/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
@@ -619,13 +619,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):
 
     This estimator is much faster than
     :class:`GradientBoostingRegressor<sklearn.ensemble.GradientBoostingRegressor>`
-    for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
-    into integer-valued bins, which considerably reduces the number of
-    splitting points to consider, and allows the algorithm to leverage
-    integer-based data structures. For small sample sizes,
-    :class:`GradientBoostingRegressor<sklearn.ensemble.GradientBoostingRegressor>`
-    might be preferred since binning may lead to split points that are too
-    approximate in this setting.
+    for big datasets (n_samples >= 10 000).
 
     This implementation is inspired by
     `LightGBM <https://github.com/Microsoft/LightGBM>`_.
@@ -641,6 +635,7 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin):
         >>> # now you can import normally from ensemble
         >>> from sklearn.ensemble import HistGradientBoostingClassifier
 
+    Read more in the :ref:`User Guide <histogram_based_gradient_boosting>`.
 
     Parameters
     ----------
@@ -792,13 +787,7 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,
 
     This estimator is much faster than
     :class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
-    for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
-    into integer-valued bins, which considerably reduces the number of
-    splitting points to consider, and allows the algorithm to leverage
-    integer-based data structures. For small sample sizes,
-    :class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
-    might be preferred since binning may lead to split points that are too
-    approximate in this setting.
+    for big datasets (n_samples >= 10 000).
 
     This implementation is inspired by
     `LightGBM <https://github.com/Microsoft/LightGBM>`_.
@@ -814,6 +803,8 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,
         >>> # now you can import normally from ensemble
         >>> from sklearn.ensemble import HistGradientBoostingClassifier
 
+    Read more in the :ref:`User Guide <histogram_based_gradient_boosting>`.
+
     Parameters
     ----------
     loss : {'auto', 'binary_crossentropy', 'categorical_crossentropy'}, \