Skip to content
70 changes: 39 additions & 31 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,16 +128,23 @@ Random Forests
--------------

In random forests (see :class:`RandomForestClassifier` and
:class:`RandomForestRegressor` classes), each tree in the ensemble is
built from a sample drawn with replacement (i.e., a bootstrap sample)
from the training set. In addition, when splitting a node during the
construction of the tree, the split that is chosen is no longer the
best split among all features. Instead, the split that is picked is the
best split among a random subset of the features. As a result of this
randomness, the bias of the forest usually slightly increases (with
respect to the bias of a single non-random tree) but, due to averaging,
its variance also decreases, usually more than compensating for the
increase in bias, hence yielding an overall better model.
:class:`RandomForestRegressor` classes), each tree in the ensemble is built
from a sample drawn with replacement (i.e., a bootstrap sample) from the
training set.

Furthermore, when splitting each node during the construction of a tree, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is difficult to parse "to split on can be". What about:

the best split is found either from all input features or a random subset of size ``max_features``.

best split is found either from all input features or a random subset of size
``max_features``. (See the :ref:`parameter tuning guidelines
<random_forest_parameters>` for more details).

The purpose of these two sources of randomness is to decrease the variance of
the forest estimator. Indeed, individual decision trees typically exhibit high
variance and tend to overfit. The injected randomness in forests yield decision
trees with somewhat decoupled prediction errors. By taking an average of those
predictions, some errors can cancel out. Random forests achieve a reduced
variance by combining diverse trees, sometimes at the cost of a slight increase
in bias. In practice the variance reduction is often significant hence yielding
an overall better model.

In contrast to the original publication [B2001]_, the scikit-learn
implementation combines classifiers by averaging their probabilistic
Expand Down Expand Up @@ -188,30 +195,31 @@ in bias::
:align: center
:scale: 75%

.. _random_forest_parameters:

Parameters
----------

The main parameters to adjust when using these methods is ``n_estimators``
and ``max_features``. The former is the number of trees in the forest. The
larger the better, but also the longer it will take to compute. In
addition, note that results will stop getting significantly better
beyond a critical number of trees. The latter is the size of the random
subsets of features to consider when splitting a node. The lower the
greater the reduction of variance, but also the greater the increase in
bias. Empirical good default values are ``max_features=n_features``
for regression problems, and ``max_features=sqrt(n_features)`` for
classification tasks (where ``n_features`` is the number of features
in the data). Good results are often achieved when setting ``max_depth=None``
in combination with ``min_samples_split=2`` (i.e., when fully developing the
trees). Bear in mind though that these values are usually not optimal, and
might result in models that consume a lot of RAM. The best parameter values
should always be cross-validated. In addition, note that in random forests,
bootstrap samples are used by default (``bootstrap=True``)
while the default strategy for extra-trees is to use the whole dataset
(``bootstrap=False``).
When using bootstrap sampling the generalization accuracy can be estimated
on the left out or out-of-bag samples. This can be enabled by
setting ``oob_score=True``.
The main parameters to adjust when using these methods is ``n_estimators`` and
``max_features``. The former is the number of trees in the forest. The larger
the better, but also the longer it will take to compute. In addition, note that
results will stop getting significantly better beyond a critical number of
trees. The latter is the size of the random subsets of features to consider
when splitting a node. The lower the greater the reduction of variance, but
also the greater the increase in bias. Empirical good default values are
``max_features=None`` (always considering all features instead of a random
subset) for regression problems, and ``max_features="sqrt"`` (using a random
subset of size ``sqrt(n_features)``) for classification tasks (where
``n_features`` is the number of features in the data). Good results are often
achieved when setting ``max_depth=None`` in combination with
``min_samples_split=2`` (i.e., when fully developing the trees). Bear in mind
though that these values are usually not optimal, and might result in models
that consume a lot of RAM. The best parameter values should always be
cross-validated. In addition, note that in random forests, bootstrap samples
are used by default (``bootstrap=True``) while the default strategy for
extra-trees is to use the whole dataset (``bootstrap=False``). When using
bootstrap sampling the generalization accuracy can be estimated on the left out
or out-of-bag samples. This can be enabled by setting ``oob_score=True``.

.. note::

Expand Down