diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst index fe829090012a0..42ba4c5438675 100644 --- a/doc/modules/ensemble.rst +++ b/doc/modules/ensemble.rst @@ -128,16 +128,23 @@ Random Forests -------------- In random forests (see :class:`RandomForestClassifier` and -:class:`RandomForestRegressor` classes), each tree in the ensemble is -built from a sample drawn with replacement (i.e., a bootstrap sample) -from the training set. In addition, when splitting a node during the -construction of the tree, the split that is chosen is no longer the -best split among all features. Instead, the split that is picked is the -best split among a random subset of the features. As a result of this -randomness, the bias of the forest usually slightly increases (with -respect to the bias of a single non-random tree) but, due to averaging, -its variance also decreases, usually more than compensating for the -increase in bias, hence yielding an overall better model. +:class:`RandomForestRegressor` classes), each tree in the ensemble is built +from a sample drawn with replacement (i.e., a bootstrap sample) from the +training set. + +Furthermore, when splitting each node during the construction of a tree, the +best split is found either from all input features or a random subset of size +``max_features``. (See the :ref:`parameter tuning guidelines +` for more details). + +The purpose of these two sources of randomness is to decrease the variance of +the forest estimator. Indeed, individual decision trees typically exhibit high +variance and tend to overfit. The injected randomness in forests yield decision +trees with somewhat decoupled prediction errors. By taking an average of those +predictions, some errors can cancel out. Random forests achieve a reduced +variance by combining diverse trees, sometimes at the cost of a slight increase +in bias. In practice the variance reduction is often significant hence yielding +an overall better model. In contrast to the original publication [B2001]_, the scikit-learn implementation combines classifiers by averaging their probabilistic @@ -188,30 +195,31 @@ in bias:: :align: center :scale: 75% +.. _random_forest_parameters: + Parameters ---------- -The main parameters to adjust when using these methods is ``n_estimators`` -and ``max_features``. The former is the number of trees in the forest. The -larger the better, but also the longer it will take to compute. In -addition, note that results will stop getting significantly better -beyond a critical number of trees. The latter is the size of the random -subsets of features to consider when splitting a node. The lower the -greater the reduction of variance, but also the greater the increase in -bias. Empirical good default values are ``max_features=n_features`` -for regression problems, and ``max_features=sqrt(n_features)`` for -classification tasks (where ``n_features`` is the number of features -in the data). Good results are often achieved when setting ``max_depth=None`` -in combination with ``min_samples_split=2`` (i.e., when fully developing the -trees). Bear in mind though that these values are usually not optimal, and -might result in models that consume a lot of RAM. The best parameter values -should always be cross-validated. In addition, note that in random forests, -bootstrap samples are used by default (``bootstrap=True``) -while the default strategy for extra-trees is to use the whole dataset -(``bootstrap=False``). -When using bootstrap sampling the generalization accuracy can be estimated -on the left out or out-of-bag samples. This can be enabled by -setting ``oob_score=True``. +The main parameters to adjust when using these methods is ``n_estimators`` and +``max_features``. The former is the number of trees in the forest. The larger +the better, but also the longer it will take to compute. In addition, note that +results will stop getting significantly better beyond a critical number of +trees. The latter is the size of the random subsets of features to consider +when splitting a node. The lower the greater the reduction of variance, but +also the greater the increase in bias. Empirical good default values are +``max_features=None`` (always considering all features instead of a random +subset) for regression problems, and ``max_features="sqrt"`` (using a random +subset of size ``sqrt(n_features)``) for classification tasks (where +``n_features`` is the number of features in the data). Good results are often +achieved when setting ``max_depth=None`` in combination with +``min_samples_split=2`` (i.e., when fully developing the trees). Bear in mind +though that these values are usually not optimal, and might result in models +that consume a lot of RAM. The best parameter values should always be +cross-validated. In addition, note that in random forests, bootstrap samples +are used by default (``bootstrap=True``) while the default strategy for +extra-trees is to use the whole dataset (``bootstrap=False``). When using +bootstrap sampling the generalization accuracy can be estimated on the left out +or out-of-bag samples. This can be enabled by setting ``oob_score=True``. .. note::