Description
This is maybe for @arjoly, @jmschrei @glouppe.
i'm trying to understand min_samples_leaf
and min_samples_split
. I thought these were pre-pruning options, but they are not. They are smoothing options. Is that clear from the docs and I'm just slow?
Is there a reference for the current behavior?
What I expected was: "if the best split results in less then min_samples_leaf
in the leaf, don't split".
What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".
These are very very very different, and that was kind of non-obvious to me (because I didn't really think about it before).
Example:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
rng = np.random.RandomState(0)
X = rng.normal(size=(50, 2))
y = np.zeros(X.shape[0], dtype=np.int)
y[X[:, 1] > 2] = 1
tree = DecisionTreeClassifier().fit(X, y)
# create a grid for plotting decision functions...
x_lin = np.linspace(X[:, 0].min() - .5, X[:, 0].max() + .5, 1000)
y_lin = np.linspace(X[:, 1].min() - .5, X[:, 1].max() + .5, 1000)
x_grid, y_grid = np.meshgrid(x_lin, y_lin)
X_grid = np.c_[x_grid.ravel(), y_grid.ravel()]
fig, axes = plt.subplots(1, 2)
axes[0].contourf(x_grid, y_grid, tree.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[0].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))
tree2 = DecisionTreeClassifier(min_samples_leaf=2).fit(X, y)
axes[1].contourf(x_grid, y_grid, tree2.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[1].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))
Basically setting min_samples_leaf
leads to exactly the same tree, only with the threshold moved so that there are enough samples in the leaf. Was that really the intent?