Understanding min_samples_leaf / min_samples_split

This is maybe for @arjoly, @jmschrei @glouppe.

i'm trying to understand ``min_samples_leaf`` and ``min_samples_split``. I thought these were pre-pruning options, but they are not. They are smoothing options. Is that clear from the docs and I'm just slow?

Is there a reference for the current behavior?

What I expected was: "if the best split results in less then ``min_samples_leaf`` in the leaf, don't split".
What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".

These are very very very different, and that was kind of non-obvious to me (because I didn't really think about it before).

Example:
```python
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
X = rng.normal(size=(50, 2))
y = np.zeros(X.shape[0], dtype=np.int)
y[X[:, 1] > 2] = 1

tree = DecisionTreeClassifier().fit(X, y)

# create a grid for plotting decision functions...
x_lin = np.linspace(X[:, 0].min() - .5, X[:, 0].max() + .5, 1000)
y_lin = np.linspace(X[:, 1].min() - .5, X[:, 1].max() + .5, 1000)
x_grid, y_grid = np.meshgrid(x_lin, y_lin)
X_grid = np.c_[x_grid.ravel(), y_grid.ravel()]

fig, axes = plt.subplots(1, 2)
axes[0].contourf(x_grid, y_grid, tree.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[0].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))


tree2 = DecisionTreeClassifier(min_samples_leaf=2).fit(X, y)
axes[1].contourf(x_grid, y_grid, tree2.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[1].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))
```

![image](https://cloud.githubusercontent.com/assets/449558/23104071/d29f4f98-f694-11e6-99b6-3a1c6918648d.png)

Basically setting ``min_samples_leaf`` leads to exactly the same tree, only with the threshold moved so that there are enough samples in the leaf. Was that really the intent?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Understanding min_samples_leaf / min_samples_split #8399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Understanding min_samples_leaf / min_samples_split #8399

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions