Skip to content

Understanding min_samples_leaf / min_samples_split #8399

Closed
@amueller

Description

@amueller

This is maybe for @arjoly, @jmschrei @glouppe.

i'm trying to understand min_samples_leaf and min_samples_split. I thought these were pre-pruning options, but they are not. They are smoothing options. Is that clear from the docs and I'm just slow?

Is there a reference for the current behavior?

What I expected was: "if the best split results in less then min_samples_leaf in the leaf, don't split".
What it is instead is "don't consider a split that leaves less than min_samples_leaf in the leaf".

These are very very very different, and that was kind of non-obvious to me (because I didn't really think about it before).

Example:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
X = rng.normal(size=(50, 2))
y = np.zeros(X.shape[0], dtype=np.int)
y[X[:, 1] > 2] = 1

tree = DecisionTreeClassifier().fit(X, y)

# create a grid for plotting decision functions...
x_lin = np.linspace(X[:, 0].min() - .5, X[:, 0].max() + .5, 1000)
y_lin = np.linspace(X[:, 1].min() - .5, X[:, 1].max() + .5, 1000)
x_grid, y_grid = np.meshgrid(x_lin, y_lin)
X_grid = np.c_[x_grid.ravel(), y_grid.ravel()]

fig, axes = plt.subplots(1, 2)
axes[0].contourf(x_grid, y_grid, tree.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[0].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))


tree2 = DecisionTreeClassifier(min_samples_leaf=2).fit(X, y)
axes[1].contourf(x_grid, y_grid, tree2.predict_proba(X_grid)[:, 0].reshape(x_grid.shape), alpha=.3)
axes[1].scatter(X[:, 0], X[:, 1], c=plt.cm.Vega10(y))

image

Basically setting min_samples_leaf leads to exactly the same tree, only with the threshold moved so that there are enough samples in the leaf. Was that really the intent?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions