Setting min_samples_split=1 in DecisionTreeClassifier does not raise exception #25481

sebp · 2023-01-25T18:30:23Z

Describe the bug

If min_samples_split is set to 1, an exception should be raised according to the paramter's constraints:

Lines 100 to 103 in e2e7050

    
           "min_samples_split": [ 
        
               Interval(Integral, 2, None, closed="left"), 
        
               Interval(Real, 0.0, 1.0, closed="right"), 
        
           ],

However, DecisionTreeClassifier accepts min_samples_split=1 without complaining.

With scikit-survival 1.0, this raises an exception as expected:

ValueError: min_samples_split == 1, must be >= 2.

I suspect that this has to do with the Intervals of the constraints overlapping. min_samples_split=1 satisfies the Real constraint, whereas the Integral constraint should have precedence.

Steps/Code to Reproduce

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
t = DecisionTreeClassifier(min_samples_split=1)
t.fit(X, y)

Expected Results

sklearn.utils._param_validation.InvalidParameterError: The 'min_samples_split' parameter of DecisionTreeClassifier must be an int in the range [2, inf) or a float in the range (0.0, 1.0]. Got 1 instead.

Actual Results

No exception is raised.

Versions

System:
    python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]
executable: /…/bin/python
   machine: Linux-6.1.6-100.fc36.x86_64-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.dev0
          pip: 22.2.2
   setuptools: 63.2.0
        numpy: 1.24.1
        scipy: 1.10.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /…/lib/libgomp.so.1.0.0
        version: None
    num_threads: 16

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /…/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Zen
    num_threads: 16

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /…/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Zen
    num_threads: 16

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-01-25T21:32:00Z

I think that this is on purpose. Otherwise, we would have used closed="neither" for the Real case and 1 is qualified as a Real.

At least this is not a regression since the code in the past would have failed and now we allow it to be considered as 100% of the train set.

If we exclude 1 it means that we don't accept both 100% and 1. I don't know if this is something that we want.

sebp · 2023-01-26T08:33:19Z

Note that with sklearn 1.0, min_samples_split=1.0 does not raise an exception, only min_samples_split=1.

thomasjpfan · 2023-02-08T02:26:50Z

Reading the docstring, I agree it is strange to interpret the integer 1 as 100%:

scikit-learn/sklearn/tree/_classes.py

Lines 635 to 638 in baefe83

    
                   - If int, then consider `min_samples_leaf` as the minimum number. 
        
                   - If float, then `min_samples_leaf` is a fraction and 
        
                     `ceil(min_samples_leaf * n_samples)` are the minimum 
        
                     number of samples for each node.

From the docstring, min_samples_split=1 is interpreted as 1 sample, which does not make any sense.

I think we should be able to specify "1.0" but not "1" in our parameter validation framework. @jeremiedbb What do you think of having a way to reject Integral, such as:

Interval(Real, 0.0, 1.0, closed="right", invalid_type=Integral),

If we have a way to specify a invalid_type, then I prefer to reject min_samples_split=1 as we did in previous versions.

See also scikit-learn/scikit-learn#25481

sebp · 2023-02-26T12:12:42Z

Also note that min_samples_split=1.0 and min_samples_split=1 do not result in the same behavior:

scikit-learn/sklearn/tree/_classes.py

Lines 257 to 263 in baefe83

    
           if isinstance(self.min_samples_split, numbers.Integral): 
        
               min_samples_split = self.min_samples_split 
        
           else:  # float 
        
               min_samples_split = int(ceil(self.min_samples_split * n_samples)) 
        
               min_samples_split = max(2, min_samples_split) 
        
           min_samples_split = max(min_samples_split, 2 * min_samples_leaf)

If min_samples_split=1, the actual min_samples_split is determine by min_samples_leaf:

min_samples_split = max(min_samples_split, 2 * min_samples_leaf)

If min_samples_split=1.0 and assuming there are more than 2 samples in the data, min_samples_split = n_samples:

min_samples_split = int(ceil(self.min_samples_split * n_samples))

See also scikit-learn/scikit-learn#25481

sebp added Bug Needs Triage Issue requires triage labels Jan 25, 2023

thomasjpfan added module:tree Needs Decision - API and removed Needs Triage Issue requires triage labels Feb 8, 2023

sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023

Do not allow min_samples_leaf=1.0

93d40b3

See also scikit-learn/scikit-learn#25481

sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023

Do not allow min_samples_leaf=1.0

ea290f8

See also scikit-learn/scikit-learn#25481

sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023

Do not allow min_samples_leaf=1.0

e629f0a

See also scikit-learn/scikit-learn#25481

sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023

Do not allow min_samples_leaf=1.0

a139036

See also scikit-learn/scikit-learn#25481

sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023

Do not allow min_samples_leaf=1.0

0cf7924

See also scikit-learn/scikit-learn#25481

jeremiedbb mentioned this issue Mar 2, 2023

FIX Raise an error when min_samples_split=1 in trees #25744

Merged

thomasjpfan closed this as completed in #25744 Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting min_samples_split=1 in DecisionTreeClassifier does not raise exception #25481

Setting min_samples_split=1 in DecisionTreeClassifier does not raise exception #25481

sebp commented Jan 25, 2023

glemaitre commented Jan 25, 2023 •

edited

Loading

sebp commented Jan 26, 2023

thomasjpfan commented Feb 8, 2023

sebp commented Feb 26, 2023

Setting min_samples_split=1 in DecisionTreeClassifier does not raise exception #25481

Setting min_samples_split=1 in DecisionTreeClassifier does not raise exception #25481

Comments

sebp commented Jan 25, 2023

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Jan 25, 2023 • edited Loading

sebp commented Jan 26, 2023

thomasjpfan commented Feb 8, 2023

sebp commented Feb 26, 2023

glemaitre commented Jan 25, 2023 •

edited

Loading