Skip to content

Setting min_samples_split=1 in DecisionTreeClassifier does not raise exception #25481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sebp opened this issue Jan 25, 2023 · 4 comments · Fixed by #25744
Closed

Setting min_samples_split=1 in DecisionTreeClassifier does not raise exception #25481

sebp opened this issue Jan 25, 2023 · 4 comments · Fixed by #25744

Comments

@sebp
Copy link
Contributor

sebp commented Jan 25, 2023

Describe the bug

If min_samples_split is set to 1, an exception should be raised according to the paramter's constraints:

"min_samples_split": [
Interval(Integral, 2, None, closed="left"),
Interval(Real, 0.0, 1.0, closed="right"),
],

However, DecisionTreeClassifier accepts min_samples_split=1 without complaining.

With scikit-survival 1.0, this raises an exception as expected:

ValueError: min_samples_split == 1, must be >= 2.

I suspect that this has to do with the Intervals of the constraints overlapping. min_samples_split=1 satisfies the Real constraint, whereas the Integral constraint should have precedence.

Steps/Code to Reproduce

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
t = DecisionTreeClassifier(min_samples_split=1)
t.fit(X, y)

Expected Results

sklearn.utils._param_validation.InvalidParameterError: The 'min_samples_split' parameter of DecisionTreeClassifier must be an int in the range [2, inf) or a float in the range (0.0, 1.0]. Got 1 instead.

Actual Results

No exception is raised.

Versions

System:
    python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]
executable: /…/bin/python
   machine: Linux-6.1.6-100.fc36.x86_64-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.dev0
          pip: 22.2.2
   setuptools: 63.2.0
        numpy: 1.24.1
        scipy: 1.10.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /…/lib/libgomp.so.1.0.0
        version: None
    num_threads: 16

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /…/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Zen
    num_threads: 16

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /…/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Zen
    num_threads: 16
@sebp sebp added Bug Needs Triage Issue requires triage labels Jan 25, 2023
@glemaitre
Copy link
Member

glemaitre commented Jan 25, 2023

I think that this is on purpose. Otherwise, we would have used closed="neither" for the Real case and 1 is qualified as a Real.

At least this is not a regression since the code in the past would have failed and now we allow it to be considered as 100% of the train set.

If we exclude 1 it means that we don't accept both 100% and 1. I don't know if this is something that we want.

@sebp
Copy link
Contributor Author

sebp commented Jan 26, 2023

Note that with sklearn 1.0, min_samples_split=1.0 does not raise an exception, only min_samples_split=1.

@thomasjpfan
Copy link
Member

Reading the docstring, I agree it is strange to interpret the integer 1 as 100%:

- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

From the docstring, min_samples_split=1 is interpreted as 1 sample, which does not make any sense.

I think we should be able to specify "1.0" but not "1" in our parameter validation framework. @jeremiedbb What do you think of having a way to reject Integral, such as:

Interval(Real, 0.0, 1.0, closed="right", invalid_type=Integral),

If we have a way to specify a invalid_type, then I prefer to reject min_samples_split=1 as we did in previous versions.

sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023
@sebp
Copy link
Contributor Author

sebp commented Feb 26, 2023

Also note that min_samples_split=1.0 and min_samples_split=1 do not result in the same behavior:

if isinstance(self.min_samples_split, numbers.Integral):
min_samples_split = self.min_samples_split
else: # float
min_samples_split = int(ceil(self.min_samples_split * n_samples))
min_samples_split = max(2, min_samples_split)
min_samples_split = max(min_samples_split, 2 * min_samples_leaf)

If min_samples_split=1, the actual min_samples_split is determine by min_samples_leaf:

min_samples_split = max(min_samples_split, 2 * min_samples_leaf)

If min_samples_split=1.0 and assuming there are more than 2 samples in the data, min_samples_split = n_samples:

min_samples_split = int(ceil(self.min_samples_split * n_samples))

sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023
sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023
sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023
sebp added a commit to sebp/scikit-survival that referenced this issue Feb 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants