Skip to content

[WIP] first step in fixing minimum 2 required samples for fixing MLPRegressor attribute error #24788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mohitthakur13
Copy link
Contributor

Reference Issues/PRs

Fixes #24713

What does this implement/fix? Explain your changes.

Added an if loop to check 10% of input training data (X) is at least 2 - based on the comment by @glemaitre in the _validate_data function in the BaseEstimator class in sklearn/base.py

If this is the correct approach to design the fix, what needs to be further done:

  1. manual selection in case 2 >= len(X) < 10
  2. Testing

Any other comments?

Any further comments or directions will be appreciated to design a proper fix.

@betatim
Copy link
Member

betatim commented Oct 31, 2022

Thanks for making this PR!

I think a better place to perform this check is in Base.fit() in sklearn/neureal_network/_multilayer_perceptron.py. There you can use self.validation_fraction instead of having to hard-code 0.1. The check should only happen if early_stopping is being used.

@mohitthakur13
Copy link
Contributor Author

mohitthakur13 commented Oct 31, 2022

Thanks @betatim , great tip. Will get back on this.

Thanks for making this PR!

I think a better place to perform this check is in Base.fit() in sklearn/neureal_network/_multilayer_perceptron.py. There you can use self.validation_fraction instead of having to hard-code 0.1. The check should only happen if early_stopping is being used.

@mohitthakur13
Copy link
Contributor Author

Hi @betatim , I updated the Base.fit() in sklearn/neureal_network/_multilayer_perceptron.py with an if loop to check self.validation_fraction as suggested by you. You should be able to see it now.

Of course, this implies that for any training input where len(X) is less than 11, a ValueError will be raised. This is the intention or? Could there be some angles of the bigger picture that I might be missing here?

Thank you.

@betatim
Copy link
Member

betatim commented Nov 2, 2022

Apologies, I didn't read the code carefully enough, fit() isn't a good place. I've now read the code and think adding

if X_val.shape[0] < 2:
    raise ValueError("The validation set is too small. Increase the validation_fraction or the size of your dataset.")

after line 581 (where X_val is defined) in _multilayer_perceptron.py is the thing to do.

This way we can directly check that X_val contains enough samples (more than two).

I've also made a suggestion for the error message.

@mohitthakur13
Copy link
Contributor Author

mohitthakur13 commented Nov 2, 2022

Thanks for the swift feedback @betatim

X_val is unfortunately not used in fit or _fit function in the BaseMultilayerPerceptron. Also, fit is not checking early_stopping criteria. Could this be a possibility in _fit (at the cost of repetition though):

early_stopping = self.early_stopping
X, X_val, y, y_val = train_test_split(
                X,
                y,
                random_state=self._random_state,
                test_size=self.validation_fraction,
                stratify=stratify,
            )
if (early_stopping and (X_val.shape[0] < 2)):
        raise ValueError("The validation set is too small. Increase the validation_fraction or the size of your dataset.")

@betatim
Copy link
Member

betatim commented Nov 3, 2022

X_val is created in _fit_stochastic(), so the check that it contains enough samples needs to be there as well. As far as I can see _fit_stochastic is the only place where a validation dataset is created, because it is the only place where early_stopping = True changes the behaviour of the estimator.

@mohitthakur13 mohitthakur13 force-pushed the MLPRegressor_attribute_error branch from b8ec6ec to 29f15d8 Compare November 3, 2022 13:01
@mohitthakur13
Copy link
Contributor Author

mohitthakur13 commented Nov 3, 2022

I see. Then, if I understood to somewhat correct degree, here is what could work:

  1. early stopping matters only in _fit_stochastic solver, so its sort of independent from checking the input dataset size or validation_fraction in general.
  2. But _fit_stochastic is called by _fit in BaseMultilayerPerceptron.
  3. Therefore, by putting the if check right after self._validate_input in _fit function will likely solve the whole issue and we only do it once. One does not need to put an extra check for _fit_stochastic with early_stopping as its already been taken care by _fit due to independence.

You should already see it in the PR update. Please let me know if it makes sense or if I am missing something in the issue.

@mohitthakur13
Copy link
Contributor Author

Hi @betatim , wonder if you had a chance to look at the solution above? Let me know if it addresses the issue. Thanks.

@betatim
Copy link
Member

betatim commented Nov 8, 2022

HI, unfortunately we misunderstood each other.

X_val is created in _fit_stochastic(), so the check that it contains enough samples needs to be there as well.

@mohitthakur13
Copy link
Contributor Author

mohitthakur13 commented Nov 8, 2022

Ok, so the recent commit contains the following two additions:

  1. added an if check in _fit function in the BaseMultilayerPerceptron class
if (self.validation_fraction * X.shape[0] < 2):
    raise ValueError("The validation set is too small. Increase the validation_fraction or the size of your dataset.")
  1. added an additional if check within the _fit_stochastic function:
if X_val.shape[0] < 2:
    raise ValueError("The validation set is too small. Increase the validation_fraction or the size of your dataset.")

I suspect the reason to add this additional check in _fit_stochastic is because _fit_stochastic is called upon in other places in the code than just from _fit, or?

@betatim
Copy link
Member

betatim commented Nov 8, 2022

I would remove your first addition and only check in _fit_stochastic. The reason being that there you can test what you really want to test: does the validation dataset contain enough samples? In _fit you'd have to replicate the exact behaviour of how train_test_split() (which is used to create the validation set) computes how many samples will be in the validation set. This is why I think checking in _fit_stochastic is the better thing to do.

@mohitthakur13
Copy link
Contributor Author

I see, got it @betatim - thanks for the clarification. Done - now, we only have the check in _fit_stochastic.
Would you know what the next steps should be with this issue?
Thanks.

@betatim
Copy link
Member

betatim commented Nov 9, 2022

Two more things are needed:

  1. a non-regression test to make sure the bug doesn't come back. Maybe based on the example reported in AttributeError: 'MLPRegressor' object has no attribute '_best_coefs' #24713
    • the test should fail on main and pass with the changes from this PR
  2. an entry in the "what's new" change log that reports that this bug has been fixed

@mohitthakur13
Copy link
Contributor Author

@betatim - sorry for a bit of delay in coming back to it. I have added a test (last function) in the sklearn/neural_network/tests/test_mlp.py, which now checks the minimum input sample size.

It passes the test when pytest sklearn/neural_network/tests/test_mlp.py but when I just run pytest for all the tests, the following two (seemingly unrelated) errors show up:

  1. E AttributeError: module 'sklearn.metrics._pairwise_distances_reduction._base' has no attribute 'BaseDistanceReducer64'
  2. E AttributeError: module 'sklearn.metrics._pairwise_distances_reduction._base' has no attribute 'BaseDistanceReducer64'

Could it be due to some deprecated code?
Thank you.

@mohitthakur13
Copy link
Contributor Author

@betatim - could you look at the code (test submitted) above? Any feedback welcome.

@betatim
Copy link
Member

betatim commented Nov 28, 2022

The linting tests that check the formatting of the code are failing. Take a look at step 9 of https://scikit-learn.org/stable/developers/contributing.html#how-to-contribute or the section after that for the tools you need to run to have the formatting fixed automatically. The easiest way is to use pre-commit.

I'm not sure why the other two tests fail, they shouldn't be related to your changes. If you fix the linter errors we can see if the problems also happen in the CI itself.

@mohitthakur13
Copy link
Contributor Author

You are right @betatim, shouldn't be from the fix I suggested. Let me have look. Thanks for the feedback.

@mohitthakur13
Copy link
Contributor Author

Done!

@glemaitre glemaitre self-requested a review September 2, 2024 09:49
@glemaitre glemaitre removed their request for review November 6, 2024 17:47
Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c3ca73b. Link to the linter CI: here

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mohitthakur13

@jeremiedbb jeremiedbb merged commit eec4449 into scikit-learn:main Apr 13, 2025
36 checks passed
lucyleeow pushed a commit to EmilyXinyi/scikit-learn that referenced this pull request Apr 23, 2025
…gressor (scikit-learn#24788)

Co-authored-by: Mohit Singh Thakur <mohit.thakur@tum.de>
Co-authored-by: Mohit Singh Thakur <thakur.mohit@gmail.com>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AttributeError: 'MLPRegressor' object has no attribute '_best_coefs'
3 participants