Skip to content

FIX bagging with metadata routing and estimator implement __len__ #28734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 3, 2024

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Mar 31, 2024

I could catch a regression in imbalanced-learn when estimator in Bagging* implement __len__ (e.g. RandomForest*) where _get_estimator will trigger calling __len__ with the current pattern.

The problem is that __len__ relies on fitted attribute while _get_estimator is called before fit.

This fix is check for None instead to know when to create a default estimator.

ping @adrinjalali @OmarManzoor @adam2392 since it was introduced in #28432

No changelog needed since we did not yet release this bug ;)

Copy link

github-actions bot commented Mar 31, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f5252d5. Link to the linter CI: here

@glemaitre
Copy link
Member Author

I got another regression for which I wrote the test. I have to check what is the reason for raising an error.

@glemaitre
Copy link
Member Author

I got another regression for which I wrote the test. I have to check what is the reason for raising an error.

It looks like we did not have a safeguard on using the routing API in this case.

Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix on the regression!

Wasn't 100% sure I follow your PR description. Sorry :p.

Do you mind describing what the issue was that occurred? Is it that __len__ errors out?

Co-authored-by: Adam Li <adam2392@gmail.com>
@glemaitre
Copy link
Member Author

Do you mind describing what the issue was that occurred? Is it that len errors out?

When calling self.estimator or DecisionTreeClassifier(), the or operator trigger to call self.estimator.__len__ for some reason. I should check the stack of the call to understand what the reason for it but for sure, the new pattern is exactly what we want.

@glemaitre
Copy link
Member Author

I should check the stack of the call to understand what the reason for it but for sure, the new pattern is exactly what we want.

Thinking about it, I assume that the first thing that Python do is to check if the object is None or "empty" container and thus it should trigger the len call.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

Comment on lines -838 to +846
return self.estimator or DecisionTreeClassifier()
if self.estimator is None:
return DecisionTreeClassifier()
return self.estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the diff here? I find the existing code clear and shorter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it fails when enabling metadata routing with a non-default estimator.
See this #28734 (comment)

Comment on lines -1340 to +1335
return self.estimator or DecisionTreeRegressor()
if self.estimator is None:
return DecisionTreeRegressor()
return self.estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks for catching this!

@adam2392
Copy link
Member

adam2392 commented Apr 2, 2024

The CI failures do not look related to this.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like BaseEnsemble.__len__ should return 0 when it is not fitted. This way Python would consider it falsey.

@glemaitre
Copy link
Member Author

I feel like BaseEnsemble.len should return 0 when it is not fitted. This way Python would consider it falsey.

I'm fine implementing this behaviour.

However, I'm thinking that this is orthogonal because implementing the proposed behaviour and using if self.estimator or DecisionTreeClassifier() will return DecisionTreeClassifier() while we would expect getting self.estimator (actually it would be a silent bug).

@thomasjpfan did I miss something in your proposal.

@thomasjpfan
Copy link
Member

However, I'm thinking that this is orthogonal because implementing the proposed behaviour and using if self.estimator or DecisionTreeClassifier() will return DecisionTreeClassifier() while we would expect getting self.estimator (actually it would be a silent bug).

Ah yes, you are correct. We still need this PR.

@adrinjalali
Copy link
Member

I'm not sure if we should return 0 on an unfitted estimator though. I think __len__ is undefined when the estimator is not fitted, so I might prefer to raise UnfittedError in that case.

WDYT @thomasjpfan

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit cfd8091 into scikit-learn:main Apr 3, 2024
30 checks passed
@thomasjpfan
Copy link
Member

I'm not sure if we should return 0 on an unfitted estimator though. I think len is undefined when the estimator is not fitted, so I might prefer to raise UnfittedError in that case.

That is okay with me to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants