-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Estimator tags #8022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Estimator tags #8022
Conversation
This starts to look good, though I have no idea what's happening with There's an issue with |
So I'm allowing |
This should be of interest to @GaelVaroquaux @jnothman @jakevdp and @mblondel.
If you want, I can make the first thing it's own PR, because this looks like it's gonna be a big one. Tests are currently failing because I'm still working on the third bullet point. |
OneVsOneClassifier returns a |
hm so will |
I decided to make |
Ok, so @GaelVaroquaux @jnothman I need some input. My current solution for I'm confused by the method resolution order in our class hierarchy. I assumed I can call I can see the following ways to fix this:
From a code perspective 4 is least intrusive (least number of lines changed), but it adds another layer to the inheritance hierarchy.
What is the motivation to have BaseEstimator to the left in the first place, and why don't we let the mixins inherit from it? |
I think mixins are meant to come before base classes in MRO, although I've not confirmed this intuition wiht online resources (I'm offline). From the perspective of purity, 3 is therefore my preferred solution, though I appreciate it is technically not backwards compatible, and is somewhat brittle. I think another partial solution is that _get_tags could be conventionally defined with a helper, like: def _get_tags(self):
return base.extend_tags(self, tags={'a': 5, 'b': 6}, allow_overwrite=['b'])
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A skim. I'm yet to look at estimator_checks
. I'm not sure this interface is perfect, and I've been wondering about just having an attribute on the objects for each trait. But that violates everything :)
doc/developers/contributing.rst
Outdated
|
||
The current set of estimator tags are: | ||
|
||
input_validation - whether the estimator does input-validation. This is only meant for stateless and dummy transformers! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what of ensembles/metaestimators that largely delegate their validation to some base/wrapped estimator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently basing all this on the tests, which I think is ok as a first approach and given the issues that were motivating this. metaestimators and ensembles are instantiated with well-behaved estimators for now, so they pass the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it, what does this mean? Only stateless and dummy transformers should set this to True? or to False? and why?
Might be clearer to document the default tag values here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I check whether text feature extractors are considered to validate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list should be formatted as a RST definition list
doc/developers/contributing.rst
Outdated
|
||
input_validation - whether the estimator does input-validation. This is only meant for stateless and dummy transformers! | ||
multioutput - whether a regressor supports multi-target outputs or a classifier supports multi-class multi-output. | ||
multilabel - whether the estimator supports multilabel output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another one for sparse multilabel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I only added tags that were needed for the test to pass without special cases - actually I didn't need the sparse data one so far or the meta-estimator!
sklearn/base.py
Outdated
@@ -12,8 +12,18 @@ | |||
from .utils.fixes import signature | |||
from . import __version__ | |||
|
|||
_DEFAULT_TAGS = { | |||
'input_types': ['2darray'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by input you mean first arg to Estimator().fit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Maybe "data_types"? We have 1d, 2d, 3d (patch extractor), sparse, dictionaries and text. Also tuples for dict-vectorizer (or was it hashing vectorizer?) I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Maybe "data_types"? We have 1d, 2d, 3d (patch extractor), sparse, dictionaries and text. Also tuples for dict-vectorizer (or was it hashing vectorizer?) I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would X_types
be better than data_types
?
sklearn/base.py
Outdated
"""Mixin class for all meta estimators in scikit-learn.""" | ||
# this is just a tag for the moment | ||
def _get_tags(self): | ||
tags = super(MetaEstimatorMixin, self)._get_tags().copy() | ||
tags.update(is_meta_estimator=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Means what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this is unused right now and might be unnecessary.
sklearn/feature_extraction/text.py
Outdated
|
||
n_samples, n_features = X.shape | ||
|
||
if self.sublinear_tf: | ||
np.log(X.data, X.data) | ||
X.data += 1 | ||
if sp.issparse(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need its own test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably. I have not added any new tests, there are still enough existing tests failing. Still wip, mostly wanted feedback on the mixins.
sklearn/multiclass.py
Outdated
@@ -217,6 +218,7 @@ def fit(self, X, y): | |||
|
|||
return self | |||
|
|||
@if_delegate_has_method(['_first_estimator', 'estimator']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs testing? I think this should just be delegating to estimator
. if the base estimator does not support partial_fit
, nor can the fitted _first_estimator
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True
sklearn/multiclass.py
Outdated
@@ -505,6 +511,7 @@ def fit(self, X, y): | |||
|
|||
return self | |||
|
|||
@if_delegate_has_method(delegate='estimator') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs testing?
sklearn/multiclass.py
Outdated
@@ -569,6 +576,8 @@ def predict(self, X): | |||
Predicted multi-class targets. | |||
""" | |||
Y = self.decision_function(X) | |||
if self.n_classes_ == 2: | |||
return self.classes_[(Y > 0).astype(np.int)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #9100
sklearn/multiclass.py
Outdated
@@ -601,7 +610,8 @@ def decision_function(self, X): | |||
for est, Xi in zip(self.estimators_, Xs)]).T | |||
Y = _ovr_decision_function(predictions, | |||
confidences, len(self.classes_)) | |||
|
|||
if self.n_classes_ == 2: | |||
return Y[:, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll add more explicit testing for all of the changes. Though they are all tested, as that's obviously how I found the issues. Though explicit regression tests are clearly good.
And I meant to say it's great that you're finding all these bugs! |
@jnothman yeah so far mostly meta-estimators and "weird" estimators that we didn't test :-/ |
I'm not sure I understand the solution. So with silent pass I guess this would work if What do you not like about the interface? Is it this exact inheritance issue? Having one attribute per tag would have the same issue, right? Actually, another way to resolve the issue is to not give |
Here's a blog complaining about people doing mixins wrong (i.e. the way we do it): Another person running into this here: But "The hitchhikers guide to Python" and therefore |
CPython does "to the left" here: https://hg.python.org/cpython/file/3.5/Lib/socketserver.py#l639 |
David Beazley agrees with us: https://twitter.com/dabeaz/status/809084586487664641 |
currently I made some breaking changes to see if it's possible to make things go smoothly. |
thanks glemaitre Co-Authored-By: amueller <t3kcit@gmail.com>
|
||
SUPPORT_STRING = ['SimpleImputer', 'MissingIndicator'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add a tag to SimpleImputer but the tests are not failing? hm...
Should be good now. |
failures are related to the extract patches docstring (unrelated to this PR) |
@glemaitre, if you have a moment before the sprint, please approve this and merge? |
LGTM. Looking forward to using this in contrib. Thanks @amueller |
yay thank you all!! @glemaitre @jnothman @rth I'm so happy! |
Looking forward to using this for my custom estimators! Thank you @amueller and everyone involved! |
@albertcthomas let us know what's missing please! |
This reverts commit b0532c4.
This reverts commit b0532c4.
Fixes #6599, #7737, #6554, #6715
Also see #6510
Todo
For another PR:
I decided to leave the
_estimator_type
for later because this PR is already too big.Fun issue I just thought about: Can we even define the tags for a pipeline?
How do we know if a pipeline can handle missing data? Because the scalers pass it through (and the tag says they "handle" missing data) and the imputers impute it. So either these need different tags or we just punt on defining tags for pipelines and if you want to run a pipeline the author of the pipeline needs to set the tags specifically for that pipeline if they want to run the tests. That seems not the best, though.