Skip to content

[MRG] Allow scoring of dummies without testsamples #11957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 13, 2018

Conversation

JarnoRFB
Copy link
Contributor

As DummyClassifier and DummyRegressor operate solely on the targets,
they can now be used without passing test samples, instead passing None.
Also includes some minor renaming in the corresponding tests for more
consistency.

Reference Issues/PRs

Resolves #11951

As DummyClassifier and DummyRegressor operate solely on the targets,
they can now be used without passing test samples, instead passing None.
Also includes some minor renaming in the corresponding tests for more
consistency.
@JarnoRFB
Copy link
Contributor Author

Not sure if the functions ensure_consistent_length is in the right place there, or if it should go in some utils module.

@JarnoRFB
Copy link
Contributor Author

JarnoRFB commented Sep 2, 2018

Thanks for the feedback! I tried to incorporate the requested changes.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, thanks!

sklearn/dummy.py Outdated
return super(DummyRegressor, self).score(X, y, sample_weight)


def _ensure_consistent_length(X, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, I think you should just inline this code, and not use a separate function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I thought that using a function with a speaking name would actually increase readability and avoid duplication, as the functionality is needed for both the classifier und the regressor. Maybe you can elaborate on how inlining the code would improve readability. I can of course change that, if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function name can improve readability, but I don't think that the meaning of _ensure_consistent_length is obvious enough to do that. Perhaps _handle_no_x would be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the function, let me know what you think.

@JarnoRFB
Copy link
Contributor Author

JarnoRFB commented Sep 5, 2018

I renamed the function, let me know what you think. If you still don't like it I can inline the code.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still suspect it would be clearer inline.

sklearn/dummy.py Outdated
samples used in the fitting for the estimator.
Passing None as test samples gives the same result
as passing real test samples, since DummyRegressor
operates solely on the targets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operates independent of the sampled observations.

@jnothman
Copy link
Member

jnothman commented Sep 6, 2018

Please add an entry to the change log at doc/whats_new/v0.21.rst. Please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is assumed that Dummy estimator predictions do not depend on X, do we need to add something to ensure that someone will not add a strategy that will use some information from X (in the doc or using a test)?

@JarnoRFB
Copy link
Contributor Author

@albertcthomas Thanks for the review! in the model evaluation docs in the dummy-estimator section http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators it says
"Note that with all these strategies, the predict method completely ignores the input data!".
So there is something in the docs. One could of course add tests for every dummy estimator and every strategy, so the estimator ones fitted produces the same prediction for different X.

The definition of a dummy estimator as predicting the label based only on the label distribution and independent of the features makes a lot of sense to me, as this is so to say the best you can get when your features have absolutely no predictive power.

I could add some test if you think it makes sense.

@JarnoRFB
Copy link
Contributor Author

Ok, I added some tests for checking the independence of X. If someone would implement a strategy that does not operate independently of X, they would of course have to add their strategy to the test to see it fail. But I guess that should make it a bit more obvious that this is not desired.

@albertcthomas
Copy link
Contributor

The definition of a dummy estimator as predicting the label based only on the label distribution and independent of the features makes a lot of sense to me, as this is so to say the best you can get when your features have absolutely no predictive power.

I agree. My point is just that if a contributor or a user adds a dummy strategy that depends on X then returning the score independently of X would be wrong and this could fail silently.

@JarnoRFB
Copy link
Contributor Author

I agree. My point is just that if a contributor or a user adds a dummy strategy that depends on X then returning the score independently of X would be wrong and this could fail silently.

One possibility would be to go from

X = np.zeros(shape=(len(y), 1))

back to

X = [[None]] * len(y)

so when some computation on the made up X is attempted, the operation would fail, if predict actually uses the values from X. Additionally, one could add even more explicit documentation. What do you think?

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can stick with what you already implemented and the tests you added. If a user wants to pass X=None in clf.score it is clearly mentionned in the docstring that Dummy estimator operate independently of X. Thanks @JarnoRFB !

@albertcthomas
Copy link
Contributor

To be completely safe, another possibility would be to add a check in clf.score such as

if X is None and strategy not in {current strategies}:
  raise ValueError('informative message')

@JarnoRFB
Copy link
Contributor Author

To be completely safe, another possibility would be to add a check in clf.score such as

if X is None and strategy not in {current strategies}:
    raise ValueError('informative message')

Not really sure about that. A similar check is already performed in clf.fit. So if someone would implement a new strategy, they would have to add their strategy to this check anyway. Therefore, they would also add it to to this check if we put it in the code.

Adding the a second strategy check to clf.score would of course give the possibility to have some strategies that work independent of X and some that work not independent of X. However, I think it would make more sense to clearly define a DummyEstimator as an estimator that ignores X and not even introduce the possibility to implement something else.

@jnothman
Copy link
Member

Let's sneak this into 0.20.
Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@JarnoRFB
Copy link
Contributor Author

@jnothman I moved the entry to doc/whats_new/v0.20.rst

@jnothman jnothman merged commit 3ee1cfc into scikit-learn:master Sep 13, 2018
@jnothman
Copy link
Member

Thanks @JarnoRFB!

@JarnoRFB
Copy link
Contributor Author

Thanks for guiding me through this. Very happy to contribute!

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scoring for dummy classifier does not work without test samples
3 participants