Skip to content

[MRG+2] #7908 Addressed issue in first iteration of RANSAC regression #7914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 5, 2017

Conversation

mthorrell
Copy link
Contributor

Reference Issue

Fixes #7908

What does this implement/fix? Explain your changes.

On the first iteration of RANSAC regression, if no inliers are found, an error is produced and the code is stopped. Ideally the procedure would just skip that iteration and continue on to the next iteration where it would use a different random sample which could produce valid inliers.

Generally this error is produced when n_inliers_subset and n_inliers_best are both zero. My fix was to set the initial value for n_inliers_best to 1. Thus if n_inliers_subset >= 1, the code follows the normal path, and if n_inliers_subset == 0, the code progresses to the next iteration. This fixes the bug as described in this issue.

However, setting n_inliers_best = 1 creates an issue again in the first iteration of this loop in the case when n_inliers_subset == 1. The subsequent comparison is made: score_subset < score_best. Since score_best is initialized to np.inf, the code will incorrectly skip to the next iteration, ignoring the fact that in the first iteration, we did find valid inliers. The fix for this is to change the initial value of score_best to -np.inf. In general, I think this is better practice when initializing these types of variables anyway.

Any other comments?

@mthorrell
Copy link
Contributor Author

In this small update, I adjusted the offending test so it did not require a specific output; however, an error is still required to be raised in the residual threshold = 0 case.

@amueller
Copy link
Member

thanks for the PR. I think changing the tests was the way to go! (I don't have time for a review right now unfortunately)

@amueller amueller added this to the 0.19 milestone Nov 28, 2016
@amueller amueller added the Bug label Nov 28, 2016
@mthorrell
Copy link
Contributor Author

Thanks.

I made a couple minor changes:

  • The test now more specifically checks the error message since I think this is better practice.
  • I also cleaned up the formatting of that error message.

msg = (
"RANSAC could not find valid consensus set, because"
" either the `residual_threshold` rejected all the samples or"
" `is_data_valid` and `is_model_valid` returned False for all"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refering to internal functions is not very helpful for the users imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you are referring to is_data_valid and is_model_valid. These are user-defined functions much like residual_threshold is a user-defined parameter. When a user does not define these functions, RANSAC does not call them--they are initialized to None and evaluation is skipped. Unless you disagree, these may still be valid to refer to.

Perhaps a better message might read:

RANSAC could not find a valid consensus set. All max_trials iterations were skipped because each randomly chosen sub-sample either: didn't produce a valid set of inliers due to a small residual_threshold or was invalidated by is_data_valid or is_model_valid if either of these functions have been defined by the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, my fault. The current message is ok, though your suggestion is even better!

@amueller
Copy link
Member

maybe @ahojnnes wants to review?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not certain this is the right fix. This still fails if a 0-inlier sample is drawn at any point after the first >0-inlier sample is drawn.

We could introduce a parameter to control how many 0-inlier samples are allowed through, or just get rid of this control (the main risk being lots of time wasted).

A couple of more cosmetic things we can do to improve this behaviour:

  • If we continue to raise an error when no inliers are found, we should report the minimum residual found (and perhaps other summary statistics of the residual distribution) as well as the current threshold, so that the user has a means to tune the parameter.
  • introduce a verbose option that reports things like the median residual, number of inliers, etc., at each iteration.

Ping @CSchoel for your opinion.

" iterations were skipped because each randomly chosen"
" sub-sample either: didn't produce a valid set of inliers"
" due to a small `residual_threshold` or was invalidated by"
" `is_data_valid` or `is_model_valid` if either of these"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "if either ..." can be removed

@jnothman
Copy link
Member

jnothman commented Dec 7, 2016

Sorry, silly me. This patch does indeed fix the issue because the check for n_inliers_subset == 0 only applies when it is >= the best subset (defaulting to 1 in this PR). But that seems to mean that the "No inliers found" error is unreachable; and as a result we may draw all trials with 0 samples and this is expensive.

If this is the right patch, you need to write a test. But we may want a more configurable parameter to control the number of 0-inlier samples. Certainly, we should be giving the user more information on how to set the threshold appropriately.

@mthorrell
Copy link
Contributor Author

The if statement containing n_inliers_subset == 0 was unreachable and hence was removed, so there shouldn't be any unreachable code generated by this PR.

If this is the fix, a test already exists that generates an appropriate error. Unless I'm missing something, there would be nothing else to do here except potentially edit the error message--again assuming we go with this fix.

However, I do think it makes sense to put a limit on the number of iterations that are tried which end in skips. Also, is_data_valid or is_model_valid themselves could be computationally expensive, so I would think we would count skips due to data/model failure the same as skips due to a strict residual_threshold. I suppose one could explicitly track the causes of different skips separately (is_data_valid vs. is_model_valid vs. residual_threshold), but this seems excessive to me.

I'll submit a commit that allows a user to set a max_skips parameter. It will default to max_trials.

@jnothman
Copy link
Member

jnothman commented Dec 8, 2016

(Sorry for my confusions. Was trying to grok the overall algorithm)

@jnothman
Copy link
Member

jnothman commented Dec 8, 2016

If this is the fix, a test already exists that generates an appropriate error.

A non-regression test, which allows 0 inliers to be drawn in the first iteration, would be ideal.

@jnothman
Copy link
Member

jnothman commented Dec 8, 2016

Yes, I don't mind max_skips. I also don't think there's any harm in reporting diagnostics of various kinds, summarising residuals or reason for skipping / not updating.

@mthorrell
Copy link
Contributor Author

err... need to finish the tests, so this is still WIP

@jnothman
Copy link
Member

Still still WIP?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to set these diagnostics as attributes of the model, i.e. n_skips_no_inliers_, etc.

We may also then choose to remove them from the error message (and certainly from the warning, to help the warnings module ignore duplicates), pointing the user to these attributes instead.

@@ -111,6 +111,11 @@ class RANSACRegressor(BaseEstimator, MetaEstimatorMixin, RegressorMixin):
max_trials : int, optional
Maximum number of iterations for random sample selection.

max_skips : int, optional
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by `is_data_valid` or invalid models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need double-backticks

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still consider including a diagnostic about the stringency of the threshold, either reporting some statistic of the residuals (e.g. average min) or some statistic of the inlier subset size.

@mthorrell
Copy link
Contributor Author

I like the attributes idea. The error and warning messages seemed a bit clunky. Let me see if I can do something about the stringency of the threshold as well.

@mthorrell
Copy link
Contributor Author

I added the n_skips_no_inliers etc. attributes and adjusted the error messages and tests.

I did not add anything indicating the stringency of the threshold. Should that be different PR? This one is getting rather large imo.

@jnothman
Copy link
Member

jnothman commented Dec 29, 2016 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

max_skips : int, optional
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by ``is_data_valid`` or invalid models
defined by ``is_data_valid``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be is_model_valid

"RANSAC skipped more iterations than `max_skips` without"
" finding a valid consensus set. Iterations were skipped"
" because each randomly chosen sub-sample failed the"
" passing criteria. The object attributes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object -> estimator. But perhaps "See estimator attributes for diagnostics (n_skips*)." is sufficient

@jnothman jnothman changed the title [MRG] #7908 Addressed issue in first iteration of RANSAC regression [MRG+1] #7908 Addressed issue in first iteration of RANSAC regression Dec 29, 2016
@agramfort
Copy link
Member

+1 for merge

@jnothman merge if you're happy

@agramfort agramfort changed the title [MRG+1] #7908 Addressed issue in first iteration of RANSAC regression [MRG+2] #7908 Addressed issue in first iteration of RANSAC regression Dec 31, 2016
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by ``is_data_valid`` or invalid models
defined by ``is_model_valid``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose nowadays we're meant to have .. versionadded annotations in docstrings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accurate to say .. versionadded:: 0.19 or should it still be 0.18?

@jnothman
Copy link
Member

jnothman commented Jan 2, 2017 via email

@mthorrell
Copy link
Contributor Author

made the versionadded change. I believe this is finished, assuming the tests pass.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'd forgotten to ask you to add an entry to whats_new.rst (under enhancements I think? bug fixes if you rather?)

@mthorrell
Copy link
Contributor Author

I made the entry in whats_new.rst. Is there anything else needed?

@jnothman jnothman merged commit d0ce0d9 into scikit-learn:master Jan 5, 2017
@jnothman
Copy link
Member

jnothman commented Jan 5, 2017

All good, thanks!

raghavrv pushed a commit to raghavrv/scikit-learn that referenced this pull request Jan 5, 2017
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finding no inliers in one iteration of RANSAC should not raise a ValueError
4 participants