-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Warning the user of bad default values, starting by dbscan.eps #14942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea. I'm mildly supportive. :) Not sure it's going to be too noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, I like the idea in general
sklearn/cluster/dbscan_.py
Outdated
@@ -208,6 +212,9 @@ class DBSCAN(ClusterMixin, BaseEstimator): | |||
important DBSCAN parameter to choose appropriately for your data set | |||
and distance function. | |||
|
|||
Note that there is no good default value for this parameter. An | |||
optimal value depends on the data at hand as well as the used metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also indicate that the default is 0.5 (which is generally bad) and that a warning will be raised unless the value is explicitly set.
sklearn/utils/validation.py
Outdated
param, obj._bad_defaults[param]) for param in bad_params]) | ||
warnings.warn(msg, UserWarning) | ||
for param in bad_params: | ||
setattr(obj, param, obj._bad_defaults[param]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't you setting in fit
a parameter passed to __init__
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point. It can be fixed, but it may not be a bad idea not to fix it. Basically, if the user clones the object after fit, they won't have the warning again (which may be a good thing, if we don't want it to be too crowded). But if they clone before the fit, which is done in grid search cv for instance, they'll see the warning for every fit, if it's not setting the parameter.
I'm probably forgetting some of the reasons why we don't change anything given to __init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That has to be fixed in some way, else the estimator can't pass check_estimator
.
Maybe one solution would be to call the validation in __init__
instead (we would have to relax some of the checks but that should be OK)
I changed it to not touch the estimator, and just return the set of parameters with the |
I worry about the noise that this will generate. People already tend not
to read warnings anyhow.
|
I agree with @GaelVaroquaux on this. I don't think this will be helpful for users. |
I am deeply thinking this is useful if users are reading. But I agree with the assessment of @GaelVaroquaux I would be -1 |
I agree, maybe update the doc (parameter description) instead? |
Having a badge like we do in what's new is helpful once the user has opened
the html docs...
I agree about the warning fatigue but I think that if we sincerely think a
parameter must be tuned, a warning or making it required is appropriate.
We don't lose much by posting a warning. But I wonder if changing default
warning settings from 'always' to 'once' would help manage fatigue.
|
If we warn maybe we should fix our warning catching mechanism first |
The warning should be printed only once with the "default" policy, but it doesn't, and I don't understand why. The doc says:
What am I missing? |
I see a lot of -1 in this discussion. I'm not in favor of adding a warning mechanism either. I'd rather explain in the eps field of the docstring that there's no good default for this. We could also make it a @adrinjalali would you be ok with just improving the parameter description ? |
Thanks for resurfacing this @jeremiedbb . I think independent of what we do with the warning, adding the description in the docstring is an improvement. I would however like to see some sort of a message to the user when they don't set these values. We could show some indicator in the html representation of the estimator, or in the |
+1 for an indicator in the HTML In the case of This is a strategy I used in this benchmark snippet to find a meaningful value of the radius that would not trigger a catastrophic memory use in a benchmark for |
In my opinion, the bad default values should be removed and these parameters should be made mandatory. Now for this particular patch, I'd apply the same warning to min_points for consistency (why is 5 a reasonable default? it's not what the DBSCAN paper recommended). |
In the second DBSCAN paper, section 4.2 "Determining the parameters"
the authors suggest the following heuristic for parameterization: It may also be worth adding warnings for the "red flags" mentioned in
in particular, warn of bad parameters if (1) almost no points are core, (2) almost every point is core, (3) everything is a single cluster, to help the user recognize bad parameterization. |
Hi @adrinjalali, Do you want to finish this PR, or shall we close it? |
Seems like most people are not in favor of this. I personally find it a big pity that we basically let users think they're fine while they're most probably absolutely not. But closing due to lack of consensus. I don't have the energy for this. |
Related to #13570, closes #13541.
This PR proposes adding a warning for parameters such as
dbscan.eps
which have no good default value.It also adds a utility function which can be used potentially for other classes/parameters we'd like the user to explicitly specify a value for.
Since it's a deviation from our usual no warning with defaults policy, pinging @scikit-learn/core-devs