Skip to content

Warning the user of bad default values, starting by dbscan.eps #14942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

adrinjalali
Copy link
Member

Related to #13570, closes #13541.

This PR proposes adding a warning for parameters such as dbscan.eps which have no good default value.

It also adds a utility function which can be used potentially for other classes/parameters we'd like the user to explicitly specify a value for.

Since it's a deviation from our usual no warning with defaults policy, pinging @scikit-learn/core-devs

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. I'm mildly supportive. :) Not sure it's going to be too noisy.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, I like the idea in general

@@ -208,6 +212,9 @@ class DBSCAN(ClusterMixin, BaseEstimator):
important DBSCAN parameter to choose appropriately for your data set
and distance function.

Note that there is no good default value for this parameter. An
optimal value depends on the data at hand as well as the used metric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also indicate that the default is 0.5 (which is generally bad) and that a warning will be raised unless the value is explicitly set.

param, obj._bad_defaults[param]) for param in bad_params])
warnings.warn(msg, UserWarning)
for param in bad_params:
setattr(obj, param, obj._bad_defaults[param])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't you setting in fit a parameter passed to __init__ here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point. It can be fixed, but it may not be a bad idea not to fix it. Basically, if the user clones the object after fit, they won't have the warning again (which may be a good thing, if we don't want it to be too crowded). But if they clone before the fit, which is done in grid search cv for instance, they'll see the warning for every fit, if it's not setting the parameter.

I'm probably forgetting some of the reasons why we don't change anything given to __init__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That has to be fixed in some way, else the estimator can't pass check_estimator.

Maybe one solution would be to call the validation in __init__ instead (we would have to relax some of the checks but that should be OK)

@adrinjalali
Copy link
Member Author

I changed it to not touch the estimator, and just return the set of parameters with the 'warn' ones replaced with their default values. It's much less intrusive and much easier to incorporate I think.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 10, 2019 via email

@amueller
Copy link
Member

I agree with @GaelVaroquaux on this. I don't think this will be helpful for users.

@glemaitre
Copy link
Member

I am deeply thinking this is useful if users are reading. But I agree with the assessment of @GaelVaroquaux

I would be -1

@qinhanmin2014
Copy link
Member

I am deeply thinking this is useful if users are reading. But I agree with the assessment of @GaelVaroquaux

I agree, maybe update the doc (parameter description) instead?

@jnothman
Copy link
Member

jnothman commented Sep 15, 2019 via email

@NicolasHug
Copy link
Member

If we warn maybe we should fix our warning catching mechanism first

@adrinjalali
Copy link
Member Author

The warning should be printed only once with the "default" policy, but it doesn't, and I don't understand why.

The doc says:

print the first occurrence of matching warnings for each location (module + line number) where the warning is issued

What am I missing?

Base automatically changed from master to main January 22, 2021 10:51
@jeremiedbb
Copy link
Member

I see a lot of -1 in this discussion. I'm not in favor of adding a warning mechanism either. I'd rather explain in the eps field of the docstring that there's no good default for this. We could also make it a _required_parameters, but we need to be careful to not do that too often then (to not impact usability for the user too much).

@adrinjalali would you be ok with just improving the parameter description ?

@adrinjalali
Copy link
Member Author

Thanks for resurfacing this @jeremiedbb . I think independent of what we do with the warning, adding the description in the docstring is an improvement.

I would however like to see some sort of a message to the user when they don't set these values.

We could show some indicator in the html representation of the estimator, or in the __repr__, or use logging and logger.warn which users can easily not have displayed.

@jjerphan jjerphan self-requested a review February 28, 2022 20:27
@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2022

+1 for an indicator in the HTML __repr__ for notebook users for parameters when we cannot provide reasonable default values.

In the case of DBSCAN's eps or the radius parameter of a radius-neighbors classifier/regressor, we could provide an alternative parametrization that would be relative to the distribution of pairwise distances observed on the training set and that would provide a more sensible default than a data-blind absolute value for eps/radius, e.g. the default eps would be set to the 5th percentile of observed pairwise distances between 1000 randomly selected samples of the training set. This would involve sorting up to 1e6 distance values using a random_state hyperparameter by default (similarly to what we do for binning in HistGradientBoostingClassifier / KBinsDiscretizer or estimating the empirical CDF of the marginals in QuantileTransformer).

This is a strategy I used in this benchmark snippet to find a meaningful value of the radius that would not trigger a catastrophic memory use in a benchmark for RadiusNeighborsClassifier: #22320 (review)

@cmarmo cmarmo added the Needs Decision Requires decision label Jul 10, 2022
@kno10
Copy link
Contributor

kno10 commented Jul 11, 2022

In my opinion, the bad default values should be removed and these parameters should be made mandatory.
Backwards compatibility is a worthy goal, nevertheless it should not prevent you from reconsidering earlier poor choices.
Instead, the usual deprecation cycle should be applied:
make the default values cause a warning for now, and indicate that for the next release these parameters will be mandatory.
This will be necessary anyway if later on some heuristic is added, e.g., choosing minpts for DBSCAN via the dimensionality based heuristic, a sampling strategy for the radius of DBSCAN based on kNN distances, or in case of k-means something like x-means or g-means that optimizes k, too. On the other hand, it makes as much sense to treat such heuristics as separate algorithms (that may call DBSCAN/k-means as a function). And in the case of DBSCAN, I'd probably just use HDBSCAN* if you do not have a domain specific requirement for a particular radius. The current default value is in my opinion the worst possible solution.

Now for this particular patch, I'd apply the same warning to min_points for consistency (why is 5 a reasonable default? it's not what the DBSCAN paper recommended).

@kno10
Copy link
Contributor

kno10 commented Aug 8, 2022

In the second DBSCAN paper, section 4.2 "Determining the parameters"

Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (1998). Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data mining and knowledge discovery, 2(2), 169-194.

the authors suggest the following heuristic for parameterization:
Make a sorted k-distance plot for k=2*dimension - 1 for a sample of 1–10% of the database, then use minpts=k+1, hence minpts=2*dim should be the default. I personally would rather use max(10, dim*2) though as default.
For epsilon they suggest to use visual inspection of the sorted k-distance plot, but I believe it will be fine to use a quantile of the k-distances, e.g., the 25% quantile when sorted descending by k-distance.

It may also be worth adding warnings for the "red flags" mentioned in

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 1-21.

in particular, warn of bad parameters if (1) almost no points are core, (2) almost every point is core, (3) everything is a single cluster, to help the user recognize bad parameterization.

@jjerphan
Copy link
Member

jjerphan commented Sep 8, 2024

Hi @adrinjalali,

Do you want to finish this PR, or shall we close it?

@adrinjalali
Copy link
Member Author

Seems like most people are not in favor of this. I personally find it a big pity that we basically let users think they're fine while they're most probably absolutely not. But closing due to lack of consensus. I don't have the energy for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.