Warning the user of bad default values, starting by dbscan.eps #14942

adrinjalali · 2019-09-10T11:46:13Z

Related to #13570, closes #13541.

This PR proposes adding a warning for parameters such as dbscan.eps which have no good default value.

It also adds a utility function which can be used potentially for other classes/parameters we'd like the user to explicitly specify a value for.

Since it's a deviation from our usual no warning with defaults policy, pinging @scikit-learn/core-devs

jnothman

Interesting idea. I'm mildly supportive. :) Not sure it's going to be too noisy.

sklearn/utils/validation.py

NicolasHug

A few comments, I like the idea in general

NicolasHug · 2019-09-10T12:34:42Z

sklearn/cluster/dbscan_.py

@@ -208,6 +212,9 @@ class DBSCAN(ClusterMixin, BaseEstimator):
        important DBSCAN parameter to choose appropriately for your data set
        and distance function.

+        Note that there is no good default value for this parameter. An
+        optimal value depends on the data at hand as well as the used metric.


We should also indicate that the default is 0.5 (which is generally bad) and that a warning will be raised unless the value is explicitly set.

sklearn/cluster/dbscan_.py

sklearn/utils/validation.py

NicolasHug · 2019-09-10T12:41:58Z

sklearn/utils/validation.py

+            param, obj._bad_defaults[param]) for param in bad_params])
+        warnings.warn(msg, UserWarning)
+    for param in bad_params:
+        setattr(obj, param, obj._bad_defaults[param])


Aren't you setting in fit a parameter passed to __init__ here?

that's a good point. It can be fixed, but it may not be a bad idea not to fix it. Basically, if the user clones the object after fit, they won't have the warning again (which may be a good thing, if we don't want it to be too crowded). But if they clone before the fit, which is done in grid search cv for instance, they'll see the warning for every fit, if it's not setting the parameter.

I'm probably forgetting some of the reasons why we don't change anything given to __init__.

That has to be fixed in some way, else the estimator can't pass check_estimator.

Maybe one solution would be to call the validation in __init__ instead (we would have to relax some of the checks but that should be OK)

sklearn/utils/tests/test_validation.py

sklearn/utils/validation.py

adrinjalali · 2019-09-10T18:59:01Z

I changed it to not touch the estimator, and just return the set of parameters with the 'warn' ones replaced with their default values. It's much less intrusive and much easier to incorporate I think.

GaelVaroquaux · 2019-09-10T19:04:32Z

I worry about the noise that this will generate. People already tend not to read warnings anyhow.

amueller · 2019-09-10T20:06:26Z

I agree with @GaelVaroquaux on this. I don't think this will be helpful for users.

glemaitre · 2019-09-13T11:47:58Z

I am deeply thinking this is useful if users are reading. But I agree with the assessment of @GaelVaroquaux

I would be -1

qinhanmin2014 · 2019-09-15T04:56:26Z

I am deeply thinking this is useful if users are reading. But I agree with the assessment of @GaelVaroquaux

I agree, maybe update the doc (parameter description) instead?

jnothman · 2019-09-15T07:13:05Z

Having a badge like we do in what's new is helpful once the user has opened the html docs... I agree about the warning fatigue but I think that if we sincerely think a parameter must be tuned, a warning or making it required is appropriate. We don't lose much by posting a warning. But I wonder if changing default warning settings from 'always' to 'once' would help manage fatigue.

NicolasHug · 2019-09-15T11:27:22Z

If we warn maybe we should fix our warning catching mechanism first

adrinjalali · 2020-05-28T09:32:14Z

The warning should be printed only once with the "default" policy, but it doesn't, and I don't understand why.

The doc says:

print the first occurrence of matching warnings for each location (module + line number) where the warning is issued

What am I missing?

jeremiedbb · 2022-02-11T18:03:52Z

I see a lot of -1 in this discussion. I'm not in favor of adding a warning mechanism either. I'd rather explain in the eps field of the docstring that there's no good default for this. We could also make it a _required_parameters, but we need to be careful to not do that too often then (to not impact usability for the user too much).

@adrinjalali would you be ok with just improving the parameter description ?

adrinjalali · 2022-02-24T12:35:02Z

Thanks for resurfacing this @jeremiedbb . I think independent of what we do with the warning, adding the description in the docstring is an improvement.

I would however like to see some sort of a message to the user when they don't set these values.

We could show some indicator in the html representation of the estimator, or in the __repr__, or use logging and logger.warn which users can easily not have displayed.

ogrisel · 2022-03-03T09:08:22Z

+1 for an indicator in the HTML __repr__ for notebook users for parameters when we cannot provide reasonable default values.

In the case of DBSCAN's eps or the radius parameter of a radius-neighbors classifier/regressor, we could provide an alternative parametrization that would be relative to the distribution of pairwise distances observed on the training set and that would provide a more sensible default than a data-blind absolute value for eps/radius, e.g. the default eps would be set to the 5th percentile of observed pairwise distances between 1000 randomly selected samples of the training set. This would involve sorting up to 1e6 distance values using a random_state hyperparameter by default (similarly to what we do for binning in HistGradientBoostingClassifier / KBinsDiscretizer or estimating the empirical CDF of the marginals in QuantileTransformer).

This is a strategy I used in this benchmark snippet to find a meaningful value of the radius that would not trigger a catastrophic memory use in a benchmark for RadiusNeighborsClassifier: #22320 (review)

kno10 · 2022-07-11T16:32:08Z

In my opinion, the bad default values should be removed and these parameters should be made mandatory.
Backwards compatibility is a worthy goal, nevertheless it should not prevent you from reconsidering earlier poor choices.
Instead, the usual deprecation cycle should be applied:
make the default values cause a warning for now, and indicate that for the next release these parameters will be mandatory.
This will be necessary anyway if later on some heuristic is added, e.g., choosing minpts for DBSCAN via the dimensionality based heuristic, a sampling strategy for the radius of DBSCAN based on kNN distances, or in case of k-means something like x-means or g-means that optimizes k, too. On the other hand, it makes as much sense to treat such heuristics as separate algorithms (that may call DBSCAN/k-means as a function). And in the case of DBSCAN, I'd probably just use HDBSCAN* if you do not have a domain specific requirement for a particular radius. The current default value is in my opinion the worst possible solution.

Now for this particular patch, I'd apply the same warning to min_points for consistency (why is 5 a reasonable default? it's not what the DBSCAN paper recommended).

kno10 · 2022-08-08T08:54:34Z

In the second DBSCAN paper, section 4.2 "Determining the parameters"

Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (1998). Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data mining and knowledge discovery, 2(2), 169-194.

the authors suggest the following heuristic for parameterization:
Make a sorted k-distance plot for k=2*dimension - 1 for a sample of 1–10% of the database, then use minpts=k+1, hence minpts=2*dim should be the default. I personally would rather use max(10, dim*2) though as default.
For epsilon they suggest to use visual inspection of the sorted k-distance plot, but I believe it will be fine to use a quantile of the k-distances, e.g., the 25% quantile when sorted descending by k-distance.

It may also be worth adding warnings for the "red flags" mentioned in

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 1-21.

in particular, warn of bad parameters if (1) almost no points are core, (2) almost every point is core, (3) everything is a single cluster, to help the user recognize bad parameterization.

jjerphan · 2024-09-08T07:05:43Z

Hi @adrinjalali,

Do you want to finish this PR, or shall we close it?

adrinjalali · 2024-09-10T14:00:03Z

Seems like most people are not in favor of this. I personally find it a big pity that we basically let users think they're fine while they're most probably absolutely not. But closing due to lack of consensus. I don't have the energy for this.

bad default value

47f7847

jnothman reviewed Sep 10, 2019

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

fix warning match, fix rm drivers/mmc/host/rtsx_pci_sdmmc.ko.xz

b1ff3af

NicolasHug reviewed Sep 10, 2019

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

adrinjalali added 3 commits September 10, 2019 19:42

don't change object's params

4b2c335

address the other comments

331cc83

sort the params

67cc364

github-actions bot added module:cluster module:utils labels Mar 2, 2020

adrinjalali added 2 commits May 28, 2020 10:20

merge master and add exception

a59aab9

trying to warn once

335dd38

Base automatically changed from master to main January 22, 2021 10:51

jeremiedbb mentioned this pull request Feb 11, 2022

The default value of some parameters make little sense #13570

Open

jjerphan self-requested a review February 28, 2022 20:27

cmarmo added the Needs Decision Requires decision label Jul 10, 2022

cmarmo mentioned this pull request Jul 10, 2022

Warn if eps is not specificed for DBSCAN #13541

Closed

adrinjalali closed this Sep 10, 2024

kno10 mentioned this pull request Nov 26, 2024

DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this. #17650

Open

kno10 mentioned this pull request Feb 13, 2025

DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

Open

Uh oh!

Warning the user of bad default values, starting by dbscan.eps #14942

Warning the user of bad default values, starting by dbscan.eps #14942

Uh oh!

Conversation

adrinjalali commented Sep 10, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug Sep 10, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NicolasHug Sep 10, 2019

Choose a reason for hiding this comment

Uh oh!

adrinjalali Sep 10, 2019

Choose a reason for hiding this comment

Uh oh!

NicolasHug Sep 10, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adrinjalali commented Sep 10, 2019

Uh oh!

GaelVaroquaux commented Sep 10, 2019 via email

Uh oh!

amueller commented Sep 10, 2019

Uh oh!

glemaitre commented Sep 13, 2019

Uh oh!

qinhanmin2014 commented Sep 15, 2019

Uh oh!

jnothman commented Sep 15, 2019 via email

Uh oh!

NicolasHug commented Sep 15, 2019

Uh oh!

adrinjalali commented May 28, 2020

Uh oh!

jeremiedbb commented Feb 11, 2022

Uh oh!

adrinjalali commented Feb 24, 2022

Uh oh!

ogrisel commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kno10 commented Jul 11, 2022

Uh oh!

kno10 commented Aug 8, 2022

Uh oh!

jjerphan commented Sep 8, 2024

Uh oh!

adrinjalali commented Sep 10, 2024

Uh oh!

Uh oh!

ogrisel commented Mar 3, 2022 •

edited

Loading