Skip to content

Commit 760edca

Browse files
authored
DOC Enhance DBSCAN docstrings with clearer parameter guidance and descriptions (#31835)
1 parent fe08016 commit 760edca

File tree

1 file changed

+47
-15
lines changed

1 file changed

+47
-15
lines changed

sklearn/cluster/_dbscan.py

Lines changed: 47 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -41,25 +41,38 @@ def dbscan(
4141
):
4242
"""Perform DBSCAN clustering from vector array or distance matrix.
4343
44+
This function is a wrapper around :class:`~cluster.DBSCAN`, suitable for
45+
quick, standalone clustering tasks. For estimator-based workflows, where
46+
estimator attributes or pipeline integration is required, prefer
47+
:class:`~cluster.DBSCAN`.
48+
49+
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
50+
density-based clustering algorithm that groups together points that are
51+
closely packed while marking points in low-density regions as outliers.
52+
4453
Read more in the :ref:`User Guide <dbscan>`.
4554
4655
Parameters
4756
----------
48-
X : {array-like, sparse (CSR) matrix} of shape (n_samples, n_features) or \
57+
X : {array-like, scipy sparse matrix} of shape (n_samples, n_features) or \
4958
(n_samples, n_samples)
5059
A feature array, or array of distances between samples if
51-
``metric='precomputed'``.
60+
``metric='precomputed'``. When using precomputed distances, X must
61+
be a square symmetric matrix.
5262
5363
eps : float, default=0.5
5464
The maximum distance between two samples for one to be considered
5565
as in the neighborhood of the other. This is not a maximum bound
5666
on the distances of points within a cluster. This is the most
5767
important DBSCAN parameter to choose appropriately for your data set
58-
and distance function.
68+
and distance function. Smaller values result in more clusters,
69+
while larger values result in fewer, larger clusters.
5970
6071
min_samples : int, default=5
6172
The number of samples (or total weight) in a neighborhood for a point
6273
to be considered as a core point. This includes the point itself.
74+
Higher values yield fewer, denser clusters, while lower values yield
75+
more, sparser clusters.
6376
6477
metric : str or callable, default='minkowski'
6578
The metric to use when calculating distance between instances in a
@@ -79,17 +92,23 @@ def dbscan(
7992
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
8093
The algorithm to be used by the NearestNeighbors module
8194
to compute pointwise distances and find nearest neighbors.
82-
See NearestNeighbors module documentation for details.
95+
'auto' will attempt to decide the most appropriate algorithm
96+
based on the values passed to :meth:`fit` method.
97+
See :class:`~sklearn.neighbors.NearestNeighbors` documentation for
98+
details.
8399
84100
leaf_size : int, default=30
85101
Leaf size passed to BallTree or cKDTree. This can affect the speed
86102
of the construction and query, as well as the memory required
87103
to store the tree. The optimal value depends
88-
on the nature of the problem.
104+
on the nature of the problem. Generally, smaller leaf sizes
105+
lead to faster queries but slower construction.
89106
90107
p : float, default=2
91-
The power of the Minkowski metric to be used to calculate distance
92-
between points.
108+
Power parameter for the Minkowski metric. When p = 1, this is equivalent
109+
to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2.
110+
For arbitrary p, minkowski_distance (l_p) is used. This parameter is expected
111+
to be positive.
93112
94113
sample_weight : array-like of shape (n_samples,), default=None
95114
Weight of each sample, such that a sample with a weight of at least
@@ -101,7 +120,7 @@ def dbscan(
101120
The number of parallel jobs to run for neighbors search. ``None`` means
102121
1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means
103122
using all processors. See :term:`Glossary <n_jobs>` for more details.
104-
If precomputed distance are used, parallel execution is not available
123+
If precomputed distances are used, parallel execution is not available
105124
and thus n_jobs will have no effect.
106125
107126
Returns
@@ -110,7 +129,8 @@ def dbscan(
110129
Indices of core samples.
111130
112131
labels : ndarray of shape (n_samples,)
113-
Cluster labels for each point. Noisy samples are given the label -1.
132+
Cluster labels for each point. Noisy samples are given the label -1.
133+
Non-negative integers indicate cluster membership.
114134
115135
See Also
116136
--------
@@ -183,7 +203,11 @@ class DBSCAN(ClusterMixin, BaseEstimator):
183203
184204
DBSCAN - Density-Based Spatial Clustering of Applications with Noise.
185205
Finds core samples of high density and expands clusters from them.
186-
Good for data which contains clusters of similar density.
206+
This algorithm is particularly good for data which contains clusters of
207+
similar density and can find clusters of arbitrary shape.
208+
209+
Unlike K-means, DBSCAN does not require specifying the number of clusters
210+
in advance and can identify outliers as noise points.
187211
188212
This implementation has a worst case memory complexity of :math:`O({n}^2)`,
189213
which can occur when the `eps` param is large and `min_samples` is low,
@@ -199,7 +223,7 @@ class DBSCAN(ClusterMixin, BaseEstimator):
199223
as in the neighborhood of the other. This is not a maximum bound
200224
on the distances of points within a cluster. This is the most
201225
important DBSCAN parameter to choose appropriately for your data set
202-
and distance function.
226+
and distance function. Smaller values generally lead to more clusters.
203227
204228
min_samples : int, default=5
205229
The number of samples (or total weight) in a neighborhood for a point to
@@ -228,7 +252,10 @@ class DBSCAN(ClusterMixin, BaseEstimator):
228252
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
229253
The algorithm to be used by the NearestNeighbors module
230254
to compute pointwise distances and find nearest neighbors.
231-
See NearestNeighbors module documentation for details.
255+
'auto' will attempt to decide the most appropriate algorithm
256+
based on the values passed to :meth:`fit` method.
257+
See :class:`~sklearn.neighbors.NearestNeighbors` documentation for
258+
details.
232259
233260
leaf_size : int, default=30
234261
Leaf size passed to BallTree or cKDTree. This can affect the speed
@@ -239,7 +266,7 @@ class DBSCAN(ClusterMixin, BaseEstimator):
239266
p : float, default=None
240267
The power of the Minkowski metric to be used to calculate distance
241268
between points. If None, then ``p=2`` (equivalent to the Euclidean
242-
distance).
269+
distance). When p=1, this is equivalent to Manhattan distance.
243270
244271
n_jobs : int, default=None
245272
The number of parallel jobs to run.
@@ -255,9 +282,10 @@ class DBSCAN(ClusterMixin, BaseEstimator):
255282
components_ : ndarray of shape (n_core_samples, n_features)
256283
Copy of each core sample found by training.
257284
258-
labels_ : ndarray of shape (n_samples)
285+
labels_ : ndarray of shape (n_samples,)
259286
Cluster labels for each point in the dataset given to fit().
260-
Noisy samples are given the label -1.
287+
Noisy samples are given the label -1. Non-negative integers
288+
indicate cluster membership.
261289
262290
n_features_in_ : int
263291
Number of features seen during :term:`fit`.
@@ -448,6 +476,9 @@ def fit(self, X, y=None, sample_weight=None):
448476
def fit_predict(self, X, y=None, sample_weight=None):
449477
"""Compute clusters from a data or distance matrix and predict labels.
450478
479+
This method fits the model and returns the cluster labels in a single step.
480+
It is equivalent to calling fit(X).labels_.
481+
451482
Parameters
452483
----------
453484
X : {array-like, sparse matrix} of shape (n_samples, n_features), or \
@@ -469,6 +500,7 @@ def fit_predict(self, X, y=None, sample_weight=None):
469500
-------
470501
labels : ndarray of shape (n_samples,)
471502
Cluster labels. Noisy samples are given the label -1.
503+
Non-negative integers indicate cluster membership.
472504
"""
473505
self.fit(X, sample_weight=sample_weight)
474506
return self.labels_

0 commit comments

Comments
 (0)