Skip to content

Commit 0ef3d40

Browse files
qinhanmin2014jnothman
authored andcommitted
FIX Support float min_samples and min_cluster_size in OPTICS (scikit-learn#14496)
1 parent f7576ea commit 0ef3d40

File tree

3 files changed

+100
-13
lines changed

3 files changed

+100
-13
lines changed

doc/whats_new/v0.21.rst

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,17 @@ random sampling procedures.
2626
Changelog
2727
---------
2828

29+
:mod:`sklearn.cluster`
30+
.....................
31+
32+
- |Fix| Fixed a bug in :class:`cluster.KMeans` where computation was single
33+
threaded when `n_jobs > 1` or `n_jobs = -1`.
34+
:pr:`12955` by :user:`Prabakaran Kumaresshan <nixphix>`.
35+
36+
- |Fix| Fixed a bug in :class:`cluster.OPTICS` where users were unable to pass
37+
float `min_samples` and `min_cluster_size`. :pr:`14496` by
38+
:user:`Fabian Klopfer <someusername1>`
39+
and :user:`Hanmin Qin <qinhanmin2014>`.
2940

3041
:mod:`sklearn.inspection`
3142
.....................
@@ -91,6 +102,80 @@ Changelog
91102
``remainder`` transformer.
92103
:pr:`14237` by `Andreas Schuderer <schuderer>`.
93104

105+
:mod:`sklearn.cluster`
106+
......................
107+
108+
- |Fix| Fixed a bug in :class:`cluster.KMeans` where computation was single
109+
threaded when `n_jobs > 1` or `n_jobs = -1`.
110+
:pr:`12955` by :user:`Prabakaran Kumaresshan <nixphix>`.
111+
112+
- |Fix| Fixed a bug in :class:`cluster.OPTICS` where users were unable to pass
113+
float `min_samples` and `min_cluster_size`. :pr:`14496` by
114+
:user:`Fabian Klopfer <someusername1>`
115+
and :user:`Hanmin Qin <qinhanmin2014>`.
116+
117+
:mod:`sklearn.compose`
118+
......................
119+
120+
- |Fix| Fixed an issue in :class:`compose.ColumnTransformer` where using
121+
DataFrames whose column order differs between :func:``fit`` and
122+
:func:``transform`` could lead to silently passing incorrect columns to the
123+
``remainder`` transformer.
124+
:pr:`14237` by `Andreas Schuderer <schuderer>`.
125+
126+
:mod:`sklearn.datasets`
127+
.......................
128+
129+
- |Fix| :func:`datasets.fetch_california_housing`,
130+
:func:`datasets.fetch_covtype`,
131+
:func:`datasets.fetch_kddcup99`, :func:`datasets.fetch_olivetti_faces`,
132+
:func:`datasets.fetch_rcv1`, and :func:`datasets.fetch_species_distributions`
133+
try to persist the previously cache using the new ``joblib`` if the cahced
134+
data was persisted using the deprecated ``sklearn.externals.joblib``. This
135+
behavior is set to be deprecated and removed in v0.23.
136+
:pr:`14197` by `Adrin Jalali`_.
137+
138+
:mod:`sklearn.impute`
139+
.....................
140+
141+
- |Fix| Fixed a bug in :class:`impute.SimpleImputer` and
142+
:class:`impute.IterativeImputer` so that no errors are thrown when there are
143+
missing values in training data. :pr:`13974` by `Frank Hoang <fhoang7>`.
144+
145+
:mod:`sklearn.inspection`
146+
.........................
147+
148+
- |Fix| Fixed a bug in :func:`inspection.plot_partial_dependence` where
149+
``target`` parameter was not being taken into account for multiclass problems.
150+
:pr:`14393` by :user:`Guillem G. Subies <guillemgsubies>`.
151+
152+
:mod:`sklearn.linear_model`
153+
...........................
154+
155+
- |Fix| Fixed a bug in :class:`linear_model.LogisticRegressionCV` where
156+
``refit=False`` would fail depending on the ``'multiclass'`` and
157+
``'penalty'`` parameters (regression introduced in 0.21). :pr:`14087` by
158+
`Nicolas Hug`_.
159+
160+
- |Fix| Compatibility fix for :class:`linear_model.ARDRegression` and
161+
Scipy>=1.3.0. Adapts to upstream changes to the default `pinvh` cutoff
162+
threshold which otherwise results in poor accuracy in some cases.
163+
:pr:`14067` by :user:`Tim Staley <timstaley>`.
164+
165+
:mod:`sklearn.neighbors`
166+
........................
167+
168+
- |Fix| Fixed a bug in :class:`neighbors.NeighborhoodComponentsAnalysis` where
169+
the validation of initial parameters ``n_components``, ``max_iter`` and
170+
``tol`` required too strict types. :pr:`14092` by
171+
:user:`Jérémie du Boisberranger <jeremiedbb>`.
172+
173+
:mod:`sklearn.tree`
174+
...................
175+
176+
- |Fix| Fixed bug in :func:`tree.export_text` when the tree has one feature and
177+
a single feature name is passed in. :pr:`14053` by `Thomas Fan`.
178+
94179
.. _changes_0_21_2:
95180

96181
Version 0.21.2

sklearn/cluster/optics_.py

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ class OPTICS(BaseEstimator, ClusterMixin):
4444
4545
Parameters
4646
----------
47-
min_samples : int > 1 or float between 0 and 1 (default=None)
47+
min_samples : int > 1 or float between 0 and 1 (default=5)
4848
The number of samples in a neighborhood for a point to be considered as
4949
a core point. Also, up and down steep regions can't have more then
5050
``min_samples`` consecutive non-steep points. Expressed as an absolute
@@ -341,7 +341,7 @@ def compute_optics_graph(X, min_samples, max_eps, metric, p, metric_params,
341341
A feature array, or array of distances between samples if
342342
metric='precomputed'
343343
344-
min_samples : int (default=5)
344+
min_samples : int > 1 or float between 0 and 1
345345
The number of samples in a neighborhood for a point to be considered
346346
as a core point. Expressed as an absolute number or a fraction of the
347347
number of samples (rounded to be at least 2).
@@ -437,7 +437,7 @@ def compute_optics_graph(X, min_samples, max_eps, metric, p, metric_params,
437437
n_samples = X.shape[0]
438438
_validate_size(min_samples, n_samples, 'min_samples')
439439
if min_samples <= 1:
440-
min_samples = max(2, min_samples * n_samples)
440+
min_samples = max(2, int(min_samples * n_samples))
441441

442442
# Start all points as 'unprocessed' ##
443443
reachability_ = np.empty(n_samples)
@@ -582,7 +582,7 @@ def cluster_optics_xi(reachability, predecessor, ordering, min_samples,
582582
ordering : array, shape (n_samples,)
583583
OPTICS ordered point indices (`ordering_`)
584584
585-
min_samples : int > 1 or float between 0 and 1 (default=None)
585+
min_samples : int > 1 or float between 0 and 1
586586
The same as the min_samples given to OPTICS. Up and down steep regions
587587
can't have more then ``min_samples`` consecutive non-steep points.
588588
Expressed as an absolute number or a fraction of the number of samples
@@ -619,12 +619,12 @@ def cluster_optics_xi(reachability, predecessor, ordering, min_samples,
619619
n_samples = len(reachability)
620620
_validate_size(min_samples, n_samples, 'min_samples')
621621
if min_samples <= 1:
622-
min_samples = max(2, min_samples * n_samples)
622+
min_samples = max(2, int(min_samples * n_samples))
623623
if min_cluster_size is None:
624624
min_cluster_size = min_samples
625625
_validate_size(min_cluster_size, n_samples, 'min_cluster_size')
626626
if min_cluster_size <= 1:
627-
min_cluster_size = max(2, min_cluster_size * n_samples)
627+
min_cluster_size = max(2, int(min_cluster_size * n_samples))
628628

629629
clusters = _xi_cluster(reachability[ordering], predecessor[ordering],
630630
ordering, xi,
@@ -753,16 +753,12 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples,
753753
reachability plot is defined by the ratio from one point to its
754754
successor being at most 1-xi.
755755
756-
min_samples : int > 1 or float between 0 and 1 (default=None)
756+
min_samples : int > 1
757757
The same as the min_samples given to OPTICS. Up and down steep regions
758758
can't have more then ``min_samples`` consecutive non-steep points.
759-
Expressed as an absolute number or a fraction of the number of samples
760-
(rounded to be at least 2).
761759
762-
min_cluster_size : int > 1 or float between 0 and 1
763-
Minimum number of samples in an OPTICS cluster, expressed as an
764-
absolute number or a fraction of the number of samples (rounded
765-
to be at least 2).
760+
min_cluster_size : int > 1
761+
Minimum number of samples in an OPTICS cluster.
766762
767763
predecessor_correction : bool
768764
Correct clusters based on the calculated predecessors.

sklearn/cluster/tests/test_optics.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,12 @@ def test_extract_xi():
102102
xi=0.4).fit(X)
103103
assert_array_equal(clust.labels_, expected_labels)
104104

105+
# check float min_samples and min_cluster_size
106+
clust = OPTICS(min_samples=0.1, min_cluster_size=0.08,
107+
max_eps=20, cluster_method='xi',
108+
xi=0.4).fit(X)
109+
assert_array_equal(clust.labels_, expected_labels)
110+
105111
X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
106112
expected_labels = np.r_[[1] * 5, [3] * 5, [2] * 5, [0] * 5, [2] * 5,
107113
-1, -1, [4] * 5]

0 commit comments

Comments
 (0)