Skip to content

Commit 546fd5f

Browse files
authored
[MRG] Fix LocalOutlierFactor's output for data with duplicated samples (#28773)
1 parent e5ed851 commit 546fd5f

File tree

3 files changed

+50
-1
lines changed

3 files changed

+50
-1
lines changed

doc/whats_new/v1.6.rst

+8-1
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Changelog
8787
- |API| Deprecates `copy_X` in :class:`linear_model.TheilSenRegressor` as the parameter
8888
has no effect. `copy_X` will be removed in 1.8.
8989
:pr:`29105` by :user:`Adam Li <adam2392>`.
90-
90+
9191
:mod:`sklearn.metrics`
9292
......................
9393

@@ -103,6 +103,13 @@ Changelog
103103
estimator without re-fitting it.
104104
:pr:`29067` by :user:`Guillaume Lemaitre <glemaitre>`.
105105

106+
:mod:`sklearn.neighbors`
107+
........................
108+
109+
- |Fix| :class:`neighbors.LocalOutlierFactor` raises a warning in the `fit` method
110+
when duplicate values in the training data lead to inaccurate outlier detection.
111+
:pr:`28773` by :user:`Henrique Caroço <HenriqueProj>`.
112+
106113
Thanks to everyone who has contributed to the maintenance and improvement of
107114
the project since version 1.5, including:
108115

sklearn/neighbors/_lof.py

+8
Original file line numberDiff line numberDiff line change
@@ -317,6 +317,14 @@ def fit(self, X, y=None):
317317
self.negative_outlier_factor_, 100.0 * self.contamination
318318
)
319319

320+
# Verify if negative_outlier_factor_ values are within acceptable range.
321+
# Novelty must also be false to detect outliers
322+
if np.min(self.negative_outlier_factor_) < -1e7 and not self.novelty:
323+
warnings.warn(
324+
"Duplicate values are leading to incorrect results. "
325+
"Increase the number of neighbors for more accurate results."
326+
)
327+
320328
return self
321329

322330
def _check_novelty_predict(self):

sklearn/neighbors/tests/test_lof.py

+34
Original file line numberDiff line numberDiff line change
@@ -359,3 +359,37 @@ def test_lof_dtype_equivalence(algorithm, novelty, contamination):
359359
y_pred_32 = getattr(lof_32, method)(X_32)
360360
y_pred_64 = getattr(lof_64, method)(X_64)
361361
assert_allclose(y_pred_32, y_pred_64, atol=0.0002)
362+
363+
364+
def test_lof_duplicate_samples():
365+
"""
366+
Check that LocalOutlierFactor raises a warning when duplicate values
367+
in the training data cause inaccurate results.
368+
369+
Non-regression test for:
370+
https://github.com/scikit-learn/scikit-learn/issues/27839
371+
"""
372+
373+
rng = np.random.default_rng(0)
374+
375+
x = rng.permutation(
376+
np.hstack(
377+
[
378+
[0.1] * 1000, # constant values
379+
np.linspace(0.1, 0.3, num=3000),
380+
rng.random(500) * 100, # the clear outliers
381+
]
382+
)
383+
)
384+
X = x.reshape(-1, 1)
385+
386+
error_msg = (
387+
"Duplicate values are leading to incorrect results. "
388+
"Increase the number of neighbors for more accurate results."
389+
)
390+
391+
lof = neighbors.LocalOutlierFactor(n_neighbors=5, contamination=0.1)
392+
393+
# Catch the warning
394+
with pytest.warns(UserWarning, match=re.escape(error_msg)):
395+
lof.fit_predict(X)

0 commit comments

Comments
 (0)