ENH Add dtype preservation to `LocalOutlierFactor` #22665

jjerphan · 2022-03-03T13:19:02Z

Reference Issues/PRs

Partially addresses #22881
Precedes #22590

What does this implement/fix? Explain your changes.

This makes LocalOutlierFactor preserves inputs dtype, in particular np.float32 dtyped inputs.

This also parametrizes tests from test_lof.py to run on np.float32 datasets.

sklearn/neighbors/tests/test_lof.py

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

jeremiedbb

LGTM

glemaitre

It could be worth to add the global_dtype for the test_predicted_outlier_number test, I think.

sklearn/neighbors/tests/test_lof.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

sklearn/neighbors/tests/test_lof.py

Tests currently fails because X can be a list of list of numbers and thus not have a `dtype` attribute. See this RFC for discussions: scikit-learn#24745

sklearn/neighbors/_lof.py

glemaitre

I did not look at the test yet but I am a bit confused here. If I understand properly, to be able to use global_dtype in the test, we have to make LocalOutlierFactor to preserve dtype since we make some casting in some parts of the code.

sklearn/neighbors/_lof.py

glemaitre · 2022-11-03T18:41:17Z

We should be adding the _more_tags and define that we preserve np.float32 and np.float64 dtype. It will complete some of the tests where we are testing the type preservation of some of the attributes.

glemaitre · 2022-11-03T18:43:12Z

sklearn/neighbors/tests/test_lof.py

@@ -71,32 +78,32 @@ def test_lof_performance():
    assert roc_auc_score(y_test, y_pred) > 0.99


We should be testing somewhere that y_pred is preserving dtype also.

we do it later for decision_function and score_samples

I understand your last message as indicating that we are not needing and assertion for dtype preservation because this is done in test_score_samples. Is this what you meant?

glemaitre · 2022-11-03T18:46:43Z

Otherwise, I think that the other tests are fine.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

jjerphan · 2022-11-04T09:58:18Z

I've changed the scope of this PR from TST to ENH as it also makes LocalOutlierFactor preserve dtypes.

glemaitre

I am thinking that we could isolate the check of the attribute dtype in a separate test. It would be more readable. We could also add a test to check the consistency 32/64 bits.

In some way this is what we try to do in other preservation dtype PRs: https://github.com/scikit-learn/scikit-learn/pull/24714/files#diff-1f2d3a6511fee4ecd34ef77089119895929c26b5f1193493eacdfd4e745cded7R253

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre · 2022-11-15T13:47:07Z

I would still propose adding a test for the equivalence 32/64 bits regarding the prediction/score functions.

I also find it cleaner to delay the check for the dtype in the "preservation" test.

I would propose the following patch:

diff --git a/sklearn/neighbors/tests/test_lof.py b/sklearn/neighbors/tests/test_lof.py
index 6ef5e145d4..3ccb3bc3ea 100644
--- a/sklearn/neighbors/tests/test_lof.py
+++ b/sklearn/neighbors/tests/test_lof.py
@@ -149,12 +149,6 @@ def test_score_samples(global_dtype):
     clf2_scores = clf2.score_samples(X_test)
     clf2_decisions = clf2.decision_function(X_test)
 
-    assert clf1_scores.dtype == global_dtype
-    assert clf1_decisions.dtype == global_dtype
-
-    assert clf2_scores.dtype == global_dtype
-    assert clf2_decisions.dtype == global_dtype
-
     assert_allclose(
         clf1_scores,
         clf1_decisions + clf1.offset_,
@@ -201,7 +195,6 @@ def test_novelty_training_scores(global_dtype):
     scores_2 = clf_2.negative_outlier_factor_
 
     assert_allclose(scores_1, scores_2)
-    assert scores_1.dtype == scores_2.dtype == global_dtype
 
 
 def test_hasattr_prediction():
@@ -278,3 +271,42 @@ def test_lof_input_dtype_preservation(global_dtype, algorithm, contamination, no
     iso.fit(X)
 
     assert iso.negative_outlier_factor_.dtype == global_dtype
+
+    for method in ("score_samples", "decision_function"):
+        if hasattr(iso, method):
+            y_pred = getattr(iso, method)(X)
+            assert y_pred.dtype == global_dtype
+
+
+@pytest.mark.parametrize("algorithm", ["auto", "ball_tree", "kd_tree", "brute"])
+@pytest.mark.parametrize("novelty", [True, False])
+@pytest.mark.parametrize("contamination", [0.5, "auto"])
+def test_lof_dtype_equivalence(algorithm, novelty, contamination):
+    """Check the equivalence of the results with 32 and 64 bits input."""
+
+    inliers = iris.data[:50]  # setosa iris are really distinct from others
+    outliers = iris.data[-5:]  # virginica will be considered as outliers
+    # lower the precision of the input data to check that we have an equivalence when
+    # making the computation in 32 and 64 bits.
+    X = np.concatenate([inliers, outliers], axis=0).astype(np.float32)
+
+    lof_32 = neighbors.LocalOutlierFactor(
+        algorithm=algorithm, novelty=novelty, contamination=contamination
+    )
+    X_32 = X.astype(np.float32, copy=True)
+    lof_32.fit(X_32)
+
+    lof_64 = neighbors.LocalOutlierFactor(
+        algorithm=algorithm, novelty=novelty, contamination=contamination
+    )
+    X_64 = X.astype(np.float64, copy=True)
+    lof_64.fit(X_64)
+
+    assert_allclose(lof_32.negative_outlier_factor_, lof_64.negative_outlier_factor_)
+
+    for method in ("score_samples", "decision_function", "predict", "fit_predict"):
+        if hasattr(lof_32, method):
+            y_pred_32 = getattr(lof_32, method)(X_32)
+            y_pred_64 = getattr(lof_64, method)(X_64)
+            assert_allclose(y_pred_32, y_pred_64)
+

Authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

« _Les opticiens!_ »

glemaitre

LGTM

TST Adapt test_lof.py to test implementations on 32bit datasets

835390b

github-actions bot added the module:neighbors label Mar 3, 2022

jjerphan added the No Changelog Needed label Mar 3, 2022

jjerphan marked this pull request as ready for review March 3, 2022 15:01

jjerphan added 2 commits March 18, 2022 10:59

Merge branch 'main' into tst/test_lof-32bit

5b583de

TST Use global_dtype

147372f

jjerphan changed the title ~~TST Adapt test_lof.py to test implementations on 32bit datasets~~ TST use global_dtype in sklearn/neighbors/tests/test_lof.py Mar 18, 2022

jjerphan mentioned this pull request Mar 18, 2022

Improve tests to make them run on variously typed data using the global_dtype fixture #22881

Open

jeremiedbb reviewed Mar 18, 2022

View reviewed changes

TST Do not copy on when casting

ee61e23

jjerphan added the Waiting for Reviewer label Mar 24, 2022

jeremiedbb reviewed Mar 25, 2022

View reviewed changes

jjerphan and others added 2 commits March 29, 2022 13:49

Address review comments

de2954f

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

fixup! Address review comments

78214c1

jeremiedbb approved these changes Mar 29, 2022

View reviewed changes

jeremiedbb added the Quick Review For PRs that are quick to review label Mar 29, 2022

glemaitre removed the Waiting for Reviewer label May 6, 2022

glemaitre self-requested a review May 6, 2022 14:48

glemaitre reviewed May 6, 2022

View reviewed changes

sklearn/neighbors/tests/test_lof.py Show resolved Hide resolved

sklearn/neighbors/tests/test_lof.py Outdated Show resolved Hide resolved

sklearn/neighbors/tests/test_lof.py Show resolved Hide resolved

jjerphan and others added 2 commits May 16, 2022 15:50

Merge branch 'main' into tst/test_lof-32bit

d575212

Review comments

ff94415

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ogrisel reviewed May 16, 2022

View reviewed changes

sklearn/neighbors/tests/test_lof.py Outdated Show resolved Hide resolved

sklearn/neighbors/tests/test_lof.py Outdated Show resolved Hide resolved

jjerphan added 3 commits October 24, 2022 16:50

Merge branch 'main' into tst/test_lof-32bit

5981f8f

Preserve input dtype for LOF

43a61c4

Tests currently fails because X can be a list of list of numbers and thus not have a `dtype` attribute. See this RFC for discussions: scikit-learn#24745

Correctly preserves dtypes and assert this preservation

eacde86

jjerphan commented Nov 3, 2022

View reviewed changes

sklearn/neighbors/_lof.py Outdated Show resolved Hide resolved

Fix typo

0d9bff9

glemaitre self-requested a review November 3, 2022 16:01

glemaitre reviewed Nov 3, 2022

View reviewed changes

sklearn/neighbors/_lof.py Outdated Show resolved Hide resolved

sklearn/neighbors/_lof.py Outdated Show resolved Hide resolved

sklearn/neighbors/_lof.py Outdated Show resolved Hide resolved

sklearn/neighbors/_lof.py Show resolved Hide resolved

glemaitre reviewed Nov 3, 2022

View reviewed changes

jjerphan and others added 2 commits November 4, 2022 10:46

Add a guard to only cast np.float32 arrays

d49dcdb

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Add whats_new entry for 1.2

41f307a

jjerphan changed the title ~~TST use global_dtype in sklearn/neighbors/tests/test_lof.py~~ ENH Add dtype preservation to LocalOutlierFactor Nov 4, 2022

jjerphan removed the No Changelog Needed label Nov 4, 2022

glemaitre reviewed Nov 4, 2022

View reviewed changes

jjerphan and others added 2 commits November 7, 2022 11:27

TST Add dedicated test for dtype preservation

15ad8e6

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

fixup! TST Add dedicated test for dtype preservation

a23078e

glemaitre self-requested a review November 15, 2022 13:15

glemaitre removed their request for review November 15, 2022 13:51

jjerphan added 3 commits November 16, 2022 09:46

TST Add dedicated test for dtype equivalence

86b27ca

Authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

TST Use proper atol

9e02e8b

« _Les opticiens!_ »

Merge branch 'main' into tst/test_lof-32bit

86a9029

glemaitre approved these changes Nov 23, 2022

View reviewed changes

glemaitre merged commit 3eb00d8 into scikit-learn:main Nov 23, 2022

jjerphan deleted the tst/test_lof-32bit branch November 23, 2022 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add dtype preservation to `LocalOutlierFactor` #22665

ENH Add dtype preservation to `LocalOutlierFactor` #22665

jjerphan commented Mar 3, 2022 •

edited

Loading

jeremiedbb left a comment

glemaitre left a comment

glemaitre left a comment

glemaitre commented Nov 3, 2022 •

edited

Loading

glemaitre Nov 3, 2022

glemaitre Nov 3, 2022

jjerphan Nov 4, 2022 •

edited

Loading

glemaitre commented Nov 3, 2022

jjerphan commented Nov 4, 2022

glemaitre left a comment

glemaitre commented Nov 15, 2022

glemaitre left a comment

		@@ -71,32 +78,32 @@ def test_lof_performance():
		assert roc_auc_score(y_test, y_pred) > 0.99

ENH Add dtype preservation to LocalOutlierFactor #22665

ENH Add dtype preservation to LocalOutlierFactor #22665

Conversation

jjerphan commented Mar 3, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

jeremiedbb left a comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Nov 3, 2022 • edited Loading

glemaitre Nov 3, 2022

Choose a reason for hiding this comment

glemaitre Nov 3, 2022

Choose a reason for hiding this comment

jjerphan Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

glemaitre commented Nov 3, 2022

jjerphan commented Nov 4, 2022

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Nov 15, 2022

glemaitre left a comment

Choose a reason for hiding this comment

ENH Add dtype preservation to `LocalOutlierFactor` #22665

ENH Add dtype preservation to `LocalOutlierFactor` #22665

jjerphan commented Mar 3, 2022 •

edited

Loading

glemaitre commented Nov 3, 2022 •

edited

Loading

jjerphan Nov 4, 2022 •

edited

Loading