Skip to content

ENH Add dtype preservation to LocalOutlierFactor #22665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Nov 23, 2022

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Mar 3, 2022

Reference Issues/PRs

Partially addresses #22881
Precedes #22590

What does this implement/fix? Explain your changes.

This makes LocalOutlierFactor preserves inputs dtype, in particular np.float32 dtyped inputs.

This also parametrizes tests from test_lof.py to run on np.float32 datasets.

@jjerphan jjerphan marked this pull request as ready for review March 3, 2022 15:01
@jjerphan jjerphan changed the title TST Adapt test_lof.py to test implementations on 32bit datasets TST use global_dtype in sklearn/neighbors/tests/test_lof.py Mar 18, 2022
jjerphan and others added 2 commits March 29, 2022 13:49
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb jeremiedbb added the Quick Review For PRs that are quick to review label Mar 29, 2022
@glemaitre glemaitre self-requested a review May 6, 2022 14:48
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be worth to add the global_dtype for the test_predicted_outlier_number test, I think.

jjerphan and others added 2 commits May 16, 2022 15:50
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Tests currently fails because X can be a list of list of numbers
and thus not have a `dtype` attribute.

See this RFC for discussions:
scikit-learn#24745
@glemaitre glemaitre self-requested a review November 3, 2022 16:01
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not look at the test yet but I am a bit confused here. If I understand properly, to be able to use global_dtype in the test, we have to make LocalOutlierFactor to preserve dtype since we make some casting in some parts of the code.

@glemaitre
Copy link
Member

glemaitre commented Nov 3, 2022

We should be adding the _more_tags and define that we preserve np.float32 and np.float64 dtype. It will complete some of the tests where we are testing the type preservation of some of the attributes.

@@ -71,32 +78,32 @@ def test_lof_performance():
assert roc_auc_score(y_test, y_pred) > 0.99
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be testing somewhere that y_pred is preserving dtype also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do it later for decision_function and score_samples

Copy link
Member Author

@jjerphan jjerphan Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your last message as indicating that we are not needing and assertion for dtype preservation because this is done in test_score_samples. Is this what you meant?

@glemaitre
Copy link
Member

Otherwise, I think that the other tests are fine.

jjerphan and others added 2 commits November 4, 2022 10:46
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@jjerphan jjerphan changed the title TST use global_dtype in sklearn/neighbors/tests/test_lof.py ENH Add dtype preservation to LocalOutlierFactor Nov 4, 2022
@jjerphan
Copy link
Member Author

jjerphan commented Nov 4, 2022

I've changed the scope of this PR from TST to ENH as it also makes LocalOutlierFactor preserve dtypes.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we could isolate the check of the attribute dtype in a separate test. It would be more readable. We could also add a test to check the consistency 32/64 bits.

In some way this is what we try to do in other preservation dtype PRs: https://github.com/scikit-learn/scikit-learn/pull/24714/files#diff-1f2d3a6511fee4ecd34ef77089119895929c26b5f1193493eacdfd4e745cded7R253

@glemaitre glemaitre self-requested a review November 15, 2022 13:15
@glemaitre
Copy link
Member

I would still propose adding a test for the equivalence 32/64 bits regarding the prediction/score functions.

I also find it cleaner to delay the check for the dtype in the "preservation" test.

I would propose the following patch:

diff --git a/sklearn/neighbors/tests/test_lof.py b/sklearn/neighbors/tests/test_lof.py
index 6ef5e145d4..3ccb3bc3ea 100644
--- a/sklearn/neighbors/tests/test_lof.py
+++ b/sklearn/neighbors/tests/test_lof.py
@@ -149,12 +149,6 @@ def test_score_samples(global_dtype):
     clf2_scores = clf2.score_samples(X_test)
     clf2_decisions = clf2.decision_function(X_test)
 
-    assert clf1_scores.dtype == global_dtype
-    assert clf1_decisions.dtype == global_dtype
-
-    assert clf2_scores.dtype == global_dtype
-    assert clf2_decisions.dtype == global_dtype
-
     assert_allclose(
         clf1_scores,
         clf1_decisions + clf1.offset_,
@@ -201,7 +195,6 @@ def test_novelty_training_scores(global_dtype):
     scores_2 = clf_2.negative_outlier_factor_
 
     assert_allclose(scores_1, scores_2)
-    assert scores_1.dtype == scores_2.dtype == global_dtype
 
 
 def test_hasattr_prediction():
@@ -278,3 +271,42 @@ def test_lof_input_dtype_preservation(global_dtype, algorithm, contamination, no
     iso.fit(X)
 
     assert iso.negative_outlier_factor_.dtype == global_dtype
+
+    for method in ("score_samples", "decision_function"):
+        if hasattr(iso, method):
+            y_pred = getattr(iso, method)(X)
+            assert y_pred.dtype == global_dtype
+
+
+@pytest.mark.parametrize("algorithm", ["auto", "ball_tree", "kd_tree", "brute"])
+@pytest.mark.parametrize("novelty", [True, False])
+@pytest.mark.parametrize("contamination", [0.5, "auto"])
+def test_lof_dtype_equivalence(algorithm, novelty, contamination):
+    """Check the equivalence of the results with 32 and 64 bits input."""
+
+    inliers = iris.data[:50]  # setosa iris are really distinct from others
+    outliers = iris.data[-5:]  # virginica will be considered as outliers
+    # lower the precision of the input data to check that we have an equivalence when
+    # making the computation in 32 and 64 bits.
+    X = np.concatenate([inliers, outliers], axis=0).astype(np.float32)
+
+    lof_32 = neighbors.LocalOutlierFactor(
+        algorithm=algorithm, novelty=novelty, contamination=contamination
+    )
+    X_32 = X.astype(np.float32, copy=True)
+    lof_32.fit(X_32)
+
+    lof_64 = neighbors.LocalOutlierFactor(
+        algorithm=algorithm, novelty=novelty, contamination=contamination
+    )
+    X_64 = X.astype(np.float64, copy=True)
+    lof_64.fit(X_64)
+
+    assert_allclose(lof_32.negative_outlier_factor_, lof_64.negative_outlier_factor_)
+
+    for method in ("score_samples", "decision_function", "predict", "fit_predict"):
+        if hasattr(lof_32, method):
+            y_pred_32 = getattr(lof_32, method)(X_32)
+            y_pred_64 = getattr(lof_64, method)(X_64)
+            assert_allclose(y_pred_32, y_pred_64)
+

@glemaitre glemaitre removed their request for review November 15, 2022 13:51
Authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
« _Les opticiens!_ »
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 3eb00d8 into scikit-learn:main Nov 23, 2022
@jjerphan jjerphan deleted the tst/test_lof-32bit branch November 23, 2022 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:neighbors Quick Review For PRs that are quick to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants