-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add dtype preservation to LocalOutlierFactor
#22665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test_lof.py
to test implementations on 32bit datasetsCo-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth to add the global_dtype
for the test_predicted_outlier_number
test, I think.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Tests currently fails because X can be a list of list of numbers and thus not have a `dtype` attribute. See this RFC for discussions: scikit-learn#24745
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not look at the test yet but I am a bit confused here. If I understand properly, to be able to use global_dtype
in the test, we have to make LocalOutlierFactor
to preserve dtype
since we make some casting in some parts of the code.
We should be adding the |
@@ -71,32 +78,32 @@ def test_lof_performance(): | |||
assert roc_auc_score(y_test, y_pred) > 0.99 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be testing somewhere that y_pred
is preserving dtype also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do it later for decision_function
and score_samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your last message as indicating that we are not needing and assertion for dtype
preservation because this is done in test_score_samples
. Is this what you meant?
Otherwise, I think that the other tests are fine. |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
LocalOutlierFactor
I've changed the scope of this PR from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we could isolate the check of the attribute dtype in a separate test. It would be more readable. We could also add a test to check the consistency 32/64 bits.
In some way this is what we try to do in other preservation dtype PRs: https://github.com/scikit-learn/scikit-learn/pull/24714/files#diff-1f2d3a6511fee4ecd34ef77089119895929c26b5f1193493eacdfd4e745cded7R253
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
I would still propose adding a test for the equivalence 32/64 bits regarding the prediction/score functions. I also find it cleaner to delay the check for the I would propose the following patch: diff --git a/sklearn/neighbors/tests/test_lof.py b/sklearn/neighbors/tests/test_lof.py
index 6ef5e145d4..3ccb3bc3ea 100644
--- a/sklearn/neighbors/tests/test_lof.py
+++ b/sklearn/neighbors/tests/test_lof.py
@@ -149,12 +149,6 @@ def test_score_samples(global_dtype):
clf2_scores = clf2.score_samples(X_test)
clf2_decisions = clf2.decision_function(X_test)
- assert clf1_scores.dtype == global_dtype
- assert clf1_decisions.dtype == global_dtype
-
- assert clf2_scores.dtype == global_dtype
- assert clf2_decisions.dtype == global_dtype
-
assert_allclose(
clf1_scores,
clf1_decisions + clf1.offset_,
@@ -201,7 +195,6 @@ def test_novelty_training_scores(global_dtype):
scores_2 = clf_2.negative_outlier_factor_
assert_allclose(scores_1, scores_2)
- assert scores_1.dtype == scores_2.dtype == global_dtype
def test_hasattr_prediction():
@@ -278,3 +271,42 @@ def test_lof_input_dtype_preservation(global_dtype, algorithm, contamination, no
iso.fit(X)
assert iso.negative_outlier_factor_.dtype == global_dtype
+
+ for method in ("score_samples", "decision_function"):
+ if hasattr(iso, method):
+ y_pred = getattr(iso, method)(X)
+ assert y_pred.dtype == global_dtype
+
+
+@pytest.mark.parametrize("algorithm", ["auto", "ball_tree", "kd_tree", "brute"])
+@pytest.mark.parametrize("novelty", [True, False])
+@pytest.mark.parametrize("contamination", [0.5, "auto"])
+def test_lof_dtype_equivalence(algorithm, novelty, contamination):
+ """Check the equivalence of the results with 32 and 64 bits input."""
+
+ inliers = iris.data[:50] # setosa iris are really distinct from others
+ outliers = iris.data[-5:] # virginica will be considered as outliers
+ # lower the precision of the input data to check that we have an equivalence when
+ # making the computation in 32 and 64 bits.
+ X = np.concatenate([inliers, outliers], axis=0).astype(np.float32)
+
+ lof_32 = neighbors.LocalOutlierFactor(
+ algorithm=algorithm, novelty=novelty, contamination=contamination
+ )
+ X_32 = X.astype(np.float32, copy=True)
+ lof_32.fit(X_32)
+
+ lof_64 = neighbors.LocalOutlierFactor(
+ algorithm=algorithm, novelty=novelty, contamination=contamination
+ )
+ X_64 = X.astype(np.float64, copy=True)
+ lof_64.fit(X_64)
+
+ assert_allclose(lof_32.negative_outlier_factor_, lof_64.negative_outlier_factor_)
+
+ for method in ("score_samples", "decision_function", "predict", "fit_predict"):
+ if hasattr(lof_32, method):
+ y_pred_32 = getattr(lof_32, method)(X_32)
+ y_pred_64 = getattr(lof_64, method)(X_64)
+ assert_allclose(y_pred_32, y_pred_64)
+ |
Authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
« _Les opticiens!_ »
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reference Issues/PRs
Partially addresses #22881
Precedes #22590
What does this implement/fix? Explain your changes.
This makes
LocalOutlierFactor
preserves inputs dtype, in particularnp.float32
dtyped inputs.This also parametrizes tests from
test_lof.py
to run on np.float32 datasets.