-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Rework outlier detection estimators example #25878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nto outlier_benchmarks
…arn into outlier_benchmarks
…nto outlier_benchmarks
This PR is good to be reviewed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the prose and left some small comments on that. Not thought about the code.
Overall the prose reads nice!
Co-authored-by: Tim Head <betatim@gmail.com>
For info this PR is failing on the CI with the following traceback: File "/home/circleci/project/examples/miscellaneous/plot_outlier_detection_bench.py", line 218, in <module>
y_pred[model_name]["ames_housing"] = fit_predict(
File "/home/circleci/project/examples/miscellaneous/plot_outlier_detection_bench.py", line 84, in fit_predict
y_pred = model.fit(X).decision_function(X)
File "/home/circleci/project/sklearn/pipeline.py", line 420, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/home/circleci/project/sklearn/ensemble/_iforest.py", line 291, in fit
X = self._validate_data(X, accept_sparse=["csc"], dtype=tree_dtype)
File "/home/circleci/project/sklearn/base.py", line 594, in _validate_data
out = check_array(X, input_name="X", **check_params)
File "/home/circleci/project/sklearn/utils/validation.py", line 964, in check_array
_assert_all_finite(
File "/home/circleci/project/sklearn/utils/validation.py", line 129, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/circleci/project/sklearn/utils/validation.py", line 178, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input X contains NaN.
IsolationForest does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values The |
from time import perf_counter | ||
|
||
|
||
def fit_predict(X, model_name, expected_anomaly_frac, categorical_columns=()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of the empty tuple. I would find it more natural with:
def fit_predict(..., categorical_columns=None):
categorical_columns = [] if categorical_columns is None else categorical_columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since expected_anomaly_frac
is only use for LOF
, I would expect it to be a have default to None
.
Also, I would use expected_anomaly_fraction
since this is only 3 letters more.
model.fit(X) | ||
y_pred = model[-1].negative_outlier_factor_ | ||
|
||
if model_name == "IForest": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be an elif
here.
ordinal_encoder = OrdinalEncoder( | ||
handle_unknown="use_encoded_value", unknown_value=-1 | ||
) | ||
iforest = IsolationForest(random_state=rng) | ||
preprocessor = ColumnTransformer( | ||
[("categorical", ordinal_encoder, categorical_columns)], | ||
remainder="passthrough", | ||
) | ||
model = make_pipeline(preprocessor, iforest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordinal_encoder = OrdinalEncoder( | |
handle_unknown="use_encoded_value", unknown_value=-1 | |
) | |
iforest = IsolationForest(random_state=rng) | |
preprocessor = ColumnTransformer( | |
[("categorical", ordinal_encoder, categorical_columns)], | |
remainder="passthrough", | |
) | |
model = make_pipeline(preprocessor, iforest) | |
ordinal_encoder = OrdinalEncoder( | |
handle_unknown="use_encoded_value", unknown_value=-1 | |
) | |
preprocessor = ColumnTransformer( | |
[("categorical", ordinal_encoder, categorical_columns)], | |
remainder="passthrough", | |
) | |
model = make_pipeline(preprocessor, IsolationForest(random_state=rng)) |
ordinal_encoder = OrdinalEncoder( | ||
handle_unknown="use_encoded_value", unknown_value=-1 | ||
) | ||
iforest = IsolationForest(random_state=rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the global rng
is not a good practice here. We should pass the random_state
as an argument.
from time import perf_counter | ||
|
||
|
||
def fit_predict(X, model_name, expected_anomaly_frac, categorical_columns=()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading a bit more about the example, I would decouple the creation of the estimator and the real fit_predict
:
# %%
# ... text about the desired preprocessing and modelling
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import (
OneHotEncoder,
OrdinalEncoder,
RobustScaler,
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
def make_estimator(name, categorical_columns=None, **kwargs):
"""Create an outlier detection estimator based on its name."""
if name == "LOF":
outlier_detector = LocalOutlierFactor(**kwargs)
if categorical_columns is None:
preprocessor = RobustScaler()
else:
preprocessor = ColumnTransformer(
transformers=[("categorical", OneHotEncoder(), categorical_columns)],
remainder=RobustScaler(),
)
else: # name == "IForest"
outlier_detector = IsolationForest(**kwargs)
if categorical_columns is None:
preprocessor = None
else:
ordinal_encoder = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
transformers=[
("categorical", ordinal_encoder, categorical_columns),
],
remainder="passthrough",
)
return make_pipeline(preprocessor, outlier_detector)
# %%
# ... text about the `fit_predict`
# %%
from time import perf_counter
def fit_predict(estimator, X):
tic = perf_counter()
if estimator[-1].__class__.__name__ == "LocalOutlierFactor":
estimator.fit(X)
y_pred = estimator[-1].negative_outlier_factor_
else: # "IsolationForest"
y_pred = estimator.fit(X).decision_function(X)
toc = perf_counter()
print(f"Duration for {model_name}: {toc - tic:.2f} s")
return y_pred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code above, it means that we will show how to define n_neighbors
each time but we can also pass random_state
when requesting an isolation forest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As LocalOutlierFactor
and IsolationForest
don't share kwargs
, the suggested make_estimator
function would only work inside if
-statements that make more boilerplate code. The idea was to have a function to handle the cases to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could still pass kwargs
to the LocalOutlierFactor
and set the random_state
directly in the construction of the IsolationForest
, though I'm aware this is not a good practice. WDYT?
Since Edit: I can indeed reproduce the error with pandas 2.0.1 while the code was working with pandas 1.5.1. |
Here is the difference: pandas 2.0.1:
pandas 1.5.3
So the difference is in the |
Basically, the culprit is pandas-dev/pandas#50286 where I think we can tweak it on our side by omitting, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the rest, LGTM.
from sklearn.preprocessing import LabelBinarizer | ||
import pandas as pd | ||
from sklearn.datasets import fetch_kddcup99 | ||
from sklearn.model_selection import train_test_split | ||
|
||
rng = np.random.RandomState(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rational is that each time rng is used, its state will be modified and changing code ordering could potentially lead to different results. So it is safer to just pass integer around.
lof = LocalOutlierFactor(n_neighbors=int(n_samples * expected_anomaly_fraction)) | ||
|
||
fig, ax = plt.subplots() | ||
for model_idx, preprocessor in enumerate(preprocessor_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the green line will be difficult to distinguish from the red for colorblind. Do you mind to also have a different linestyle for each line to distinguish them this way as well.
LGTM. Enabling auto-merge. Thanks @ArturoAmorQ |
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized I had a pending review on this PR. Here were the comments:
This example benchmarks two outlier detection algorithms, namely | ||
:ref:`local_outlier_factor` (LOF) and :ref:`isolation_forest` (IForest), using | ||
ROC curves on classical anomaly detection datasets. The goal is to show that | ||
different algorithms perform well on different datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different algorithms perform well on different datasets. | |
different algorithms perform well on different datasets and highlight | |
differences in training speed and sensitivity to hyper-parameters. |
1. The algorithms are trained on the whole dataset which is assumed to | ||
contain outliers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. The algorithms are trained on the whole dataset which is assumed to | |
contain outliers. | |
1. The algorithms are trained (without labels) on the whole dataset which is assumed | |
to contain outliers. |
# similarly in terms of ROC AUC for the forestcover and cardiotocography | ||
# datasets. The score for IForest is slightly better for the SA dataset and LOF | ||
# performs considerably better on WDBC than IForest. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a something about the tradeoff between ROC and computational performance:
Recall however that Isolation Forest tends to train much faster than LOF on datasets with a large number of samples. Indeed LOF needs to compute pairwise distances to find nearest neighbors which is has a quadratic complexity in large dimensions. This can make this method prohibitive on large datasets.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
In preparation to address this comment, the current state of the Evaluation of outlier detection estimators example can benefit from a "tutorialization" to explain the steps and results.
For such purpose, this PR:
LabelBinarizer
to a properOneHotEncoder
when needed;Any other comments?
The ablation section is a nice to have but makes the example take longer to run. Hopefully the soon to come speed-ups in neighbors computations will solve this issue.