Skip to content

DOC Rework outlier detection estimators example #25878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Sep 11, 2023

Conversation

ArturoAmorQ
Copy link
Member

Reference Issues/PRs

What does this implement/fix? Explain your changes.

In preparation to address this comment, the current state of the Evaluation of outlier detection estimators example can benefit from a "tutorialization" to explain the steps and results.

For such purpose, this PR:

  • replaces use of LabelBinarizer to a proper OneHotEncoder when needed;
  • reduces the number of evaluated datasets to keep message clear and use less resources;
  • splits the pre-processing of the different datasets to have individually tuned pipelines (current state only shows the performance of default parameters);
  • adds an ablation section showing the impact of the pre-processing of a local outlier factor model.

Any other comments?

The ablation section is a nice to have but makes the example take longer to run. Hopefully the soon to come speed-ups in neighbors computations will solve this issue.

@ArturoAmorQ
Copy link
Member Author

This PR is good to be reviewed now.

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the prose and left some small comments on that. Not thought about the code.

Overall the prose reads nice!

@glemaitre glemaitre self-requested a review May 25, 2023 08:41
@ArturoAmorQ
Copy link
Member Author

For info this PR is failing on the CI with the following traceback:

File "/home/circleci/project/examples/miscellaneous/plot_outlier_detection_bench.py", line 218, in <module>
    y_pred[model_name]["ames_housing"] = fit_predict(
  File "/home/circleci/project/examples/miscellaneous/plot_outlier_detection_bench.py", line 84, in fit_predict
    y_pred = model.fit(X).decision_function(X)
  File "/home/circleci/project/sklearn/pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/circleci/project/sklearn/ensemble/_iforest.py", line 291, in fit
    X = self._validate_data(X, accept_sparse=["csc"], dtype=tree_dtype)
  File "/home/circleci/project/sklearn/base.py", line 594, in _validate_data
    out = check_array(X, input_name="X", **check_params)
  File "/home/circleci/project/sklearn/utils/validation.py", line 964, in check_array
    _assert_all_finite(
  File "/home/circleci/project/sklearn/utils/validation.py", line 129, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/home/circleci/project/sklearn/utils/validation.py", line 178, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
IsolationForest does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

The ames_housing dataset does not contain any missing values and I don't get the traceback when running locally, even after merging main. Is it possible to have an issue with the datasets.fetch_openml function on the CI only?

from time import perf_counter


def fit_predict(X, model_name, expected_anomaly_frac, categorical_columns=()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of the empty tuple. I would find it more natural with:

def fit_predict(..., categorical_columns=None):
    categorical_columns = [] if categorical_columns is None else categorical_columns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since expected_anomaly_frac is only use for LOF, I would expect it to be a have default to None.
Also, I would use expected_anomaly_fraction since this is only 3 letters more.

model.fit(X)
y_pred = model[-1].negative_outlier_factor_

if model_name == "IForest":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be an elif here.

Comment on lines 75 to 83
ordinal_encoder = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
iforest = IsolationForest(random_state=rng)
preprocessor = ColumnTransformer(
[("categorical", ordinal_encoder, categorical_columns)],
remainder="passthrough",
)
model = make_pipeline(preprocessor, iforest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ordinal_encoder = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
iforest = IsolationForest(random_state=rng)
preprocessor = ColumnTransformer(
[("categorical", ordinal_encoder, categorical_columns)],
remainder="passthrough",
)
model = make_pipeline(preprocessor, iforest)
ordinal_encoder = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
[("categorical", ordinal_encoder, categorical_columns)],
remainder="passthrough",
)
model = make_pipeline(preprocessor, IsolationForest(random_state=rng))

ordinal_encoder = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
iforest = IsolationForest(random_state=rng)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the global rng is not a good practice here. We should pass the random_state as an argument.

from time import perf_counter


def fit_predict(X, model_name, expected_anomaly_frac, categorical_columns=()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a bit more about the example, I would decouple the creation of the estimator and the real fit_predict:

# %%
# ... text about the desired preprocessing and modelling
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    RobustScaler,
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline


def make_estimator(name, categorical_columns=None, **kwargs):
    """Create an outlier detection estimator based on its name."""
    if name == "LOF":
        outlier_detector = LocalOutlierFactor(**kwargs)
        if categorical_columns is None:
            preprocessor = RobustScaler()
        else:
            preprocessor = ColumnTransformer(
                transformers=[("categorical", OneHotEncoder(), categorical_columns)],
                remainder=RobustScaler(),
            )
    else:  # name == "IForest"
        outlier_detector = IsolationForest(**kwargs)
        if categorical_columns is None:
            preprocessor = None
        else:
            ordinal_encoder = OrdinalEncoder(
                handle_unknown="use_encoded_value", unknown_value=-1
            )
            preprocessor = ColumnTransformer(
                transformers=[
                    ("categorical", ordinal_encoder, categorical_columns),
                ],
                remainder="passthrough",
            )

    return make_pipeline(preprocessor, outlier_detector)


# %%
# ... text about the `fit_predict`
# %%
from time import perf_counter


def fit_predict(estimator, X):
    tic = perf_counter()
    if estimator[-1].__class__.__name__ == "LocalOutlierFactor":
        estimator.fit(X)
        y_pred = estimator[-1].negative_outlier_factor_
    else:  # "IsolationForest"
         y_pred = estimator.fit(X).decision_function(X)
    toc = perf_counter()
    print(f"Duration for {model_name}: {toc - tic:.2f} s")
    return y_pred

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code above, it means that we will show how to define n_neighbors each time but we can also pass random_state when requesting an isolation forest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As LocalOutlierFactor and IsolationForest don't share kwargs, the suggested make_estimator function would only work inside if-statements that make more boilerplate code. The idea was to have a function to handle the cases to avoid this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could still pass kwargs to the LocalOutlierFactor and set the random_state directly in the construction of the IsolationForest, though I'm aware this is not a good practice. WDYT?

@glemaitre
Copy link
Member

glemaitre commented May 25, 2023

Is it possible to have an issue with the datasets.fetch_openml function on the CI only?

Since doc_min_dependencies does not fail, I would think this is a change of behavior induced by a newer version of pandas.

Edit: I can indeed reproduce the error with pandas 2.0.1 while the code was working with pandas 1.5.1.

@glemaitre
Copy link
Member

Here is the difference:

pandas 2.0.1:

<class 'pandas.core.frame.DataFrame'>
Index: 2714 entries, 0 to 2929
Data columns (total 79 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   MS_SubClass         2714 non-null   category
 1   MS_Zoning           2714 non-null   category
 2   Lot_Frontage        2714 non-null   int64   
 3   Street              2714 non-null   category
 4   Alley               2714 non-null   category
 5   Lot_Shape           2714 non-null   category
 6   Land_Contour        2714 non-null   category
 7   Utilities           2714 non-null   category
 8   Lot_Config          2714 non-null   category
 9   Land_Slope          2714 non-null   category
 10  Neighborhood        2714 non-null   category
 11  Condition_1         2714 non-null   category
 12  Condition_2         2714 non-null   category
 13  Bldg_Type           2714 non-null   category
 14  House_Style         2714 non-null   category
 15  Overall_Qual        2714 non-null   category
 16  Overall_Cond        2714 non-null   category
 17  Year_Built          2714 non-null   int64   
 18  Year_Remod_Add      2714 non-null   int64   
 19  Roof_Style          2714 non-null   category
 20  Roof_Matl           2714 non-null   category
 21  Exterior_1st        2714 non-null   category
 22  Exterior_2nd        2714 non-null   category
 23  Mas_Vnr_Type        1038 non-null   category
 24  Mas_Vnr_Area        2714 non-null   int64   
 25  Exter_Qual          2714 non-null   category
 26  Exter_Cond          2714 non-null   category
 27  Foundation          2714 non-null   category
 28  Bsmt_Qual           2714 non-null   category
 29  Bsmt_Cond           2714 non-null   category
 30  Bsmt_Exposure       2714 non-null   category
 31  BsmtFin_Type_1      2714 non-null   category
 32  BsmtFin_SF_1        2714 non-null   int64   
 33  BsmtFin_Type_2      2714 non-null   category
 34  BsmtFin_SF_2        2714 non-null   int64   
 35  Bsmt_Unf_SF         2714 non-null   int64   
 36  Total_Bsmt_SF       2714 non-null   int64   
 37  Heating             2714 non-null   category
 38  Heating_QC          2714 non-null   category
 39  Central_Air         2714 non-null   category
 40  Electrical          2714 non-null   category
 41  First_Flr_SF        2714 non-null   int64   
 42  Second_Flr_SF       2714 non-null   int64   
 43  Low_Qual_Fin_SF     2714 non-null   int64   
 44  Gr_Liv_Area         2714 non-null   int64   
 45  Bsmt_Full_Bath      2714 non-null   int64   
 46  Bsmt_Half_Bath      2714 non-null   int64   
 47  Full_Bath           2714 non-null   int64   
 48  Half_Bath           2714 non-null   int64   
 49  Bedroom_AbvGr       2714 non-null   int64   
 50  Kitchen_AbvGr       2714 non-null   int64   
 51  Kitchen_Qual        2714 non-null   category
 52  TotRms_AbvGrd       2714 non-null   int64   
 53  Functional          2714 non-null   category
 54  Fireplaces          2714 non-null   int64   
 55  Fireplace_Qu        2714 non-null   category
 56  Garage_Type         2714 non-null   category
 57  Garage_Finish       2714 non-null   category
 58  Garage_Cars         2714 non-null   int64   
 59  Garage_Area         2714 non-null   int64   
 60  Garage_Qual         2714 non-null   category
 61  Garage_Cond         2714 non-null   category
 62  Paved_Drive         2714 non-null   category
 63  Wood_Deck_SF        2714 non-null   int64   
 64  Open_Porch_SF       2714 non-null   int64   
 65  Enclosed_Porch      2714 non-null   int64   
 66  Three_season_porch  2714 non-null   int64   
 67  Screen_Porch        2714 non-null   int64   
 68  Pool_Area           2714 non-null   int64   
 69  Pool_QC             2714 non-null   category
 70  Fence               2714 non-null   category
 71  Misc_Feature        106 non-null    category
 72  Misc_Val            2714 non-null   int64   
 73  Mo_Sold             2714 non-null   int64   
 74  Year_Sold           2714 non-null   int64   
 75  Sale_Type           2714 non-null   category
 76  Sale_Condition      2714 non-null   category
 77  Longitude           2714 non-null   float64 
 78  Latitude            2714 non-null   float64 
dtypes: category(46), float64(2), int64(31)
memory usage: 856.1 KB

pandas 1.5.3

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2714 entries, 0 to 2929
Data columns (total 79 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   MS_SubClass         2714 non-null   category
 1   MS_Zoning           2714 non-null   category
 2   Lot_Frontage        2714 non-null   int64   
 3   Street              2714 non-null   category
 4   Alley               2714 non-null   category
 5   Lot_Shape           2714 non-null   category
 6   Land_Contour        2714 non-null   category
 7   Utilities           2714 non-null   category
 8   Lot_Config          2714 non-null   category
 9   Land_Slope          2714 non-null   category
 10  Neighborhood        2714 non-null   category
 11  Condition_1         2714 non-null   category
 12  Condition_2         2714 non-null   category
 13  Bldg_Type           2714 non-null   category
 14  House_Style         2714 non-null   category
 15  Overall_Qual        2714 non-null   category
 16  Overall_Cond        2714 non-null   category
 17  Year_Built          2714 non-null   int64   
 18  Year_Remod_Add      2714 non-null   int64   
 19  Roof_Style          2714 non-null   category
 20  Roof_Matl           2714 non-null   category
 21  Exterior_1st        2714 non-null   category
 22  Exterior_2nd        2714 non-null   category
 23  Mas_Vnr_Type        2714 non-null   category
 24  Mas_Vnr_Area        2714 non-null   int64   
 25  Exter_Qual          2714 non-null   category
 26  Exter_Cond          2714 non-null   category
 27  Foundation          2714 non-null   category
 28  Bsmt_Qual           2714 non-null   category
 29  Bsmt_Cond           2714 non-null   category
 30  Bsmt_Exposure       2714 non-null   category
 31  BsmtFin_Type_1      2714 non-null   category
 32  BsmtFin_SF_1        2714 non-null   int64   
 33  BsmtFin_Type_2      2714 non-null   category
 34  BsmtFin_SF_2        2714 non-null   int64   
 35  Bsmt_Unf_SF         2714 non-null   int64   
 36  Total_Bsmt_SF       2714 non-null   int64   
 37  Heating             2714 non-null   category
 38  Heating_QC          2714 non-null   category
 39  Central_Air         2714 non-null   category
 40  Electrical          2714 non-null   category
 41  First_Flr_SF        2714 non-null   int64   
 42  Second_Flr_SF       2714 non-null   int64   
 43  Low_Qual_Fin_SF     2714 non-null   int64   
 44  Gr_Liv_Area         2714 non-null   int64   
 45  Bsmt_Full_Bath      2714 non-null   int64   
 46  Bsmt_Half_Bath      2714 non-null   int64   
 47  Full_Bath           2714 non-null   int64   
 48  Half_Bath           2714 non-null   int64   
 49  Bedroom_AbvGr       2714 non-null   int64   
 50  Kitchen_AbvGr       2714 non-null   int64   
 51  Kitchen_Qual        2714 non-null   category
 52  TotRms_AbvGrd       2714 non-null   int64   
 53  Functional          2714 non-null   category
 54  Fireplaces          2714 non-null   int64   
 55  Fireplace_Qu        2714 non-null   category
 56  Garage_Type         2714 non-null   category
 57  Garage_Finish       2714 non-null   category
 58  Garage_Cars         2714 non-null   int64   
 59  Garage_Area         2714 non-null   int64   
 60  Garage_Qual         2714 non-null   category
 61  Garage_Cond         2714 non-null   category
 62  Paved_Drive         2714 non-null   category
 63  Wood_Deck_SF        2714 non-null   int64   
 64  Open_Porch_SF       2714 non-null   int64   
 65  Enclosed_Porch      2714 non-null   int64   
 66  Three_season_porch  2714 non-null   int64   
 67  Screen_Porch        2714 non-null   int64   
 68  Pool_Area           2714 non-null   int64   
 69  Pool_QC             2714 non-null   category
 70  Fence               2714 non-null   category
 71  Misc_Feature        2714 non-null   category
 72  Misc_Val            2714 non-null   int64   
 73  Mo_Sold             2714 non-null   int64   
 74  Year_Sold           2714 non-null   int64   
 75  Sale_Type           2714 non-null   category
 76  Sale_Condition      2714 non-null   category
 77  Longitude           2714 non-null   float64 
 78  Latitude            2714 non-null   float64 
dtypes: category(46), float64(2), int64(31)
memory usage: 856.2 KB

So the difference is in the Misc_Feature where None values in pandas 1.5.1 have been mapped to np.nan in pandas 2.0.1.

@glemaitre
Copy link
Member

glemaitre commented May 25, 2023

Basically, the culprit is pandas-dev/pandas#50286 where None as been added to the default na_values in read_csv. So this is a breaking change but it might be considered part of the breaking change between 1.X to 2.X.

I think we can tweak it on our side by omitting, None and announcing a FutureWarning.
It looks like a good case to revive #25488 to be able to handle this case.

@github-actions
Copy link

github-actions bot commented Aug 24, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: add9833. Link to the linter CI: here

@glemaitre glemaitre self-requested a review August 29, 2023 16:45
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the rest, LGTM.

from sklearn.preprocessing import LabelBinarizer
import pandas as pd
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rational is that each time rng is used, its state will be modified and changing code ordering could potentially lead to different results. So it is safer to just pass integer around.

lof = LocalOutlierFactor(n_neighbors=int(n_samples * expected_anomaly_fraction))

fig, ax = plt.subplots()
for model_idx, preprocessor in enumerate(preprocessor_list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the green line will be difficult to distinguish from the red for colorblind. Do you mind to also have a different linestyle for each line to distinguish them this way as well.

@glemaitre glemaitre merged commit 3f11069 into scikit-learn:main Sep 11, 2023
@glemaitre
Copy link
Member

LGTM. Enabling auto-merge. Thanks @ArturoAmorQ

@ArturoAmorQ ArturoAmorQ deleted the outlier_benchmarks branch September 11, 2023 18:06
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu>
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu>
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu>
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized I had a pending review on this PR. Here were the comments:

This example benchmarks two outlier detection algorithms, namely
:ref:`local_outlier_factor` (LOF) and :ref:`isolation_forest` (IForest), using
ROC curves on classical anomaly detection datasets. The goal is to show that
different algorithms perform well on different datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
different algorithms perform well on different datasets.
different algorithms perform well on different datasets and highlight
differences in training speed and sensitivity to hyper-parameters.

Comment on lines 13 to 14
1. The algorithms are trained on the whole dataset which is assumed to
contain outliers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. The algorithms are trained on the whole dataset which is assumed to
contain outliers.
1. The algorithms are trained (without labels) on the whole dataset which is assumed
to contain outliers.

# similarly in terms of ROC AUC for the forestcover and cardiotocography
# datasets. The score for IForest is slightly better for the SA dataset and LOF
# performs considerably better on WDBC than IForest.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a something about the tradeoff between ROC and computational performance:

Recall however that Isolation Forest tends to train much faster than LOF on datasets with a large number of samples. Indeed LOF needs to compute pairwise distances to find nearest neighbors which is has a quadratic complexity in large dimensions. This can make this method prohibitive on large datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants