[MRG+1] iforest backward compatibility #11553

ngoix · 2018-07-16T13:38:44Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Add a parameter which switches old/new behaviour. Default value to be changed in 0.22 and parameter to be removed in 0.24

agramfort · 2018-07-16T13:43:30Z

sklearn/ensemble/iforest.py

+        Assuming the behaviour parameter is set to 'old', we always have
+        offset_ = -0.5.
+
+    behaviour: str, default == 'old'


should be written

behaviour: str, optional (default='old')

agramfort · 2018-07-16T13:44:00Z

sklearn/ensemble/iforest.py

@@ -131,6 +141,7 @@ def __init__(self,
                 n_estimators=100,
                 max_samples="auto",
                 contamination="legacy",
+                 behaviour='old',


should be positioned after n_jobs param

albertcthomas

We need to update the whatsnew entry that you added in the PR about the outlier detection API. Common tests are also failing. The common tests are written for the new API. We can maybe do check_estimator(IsolationForest(behavior='new')) in test_iforest.py and skip IsolationForest in the common test until the end of the deprecation cycle? or just do estimator.set_params(behavior='new') if estimator=IsolationForest in the common tests.

albertcthomas · 2018-07-16T15:17:20Z

sklearn/ensemble/iforest.py

+        Assuming the behaviour parameter is set to 'old', we always have
+        offset_ = -0.5.
+
+    behaviour: str, optional (default='old')


Should be in the Parameters section of the docstring not Attributes section

GaelVaroquaux

One cosmetic comment, and then this is good.

GaelVaroquaux · 2018-07-16T15:37:30Z

sklearn/ensemble/iforest.py

+        Default "behaviour" parameter will change to "new" in version 0.22.
+        Passing behaviour="new" makes the decision_function change to match
+        other anomaly detection algorithm API, as explained in the offset_
+        parameter documentation.


I think that you should be slightly more explicit as to what switching this parameter entails.

GaelVaroquaux · 2018-07-16T16:14:10Z

LGTM. Thank you!

albertcthomas · 2018-07-16T16:38:01Z

for the common tests you can add

    if estimator.__class__.__name__ == "IsolationForest":
        # XXX to be removed in 0.22.
        # this is used because the old IsolationForest does not
        # respect the outlier detection API and thus and does not
        # pass the outlier detection common tests.
        estimator.set_params(behaviour='new')

in the set_checking_parameters(estimator) function located here

albertcthomas

The examples should also be modified so that IsolationForest is instantiated with behaviour=new.

jnothman

I've not checked that the examples render similarly to 0.19, which is what we'd aim for...?

jnothman · 2018-07-16T22:16:07Z

sklearn/ensemble/iforest.py

+        Passing behaviour="new" makes the decision_function change to match
+        other anomaly detection algorithm API, as explained in details in the
+        offset_ attribute documentation. Basically, the decision_function
+        becomes dependent from the contamination parameter, in such a way that


"dependent from" is bad English. Do you mean "dependent on"?

jnothman · 2018-07-16T22:17:46Z

sklearn/ensemble/iforest.py

+            return self
+
+        # else, define offset_ wrt contamination parameter, so that the
+        # threshold_ attribute is implicitely 0 and is not needed anymore:


*implicitly

jnothman · 2018-07-17T04:39:22Z

flake8 errors

jnothman · 2018-07-17T04:40:04Z

sklearn/ensemble/tests/test_iforest.py:267:27: E127 continuation line over-indented for visual indent
                          'in version 0.22',
                          ^
sklearn/ensemble/tests/test_iforest.py:276:1: E302 expected 2 blank lines, found 1
def test_behaviour_param():
^

albertcthomas · 2018-07-17T07:34:29Z

I've not checked that the examples render similarly to 0.19, which is what we'd aim for...?

Yes we need to check the examples. (Btw the example given in #11337 is not an example from scikit-learn but from a scipy tutorial).

agramfort · 2018-07-17T09:25:05Z

you have a travis error https://travis-ci.org/scikit-learn/scikit-learn/jobs/404541612

ngoix · 2018-07-17T12:39:42Z

rebased on master, fixing conflict with #11561
I checked the examples, they seem fine.
I run @glemaitre script from #11337, which now returns:

(same as on 0.19)

amueller · 2018-07-17T18:43:51Z

looks like travis will pass in a sec. should we wait for the rest?

GaelVaroquaux · 2018-07-17T18:52:25Z

I'm very worried about the appveyor failure ⁣Sent from my phone. Please forgive typos and briefness.

…

On Jul 17, 2018, 20:44, at 20:44, Andreas Mueller ***@***.***> wrote: looks like travis will pass in a sec. should we wait for the rest? -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #11553 (comment)

agramfort · 2018-07-17T20:42:16Z

@ngoix you need to rebase.

glemaitre · 2018-07-17T22:25:57Z

sklearn/ensemble/iforest.py

        self.contamination = contamination

+        if behaviour == 'old':


This should be done in fit

glemaitre · 2018-07-17T22:42:35Z

You need to catch the warning using the pytest.mark.filterwarings for when behaviour='old'

glemaitre

Basically, we turn warning into errors so you can refer to the following traceback:

================================ test session starts ================================
platform linux -- Python 3.6.5, pytest-3.6.1, py-1.5.3, pluggy-0.6.0 -- /home/lemaitre/miniconda3/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /home/lemaitre/Documents/code/toolbox/scikit-learn, inifile: setup.cfg
collected 14 items                                                                  

sklearn/ensemble/tests/test_iforest.py::test_iforest PASSED                   [  7%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_sparse FAILED            [ 14%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_error FAILED             [ 21%]
sklearn/ensemble/tests/test_iforest.py::test_recalculate_max_depth FAILED     [ 28%]
sklearn/ensemble/tests/test_iforest.py::test_max_samples_attribute FAILED     [ 35%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_parallel_regression FAILED [ 42%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_performance FAILED       [ 50%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_works PASSED             [ 57%]
sklearn/ensemble/tests/test_iforest.py::test_max_samples_consistency FAILED   [ 64%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_subsampled_features FAILED [ 71%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_average_path_length PASSED [ 78%]
sklearn/ensemble/tests/test_iforest.py::test_score_samples FAILED             [ 85%]
sklearn/ensemble/tests/test_iforest.py::test_deprecation FAILED               [ 92%]
sklearn/ensemble/tests/test_iforest.py::test_behaviour_param FAILED           [100%]

===================================== FAILURES ======================================
________________________________ test_iforest_sparse ________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_sparse():
        """Check IForest for various parameter settings on sparse input."""
        rng = check_random_state(0)
        X_train, X_test, y_train, y_test = train_test_split(boston.data[:50],
                                                            boston.target[:50],
                                                            random_state=rng)
        grid = ParameterGrid({"max_samples": [0.5, 1.0],
                              "bootstrap": [True, False]})
    
        for sparse_format in [csc_matrix, csr_matrix]:
            X_train_sparse = sparse_format(X_train)
            X_test_sparse = sparse_format(X_test)
    
            for params in grid:
                # Trained on sparse format
                sparse_classifier = IsolationForest(
>                   n_estimators=10, random_state=1, **params).fit(X_train_sparse)

sklearn/ensemble/tests/test_iforest.py:85: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=True, contamination='legacy',
        max_features=1.0, max_samples=0.5, n_estimators=10, n_jobs=1,
        random_state=1, verbose=0)
X = <37x13 sparse matrix of type '<class 'numpy.float64'>'
	with 420 stored elements in Compressed Sparse Column format>
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
________________________________ test_iforest_error _________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_error():
        """Test that it gives proper exception on deficient input."""
        X = iris.data
    
        # Test max_samples
        assert_raises(ValueError,
>                     IsolationForest(max_samples=-1).fit, X)

sklearn/ensemble/tests/test_iforest.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/_unittest_backport.py:204: in assertRaises
    return context.handle('assertRaises', args, kwargs)
sklearn/utils/_unittest_backport.py:113: in handle
    callable_obj(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples=-1, n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
____________________________ test_recalculate_max_depth _____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_recalculate_max_depth():
        """Check max_depth recalculation when max_samples is reset to n_samples"""
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:145: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
____________________________ test_max_samples_attribute _____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_max_samples_attribute():
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:153: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________ test_iforest_parallel_regression __________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_parallel_regression():
        """Check parallel regression."""
        rng = check_random_state(0)
    
        X_train, X_test, y_train, y_test = train_test_split(boston.data,
                                                            boston.target,
                                                            random_state=rng)
    
        ensemble = IsolationForest(n_jobs=3,
>                                  random_state=0).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:177: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=3,
        random_state=0, verbose=0)
X = array([[1.44760e-01, 0.00000e+00, 1.00100e+01, ..., 1.78000e+01,
        3.91500e+02, 1.36100e+01],
       [3.46600e-0...+02, 6.53000e+00],
       [1.40507e+01, 0.00000e+00, 1.81000e+01, ..., 2.02000e+01,
        3.50500e+01, 2.12200e+01]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_____________________________ test_iforest_performance ______________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_iforest_performance():
        """Test Isolation Forest performs well"""
    
        # Generate train/test data
        rng = check_random_state(2)
        X = 0.3 * rng.randn(120, 2)
        X_train = np.r_[X + 2, X - 2]
        X_train = X[:100]
    
        # Generate some abnormal novel observations
        X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
        X_test = np.r_[X[100:], X_outliers]
        y_test = np.array([0] * 20 + [1] * 20)
    
        # fit the model
>       clf = IsolationForest(max_samples=100, random_state=rng).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:208: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples=100, n_estimators=100, n_jobs=1,
        random_state=<mtrand.RandomState object at 0x7fe2e0dc3360>,
        verbose=0)
X = array([[-1.25027354e-01, -1.68800482e-02],
       [-6.40858829e-01,  4.92081243e-01],
       [-5.38030676e-01, -2.5252....66428750e-01,  5.41342992e-01],
       [ 5.41229421e-02,  1.65949282e-01],
       [ 3.09908720e-01, -9.87007304e-02]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
___________________________ test_max_samples_consistency ____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_max_samples_consistency():
        # Make sure validated max_samples in iforest and BaseBagging are identical
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:238: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________ test_iforest_subsampled_features __________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_subsampled_features():
        # It tests non-regression for #5732 which failed at predict.
        rng = check_random_state(0)
        X_train, X_test, y_train, y_test = train_test_split(boston.data[:50],
                                                            boston.target[:50],
                                                            random_state=rng)
        clf = IsolationForest(max_features=0.8)
>       clf.fit(X_train, y_train)

sklearn/ensemble/tests/test_iforest.py:251: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=0.8, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.20177e+00, 0.00000e+00, 1.81000e+01, 1.00000e+00, 7.70000e-01,
        6.12700e+00, 8.34000e+01, 2.72270e+00...      6.89700e+00, 5.43000e+01, 6.33610e+00, 6.00000e+00, 3.00000e+02,
        1.66000e+01, 3.91250e+02, 1.13800e+01]])
y = array([22.7, 26.2, 17.8, 10.9, 22.2, 18.2, 24.6, 22. , 13.4,  5. ,  9.5,
       36.4, 21.6, 31.5, 14.9, 33.3, 24. , 16... , 15.1, 14.9,
       28.4, 14.4, 16.4, 19. , 14.5, 24.4, 31.1, 23.7, 19.3, 32.9, 33.1,
        8.8, 19.9, 26.6, 22. ])
sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
________________________________ test_score_samples _________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_score_samples():
        X_train = [[1, 1], [1, 2], [2, 1]]
>       clf1 = IsolationForest(contamination=0.1).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:271: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination=0.1,
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[1, 1], [1, 2], [2, 1]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________________ test_deprecation __________________________________

    def test_deprecation():
        X = [[0.0], [1.0]]
        clf = IsolationForest()
    
        assert_warns_message(FutureWarning,
                             'default contamination parameter 0.1 will change '
                             'in version 0.22 to "auto"',
                             clf.fit, X)
    
        assert_warns_message(FutureWarning,
                             'Default "behaviour" parameter will change to "new" '
                             'in version 0.22',
                             clf.fit, X)
    
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:295: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[0.0], [1.0]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

sklearn/ensemble/iforest.py:199: FutureWarning
_______________________________ test_behaviour_param ________________________________

    def test_behaviour_param():
        X_train = [[1, 1], [1, 2], [2, 1]]
>       clf1 = IsolationForest(behaviour='old').fit(X_train)

sklearn/ensemble/tests/test_iforest.py:304: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[1, 1], [1, 2], [2, 1]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

sklearn/ensemble/iforest.py:199: FutureWarning
======================== 11 failed, 3 passed in 0.96 seconds ========================

ngoix · 2018-07-18T16:34:59Z

PR ready to be merged I think

agramfort · 2018-07-18T19:57:53Z

LGTM

@glemaitre merge if you're happy

albertcthomas

Sorry to be late but I think we should be clearer about the deprecation cycle.

we are going to remove the behaviour parameter at some point (0.24 according to the PR description) right? In that case I think we should mention it somewhere (in whastnew, the docstring or the deprecation warning). I don't see something mentioning this except the description of the PR.
I would also add in whatsnew that the default value of behaviour will be new in 0.22

glemaitre · 2018-07-19T15:18:25Z

we are going to remove the behaviour parameter at some point (0.24 according to the PR description) right? In that case I think we should mention it somewhere (in whastnew, the docstring or the deprecation warning). I don't see something mentioning this except the description of the PR.

We have to be careful there. In 0.22, we will not be able to make the old behaviour since a bunch of thing will be deprecated or change from default. Therefore, behaviour needs to be introduce and deprecated right now because we will not be able to use behavior=old in 0.22.

Does it make sense (or is it make jet-lag speaking)?

albertcthomas · 2018-07-19T21:18:42Z

We have to be careful there. In 0.22, we will not be able to make the old behaviour since a bunch of thing will be deprecated or change from default. Therefore, behaviour needs to be introduce and deprecated right now because we will not be able to use behavior=old in 0.22.

Does it make sense (or is it make jet-lag speaking)?

Yes it does. In this case the docstring of the behaviour attribute and the warning are a bit misleading because they say Default "behaviour" parameter will change to "new" in version 0.22 and don't say that there will no longer be a behaviour parameter in 0.22. Same thing for the whatsnew entry which promotes the introduction of a new parameter behaviour and does not mention that it will no longer be here in 0.22.

Should we add in the docstring, the warning and the whatsnew entry that behaviour parameter will be removed in 0.22?

ngoix · 2018-07-20T06:58:11Z

Yes we definitely should, thanks for pointing out!

ngoix · 2018-07-20T07:18:21Z

Actually I'm not very comfortable in introducing a deprecated parameter. We can't both change the behaviour + remove the parameter in one single version.
I think we should stick to switch default behavior to 'new' in 0.22, and remove it in 0.24.
behaviour='old' will still be usable in 0.22 if the user set contamination to something different than 'auto'.

glemaitre · 2018-07-20T07:25:44Z

We are doing this right now there:
#11585
to keep the wrong behaviour for the smallest amount of time.

Moreover, here I think that this is even more problematic since we will not be able to maintain behaviour=old after 0.22 since that we are supposed to remove some deprecated code which allows for such behaviour.

ngoix · 2018-07-20T07:36:36Z

here this is not a wrong behaviour though, just a different convention...

Moreover, here I think that this is even more problematic since we will not be able to maintain behaviour=old after 0.22 since that we are supposed to remove some deprecated code which allows for such behaviour.

behaviour=old will be ok in 0.22, though not the default argument, and we remove this param in 0.24 so I don't see the problem?

jorisvandenbossche · 2018-07-20T07:39:10Z

I don't think we should deprecate behaviour='new' already now. We need to give users the opportunity to already have the new behaviour with 0.20, but without having a deprecation warning that the keyword they need to specify to have that behaviour is deprecated.
So in my view it is only behaviour='old' that is deprecated, so it is fine to remove that option in 0.22, and from then only behaviour='new' will be accepted (and then deprecated at that point).
But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

glemaitre · 2018-07-20T07:42:07Z

behaviour=old will be ok in 0.22, though not the default argument, and we remove this param in 0.24 so I don't see the problem?

We are supposed to not have threshold_ anymore, isn't it.

ngoix · 2018-07-20T07:55:28Z

We are supposed to not have threshold_ anymore, isn't it.

Right, but we can still allow behaviour='old' in 0.22 (deprecated) without allowing to access this attribute anymore.

I'm also ok with @jorisvandenbossche compromise solution, ie remove behavior='old' in 0.22 and remove the param in 0.24

albertcthomas · 2018-07-20T08:14:37Z

So in my view it is only behaviour='old' that is deprecated, so it is fine to remove that option in 0.22, and from then only behaviour='new' will be accepted (and then deprecated at that point).
But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

I'm +1 for @jorisvandenbossche 's solution, even if it takes 2 deprecation cycles. We introduce behaviour now and deprecate the default old value. So in 0.22, only new will be available (and we don't need to maintain the old value anymore). We then deprecate behaviour to remove it in 0.24.

albertcthomas · 2018-07-20T08:17:00Z

But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

And +1 for this also if we choose @jorisvandenbossche 's solution

fix + cosmit cosmit common test update examples cosmit + fix travis attribute error instead of value + mask warning in test whatsnew

glemaitre

I think that we are almost good then. Just a weird thing with the example and I would slightly modify the behaviour docstring.

glemaitre · 2018-07-20T09:00:47Z

examples/covariance/plot_outlier_detection.py

@@ -0,0 +1,129 @@
+"""


Is it a new file?

This example was removed in PR #10700 (merged yesterday) in favor of plot_anomaly_comparison.py. We don’t need it anymore.

glemaitre · 2018-07-20T09:05:08Z

sklearn/ensemble/iforest.py

+        offset_ attribute documentation. Basically, the decision_function
+        becomes dependent on the contamination parameter, in such a way that
+        0 becomes its natural threshold to detect outliers.
+


Maybe we can use the sphinx syntax to highlight those remarks. I would also remove the optional since that people should probably use 'new' and state that 'new' will be the default in the future.

behaviour : str, default='old' Behaviour of the ``decision_function`` which can be either 'old' or 'new'. Passing ``behaviour='new'`` makes the ``decision_function`` change to match other anomaly detection algorithm API which will be the default behaviour in the future. As explained in details in the ``offset_`` attribute documentation, the ``decision_function`` becomes dependent on the contamination parameter, in such a way that 0 becomes its natural threshold to detect outliers. .. versionadded:: 0.20 ``behaviour`` is added in 0.20 for back-compatibility purpose. .. deprecated:: 0.20 ``behaviour='old'`` is deprecated in 0.20 and will not be possible in 0.22. .. deprecated:: 0.22 ``behaviour`` parameter will be deprecated in 0.22 and removed in 0.24.

You have the explanation duplicated now

glemaitre · 2018-07-23T11:05:08Z

Thanks @ngoix !!! Merging now.

ngoix changed the title ~~[WIP] iforest backward compati~~ [WIP] iforest backward compatibility Jul 16, 2018

agramfort reviewed Jul 16, 2018

View reviewed changes

ngoix changed the title ~~[WIP] iforest backward compatibility~~ [MRG] iforest backward compatibility Jul 16, 2018

albertcthomas reviewed Jul 16, 2018

View reviewed changes

GaelVaroquaux approved these changes Jul 16, 2018

View reviewed changes

GaelVaroquaux changed the title ~~[MRG] iforest backward compatibility~~ [MRG+1] iforest backward compatibility Jul 16, 2018

albertcthomas reviewed Jul 16, 2018

View reviewed changes

amueller mentioned this pull request Jul 16, 2018

[MRG] Examples deprecations #11561

Merged

ngoix force-pushed the iforest_fix branch from f0f141e to eed8f68 Compare July 16, 2018 21:11

jnothman reviewed Jul 16, 2018

View reviewed changes

ngoix force-pushed the iforest_fix branch from 518cbba to dce33fe Compare July 17, 2018 12:34

agramfort approved these changes Jul 17, 2018

View reviewed changes

ngoix force-pushed the iforest_fix branch from 084b877 to 9ac95bd Compare July 17, 2018 22:22

glemaitre reviewed Jul 17, 2018

View reviewed changes

sklearn/ensemble/iforest.py Outdated

self.contamination = contamination

if behaviour == 'old':

Copy link

Member

glemaitre Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in fit

ngoix force-pushed the iforest_fix branch from 9ac95bd to f11f657 Compare July 17, 2018 22:28

glemaitre requested changes Jul 17, 2018

View reviewed changes

albertcthomas mentioned this pull request Jul 18, 2018

[MRG+1] Novelty detection for LOF #10700

Merged

qinhanmin2014 added this to the 0.20 milestone Jul 18, 2018

albertcthomas reviewed Jul 19, 2018

View reviewed changes

ngoix added 6 commits July 20, 2018 10:45

iforest behaviour param

0cad3f3

fix + cosmit cosmit common test update examples cosmit + fix travis attribute error instead of value + mask warning in test whatsnew

warning in fit not init

0413c36

catch behaviour warnings

3cc43d5

catch contam warning

b7eac94

fix docstring travis error

ca6848c

depreciate behaviour value instead of default

bd2619a

ngoix force-pushed the iforest_fix branch from d06cb61 to bd2619a Compare July 20, 2018 08:46

glemaitre requested changes Jul 20, 2018

View reviewed changes

ngoix added 2 commits July 23, 2018 11:35

rm ex + change behaviour doc

7afe223

rm dupl doc

05c905d

glemaitre approved these changes Jul 23, 2018

View reviewed changes

glemaitre merged commit 53622e8 into scikit-learn:master Jul 23, 2018

[MRG+1] iforest backward compatibility #11553

[MRG+1] iforest backward compatibility #11553

Conversation

ngoix commented Jul 16, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertcthomas left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 16, 2018

albertcthomas commented Jul 16, 2018

albertcthomas left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jul 17, 2018

jnothman commented Jul 17, 2018

albertcthomas commented Jul 17, 2018 • edited Loading

agramfort commented Jul 17, 2018

ngoix commented Jul 17, 2018

amueller commented Jul 17, 2018

GaelVaroquaux commented Jul 17, 2018 via email

agramfort commented Jul 17, 2018

Choose a reason for hiding this comment

glemaitre commented Jul 17, 2018

glemaitre left a comment

Choose a reason for hiding this comment

ngoix commented Jul 18, 2018

agramfort commented Jul 18, 2018

albertcthomas left a comment

Choose a reason for hiding this comment

glemaitre commented Jul 19, 2018 • edited Loading

albertcthomas commented Jul 19, 2018

ngoix commented Jul 20, 2018

ngoix commented Jul 20, 2018

glemaitre commented Jul 20, 2018

ngoix commented Jul 20, 2018

jorisvandenbossche commented Jul 20, 2018

glemaitre commented Jul 20, 2018

ngoix commented Jul 20, 2018

albertcthomas commented Jul 20, 2018 • edited Loading

albertcthomas commented Jul 20, 2018

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jul 23, 2018

albertcthomas left a comment •

edited

Loading

albertcthomas commented Jul 17, 2018 •

edited

Loading

glemaitre commented Jul 19, 2018 •

edited

Loading

albertcthomas commented Jul 20, 2018 •

edited

Loading