Skip to content

[MRG+1] iforest backward compatibility #11553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 23, 2018
Merged

Conversation

ngoix
Copy link
Contributor

@ngoix ngoix commented Jul 16, 2018

Reference Issues/PRs

Fixes #11337

What does this implement/fix? Explain your changes.

Add a parameter which switches old/new behaviour. Default value to be changed in 0.22 and parameter to be removed in 0.24

@ngoix ngoix changed the title [WIP] iforest backward compati [WIP] iforest backward compatibility Jul 16, 2018
Assuming the behaviour parameter is set to 'old', we always have
offset_ = -0.5.

behaviour: str, default == 'old'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be written

behaviour: str, optional (default='old')

@@ -131,6 +141,7 @@ def __init__(self,
n_estimators=100,
max_samples="auto",
contamination="legacy",
behaviour='old',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be positioned after n_jobs param

@ngoix ngoix changed the title [WIP] iforest backward compatibility [MRG] iforest backward compatibility Jul 16, 2018
Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the whatsnew entry that you added in the PR about the outlier detection API. Common tests are also failing. The common tests are written for the new API. We can maybe do check_estimator(IsolationForest(behavior='new')) in test_iforest.py and skip IsolationForest in the common test until the end of the deprecation cycle? or just do estimator.set_params(behavior='new') if estimator=IsolationForest in the common tests.

Assuming the behaviour parameter is set to 'old', we always have
offset_ = -0.5.

behaviour: str, optional (default='old')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in the Parameters section of the docstring not Attributes section

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One cosmetic comment, and then this is good.

Default "behaviour" parameter will change to "new" in version 0.22.
Passing behaviour="new" makes the decision_function change to match
other anomaly detection algorithm API, as explained in the offset_
parameter documentation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you should be slightly more explicit as to what switching this parameter entails.

@GaelVaroquaux GaelVaroquaux changed the title [MRG] iforest backward compatibility [MRG+1] iforest backward compatibility Jul 16, 2018
@GaelVaroquaux
Copy link
Member

LGTM. Thank you!

@albertcthomas
Copy link
Contributor

for the common tests you can add

    if estimator.__class__.__name__ == "IsolationForest":
        # XXX to be removed in 0.22.
        # this is used because the old IsolationForest does not
        # respect the outlier detection API and thus and does not
        # pass the outlier detection common tests.
        estimator.set_params(behaviour='new')

in the set_checking_parameters(estimator) function located here

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples should also be modified so that IsolationForest is instantiated with behaviour=new.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not checked that the examples render similarly to 0.19, which is what we'd aim for...?

Passing behaviour="new" makes the decision_function change to match
other anomaly detection algorithm API, as explained in details in the
offset_ attribute documentation. Basically, the decision_function
becomes dependent from the contamination parameter, in such a way that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dependent from" is bad English. Do you mean "dependent on"?

return self

# else, define offset_ wrt contamination parameter, so that the
# threshold_ attribute is implicitely 0 and is not needed anymore:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*implicitly

@jnothman
Copy link
Member

flake8 errors

@jnothman
Copy link
Member

sklearn/ensemble/tests/test_iforest.py:267:27: E127 continuation line over-indented for visual indent
                          'in version 0.22',
                          ^
sklearn/ensemble/tests/test_iforest.py:276:1: E302 expected 2 blank lines, found 1
def test_behaviour_param():
^

@albertcthomas
Copy link
Contributor

albertcthomas commented Jul 17, 2018

I've not checked that the examples render similarly to 0.19, which is what we'd aim for...?

Yes we need to check the examples. (Btw the example given in #11337 is not an example from scikit-learn but from a scipy tutorial).

@agramfort
Copy link
Member

@ngoix
Copy link
Contributor Author

ngoix commented Jul 17, 2018

rebased on master, fixing conflict with #11561
I checked the examples, they seem fine.
I run @glemaitre script from #11337, which now returns:

figure_1

(same as on 0.19)

@amueller
Copy link
Member

looks like travis will pass in a sec. should we wait for the rest?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 17, 2018 via email

@agramfort
Copy link
Member

@ngoix you need to rebase.

self.contamination = contamination

if behaviour == 'old':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in fit

@glemaitre
Copy link
Member

You need to catch the warning using the pytest.mark.filterwarings for when behaviour='old'

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we turn warning into errors so you can refer to the following traceback:

================================ test session starts ================================
platform linux -- Python 3.6.5, pytest-3.6.1, py-1.5.3, pluggy-0.6.0 -- /home/lemaitre/miniconda3/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /home/lemaitre/Documents/code/toolbox/scikit-learn, inifile: setup.cfg
collected 14 items                                                                  

sklearn/ensemble/tests/test_iforest.py::test_iforest PASSED                   [  7%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_sparse FAILED            [ 14%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_error FAILED             [ 21%]
sklearn/ensemble/tests/test_iforest.py::test_recalculate_max_depth FAILED     [ 28%]
sklearn/ensemble/tests/test_iforest.py::test_max_samples_attribute FAILED     [ 35%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_parallel_regression FAILED [ 42%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_performance FAILED       [ 50%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_works PASSED             [ 57%]
sklearn/ensemble/tests/test_iforest.py::test_max_samples_consistency FAILED   [ 64%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_subsampled_features FAILED [ 71%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_average_path_length PASSED [ 78%]
sklearn/ensemble/tests/test_iforest.py::test_score_samples FAILED             [ 85%]
sklearn/ensemble/tests/test_iforest.py::test_deprecation FAILED               [ 92%]
sklearn/ensemble/tests/test_iforest.py::test_behaviour_param FAILED           [100%]

===================================== FAILURES ======================================
________________________________ test_iforest_sparse ________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_sparse():
        """Check IForest for various parameter settings on sparse input."""
        rng = check_random_state(0)
        X_train, X_test, y_train, y_test = train_test_split(boston.data[:50],
                                                            boston.target[:50],
                                                            random_state=rng)
        grid = ParameterGrid({"max_samples": [0.5, 1.0],
                              "bootstrap": [True, False]})
    
        for sparse_format in [csc_matrix, csr_matrix]:
            X_train_sparse = sparse_format(X_train)
            X_test_sparse = sparse_format(X_test)
    
            for params in grid:
                # Trained on sparse format
                sparse_classifier = IsolationForest(
>                   n_estimators=10, random_state=1, **params).fit(X_train_sparse)

sklearn/ensemble/tests/test_iforest.py:85: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=True, contamination='legacy',
        max_features=1.0, max_samples=0.5, n_estimators=10, n_jobs=1,
        random_state=1, verbose=0)
X = <37x13 sparse matrix of type '<class 'numpy.float64'>'
	with 420 stored elements in Compressed Sparse Column format>
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
________________________________ test_iforest_error _________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_error():
        """Test that it gives proper exception on deficient input."""
        X = iris.data
    
        # Test max_samples
        assert_raises(ValueError,
>                     IsolationForest(max_samples=-1).fit, X)

sklearn/ensemble/tests/test_iforest.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/_unittest_backport.py:204: in assertRaises
    return context.handle('assertRaises', args, kwargs)
sklearn/utils/_unittest_backport.py:113: in handle
    callable_obj(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples=-1, n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
____________________________ test_recalculate_max_depth _____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_recalculate_max_depth():
        """Check max_depth recalculation when max_samples is reset to n_samples"""
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:145: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
____________________________ test_max_samples_attribute _____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_max_samples_attribute():
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:153: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________ test_iforest_parallel_regression __________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_parallel_regression():
        """Check parallel regression."""
        rng = check_random_state(0)
    
        X_train, X_test, y_train, y_test = train_test_split(boston.data,
                                                            boston.target,
                                                            random_state=rng)
    
        ensemble = IsolationForest(n_jobs=3,
>                                  random_state=0).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:177: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=3,
        random_state=0, verbose=0)
X = array([[1.44760e-01, 0.00000e+00, 1.00100e+01, ..., 1.78000e+01,
        3.91500e+02, 1.36100e+01],
       [3.46600e-0...+02, 6.53000e+00],
       [1.40507e+01, 0.00000e+00, 1.81000e+01, ..., 2.02000e+01,
        3.50500e+01, 2.12200e+01]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_____________________________ test_iforest_performance ______________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_iforest_performance():
        """Test Isolation Forest performs well"""
    
        # Generate train/test data
        rng = check_random_state(2)
        X = 0.3 * rng.randn(120, 2)
        X_train = np.r_[X + 2, X - 2]
        X_train = X[:100]
    
        # Generate some abnormal novel observations
        X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
        X_test = np.r_[X[100:], X_outliers]
        y_test = np.array([0] * 20 + [1] * 20)
    
        # fit the model
>       clf = IsolationForest(max_samples=100, random_state=rng).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:208: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples=100, n_estimators=100, n_jobs=1,
        random_state=<mtrand.RandomState object at 0x7fe2e0dc3360>,
        verbose=0)
X = array([[-1.25027354e-01, -1.68800482e-02],
       [-6.40858829e-01,  4.92081243e-01],
       [-5.38030676e-01, -2.5252....66428750e-01,  5.41342992e-01],
       [ 5.41229421e-02,  1.65949282e-01],
       [ 3.09908720e-01, -9.87007304e-02]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
___________________________ test_max_samples_consistency ____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_max_samples_consistency():
        # Make sure validated max_samples in iforest and BaseBagging are identical
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:238: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________ test_iforest_subsampled_features __________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_subsampled_features():
        # It tests non-regression for #5732 which failed at predict.
        rng = check_random_state(0)
        X_train, X_test, y_train, y_test = train_test_split(boston.data[:50],
                                                            boston.target[:50],
                                                            random_state=rng)
        clf = IsolationForest(max_features=0.8)
>       clf.fit(X_train, y_train)

sklearn/ensemble/tests/test_iforest.py:251: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=0.8, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.20177e+00, 0.00000e+00, 1.81000e+01, 1.00000e+00, 7.70000e-01,
        6.12700e+00, 8.34000e+01, 2.72270e+00...      6.89700e+00, 5.43000e+01, 6.33610e+00, 6.00000e+00, 3.00000e+02,
        1.66000e+01, 3.91250e+02, 1.13800e+01]])
y = array([22.7, 26.2, 17.8, 10.9, 22.2, 18.2, 24.6, 22. , 13.4,  5. ,  9.5,
       36.4, 21.6, 31.5, 14.9, 33.3, 24. , 16... , 15.1, 14.9,
       28.4, 14.4, 16.4, 19. , 14.5, 24.4, 31.1, 23.7, 19.3, 32.9, 33.1,
        8.8, 19.9, 26.6, 22. ])
sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
________________________________ test_score_samples _________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_score_samples():
        X_train = [[1, 1], [1, 2], [2, 1]]
>       clf1 = IsolationForest(contamination=0.1).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:271: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination=0.1,
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[1, 1], [1, 2], [2, 1]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________________ test_deprecation __________________________________

    def test_deprecation():
        X = [[0.0], [1.0]]
        clf = IsolationForest()
    
        assert_warns_message(FutureWarning,
                             'default contamination parameter 0.1 will change '
                             'in version 0.22 to "auto"',
                             clf.fit, X)
    
        assert_warns_message(FutureWarning,
                             'Default "behaviour" parameter will change to "new" '
                             'in version 0.22',
                             clf.fit, X)
    
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:295: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[0.0], [1.0]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

sklearn/ensemble/iforest.py:199: FutureWarning
_______________________________ test_behaviour_param ________________________________

    def test_behaviour_param():
        X_train = [[1, 1], [1, 2], [2, 1]]
>       clf1 = IsolationForest(behaviour='old').fit(X_train)

sklearn/ensemble/tests/test_iforest.py:304: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[1, 1], [1, 2], [2, 1]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

sklearn/ensemble/iforest.py:199: FutureWarning
======================== 11 failed, 3 passed in 0.96 seconds ========================

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 18, 2018
@ngoix
Copy link
Contributor Author

ngoix commented Jul 18, 2018

PR ready to be merged I think

@agramfort
Copy link
Member

LGTM

@glemaitre merge if you're happy

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be late but I think we should be clearer about the deprecation cycle.

  1. we are going to remove the behaviour parameter at some point (0.24 according to the PR description) right? In that case I think we should mention it somewhere (in whastnew, the docstring or the deprecation warning). I don't see something mentioning this except the description of the PR.
  2. I would also add in whatsnew that the default value of behaviour will be new in 0.22

@glemaitre
Copy link
Member

glemaitre commented Jul 19, 2018

we are going to remove the behaviour parameter at some point (0.24 according to the PR description) right? In that case I think we should mention it somewhere (in whastnew, the docstring or the deprecation warning). I don't see something mentioning this except the description of the PR.

We have to be careful there. In 0.22, we will not be able to make the old behaviour since a bunch of thing will be deprecated or change from default. Therefore, behaviour needs to be introduce and deprecated right now because we will not be able to use behavior=old in 0.22.

Does it make sense (or is it make jet-lag speaking)?

@albertcthomas
Copy link
Contributor

We have to be careful there. In 0.22, we will not be able to make the old behaviour since a bunch of thing will be deprecated or change from default. Therefore, behaviour needs to be introduce and deprecated right now because we will not be able to use behavior=old in 0.22.

Does it make sense (or is it make jet-lag speaking)?

Yes it does. In this case the docstring of the behaviour attribute and the warning are a bit misleading because they say Default "behaviour" parameter will change to "new" in version 0.22 and don't say that there will no longer be a behaviour parameter in 0.22. Same thing for the whatsnew entry which promotes the introduction of a new parameter behaviour and does not mention that it will no longer be here in 0.22.

Should we add in the docstring, the warning and the whatsnew entry that behaviour parameter will be removed in 0.22?

@ngoix
Copy link
Contributor Author

ngoix commented Jul 20, 2018

Yes we definitely should, thanks for pointing out!

@ngoix
Copy link
Contributor Author

ngoix commented Jul 20, 2018

Actually I'm not very comfortable in introducing a deprecated parameter. We can't both change the behaviour + remove the parameter in one single version.
I think we should stick to switch default behavior to 'new' in 0.22, and remove it in 0.24.
behaviour='old' will still be usable in 0.22 if the user set contamination to something different than 'auto'.

@glemaitre
Copy link
Member

We are doing this right now there:
#11585
to keep the wrong behaviour for the smallest amount of time.

Moreover, here I think that this is even more problematic since we will not be able to maintain behaviour=old after 0.22 since that we are supposed to remove some deprecated code which allows for such behaviour.

@ngoix
Copy link
Contributor Author

ngoix commented Jul 20, 2018

here this is not a wrong behaviour though, just a different convention...

Moreover, here I think that this is even more problematic since we will not be able to maintain behaviour=old after 0.22 since that we are supposed to remove some deprecated code which allows for such behaviour.

behaviour=old will be ok in 0.22, though not the default argument, and we remove this param in 0.24 so I don't see the problem?

@jorisvandenbossche
Copy link
Member

I don't think we should deprecate behaviour='new' already now. We need to give users the opportunity to already have the new behaviour with 0.20, but without having a deprecation warning that the keyword they need to specify to have that behaviour is deprecated.
So in my view it is only behaviour='old' that is deprecated, so it is fine to remove that option in 0.22, and from then only behaviour='new' will be accepted (and then deprecated at that point).
But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

@glemaitre
Copy link
Member

behaviour=old will be ok in 0.22, though not the default argument, and we remove this param in 0.24 so I don't see the problem?

We are supposed to not have threshold_ anymore, isn't it.

@ngoix
Copy link
Contributor Author

ngoix commented Jul 20, 2018

We are supposed to not have threshold_ anymore, isn't it.

Right, but we can still allow behaviour='old' in 0.22 (deprecated) without allowing to access this attribute anymore.

I'm also ok with @jorisvandenbossche compromise solution, ie remove behavior='old' in 0.22 and remove the param in 0.24

@albertcthomas
Copy link
Contributor

albertcthomas commented Jul 20, 2018

So in my view it is only behaviour='old' that is deprecated, so it is fine to remove that option in 0.22, and from then only behaviour='new' will be accepted (and then deprecated at that point).
But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

I'm +1 for @jorisvandenbossche 's solution, even if it takes 2 deprecation cycles. We introduce behaviour now and deprecate the default old value. So in 0.22, only new will be available (and we don't need to maintain the old value anymore). We then deprecate behaviour to remove it in 0.24.

@albertcthomas
Copy link
Contributor

But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

And +1 for this also if we choose @jorisvandenbossche 's solution

ngoix added 6 commits July 20, 2018 10:45
fix + cosmit

cosmit

common test

update examples

cosmit + fix travis

attribute error instead of value + mask warning in test

whatsnew
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we are almost good then. Just a weird thing with the example and I would slightly modify the behaviour docstring.

@@ -0,0 +1,129 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a new file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example was removed in PR #10700 (merged yesterday) in favor of plot_anomaly_comparison.py. We don’t need it anymore.

offset_ attribute documentation. Basically, the decision_function
becomes dependent on the contamination parameter, in such a way that
0 becomes its natural threshold to detect outliers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use the sphinx syntax to highlight those remarks. I would also remove the optional since that people should probably use 'new' and state that 'new' will be the default in the future.

 behaviour : str, default='old'
    Behaviour of the ``decision_function`` which can be either 'old' or 'new'.
    Passing ``behaviour='new'`` makes the ``decision_function`` change to match
    other anomaly detection algorithm API which will be the default behaviour
    in the future. As explained in details in the ``offset_`` attribute
    documentation, the ``decision_function`` becomes dependent on the
    contamination parameter, in such a way that 0 becomes its natural 
    threshold to detect outliers.

    .. versionadded:: 0.20
       ``behaviour`` is added in 0.20 for back-compatibility purpose.
   
    .. deprecated:: 0.20
       ``behaviour='old'`` is deprecated in 0.20 and will not be possible in 0.22.

    .. deprecated:: 0.22
       ``behaviour`` parameter will be deprecated in 0.22 and removed in 0.24.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the explanation duplicated now

@glemaitre
Copy link
Member

Thanks @ngoix !!! Merging now.

@glemaitre glemaitre merged commit 53622e8 into scikit-learn:master Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IsolationForest is breaking backward compatibility
9 participants