diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst index d44d372f1b7ca..ddc2a7fdaf816 100644 --- a/doc/developers/contributing.rst +++ b/doc/developers/contributing.rst @@ -1108,7 +1108,7 @@ multiple interfaces): :Transformer: - For filtering or modifying the data, in a supervised or unsupervised + For modifying the data, in a supervised or unsupervised way, implements:: new_data = transformer.transform(data) @@ -1118,6 +1118,13 @@ multiple interfaces): new_data = transformer.fit_transform(data) +:Resamplers: + + For filtering or augmenting the data, in a supervised or unsupervised + way, implements:: + + new_X, new_y = transformer.fit_resample(data_X, data_y) + :Model: A model that can give a `goodness of fit `_ diff --git a/doc/glossary.rst b/doc/glossary.rst index dba7ffa746732..ecad1132f2356 100644 --- a/doc/glossary.rst +++ b/doc/glossary.rst @@ -918,6 +918,19 @@ Class APIs and Estimator Types outliers have score below 0. :term:`score_samples` may provide an unnormalized score per sample. + outlier rejector + outlier rejectors + An :term:`outlier detector` which is a resampler. It will remove + outliers from a passed dataset when :term:`fit_resample` is called. + + Outlier detectors must implement: + + * :term:`fit_resample` + + If the estimator implements :term:`fit_predict` according to the + :class:`OutlierMixin` API, :class:`OutlierRejectorMixin` should be used + to automatically implement correct :term:`fit_resample` behavior. + predictor predictors An :term:`estimator` supporting :term:`predict` and/or @@ -949,6 +962,12 @@ Class APIs and Estimator Types A purely :term:`transductive` transformer, such as :class:`manifold.TSNE`, may not implement ``transform``. + resampler + resamplers + An estimator supporting :term:`fit_resample`. This can be used in a + :class:`ResampledTrainer` to resample, augment or reduce the training + dataset passed to another estimator. + vectorizer vectorizers See :term:`feature extractor`. @@ -1218,6 +1237,27 @@ Methods (i.e. training and test data together) before further modelling, as this results in :term:`data leakage`. + ``fit_resample`` + A method whose presence in an estimator is sufficient and necessary for + it to be a :term:`resampler`. + When called it should fit the estimator and return a new + dataset. In the new dataset, samples may be removed, added or modified. + In contrast to :term:`fit_transform`: + * X, y, and any other sample-aligned data may be generated; + * the samples in the returned dataset need not have any alignment or + correspondence to the input dataset. + + This method has the signature ``fit_resample(X, y, **kw)`` and returns + a 3-tuple ``X_new, y_new, kw_new`` where ``kw_new`` is a dict mapping + names to data-aligned values that should be passed as fit parameters + to the subsequent estimator. Any keyword arguments passed in should be + resampled and returned, and if the resampler is not capable of + resampling the keyword arguments, it should raise a TypeError. + + Ordinarily, this method is only called by a :class:`ResampledTrainer`, + which acts like a specialised pipeline for cases when the training data + should be augmented or resampled. + ``get_feature_names`` Primarily for :term:`feature extractors`, but also used for other transformers to provide string names for each column in the output of diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst index 7ad12680353bb..2cc1335c22765 100644 --- a/doc/modules/classes.rst +++ b/doc/modules/classes.rst @@ -30,6 +30,8 @@ Base classes base.BiclusterMixin base.ClassifierMixin base.ClusterMixin + base.OutlierMixin + base.OutlierRejectorMixin base.DensityMixin base.RegressorMixin base.TransformerMixin @@ -164,6 +166,7 @@ details. :template: class.rst compose.ColumnTransformer + compose.ResampledTrainer compose.TransformedTargetRegressor .. autosummary:: diff --git a/doc/modules/compose.rst b/doc/modules/compose.rst index 0ac33ce7a4d4a..2583b21716525 100644 --- a/doc/modules/compose.rst +++ b/doc/modules/compose.rst @@ -5,14 +5,16 @@ Pipelines and composite estimators ================================== -Transformers are usually combined with classifiers, regressors or other +Transformers and resamplers are usually combined with classifiers, regressors +or other estimators to build a composite estimator. The most common tool is a :ref:`Pipeline `. Pipeline is often used in combination with :ref:`FeatureUnion ` which concatenates the output of transformers into a composite feature space. :ref:`TransformedTargetRegressor ` deals with transforming the :term:`target` (i.e. log-transform :term:`y`). In contrast, Pipelines only transform the -observed data (:term:`X`). +observed data (:term:`X`). Additionally, pipelines support :term:`resamplers` to +resample the dataset on fit (see :ref:`_pipeline_resamplers`). .. _pipeline: @@ -236,6 +238,46 @@ object:: * :ref:`sphx_glr_auto_examples_compose_plot_compare_reduction.py` +.. _pipeline_resamplers: + +Resampling or modifying samples in training +=========================================== + +All transformers in a Pipeline must output a dataset with samples corresponding +to their input. Sometimes you want a process to modify the set of samples +used in training, such as balanced resampling, outlier remover, or data +augmentation/perturbation. Such processes are called Resamplers, rather than +Transformers, in Scikit-learn, and should be composed with a predictor using +a :class:`compose.ResampledTrainer` rather than a Pipeline. Resamplers provide +a `fit_resample` method which is called by the ``ResampledTrainer`` when +fitting, so that the resampled data is used to train the subsequent predictor. + +:ref:`outlier rejectors` provide `fit_resample` methods that remove samples +from the dataset if they classified as outliers. Consider the following:: + + >>> from sklearn.compose import ResampledTrainer + >>> from sklearn.covariance import EllipticEnvelope + >>> from sklearn.linear_model import LogisticRegression + >>> resampled = ResampledTrainer(EllipticEnvelope(), LogisticRegression()) + >>> from sklearn.datasets import load_iris + >>> X, y = load_iris(return_X_y=True) + >>> resampled.fit(X, y) + ... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS + ResampledTrainer(...) + + +In ``pipe``, we remove outliers before fitting our `LogisticRegression` +model, so that the samples passed to fit come from the same distribution. We do +this to improve the quality of the fit (see :ref:`_outlier_detection`). +Therefore, during ``fit``, we want our resampler to be applied. + +Now assume that we would like to make predictions on some new data ``X_test``:: + + >>> predictions = pipe.predict(X_test) + +This does not apply resampling, but provides predictions for all samples in +``X_test``. + .. _transformed_target_regressor: Transforming target in regression @@ -327,8 +369,7 @@ is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix. -When you want to apply different transformations to each field of the data, -see the related class :class:`sklearn.compose.ColumnTransformer` +When you want to apply different transformations to each field of the data, see the related class :class:`sklearn.compose.ColumnTransformer` (see :ref:`user guide `). :class:`FeatureUnion` serves the same purposes as :class:`Pipeline` - diff --git a/doc/modules/outlier_detection.rst b/doc/modules/outlier_detection.rst index c061feb0b1d7c..135bb92a4c5e6 100644 --- a/doc/modules/outlier_detection.rst +++ b/doc/modules/outlier_detection.rst @@ -349,6 +349,16 @@ This strategy is illustrated below. `_ Proc. ACM SIGMOD +.. _outlier_rejectors + +Outlier Rejectors +----------------- +All :term:`outlier detectors` can be used as :term:`outlier rejectors`, a form +of :term:`resampler` that takes in a dataset, and returns a new dataset that is +the same dataset, but with the outliers removed. This is especially useful for +pipelines. See :ref:`pipeline_resamplers` and the examples. + +.. topic:: Examples: .. _novelty_with_lof: Novelty detection with Local Outlier Factor diff --git a/sklearn/base.py b/sklearn/base.py index fb0818efc8248..ed19804fb0ef0 100644 --- a/sklearn/base.py +++ b/sklearn/base.py @@ -13,7 +13,8 @@ import numpy as np from . import __version__ -from .utils import _IS_32BIT +from sklearn.utils import _IS_32BIT +from sklearn.utils import safe_indexing, check_X_y_kwargs _DEFAULT_TAGS = { 'non_deterministic': False, @@ -603,6 +604,45 @@ def fit_predict(self, X, y=None): return self.fit(X).predict(X) +class OutlierRejectionMixin: + """Mixin class for all outlier detection resamplers in scikit-learn. Child + classes remove outliers from the dataset. + """ + _estimator_type = "outlier_rejector" + + def fit_resample(self, X, y, **kws): + """Performs fit on X and returns a new X and y consisting of only the + inliers. + + Parameters + ---------- + X : ndarray, shape (n_samples, n_features) + Input data X. + + y : ndarray, shape (n_samples,) + Input data y. + + Returns + ------- + X : ndarray, shape (n_samples, n_features) + The original X with outlier samples removed. + + y : ndarray, shape (n_samples,) + The original y with outlier samples removed. + + kws : dict of ndarray + dict of keyword arguments, with all outlier samples removed. + """ + + check_X_y_kwargs(X, y, kws) + inliers = self.fit_predict(X) == 1 + kwsr = { + kw: safe_indexing(kws[kw], inliers) + for kw in kws + } + return safe_indexing(X, inliers), safe_indexing(y, inliers), kwsr + + class MetaEstimatorMixin: _required_parameters = ["estimator"] """Mixin class for all meta estimators in scikit-learn.""" diff --git a/sklearn/compose/__init__.py b/sklearn/compose/__init__.py index 1cfd53c50d682..80e537c741de2 100644 --- a/sklearn/compose/__init__.py +++ b/sklearn/compose/__init__.py @@ -7,10 +7,12 @@ from ._column_transformer import ColumnTransformer, make_column_transformer from ._target import TransformedTargetRegressor +from ._resampled import ResampledTrainer __all__ = [ 'ColumnTransformer', 'make_column_transformer', 'TransformedTargetRegressor', + 'ResampledTrainer', ] diff --git a/sklearn/compose/_resampled.py b/sklearn/compose/_resampled.py new file mode 100644 index 0000000000000..97b8a81973bc5 --- /dev/null +++ b/sklearn/compose/_resampled.py @@ -0,0 +1,112 @@ +# Author: Joel Nothman + +from ..base import BaseEstimator, MetaEstimatorMixin, clone +from ..utils.metaestimators import if_delegate_has_method +from ..utils.validation import check_is_fitted, check_X_y_kwargs + + +class ResampledTrainer(MetaEstimatorMixin, BaseEstimator): + """Composition of a resampler and a estimator + + Read more in the :ref:`User Guide `. + + Parameters + ---------- + resampler : Estimator supporting fit_resample + + estimator : Estimator + + Attributes + ---------- + resampler_ : Estimator + Fitted clone of `resampler`. + + estimator_ : Estimator + Fitted clone of `estimator`. + + Examples + -------- + >>> from sklearn.base import BaseEstimator + >>> from sklearn.compose import ResampledTrainer + >>> from sklearn.datasets import load_iris + >>> from sklearn.linear_model import LogisticRegression + >>> + >>> class HalfSampler(BaseEstimator): + ... "Train with every second sample" + ... def fit_resample(self, X, y, **kw): + ... return X[::2], y[::2] + >>> + >>> est = ResampledTrainer(HalfSampler(), LogisticRegression()) + >>> X, y = load_iris(return_X_y=True) + >>> est.fit(X, y) + ... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS + ResampledTrainer(...) + >>> est.predict(X[:2]) + array([0, 0]) + """ + + def __init__(self, resampler, estimator): + self.resampler = resampler + self.estimator = estimator + + _required_parameters = ["resampler", "estimator"] + + # TODO: tags? + + def fit(self, X, y=None, **kw): + X, y, kw = check_X_y_kwargs(X, y, kw) + self.resampler_ = clone(self.resampler) + X, y, kw = self.resampler_.fit_resample(X, y, **kw) + + self.estimator_ = clone(self.estimator).fit(X, y, **kw) + return self + + @if_delegate_has_method(delegate="estimator") + def predict(self, X, **predict_params): + check_is_fitted(self, "estimator_") + return self.estimator_.predict(X, **predict_params) + + @if_delegate_has_method(delegate="estimator") + def transform(self, X): + check_is_fitted(self, "estimator_") + return self.estimator_.transform(X) + + @if_delegate_has_method(delegate="estimator") + def predict_proba(self, X): + check_is_fitted(self, "estimator_") + return self.estimator_.predict_proba(X) + + @if_delegate_has_method(delegate="estimator") + def predict_log_proba(self, X): + check_is_fitted(self, "estimator_") + return self.estimator_.predict_log_proba(X) + + @if_delegate_has_method(delegate="estimator") + def decision_function(self, X): + check_is_fitted(self, "estimator_") + return self.estimator_.decision_function(X) + + @if_delegate_has_method(delegate="estimator") + def score(self, X, y, **kw): + check_is_fitted(self, "estimator_") + return self.estimator_.score(X, y, **kw) + + @property + def fit_transform(self): + # check if the estimator has a transform function + self.estimator.transform + + def fit_transform(X, y, **kwargs): + self.fit(X, y, **kwargs) + # since estimator_ exists now, we can return transform + return self.estimator_.transform(X) + + return fit_transform + + @property + def _estimator_type(self): + return self.estimator._estimator_type + + @property + def classes_(self): + return self.estimator_.classes_ diff --git a/sklearn/compose/tests/test_resampled.py b/sklearn/compose/tests/test_resampled.py new file mode 100644 index 0000000000000..6478e4d8c88b0 --- /dev/null +++ b/sklearn/compose/tests/test_resampled.py @@ -0,0 +1,123 @@ +# Authors: Joel Nothman +# Oliver Rausch +import numpy as np +from sklearn.base import BaseEstimator +from sklearn.datasets import make_classification +from sklearn.svm import SVC, OneClassSVM +from sklearn.decomposition import PCA +from sklearn.pipeline import Pipeline +from sklearn.compose import ResampledTrainer +from sklearn.utils.estimator_checks import check_estimator +from sklearn.utils.validation import _num_samples, check_X_y_kwargs + + +class HalfSampler(BaseEstimator): + "Train with every second sample" + + def fit_resample(self, X, y, **kws): + X, y, kws = check_X_y_kwargs(X, y, kws, accept_sparse="csr") + if _num_samples(X) > 1: + return X[::2], y[::2], {kw: kws[kw][::2] for kw in kws} + + return X, y, kws + + +class DataSaver(BaseEstimator): + "remembers the data that it was fitted with" + + def fit(self, X, y, **kws): + self.X = X + self.y = y + self.kws = kws + return self + + def predict(self, X): + return np.zeros((X.shape[0],)) + + def transform(self, X): + return np.zeros((X.shape[0],)) + + +def test_estimator_checks(): + check_estimator(ResampledTrainer(HalfSampler(), SVC())) + + +def test_correct_halfsampler(): + # check that the estimator is fitted with the correct data + X = np.zeros((10, 2)) + y = np.arange(10) + + rt = ResampledTrainer(HalfSampler(), DataSaver()) + for method in [rt.fit, rt.fit_transform, rt.fit_predict]: + method(X, y) + + np.testing.assert_array_equal( + rt.estimator_.y, np.array([0, 2, 4, 6, 8]) + ) + assert rt.estimator_.kws == {} + method(X, y, sample_weight=np.arange(10, 20), + sample_prop=np.arange(20, 30)) + + np.testing.assert_array_equal( + rt.estimator_.y, np.array([0, 2, 4, 6, 8]) + ) + np.testing.assert_array_equal( + rt.estimator_.kws['sample_weight'], np.array([10, 12, 14, 16, 18]) + ) + np.testing.assert_array_equal( + rt.estimator_.kws['sample_prop'], np.array([20, 22, 24, 26, 28]) + ) + + +def test_pca_outlier_svm(): + # Test the various methods of the pipeline (pca + svm). + X, y = make_classification( + n_classes=2, + class_sep=2, + weights=[0.1, 0.9], + n_informative=3, + n_redundant=1, + flip_y=0, + n_features=20, + n_clusters_per_class=1, + n_samples=500, + random_state=0, + ) + + # Test with PCA + SVC + clf = SVC(gamma="scale", probability=True, random_state=0) + pca = PCA() + outlier = OneClassSVM(gamma="scale") + pipe = Pipeline([("pca", pca), ("clf", ResampledTrainer(outlier, clf))]) + pipe.fit(X, y) + pipe.predict(X) + pipe.predict_proba(X) + pipe.predict_log_proba(X) + pipe.score(X, y) + + +def test_outlier_pca_svm(): + # Test the various methods of the pipeline (pca + svm). + X, y = make_classification( + n_classes=2, + class_sep=2, + weights=[0.1, 0.9], + n_informative=3, + n_redundant=1, + flip_y=0, + n_features=20, + n_clusters_per_class=1, + n_samples=500, + random_state=0, + ) + + # Test with PCA + SVC + clf = SVC(gamma="scale", probability=True, random_state=0) + pca = PCA() + outlier = OneClassSVM(gamma="scale") + pipe = ResampledTrainer(outlier, Pipeline([("pca", pca), ("svc", clf)])) + pipe.fit(X, y) + pipe.predict(X) + pipe.predict_proba(X) + pipe.predict_log_proba(X) + pipe.score(X, y) diff --git a/sklearn/covariance/elliptic_envelope.py b/sklearn/covariance/elliptic_envelope.py index 517f9a32dc9af..53b466f3fcb72 100644 --- a/sklearn/covariance/elliptic_envelope.py +++ b/sklearn/covariance/elliptic_envelope.py @@ -7,9 +7,10 @@ from ..utils.validation import check_is_fitted, check_array from ..metrics import accuracy_score from ..base import OutlierMixin +from ..base import OutlierRejectionMixin -class EllipticEnvelope(MinCovDet, OutlierMixin): +class EllipticEnvelope(MinCovDet, OutlierMixin, OutlierRejectionMixin): """An object for detecting outliers in a Gaussian distributed dataset. Read more in the :ref:`User Guide `. diff --git a/sklearn/ensemble/iforest.py b/sklearn/ensemble/iforest.py index 8aaae2925ccaf..b81832d98e160 100644 --- a/sklearn/ensemble/iforest.py +++ b/sklearn/ensemble/iforest.py @@ -18,13 +18,14 @@ from ..utils.fixes import _joblib_parallel_args from ..utils.validation import check_is_fitted, _num_samples from ..base import OutlierMixin +from ..base import OutlierRejectionMixin from .bagging import BaseBagging __all__ = ["IsolationForest"] -class IsolationForest(BaseBagging, OutlierMixin): +class IsolationForest(BaseBagging, OutlierMixin, OutlierRejectionMixin): """Isolation Forest Algorithm Return the anomaly score of each sample using the IsolationForest algorithm diff --git a/sklearn/neighbors/lof.py b/sklearn/neighbors/lof.py index a58997502be91..36919d182604c 100644 --- a/sklearn/neighbors/lof.py +++ b/sklearn/neighbors/lof.py @@ -9,15 +9,16 @@ from .base import KNeighborsMixin from .base import UnsupervisedMixin from ..base import OutlierMixin +from ..base import OutlierRejectionMixin from ..utils.validation import check_is_fitted -from ..utils import check_array +from ..utils import check_array, safe_indexing, check_X_y __all__ = ["LocalOutlierFactor"] class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin, - OutlierMixin): + OutlierMixin, OutlierRejectionMixin): """Unsupervised Outlier Detection using Local Outlier Factor (LOF) The anomaly score of each sample is called Local Outlier Factor. @@ -193,6 +194,51 @@ def fit_predict(self): return self._fit_predict + @property + def fit_resample(self): + """Performs fit on X and returns new X and y consisting of only the + inliers. + + Parameters + ---------- + X : ndarray, shape (n_samples, n_features) + Input data X. + + y : ndarray, shape (n_samples,) + Input data y. + + Returns + ------- + X : ndarray, shape (n_samples, n_features) + The input X with outlier samples removed. + + y : ndarray, shape (n_samples,) + The input y with outlier samples removed. + + kws : dict of ndarray + dict of keyword arguments, with all outlier samples removed. + """ + # fit_resample requires fit_predict + if self.novelty: + msg = ('fit_resample is not available when novelty=True. Use ' + 'novelty=False if you want to use outlier rejection') + raise AttributeError(msg) + + return self._fit_resample + + def _fit_resample(self, X, y, **kws): + check_X_y(X, y) + kws = { + kw: check_X_y(X, kws[kw], force_all_finite='allow-nan')[1] + for kw in kws + } + inliers = self.fit_predict(X) == 1 + kwsr = { + kw: safe_indexing(kws[kw], inliers) + for kw in kws + } + return safe_indexing(X, inliers), safe_indexing(y, inliers), kwsr + def _fit_predict(self, X, y=None): """"Fits the model to the training set X and returns the labels. diff --git a/sklearn/preprocessing/__init__.py b/sklearn/preprocessing/__init__.py index 2eb41a66220c7..e03f3d5c6f1e2 100644 --- a/sklearn/preprocessing/__init__.py +++ b/sklearn/preprocessing/__init__.py @@ -24,6 +24,7 @@ from .data import power_transform from .data import PowerTransformer from .data import PolynomialFeatures +from .data import NaNFilter from ._encoders import OneHotEncoder from ._encoders import OrdinalEncoder @@ -64,4 +65,5 @@ 'label_binarize', 'quantile_transform', 'power_transform', + 'NaNFilter', ] diff --git a/sklearn/preprocessing/data.py b/sklearn/preprocessing/data.py index 823eedc8b7dd9..3778f7bc4d4f5 100644 --- a/sklearn/preprocessing/data.py +++ b/sklearn/preprocessing/data.py @@ -20,7 +20,7 @@ from scipy.special import boxcox from ..base import BaseEstimator, TransformerMixin -from ..utils import check_array +from ..utils import check_array, check_X_y_kwargs, safe_indexing from ..utils.extmath import row_norms from ..utils.extmath import _incremental_mean_and_var from ..utils.sparsefuncs_fast import (inplace_csr_row_normalize_l1, @@ -57,6 +57,7 @@ 'minmax_scale', 'quantile_transform', 'power_transform', + 'NaNFilter', ] @@ -3009,3 +3010,72 @@ def power_transform(X, method='warn', standardize=True, copy=True): method = 'box-cox' pt = PowerTransformer(method=method, standardize=standardize, copy=copy) return pt.fit_transform(X) + + +class NaNFilter(BaseEstimator): + """ + A resampler that removes samples containing NaN in X. + + Parameters + ---------- + count : int, optional, default=1 + The amount of NaNs a sample can contain before it gets filtered. + + Examples + -------- + >>> import numpy as np + >>> from sklearn.preprocessing import NaNFilter + >>> nan = float('nan') + >>> data = [[1, 2], [1, nan], [nan, nan]] + >>> y = [0, 0, 0] + >>> Xr, yr, _ = NaNFilter().fit_resample(data, y) + >>> print(Xr) + [[1. 2.]] + >>> Xr, yr, _ = NaNFilter(count=2).fit_resample(data, y) + >>> print(Xr) + [[ 1. 2.] + [ 1. nan]] + + See also + -------- + SimpleImputer : Transformer for completing missing values. + """ + def __init__(self, count=1): + self.count = count + + def fit_resample(self, X, y, **kws): + """Removes samples containing NaN from X. + + Parameters + ---------- + X : ndarray, shape (n_samples, n_features) + Input data X. + + y : ndarray, shape (n_samples,) + Input data y. + + Returns + ------- + X : ndarray, shape (n_samples, n_features) + The input X, with all samples containing more than `count` NaN + values removed. + + y : ndarray, shape (n_samples,) + The input y, with all samples containing more than `count` NaN + values removed. + + kws : dict of ndarray + dict of keyword arguments, with all samples containing more than + `count` NaN values removed. + """ + X, y, kws = check_X_y_kwargs(X, y, kws, force_all_finite='allow-nan') + + mask = np.sum(np.isnan(X), axis=1) < self.count + kwsr = { + kw: safe_indexing(kws[kw], mask) + for kw in kws + } + return safe_indexing(X, mask), safe_indexing(y, mask), kwsr + + def _more_tags(self): + return {'allow_nan': True} diff --git a/sklearn/preprocessing/tests/test_data.py b/sklearn/preprocessing/tests/test_data.py index 46769cad40edf..ede79ac5a29eb 100644 --- a/sklearn/preprocessing/tests/test_data.py +++ b/sklearn/preprocessing/tests/test_data.py @@ -50,6 +50,7 @@ from sklearn.preprocessing.data import PowerTransformer from sklearn.preprocessing.data import power_transform from sklearn.preprocessing.data import BOUNDS_THRESHOLD +from sklearn.preprocessing.data import NaNFilter from sklearn.exceptions import NotFittedError from sklearn.base import clone @@ -2463,3 +2464,53 @@ def test_power_transform_default_method(): X_trans_boxcox = power_transform(X, method='box-cox') assert_array_equal(X_trans_boxcox, X_trans_default) + + +def test_nanfilter(): + nan = float('nan') + data = [[1, 2], [1, nan], [nan, nan]] + y = [0, 1, 2] + sample_weights = np.array([0.1, 0.4, 0.5]) + other_sample_prop = [0.3, 0.4, 0.5] + Xr, yr, kws = NaNFilter().fit_resample( + data, y) + + assert_array_equal( + Xr, + np.array([[1, 2]]) + ) + assert_array_equal( + yr, + np.array([0]) + ) + + Xr, yr, kws = NaNFilter(count=2).fit_resample( + data, y) + assert_array_equal( + Xr, + np.array([[1, 2], [1, nan]]) + ) + assert_array_equal( + yr, + np.array([0, 1]) + ) + + Xr, yr, kws = NaNFilter().fit_resample( + data, y, + sample_weights=sample_weights, + other_sample_prop=other_sample_prop + ) + + assert_array_equal( + Xr, + np.array([[1, 2]]) + ) + + assert_array_equal( + kws['sample_weights'], + np.array([0.1]) + ) + assert_array_equal( + kws['other_sample_prop'], + np.array([0.3]) + ) diff --git a/sklearn/svm/classes.py b/sklearn/svm/classes.py index 067e1a6ef8d34..ccfc5a229c2d9 100644 --- a/sklearn/svm/classes.py +++ b/sklearn/svm/classes.py @@ -2,7 +2,8 @@ import numpy as np from .base import _fit_liblinear, BaseSVC, BaseLibSVM -from ..base import BaseEstimator, RegressorMixin, OutlierMixin +from ..base import (BaseEstimator, RegressorMixin, OutlierMixin, + OutlierRejectionMixin) from ..linear_model.base import LinearClassifierMixin, SparseCoefMixin, \ LinearModel from ..utils import check_X_y @@ -1056,7 +1057,7 @@ def __init__(self, nu=0.5, C=1.0, kernel='rbf', degree=3, verbose=verbose, max_iter=max_iter, random_state=None) -class OneClassSVM(BaseLibSVM, OutlierMixin): +class OneClassSVM(BaseLibSVM, OutlierMixin, OutlierRejectionMixin): """Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. diff --git a/sklearn/tests/test_metaestimators.py b/sklearn/tests/test_metaestimators.py index 822dd0edb5501..45fd68709a3ab 100644 --- a/sklearn/tests/test_metaestimators.py +++ b/sklearn/tests/test_metaestimators.py @@ -13,6 +13,7 @@ from sklearn.feature_selection import RFE, RFECV from sklearn.ensemble import BaggingClassifier from sklearn.exceptions import NotFittedError +from sklearn.compose import ResampledTrainer class DelegatorData: @@ -24,6 +25,12 @@ def __init__(self, name, construct, skip_methods=(), self.skip_methods = skip_methods +class DummyResampler(BaseEstimator): + "does nothing" + def fit_resample(self, X, y): + return X, y + + DELEGATING_METAESTIMATORS = [ DelegatorData('Pipeline', lambda est: Pipeline([('est', est)])), DelegatorData('GridSearchCV', @@ -41,7 +48,10 @@ def __init__(self, name, construct, skip_methods=(), DelegatorData('BaggingClassifier', BaggingClassifier, skip_methods=['transform', 'inverse_transform', 'score', 'predict_proba', 'predict_log_proba', - 'predict']) + 'predict']), + DelegatorData('ResampledTrainer', + lambda est: ResampledTrainer(DummyResampler(), est), + skip_methods=['inverse_transform']) ] @@ -62,7 +72,7 @@ def __init__(self, param=1, hidden_method=None): def fit(self, X, y=None, *args, **kwargs): self.coef_ = np.arange(X.shape[1]) - return True + return self def _check_fit(self): check_is_fitted(self, 'coef_') diff --git a/sklearn/utils/__init__.py b/sklearn/utils/__init__.py index d40414c36a03d..324ce632cb2be 100644 --- a/sklearn/utils/__init__.py +++ b/sklearn/utils/__init__.py @@ -20,8 +20,8 @@ from .validation import (as_float_array, assert_all_finite, check_random_state, column_or_1d, check_array, - check_consistent_length, check_X_y, indexable, - check_symmetric, check_scalar) + check_consistent_length, check_X_y, check_X_y_kwargs, + indexable, check_symmetric, check_scalar) from .. import get_config @@ -264,6 +264,8 @@ def _safe_indexing_row(X, indices): indices.dtype.kind == 'i'): # This is often substantially faster than X[indices] return X.take(indices, axis=0) + elif getattr(X, 'format', None) in {'dia', 'bsr', 'coo'}: + return X.tocsr()[indices] else: return X[indices] else: diff --git a/sklearn/utils/estimator_checks.py b/sklearn/utils/estimator_checks.py index 249cb022f8e87..ea3c25396ac0c 100644 --- a/sklearn/utils/estimator_checks.py +++ b/sklearn/utils/estimator_checks.py @@ -50,7 +50,7 @@ from .import shuffle from .validation import has_fit_parameter, _num_samples from ..preprocessing import StandardScaler -from ..datasets import load_iris, load_boston, make_blobs +from ..datasets import load_iris, load_boston, make_blobs, make_classification BOSTON = None @@ -75,11 +75,12 @@ def _yield_checks(name, estimator): tags = _safe_tags(estimator) yield check_estimators_dtypes yield check_fit_score_takes_y - yield check_sample_weights_pandas_series - yield check_sample_weights_list - yield check_sample_weights_invariance - yield check_estimators_fit_returns_self - yield partial(check_estimators_fit_returns_self, readonly_memmap=True) + if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'): + yield check_sample_weights_pandas_series + yield check_sample_weights_list + yield check_sample_weights_invariance + yield check_estimators_fit_returns_self + yield partial(check_estimators_fit_returns_self, readonly_memmap=True) # Check that all estimator yield informative messages when # trained on empty datasets @@ -88,7 +89,11 @@ def _yield_checks(name, estimator): yield check_dtype_object yield check_estimators_empty_data_messages - if name not in CROSS_DECOMPOSITION: + if (name not in CROSS_DECOMPOSITION + and not hasattr(estimator, 'fit_resample')): + # TODO potentially readd fit_resample tests after SLEP has + # been clarified. + # cross-decomposition's "transform" returns X and Y yield check_pipeline_consistency @@ -102,9 +107,10 @@ def _yield_checks(name, estimator): yield check_estimator_sparse_data - # Test that estimators can be pickled, and once pickled - # give the same answer as before. - yield check_estimators_pickle + if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'): + # Test that estimators can be pickled, and once pickled + # give the same answer as before. + yield check_estimators_pickle def _yield_classifier_checks(name, classifier): @@ -212,6 +218,9 @@ def _yield_outliers_checks(name, estimator): if hasattr(estimator, 'fit_predict'): yield check_outliers_fit_predict + if hasattr(estimator, 'fit_resample'): + yield check_outlier_rejectors + # checks for estimators that can be used on a test set if hasattr(estimator, 'predict'): yield check_outliers_train @@ -223,6 +232,13 @@ def _yield_outliers_checks(name, estimator): yield check_estimators_unfitted +def _yield_resamplers_checks(name, estimator): + yield check_resampler_structure + yield check_resamplers_have_no_transform + yield check_resample_repeated + yield check_fit_resample2d + + def _yield_all_checks(name, estimator): tags = _safe_tags(estimator) if "2darray" not in tags["X_types"]: @@ -247,22 +263,27 @@ def _yield_all_checks(name, estimator): if hasattr(estimator, 'transform'): for check in _yield_transformer_checks(name, estimator): yield check + if hasattr(estimator, 'fit_resample'): + for check in _yield_resamplers_checks(name, estimator): + yield check if isinstance(estimator, ClusterMixin): for check in _yield_clustering_checks(name, estimator): yield check if is_outlier_detector(estimator): for check in _yield_outliers_checks(name, estimator): yield check - yield check_fit2d_predict1d - yield check_methods_subset_invariance yield check_fit2d_1sample yield check_fit2d_1feature yield check_fit1d yield check_get_params_invariance - yield check_set_params - yield check_dict_unchanged yield check_dont_overwrite_parameters - yield check_fit_idempotent + yield check_set_params + + if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'): + yield check_fit2d_predict1d + yield check_methods_subset_invariance + yield check_dict_unchanged + yield check_fit_idempotent def check_estimator(Estimator): @@ -514,7 +535,10 @@ def check_estimator_sparse_data(name, estimator_orig): # fit and predict try: with ignore_warnings(category=(DeprecationWarning, FutureWarning)): - estimator.fit(X, y) + if hasattr(estimator, "fit"): + estimator.fit(X, y) + if hasattr(estimator, "fit_resample"): + estimator.fit_resample(X, y) if hasattr(estimator, "predict"): pred = estimator.predict(X) if tags['multioutput_only']: @@ -634,50 +658,58 @@ def check_sample_weights_invariance(name, estimator_orig): @ignore_warnings(category=(DeprecationWarning, FutureWarning, UserWarning)) def check_dtype_object(name, estimator_orig): # check that estimators treat dtype object as numeric if possible - rng = np.random.RandomState(0) - X = pairwise_estimator_convert_X(rng.rand(40, 10), estimator_orig) - X = X.astype(object) - tags = _safe_tags(estimator_orig) - if tags['binary_only']: - y = (X[:, 0] * 2).astype(np.int) - else: - y = (X[:, 0] * 4).astype(np.int) - estimator = clone(estimator_orig) - y = enforce_estimator_tags_y(estimator, y) - - estimator.fit(X, y) - if hasattr(estimator, "predict"): - estimator.predict(X) + methods = ['fit', 'fit_resample', 'fit_transform'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + rng = np.random.RandomState(0) + X = pairwise_estimator_convert_X(rng.rand(40, 10), estimator_orig) + X = X.astype(object) + tags = _safe_tags(estimator_orig) + if tags['binary_only']: + y = (X[:, 0] * 2).astype(np.int) + else: + y = (X[:, 0] * 4).astype(np.int) + estimator = clone(estimator_orig) + y = enforce_estimator_tags_y(estimator, y) - if hasattr(estimator, "transform"): - estimator.transform(X) + getattr(estimator, method)(X, y) + if hasattr(estimator, "predict"): + estimator.predict(X) - try: - estimator.fit(X, y.astype(object)) - except Exception as e: - if "Unknown label type" not in str(e): - raise + if hasattr(estimator, "transform"): + estimator.transform(X) - if 'string' not in tags['X_types']: - X[0, 0] = {'foo': 'bar'} - msg = "argument must be a string.* number" - assert_raises_regex(TypeError, msg, estimator.fit, X, y) - else: - # Estimators supporting string will not call np.asarray to convert the - # data to numeric and therefore, the error will not be raised. - # Checking for each element dtype in the input array will be costly. - # Refer to #11401 for full discussion. - estimator.fit(X, y) + try: + getattr(estimator, method)(X, y.astype(object)) + except Exception as e: + if "Unknown label type" not in str(e): + raise + + if 'string' not in tags['X_types']: + X[0, 0] = {'foo': 'bar'} + msg = "argument must be a string.* number" + assert_raises_regex( + TypeError, msg, getattr(estimator, method), X, y) + else: + # Estimators supporting string will not call np.asarray to convert + # the data to numeric and therefore, the error will not be raised. + # Checking for each element dtype in the input array will be + # costly. + # Refer to #11401 for full discussion. + getattr(estimator, method)(X, y) def check_complex_data(name, estimator_orig): # check that estimators raise an exception on providing complex data - X = np.random.sample(10) + 1j * np.random.sample(10) - X = X.reshape(-1, 1) - y = np.random.sample(10) + 1j * np.random.sample(10) - estimator = clone(estimator_orig) - assert_raises_regex(ValueError, "Complex data not supported", - estimator.fit, X, y) + methods = ['fit', 'fit_resample', 'fit_transform'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + X = np.random.sample(10) + 1j * np.random.sample(10) + X = X.reshape(-1, 1) + y = np.random.sample(10) + 1j * np.random.sample(10) + estimator = clone(estimator_orig) + assert_raises_regex(ValueError, "Complex data not supported", + getattr(estimator, method), X, y) @ignore_warnings @@ -726,57 +758,62 @@ def is_public_parameter(attr): @ignore_warnings(category=(DeprecationWarning, FutureWarning)) def check_dont_overwrite_parameters(name, estimator_orig): - # check that fit method only changes or sets private attributes - if hasattr(estimator_orig.__init__, "deprecated_original"): - # to not check deprecated classes - return - estimator = clone(estimator_orig) - rnd = np.random.RandomState(0) - X = 3 * rnd.uniform(size=(20, 3)) - X = pairwise_estimator_convert_X(X, estimator_orig) - y = X[:, 0].astype(np.int) - if _safe_tags(estimator, 'binary_only'): - y[y == 2] = 1 - y = enforce_estimator_tags_y(estimator, y) + # check that fit methods only change or set private attributes + methods = ['fit', 'fit_resample', 'fit_transform', 'fit_predict'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + if hasattr(estimator_orig.__init__, "deprecated_original"): + # to not check deprecated classes + return + estimator = clone(estimator_orig) + rnd = np.random.RandomState(0) + X = 3 * rnd.uniform(size=(20, 3)) + X = pairwise_estimator_convert_X(X, estimator_orig) + y = X[:, 0].astype(np.int) + if _safe_tags(estimator, 'binary_only'): + y[y == 2] = 1 + y = enforce_estimator_tags_y(estimator, y) - if hasattr(estimator, "n_components"): - estimator.n_components = 1 - if hasattr(estimator, "n_clusters"): - estimator.n_clusters = 1 + if hasattr(estimator, "n_components"): + estimator.n_components = 1 + if hasattr(estimator, "n_clusters"): + estimator.n_clusters = 1 - set_random_state(estimator, 1) - dict_before_fit = estimator.__dict__.copy() - estimator.fit(X, y) + set_random_state(estimator, 1) + dict_before_fit = estimator.__dict__.copy() + + method = getattr(estimator, method) + method(X, y) - dict_after_fit = estimator.__dict__ + dict_after_fit = estimator.__dict__ - public_keys_after_fit = [key for key in dict_after_fit.keys() - if is_public_parameter(key)] + public_keys_after_fit = [key for key in dict_after_fit.keys() + if is_public_parameter(key)] - attrs_added_by_fit = [key for key in public_keys_after_fit - if key not in dict_before_fit.keys()] + attrs_added_by_fit = [key for key in public_keys_after_fit + if key not in dict_before_fit.keys()] - # check that fit doesn't add any public attribute - assert not attrs_added_by_fit, ( - 'Estimator adds public attribute(s) during' - ' the fit method.' - ' Estimators are only allowed to add private attributes' - ' either started with _ or ended' - ' with _ but %s added' - % ', '.join(attrs_added_by_fit)) + # check that fit doesn't add any public attribute + assert not attrs_added_by_fit, ( + 'Estimator adds public attribute(s) during' + ' the fit method.' + ' Estimators are only allowed to add private attributes' + ' either started with _ or ended' + ' with _ but %s added' + % ', '.join(attrs_added_by_fit)) - # check that fit doesn't change any public attribute - attrs_changed_by_fit = [key for key in public_keys_after_fit - if (dict_before_fit[key] - is not dict_after_fit[key])] + # check that fit doesn't change any public attribute + attrs_changed_by_fit = [key for key in public_keys_after_fit + if (dict_before_fit[key] + is not dict_after_fit[key])] - assert not attrs_changed_by_fit, ( - 'Estimator changes public attribute(s) during' - ' the fit method. Estimators are only allowed' - ' to change attributes started' - ' or ended with _, but' - ' %s changed' - % ', '.join(attrs_changed_by_fit)) + assert not attrs_changed_by_fit, ( + 'Estimator changes public attribute(s) during' + ' the fit method. Estimators are only allowed' + ' to change attributes started' + ' or ended with _, but' + ' %s changed' + % ', '.join(attrs_changed_by_fit)) @ignore_warnings(category=(DeprecationWarning, FutureWarning)) @@ -810,6 +847,27 @@ def check_fit2d_predict1d(name, estimator_orig): getattr(estimator, method), X[0]) +def check_fit_resample2d(name, estimator_orig): + # check by fit resampling a 2d array + rnd = np.random.RandomState(0) + X = 3 * rnd.uniform(size=(20, 3)) + X = pairwise_estimator_convert_X(X, estimator_orig) + y = X[:, 0].astype(np.int) + tags = _safe_tags(estimator_orig) + if tags['binary_only']: + y[y == 2] = 1 + estimator = clone(estimator_orig) + y = enforce_estimator_tags_y(estimator, y) + + if hasattr(estimator, "n_components"): + estimator.n_components = 1 + if hasattr(estimator, "n_clusters"): + estimator.n_clusters = 1 + + set_random_state(estimator, 1) + X, y, kw = estimator.fit_resample(X, y) + + def _apply_on_subsets(func, X): # apply function on the whole set and on mini batches result_full = func(X) @@ -873,87 +931,96 @@ def check_fit2d_1sample(name, estimator_orig): # Check that fitting a 2d array with only one sample either works or # returns an informative message. The error message should either mention # the number of samples or the number of classes. - rnd = np.random.RandomState(0) - X = 3 * rnd.uniform(size=(1, 10)) - y = X[:, 0].astype(np.int) - estimator = clone(estimator_orig) - y = enforce_estimator_tags_y(estimator, y) + methods = ['fit', 'fit_resample', 'fit_transform', 'fit_predict'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + rnd = np.random.RandomState(0) + X = 3 * rnd.uniform(size=(1, 10)) + y = X[:, 0].astype(np.int) + estimator = clone(estimator_orig) + y = enforce_estimator_tags_y(estimator, y) - if hasattr(estimator, "n_components"): - estimator.n_components = 1 - if hasattr(estimator, "n_clusters"): - estimator.n_clusters = 1 + if hasattr(estimator, "n_components"): + estimator.n_components = 1 + if hasattr(estimator, "n_clusters"): + estimator.n_clusters = 1 - set_random_state(estimator, 1) + set_random_state(estimator, 1) - # min_cluster_size cannot be less than the data size for OPTICS. - if name == 'OPTICS': - estimator.set_params(min_samples=1) + # min_cluster_size cannot be less than the data size for OPTICS. + if name == 'OPTICS': + estimator.set_params(min_samples=1) - msgs = ["1 sample", "n_samples = 1", "n_samples=1", "one sample", - "1 class", "one class"] + msgs = ["1 sample", "n_samples = 1", "n_samples=1", "one sample", + "1 class", "one class"] - try: - estimator.fit(X, y) - except ValueError as e: - if all(msg not in repr(e) for msg in msgs): - raise e + try: + getattr(estimator, method)(X, y) + except ValueError as e: + if all(msg not in repr(e) for msg in msgs): + raise e @ignore_warnings def check_fit2d_1feature(name, estimator_orig): # check fitting a 2d array with only 1 feature either works or returns # informative message - rnd = np.random.RandomState(0) - X = 3 * rnd.uniform(size=(10, 1)) - X = pairwise_estimator_convert_X(X, estimator_orig) - y = X[:, 0].astype(np.int) - estimator = clone(estimator_orig) - y = enforce_estimator_tags_y(estimator, y) + methods = ['fit', 'fit_resample', 'fit_transform', 'fit_predict'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + rnd = np.random.RandomState(0) + X = 3 * rnd.uniform(size=(10, 1)) + X = pairwise_estimator_convert_X(X, estimator_orig) + y = X[:, 0].astype(np.int) + estimator = clone(estimator_orig) + y = enforce_estimator_tags_y(estimator, y) - if hasattr(estimator, "n_components"): - estimator.n_components = 1 - if hasattr(estimator, "n_clusters"): - estimator.n_clusters = 1 - # ensure two labels in subsample for RandomizedLogisticRegression - if name == 'RandomizedLogisticRegression': - estimator.sample_fraction = 1 - # ensure non skipped trials for RANSACRegressor - if name == 'RANSACRegressor': - estimator.residual_threshold = 0.5 + if hasattr(estimator, "n_components"): + estimator.n_components = 1 + if hasattr(estimator, "n_clusters"): + estimator.n_clusters = 1 + # ensure two labels in subsample for RandomizedLogisticRegression + if name == 'RandomizedLogisticRegression': + estimator.sample_fraction = 1 + # ensure non skipped trials for RANSACRegressor + if name == 'RANSACRegressor': + estimator.residual_threshold = 0.5 - y = enforce_estimator_tags_y(estimator, y) - set_random_state(estimator, 1) + y = enforce_estimator_tags_y(estimator, y) + set_random_state(estimator, 1) - msgs = ["1 feature(s)", "n_features = 1", "n_features=1"] + msgs = ["1 feature(s)", "n_features = 1", "n_features=1"] - try: - estimator.fit(X, y) - except ValueError as e: - if all(msg not in repr(e) for msg in msgs): - raise e + try: + getattr(estimator, method)(X, y) + except ValueError as e: + if all(msg not in repr(e) for msg in msgs): + raise e @ignore_warnings def check_fit1d(name, estimator_orig): # check fitting 1d X array raises a ValueError - rnd = np.random.RandomState(0) - X = 3 * rnd.uniform(size=(20)) - y = X.astype(np.int) - estimator = clone(estimator_orig) - tags = _safe_tags(estimator) - if tags["no_validation"]: - # FIXME this is a bit loose - return - y = enforce_estimator_tags_y(estimator, y) + methods = ['fit', 'fit_resample', 'fit_transform', 'fit_predict'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + rnd = np.random.RandomState(0) + X = 3 * rnd.uniform(size=(20)) + y = X.astype(np.int) + estimator = clone(estimator_orig) + tags = _safe_tags(estimator) + if tags["no_validation"]: + # FIXME this is a bit loose + return + y = enforce_estimator_tags_y(estimator, y) - if hasattr(estimator, "n_components"): - estimator.n_components = 1 - if hasattr(estimator, "n_clusters"): - estimator.n_clusters = 1 + if hasattr(estimator, "n_components"): + estimator.n_components = 1 + if hasattr(estimator, "n_clusters"): + estimator.n_clusters = 1 - set_random_state(estimator, 1) - assert_raises(ValueError, estimator.fit, X, y) + set_random_state(estimator, 1) + assert_raises(ValueError, getattr(estimator, method), X, y) @ignore_warnings(category=(DeprecationWarning, FutureWarning)) @@ -1147,34 +1214,42 @@ def check_estimators_dtypes(name, estimator_orig): for X_train in [X_train_32, X_train_64, X_train_int_64, X_train_int_32]: estimator = clone(estimator_orig) set_random_state(estimator, 1) - estimator.fit(X_train, y) + if hasattr(estimator, "fit"): + estimator.fit(X_train, y) for method in methods: if hasattr(estimator, method): getattr(estimator, method)(X_train) + if hasattr(estimator, "fit_resample"): + getattr(estimator, "fit_resample")(X_train, y) @ignore_warnings(category=(DeprecationWarning, FutureWarning)) def check_estimators_empty_data_messages(name, estimator_orig): - e = clone(estimator_orig) - set_random_state(e, 1) - - X_zero_samples = np.empty(0).reshape(0, 3) - # The precise message can change depending on whether X or y is - # validated first. Let us test the type of exception only: - with assert_raises(ValueError, msg="The estimator {} does not" - " raise an error when an empty data is used " - "to train. Perhaps use " - "check_array in train.".format(name)): - e.fit(X_zero_samples, []) - - X_zero_features = np.empty(0).reshape(3, 0) - # the following y should be accepted by both classifiers and regressors - # and ignored by unsupervised models - y = enforce_estimator_tags_y(e, np.array([1, 0, 1])) - msg = (r"0 feature\(s\) \(shape=\(3, 0\)\) while a minimum of \d* " - "is required.") - assert_raises_regex(ValueError, msg, e.fit, X_zero_features, y) + methods = ['fit', 'fit_resample', 'fit_transform'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + e = clone(estimator_orig) + set_random_state(e, 1) + + X_zero_samples = np.empty(0).reshape(0, 3) + # The precise message can change depending on whether X or y is + # validated first. Let us test the type of exception only: + with assert_raises(ValueError, msg="The estimator {} does not" + " raise an error when an empty data is used " + "to train. Perhaps use " + "check_array in train.".format(name)): + getattr(e, method)(X_zero_samples, []) + + X_zero_features = np.empty(0).reshape(3, 0) + # the following y should be accepted by both classifiers and regressors + # and ignored by unsupervised models + y = enforce_estimator_tags_y(e, np.array([1, 0, 1])) + msg = (r"0 feature\(s\) \(shape=\(\d*, 0\)\) while a minimum of \d* " + "is required.") + assert_raises_regex( + ValueError, msg, getattr(e, method), X_zero_features, y + ) @ignore_warnings(category=DeprecationWarning) @@ -1191,6 +1266,8 @@ def check_estimators_nan_inf(name, estimator_orig): y[:5] = 0 y = enforce_estimator_tags_y(estimator_orig, y) error_string_fit = "Estimator doesn't check for NaN and inf in fit." + error_string_fit_resample = ("Estimator doesn't check for NaN and inf in" + " fit_resample.") error_string_predict = ("Estimator doesn't check for NaN and inf in" " predict.") error_string_transform = ("Estimator doesn't check for NaN and inf in" @@ -1200,22 +1277,40 @@ def check_estimators_nan_inf(name, estimator_orig): with ignore_warnings(category=(DeprecationWarning, FutureWarning)): estimator = clone(estimator_orig) set_random_state(estimator, 1) - # try to fit - try: - estimator.fit(X_train, y) - except ValueError as e: - if 'inf' not in repr(e) and 'NaN' not in repr(e): - print(error_string_fit, estimator, e) + + if (not hasattr(estimator, 'fit_resample') + or hasattr(estimator, 'fit')): + # try to fit + try: + estimator.fit(X_train, y) + except ValueError as e: + if 'inf' not in repr(e) and 'NaN' not in repr(e): + print(error_string_fit, estimator, e) + traceback.print_exc(file=sys.stdout) + raise e + except Exception as exc: + print(error_string_fit, estimator, exc) traceback.print_exc(file=sys.stdout) - raise e - except Exception as exc: - print(error_string_fit, estimator, exc) - traceback.print_exc(file=sys.stdout) - raise exc - else: - raise AssertionError(error_string_fit, estimator) - # actually fit - estimator.fit(X_train_finite, y) + raise exc + else: + raise AssertionError(error_string_fit, estimator) + # actually fit + estimator.fit(X_train_finite, y) + + # fit_resample + if hasattr(estimator, "fit_resample"): + try: + estimator.fit_resample(X_train, y) + except ValueError as e: + if 'inf' not in repr(e) and 'NaN' not in repr(e): + print(error_string_predict, estimator, e) + traceback.print_exc(file=sys.stdout) + raise e + except Exception as exc: + print(error_string_predict, estimator, exc) + traceback.print_exc(file=sys.stdout) + else: + raise AssertionError(error_string_fit_resample, estimator) # predict if hasattr(estimator, "predict"): @@ -2020,41 +2115,47 @@ def check_class_weight_balanced_linear_classifier(name, Classifier): @ignore_warnings(category=(DeprecationWarning, FutureWarning)) def check_estimators_overwrite_params(name, estimator_orig): - if _safe_tags(estimator_orig, 'binary_only'): - n_centers = 2 - else: - n_centers = 3 - X, y = make_blobs(random_state=0, n_samples=9, centers=n_centers) - # some want non-negative input - X -= X.min() - X = pairwise_estimator_convert_X(X, estimator_orig, kernel=rbf_kernel) - estimator = clone(estimator_orig) - y = enforce_estimator_tags_y(estimator, y) - - set_random_state(estimator) - - # Make a physical copy of the original estimator parameters before fitting. - params = estimator.get_params() - original_params = deepcopy(params) - - # Fit the model - estimator.fit(X, y) + methods = ['fit', 'fit_resample', 'fit_transform'] + methods = filter(lambda method: hasattr(estimator_orig, method), methods) + for method in methods: + if _safe_tags(estimator_orig, 'binary_only'): + n_centers = 2 + else: + n_centers = 3 + X, y = make_blobs(random_state=0, n_samples=9, centers=n_centers) + # some want non-negative input + X -= X.min() + X = pairwise_estimator_convert_X(X, estimator_orig, kernel=rbf_kernel) + estimator = clone(estimator_orig) + y = enforce_estimator_tags_y(estimator, y) - # Compare the state of the model parameters with the original parameters - new_params = estimator.get_params() - for param_name, original_value in original_params.items(): - new_value = new_params[param_name] + set_random_state(estimator) - # We should never change or mutate the internal state of input - # parameters by default. To check this we use the joblib.hash function - # that introspects recursively any subobjects to compute a checksum. - # The only exception to this rule of immutable constructor parameters - # is possible RandomState instance but in this check we explicitly - # fixed the random_state params recursively to be integer seeds. - assert joblib.hash(new_value) == joblib.hash(original_value), ( - "Estimator %s should not change or mutate " - " the parameter %s from %s to %s during fit." - % (name, param_name, original_value, new_value)) + # Make a physical copy of the original estimator parameters before + # fitting. + params = estimator.get_params() + original_params = deepcopy(params) + + # Fit the model + getattr(estimator, method)(X, y) + + # Compare the state of the model parameters with the original + # parameters + new_params = estimator.get_params() + for param_name, original_value in original_params.items(): + new_value = new_params[param_name] + + # We should never change or mutate the internal state of input + # parameters by default. To check this we use the joblib.hash + # function that introspects recursively any subobjects to compute a + # checksum. The only exception to this rule of immutable + # constructor parameters is possible RandomState instance but in + # this check we explicitly fixed the random_state params + # recursively to be integer seeds. + assert joblib.hash(new_value) == joblib.hash(original_value), ( + "Estimator %s should not change or mutate " + " the parameter %s from %s to %s during fit." + % (name, param_name, original_value, new_value)) def check_no_attributes_set_in_init(name, estimator): @@ -2500,6 +2601,7 @@ def check_fit_idempotent(name, estimator_orig): for method in check_methods: if hasattr(estimator, method): new_result = getattr(estimator, method)(X_test) + if np.issubdtype(new_result.dtype, np.floating): tol = 2*np.finfo(new_result.dtype).eps else: @@ -2509,3 +2611,51 @@ def check_fit_idempotent(name, estimator_orig): atol=max(tol, 1e-9), rtol=max(tol, 1e-7), err_msg="Idempotency check failed for method {}".format(method) ) + + +def check_outlier_rejectors(name, estimator_orig): + X, y = make_blobs(random_state=0) + outliers = estimator_orig.fit_predict(X, y) == -1 + n_outliers = np.sum(outliers) + + X_new, y_new, kws = estimator_orig.fit_resample(X, y) + + assert X_new.shape[0] == X.shape[0] - n_outliers + assert y_new.shape[0] == y.shape[0] - n_outliers + + +def check_resampler_structure(name, estimator_orig): + X, y = make_blobs(n_samples=10) + X_new, y_new, kw = estimator_orig.fit_resample(X, y) + + +def check_resample_fails_on_non_matching_shapes(): + # check that resamplers enforce matching shapes between kwargs, X and y + pass + + +def check_resample_resamples_kwargs(): + pass + + +def check_resample_repeated(name, estimator_orig): + X, y = make_classification( + n_classes=2, + weights=[0.1, 0.9], + n_features=20, + n_clusters_per_class=1, + n_samples=50, + random_state=0) + + set_random_state(estimator_orig, random_state=0) + X_new, y_new, kw = estimator_orig.fit_resample(X, y) + set_random_state(estimator_orig, random_state=0) + X_new2, y_new2, kw = estimator_orig.fit_resample(X, y) + + assert_array_equal(X_new, X_new2) + assert_array_equal(y_new, y_new2) + + +def check_resamplers_have_no_transform(name, estimator_orig): + assert not hasattr(estimator_orig, 'transform') + assert not hasattr(estimator_orig, 'fit_transform') diff --git a/sklearn/utils/validation.py b/sklearn/utils/validation.py index bb6cf1c8ffe00..110cc1c6e82ea 100644 --- a/sklearn/utils/validation.py +++ b/sklearn/utils/validation.py @@ -596,6 +596,131 @@ def _check_large_sparse(X, accept_large_sparse=False): % indices_datatype) +def check_X_y_kwargs(X, y, kwargs, accept_sparse=False, + accept_large_sparse=True, dtype="numeric", order=None, + copy=False, force_all_finite=True, ensure_2d=True, + allow_nd=False, multi_output=False, ensure_min_samples=1, + ensure_min_features=1, y_numeric=False, + estimator=None): + """Input validation for standard estimators. + + Checks X, y and all kwargs for consistent length, enforces X to be 2D and y + and kwargs 1D. By default, X is checked to be non-empty and containing only + finite values. Standard input checks are also applied to y, such as + checking that y does not have np.nan or np.inf targets. For multi-label y, + set multi_output=True to allow 2D and sparse y. If the dtype of X is + object, attempt converting to float, raising on failure. + + Further, kwargs are checked to not have np.nan or np.inf. + + Parameters + ---------- + X : nd-array, list or sparse matrix + Input data. + + y : nd-array, list or sparse matrix + Labels. + + accept_sparse : string, boolean or list of string (default=False) + String[s] representing allowed sparse matrix formats, such as 'csc', + 'csr', etc. If the input is sparse but not in the allowed format, + it will be converted to the first listed format. True allows the input + to be any format. False means that a sparse matrix input will + raise an error. + + accept_large_sparse : bool (default=True) + If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by + accept_sparse, accept_large_sparse will cause it to be accepted only + if its indices are stored with a 32-bit dtype. + + dtype : string, type, list of types or None (default="numeric") + Data type of result. If None, the dtype of the input is preserved. + If "numeric", dtype is preserved unless array.dtype is object. + If dtype is a list of types, conversion on the first type is only + performed if the dtype of the input is not in the list. + + order : 'F', 'C' or None (default=None) + Whether an array will be forced to be fortran or c-style. + + copy : boolean (default=False) + Whether a forced copy will be triggered. If copy=False, a copy might + be triggered by a conversion. + + force_all_finite : boolean or 'allow-nan', (default=True) + Whether to raise an error on np.inf and np.nan in X. This parameter + does not influence whether y can have np.inf or np.nan values. + The possibilities are: + + - True: Force all values of X to be finite. + - False: accept both np.inf and np.nan in X. + - 'allow-nan': accept only np.nan values in X. Values cannot be + infinite. + + ensure_2d : boolean (default=True) + Whether to raise a value error if X is not 2D. + + allow_nd : boolean (default=False) + Whether to allow X.ndim > 2. + + multi_output : boolean (default=False) + Whether to allow 2D y (array or sparse matrix). If false, y will be + validated as a vector. y cannot have np.nan or np.inf values if + multi_output=True. + + ensure_min_samples : int (default=1) + Make sure that X has a minimum number of samples in its first + axis (rows for a 2D array). + + ensure_min_features : int (default=1) + Make sure that the 2D array has some minimum number of features + (columns). The default value of 1 rejects empty datasets. + This check is only enforced when X has effectively 2 dimensions or + is originally 1D and ``ensure_2d`` is True. Setting to 0 disables + this check. + + y_numeric : boolean (default=False) + Whether to ensure that y has a numeric type. If dtype of y is object, + it is converted to float64. Should only be used for regression + algorithms. + + estimator : str or estimator instance (default=None) + If passed, include the name of the estimator in warning messages. + + Returns + ------- + X_converted : object + The converted and validated X. + + y_converted : object + The converted and validated y. + + kwargs_converted: dict of string -> object + The converted and validated kwargs + """ + X_converted, y_converted = check_X_y( + X, y, + accept_sparse=accept_sparse, accept_large_sparse=accept_large_sparse, + dtype=dtype, order=order, copy=copy, force_all_finite=force_all_finite, + ensure_2d=ensure_2d, allow_nd=allow_nd, multi_output=multi_output, + ensure_min_samples=ensure_min_samples, + ensure_min_features=ensure_min_features, y_numeric=y_numeric, + estimator=estimator + ) + kwargs_converted = { + kw: check_array( + kwargs[kw], force_all_finite=True, dtype="numeric", + ensure_2d=False + ) + for kw in kwargs + } + check_consistent_length( + X_converted, y_converted, *(kwargs_converted[kw] for kw in + kwargs_converted) + ) + + return X_converted, y_converted, kwargs_converted + + def check_X_y(X, y, accept_sparse=False, accept_large_sparse=True, dtype="numeric", order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, multi_output=False,