Skip to content

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 50 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
7d9dcc4
Basic validate_X and validate_X_y methods for _n_features_in attribute
NicolasHug Apr 9, 2019
f117745
created NonRectangularInputMixin
NicolasHug Apr 19, 2019
95b330c
Merge remote-tracking branch 'upstream/master' into n_features_in
NicolasHug Apr 19, 2019
e56592b
resolved conflicts
NicolasHug Apr 19, 2019
3bdcb5c
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug May 30, 2019
8ecc690
_validate** is not private
NicolasHug May 30, 2019
60e4cea
Added support for pipeline and grid search
NicolasHug May 30, 2019
ff19f22
pep8
NicolasHug May 31, 2019
a44318b
Trigger CI??
NicolasHug May 31, 2019
42249fb
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Jun 26, 2019
abdc94e
Added to decision tree for gridsearch tests to pass
NicolasHug Jun 26, 2019
a50e76f
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Aug 1, 2019
62fc42e
Added support for ColumnTransformer and FeatureUnion
NicolasHug Aug 1, 2019
6845788
pep8
NicolasHug Aug 1, 2019
3246436
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Aug 7, 2019
ee2598b
BaseSearchCV now raises AttributeError
NicolasHug Aug 12, 2019
6a14e4b
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Aug 19, 2019
3f2d44f
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 2, 2019
25fda0f
Added common test + used _validate_XXX on most estimators
NicolasHug Sep 2, 2019
9bdfb65
Fixed some test
NicolasHug Sep 2, 2019
be76ef4
fixed issues for some estimators
NicolasHug Sep 4, 2019
b464f86
Merge branch 'n_features_in' of github.com:NicolasHug/scikit-learn; b…
NicolasHug Sep 5, 2019
70dc4ed
fixed tests in test_data.py
NicolasHug Sep 5, 2019
988f9c4
Fixed some tests
NicolasHug Sep 5, 2019
fd9b72c
validate twice for Kmeans and FastICA
NicolasHug Sep 5, 2019
4f3d6ff
again
NicolasHug Sep 5, 2019
08f7192
and again
NicolasHug Sep 5, 2019
5a41275
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 6, 2019
f0e7b41
should fix dep warning error
NicolasHug Sep 6, 2019
193fda1
removed superfluous tests
NicolasHug Sep 8, 2019
5b20a4c
Added specific tests for vectorizers
NicolasHug Sep 8, 2019
a49e5ea
flake8
NicolasHug Sep 8, 2019
968fbff
Dummies now have n_feautures_in_ to None and raise error if not fitted
NicolasHug Sep 9, 2019
e4faf13
still don't check n_features_in_ for LDA (will be done in later PR)
NicolasHug Sep 9, 2019
908aea6
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 11, 2019
a88a4c5
Added tests for some estimators
NicolasHug Sep 11, 2019
f3fb539
removed NonRectangularInputMixin and set n_features_in to SparseCoder
NicolasHug Sep 11, 2019
4b7b758
simpler logic for dummies
NicolasHug Sep 12, 2019
53027d3
comments
NicolasHug Sep 12, 2019
c5dfbbd
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 15, 2019
a1aea70
pep8
NicolasHug Sep 15, 2019
9ecc396
remove print
NicolasHug Sep 15, 2019
e9c3104
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 16, 2019
9292c84
avoid dep warning
NicolasHug Sep 16, 2019
e11b0bb
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 18, 2019
6846bea
merged (maybe)
NicolasHug Sep 19, 2019
60c5108
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 19, 2019
615140e
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 19, 2019
fe052e6
set n_features_in_ for stacking estimators
NicolasHug Sep 19, 2019
9a205dd
dont hardcode attribute in init for sparsecoder
NicolasHug Sep 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

from . import __version__
from .utils import _IS_32BIT
from .utils.validation import check_X_y
from .utils.validation import check_array

_DEFAULT_TAGS = {
'non_deterministic': False,
Expand Down Expand Up @@ -323,6 +325,31 @@ def _get_tags(self):
collected_tags.update(more_tags)
return collected_tags

def _validate_n_features(self, X, check_n_features):
if check_n_features:
if not hasattr(self, 'n_features_in_'):
raise RuntimeError(
"check_n_features is True but there is no n_features_in_ "
"attribute."
)
if X.shape[1] != self.n_features_in_:
raise ValueError(
'X has {} features, but this {} is expecting {} features '
'as input.'.format(X.shape[1], self.__class__.__name__,
self.n_features_in_)
)
else:
self.n_features_in_ = X.shape[1]

def _validate_X(self, X, check_n_features=False, **check_array_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using **kwargs make it harder to add and remove parameters to this function?
Seems potentially a bit tricky, right?
I guess the worst thing that could happen is that we add another argument here in a future version and a user using an older version trying to pass the argument will have it land in the **check_array_params and then check_array will raise an error?

Is there an issue with making all parameters explicit here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your concern.

I can definitely make all the parameters explicit. The only downside is that we have to keep the signature synchronized with that of check_array, but that might be a good thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would only mean the user hasn't properly specified the minimum sklearn version requirement properly. I don't think that's something we should worry about.

X = check_array(X, **check_array_params)
self._validate_n_features(X, check_n_features)
return X

def _validate_X_y(self, X, y, check_n_features=False, **check_X_y_params):
X, y = check_X_y(X, y, **check_X_y_params)
self._validate_n_features(X, check_n_features)
return X, y

class ClassifierMixin:
"""Mixin class for all classifiers in scikit-learn."""
Expand Down
4 changes: 2 additions & 2 deletions sklearn/calibration.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,8 +130,8 @@ def fit(self, X, y, sample_weight=None):
self : object
Returns an instance of self.
"""
X, y = check_X_y(X, y, accept_sparse=['csc', 'csr', 'coo'],
force_all_finite=False, allow_nd=True)
X, y = self._validate_X_y(X, y, accept_sparse=['csc', 'csr', 'coo'],
force_all_finite=False, allow_nd=True)
X, y = indexable(X, y)
le = LabelBinarizer().fit(y)
self.classes_ = le.classes_
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/affinity_propagation_.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,7 +372,7 @@ def fit(self, X, y=None):
accept_sparse = False
else:
accept_sparse = 'csr'
X = check_array(X, accept_sparse=accept_sparse)
X = self._validate_X(X, accept_sparse=accept_sparse)
if self.affinity == "precomputed":
self.affinity_matrix_ = X
elif self.affinity == "euclidean":
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/bicluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ def fit(self, X, y=None):
y : Ignored

"""
X = check_array(X, accept_sparse='csr', dtype=np.float64)
X = self._validate_X(X, accept_sparse='csr', dtype=np.float64)
self._check_parameters()
self._fit(X)
return self
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/birch.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,7 +445,7 @@ def fit(self, X, y=None):
return self._fit(X)

def _fit(self, X):
X = check_array(X, accept_sparse='csr', copy=self.copy)
X = self._validate_X(X, accept_sparse='csr', copy=self.copy)
threshold = self.threshold
branching_factor = self.branching_factor

Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/dbscan_.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,7 @@ def fit(self, X, y=None, sample_weight=None):
self

"""
X = check_array(X, accept_sparse='csr')
X = self._validate_X(X, accept_sparse='csr')

if not self.eps > 0.0:
raise ValueError("eps must be positive.")
Expand Down
13 changes: 9 additions & 4 deletions sklearn/cluster/hierarchical.py
Original file line number Diff line number Diff line change
Expand Up @@ -790,7 +790,7 @@ def fit(self, X, y=None):
-------
self
"""
X = check_array(X, ensure_min_samples=2, estimator=self)
X = self._validate_X(X, ensure_min_samples=2, estimator=self)
memory = check_memory(self.memory)

if self.n_clusters is not None and self.n_clusters <= 0:
Expand Down Expand Up @@ -1034,9 +1034,14 @@ def fit(self, X, y=None, **params):
-------
self
"""
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
ensure_min_features=2, estimator=self)
return AgglomerativeClustering.fit(self, X.T, **params)
X = self._validate_X(X, accept_sparse=['csr', 'csc', 'coo'],
ensure_min_features=2, estimator=self)
n_features_in_ = self.n_features_in_
AgglomerativeClustering.fit(self, X.T, **params)
# Need to restore n_features_in_ attribute that was overridden in
# AgglomerativeClustering since we passed it X.T.
self.n_features_in_ = n_features_in_
return self

@property
def fit_predict(self):
Expand Down
9 changes: 5 additions & 4 deletions sklearn/cluster/k_means_.py
Original file line number Diff line number Diff line change
Expand Up @@ -852,8 +852,9 @@ def fit(self, X, y=None, sample_weight=None):

# avoid forcing order when copy_x=False
order = "C" if self.copy_x else None
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
order=order, copy=self.copy_x)
X = self._validate_X(X, accept_sparse='csr',
dtype=[np.float64, np.float32],
order=order, copy=self.copy_x)
# verify that the number of samples given is larger than k
if _num_samples(X) < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
Expand Down Expand Up @@ -1497,8 +1498,8 @@ def fit(self, X, y=None, sample_weight=None):

"""
random_state = check_random_state(self.random_state)
X = check_array(X, accept_sparse="csr", order='C',
dtype=[np.float64, np.float32])
X = self._validate_X(X, accept_sparse="csr", order='C',
dtype=[np.float64, np.float32])
n_samples, n_features = X.shape
if n_samples < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d"
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/mean_shift_.py
Original file line number Diff line number Diff line change
Expand Up @@ -414,7 +414,7 @@ def fit(self, X, y=None):
y : Ignored

"""
X = check_array(X)
X = self._validate_X(X)
self.cluster_centers_, self.labels_ = \
mean_shift(X, bandwidth=self.bandwidth, seeds=self.seeds,
min_bin_freq=self.min_bin_freq,
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/optics_.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ def fit(self, X, y=None):
self : instance of OPTICS
The instance.
"""
X = check_array(X, dtype=np.float)
X = self._validate_X(X, dtype=np.float)

if self.cluster_method not in ['dbscan', 'xi']:
raise ValueError("cluster_method should be one of"
Expand Down
4 changes: 2 additions & 2 deletions sklearn/cluster/spectral.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,8 +474,8 @@ def fit(self, X, y=None):
self

"""
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64, ensure_min_samples=2)
X = self._validate_X(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64, ensure_min_samples=2)
allow_squared = self.affinity in ["precomputed",
"precomputed_nearest_neighbors"]
if X.shape[0] == X.shape[1] and not allow_squared:
Expand Down
11 changes: 11 additions & 0 deletions sklearn/cluster/tests/test_bicluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,3 +256,14 @@ def test_wrong_shape():
data = np.arange(27).reshape((3, 3, 3))
with pytest.raises(ValueError):
model.fit(data)


@pytest.mark.parametrize('est',
(SpectralBiclustering(), SpectralCoclustering()))
def test_n_features_in_(est):

X, _, _ = make_biclusters((3, 3), 3, random_state=0)

assert not hasattr(est, 'n_features_in_')
est.fit(X)
assert est.n_features_in_ == 3
3 changes: 3 additions & 0 deletions sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,8 @@ def fit_transform(self, X, y=None):
else:
self._feature_names_in = None
X = _check_X(X)
# set n_features_in_ attribute
self._validate_n_features(X, check_n_features=False)
self._validate_transformers()
self._validate_column_callables(X)
self._validate_remainder(X)
Expand Down Expand Up @@ -579,6 +581,7 @@ def transform(self, X):
'and for transform when using the '
'remainder keyword')

# TODO: also call _validate_n_features(check_n_features=True) in 0.24
self._validate_features(X.shape[1], X_feature_names)
Xs = self._fit_transform(X, None, _transform_one, fitted=True)
self._validate_output(Xs)
Expand Down
15 changes: 15 additions & 0 deletions sklearn/compose/_target.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from ..utils.validation import check_is_fitted
from ..utils import check_array, safe_indexing
from ..preprocessing import FunctionTransformer
from ..exceptions import NotFittedError

__all__ = ['TransformedTargetRegressor']

Expand Down Expand Up @@ -234,3 +235,17 @@ def predict(self, X):

def _more_tags(self):
return {'poor_score': True, 'no_validation': True}

@property
def n_features_in_(self):
# For consistency with other estimators we raise a AttributeError so
# that hasattr() fails if the estimator isn't fitted.
try:
check_is_fitted(self)
except NotFittedError as nfe:
raise AttributeError(
"{} object has no n_features_in_ attribute."
.format(self.__class__.__name__)
) from nfe

return self.regressor_.n_features_in_
12 changes: 12 additions & 0 deletions sklearn/compose/tests/test_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1180,3 +1180,15 @@ def test_column_transformer_mask_indexing(array_type):
)
X_trans = column_transformer.fit_transform(X)
assert X_trans.shape == (3, 2)


def test_n_features_in():
# make sure n_features_in is what is passed as input to the column
# transformer.

X = [[1, 2], [3, 4], [5, 6]]
ct = ColumnTransformer([('a', DoubleTrans(), [0]),
('b', DoubleTrans(), [1])])
assert not hasattr(ct, 'n_features_in_')
ct.fit(X)
assert ct.n_features_in_ == 2
2 changes: 1 addition & 1 deletion sklearn/covariance/empirical_covariance_.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ def fit(self, X, y=None):
self : object

"""
X = check_array(X)
X = self._validate_X(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down
6 changes: 3 additions & 3 deletions sklearn/covariance/graph_lasso_.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,8 +378,8 @@ def fit(self, X, y=None):
y : (ignored)
"""
# Covariance does not make sense for a single feature
X = check_array(X, ensure_min_features=2, ensure_min_samples=2,
estimator=self)
X = self._validate_X(X, ensure_min_features=2, ensure_min_samples=2,
estimator=self)

if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
Expand Down Expand Up @@ -645,7 +645,7 @@ def fit(self, X, y=None):
y : (ignored)
"""
# Covariance does not make sense for a single feature
X = check_array(X, ensure_min_features=2, estimator=self)
X = self._validate_X(X, ensure_min_features=2, estimator=self)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down
2 changes: 1 addition & 1 deletion sklearn/covariance/robust_covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -636,7 +636,7 @@ def fit(self, X, y=None):
self : object

"""
X = check_array(X, ensure_min_samples=2, estimator='MinCovDet')
X = self._validate_X(X, ensure_min_samples=2, estimator='MinCovDet')
random_state = check_random_state(self.random_state)
n_samples, n_features = X.shape
# check that the empirical covariance is full rank
Expand Down
6 changes: 3 additions & 3 deletions sklearn/covariance/shrunk_covariance_.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def fit(self, X, y=None):
self : object

"""
X = check_array(X)
X = self._validate_X(X)
# Not calling the parent object to fit, to avoid a potential
# matrix inversion when setting the precision
if self.assume_centered:
Expand Down Expand Up @@ -419,7 +419,7 @@ def fit(self, X, y=None):
"""
# Not calling the parent object to fit, to avoid computing the
# covariance matrix (and potentially the precision)
X = check_array(X)
X = self._validate_X(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down Expand Up @@ -572,7 +572,7 @@ def fit(self, X, y=None):
self : object

"""
X = check_array(X)
X = self._validate_X(X)
# Not calling the parent object to fit, to avoid computing the
# covariance matrix (and potentially the precision)
if self.assume_centered:
Expand Down
8 changes: 4 additions & 4 deletions sklearn/cross_decomposition/pls_.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,8 +252,8 @@ def fit(self, X, Y):

# copy since this will contains the residuals (deflated) matrices
check_consistent_length(X, Y)
X = check_array(X, dtype=np.float64, copy=self.copy,
ensure_min_samples=2)
X = self._validate_X(X, dtype=np.float64, copy=self.copy,
ensure_min_samples=2)
Y = check_array(Y, dtype=np.float64, copy=self.copy, ensure_2d=False)
if Y.ndim == 1:
Y = Y.reshape(-1, 1)
Expand Down Expand Up @@ -828,8 +828,8 @@ def fit(self, X, Y):
"""
# copy since this will contains the centered data
check_consistent_length(X, Y)
X = check_array(X, dtype=np.float64, copy=self.copy,
ensure_min_samples=2)
X = self._validate_X(X, dtype=np.float64, copy=self.copy,
ensure_min_samples=2)
Y = check_array(Y, dtype=np.float64, copy=self.copy, ensure_2d=False)
if Y.ndim == 1:
Y = Y.reshape(-1, 1)
Expand Down
8 changes: 6 additions & 2 deletions sklearn/decomposition/dict_learning.py
Original file line number Diff line number Diff line change
Expand Up @@ -1044,6 +1044,10 @@ def fit(self, X, y=None):
"""
return self

@property
def n_features_in_(self):
return self.components_.shape[1]


class DictionaryLearning(SparseCodingMixin, BaseEstimator):
"""Dictionary learning
Expand Down Expand Up @@ -1217,7 +1221,7 @@ def fit(self, X, y=None):
Returns the object itself
"""
random_state = check_random_state(self.random_state)
X = check_array(X)
X = self._validate_X(X)
if self.n_components is None:
n_components = X.shape[1]
else:
Expand Down Expand Up @@ -1423,7 +1427,7 @@ def fit(self, X, y=None):
Returns the instance itself.
"""
random_state = check_random_state(self.random_state)
X = check_array(X)
X = self._validate_X(X)

U, (A, B), self.n_iter_ = dict_learning_online(
X, self.n_components, self.alpha,
Expand Down
2 changes: 1 addition & 1 deletion sklearn/decomposition/factor_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ def fit(self, X, y=None):
-------
self
"""
X = check_array(X, copy=self.copy, dtype=np.float64)
X = self._validate_X(X, copy=self.copy, dtype=np.float64)

n_samples, n_features = X.shape
n_components = self.n_components
Expand Down
5 changes: 5 additions & 0 deletions sklearn/decomposition/fastica_.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,11 @@ def _fit(self, X, compute_sources=False):
-------
X_new : array-like, shape (n_samples, n_components)
"""

# This validates twice but there is not clean way to avoid validation
# in fastica(). Please see issue 14897.
self._validate_X(X, copy=self.whiten, dtype=FLOAT_DTYPES,
ensure_min_samples=2).T
fun_args = {} if self.fun_args is None else self.fun_args
whitening, unmixing, sources, X_mean, self.n_iter_ = fastica(
X=X, n_components=self.n_components, algorithm=self.algorithm,
Expand Down
4 changes: 2 additions & 2 deletions sklearn/decomposition/incremental_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,8 +192,8 @@ def fit(self, X, y=None):
self.singular_values_ = None
self.noise_variance_ = None

X = check_array(X, accept_sparse=['csr', 'csc', 'lil'],
copy=self.copy, dtype=[np.float64, np.float32])
X = self._validate_X(X, accept_sparse=['csr', 'csc', 'lil'],
copy=self.copy, dtype=[np.float64, np.float32])
n_samples, n_features = X.shape

if self.batch_size is None:
Expand Down
2 changes: 1 addition & 1 deletion sklearn/decomposition/kernel_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ def fit(self, X, y=None):
self : object
Returns the instance itself.
"""
X = check_array(X, accept_sparse='csr', copy=self.copy_X)
X = self._validate_X(X, accept_sparse='csr', copy=self.copy_X)
self._centerer = KernelCenterer()
K = self._get_kernel(X)
self._fit_transform(K)
Expand Down
Loading