Skip to content

MNT Introduction of n_features_in_ attr with _validate_data mtd #16112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 78 commits into from
Feb 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
7d9dcc4
Basic validate_X and validate_X_y methods for _n_features_in attribute
NicolasHug Apr 9, 2019
f117745
created NonRectangularInputMixin
NicolasHug Apr 19, 2019
95b330c
Merge remote-tracking branch 'upstream/master' into n_features_in
NicolasHug Apr 19, 2019
e56592b
resolved conflicts
NicolasHug Apr 19, 2019
3bdcb5c
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug May 30, 2019
8ecc690
_validate** is not private
NicolasHug May 30, 2019
60e4cea
Added support for pipeline and grid search
NicolasHug May 30, 2019
ff19f22
pep8
NicolasHug May 31, 2019
a44318b
Trigger CI??
NicolasHug May 31, 2019
42249fb
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Jun 26, 2019
abdc94e
Added to decision tree for gridsearch tests to pass
NicolasHug Jun 26, 2019
a50e76f
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Aug 1, 2019
62fc42e
Added support for ColumnTransformer and FeatureUnion
NicolasHug Aug 1, 2019
6845788
pep8
NicolasHug Aug 1, 2019
3246436
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Aug 7, 2019
ee2598b
BaseSearchCV now raises AttributeError
NicolasHug Aug 12, 2019
6a14e4b
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Aug 19, 2019
3f2d44f
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 2, 2019
25fda0f
Added common test + used _validate_XXX on most estimators
NicolasHug Sep 2, 2019
9bdfb65
Fixed some test
NicolasHug Sep 2, 2019
be76ef4
fixed issues for some estimators
NicolasHug Sep 4, 2019
b464f86
Merge branch 'n_features_in' of github.com:NicolasHug/scikit-learn; b…
NicolasHug Sep 5, 2019
70dc4ed
fixed tests in test_data.py
NicolasHug Sep 5, 2019
988f9c4
Fixed some tests
NicolasHug Sep 5, 2019
fd9b72c
validate twice for Kmeans and FastICA
NicolasHug Sep 5, 2019
4f3d6ff
again
NicolasHug Sep 5, 2019
08f7192
and again
NicolasHug Sep 5, 2019
5a41275
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 6, 2019
f0e7b41
should fix dep warning error
NicolasHug Sep 6, 2019
193fda1
removed superfluous tests
NicolasHug Sep 8, 2019
5b20a4c
Added specific tests for vectorizers
NicolasHug Sep 8, 2019
a49e5ea
flake8
NicolasHug Sep 8, 2019
968fbff
Dummies now have n_feautures_in_ to None and raise error if not fitted
NicolasHug Sep 9, 2019
e4faf13
still don't check n_features_in_ for LDA (will be done in later PR)
NicolasHug Sep 9, 2019
908aea6
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 11, 2019
a88a4c5
Added tests for some estimators
NicolasHug Sep 11, 2019
f3fb539
removed NonRectangularInputMixin and set n_features_in to SparseCoder
NicolasHug Sep 11, 2019
4b7b758
simpler logic for dummies
NicolasHug Sep 12, 2019
53027d3
comments
NicolasHug Sep 12, 2019
c5dfbbd
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 15, 2019
a1aea70
pep8
NicolasHug Sep 15, 2019
9ecc396
remove print
NicolasHug Sep 15, 2019
e9c3104
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 16, 2019
9292c84
avoid dep warning
NicolasHug Sep 16, 2019
e11b0bb
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 18, 2019
6846bea
merged (maybe)
NicolasHug Sep 19, 2019
60c5108
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 19, 2019
615140e
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Sep 19, 2019
fe052e6
set n_features_in_ for stacking estimators
NicolasHug Sep 19, 2019
9a205dd
dont hardcode attribute in init for sparsecoder
NicolasHug Sep 25, 2019
da6a15f
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Jan 13, 2020
84e62ce
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Jan 13, 2020
0e81156
more merge
NicolasHug Jan 13, 2020
d4d92bc
fixed some bugs
NicolasHug Jan 13, 2020
40cd141
fixed more bugs
NicolasHug Jan 14, 2020
b3251fe
fixed warnings
NicolasHug Jan 14, 2020
7e73a24
use _validate_data() method
NicolasHug Jan 14, 2020
2f448aa
fixed columntransformer issue
NicolasHug Jan 14, 2020
9e25594
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Jan 14, 2020
a6a344d
comments
NicolasHug Jan 14, 2020
9e0c3d7
minor renaming
NicolasHug Jan 14, 2020
edc6545
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Jan 31, 2020
d7963e7
Apply suggestions from code review
NicolasHug Jan 31, 2020
d344f04
Merge branch 'n_features_in' of github.com:NicolasHug/scikit-learn in…
NicolasHug Jan 31, 2020
d6f0451
addressed most comments
NicolasHug Jan 31, 2020
b917d72
Better comments
NicolasHug Jan 31, 2020
08a8d72
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Feb 5, 2020
4fb756e
pep8
NicolasHug Feb 5, 2020
d62394b
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Feb 7, 2020
511c395
Addressed comments and raise warning instad of error
NicolasHug Feb 7, 2020
c270884
Added whatsnew
NicolasHug Feb 7, 2020
39d8371
Updated estimator API
NicolasHug Feb 7, 2020
9effdbf
formulation
NicolasHug Feb 7, 2020
4f8ca86
Comment about estimator API
NicolasHug Feb 10, 2020
a1045e0
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Feb 10, 2020
2c70b1b
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Feb 11, 2020
a101d2d
Updated changelog
NicolasHug Feb 11, 2020
dd983c4
Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…
NicolasHug Feb 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions doc/developers/develop.rst
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,17 @@ the dataset, e.g. when ``X`` is a precomputed kernel matrix. Specifically,
the :term:`_pairwise` property is used by ``utils.metaestimators._safe_split``
to slice rows and columns.

Universal attributes
^^^^^^^^^^^^^^^^^^^^

Estimators that expect tabular input should set a `n_features_in_`
attribute at `fit` time to indicate the number of features that the estimator
expects for subsequent calls to `predict` or `transform`.
See
`SLEP010
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html>`_
for details.

.. _rolling_your_own_estimator:

Rolling your own estimator
Expand Down
10 changes: 9 additions & 1 deletion doc/whats_new/v0.23.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ Version 0.23.0

Put the changes in their relevant module.


Changed models
--------------

Expand Down Expand Up @@ -378,3 +377,12 @@ Changelog
- |Fix| :class:`cluster.AgglomerativeClustering` add specific error when
distance matrix is not square and `affinity=precomputed`.
:pr:`16257` by :user:`Simona Maggio <simonamaggio>`.

Miscellaneous
.............

- |API| Most estimators now expose a `n_features_in_` attribute. This
attribute is equal to the number of features passed to the `fit` method.
See `SLEP010
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html>`_
for details. :pr:`16112` by `Nicolas Hug`_.
68 changes: 68 additions & 0 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@

from . import __version__
from .utils import _IS_32BIT
from .utils.validation import check_X_y
from .utils.validation import check_array

_DEFAULT_TAGS = {
'non_deterministic': False,
Expand Down Expand Up @@ -343,6 +345,72 @@ def _get_tags(self):
collected_tags.update(more_tags)
return collected_tags

def _check_n_features(self, X, reset):
"""Set the `n_features_in_` attribute, or check against it.

Parameters
----------
X : {ndarray, sparse matrix} of shape (n_samples, n_features)
The input samples.
reset : bool
If True, the `n_features_in_` attribute is set to `X.shape[1]`.
Else, the attribute must already exist and the function checks
that it is equal to `X.shape[1]`.
"""
n_features = X.shape[1]

if reset:
self.n_features_in_ = n_features
else:
if not hasattr(self, 'n_features_in_'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always wonder if I should write the way you have it here, or try to access the attribute and catch the AttributeError exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to avoid try/except when I can but that's personal pref.
I guess hasattr exists for a reason so we might as well just use it ^^
It might also be faster since it's written in C

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always get the same question. FYI, I pretty much like the discussion here: https://codecalamity.com/is-it-easier-to-ask-forgiveness-in-python/#when-to-use-what

raise RuntimeError(
"The reset parameter is False but there is no "
"n_features_in_ attribute. Is this estimator fitted?"
)
if n_features != self.n_features_in_:
raise ValueError(
'X has {} features, but this {} is expecting {} features '
'as input.'.format(n_features, self.__class__.__name__,
self.n_features_in_)
)

def _validate_data(self, X, y=None, reset=True, **check_params):
"""Validate input data and set or check the `n_features_in_` attribute.

Parameters
----------
X : {array-like, sparse matrix, dataframe} of shape \
(n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,), default=None
The targets. If None, `check_array` is called on `X` and
`check_X_y` is called otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, check_X_y would enforce that y was not None in many estimators. What happens if the user passes y=None now (e.g. through a Pipeline where y is not a required parameter)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implication is that now this will break in tuple unpacking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implication is that now this will break in tuple unpacking.

I don't understand this. Tuple unpacking of which function call and for what kind of input?

Copy link
Member Author

@NicolasHug NicolasHug Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're correct Joel.

Calls to X, y = check_X_y(X, y) have been replaced by X, y = _validate_data(X, y)

Calling fit(X, y=None) will now go through check_array instead of check_X_y because y is None (see code below), and only X will be returned instead of X, y.

So instead of the nicer error in check_X_y ("y cannot be None"), we now have a tuple unpacking error.

I'm not sure what to do here, apart e.g. adding a new parameter and replacing check_X_y(X, y) with X, y = _validate_data(X, y, force_y_not_none=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 overall but rather than force_y_not_none, may I suggest I would y_ignored=True/False or y_required=True/False?

The "force" terminology makes me think that the validation method can automatically convert y into something that is not None while here it will raise a ValueError if y is None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be weird to start using estimator tags for this use-case. Up to now, we used estimator tags to run some specific common tests.

Inheritance then?

How does forcing the caller to pass y help us check when the user has indirectly provided null y? Do you rather mean that _validate_data should always return an X, y pair? That would be good in some ways. It would produce a better error message than tuple unpacking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed my solution is useless on its own, we would still need the y_required parameter.

I guess inheritance could almost work, i.e. using our current Mixins, but how do we deal with meta-estimators?


I feel like a tag could work here. We can define a requires_y tag which defaults to True and which is overriden to False in OutlierMixin, [Bi]ClusterMixin, and TransformerMixin.

For the meta-estimators, we can have adhoc handling, e.g. for pipelines that would be something like tags['requires_y'] = any(step.get_tags['requires_y'] for step in self.steps)

That's not a trivial change though and it should be done in another PR which should be merged first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be weird to start using estimator tags for this use-case. Up to now, we used estimator tags to run some specific common tests. Here, I have the impression that we want to use the tag such the estimator will change is internal validation, isn't it?

I don't think the proposal above would change the internal validation. With my proposal above, we can add a common check that makes sure a proper error is raised when requires_y is True but None is passed.

My main concerns for using estimator tags outside of estimator checks was that it was impossible to override them in child classes more than once (fixed now), and that it required deprecation cycle / warnings to give time to third party developer to adapt (#13565 (comment)). But that's a non-issue there: we're already introducing a common check with warnings for 2 cycles, so we might as well have 2 of them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be weird to start using estimator tags for this use-case. Up to now, we used estimator tags to run some specific common tests.

Inheritance then?

I think it would be good if there was a programmatic way to figure out if an estimator requires y (cc @thomasjpfan). I think it's a valid question whether we want to use estimator tags for that. While it would be the first time we use estimator_tags in an estimator, I think it's a good use. Right now, we actually use _estimator_type when we call check_cv. While _estimator_type is not an estimator tag, I think it actually should be.

Whether we want to make this decision a prereq for this PR is another question.
So maybe just always returning X, y is the simpler solution? I would be ok with either.

I opened #16468 for tracking adding a tag on requiring y for fitting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern I have with returning both X, y is that this is a design we would not have come up with if we didn't encounter this issue. So it's more of a workaround than a deliberate design choice. Not strongly against it though

reset : bool, default=True
Whether to reset the `n_features_in_` attribute.
If False, the input will be checked for consistency with data
provided when reset was last True.
**check_params : kwargs
Parameters passed to :func:`sklearn.utils.check_array` or
:func:`sklearn.utils.check_X_y`.

Returns
-------
out : {ndarray, sparse matrix} or tuple of these
The validated input. A tuple is returned if `y` is not None.
"""

if y is None:
X = check_array(X, **check_params)
out = X
else:
X, y = check_X_y(X, y, **check_params)
out = X, y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we change check_X_y to allow for y=None while we are at it? Maybe we could add an explicit option named "y_required=True/False" to raise ValueError if y is required and y == None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, I'm not sure of the benefits?

It would just be a less obvious way to call check_array so that goes against the Zen "There should be one-- and preferably only one --obvious way to do it."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main benefit would be to have an explicit and standard check_params for estimators to specify when they accept None as valid value for y (e.g. for most unsupervised transformers) and when y=None is not valid and should raise a ValueError (e.g. for all classifiers and regressors).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better directly in _validate_data then? As proposed in #16112 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not both to avoid duplication?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow @ogrisel. Currently, check_X_y raises an error when y is None. Are you suggesting to add a parameter to make this optional (both in check_X_y and in _validate_data)? In _validate_data this would be a bit strange because the call would be X = _validate_data(X, y_ignored=True) and y isn't passed anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to change check_X_y in this PR. There is enough to review here


if check_params.get('ensure_2d', True):
self._check_n_features(X, reset=reset)

return out


class ClassifierMixin:
"""Mixin class for all classifiers in scikit-learn."""
Expand Down
4 changes: 2 additions & 2 deletions sklearn/calibration.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,8 @@ def fit(self, X, y, sample_weight=None):
self : object
Returns an instance of self.
"""
X, y = check_X_y(X, y, accept_sparse=['csc', 'csr', 'coo'],
force_all_finite=False, allow_nd=True)
X, y = self._validate_data(X, y, accept_sparse=['csc', 'csr', 'coo'],
force_all_finite=False, allow_nd=True)
X, y = indexable(X, y)
le = LabelBinarizer().fit(y)
self.classes_ = le.classes_
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/_affinity_propagation.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ def fit(self, X, y=None):
accept_sparse = False
else:
accept_sparse = 'csr'
X = check_array(X, accept_sparse=accept_sparse)
X = self._validate_data(X, accept_sparse=accept_sparse)
if self.affinity == "precomputed":
self.affinity_matrix_ = X
elif self.affinity == "euclidean":
Expand Down
13 changes: 9 additions & 4 deletions sklearn/cluster/_agglomerative.py
Original file line number Diff line number Diff line change
Expand Up @@ -809,7 +809,7 @@ def fit(self, X, y=None):
-------
self
"""
X = check_array(X, ensure_min_samples=2, estimator=self)
X = self._validate_data(X, ensure_min_samples=2, estimator=self)
memory = check_memory(self.memory)

if self.n_clusters is not None and self.n_clusters <= 0:
Expand Down Expand Up @@ -1055,9 +1055,14 @@ def fit(self, X, y=None, **params):
-------
self
"""
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
ensure_min_features=2, estimator=self)
return AgglomerativeClustering.fit(self, X.T, **params)
X = self._validate_data(X, accept_sparse=['csr', 'csc', 'coo'],
ensure_min_features=2, estimator=self)
# save n_features_in_ attribute here to reset it after, because it will
# be overridden in AgglomerativeClustering since we passed it X.T.
n_features_in_ = self.n_features_in_
AgglomerativeClustering.fit(self, X.T, **params)
self.n_features_in_ = n_features_in_
return self

@property
def fit_predict(self):
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/_bicluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ def fit(self, X, y=None):
warnings.warn("'n_jobs' was deprecated in version 0.23 and will be"
" removed in 0.25.", FutureWarning)

X = check_array(X, accept_sparse='csr', dtype=np.float64)
X = self._validate_data(X, accept_sparse='csr', dtype=np.float64)
self._check_parameters()
self._fit(X)
return self
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/_birch.py
Original file line number Diff line number Diff line change
Expand Up @@ -463,7 +463,7 @@ def fit(self, X, y=None):
return self._fit(X)

def _fit(self, X):
X = check_array(X, accept_sparse='csr', copy=self.copy)
X = self._validate_data(X, accept_sparse='csr', copy=self.copy)
threshold = self.threshold
branching_factor = self.branching_factor

Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/_dbscan.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ def fit(self, X, y=None, sample_weight=None):
self

"""
X = check_array(X, accept_sparse='csr')
X = self._validate_data(X, accept_sparse='csr')

if not self.eps > 0.0:
raise ValueError("eps must be positive.")
Expand Down
10 changes: 6 additions & 4 deletions sklearn/cluster/_kmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -975,8 +975,10 @@ def fit(self, X, y=None, sample_weight=None):
' got %d instead' % self.max_iter
)

X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
order='C', copy=self.copy_x, accept_large_sparse=False)
X = self._validate_data(X, accept_sparse='csr',
dtype=[np.float64, np.float32],
order='C', copy=self.copy_x,
accept_large_sparse=False)
# verify that the number of samples given is larger than k
if _num_samples(X) < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
Expand Down Expand Up @@ -1591,8 +1593,8 @@ def fit(self, X, y=None, sample_weight=None):
self
"""
random_state = check_random_state(self.random_state)
X = check_array(X, accept_sparse="csr", order='C',
dtype=[np.float64, np.float32])
X = self._validate_data(X, accept_sparse="csr", order='C',
dtype=[np.float64, np.float32])
n_samples, n_features = X.shape
if n_samples < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d"
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/_mean_shift.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ def fit(self, X, y=None):
y : Ignored

"""
X = check_array(X)
X = self._validate_data(X)
bandwidth = self.bandwidth
if bandwidth is None:
bandwidth = estimate_bandwidth(X, n_jobs=self.n_jobs)
Expand Down
2 changes: 1 addition & 1 deletion sklearn/cluster/_optics.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ def fit(self, X, y=None):
self : instance of OPTICS
The instance.
"""
X = check_array(X, dtype=np.float)
X = self._validate_data(X, dtype=np.float)

if self.cluster_method not in ['dbscan', 'xi']:
raise ValueError("cluster_method should be one of"
Expand Down
4 changes: 2 additions & 2 deletions sklearn/cluster/_spectral.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,8 +474,8 @@ def fit(self, X, y=None):
self

"""
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64, ensure_min_samples=2)
X = self._validate_data(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64, ensure_min_samples=2)
allow_squared = self.affinity in ["precomputed",
"precomputed_nearest_neighbors"]
if X.shape[0] == X.shape[1] and not allow_squared:
Expand Down
11 changes: 11 additions & 0 deletions sklearn/cluster/tests/test_bicluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,17 @@ def test_wrong_shape():
model.fit(data)


@pytest.mark.parametrize('est',
(SpectralBiclustering(), SpectralCoclustering()))
def test_n_features_in_(est):

X, _, _ = make_biclusters((3, 3), 3, random_state=0)

assert not hasattr(est, 'n_features_in_')
est.fit(X)
assert est.n_features_in_ == 3


@pytest.mark.parametrize("klass", [SpectralBiclustering, SpectralCoclustering])
@pytest.mark.parametrize("n_jobs", [None, 1])
def test_n_jobs_deprecated(klass, n_jobs):
Expand Down
3 changes: 3 additions & 0 deletions sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,8 @@ def fit_transform(self, X, y=None):
else:
self._feature_names_in = None
X = _check_X(X)
# set n_features_in_ attribute
self._check_n_features(X, reset=True)
self._validate_transformers()
self._validate_column_callables(X)
self._validate_remainder(X)
Expand Down Expand Up @@ -587,6 +589,7 @@ def transform(self, X):
'and for transform when using the '
'remainder keyword')

# TODO: also call _check_n_features(reset=False) in 0.24
self._validate_features(X.shape[1], X_feature_names)
Xs = self._fit_transform(X, None, _transform_one, fitted=True)
self._validate_output(Xs)
Expand Down
15 changes: 15 additions & 0 deletions sklearn/compose/_target.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from ..utils.validation import check_is_fitted
from ..utils import check_array, _safe_indexing
from ..preprocessing import FunctionTransformer
from ..exceptions import NotFittedError

__all__ = ['TransformedTargetRegressor']

Expand Down Expand Up @@ -235,3 +236,17 @@ def predict(self, X):

def _more_tags(self):
return {'poor_score': True, 'no_validation': True}

@property
def n_features_in_(self):
# For consistency with other estimators we raise a AttributeError so
# that hasattr() returns False the estimator isn't fitted.
try:
check_is_fitted(self)
except NotFittedError as nfe:
raise AttributeError(
"{} object has no n_features_in_ attribute."
.format(self.__class__.__name__)
) from nfe

return self.regressor_.n_features_in_
12 changes: 12 additions & 0 deletions sklearn/compose/tests/test_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1187,6 +1187,18 @@ def test_column_transformer_mask_indexing(array_type):
assert X_trans.shape == (3, 2)


def test_n_features_in():
# make sure n_features_in is what is passed as input to the column
# transformer.

X = [[1, 2], [3, 4], [5, 6]]
ct = ColumnTransformer([('a', DoubleTrans(), [0]),
('b', DoubleTrans(), [1])])
assert not hasattr(ct, 'n_features_in_')
ct.fit(X)
assert ct.n_features_in_ == 2


@pytest.mark.parametrize('cols, pattern, include, exclude', [
(['col_int', 'col_float'], None, np.number, None),
(['col_int', 'col_float'], None, None, object),
Expand Down
2 changes: 1 addition & 1 deletion sklearn/covariance/_empirical_covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ def fit(self, X, y=None):
-------
self : object
"""
X = check_array(X)
X = self._validate_data(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down
6 changes: 3 additions & 3 deletions sklearn/covariance/_graph_lasso.py
Original file line number Diff line number Diff line change
Expand Up @@ -382,8 +382,8 @@ def fit(self, X, y=None):
self : object
"""
# Covariance does not make sense for a single feature
X = check_array(X, ensure_min_features=2, ensure_min_samples=2,
estimator=self)
X = self._validate_data(X, ensure_min_features=2, ensure_min_samples=2,
estimator=self)

if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
Expand Down Expand Up @@ -659,7 +659,7 @@ def fit(self, X, y=None):
self : object
"""
# Covariance does not make sense for a single feature
X = check_array(X, ensure_min_features=2, estimator=self)
X = self._validate_data(X, ensure_min_features=2, estimator=self)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down
2 changes: 1 addition & 1 deletion sklearn/covariance/_robust_covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -637,7 +637,7 @@ def fit(self, X, y=None):
-------
self : object
"""
X = check_array(X, ensure_min_samples=2, estimator='MinCovDet')
X = self._validate_data(X, ensure_min_samples=2, estimator='MinCovDet')
random_state = check_random_state(self.random_state)
n_samples, n_features = X.shape
# check that the empirical covariance is full rank
Expand Down
6 changes: 3 additions & 3 deletions sklearn/covariance/_shrunk_covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ def fit(self, X, y=None):
-------
self : object
"""
X = check_array(X)
X = self._validate_data(X)
# Not calling the parent object to fit, to avoid a potential
# matrix inversion when setting the precision
if self.assume_centered:
Expand Down Expand Up @@ -412,7 +412,7 @@ def fit(self, X, y=None):
"""
# Not calling the parent object to fit, to avoid computing the
# covariance matrix (and potentially the precision)
X = check_array(X)
X = self._validate_data(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down Expand Up @@ -562,7 +562,7 @@ def fit(self, X, y=None):
-------
self : object
"""
X = check_array(X)
X = self._validate_data(X)
# Not calling the parent object to fit, to avoid computing the
# covariance matrix (and potentially the precision)
if self.assume_centered:
Expand Down
Loading