Skip to content

[MRG+1] More robust input validation, more testing. #4136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 13, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions sklearn/cluster/spectral.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ def spectral_clustering(affinity, n_clusters=8, n_components=None,
This algorithm solves the normalized cut for k=2: it is a
normalized spectral clustering.
"""
if not assign_labels in ('kmeans', 'discretize'):
if assign_labels not in ('kmeans', 'discretize'):
raise ValueError("The 'assign_labels' parameter should be "
"'kmeans' or 'discretize', but '%s' was given"
% assign_labels)
Expand Down Expand Up @@ -415,7 +415,8 @@ def fit(self, X, y=None):
OR, if affinity==`precomputed`, a precomputed affinity
matrix of shape (n_samples, n_samples)
"""
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64)
if X.shape[0] == X.shape[1] and self.affinity != "precomputed":
warnings.warn("The spectral clustering API has changed. ``fit``"
"now constructs an affinity matrix from data. To use"
Expand Down
4 changes: 3 additions & 1 deletion sklearn/covariance/empirical_covariance_.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def empirical_covariance(X, assume_centered=False):

Parameters
----------
X : 2D ndarray, shape (n_samples, n_features)
X : ndarray, shape (n_samples, n_features)
Data from which to compute the covariance estimate

assume_centered : Boolean
Expand All @@ -70,6 +70,7 @@ def empirical_covariance(X, assume_centered=False):
X = np.asarray(X)
if X.ndim == 1:
X = np.reshape(X, (1, -1))
if X.shape[0] == 1:
warnings.warn("Only one sample available. "
"You may want to reshape your data array")

Expand Down Expand Up @@ -164,6 +165,7 @@ def fit(self, X, y=None):
Returns self.

"""
X = check_array(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down
16 changes: 13 additions & 3 deletions sklearn/covariance/graph_lasso_.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

from ..utils import ConvergenceWarning
from ..utils.extmath import pinvh
from ..utils.validation import check_random_state
from ..utils.validation import check_random_state, check_array
from ..linear_model import lars_path
from ..linear_model import cd_fast
from ..cross_validation import _check_cv as check_cv, cross_val_score
Expand Down Expand Up @@ -191,6 +191,9 @@ def graph_lasso(emp_cov, alpha, cov_init=None, mode='cd', tol=1e-4,
else:
errors = dict(invalid='raise')
try:
# be robust to the max_iter=0 edge case, see:
# https://github.com/scikit-learn/scikit-learn/issues/4134
d_gap = np.inf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a case where max_iter is 0. I opened #4134 because I'm not familiar with the details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put an inline comment to make it explicit that this is to support the max_iter=0 case.

Alternatively I would be ok to add input validation to ensure that models configured with max_iter < 1 yield consistently a ValueError.

I don't see the point in supporting the max_iter=0 edge case in the first place. If most sklearn models already do something reasonable (a.k.a. then do not crash neither at fit nor predict or transform time) then this is is fine. But if this is problematic for any of them I would just disable explicitly support for max_iter=0 rather than hacking workarounds to add support for this useless edge case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had a look at #4134? It is not like I'm setting max_iter anywhere. I just run with the default arguments on random data, and the estimator crashes every time. With this fix, at least it runs through. I don't know what is happening, which is why I opened #4134.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I did not make the connection. Still please add an inline comment such as:

try:
    # be robust to the max_iter=0 edge case, see:
    # https://github.com/scikit-learn/scikit-learn/issues/4134
    d_gap = np.inf
    ...

for i in range(max_iter):
for idx in range(n_features):
sub_covariance = covariance_[indices != idx].T[indices != idx]
Expand Down Expand Up @@ -314,7 +317,7 @@ def __init__(self, alpha=.01, mode='cd', tol=1e-4, max_iter=100,
self.store_precision = True

def fit(self, X, y=None):
X = np.asarray(X)
X = check_array(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down Expand Up @@ -514,7 +517,14 @@ def __init__(self, alphas=4, n_refinements=4, cv=None, tol=1e-4,
self.store_precision = True

def fit(self, X, y=None):
X = np.asarray(X)
"""Fits the GraphLasso covariance model to X.

Parameters
----------
X : ndarray, shape (n_samples, n_features)
Data from which to compute the covariance estimate
"""
X = check_array(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down
3 changes: 2 additions & 1 deletion sklearn/covariance/robust_covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from . import empirical_covariance, EmpiricalCovariance
from ..utils.extmath import fast_logdet, pinvh
from ..utils import check_random_state
from ..utils import check_random_state, check_array


# Minimum Covariance Determinant
Expand Down Expand Up @@ -605,6 +605,7 @@ def fit(self, X, y=None):
Returns self.

"""
X = check_array(X)
random_state = check_random_state(self.random_state)
n_samples, n_features = X.shape
# check that the empirical covariance is full rank
Expand Down
15 changes: 8 additions & 7 deletions sklearn/covariance/shrunk_covariance_.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ def fit(self, X, y=None):
Returns self.

"""
X = check_array(X)
# Not calling the parent object to fit, to avoid a potential
# matrix inversion when setting the precision
if self.assume_centered:
Expand Down Expand Up @@ -181,12 +182,11 @@ def ledoit_wolf_shrinkage(X, assume_centered=False, block_size=1000):
return 0.
if X.ndim == 1:
X = np.reshape(X, (1, -1))

if X.shape[0] == 1:
warnings.warn("Only one sample available. "
"You may want to reshape your data array")
n_samples = 1
n_features = X.size
else:
n_samples, n_features = X.shape
n_samples, n_features = X.shape

# optionaly center data
if not assume_centered:
Expand Down Expand Up @@ -228,8 +228,7 @@ def ledoit_wolf_shrinkage(X, assume_centered=False, block_size=1000):
# get final beta as the min between beta and delta
beta = min(beta, delta)
# finally get shrinkage
shrinkage = beta / delta

shrinkage = 0 if beta == 0 else beta / delta
return shrinkage


Expand Down Expand Up @@ -384,6 +383,7 @@ def fit(self, X, y=None):
"""
# Not calling the parent object to fit, to avoid computing the
# covariance matrix (and potentially the precision)
X = check_array(X)
if self.assume_centered:
self.location_ = np.zeros(X.shape[1])
else:
Expand Down Expand Up @@ -460,7 +460,7 @@ def oas(X, assume_centered=False):
num = alpha + mu ** 2
den = (n_samples + 1.) * (alpha - (mu ** 2) / n_features)

shrinkage = min(num / den, 1.)
shrinkage = 1. if den == 0 else min(num / den, 1.)
shrunk_cov = (1. - shrinkage) * emp_cov
shrunk_cov.flat[::n_features + 1] += shrinkage * mu

Expand Down Expand Up @@ -536,6 +536,7 @@ def fit(self, X, y=None):
Returns self.

"""
X = check_array(X)
# Not calling the parent object to fit, to avoid computing the
# covariance matrix (and potentially the precision)
if self.assume_centered:
Expand Down
10 changes: 9 additions & 1 deletion sklearn/covariance/tests/test_covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,12 @@ def test_covariance():
cov.error_norm(empirical_covariance(X_1d), norm='spectral'), 0)

# test with one sample
# FIXME I don't know what this test does
X_1sample = np.arange(5)
cov = EmpiricalCovariance()

assert_warns(UserWarning, cov.fit, X_1sample)
assert_array_almost_equal(cov.covariance_,
np.zeros(shape=(5, 5), dtype=np.float64))

# test integer type
X_integer = np.asarray([[0, 1], [1, 0]])
Expand Down Expand Up @@ -180,9 +182,12 @@ def test_ledoit_wolf():
assert_array_almost_equal(empirical_covariance(X_1d), lw.covariance_, 4)

# test with one sample
# FIXME I don't know what this test does
X_1sample = np.arange(5)
lw = LedoitWolf()
assert_warns(UserWarning, lw.fit, X_1sample)
assert_array_almost_equal(lw.covariance_,
np.zeros(shape=(5, 5), dtype=np.float64))

# test shrinkage coeff on a simple data set (without saving precision)
lw = LedoitWolf(store_precision=False)
Expand Down Expand Up @@ -251,9 +256,12 @@ def test_oas():
assert_array_almost_equal(empirical_covariance(X_1d), oa.covariance_, 4)

# test with one sample
# FIXME I don't know what this test does
X_1sample = np.arange(5)
oa = OAS()
assert_warns(UserWarning, oa.fit, X_1sample)
assert_array_almost_equal(oa.covariance_,
np.zeros(shape=(5, 5), dtype=np.float64))

# test shrinkage coeff on a simple data set (without saving precision)
oa = OAS(store_precision=False)
Expand Down
6 changes: 5 additions & 1 deletion sklearn/linear_model/coordinate_descent.py
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,9 @@ def enet_path(X, y, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None,
ElasticNetCV
"""
X = check_array(X, 'csc', dtype=np.float64, order='F', copy=copy_X)
if Xy is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @agramfort wants to have a look at this. This comes up in ElasticNetCV. In c2f0b31 I enforced it there, but there was a comment that is shouldn't be enforced there. not sure about that.

Xy = check_array(Xy, 'csc', dtype=np.float64, order='F', copy=False,
ensure_2d=False)
n_samples, n_features = X.shape

multi_output = False
Expand Down Expand Up @@ -389,7 +392,6 @@ def enet_path(X, y, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None,
if selection not in ['random', 'cyclic']:
raise ValueError("selection should be either random or cyclic.")
random = (selection == 'random')
models = []

if not multi_output:
coefs = np.empty((n_features, n_alphas), dtype=np.float64)
Expand All @@ -414,6 +416,7 @@ def enet_path(X, y, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None,
model = cd_fast.enet_coordinate_descent_multi_task(
coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random)
elif isinstance(precompute, np.ndarray):
precompute = check_array(precompute, 'csc', dtype=np.float64, order='F')
model = cd_fast.enet_coordinate_descent_gram(
coef_, l1_reg, l2_reg, precompute, Xy, y, max_iter,
tol, rng, random, positive)
Expand Down Expand Up @@ -1418,6 +1421,7 @@ def __init__(self, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None,
self.random_state = random_state
self.selection = selection


###############################################################################
# Multi Task ElasticNet and Lasso models (with joint feature selection)

Expand Down
1 change: 1 addition & 0 deletions sklearn/linear_model/omp.py
Original file line number Diff line number Diff line change
Expand Up @@ -806,6 +806,7 @@ def fit(self, X, y):
returns an instance of self.
"""
X, y = check_X_y(X, y)
X = as_float_array(X, copy=False, force_all_finite=False)
cv = check_cv(self.cv, X, y, classifier=False)
max_iter = (min(max(int(0.1 * X.shape[1]), 5), X.shape[1])
if not self.max_iter
Expand Down
5 changes: 3 additions & 2 deletions sklearn/manifold/mds.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,7 @@ def __init__(self, n_components=2, metric=True, n_init=4,
def _pairwise(self):
return self.kernel == "precomputed"

def fit(self, X, init=None, y=None):
def fit(self, X, y=None, init=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I merged to fast. I am still not 100% about this :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about that one. I don't think it's a big deal though.

"""
Computes the position of the points in the embedding space

Expand All @@ -374,7 +374,7 @@ def fit(self, X, init=None, y=None):
self.fit_transform(X, init=init)
return self

def fit_transform(self, X, init=None, y=None):
def fit_transform(self, X, y=None, init=None):
"""
Fit the data from X, and returns the embedded coordinates

Expand All @@ -389,6 +389,7 @@ def fit_transform(self, X, init=None, y=None):
if ndarray, initialize the SMACOF algorithm with this array.

"""
X = check_array(X)
if X.shape[0] == X.shape[1] and self.dissimilarity != "precomputed":
warnings.warn("The MDS API has changed. ``fit`` now constructs an"
" dissimilarity matrix from data. To use a custom "
Expand Down
10 changes: 5 additions & 5 deletions sklearn/manifold/t_sne.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ def _gradient_descent(objective, p0, it, n_iter, n_iter_without_progress=30,
update = momentum * update - learning_rate * grad
p += update

if verbose >= 2 and (i+1) % 10 == 0:
if verbose >= 2 and (i + 1) % 10 == 0:
print("[t-SNE] Iteration %d: error = %.7f, gradient norm = %.7f"
% (i + 1, error, grad_norm))

Expand Down Expand Up @@ -404,7 +404,7 @@ def __init__(self, n_components=2, perplexity=30.0,
self.verbose = verbose
self.random_state = random_state

def _fit(self, X):
def fit(self, X, y=None):
"""Fit the model using X as training data.

Parameters
Expand All @@ -413,7 +413,7 @@ def _fit(self, X):
If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row.
"""
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], dtype=np.float64)
random_state = check_random_state(self.random_state)

if self.early_exaggeration < 1.0:
Expand Down Expand Up @@ -507,7 +507,7 @@ def _tsne(self, P, alpha, n_samples, random_state, X_embedded=None):

return X_embedded

def fit_transform(self, X):
def fit_transform(self, X, y=None):
"""Transform X to the embedded space.

Parameters
Expand All @@ -521,5 +521,5 @@ def fit_transform(self, X):
X_new : array, shape (n_samples, n_components)
Embedding of the training data in low-dimensional space.
"""
self._fit(X)
self.fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I thought that we were in the fit method.

return self.embedding_
20 changes: 10 additions & 10 deletions sklearn/mixture/dpgmm.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from scipy.spatial.distance import cdist

from ..externals.six.moves import xrange
from ..utils import check_random_state
from ..utils import check_random_state, check_array
from ..utils.extmath import logsumexp, pinvh, squared_norm
from ..utils.validation import check_is_fitted
from .. import cluster
Expand Down Expand Up @@ -148,12 +148,12 @@ class DPGMM(GMM):
n_iter : int, default 10
Maximum number of iterations to perform before convergence.

params : string, default 'wmc'
params : string, default 'wmc'
Controls which parameters are updated in the training
process. Can contain any combination of 'w' for weights,
'm' for means, and 'c' for covars.

init_params : string, default 'wmc'
init_params : string, default 'wmc'
Controls which parameters are updated in the initialization
process. Can contain any combination of 'w' for weights,
'm' for means, and 'c' for covars. Defaults to 'wmc'.
Expand Down Expand Up @@ -250,7 +250,7 @@ def score_samples(self, X):
"""
check_is_fitted(self, 'gamma_')

X = np.asarray(X)
X = check_array(X)
if X.ndim == 1:
X = X[:, np.newaxis]
z = np.zeros((X.shape[0], self.n_components))
Expand Down Expand Up @@ -461,7 +461,7 @@ def _logprior(self, z):
def lower_bound(self, X, z):
"""returns a lower bound on model evidence based on X and membership"""
check_is_fitted(self, 'means_')

if self.covariance_type not in ['full', 'tied', 'diag', 'spherical']:
raise NotImplementedError("This ctype is not implemented: %s"
% self.covariance_type)
Expand All @@ -480,7 +480,7 @@ def _set_weights(self):
+ self.gamma_[i, 2])
self.weights_ /= np.sum(self.weights_)

def fit(self, X):
def fit(self, X, y=None):
"""Estimate model parameters with the variational
algorithm.

Expand All @@ -501,10 +501,10 @@ def fit(self, X):
List of n_features-dimensional data points. Each row
corresponds to a single data point.
"""
self.random_state = check_random_state(self.random_state)
self.random_state_ = check_random_state(self.random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks strange that we need to store the random state as a fitted attribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? How else should we keep sampling? If we do check_random_state before each sampling, it will yield the same result on each call, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, this is a good question, I wonder if we deal consistently with this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used this pattern in other places, I don't remember where, though. The previous pattern of just overwriting the parameter is clearly not what we want, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous pattern of just overwriting the parameter is clearly not what we want, though.

I agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, wouldn't it make sense to have a common test to check that self.random_state is not mutated by a call to fit when the constructor parameter is passed with an integer seed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think that will be different from a None seed? That is the default parameter and so we do test for it (well after this PR we do)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different but we can do that in another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ,right, I got it. Btw, do you think it is ok to modify it if a RandomState object was passed? (And do you want to open an issue for the test?)


## initialization step
X = np.asarray(X)
X = check_array(X)
if X.ndim == 1:
X = X[:, np.newaxis]

Expand All @@ -521,7 +521,7 @@ def fit(self, X):
if 'm' in self.init_params or not hasattr(self, 'means_'):
self.means_ = cluster.KMeans(
n_clusters=self.n_components,
random_state=self.random_state).fit(X).cluster_centers_[::-1]
random_state=self.random_state_).fit(X).cluster_centers_[::-1]

if 'w' in self.init_params or not hasattr(self, 'weights_'):
self.weights_ = np.tile(1.0 / self.n_components, self.n_components)
Expand Down Expand Up @@ -705,7 +705,7 @@ def score_samples(self, X):
"""
check_is_fitted(self, 'gamma_')

X = np.asarray(X)
X = check_array(X)
if X.ndim == 1:
X = X[:, np.newaxis]
dg = digamma(self.gamma_) - digamma(np.sum(self.gamma_))
Expand Down
Loading