-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Ridge linear model dtype consistency (all solvers but sag) #9033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
All credit goes to @ncordier |
Oops, there is an assertion error!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a test for _preprocess_data
, directly on test_base.py
, to check the dtype of X
X_scale
and X_offset
? It would check it with all possible options, including having a sparse input X.
You also need to:
- update
_solve_cholesky_kernel
, and test it (needsn_features
>n_samples
) - update
check_X_y
in_RidgeGCV
and testRidgeCV
- test directly the function
ridge_regression
which is also public - test dtype consistency with sparse
X
input matrix
@@ -111,7 +111,7 @@ def _solve_cholesky(X, y, alpha): | |||
return linalg.solve(A, Xy, sym_pos=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens in this case? Does linalg.solve
returns the correct dtype? is it tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this handled here If the right test is not passed eventually it would get propagated up and cached there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double check it, it does passes through. A
and Xy
are float32 and float64 (as expected) and linalg.solve
preserves the type.
So, all good. And both branches are tested.
sklearn/linear_model/ridge.py
Outdated
@@ -327,7 +327,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto', | |||
y = check_array(y, dtype=np.float64, ensure_2d=False, order='F') | |||
else: | |||
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], | |||
dtype=np.float64) | |||
dtype='numeric') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason you use 'numeric'
here and [np.float64, np.float32]
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this change. I don't know how it made it through. SAGA is not touched in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with numeric
because I thought it was more generic. I think you are right though and we should go for [np.float64, np.float32]
instead. However, the dtype consistency test should fail if you stick with np.float64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure the test currently fails when you stick with np.float64
?
It seems to me that it only tests self.coef_
. Not sure how to test it though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can test it again. However, as I understand it, the goal of this PR is:
- to ensure dtype consistency between input (
X
) and output (self.coef_
), - to make sure the computations are using
X
with its orginal dtype (to reduce memory footprint as asked in LogisticRegression convert to float64 #8769).
Wouldn't sticking with np.float64
cancel the second bullet point?
Just for reference: test_ridge.py
does not appear to make use of unit tests, so the way I test the code is by adding the following lines at the end of my local version of test_ridge.py
:
if __name__ == "__main__":
test_dtype_match()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't sticking with np.float64 cancel the second bullet point?
Yes it would, my point is that we don't have a test to prove it. I am not sure it is easy to test it though.
I test the code is by adding the following lines at the end of my local version
You can use nosetests sklearn/linear_model/tests/test_ridge.py -v
or pytest sklearn/linear_model/tests/test_ridge.py -v
To run one particular test, you can use nosetests sklearn/linear_model/tests/test_ridge.py:test_dtype_match
or pytest sklearn/linear_model/tests/test_ridge.py::test_dtype_match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn/linear_model/ridge.py
Outdated
@@ -327,7 +327,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto', | |||
y = check_array(y, dtype=np.float64, ensure_2d=False, order='F') | |||
else: | |||
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], | |||
dtype=np.float64) | |||
dtype='numeric') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this change. I don't know how it made it through. SAGA is not touched in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Nevermind, I will be replying directly next to the line of the code.
sklearn/linear_model/ridge.py
Outdated
# SAG needs X and y columns to be C-contiguous and np.float64 | ||
if solver in ['sag', 'saga']: | ||
X = check_array(X, accept_sparse=['csr'], | ||
dtype=np.float64, order='C') | ||
y = check_array(y, dtype=np.float64, ensure_2d=False, order='F') | ||
else: | ||
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], | ||
dtype=np.float64) | ||
dtype=_dtype) | ||
y = check_array(y, dtype='numeric', ensure_2d=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to have X and y with the exact same type (to take advantage of BLAS), what do you guys thing about the follwing line?
y = check_array(y, dtype=X.dtype, ensure_2d=False)
we let X
be whatever is in _dtype
and then force X and y dtype be equal.
This is related to #8976 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @TomDLT, @raghavrv, @GaelVaroquaux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y = check_array(y, dtype=X.dtype, ensure_2d=False)
+1
@TomDLT any other comments? |
sklearn/linear_model/base.py
Outdated
@@ -171,7 +171,7 @@ def _preprocess_data(X, y, fit_intercept, normalize=False, copy=True, | |||
if sp.issparse(X): | |||
X_offset, X_var = mean_variance_axis(X, axis=0) | |||
if not return_mean: | |||
X_offset = np.zeros(X.shape[1]) | |||
X_offset = np.zeros(X.shape[1], dtype=X.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to do:
X_offset[:] = 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
# Do the actual checks at once for easier debug | ||
assert_equal(coef_32.dtype, X_32.dtype) | ||
assert_equal(coef_64.dtype, X_64.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need one more empty line here to be pep8 complient
assert_equal(coef_32.dtype, X_32.dtype) | ||
assert_equal(coef_64.dtype, X_64.dtype) | ||
|
||
def test_dtype_match_cholesky(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that here it would be usable to have a comment saying that we have a different test than above because we are testing with multitarget.
A few details need to be addressed (including getting travis flake8 to like your PR), but I am going my +1 conditional on addressing these. |
besides +1 for MRG when Travis is happy |
@agramfort travis is happy, appveyor is not due to master. |
LGTM ! +1 for MRG after updating what's new |
Thanks @massich! |
if self.solver in ['svd', 'sparse_cg', 'cholesky', 'lsqr']: | ||
_dtype = [np.float64, np.float32] | ||
else: | ||
_dtype = np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massich I forgot to comment here, it would be better to change the condition to assume that by default all solvers should accept both dtypes unless there is a known exception (preferably tracked by an issue):
if self.solver == 'lbfgs':
# scipy lbfgs does not support float32 yet:
# https://github.com/scipy/scipy/issues/4873
_dtype = np.float64
else:
# all other solvers work at both float precision levels
_dtype = [np.float64, np.float32]
I have not tested the above change. Please feel free to do a PR if it works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a mistake, I thought this was LogisticRegression
. For Ridge
the only unsupported solvers remaining are sag and saga.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment should be in #8769, so that everyone can make sure that the default policy is to support both dtypes.
I'll post it there.
|
||
# Do the actual checks at once for easier debug | ||
assert_equal(coef_32.dtype, X_32.dtype) | ||
assert_equal(coef_64.dtype, X_64.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also add a check for the dtypes of ridge_64.predict(X_64)
and ridge_32.predict(X_32)
.
Reference Issue
works on #8769 for Ridge case
What does this implement/fix? Explain your changes.
Avoids Ridge to aggressively cast the data to
np.float64
whennp.float32
is supplied.Any other comments?