Skip to content

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) #9033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 14, 2017

Conversation

massich
Copy link
Contributor

@massich massich commented Jun 7, 2017

Reference Issue

works on #8769 for Ridge case

What does this implement/fix? Explain your changes.

Avoids Ridge to aggressively cast the data to np.float64 when np.float32 is supplied.

Any other comments?

@massich massich changed the title [WIP] Ridge linear model dtype consistency (svd) [WIP] Ridge linear model dtype consistency (all solvers but sag) Jun 8, 2017
@massich
Copy link
Contributor Author

massich commented Jun 8, 2017

All credit goes to @ncordier

@massich massich changed the title [WIP] Ridge linear model dtype consistency (all solvers but sag) [MRG] Ridge linear model dtype consistency (all solvers but sag) Jun 8, 2017
@ncordier
Copy link

ncordier commented Jun 8, 2017

Oops, there is an assertion error!

FAIL: sklearn.linear_model.tests.test_ridge.test_dtype_match

----------------------------------------------------------------------

[...]

  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/tests/test_ridge.py", line 817, in test_dtype_match

    assert_almost_equal(ridge_32.coef_, ridge_64.coef_.astype(np.float32))

[...]

AssertionError: 

Arrays are not almost equal to 7 decimals

(mismatch 20.0%)

 x: array([-0.03347596, -0.28948903, -0.15810712,  0.09406258,  0.37783372], dtype=float32)

 y: array([-0.0334758 , -0.28948909, -0.15810704,  0.09406273,  0.37783384], dtype=float32)

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a test for _preprocess_data, directly on test_base.py, to check the dtype of X X_scale and X_offset? It would check it with all possible options, including having a sparse input X.

You also need to:

  • update _solve_cholesky_kernel, and test it (needs n_features > n_samples)
  • update check_X_y in _RidgeGCV and test RidgeCV
  • test directly the function ridge_regression which is also public
  • test dtype consistency with sparse X input matrix

@@ -111,7 +111,7 @@ def _solve_cholesky(X, y, alpha):
return linalg.solve(A, Xy, sym_pos=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens in this case? Does linalg.solve returns the correct dtype? is it tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this handled here If the right test is not passed eventually it would get propagated up and cached there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double check it, it does passes through. A and Xy are float32 and float64 (as expected) and linalg.solve preserves the type.

So, all good. And both branches are tested.

@@ -327,7 +327,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
y = check_array(y, dtype=np.float64, ensure_2d=False, order='F')
else:
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64)
dtype='numeric')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you use 'numeric' here and [np.float64, np.float32] below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this change. I don't know how it made it through. SAGA is not touched in this PR

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with numeric because I thought it was more generic. I think you are right though and we should go for [np.float64, np.float32] instead. However, the dtype consistency test should fail if you stick with np.float64.

Copy link
Member

@TomDLT TomDLT Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the test currently fails when you stick with np.float64?
It seems to me that it only tests self.coef_. Not sure how to test it though...

Copy link

@ncordier ncordier Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can test it again. However, as I understand it, the goal of this PR is:

  • to ensure dtype consistency between input (X) and output (self.coef_),
  • to make sure the computations are using X with its orginal dtype (to reduce memory footprint as asked in LogisticRegression convert to float64 #8769).

Wouldn't sticking with np.float64 cancel the second bullet point?

Just for reference: test_ridge.py does not appear to make use of unit tests, so the way I test the code is by adding the following lines at the end of my local version of test_ridge.py:

if __name__ == "__main__":
    test_dtype_match()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't sticking with np.float64 cancel the second bullet point?

Yes it would, my point is that we don't have a test to prove it. I am not sure it is easy to test it though.

I test the code is by adding the following lines at the end of my local version

You can use nosetests sklearn/linear_model/tests/test_ridge.py -v or pytest sklearn/linear_model/tests/test_ridge.py -v

To run one particular test, you can use nosetests sklearn/linear_model/tests/test_ridge.py:test_dtype_match or pytest sklearn/linear_model/tests/test_ridge.py::test_dtype_match

Copy link
Contributor Author

@massich massich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomDLT this can be handled in a different PR like #9040

Should we also add a test for _preprocess_data, directly on test_base.py, to check the dtype of X X_scale and X_offset? It would check it with all possible options, including having a sparse input X.

@@ -327,7 +327,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
y = check_array(y, dtype=np.float64, ensure_2d=False, order='F')
else:
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64)
dtype='numeric')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this change. I don't know how it made it through. SAGA is not touched in this PR

Copy link

@ncordier ncordier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Nevermind, I will be replying directly next to the line of the code.

# SAG needs X and y columns to be C-contiguous and np.float64
if solver in ['sag', 'saga']:
X = check_array(X, accept_sparse=['csr'],
dtype=np.float64, order='C')
y = check_array(y, dtype=np.float64, ensure_2d=False, order='F')
else:
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
dtype=np.float64)
dtype=_dtype)
y = check_array(y, dtype='numeric', ensure_2d=False)
Copy link
Contributor Author

@massich massich Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to have X and y with the exact same type (to take advantage of BLAS), what do you guys thing about the follwing line?

 y = check_array(y, dtype=X.dtype, ensure_2d=False)

we let X be whatever is in _dtype and then force X and y dtype be equal.
This is related to #8976 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y = check_array(y, dtype=X.dtype, ensure_2d=False)

+1

@massich
Copy link
Contributor Author

massich commented Jun 9, 2017

@TomDLT any other comments?

@@ -171,7 +171,7 @@ def _preprocess_data(X, y, fit_intercept, normalize=False, copy=True,
if sp.issparse(X):
X_offset, X_var = mean_variance_axis(X, axis=0)
if not return_mean:
X_offset = np.zeros(X.shape[1])
X_offset = np.zeros(X.shape[1], dtype=X.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to do:

X_offset[:] = 0

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


# Do the actual checks at once for easier debug
assert_equal(coef_32.dtype, X_32.dtype)
assert_equal(coef_64.dtype, X_64.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need one more empty line here to be pep8 complient

assert_equal(coef_32.dtype, X_32.dtype)
assert_equal(coef_64.dtype, X_64.dtype)

def test_dtype_match_cholesky():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that here it would be usable to have a comment saying that we have a different test than above because we are testing with multitarget.

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Ridge linear model dtype consistency (all solvers but sag) [MRG+1] Ridge linear model dtype consistency (all solvers but sag) Jun 10, 2017
@GaelVaroquaux
Copy link
Member

A few details need to be addressed (including getting travis flake8 to like your PR), but I am going my +1 conditional on addressing these.

@agramfort
Copy link
Member

besides +1 for MRG when Travis is happy

@massich
Copy link
Contributor Author

massich commented Jun 14, 2017

@agramfort travis is happy, appveyor is not due to master.

@agramfort
Copy link
Member

LGTM !

+1 for MRG after updating what's new

@agramfort agramfort changed the title [MRG+1] Ridge linear model dtype consistency (all solvers but sag) [MRG+2] Ridge linear model dtype consistency (all solvers but sag) Jun 14, 2017
@jnothman jnothman merged commit 7cfec78 into scikit-learn:master Jun 14, 2017
@massich massich deleted the is/8769_ridge branch June 14, 2017 13:29
@jnothman
Copy link
Member

Thanks @massich!

if self.solver in ['svd', 'sparse_cg', 'cholesky', 'lsqr']:
_dtype = [np.float64, np.float32]
else:
_dtype = np.float64
Copy link
Member

@ogrisel ogrisel Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@massich I forgot to comment here, it would be better to change the condition to assume that by default all solvers should accept both dtypes unless there is a known exception (preferably tracked by an issue):

if self.solver == 'lbfgs':
    # scipy lbfgs does not support float32 yet:
    # https://github.com/scipy/scipy/issues/4873
    _dtype = np.float64
else:
    # all other solvers work at both float precision levels
    _dtype = [np.float64, np.float32]

I have not tested the above change. Please feel free to do a PR if it works as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake, I thought this was LogisticRegression. For Ridge the only unsupported solvers remaining are sag and saga.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should be in #8769, so that everyone can make sure that the default policy is to support both dtypes.
I'll post it there.


# Do the actual checks at once for easier debug
assert_equal(coef_32.dtype, X_32.dtype)
assert_equal(coef_64.dtype, X_64.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also add a check for the dtypes of ridge_64.predict(X_64) and ridge_32.predict(X_32).

@massich massich restored the is/8769_ridge branch June 14, 2017 15:13
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants