-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] LogisticRegression convert to float64 (sag) #11155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test failure on appveyor is intriguing:
It means that the SGDClassifier model has converged in 2 epochs. The data is iris so it's deterministic and the random state of the model is fixed so the stopping criterion should be consistent across platforms. Maybe this test is too sensitive to small numerical changes and the simple fact that we use a different compiler under windows to compile the sgd routine and the dataset access can explain this? |
This test is indeed very brittle: a simple change of the random seed makes it fail very easily (tested under linux). I will try to push a more numerically stable of this test to see if that solves the failure we observe on appveyor. |
The unstable test is fixed. |
Great! Thanks @ogrisel |
This pull request introduces 1 alert when merging 6aa7b93 into 20cb37e - view on lgtm.com new alerts:
Comment posted by lgtm.com |
6aa7b93
to
2bb42c8
Compare
It might be the gradient memory array that can fit in CPU cache in float32 while it cannot with float64 causing the slowdown for the later. This is just a hunch though. Anyway, this PR LGTM. |
sklearn/linear_model/setup.py
Outdated
@@ -34,6 +35,20 @@ def configuration(parent_package='', top_path=None): | |||
[]), | |||
**blas_info) | |||
|
|||
# generate sag_fast from template | |||
sag_template = 'sklearn/linear_model/sag_fast.pyx.tp' | |||
sag_file = sag_template[:-3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I would rename to sag_cython_file
.
benchmarks/bench_saga.py
Outdated
dtypes_mapping = { | ||
"float64": np.float64, | ||
"float32": np.float32, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:
dtypes_mapping = {
"float64": np.float64,
"float32": np.float32,
}
2bb42c8
to
84e337c
Compare
hum… I screwed up somewhere… |
84e337c
to
dbb22b7
Compare
Cool LGTM (apart from travis complaining in PEP8) Thx @NelleV |
dbb22b7
to
921ab38
Compare
benchmarks/bench_saga.py
Outdated
@@ -151,21 +160,25 @@ def exp(solvers, penalties, single_target, n_samples=30000, max_iter=20, | |||
X = X[:n_samples] | |||
y = y[:n_samples] | |||
|
|||
cached_fit = mem.cache(fit_single) | |||
# cached_fit = mem.cache(fit_single) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should delete this line
if self.solver in ['newton-cg']: | ||
_dtype = [np.float64, np.float32] | ||
else: | ||
if self.solver in ['lbfgs', 'liblinear']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to document in the Notes section of the object what solvers can handle what dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I could improve the wording…
sklearn/linear_model/sag_fast.pyx.tp
Outdated
# # tol | ||
# cdef {{c_type}} tol = <{{c_type}}> _tol | ||
# # intercept_decay | ||
# cdef {{c_type}} intercept_decay = <{{c_type}}> _intercept_decay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be cleaned.
I've not looked through in detail... A broad question: should we be preferring Tempita over fused types in general? Why / why not? @NelleV, I don't get what the "I am not able to reproduce those benchmarks" above means...? |
A broad question: should we be preferring Tempita over fused types in general? Why / why not?
Fused type didn't work here because they couldn't be used to introduce
polymorphism with the classes.
|
sklearn/utils/setup.py
Outdated
'sklearn/utils/seq_dataset.pxd.tp'] | ||
|
||
for pyxfiles in pyx_templates: | ||
outfile = pyxfiles[:-3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment: I find .replace('.tp', '')
slightly more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should take mental note of this one.
sklearn/linear_model/setup.py
Outdated
@@ -34,6 +36,20 @@ def configuration(parent_package='', top_path=None): | |||
[]), | |||
**blas_info) | |||
|
|||
# generate sag_fast from template | |||
sag_cython_file = 'sklearn/linear_model/sag_fast.pyx.tp' | |||
sag_file = sag_cython_file[:-3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment: I find .replace('.tp', '')
slightly more readable.
@@ -243,8 +243,9 @@ def sag_solver(X, y, sample_weight=None, loss='log', alpha=1., beta=0., | |||
max_iter = 1000 | |||
|
|||
if check_input: | |||
X = check_array(X, dtype=np.float64, accept_sparse='csr', order='C') | |||
y = check_array(y, dtype=np.float64, ensure_2d=False, order='C') | |||
_dtype = [np.float64, np.float32] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be covered by any of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a pattern that we applied sometimes: the solvers are always checking the inputs while the class level does not do it (e.g. KMeans)
Is it something that we want to extend here or we let the lines uncovered or removing it?
|
||
if 'coef' in warm_start_mem.keys(): | ||
coef_init = warm_start_mem['coef'] | ||
else: | ||
# assume fit_intercept is False | ||
coef_init = np.zeros((n_features, n_classes), dtype=np.float64, | ||
coef_init = np.zeros((n_features, n_classes), dtype=X.dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be covered by any of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
34596c6
to
8d89662
Compare
I broke everything in the rebase, so I'll need a bit of time to fix things up… |
@NelleV : any update on this? I was hoping to merge. |
Not yet, but I can make it a priority starting tomorrow or wednesday. I had almost nothing to do, but apparently broke a huge amount of things in my rebase. |
@NelleV I did remove couple of outdated code and change some precision in a test. However, I will try to review the changes now. |
We need 2 additional tests:
|
I think that we should make the support in |
I think that we should make the support in ridge_regression as well.
Yes! Maybe in a separate PR (you could add an issue about this).
|
So, I almost forced pushed on top of all the changes in this branch… In general, it'd be nice to have a heads up before pushing to someone else's branch. |
You're right. Euphoria of the sprint, I assume... Sorry about that. |
This line seems weird: I still see some place where the dtype is converted:
The first case is not triggered because we usually do not call check_input=False while the second one is not tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments. I did not finish to go through all the tests.
I think that if we want to be complete we need to have a test which past a mixed sample_weight and X dtype to be sure that sample_weight will be casted to the data type of X.
The conversion is done but the test is not there.
xi_data_32, _, _ = xi_32 | ||
xi_data_64, _, _ = xi_64 | ||
|
||
assert_equal(xi_data_32.dtype, np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use assert ... == ...
xi_data_64, _, _ = xi_64 | ||
|
||
assert_equal(xi_data_32.dtype, np.float32) | ||
assert_equal(xi_data_64.dtype, np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use assert ... == ...
@@ -161,17 +195,18 @@ cdef class SequentialDataset: | |||
|
|||
def _sample_py(self, int current_index): | |||
"""python function used for easy testing""" | |||
cdef double* x_data_ptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure about this function for testing.
double
and float
in C will be converted to float
in python. Therefore, we cannot test that we are having the good type so the function *_py
are not useful for type checking I think.
|
||
assert_equal(xicsr_data_32.dtype, np.float32) | ||
assert_equal(xicsr_data_64.dtype, np.float64) | ||
assert isinstance(yicsr_32, float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look weird since that I was expected to get a np.float32
This is due that _sample_py convert the C double to Python float.
If we want to get the proper inner type, we need to return a value of the numpy array which will be then not converted and will have the proper dtype.
Actually there is not way to know. I would remove the check.
xicsr_data_32, _, _ = xicsr_32 | ||
xicsr_data_64, _, _ = xicsr_64 | ||
|
||
assert_equal(xicsr_data_32.dtype, np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use assert ... == ...
assert_array_almost_equal(xicsr_data_64, xicsr_data_32, decimal=5) | ||
assert_array_almost_equal(yicsr_64, yicsr_32, decimal=5) | ||
|
||
assert_array_equal(xi_data_32, xicsr_data_32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use assert_allclose is better as well here since they should be floating comparison
assert_equal(xi_data_64.dtype, np.float64) | ||
assert isinstance(yi_32, float) | ||
assert isinstance(yi_64, float) | ||
# assert_array_almost_equal(yi_64, yi_32, decimal=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this check here with assert_allclose and a large enough tolerance accounting for the cast
|
||
# Check accuracy consistency | ||
lr_64 = LogisticRegression(solver=solver, multi_class=multi_class) | ||
lr_64.fit(X_64, y_64) | ||
assert_equal(lr_64.coef_.dtype, X_64.dtype) | ||
assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32)) | ||
tolerance = 5 if solver != 'saga' else 3 | ||
assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_allclose is better
assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32), | ||
decimal=tolerance) | ||
# Do all asserts at once (it facilitate to interactive type check) | ||
assert_equal(lr_32_coef.dtype, X_32.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use the plain assert
assert_equal(lr_32_coef.dtype, X_32.dtype) | ||
assert_equal(lr_64_coef.dtype, X_64.dtype) | ||
assert_equal(lr_32_sparse.coef_.dtype, X_sparse_32.dtype) | ||
assert_almost_equal(lr_32_coef, lr_64_coef.astype(np.float32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a single coefficient or an array?
@@ -6,17 +6,25 @@ | |||
from numpy.testing import assert_array_equal | |||
import scipy.sparse as sp | |||
|
|||
from sklearn.utils.seq_dataset import ArrayDataset, CSRDataset | |||
from sklearn.utils.seq_dataset import ArrayDataset64 as ArrayDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be confusing and I would keep the 64 at the end and replace in the tests.
@@ -6,17 +6,25 @@ | |||
from numpy.testing import assert_array_equal | |||
import scipy.sparse as sp | |||
|
|||
from sklearn.utils.seq_dataset import ArrayDataset, CSRDataset | |||
from sklearn.utils.seq_dataset import ArrayDataset64 as ArrayDataset | |||
from sklearn.utils.seq_dataset import CSRDataset64 as CSRDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the 64
y32 = iris.target.astype(np.float32) | ||
X_csr32 = sp.csr_matrix(X32) | ||
sample_weight32 = np.arange(y32.size, dtype=np.float32) | ||
|
||
|
||
def assert_csr_equal(X, Y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably test CSRDataset32 as well
# cython: cdivision=True | ||
# cython: boundscheck=False | ||
# cython: wraparound=False | ||
# | ||
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably keep this part and repeat it in both template file
It would be nice to have this in 0.20. @NelleV would you have time to address last review comments or should it be continued by someone? |
@NelleV Can I take the PR back? I'll be working on it in the Paris sprint |
Reference Issue
Works on #8769 (for SAG solver), Fixes #9040
What does this implement/fix? Explain your changes.
Avoids logistic regression to aggressively cast the data to np.float64 when np.float32 is supplied.
Any other comments?
This PR follows up on PR #8835 and PR #9020