Skip to content

[MRG] LogisticRegression convert to float64 (sag) #11155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

NelleV
Copy link
Member

@NelleV NelleV commented May 28, 2018

Reference Issue

Works on #8769 (for SAG solver), Fixes #9040
What does this implement/fix? Explain your changes.

Avoids logistic regression to aggressively cast the data to np.float64 when np.float32 is supplied.

Any other comments?

This PR follows up on PR #8835 and PR #9020

@NelleV NelleV changed the title [MRG] LogisticRegression convert to float64 (sag) [WIP] LogisticRegression convert to float64 (sag) May 28, 2018
@ogrisel
Copy link
Member

ogrisel commented May 29, 2018

The test failure on appveyor is intriguing:

[00:13:53]     def test_tol_parameter():
[00:13:53]         # Test that the tol parameter behaves as expected
[00:13:53]         X = StandardScaler().fit_transform(iris.data)
[00:13:53]         y = iris.target == 1
[00:13:53]     
[00:13:53]         # With tol is None, the number of iteration should be equal to max_iter
[00:13:53]         max_iter = 42
[00:13:53]         model_0 = SGDClassifier(tol=None, random_state=0, max_iter=max_iter)
[00:13:53]         model_0.fit(X, y)
[00:13:53]         assert_equal(max_iter, model_0.n_iter_)
[00:13:53]     
[00:13:53]         # If tol is not None, the number of iteration should be less than max_iter
[00:13:53]         max_iter = 2000
[00:13:53]         model_1 = SGDClassifier(tol=0, random_state=0, max_iter=max_iter)
[00:13:53]         model_1.fit(X, y)
[00:13:53]         assert_greater(max_iter, model_1.n_iter_)
[00:13:53] >       assert_greater(model_1.n_iter_, 5)
[00:13:53] 
[00:13:53] X          = array([[ -9.00681170e-01,   1.01900435e+00,  -1.34022653e+00,
[00:13:53]          -1.3154...6.86617933e-02,  -1.31979479e-01,   7.62758269e-01,
[00:13:53]           7.90670654e-01]])
[00:13:53] max_iter   = 2000
[00:13:53] model_0    = SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
[00:13:53]    ....5, random_state=0, shuffle=True,
[00:13:53]        tol=None, verbose=0, warm_start=False)
[00:13:53] model_1    = SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
[00:13:53]    ...t=0.5, random_state=0, shuffle=True,
[00:13:53]        tol=0, verbose=0, warm_start=False)
[00:13:53] y          = array([False, False, False, False, False, False, False, False, False,
[00:13:53]        F...se, False, False,
[00:13:53]        False, False, False, False, False, False], dtype=bool)
[00:13:53] 
[00:13:53] c:\python27\lib\site-packages\sklearn\linear_model\tests\test_sgd.py:1205: 
[00:13:53] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:13:53] c:\python27\lib\unittest\case.py:943: in assertGreater
[00:13:53]     self.fail(self._formatMessage(msg, standardMsg))
[00:13:53] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:13:53] 
[00:13:53] self = <sklearn.utils._unittest_backport.TestCase testMethod=__init__>
[00:13:53] msg = '2 not greater than 5'
[00:13:53] 
[00:13:53]     def fail(self, msg=None):
[00:13:53]         """Fail immediately, with the given message."""
[00:13:53] >       raise self.failureException(msg)
[00:13:53] E       AssertionError: 2 not greater than 5
[00:13:53] 
[00:13:53] msg        = '2 not greater than 5'
[00:13:53] self       = <sklearn.utils._unittest_backport.TestCase testMethod=__init__>
[00:13:53] 
[00:13:53] c:\python27\lib\unittest\case.py:410: AssertionError

It means that the SGDClassifier model has converged in 2 epochs. The data is iris so it's deterministic and the random state of the model is fixed so the stopping criterion should be consistent across platforms. Maybe this test is too sensitive to small numerical changes and the simple fact that we use a different compiler under windows to compile the sgd routine and the dataset access can explain this?

@ogrisel
Copy link
Member

ogrisel commented May 29, 2018

This test is indeed very brittle: a simple change of the random seed makes it fail very easily (tested under linux). I will try to push a more numerically stable of this test to see if that solves the failure we observe on appveyor.

@ogrisel
Copy link
Member

ogrisel commented May 29, 2018

The unstable test is fixed.

@NelleV
Copy link
Member Author

NelleV commented May 29, 2018

Great! Thanks @ogrisel
I'll try to wrap up the benchmarks today.

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 6aa7b93 into 20cb37e - view on lgtm.com

new alerts:

  • 1 for Unreachable code

Comment posted by lgtm.com

@NelleV NelleV force-pushed the henley_is_8769_minus_merged branch from 6aa7b93 to 2bb42c8 Compare May 31, 2018 21:21
@NelleV
Copy link
Member Author

NelleV commented May 31, 2018

On large datasets, 32b can sometimes cache results, while 64b can't. This leads to significant speed up:

single_target_l1

There's no difference on small datasets.

I am not able to reproduce those benchmarks

@NelleV NelleV changed the title [WIP] LogisticRegression convert to float64 (sag) [MRG] LogisticRegression convert to float64 (sag) May 31, 2018
@ogrisel
Copy link
Member

ogrisel commented May 31, 2018

It might be the gradient memory array that can fit in CPU cache in float32 while it cannot with float64 causing the slowdown for the later. This is just a hunch though.

Anyway, this PR LGTM.

@@ -34,6 +35,20 @@ def configuration(parent_package='', top_path=None):
[]),
**blas_info)

# generate sag_fast from template
sag_template = 'sklearn/linear_model/sag_fast.pyx.tp'
sag_file = sag_template[:-3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I would rename to sag_cython_file.

dtypes_mapping = {
"float64": np.float64,
"float32": np.float32,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style:

dtypes_mapping = {
     "float64": np.float64,
     "float32": np.float32, 
}

@NelleV
Copy link
Member Author

NelleV commented May 31, 2018

hum… I screwed up somewhere…

@NelleV NelleV force-pushed the henley_is_8769_minus_merged branch from 84e337c to dbb22b7 Compare June 1, 2018 00:18
@massich
Copy link
Contributor

massich commented Jun 1, 2018

Cool LGTM (apart from travis complaining in PEP8)

Thx @NelleV

@NelleV NelleV force-pushed the henley_is_8769_minus_merged branch from dbb22b7 to 921ab38 Compare June 1, 2018 16:51
@NelleV
Copy link
Member Author

NelleV commented Jun 1, 2018

In most cases (99%) we don't see a differences between the 32b and 64b.

saga_l1_200000

@@ -151,21 +160,25 @@ def exp(solvers, penalties, single_target, n_samples=30000, max_iter=20,
X = X[:n_samples]
y = y[:n_samples]

cached_fit = mem.cache(fit_single)
# cached_fit = mem.cache(fit_single)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should delete this line

if self.solver in ['newton-cg']:
_dtype = [np.float64, np.float32]
else:
if self.solver in ['lbfgs', 'liblinear']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document in the Notes section of the object what solvers can handle what dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I could improve the wording…

# # tol
# cdef {{c_type}} tol = <{{c_type}}> _tol
# # intercept_decay
# cdef {{c_type}} intercept_decay = <{{c_type}}> _intercept_decay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be cleaned.

@jnothman
Copy link
Member

jnothman commented Jun 3, 2018

I've not looked through in detail... A broad question: should we be preferring Tempita over fused types in general? Why / why not?

@NelleV, I don't get what the "I am not able to reproduce those benchmarks" above means...?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 4, 2018 via email

'sklearn/utils/seq_dataset.pxd.tp']

for pyxfiles in pyx_templates:
outfile = pyxfiles[:-3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment: I find .replace('.tp', '') slightly more readable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should take mental note of this one.

@@ -34,6 +36,20 @@ def configuration(parent_package='', top_path=None):
[]),
**blas_info)

# generate sag_fast from template
sag_cython_file = 'sklearn/linear_model/sag_fast.pyx.tp'
sag_file = sag_cython_file[:-3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment: I find .replace('.tp', '') slightly more readable.

@@ -243,8 +243,9 @@ def sag_solver(X, y, sample_weight=None, loss='log', alpha=1., beta=0.,
max_iter = 1000

if check_input:
X = check_array(X, dtype=np.float64, accept_sparse='csr', order='C')
y = check_array(y, dtype=np.float64, ensure_2d=False, order='C')
_dtype = [np.float64, np.float32]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be covered by any of the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a pattern that we applied sometimes: the solvers are always checking the inputs while the class level does not do it (e.g. KMeans)

Is it something that we want to extend here or we let the lines uncovered or removing it?


if 'coef' in warm_start_mem.keys():
coef_init = warm_start_mem['coef']
else:
# assume fit_intercept is False
coef_init = np.zeros((n_features, n_classes), dtype=np.float64,
coef_init = np.zeros((n_features, n_classes), dtype=X.dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be covered by any of the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomDLT as mentioned by @lesteve, this line is not covered since LogisiticRegression and Ridge always pass an array with initialized coefficients.

Is there a danger to remove this line? Is sag_solver could be used as a standalone function?

@NelleV NelleV force-pushed the henley_is_8769_minus_merged branch from 34596c6 to 8d89662 Compare July 8, 2018 22:52
@NelleV
Copy link
Member Author

NelleV commented Jul 9, 2018

I broke everything in the rebase, so I'll need a bit of time to fix things up…

@GaelVaroquaux
Copy link
Member

@NelleV : any update on this? I was hoping to merge.

@NelleV
Copy link
Member Author

NelleV commented Jul 16, 2018

Not yet, but I can make it a priority starting tomorrow or wednesday. I had almost nothing to do, but apparently broke a huge amount of things in my rebase.
Once I fix my rebase, it'll be a minimal amount of work.

@glemaitre
Copy link
Member

@NelleV I did remove couple of outdated code and change some precision in a test. However, I will try to review the changes now.

@glemaitre
Copy link
Member

We need 2 additional tests:

  • make a test with check_input=True for sag_solver
  • make a test with cold_start to initialize the coefficient.

@glemaitre
Copy link
Member

I think that we should make the support in ridge_regression as well.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 19, 2018 via email

@NelleV
Copy link
Member Author

NelleV commented Jul 19, 2018

So, I almost forced pushed on top of all the changes in this branch… In general, it'd be nice to have a heads up before pushing to someone else's branch.

@glemaitre
Copy link
Member

In general, it'd be nice to have a heads up before pushing to someone else's branch.

You're right. Euphoria of the sprint, I assume... Sorry about that.

@glemaitre
Copy link
Member

This line seems weird:

I still see some place where the dtype is converted:

The first case is not triggered because we usually do not call check_input=False while the second one is not tested.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. I did not finish to go through all the tests.

I think that if we want to be complete we need to have a test which past a mixed sample_weight and X dtype to be sure that sample_weight will be casted to the data type of X.

The conversion is done but the test is not there.

xi_data_32, _, _ = xi_32
xi_data_64, _, _ = xi_64

assert_equal(xi_data_32.dtype, np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use assert ... == ...

xi_data_64, _, _ = xi_64

assert_equal(xi_data_32.dtype, np.float32)
assert_equal(xi_data_64.dtype, np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use assert ... == ...

@@ -161,17 +195,18 @@ cdef class SequentialDataset:

def _sample_py(self, int current_index):
"""python function used for easy testing"""
cdef double* x_data_ptr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about this function for testing.

double and float in C will be converted to float in python. Therefore, we cannot test that we are having the good type so the function *_py are not useful for type checking I think.


assert_equal(xicsr_data_32.dtype, np.float32)
assert_equal(xicsr_data_64.dtype, np.float64)
assert isinstance(yicsr_32, float)
Copy link
Member

@glemaitre glemaitre Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look weird since that I was expected to get a np.float32
This is due that _sample_py convert the C double to Python float.
If we want to get the proper inner type, we need to return a value of the numpy array which will be then not converted and will have the proper dtype.
Actually there is not way to know. I would remove the check.

xicsr_data_32, _, _ = xicsr_32
xicsr_data_64, _, _ = xicsr_64

assert_equal(xicsr_data_32.dtype, np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use assert ... == ...

assert_array_almost_equal(xicsr_data_64, xicsr_data_32, decimal=5)
assert_array_almost_equal(yicsr_64, yicsr_32, decimal=5)

assert_array_equal(xi_data_32, xicsr_data_32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use assert_allclose is better as well here since they should be floating comparison

assert_equal(xi_data_64.dtype, np.float64)
assert isinstance(yi_32, float)
assert isinstance(yi_64, float)
# assert_array_almost_equal(yi_64, yi_32, decimal=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this check here with assert_allclose and a large enough tolerance accounting for the cast


# Check accuracy consistency
lr_64 = LogisticRegression(solver=solver, multi_class=multi_class)
lr_64.fit(X_64, y_64)
assert_equal(lr_64.coef_.dtype, X_64.dtype)
assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32))
tolerance = 5 if solver != 'saga' else 3
assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_allclose is better

assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32),
decimal=tolerance)
# Do all asserts at once (it facilitate to interactive type check)
assert_equal(lr_32_coef.dtype, X_32.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use the plain assert

assert_equal(lr_32_coef.dtype, X_32.dtype)
assert_equal(lr_64_coef.dtype, X_64.dtype)
assert_equal(lr_32_sparse.coef_.dtype, X_sparse_32.dtype)
assert_almost_equal(lr_32_coef, lr_64_coef.astype(np.float32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a single coefficient or an array?

@@ -6,17 +6,25 @@
from numpy.testing import assert_array_equal
import scipy.sparse as sp

from sklearn.utils.seq_dataset import ArrayDataset, CSRDataset
from sklearn.utils.seq_dataset import ArrayDataset64 as ArrayDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be confusing and I would keep the 64 at the end and replace in the tests.

@@ -6,17 +6,25 @@
from numpy.testing import assert_array_equal
import scipy.sparse as sp

from sklearn.utils.seq_dataset import ArrayDataset, CSRDataset
from sklearn.utils.seq_dataset import ArrayDataset64 as ArrayDataset
from sklearn.utils.seq_dataset import CSRDataset64 as CSRDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the 64

y32 = iris.target.astype(np.float32)
X_csr32 = sp.csr_matrix(X32)
sample_weight32 = np.arange(y32.size, dtype=np.float32)


def assert_csr_equal(X, Y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably test CSRDataset32 as well

# cython: cdivision=True
# cython: boundscheck=False
# cython: wraparound=False
#
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably keep this part and repeat it in both template file

@rth
Copy link
Member

rth commented Jul 27, 2018

It would be nice to have this in 0.20. @NelleV would you have time to address last review comments or should it be continued by someone?

@amueller amueller modified the milestones: 0.20, 0.21 Aug 17, 2018
@massich
Copy link
Contributor

massich commented Feb 25, 2019

@NelleV Can I take the PR back? I'll be working on it in the Paris sprint

@GaelVaroquaux
Copy link
Member

Closing as overidden by #11643 and #13243

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants