[MRG] LogisticRegression convert to float64 (sag) #11155

NelleV · 2018-05-28T23:36:21Z

Reference Issue

Works on #8769 (for SAG solver), Fixes #9040
What does this implement/fix? Explain your changes.

Avoids logistic regression to aggressively cast the data to np.float64 when np.float32 is supplied.

Any other comments?

This PR follows up on PR #8835 and PR #9020

ogrisel · 2018-05-29T21:05:50Z

The test failure on appveyor is intriguing:

[00:13:53]     def test_tol_parameter():
[00:13:53]         # Test that the tol parameter behaves as expected
[00:13:53]         X = StandardScaler().fit_transform(iris.data)
[00:13:53]         y = iris.target == 1
[00:13:53]     
[00:13:53]         # With tol is None, the number of iteration should be equal to max_iter
[00:13:53]         max_iter = 42
[00:13:53]         model_0 = SGDClassifier(tol=None, random_state=0, max_iter=max_iter)
[00:13:53]         model_0.fit(X, y)
[00:13:53]         assert_equal(max_iter, model_0.n_iter_)
[00:13:53]     
[00:13:53]         # If tol is not None, the number of iteration should be less than max_iter
[00:13:53]         max_iter = 2000
[00:13:53]         model_1 = SGDClassifier(tol=0, random_state=0, max_iter=max_iter)
[00:13:53]         model_1.fit(X, y)
[00:13:53]         assert_greater(max_iter, model_1.n_iter_)
[00:13:53] >       assert_greater(model_1.n_iter_, 5)
[00:13:53] 
[00:13:53] X          = array([[ -9.00681170e-01,   1.01900435e+00,  -1.34022653e+00,
[00:13:53]          -1.3154...6.86617933e-02,  -1.31979479e-01,   7.62758269e-01,
[00:13:53]           7.90670654e-01]])
[00:13:53] max_iter   = 2000
[00:13:53] model_0    = SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
[00:13:53]    ....5, random_state=0, shuffle=True,
[00:13:53]        tol=None, verbose=0, warm_start=False)
[00:13:53] model_1    = SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
[00:13:53]    ...t=0.5, random_state=0, shuffle=True,
[00:13:53]        tol=0, verbose=0, warm_start=False)
[00:13:53] y          = array([False, False, False, False, False, False, False, False, False,
[00:13:53]        F...se, False, False,
[00:13:53]        False, False, False, False, False, False], dtype=bool)
[00:13:53] 
[00:13:53] c:\python27\lib\site-packages\sklearn\linear_model\tests\test_sgd.py:1205: 
[00:13:53] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:13:53] c:\python27\lib\unittest\case.py:943: in assertGreater
[00:13:53]     self.fail(self._formatMessage(msg, standardMsg))
[00:13:53] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:13:53] 
[00:13:53] self = <sklearn.utils._unittest_backport.TestCase testMethod=__init__>
[00:13:53] msg = '2 not greater than 5'
[00:13:53] 
[00:13:53]     def fail(self, msg=None):
[00:13:53]         """Fail immediately, with the given message."""
[00:13:53] >       raise self.failureException(msg)
[00:13:53] E       AssertionError: 2 not greater than 5
[00:13:53] 
[00:13:53] msg        = '2 not greater than 5'
[00:13:53] self       = <sklearn.utils._unittest_backport.TestCase testMethod=__init__>
[00:13:53] 
[00:13:53] c:\python27\lib\unittest\case.py:410: AssertionError

It means that the SGDClassifier model has converged in 2 epochs. The data is iris so it's deterministic and the random state of the model is fixed so the stopping criterion should be consistent across platforms. Maybe this test is too sensitive to small numerical changes and the simple fact that we use a different compiler under windows to compile the sgd routine and the dataset access can explain this?

ogrisel · 2018-05-29T21:27:19Z

This test is indeed very brittle: a simple change of the random seed makes it fail very easily (tested under linux). I will try to push a more numerically stable of this test to see if that solves the failure we observe on appveyor.

ogrisel · 2018-05-29T22:47:57Z

The unstable test is fixed.

NelleV · 2018-05-29T22:50:19Z

Great! Thanks @ogrisel
I'll try to wrap up the benchmarks today.

sklearn-lgtm · 2018-05-31T19:17:59Z

This pull request introduces 1 alert when merging 6aa7b93 into 20cb37e - view on lgtm.com

new alerts:

1 for Unreachable code

Comment posted by lgtm.com

NelleV · 2018-05-31T21:22:57Z

On large datasets, 32b can sometimes cache results, while 64b can't. This leads to significant speed up:

There's no difference on small datasets.

I am not able to reproduce those benchmarks

ogrisel · 2018-05-31T23:23:17Z

It might be the gradient memory array that can fit in CPU cache in float32 while it cannot with float64 causing the slowdown for the later. This is just a hunch though.

Anyway, this PR LGTM.

ogrisel · 2018-05-31T23:25:39Z

sklearn/linear_model/setup.py

@@ -34,6 +35,20 @@ def configuration(parent_package='', top_path=None):
                                                          []),
                         **blas_info)

+    # generate sag_fast from template
+    sag_template = 'sklearn/linear_model/sag_fast.pyx.tp'
+    sag_file = sag_template[:-3]


Nitpick: I would rename to sag_cython_file.

ogrisel · 2018-05-31T23:29:51Z

benchmarks/bench_saga.py

+    dtypes_mapping = {
+                      "float64": np.float64,
+                      "float32": np.float32,
+                      }


style:

dtypes_mapping = { "float64": np.float64, "float32": np.float32, }

NelleV · 2018-05-31T23:53:02Z

hum… I screwed up somewhere…

massich · 2018-06-01T08:46:20Z

Cool LGTM (apart from travis complaining in PEP8)

Thx @NelleV

NelleV · 2018-06-01T17:58:38Z

In most cases (99%) we don't see a differences between the 32b and 64b.

GaelVaroquaux · 2018-06-01T23:25:20Z

benchmarks/bench_saga.py

@@ -151,21 +160,25 @@ def exp(solvers, penalties, single_target, n_samples=30000, max_iter=20,
    X = X[:n_samples]
    y = y[:n_samples]

-    cached_fit = mem.cache(fit_single)
+    # cached_fit = mem.cache(fit_single)


Maybe you should delete this line

GaelVaroquaux · 2018-06-01T23:36:57Z

sklearn/linear_model/logistic.py

-        if self.solver in ['newton-cg']:
-            _dtype = [np.float64, np.float32]
-        else:
+        if self.solver in ['lbfgs', 'liblinear']:


It would be good to document in the Notes section of the object what solvers can handle what dtype.

Although I could improve the wording…

GaelVaroquaux · 2018-06-01T23:44:29Z

sklearn/linear_model/sag_fast.pyx.tp

+    # # tol
+    # cdef {{c_type}} tol = <{{c_type}}> _tol
+    # # intercept_decay
+    # cdef {{c_type}} intercept_decay = <{{c_type}}> _intercept_decay


I think that this should be cleaned.

jnothman · 2018-06-03T22:36:41Z

I've not looked through in detail... A broad question: should we be preferring Tempita over fused types in general? Why / why not?

@NelleV, I don't get what the "I am not able to reproduce those benchmarks" above means...?

GaelVaroquaux · 2018-06-04T04:30:19Z

A broad question: should we be preferring Tempita over fused types in general? Why / why not?

Fused type didn't work here because they couldn't be used to introduce polymorphism with the classes.

lesteve · 2018-06-04T09:05:57Z

sklearn/utils/setup.py

+                     'sklearn/utils/seq_dataset.pxd.tp']
+
+    for pyxfiles in pyx_templates:
+        outfile = pyxfiles[:-3]


small comment: I find .replace('.tp', '') slightly more readable.

I should take mental note of this one.

lesteve · 2018-06-04T09:06:06Z

sklearn/linear_model/setup.py

@@ -34,6 +36,20 @@ def configuration(parent_package='', top_path=None):
                                                          []),
                         **blas_info)

+    # generate sag_fast from template
+    sag_cython_file = 'sklearn/linear_model/sag_fast.pyx.tp'
+    sag_file = sag_cython_file[:-3]


small comment: I find .replace('.tp', '') slightly more readable.

lesteve · 2018-06-04T09:16:18Z

sklearn/linear_model/sag.py

@@ -243,8 +243,9 @@ def sag_solver(X, y, sample_weight=None, loss='log', alpha=1., beta=0.,
        max_iter = 1000

    if check_input:
-        X = check_array(X, dtype=np.float64, accept_sparse='csr', order='C')
-        y = check_array(y, dtype=np.float64, ensure_2d=False, order='C')
+        _dtype = [np.float64, np.float32]


This does not seem to be covered by any of the tests.

I see a pattern that we applied sometimes: the solvers are always checking the inputs while the class level does not do it (e.g. KMeans)

Is it something that we want to extend here or we let the lines uncovered or removing it?

lesteve · 2018-06-04T09:16:36Z

sklearn/linear_model/sag.py


    if 'coef' in warm_start_mem.keys():
        coef_init = warm_start_mem['coef']
    else:
        # assume fit_intercept is False
-        coef_init = np.zeros((n_features, n_classes), dtype=np.float64,
+        coef_init = np.zeros((n_features, n_classes), dtype=X.dtype,


This does not seem to be covered by any of the tests.

@TomDLT as mentioned by @lesteve, this line is not covered since LogisiticRegression and Ridge always pass an array with initialized coefficients.

Is there a danger to remove this line? Is sag_solver could be used as a standalone function?

NelleV · 2018-07-09T16:22:46Z

I broke everything in the rebase, so I'll need a bit of time to fix things up…

GaelVaroquaux · 2018-07-16T14:25:03Z

@NelleV : any update on this? I was hoping to merge.

NelleV · 2018-07-16T23:05:59Z

Not yet, but I can make it a priority starting tomorrow or wednesday. I had almost nothing to do, but apparently broke a huge amount of things in my rebase.
Once I fix my rebase, it'll be a minimal amount of work.

…69_minus_merged

glemaitre · 2018-07-17T23:22:05Z

@NelleV I did remove couple of outdated code and change some precision in a test. However, I will try to review the changes now.

…69_minus_merged

glemaitre · 2018-07-19T15:03:22Z

We need 2 additional tests:

make a test with check_input=True for sag_solver
make a test with cold_start to initialize the coefficient.

glemaitre · 2018-07-19T17:26:02Z

I think that we should make the support in ridge_regression as well.

GaelVaroquaux · 2018-07-19T17:31:57Z

I think that we should make the support in ridge_regression as well.

Yes! Maybe in a separate PR (you could add an issue about this).

NelleV · 2018-07-19T17:33:55Z

So, I almost forced pushed on top of all the changes in this branch… In general, it'd be nice to have a heads up before pushing to someone else's branch.

glemaitre · 2018-07-19T19:02:26Z

In general, it'd be nice to have a heads up before pushing to someone else's branch.

You're right. Euphoria of the sprint, I assume... Sorry about that.

glemaitre · 2018-07-19T21:12:43Z

This line seems weird:

I still see some place where the dtype is converted:

logistic_regression_path only when check_input=True: https://github.com/scikit-learn/scikit-learn/blob/ed5e127b/sklearn/linear_model/logistic.py#L597
LogisticRegressionCV: https://github.com/scikit-learn/scikit-learn/blob/ed5e127b/sklearn/linear_model/logistic.py#L1614

The first case is not triggered because we usually do not call check_input=False while the second one is not tested.

glemaitre

Couple of comments. I did not finish to go through all the tests.

I think that if we want to be complete we need to have a test which past a mixed sample_weight and X dtype to be sure that sample_weight will be casted to the data type of X.

The conversion is done but the test is not there.

glemaitre · 2018-07-19T21:26:14Z

sklearn/linear_model/tests/test_base.py

+    xi_data_32, _, _ = xi_32
+    xi_data_64, _, _ = xi_64
+
+    assert_equal(xi_data_32.dtype, np.float32)


We need to use assert ... == ...

glemaitre · 2018-07-19T21:26:24Z

sklearn/linear_model/tests/test_base.py

+    xi_data_64, _, _ = xi_64
+
+    assert_equal(xi_data_32.dtype, np.float32)
+    assert_equal(xi_data_64.dtype, np.float64)


We need to use assert ... == ...

glemaitre · 2018-07-19T22:05:40Z

sklearn/utils/seq_dataset.pyx.tp

@@ -161,17 +195,18 @@ cdef class SequentialDataset:

    def _sample_py(self, int current_index):
        """python function used for easy testing"""
-        cdef double* x_data_ptr


I am unsure about this function for testing.

double and float in C will be converted to float in python. Therefore, we cannot test that we are having the good type so the function *_py are not useful for type checking I think.

glemaitre · 2018-07-19T22:07:44Z

sklearn/linear_model/tests/test_base.py

+
+    assert_equal(xicsr_data_32.dtype, np.float32)
+    assert_equal(xicsr_data_64.dtype, np.float64)
+    assert isinstance(yicsr_32, float)


This look weird since that I was expected to get a np.float32
This is due that _sample_py convert the C double to Python float.
~~If we want to get the proper inner type, we need to return a value of the numpy array which will be then not converted and will have the proper dtype.~~
Actually there is not way to know. I would remove the check.

glemaitre · 2018-07-19T22:08:07Z

sklearn/linear_model/tests/test_base.py

+    xicsr_data_32, _, _ = xicsr_32
+    xicsr_data_64, _, _ = xicsr_64
+
+    assert_equal(xicsr_data_32.dtype, np.float32)


We need to use assert ... == ...

glemaitre · 2018-07-19T22:09:35Z

sklearn/linear_model/tests/test_base.py

+    assert_array_almost_equal(xicsr_data_64, xicsr_data_32, decimal=5)
+    assert_array_almost_equal(yicsr_64, yicsr_32, decimal=5)
+
+    assert_array_equal(xi_data_32, xicsr_data_32)


Shall we use assert_allclose is better as well here since they should be floating comparison

glemaitre · 2018-07-19T22:10:13Z

sklearn/linear_model/tests/test_base.py

+    assert_equal(xi_data_64.dtype, np.float64)
+    assert isinstance(yi_32, float)
+    assert isinstance(yi_64, float)
+    # assert_array_almost_equal(yi_64, yi_32, decimal=5)


We could make this check here with assert_allclose and a large enough tolerance accounting for the cast

glemaitre · 2018-07-19T22:10:55Z

sklearn/linear_model/tests/test_logistic.py


            # Check accuracy consistency
            lr_64 = LogisticRegression(solver=solver, multi_class=multi_class)
            lr_64.fit(X_64, y_64)
            assert_equal(lr_64.coef_.dtype, X_64.dtype)
-            assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32))
+            tolerance = 5 if solver != 'saga' else 3
+            assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32),


assert_allclose is better

glemaitre · 2018-07-19T22:11:48Z

sklearn/linear_model/tests/test_logistic.py

+            assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32),
+                                decimal=tolerance)
+            # Do all asserts at once (it facilitate to interactive type check)
+            assert_equal(lr_32_coef.dtype, X_32.dtype)


we can use the plain assert

glemaitre · 2018-07-19T22:12:35Z

sklearn/linear_model/tests/test_logistic.py

+            assert_equal(lr_32_coef.dtype, X_32.dtype)
+            assert_equal(lr_64_coef.dtype, X_64.dtype)
+            assert_equal(lr_32_sparse.coef_.dtype, X_sparse_32.dtype)
+            assert_almost_equal(lr_32_coef, lr_64_coef.astype(np.float32),


is it a single coefficient or an array?

glemaitre · 2018-07-20T05:21:58Z

sklearn/utils/tests/test_seq_dataset.py

@@ -6,17 +6,25 @@
 from numpy.testing import assert_array_equal
 import scipy.sparse as sp

-from sklearn.utils.seq_dataset import ArrayDataset, CSRDataset
+from sklearn.utils.seq_dataset import ArrayDataset64 as ArrayDataset


It could be confusing and I would keep the 64 at the end and replace in the tests.

glemaitre · 2018-07-20T05:22:22Z

sklearn/utils/tests/test_seq_dataset.py

@@ -6,17 +6,25 @@
 from numpy.testing import assert_array_equal
 import scipy.sparse as sp

-from sklearn.utils.seq_dataset import ArrayDataset, CSRDataset
+from sklearn.utils.seq_dataset import ArrayDataset64 as ArrayDataset
+from sklearn.utils.seq_dataset import CSRDataset64 as CSRDataset


I would keep the 64

glemaitre · 2018-07-20T05:23:28Z

sklearn/utils/tests/test_seq_dataset.py

+y32 = iris.target.astype(np.float32)
+X_csr32 = sp.csr_matrix(X32)
+sample_weight32 = np.arange(y32.size, dtype=np.float32)
+

 def assert_csr_equal(X, Y):


We should probably test CSRDataset32 as well

glemaitre · 2018-07-20T05:25:37Z

sklearn/utils/seq_dataset.pyx.tp

 # cython: cdivision=True
 # cython: boundscheck=False
 # cython: wraparound=False
-#
-# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>


We should probably keep this part and repeat it in both template file

rth · 2018-07-27T13:42:11Z

It would be nice to have this in 0.20. @NelleV would you have time to address last review comments or should it be continued by someone?

massich · 2019-02-25T09:53:42Z

@NelleV Can I take the PR back? I'll be working on it in the Paris sprint

GaelVaroquaux · 2019-02-25T12:58:00Z

Closing as overidden by #11643 and #13243

NelleV mentioned this pull request May 28, 2018

[MRG] Fused type makedataset #9040

Closed

NelleV changed the title ~~[MRG] LogisticRegression convert to float64 (sag)~~ [WIP] LogisticRegression convert to float64 (sag) May 28, 2018

NelleV force-pushed the henley_is_8769_minus_merged branch from 6aa7b93 to 2bb42c8 Compare May 31, 2018 21:21

NelleV changed the title ~~[WIP] LogisticRegression convert to float64 (sag)~~ [MRG] LogisticRegression convert to float64 (sag) May 31, 2018

ogrisel reviewed May 31, 2018

View reviewed changes

ogrisel mentioned this pull request May 31, 2018

[MRG] LogisticRegression convert to float64 (sag) #9020

Closed

NelleV force-pushed the henley_is_8769_minus_merged branch from 2bb42c8 to 84e337c Compare May 31, 2018 23:38

NelleV force-pushed the henley_is_8769_minus_merged branch from 84e337c to dbb22b7 Compare June 1, 2018 00:18

massich mentioned this pull request Jun 1, 2018

LogisticRegression convert to float64 #8769

Closed

NelleV force-pushed the henley_is_8769_minus_merged branch from dbb22b7 to 921ab38 Compare June 1, 2018 16:51

GaelVaroquaux reviewed Jun 1, 2018

View reviewed changes

lesteve reviewed Jun 4, 2018

View reviewed changes

Fixing gael's comments

8d89662

NelleV force-pushed the henley_is_8769_minus_merged branch from 34596c6 to 8d89662 Compare July 8, 2018 22:52

fix

21c3f98

glemaitre added 2 commits July 18, 2018 00:51

Merge remote-tracking branch 'origin/master' into NelleV-henley_is_87…

2534661

…69_minus_merged

solve some issues

b10e2d7

glemaitre added 2 commits July 18, 2018 01:22

Merge remote-tracking branch 'origin/master' into NelleV-henley_is_87…

3af8134

…69_minus_merged

PEP8

30f2080

Address lesteve comments

7babe3e

glemaitre requested changes Jul 19, 2018

View reviewed changes

glemaitre requested changes Jul 20, 2018

View reviewed changes

This was referenced Jul 20, 2018

Ridge with SAG/SAGA should not cast to float 64 #11642

Closed

[WIP] EHN: Ridge with solver SAG/SAGA does not cast to float64 #11643

Closed

amueller modified the milestones: 0.20, 0.21 Aug 17, 2018

massich mentioned this pull request Feb 25, 2019

LogisticRegression convert to float64 (for SAG solver) #13243

Merged

GaelVaroquaux closed this Feb 25, 2019

massich mentioned this pull request Feb 27, 2019

[MRG+1] ENH Ridge with solver SAG/SAGA does not cast to float64 #13302

Merged

2 tasks

[MRG] LogisticRegression convert to float64 (sag) #11155

[MRG] LogisticRegression convert to float64 (sag) #11155

Conversation

NelleV commented May 28, 2018

ogrisel commented May 29, 2018

ogrisel commented May 29, 2018 • edited Loading

ogrisel commented May 29, 2018

NelleV commented May 29, 2018

sklearn-lgtm commented May 31, 2018

NelleV commented May 31, 2018 • edited Loading

ogrisel commented May 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NelleV commented May 31, 2018

massich commented Jun 1, 2018

NelleV commented Jun 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jun 3, 2018

GaelVaroquaux commented Jun 4, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NelleV commented Jul 9, 2018

GaelVaroquaux commented Jul 16, 2018

NelleV commented Jul 16, 2018

glemaitre commented Jul 17, 2018

glemaitre commented Jul 19, 2018

glemaitre commented Jul 19, 2018

GaelVaroquaux commented Jul 19, 2018 via email

NelleV commented Jul 19, 2018

glemaitre commented Jul 19, 2018

glemaitre commented Jul 19, 2018

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre Jul 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Jul 27, 2018

massich commented Feb 25, 2019

GaelVaroquaux commented Feb 25, 2019

ogrisel commented May 29, 2018 •

edited

Loading

NelleV commented May 31, 2018 •

edited

Loading

glemaitre Jul 19, 2018 •

edited

Loading