[MRG] LogisticRegression convert to float64 (sag) #9020

Henley13 · 2017-06-06T16:58:48Z

Reference Issue

Works on #8769 (for SAG solver), Fixes #9040

What does this implement/fix? Explain your changes.

Avoids logistic regression to aggressively cast the data to np.float64 when np.float32 is supplied.

Any other comments?

This PR follows up on PR #8835

arthurmensch · 2017-06-07T08:34:19Z

Do you plan to handle all solvers ?

Henley13 · 2017-06-07T08:39:13Z

@arthurmensch, you check the cython scripts and I take care of the python ones?
Maybe we could do sag and saga together in this PR, but for the rest it would be one solver per PR

arthurmensch · 2017-06-07T09:24:32Z

I ll do a separate pr for sag and saga

…

On Wed, Jun 7, 2017, 10:39 AM Arthur Imbert ***@***.***> wrote: @arthurmensch <https://github.com/arthurmensch>, you check the cython scripts and I take care of the python ones? Maybe we could do sag and saga together in this PR, but for the rest it would be one solver per PR — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9020 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD48Y-dTln_Y0dkmnJHlBgvBcJt4tspiks5sBmHAgaJpZM4Nxocl> .

pcerda · 2017-06-07T12:08:32Z

hi, I also want to work on that

massich · 2017-06-07T16:55:10Z

This PR is actually blocked by #9040. cc: @arthurmensch

massich · 2017-06-10T16:20:11Z

I'm working on it here

initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template #2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset

massich · 2017-06-19T15:48:58Z

TODO:

Revert 3e25392
( _multinomial_loss_grad is used only in testing to ensure consistent results. I think there is no need to test it in 32 and 64. The later should be enough)
Mimic the 64bits behavior using templating using from my_file import my_function64 as my_function
Add the saga to the testing
Remove the mimicking behaviour
Fix whatever we broke
Benchmark it !
keep the coverage score (here)

… python type

massich · 2017-06-19T19:58:18Z

Up to sag32 and sag64 calls in sag_fast.pyx, the calling parameters seem correct. However from there on, at somewhere there's a cast that is not properly done (because the test breaks). The only none array structure is wscale but its type is created by the templating, and its initial computation using mixed type should be no problem.

~~I have checked the code, but just reading the code would not make it. Do you know any manner of tracing the call further in the cython code?~~

cc: @raghavrv, @ogrisel, @arthurmensch

raghavrv

Some minor questions. But looks good to me overall. Thanks for the persistent efforts @massich @Henley13

raghavrv · 2017-06-20T09:46:21Z

.gitignore

@@ -65,3 +65,7 @@ benchmarks/bench_covertype_data/

 # Used by py.test
 .cache
+
+# files generated from a template
+sklearn/utils/seq.dataset.pyx


--> seq_dataset.pyx

Also add the sag_fast.pyx

raghavrv · 2017-06-20T09:50:55Z

sklearn/linear_model/setup.py

@@ -5,6 +5,7 @@

 from sklearn._build_utils import get_blas_info

+from Cython import Tempita as tempita


Why not just use the Tempita?

raghavrv · 2017-06-20T09:55:26Z

sklearn/linear_model/sgd_fast_helpers.h

@@ -3,7 +3,11 @@
 #ifdef _MSC_VER
 # include <float.h>
 # define skl_isfinite _finite
+# define skl_isfinite32 _finite


Could you add a comment saying why this change is needed?

# When re-declaring the functions in the template for cython # specific for each parameter input type, it needs to be 2 different functions # as cython doesn't support function overloading.

@jnothman can you help in deciding whether this comment is needed or this is obvious.

raghavrv · 2017-06-20T09:58:33Z

sklearn/linear_model/tests/test_base.py

+    # array
+    dataset_32, _ = make_dataset(X_32, y_32, sample_weight_32)
+    dataset_64, _ = make_dataset(X_64, y_64, sample_weight_64)
+    xi_32, _, _, _ = dataset_32._next_py()


Why are we not checking for y? cc: @Henley13

raghavrv · 2017-06-20T10:00:13Z

sklearn/utils/tests/test_seq_dataset.py

+        xi32, yi32, swi32, idx32 = dataset32._next_py()
+        xi64, yi64, swi64, idx64 = dataset64._next_py()
+
+        xi_data32, _, _ = xi32


same comment as above. Why don't we check for y?

Henley13 · 2017-06-23T14:17:54Z

sklearn/linear_model/base.py

@@ -32,7 +32,9 @@
 from ..utils.extmath import safe_sparse_dot
 from ..utils.sparsefuncs import mean_variance_axis, inplace_column_scale
 from ..utils.fixes import sparse_lsqr
-from ..utils.seq_dataset import ArrayDataset, CSRDataset
+from ..utils.seq_dataset import ArrayDataset32, CSRDataset32
+from ..utils.seq_dataset import ArrayDataset64 as ArrayDataset


Should we keep the name "ArrayDataset"? Why not write explicitly the number of bits?

IMHO that's how I would do it for the moment. Like this you ensure that we don't break anything.

+1 for preserving backward compatibility. The ArrayDataset class of scikit-learn 0.18.2 is 64 bit float based. Let's keep the same name as an alias. We could even setup the alias in sklearn/utils/seq_dataset even if Cython code is not official part of the project public API.

~~this also needs to be addressed~~

raghavrv · 2017-06-29T17:10:03Z

This looks good to me. But I guess you want to do the benchmark. Ping me once the CIs pass and you are satisfied with this.

massich · 2017-07-13T17:57:01Z

sklearn/utils/tests/test_seq_dataset.py

+
+    for i in range(5):
+        # next sample
+        xi32, yi32, _, _ = dataset32._next_py()


@Henley13 this test cannot work. yi32 and yi64 are single elements not arrays. Therefore you cannot compare them into np.float32 nor np.float64. More over python build-in float type are all 64. For a single va lue this is not a problem. It is only a problem for vectors.
I think we should revert the test asked by @raghavrv.

massich · 2017-07-17T18:03:33Z

Ok, I need some help here.

Thats what we generate from the template:

cdef class SequentialDataset64:
                        # .....  Some functions
    def _next_py(self):
        """python function used for easy testing"""
        cdef int current_index = self._get_next_index()
        return self._sample_py(current_index)
                        # ......  more functions
    def _sample_py(self, int current_index):
        """python function used for easy testing"""
        cdef double* x_data_ptr
        cdef int* x_indices_ptr
        cdef int nnz, j
        cdef double y, sample_weight

        # call _sample in cython
        self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
                     current_index)
                        #  .... some stuff when X or y are sparse
        return (x_data, x_indices, x_indptr), y, sample_weight, sample_idx


cdef class ArrayDataset32(SequentialDataset32):
               # ..... Similar things
    def _sample_py(self, int current_index):
        """python function used for easy testing"""
        cdef float* x_data_ptr
        cdef int* x_indices_ptr
        cdef int nnz, j
        cdef float y, sample_weight

        # call _sample in cython
        self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
                     current_index)
                # ............................
        return (x_data, x_indices, x_indptr), y, sample_weight, sample_idx

However this is not working:

    xi_32, yi_32, _, _ = dataset_32._next_py()
    xi_64, yi_64, _, _ = dataset_64._next_py()
    assert_equal(yi_32.dtype, np.float32)
    assert_equal(yi_64.dtype, np.float64)

It is not working because neither of them is np element. they both are a float basic type therefore 64 bits. Regardless that the container where they are stored they are in fact np.float32 and np.float64

>>> type(yi_32)                                                                                                       
<class 'float'>                                                                                                       
>>> type(yi_64)                                                                                                       
<class 'float'>

This might have to do with the fact that {{np_type}} is not used in seq_dataset.pyx.tp but I'm not sure that you can do a single element np_type

TomDLT · 2017-07-18T08:53:29Z

I think we should revert the test asked by @raghavrv.

I agree, let's not test the type of y.

ogrisel · 2018-05-31T23:35:26Z

I think we can close in favor or #11155.

massich mentioned this pull request Jun 7, 2017

LogisticRegression convert to float64 #8769

Closed

Henley13 mentioned this pull request Jun 7, 2017

[MRG] Fused type makedataset #9040

Closed

Joan Massich and others added 9 commits June 16, 2017 15:44

Remove unused code

3e25392

Append 64 when the templated type is a double

f10f250

Add templating

f34879c

Add missing stuff

1e8ec39

Fix tempita

d781d5e

Working templated version of sag_fast

7e2e417

32 64 bit _finite macros (there should be a better way)

1c10728

Import sag64 as sag to mimic the default behavior

ce0f5dc

massich force-pushed the is/8769 branch from b80cd39 to ce0f5dc Compare June 19, 2017 15:40

Joan Massich added 6 commits June 19, 2017 18:15

add saga and group the asserts

43510cf

allow direct coef inspection from the pudb

26a5ede

test 64 bits first

cfdb2bb

Use [float64, float32] as default

5e302f8

Allow sag to be 64 or 32

90c0454

Change the parent function types so that single floats remain 64 like…

0011fd8

… python type

Joan Massich added 4 commits June 20, 2017 10:43

fix np_types

f647e39

relax the almost equal assert up to 4 decimals

10efe52

test newton-cg and saga

f34b80f

Relax almost equal only for SAGA case

6c99f3e

Joan Massich added 2 commits June 20, 2017 11:15

make the sag call more readable

57675df

Revert 3e25392

6cae75a

raghavrv suggested changes Jun 20, 2017

View reviewed changes

Joan Massich added 2 commits June 20, 2017 13:09

update gitignore

ccc89dc

Address @raghavrv comments

c854204

Henley13 changed the title ~~[WIP] LogisticRegression convert to float64 (sag)~~ [MRG] LogisticRegression convert to float64 (sag) Jun 20, 2017

Henley13 commented Jun 23, 2017

View reviewed changes

add a test for y

8655f4a

massich force-pushed the is/8769 branch 2 times, most recently from d1634e7 to 8655f4a Compare June 28, 2017 12:11

raghavrv added 2 commits June 29, 2017 19:05

Merge branch 'master' into is/8769

3a4ed49

Merge branch 'master' into is/8769

0a5eda0

raghavrv approved these changes Jun 29, 2017

View reviewed changes

raghavrv changed the title ~~[MRG] LogisticRegression convert to float64 (sag)~~ [MRG + 1] LogisticRegression convert to float64 (sag) Jun 29, 2017

raghavrv changed the title ~~[MRG + 1] LogisticRegression convert to float64 (sag)~~ [MRG] LogisticRegression convert to float64 (sag) Jun 29, 2017

massich reviewed Jul 17, 2017

View reviewed changes

NelleV mentioned this pull request May 28, 2018

[MRG] LogisticRegression convert to float64 (sag) #11155

Closed

ogrisel closed this May 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] LogisticRegression convert to float64 (sag) #9020

[MRG] LogisticRegression convert to float64 (sag) #9020

Henley13 commented Jun 6, 2017 •

edited

Loading

arthurmensch commented Jun 7, 2017

Henley13 commented Jun 7, 2017

arthurmensch commented Jun 7, 2017 via email

pcerda commented Jun 7, 2017

massich commented Jun 7, 2017

massich commented Jun 10, 2017

massich commented Jun 19, 2017 •

edited

Loading

massich commented Jun 19, 2017 •

edited

Loading

raghavrv left a comment

raghavrv Jun 20, 2017

raghavrv Jun 20, 2017

raghavrv Jun 20, 2017

raghavrv Jun 20, 2017

raghavrv Jun 20, 2017

raghavrv Jun 20, 2017

Henley13 Jun 23, 2017

massich Jun 23, 2017

ogrisel Jun 23, 2017

raghavrv Jun 29, 2017 •

edited

Loading

raghavrv commented Jun 29, 2017

massich Jul 13, 2017

massich commented Jul 17, 2017

TomDLT commented Jul 18, 2017 •

edited

Loading

ogrisel commented May 31, 2018

		@@ -5,6 +5,7 @@

		from sklearn._build_utils import get_blas_info

		from Cython import Tempita as tempita

[MRG] LogisticRegression convert to float64 (sag) #9020

[MRG] LogisticRegression convert to float64 (sag) #9020

Conversation

Henley13 commented Jun 6, 2017 • edited Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

arthurmensch commented Jun 7, 2017

Henley13 commented Jun 7, 2017

arthurmensch commented Jun 7, 2017 via email

pcerda commented Jun 7, 2017

massich commented Jun 7, 2017

massich commented Jun 10, 2017

massich commented Jun 19, 2017 • edited Loading

massich commented Jun 19, 2017 • edited Loading

raghavrv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghavrv Jun 29, 2017 • edited Loading

Choose a reason for hiding this comment

raghavrv commented Jun 29, 2017

Choose a reason for hiding this comment

massich commented Jul 17, 2017

TomDLT commented Jul 18, 2017 • edited Loading

ogrisel commented May 31, 2018

Henley13 commented Jun 6, 2017 •

edited

Loading

massich commented Jun 19, 2017 •

edited

Loading

massich commented Jun 19, 2017 •

edited

Loading

raghavrv Jun 29, 2017 •

edited

Loading

TomDLT commented Jul 18, 2017 •

edited

Loading