Skip to content

[MRG] LogisticRegression convert to float64 (sag) #9020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

Henley13
Copy link
Contributor

@Henley13 Henley13 commented Jun 6, 2017

Reference Issue

Works on #8769 (for SAG solver), Fixes #9040

What does this implement/fix? Explain your changes.

Avoids logistic regression to aggressively cast the data to np.float64 when np.float32 is supplied.

Any other comments?

This PR follows up on PR #8835

@arthurmensch
Copy link
Contributor

Do you plan to handle all solvers ?

@Henley13
Copy link
Contributor Author

Henley13 commented Jun 7, 2017

@arthurmensch, you check the cython scripts and I take care of the python ones?
Maybe we could do sag and saga together in this PR, but for the rest it would be one solver per PR

@arthurmensch
Copy link
Contributor

arthurmensch commented Jun 7, 2017 via email

@pcerda
Copy link

pcerda commented Jun 7, 2017

hi, I also want to work on that

@massich
Copy link
Contributor

massich commented Jun 7, 2017

This PR is actually blocked by #9040. cc: @arthurmensch

@massich
Copy link
Contributor

massich commented Jun 10, 2017

I'm working on it here

Joan Massich and others added 9 commits June 16, 2017 15:44
initial PR commit

seq_dataset.pyx generated from template

seq_dataset.pyx generated from template #2

rename variables

fused types consistency test for seq_dataset

a

sklearn/utils/tests/test_seq_dataset.py

new if statement

add doc

sklearn/utils/seq_dataset.pyx.tp

minor changes

minor changes

typo fix

check numeric accuracy only up 5th decimal

Address oliver's request for changing test name

add test for make_dataset and rename a variable in test_seq_dataset
@massich
Copy link
Contributor

massich commented Jun 19, 2017

TODO:

  • Revert 3e25392
    ( _multinomial_loss_grad is used only in testing to ensure consistent results. I think there is no need to test it in 32 and 64. The later should be enough)
  • Mimic the 64bits behavior using templating using from my_file import my_function64 as my_function
  • Add the saga to the testing
  • Remove the mimicking behaviour
  • Fix whatever we broke
  • Benchmark it !
  • keep the coverage score (here)

@massich
Copy link
Contributor

massich commented Jun 19, 2017

Up to sag32 and sag64 calls in sag_fast.pyx, the calling parameters seem correct. However from there on, at somewhere there's a cast that is not properly done (because the test breaks). The only none array structure is wscale but its type is created by the templating, and its initial computation using mixed type should be no problem.

I have checked the code, but just reading the code would not make it. Do you know any manner of tracing the call further in the cython code?

cc: @raghavrv, @ogrisel, @arthurmensch

Copy link
Member

@raghavrv raghavrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor questions. But looks good to me overall. Thanks for the persistent efforts @massich @Henley13

.gitignore Outdated
@@ -65,3 +65,7 @@ benchmarks/bench_covertype_data/

# Used by py.test
.cache

# files generated from a template
sklearn/utils/seq.dataset.pyx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> seq_dataset.pyx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add the sag_fast.pyx

@@ -5,6 +5,7 @@

from sklearn._build_utils import get_blas_info

from Cython import Tempita as tempita
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the Tempita?

@@ -3,7 +3,11 @@
#ifdef _MSC_VER
# include <float.h>
# define skl_isfinite _finite
# define skl_isfinite32 _finite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment saying why this change is needed?

# When re-declaring the functions in the template for cython
# specific for each parameter input type, it needs to be 2 different functions
# as cython doesn't support function overloading.

@jnothman can you help in deciding whether this comment is needed or this is obvious.

# array
dataset_32, _ = make_dataset(X_32, y_32, sample_weight_32)
dataset_64, _ = make_dataset(X_64, y_64, sample_weight_64)
xi_32, _, _, _ = dataset_32._next_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not checking for y? cc: @Henley13

xi32, yi32, swi32, idx32 = dataset32._next_py()
xi64, yi64, swi64, idx64 = dataset64._next_py()

xi_data32, _, _ = xi32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above. Why don't we check for y?

@Henley13 Henley13 changed the title [WIP] LogisticRegression convert to float64 (sag) [MRG] LogisticRegression convert to float64 (sag) Jun 20, 2017
@@ -32,7 +32,9 @@
from ..utils.extmath import safe_sparse_dot
from ..utils.sparsefuncs import mean_variance_axis, inplace_column_scale
from ..utils.fixes import sparse_lsqr
from ..utils.seq_dataset import ArrayDataset, CSRDataset
from ..utils.seq_dataset import ArrayDataset32, CSRDataset32
from ..utils.seq_dataset import ArrayDataset64 as ArrayDataset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the name "ArrayDataset"? Why not write explicitly the number of bits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO that's how I would do it for the moment. Like this you ensure that we don't break anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for preserving backward compatibility. The ArrayDataset class of scikit-learn 0.18.2 is 64 bit float based. Let's keep the same name as an alias. We could even setup the alias in sklearn/utils/seq_dataset even if Cython code is not official part of the project public API.

Copy link
Member

@raghavrv raghavrv Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also needs to be addressed

@massich massich force-pushed the is/8769 branch 2 times, most recently from d1634e7 to 8655f4a Compare June 28, 2017 12:11
@raghavrv raghavrv changed the title [MRG] LogisticRegression convert to float64 (sag) [MRG + 1] LogisticRegression convert to float64 (sag) Jun 29, 2017
@raghavrv raghavrv changed the title [MRG + 1] LogisticRegression convert to float64 (sag) [MRG] LogisticRegression convert to float64 (sag) Jun 29, 2017
@raghavrv
Copy link
Member

This looks good to me. But I guess you want to do the benchmark. Ping me once the CIs pass and you are satisfied with this.


for i in range(5):
# next sample
xi32, yi32, _, _ = dataset32._next_py()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Henley13 this test cannot work. yi32 and yi64 are single elements not arrays. Therefore you cannot compare them into np.float32 nor np.float64. More over python build-in float type are all 64. For a single va lue this is not a problem. It is only a problem for vectors.
I think we should revert the test asked by @raghavrv.

@massich
Copy link
Contributor

massich commented Jul 17, 2017

Ok, I need some help here.

Thats what we generate from the template:

cdef class SequentialDataset64:
                        # .....  Some functions
    def _next_py(self):
        """python function used for easy testing"""
        cdef int current_index = self._get_next_index()
        return self._sample_py(current_index)
                        # ......  more functions
    def _sample_py(self, int current_index):
        """python function used for easy testing"""
        cdef double* x_data_ptr
        cdef int* x_indices_ptr
        cdef int nnz, j
        cdef double y, sample_weight

        # call _sample in cython
        self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
                     current_index)
                        #  .... some stuff when X or y are sparse
        return (x_data, x_indices, x_indptr), y, sample_weight, sample_idx


cdef class ArrayDataset32(SequentialDataset32):
               # ..... Similar things
    def _sample_py(self, int current_index):
        """python function used for easy testing"""
        cdef float* x_data_ptr
        cdef int* x_indices_ptr
        cdef int nnz, j
        cdef float y, sample_weight

        # call _sample in cython
        self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
                     current_index)
                # ............................
        return (x_data, x_indices, x_indptr), y, sample_weight, sample_idx

However this is not working:

    xi_32, yi_32, _, _ = dataset_32._next_py()
    xi_64, yi_64, _, _ = dataset_64._next_py()
    assert_equal(yi_32.dtype, np.float32)
    assert_equal(yi_64.dtype, np.float64)

It is not working because neither of them is np element. they both are a float basic type therefore 64 bits. Regardless that the container where they are stored they are in fact np.float32 and np.float64

>>> type(yi_32)                                                                                                       
<class 'float'>                                                                                                       
>>> type(yi_64)                                                                                                       
<class 'float'> 

This might have to do with the fact that {{np_type}} is not used in seq_dataset.pyx.tp but I'm not sure that you can do a single element np_type

@TomDLT
Copy link
Member

TomDLT commented Jul 18, 2017

I think we should revert the test asked by @raghavrv.

I agree, let's not test the type of y.

@ogrisel
Copy link
Member

ogrisel commented May 31, 2018

I think we can close in favor or #11155.

@ogrisel ogrisel closed this May 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants