-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] LogisticRegression convert to float64 (sag) #9020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do you plan to handle all solvers ? |
@arthurmensch, you check the cython scripts and I take care of the python ones? |
I ll do a separate pr for sag and saga
…On Wed, Jun 7, 2017, 10:39 AM Arthur Imbert ***@***.***> wrote:
@arthurmensch <https://github.com/arthurmensch>, you check the cython
scripts and I take care of the python ones?
Maybe we could do sag and saga together in this PR, but for the rest it
would be one solver per PR
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9020 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD48Y-dTln_Y0dkmnJHlBgvBcJt4tspiks5sBmHAgaJpZM4Nxocl>
.
|
hi, I also want to work on that |
This PR is actually blocked by #9040. cc: @arthurmensch |
I'm working on it here |
initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template #2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset
TODO:
|
cc: @raghavrv, @ogrisel, @arthurmensch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.gitignore
Outdated
@@ -65,3 +65,7 @@ benchmarks/bench_covertype_data/ | |||
|
|||
# Used by py.test | |||
.cache | |||
|
|||
# files generated from a template | |||
sklearn/utils/seq.dataset.pyx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> seq_dataset.pyx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add the sag_fast.pyx
sklearn/linear_model/setup.py
Outdated
@@ -5,6 +5,7 @@ | |||
|
|||
from sklearn._build_utils import get_blas_info | |||
|
|||
from Cython import Tempita as tempita |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the Tempita
?
@@ -3,7 +3,11 @@ | |||
#ifdef _MSC_VER | |||
# include <float.h> | |||
# define skl_isfinite _finite | |||
# define skl_isfinite32 _finite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment saying why this change is needed?
# When re-declaring the functions in the template for cython
# specific for each parameter input type, it needs to be 2 different functions
# as cython doesn't support function overloading.
@jnothman can you help in deciding whether this comment is needed or this is obvious.
# array | ||
dataset_32, _ = make_dataset(X_32, y_32, sample_weight_32) | ||
dataset_64, _ = make_dataset(X_64, y_64, sample_weight_64) | ||
xi_32, _, _, _ = dataset_32._next_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not checking for y
? cc: @Henley13
xi32, yi32, swi32, idx32 = dataset32._next_py() | ||
xi64, yi64, swi64, idx64 = dataset64._next_py() | ||
|
||
xi_data32, _, _ = xi32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above. Why don't we check for y
?
@@ -32,7 +32,9 @@ | |||
from ..utils.extmath import safe_sparse_dot | |||
from ..utils.sparsefuncs import mean_variance_axis, inplace_column_scale | |||
from ..utils.fixes import sparse_lsqr | |||
from ..utils.seq_dataset import ArrayDataset, CSRDataset | |||
from ..utils.seq_dataset import ArrayDataset32, CSRDataset32 | |||
from ..utils.seq_dataset import ArrayDataset64 as ArrayDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the name "ArrayDataset"? Why not write explicitly the number of bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO that's how I would do it for the moment. Like this you ensure that we don't break anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for preserving backward compatibility. The ArrayDataset class of scikit-learn 0.18.2 is 64 bit float based. Let's keep the same name as an alias. We could even setup the alias in sklearn/utils/seq_dataset even if Cython code is not official part of the project public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also needs to be addressed
d1634e7
to
8655f4a
Compare
This looks good to me. But I guess you want to do the benchmark. Ping me once the CIs pass and you are satisfied with this. |
|
||
for i in range(5): | ||
# next sample | ||
xi32, yi32, _, _ = dataset32._next_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Henley13 this test cannot work. yi32
and yi64
are single elements not arrays. Therefore you cannot compare them into np.float32
nor np.float64
. More over python build-in float type are all 64. For a single va lue this is not a problem. It is only a problem for vectors.
I think we should revert the test asked by @raghavrv.
Ok, I need some help here. Thats what we generate from the template: cdef class SequentialDataset64:
# ..... Some functions
def _next_py(self):
"""python function used for easy testing"""
cdef int current_index = self._get_next_index()
return self._sample_py(current_index)
# ...... more functions
def _sample_py(self, int current_index):
"""python function used for easy testing"""
cdef double* x_data_ptr
cdef int* x_indices_ptr
cdef int nnz, j
cdef double y, sample_weight
# call _sample in cython
self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
current_index)
# .... some stuff when X or y are sparse
return (x_data, x_indices, x_indptr), y, sample_weight, sample_idx
cdef class ArrayDataset32(SequentialDataset32):
# ..... Similar things
def _sample_py(self, int current_index):
"""python function used for easy testing"""
cdef float* x_data_ptr
cdef int* x_indices_ptr
cdef int nnz, j
cdef float y, sample_weight
# call _sample in cython
self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
current_index)
# ............................
return (x_data, x_indices, x_indptr), y, sample_weight, sample_idx However this is not working: xi_32, yi_32, _, _ = dataset_32._next_py()
xi_64, yi_64, _, _ = dataset_64._next_py()
assert_equal(yi_32.dtype, np.float32)
assert_equal(yi_64.dtype, np.float64) It is not working because neither of them is >>> type(yi_32)
<class 'float'>
>>> type(yi_64)
<class 'float'> This might have to do with the fact that |
I agree, let's not test the type of |
I think we can close in favor or #11155. |
Reference Issue
Works on #8769 (for SAG solver), Fixes #9040
What does this implement/fix? Explain your changes.
Avoids logistic regression to aggressively cast the data to
np.float64
whennp.float32
is supplied.Any other comments?
This PR follows up on PR #8835