-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fused type makedataset #9040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ping: @raghavrv |
The idea was to import the However, fused types cannot be used as class attributes and we have several A possible workaround is to add from cython cimport floating
cdef class Printer:
// We wish num to be fused types, so declared it as void*
cdef void *num;
cdef bint is_float;
// Used as fused type arguments
cdef float float_sample;
cdef double double_sample;
def __init__(self):
cdef float num = float(5)
self.num = &num
if type(num) == float:
self.is_float = True
else:
self.is_float = False
if self.is_float:
self._print(float_sample)
else:
self._print(double_sample)
// Underlying function
def _print(self, floating sample):
// Typecast it when we want to access its value
cdef floating *num = <floating*>self.num
print num[0] @jnothman concern is that
Talking to @ogrisel, @amueller, @raghavrv we could do something like what it has been done in Can you guys give us any hints on how to do that? |
@arthurmensch can you give a look to all this? |
Example of how pandas uses cython templates to have versions of the same function for different dtypes: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby_helper.pxi.in |
Thanks @jorisvandenbossche. I think this is what @ogrisel suggested yesterday IRL... We need a I'm trying to see if we can avoid that... Most of the class attributes are not data themselves but pointers. Pointers can actually be BTW why does this PR diff have changes unrelated to |
b80cd39
to
de9189e
Compare
@raghavrv, Sorry for the changes unrelated to |
9e13ead
to
1bcee61
Compare
can you guys give us feedback in this ? |
sklearn/linear_model/base.py
Outdated
@@ -56,12 +57,19 @@ def make_dataset(X, y, sample_weight, random_state=None): | |||
# seed should never be 0 in SequentialDataset | |||
seed = rng.randint(1, np.iinfo(np.int32).max) | |||
|
|||
if isinstance(X, np.float32): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you be checking the dtype, rather than X?
Something like X.dtype == np.float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sklearn/utils/seq_dataset.pxd.tp
Outdated
@@ -0,0 +1,69 @@ | |||
"""Dataset abstractions for sequential data access. """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful here to point to what kind of file this is (a Cython template), given an URL where it is described. Indeed, such files are uncommon, and people might be surprised.
sklearn/utils/seq_dataset.pyx.tp
Outdated
from libc.limits cimport INT_MAX | ||
cimport numpy as np | ||
import numpy as np | ||
{{py: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is needed to put those strings inside the {{py
part (so I would leave the author information and the note you added at the top before the {{py
stuff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, apart from minor comments. I would have prefered to go with void *
pointers as I think this does not hinder performance much but I guess you discussed this. Let's wait for CI.
sklearn/utils/seq_dataset.pxd.tp
Outdated
""" | ||
|
||
# name, c_type | ||
dtypes = [('', 'double'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
['64', 'double']
sounds more consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. But doing that we would need to recode sag_fast.pyx for example... we could put it off to another PR (#9020) or just rename SequentialDataset
in sag_fast.pyx as SequentialDataset64
.
sklearn/utils/seq_dataset.pyx.tp
Outdated
@@ -297,4 +334,4 @@ cdef inline np.uint32_t our_rand_r(np.uint32_t* seed) nogil: | |||
seed[0] ^= <np.uint32_t>(seed[0] >> 17) | |||
seed[0] ^= <np.uint32_t>(seed[0] << 5) | |||
|
|||
return seed[0] % (<np.uint32_t>RAND_R_MAX + 1) | |||
return seed[0] % (<np.uint32_t>RAND_R_MAX + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EOF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for i in range(5): | ||
# next sample | ||
xi_32, yi32, swi32, idx32 = dataset32._next_py() | ||
xi_, yi, swi, idx = dataset64._next_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check name consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
xi_data32, _, _ = xi_32 | ||
xi_data, _, _ = xi_ | ||
assert_equal(xi_data32.dtype, np.float32) | ||
assert_equal(xi_data.dtype, np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_array_almost_equal(xi_data64.astype('double'), xi_data32)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
sklearn/utils/setup.py
Outdated
'sklearn/utils/seq_dataset.pxd.tp'] | ||
|
||
for pyxfiles in pyx_templates: | ||
assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
sklearn/utils/setup.py
Outdated
for pyxfiles in pyx_templates: | ||
assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp')) | ||
outfile = pyxfiles[:-3] | ||
# if .pxi.in is not updated, no need to output .pxi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit comment to match code. Maybe precise that this a good idea for cythonization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate ?
xi_data32, _, _ = xi_32 | ||
xi_data, _, _ = xi_ | ||
assert_equal(xi_data32.dtype, np.float32) | ||
assert_equal(xi_data.dtype, np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware to add generated files to .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's done!
sklearn/utils/seq_dataset.pxd.tp
Outdated
"""Dataset abstractions for sequential data access. """ | ||
|
||
cimport numpy as np | ||
{{py: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you put the header outside the {{py
? It would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it just below actually. First the code defining the template (in the braces {{py blablabla}}) and then the template itself. Only the second part is generated.
sklearn/utils/seq_dataset.pyx.tp
Outdated
# | ||
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com> | ||
# Arthur Imbert <arthurimbert05@gmail.com> | ||
# License: BSD 3 clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should move author and license outside the generation loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do that, they won't appear in the generated file. That's what you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. They won't be in the repo anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done :)
The windows test failure looks like a numerical accuracy that should be relaxed when tested with 32 bit floats (but kept precise when run with 64 bit floats). |
Also please use informative commit message. |
…earn into fused_type_makedataset
@ogrisel I changed the numerical accuracy in the test, is it better? |
@@ -82,3 +88,18 @@ def test_seq_dataset_shuffle(): | |||
_, _, _, idx2 = dataset2._random_py() | |||
assert_equal(idx1, idx2) | |||
|
|||
|
|||
def test_consistency_check_fused_types(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_fused_types_consistency
Address oliver's request for changing test name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the scope of this PR is exactly but I think we should check the change is actually useful for the user-facing estimators such as SGDClassifier
. See my comment below.
@@ -56,12 +57,19 @@ def make_dataset(X, y, sample_weight, random_state=None): | |||
# seed should never be 0 in SequentialDataset | |||
seed = rng.randint(1, np.iinfo(np.int32).max) | |||
|
|||
if X.dtype == np.float32: | |||
CSRData = CSRDataset32 | |||
ArrayData = ArrayDataset32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codecov chrome/firefox extension tells me that those lines are not covered. Please add a test for that case (and install the codecov browser extension ;).
This test should probably fit a SGDClassifier
on 32 bit float iris and check that the coef_
attribute should be 32 bit float as well ( and the output of decision_function
should also output a float32 array).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there isn't any specific test for make_dataset
. It's supposed to be covered by the tests for the sag solver. So we can either put the test off to a next PR about the sag solver (#9020) or doing the test you are proposing in this PR. What do you prefer?
Yes this test is good now. |
doing the test you are proposing in this PR. What do you prefer?
A dedicated test seems useful.
|
I added a test for |
I'm stuck in line 383. dataset is of the type sequentialDataset (see line 156) This comes form the templated code in scikit-learn#9040. So dataset would be able to be either float or double. but it does not pick it up when instanciated.
sklearn/linear_model/base.py
Outdated
@@ -32,7 +32,8 @@ | |||
from ..utils.extmath import safe_sparse_dot | |||
from ..utils.sparsefuncs import mean_variance_axis, inplace_column_scale | |||
from ..utils.fixes import sparse_lsqr | |||
from ..utils.seq_dataset import ArrayDataset, CSRDataset | |||
from ..utils.seq_dataset import ArrayDataset32, CSRDataset32, ArrayDataset, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to avoid the line continuation with a backslash and rather start a new line, repeating the "from ...utils"
|
||
# name, c_type | ||
dtypes = [('', 'double'), | ||
('32', 'float')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is much more clear if we use:
dtypes = [('64', 'double'),
 ('32', 'float')]
I found an error complaining about SequentialDataset
and I had a hard time to figure out that it was referring to the 64bits version. @Henley13 do you remember why we choose ''
over '64'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid refactoring. But I don't think it is a good idea
I've rebased this PR, and fixed the tiny element remaining that caused the tests to fail. I've opened a new PR (#11155 ) Thanks a lot for all the work! |
Reference Issue
Works on #8769.
What does this implement/fix? Explain your changes.
Make make_dataset from linear_model/base.py support fused types.
Any other comments?
Intermediate step to make
sag
solver efficiently support fused types (PR #9020).