Skip to content

[MRG] Fused type makedataset #9040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

Henley13
Copy link
Contributor

@Henley13 Henley13 commented Jun 7, 2017

Reference Issue

Works on #8769.

What does this implement/fix? Explain your changes.

Make make_dataset from linear_model/base.py support fused types.

Any other comments?

Intermediate step to make sag solver efficiently support fused types (PR #9020).

@Henley13 Henley13 changed the title Fused type makedataset [WIP] Fused type makedataset Jun 7, 2017
@massich
Copy link
Contributor

massich commented Jun 7, 2017

ping: @raghavrv

@massich
Copy link
Contributor

massich commented Jun 8, 2017

The idea was to import the floating fuse type using from cython cimport floating and replace all double in seq_dataset.{pxy, pxd}.

However, fused types cannot be used as class attributes and we have several *double attributes both in ArrayDataset and CSRDataset (see here)

A possible workaround is to add cdef void *X_data_ptr, keep track of the type and fill the code with if else statements:

from cython cimport floating
cdef class Printer:
	
	// We wish num to be fused types, so declared it as void*
	cdef void *num;
	
	cdef bint is_float;
	
	// Used as fused type arguments
	cdef float float_sample;
	cdef double double_sample;

	def __init__(self):
		cdef float num = float(5)
		self.num = &num
		
		if type(num) == float:
			self.is_float = True
		else:
			self.is_float = False
			
		if self.is_float:
			self._print(float_sample)
		else:
			self._print(double_sample)
	
	// Underlying function
	def _print(self, floating sample):
		// Typecast it when we want to access its value
		cdef floating *num = <floating*>self.num
		print num[0]

@jnothman concern is that

A small problem with your workaround is that unlike if floating is float, the branching for if/else self.is_float needs to be compiled into C by Cython. Cython doesn't know that it can statically factor out that conditional.

Talking to @ogrisel, @amueller, @raghavrv we could do something like what it has been done in svm/src/libsvm/svm.{h,cpp} and svm/src/libsvm/libsvm_template.cpp. Where two interfaces are defined, in svm.h. The beginning of the namespace declaration is wrapped within an #ifdef declaration.

Can you guys give us any hints on how to do that?

@massich
Copy link
Contributor

massich commented Jun 8, 2017

@massich
Copy link
Contributor

massich commented Jun 8, 2017

@arthurmensch can you give a look to all this?

@jorisvandenbossche
Copy link
Member

Example of how pandas uses cython templates to have versions of the same function for different dtypes: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby_helper.pxi.in

@raghavrv
Copy link
Member

raghavrv commented Jun 8, 2017

Thanks @jorisvandenbossche. I think this is what @ogrisel suggested yesterday IRL...

@massich @Henley13

We need a seq_dataset.pyx.in with the template, a generator for seq_dataset_float.pyx and seq_dataset_double.pyx...

I'm trying to see if we can avoid that...

Most of the class attributes are not data themselves but pointers.

Pointers can actually be void* and be later cast to <floating> void* circumventing cython's limitation of "no fused type for class attributes"... But this needs trial and error and may not end up successful...

BTW why does this PR diff have changes unrelated to seq_dataset.pyx?

@Henley13 Henley13 force-pushed the fused_type_makedataset branch from b80cd39 to de9189e Compare June 8, 2017 13:53
@Henley13
Copy link
Contributor Author

Henley13 commented Jun 8, 2017

@raghavrv, Sorry for the changes unrelated to seq_dataset.pyx, I badly initialized the PR, but it's fixed now.

@massich
Copy link
Contributor

massich commented Jun 9, 2017

can you guys give us feedback in this ?

cc: @GaelVaroquaux @raghavrv @lesteve

@@ -56,12 +57,19 @@ def make_dataset(X, y, sample_weight, random_state=None):
# seed should never be 0 in SequentialDataset
seed = rng.randint(1, np.iinfo(np.int32).max)

if isinstance(X, np.float32):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you be checking the dtype, rather than X?

Something like X.dtype == np.float32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Henley13 Henley13 changed the title [WIP] Fused type makedataset [MRG] Fused type makedataset Jun 9, 2017
@@ -0,0 +1,69 @@
"""Dataset abstractions for sequential data access. """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful here to point to what kind of file this is (a Cython template), given an URL where it is described. Indeed, such files are uncommon, and people might be surprised.

from libc.limits cimport INT_MAX
cimport numpy as np
import numpy as np
{{py:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is needed to put those strings inside the {{py part (so I would leave the author information and the note you added at the top before the {{py stuff)

Copy link
Contributor

@arthurmensch arthurmensch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, apart from minor comments. I would have prefered to go with void * pointers as I think this does not hinder performance much but I guess you discussed this. Let's wait for CI.

"""

# name, c_type
dtypes = [('', 'double'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

['64', 'double'] sounds more consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But doing that we would need to recode sag_fast.pyx for example... we could put it off to another PR (#9020) or just rename SequentialDataset in sag_fast.pyx as SequentialDataset64.

@@ -297,4 +334,4 @@ cdef inline np.uint32_t our_rand_r(np.uint32_t* seed) nogil:
seed[0] ^= <np.uint32_t>(seed[0] >> 17)
seed[0] ^= <np.uint32_t>(seed[0] << 5)

return seed[0] % (<np.uint32_t>RAND_R_MAX + 1)
return seed[0] % (<np.uint32_t>RAND_R_MAX + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for i in range(5):
# next sample
xi_32, yi32, swi32, idx32 = dataset32._next_py()
xi_, yi, swi, idx = dataset64._next_py()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check name consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

xi_data32, _, _ = xi_32
xi_data, _, _ = xi_
assert_equal(xi_data32.dtype, np.float32)
assert_equal(xi_data.dtype, np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_array_almost_equal(xi_data64.astype('double'), xi_data32)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

'sklearn/utils/seq_dataset.pxd.tp']

for pyxfiles in pyx_templates:
assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

for pyxfiles in pyx_templates:
assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp'))
outfile = pyxfiles[:-3]
# if .pxi.in is not updated, no need to output .pxi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit comment to match code. Maybe precise that this a good idea for cythonization ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate ?

xi_data32, _, _ = xi_32
xi_data, _, _ = xi_
assert_equal(xi_data32.dtype, np.float32)
assert_equal(xi_data.dtype, np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware to add generated files to .gitignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's done!

"""Dataset abstractions for sequential data access. """

cimport numpy as np
{{py:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you put the header outside the {{py ? It would be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it just below actually. First the code defining the template (in the braces {{py blablabla}}) and then the template itself. Only the second part is generated.

#
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
# Arthur Imbert <arthurimbert05@gmail.com>
# License: BSD 3 clause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should move author and license outside the generation loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do that, they won't appear in the generated file. That's what you want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They won't be in the repo anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

@ogrisel
Copy link
Member

ogrisel commented Jun 10, 2017

The windows test failure looks like a numerical accuracy that should be relaxed when tested with 32 bit floats (but kept precise when run with 64 bit floats).

@ogrisel
Copy link
Member

ogrisel commented Jun 10, 2017

Also please use informative commit message.

@Henley13
Copy link
Contributor Author

@ogrisel I changed the numerical accuracy in the test, is it better?

@@ -82,3 +88,18 @@ def test_seq_dataset_shuffle():
_, _, _, idx2 = dataset2._random_py()
assert_equal(idx1, idx2)


def test_consistency_check_fused_types():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_fused_types_consistency

Joan Massich and others added 2 commits June 10, 2017 14:53
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the scope of this PR is exactly but I think we should check the change is actually useful for the user-facing estimators such as SGDClassifier. See my comment below.

@@ -56,12 +57,19 @@ def make_dataset(X, y, sample_weight, random_state=None):
# seed should never be 0 in SequentialDataset
seed = rng.randint(1, np.iinfo(np.int32).max)

if X.dtype == np.float32:
CSRData = CSRDataset32
ArrayData = ArrayDataset32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codecov chrome/firefox extension tells me that those lines are not covered. Please add a test for that case (and install the codecov browser extension ;).

This test should probably fit a SGDClassifier on 32 bit float iris and check that the coef_ attribute should be 32 bit float as well ( and the output of decision_function should also output a float32 array).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there isn't any specific test for make_dataset. It's supposed to be covered by the tests for the sag solver. So we can either put the test off to a next PR about the sag solver (#9020) or doing the test you are proposing in this PR. What do you prefer?

@ogrisel
Copy link
Member

ogrisel commented Jun 10, 2017

@ogrisel I changed the numerical accuracy in the test, is it better?

Yes this test is good now.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 10, 2017 via email

@Henley13
Copy link
Contributor Author

I added a test for make_dataset, but without using a SGDClassifier (it's not consistent yet with fused types). @ogrisel @GaelVaroquaux ;)

massich pushed a commit to massich/scikit-learn that referenced this pull request Jun 10, 2017
I'm stuck in line 383. dataset is of the type sequentialDataset (see line 156)
This comes form the templated code in scikit-learn#9040. So dataset would be able to be
either float or double.

but it does not pick it up when instanciated.
@@ -32,7 +32,8 @@
from ..utils.extmath import safe_sparse_dot
from ..utils.sparsefuncs import mean_variance_axis, inplace_column_scale
from ..utils.fixes import sparse_lsqr
from ..utils.seq_dataset import ArrayDataset, CSRDataset
from ..utils.seq_dataset import ArrayDataset32, CSRDataset32, ArrayDataset, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to avoid the line continuation with a backslash and rather start a new line, repeating the "from ...utils"


# name, c_type
dtypes = [('', 'double'),
('32', 'float')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is much more clear if we use:

 dtypes = [('64', 'double'),
            ('32', 'float')]

I found an error complaining about SequentialDataset and I had a hard time to figure out that it was referring to the 64bits version. @Henley13 do you remember why we choose '' over '64'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid refactoring. But I don't think it is a good idea

@NelleV
Copy link
Member

NelleV commented May 28, 2018

I've rebased this PR, and fixed the tiny element remaining that caused the tests to fail. I've opened a new PR (#11155 )

Thanks a lot for all the work!

@massich
Copy link
Contributor

massich commented Jun 1, 2018

@ogrisel I think you can close this one aswell in favor of #11155 (As you did in #9020)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants