Skip to content

Any way to make BinaryTree support fused type? #7059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yenchenlin opened this issue Jul 21, 2016 · 30 comments
Open

Any way to make BinaryTree support fused type? #7059

yenchenlin opened this issue Jul 21, 2016 · 30 comments
Labels
cython Enhancement float32 Issues related to support for 32bit data module:tree

Comments

@yenchenlin
Copy link
Contributor

yenchenlin commented Jul 21, 2016

Hello guys, currently I'm trying to make neighbors tree algorithms to support Cython fused types so that the memory needed can be drastically reduced.

However, both KDTree and BallTree are subclasses of BinaryTree, and this line of BinaryTree seems to decide the datatype (np.float64) of data it stores at initialization time, i.e., even before it sees the actual datatype of training data (which may be np.float32).

The reason to initialize array in __cinit__ is explained in the comments of code:

  # Use cinit to initialize all arrays to empty: this will prevent memory
     # errors and seg-faults in rare cases where __init__ is not called

I wonder what is the rare cases where __init__ is not called?
Can we handle those rare cases as special cases and move the data initialization to __init__, so that it can init a array of datatype of training data?

Thanks in advance.
ping @jakevdp @jnothman @MechCoder

@jnothman
Copy link
Member

Either way, the value of data_arr is overwritten in init, and
data_arr's cdef does not prescribe a dtype, so there's nothing wrong with
changing the dtype set between cinit and init in the current
implementation.

On 21 July 2016 at 23:55, Yen notifications@github.com wrote:

Hello guys, currently I'm trying to make neighbors tree methods to support
Cython fused types so that the memory needed can be drastically reduced.

However, both KDTree and BallTree are subclasses of BinaryTree, and this
line
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/binary_tree.pxi#L1031
of BinaryTree seems to decide the datatype (np.float64) of data it stores
at initialization time, i.e., even before it sees the actual datatype of
training data (may be np.float32).

The reason to initialize array in cinit is explained in the comments
of code:

Use cinit to initialize all arrays to empty: this will prevent memory

 # errors and seg-faults in rare cases where **init** is not called

I wonder what is the rare cases where init is not called?
Can we handle those rare cases as special cases and move the data
initialization to init, so that it can init a array of datatype of
training data?

Thanks in advance.
ping @jakevdp https://github.com/jakevdp @jnothman
https://github.com/jnothman @MechCoder https://github.com/MechCoder


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7059, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz6yzPkKitOW2SlauxbOwHGkNuSl_oks5qX3pWgaJpZM4JR0g4
.

@yenchenlin
Copy link
Contributor Author

Thanks @jnothman, you're right.

However, after some attempts to try to convert data variable into fused types, I got the
error Fused types not allowed here.

And then I found this unfortunate description in the official page here:

"Note Fused types are not currently supported as attributes of extension types. Only variables and function/method arguments can be declared with fused types."

The only solution I can come up so far is like this commit, declare another float memoryview attributes and we can decide which memoryview we want to use when we want to access data in the future. However, I know this looks clumsy.

@jnothman
Copy link
Member

You can make it a void * and typecast it within a function, and see how that goes.

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Jul 22, 2016

Hello @jnothman , sorry to interrupt again.
After a head scratching day I still can't make it work with fused types.

I'm assuming you mean something like this?

(I use variable another_data to represent possible new implementation so as not to affect originaldata variable and therefore being able to build scikit-learn successfully.)

Above code works, but if I change both double of this line to floating, it outputs error:

Error compiling Cython file:
------------------------------------------------------------
...


        cdef floating[:, ::1] another_data = <floating[:self.data_arr.shape[0],
                                                       :self.data_arr.shape[1]:1]> self.another_data
        n_samples = another_data.shape[0]
        n_features = another_data.shape[1]
                                ^
------------------------------------------------------------

sklearn/neighbors/binary_tree.pxi:1096:33: Compiler crash in AnalyseExpressionsTransform

ModuleNode.body = StatListNode(ball_tree.pyx:9:0)
StatListNode.stats[3] = StatListNode(binary_tree.pxi:144:0)
StatListNode.stats[60] = StatListNode(binary_tree.pxi:1012:5)
StatListNode.stats[0] = CClassDefNode(binary_tree.pxi:1012:5,
    as_name = u'BinaryTree',
    class_name = u'BinaryTree',
    module_name = u'',
    visibility = u'private')
CClassDefNode.body = StatListNode(binary_tree.pxi:1014:4)
StatListNode.stats[2] = DefNode(binary_tree.pxi:1067:4,
    modifiers = [...]/0,
    name = u'__init__',
    num_required_args = 2,
    py_wrapper_required = True,
    reqd_kw_flags_cname = '0',
    used = True)
File 'ExprNodes.py', line 3032, in infer_type: IndexNode(binary_tree.pxi:1096:39,
    is_subscript = True,
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 5845, in infer_type: AttributeNode(binary_tree.pxi:1096:33,
    attribute = u'shape',
    is_attribute = 1,
    member = u'shape',
    needs_none_check = True,
    op = '.',
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 6016, in analyse_attribute: AttributeNode(binary_tree.pxi:1096:33,
    attribute = u'shape',
    is_attribute = 1,
    member = u'shape',
    needs_none_check = True,
    op = '.',
    result_is_used = True,
    use_managed_ref = True)

which I can't understand since another_data should've not affected other .py files.

Any idea?
Thanks in advance.

@jnothman
Copy link
Member

jnothman commented Jul 23, 2016

Well, I think the first rule of compilers is that "Compiler crash" isn't expected behaviour! I presume you're uncovering a Cython bug. It looks like type inference is failing. Have you attempted to make the code that crashes minimal?

I don't know what you mean by should not affect other py files. Do you mean ball_tree.pyx? The .pxi is not compiled by itself, only as an include within the two .pyx files.

@jnothman
Copy link
Member

Oh do you mean ExprNodes.py and such .py files? Those are files in the Cython compiler.

@jnothman
Copy link
Member

The line numbers there don't seem to correspond to the most recent Cython, so you should probably upgrade before trying to alert someone to a bug.

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Jul 23, 2016

@jnothman Thanks for the clarifying.
About, not affect other py files, I mean I don't want to directly change data variable ar this early stage so that I can still build scikit-learn without errors. Yes, I mean what you said.

I'll dive into it to see if this is really a Cython bug and get back.

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Jul 24, 2016

I produced a minimum example here, but I'm sorry that I overlooked another error that actually happend on top of it, i.e.,

Cannot coerce to a type that is not specialized.

I think this error occured because Cython compiler can't decide which type of floating (float or double) to use at compile time.

In previous PR, we didn't encounter this because fused types like floating always appeared in the function arguments, and the compiler can therefore generate several versions of that function with different data types (e.g., one for float and one for double).
And once the function arguments' fused types is decided, compiler can figure out actual type of local variables who also declared to be the same fused type.

However, in this case, it seems that compiler can't decide what floating is at compile time and therefore output the error.

Does this make sense?

@yenchenlin
Copy link
Contributor Author

If so, we need fused types to appear in function arguments if we want to use fused types local variables and assign value to it in that function?

This seems like a big limitation of Cython fused types and worth to be documented.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 24, 2016 via email

@yenchenlin
Copy link
Contributor Author

@GaelVaroquaux thanks, I just asked the question.
Hope they can provide some other workarounds in this situation 😟

@jnothman
Copy link
Member

Ah of course, I too missed the context. Yes, fused types only work in functions/methods where the type can be specialised by sniffing the arguments. Which means we just need a way to specify type at __init__ time. I don't think this is a big deal, but can't elaborate on it until tomorrow.

@jnothman
Copy link
Member

You may be able to specialise things with the func_name[spec_type] notation.

@yenchenlin
Copy link
Contributor Author

I'm worry about since we declare self.data as void *, we may need to type cast it in every function when that function want to access it, and it seems hard to make every function's arguments to include fused type.

Like here, we can't typecast it to floating since the function argument doesn't involve fused types.

Sorry I am not 100% sure of my concern, and I will wait for your elaboration, thanks a lot!

@jnothman
Copy link
Member

Again I'll have to put aside some time to look into it later in the week,
but one option may be using a templated cppclass, such that you can
specialise the entire class at once. I.e. it might be a case for using the
C++-side type generics rather than Cython's.

On 25 July 2016 at 14:30, Yen notifications@github.com wrote:

I'm worry about since we declare self.data as void *, we may need to type
cast it in every function when that function want to access it, and it
seems hard to make every function's arguments to include fused type.

Like here
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/binary_tree.pxi#L1943,
we can't typecast it to floating since the function argument doesn't
involve fused types.

Sorry I am not 100% sure of my concern, and I will wait for your
elaboration, thanks a lot!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7059 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60FYhn5UaE6e82BG_5s-3esixDinks5qZDvOgaJpZM4JR0g4
.

@yenchenlin
Copy link
Contributor Author

I think data_sample argument is a nice solution, but I'm curious about how to declare data?

Using void* ?

If so, it seems that we almost need to include self.data_sample in every function since we need to convert void* to either double* or float*.

@jnothman
Copy link
Member

jnothman commented Aug 4, 2016

Yes, we'd use void* and have to cast everywhere. It's not exactly elegant. That's why it should possibly just be limited to the dataset itself, rather than worrying about it in other places.

(And we don't actually need a separate data_sample attribute in BinaryTree where there is data_arr already)

@yenchenlin
Copy link
Contributor Author

Hello @jnothman , am I doing anything stupid here?

I though it might work, but it outputs error:

sklearn/neighbors/binary_tree.pxi:1262:4: Special method __defaults__ has wrong number of arguments (0 declared, 1 or more expected)

@jnothman
Copy link
Member

jnothman commented Aug 5, 2016

I've never head of this defaults, and the error message appears in
discussion here:
https://groups.google.com/forum/#!topic/cython-users/VTOxyVbeZ0s

On 5 August 2016 at 14:01, Yen notifications@github.com wrote:

Hello @jnothman https://github.com/jnothman , am I doing anything
stupid here
yenchenlin@44e0c39#diff-b071106a87a03ee4e9149e3f0a2b1180R1262
?

I though it might work, but it outputs error:

sklearn/neighbors/binary_tree.pxi:1262:4: Special method defaults has wrong number of arguments (0 declared, 1 or more expected)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7059 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-J8JS85CiS1uqU0i4rFmubv4ACnks5qcrWUgaJpZM4JR0g4
.

@jnothman
Copy link
Member

jnothman commented Aug 5, 2016

It doesn't look like anyone responded there. I suppose one should check
whether fused types work at all with extension class methods, as opposed to
functions... :-/

On 5 August 2016 at 16:57, Joel Nothman joel.nothman@gmail.com wrote:

I've never head of this defaults, and the error message appears in
discussion here: https://groups.google.com/forum/#!topic/cython-
users/VTOxyVbeZ0s

On 5 August 2016 at 14:01, Yen notifications@github.com wrote:

Hello @jnothman https://github.com/jnothman , am I doing anything
stupid here
yenchenlin@44e0c39#diff-b071106a87a03ee4e9149e3f0a2b1180R1262
?

I though it might work, but it outputs error:

sklearn/neighbors/binary_tree.pxi:1262:4: Special method defaults has wrong number of arguments (0 declared, 1 or more expected)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7059 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-J8JS85CiS1uqU0i4rFmubv4ACnks5qcrWUgaJpZM4JR0g4
.

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 5, 2016

Well ...
~~This should work according to a minimal example I just created!~~😭

Nevermind, it's a cython bug, see below.

@yenchenlin
Copy link
Contributor Author

@jnothman After countless repeat experiments, I found out that it's a bug! 🐞

It turns out that if you want to put fused type arguments in extension class methods, you cannot have default value arguments in the same function.

See this ipython notebook I just created.

@yenchenlin
Copy link
Contributor Author

I've found a way to successfully include fused types arguments in cdef function, see #7153 .

@jnothman
Copy link
Member

jnothman commented Aug 6, 2016

Great that we've got some kind of solution! That's a pretty frustrating bug/limitation in Cython. It'd be worth reporting so that at least there's a record of it.

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 6, 2016

Yes, I almost killed myself during Cython experiments.

And I have to say sorry that I'm currently modifying #7153 , so that I may force push it later since some code on the github will cause problems.

@cod3licious
Copy link

hey @yenchenlin and @jnothman,

I just created a KDTree from a dataset that was stored as uint8. The original data is about 200mb, but the final KDTree is like 1.6gb and when I looked at the arrays they seem to be stored as floats. Is there any progress on keeping the original dtypes in the tree and thereby possibly reducing the memory footprint? Or at least a way to change the dtypes after the tree was constructed? (I'm not familiar with Cython so I didn't get how to access any of the arrays directly in a way that lets you modify them, preferably without breaking anything.)

I'm especially concerned as I wanted to use the tree in a WebApp, but the heroku server only allows for 512mb of RAM... >.>

Thanks! :)

@rth
Copy link
Member

rth commented Sep 27, 2019

Maybe the finishing the refactoring such as #4217 would help. I started some of it in #15097

Another blocker is read-only fused memory-views #10624 that will be only available in the future Cython 3.0.

I just created a KDTree from a dataset that was stored as uint8.

In any case, this would likely apply to float64, float32. Anything else is so far out of scope, in scikit-learn.

@ajouellette
Copy link

Is there any update on this? I would really like to be able to construct trees with float32 data. The data I have is single-precision (and large, ~1.5GB), so constructing a tree not only has to copy the data but also doubles the memory usage by converting to float64.

If it were possible to construct trees with single-precision data internally, then this data set would be manageable.

Thanks!

@jjerphan
Copy link
Member

jjerphan commented Mar 10, 2022

Having sklearn.neighbors.{KDTree,BallTree} support 32bit would definitely be useful.

Yet, it all comes done to a limitation within Cython as explained by cython/cython#3283 (comment).

For scikit-learn, we are currently trying to get over this problem by extending some Cython implementations to support 32bit datasets using Tempita (see #22590).

This first necessitates refactoring the implementations of sklearn.metrics.DistanceMetrics to support 32bit, which is also needed to solve this issue. This is currently being explored in #22590, and changes made to sklearn.metrics.DistanceMetrics might be extracted in a separate and smaller PR then.

@jjerphan jjerphan added the float32 Issues related to support for 32bit data label Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython Enhancement float32 Issues related to support for 32bit data module:tree
Projects
None yet
Development

No branches or pull requests

8 participants