-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Based on the cython doc, Please replace those |
Please also feel free to squash those commits. |
elif floating is double: | ||
centers = np.zeros((n_clusters, n_features), dtype=np.float64) | ||
else: | ||
raise ValueError("Unknown floating type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following should work no?
centers = np.zeros((n_clusters, n_features), dtype=X.dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar remark for the allocation of the center_squared_norms
and centers
arrays in other functions of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel This change works on my machine with Ubuntu, but leads to failing appveyor builds.
Other than that, LGTM, +1 for merge with a new entry in |
ec80989
to
cebb687
Compare
Thanks for your review and the comments. |
8ae6d1f
to
07dcec5
Compare
Please squash the new commit as well. Commits with a message such as "Adresses several comments" are not interesting when reviewing the history of a file 6 months from now :) |
Ok, thanks, I squashed them. :) |
if floating is float: | ||
centers = np.zeros((n_clusters, n_features), dtype=np.float32) | ||
else: | ||
centers = np.zeros((n_clusters, n_features), dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be further simplified as:
centers = np.zeros((n_clusters, n_features), dtype=X.dtype)
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that and it works on my machine with Ubuntu, but leads to failing appveyor builds.
I don't know exactly why this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is really weird. It would be worth investigating with a debugger under windows. This might be a bug in Cython / MSVC or even numpy. But we can leave it as it is for now in this PR.
Hello @ssaeger ,
What do you mean by internally? |
Hi @yenchenlin1994 , With internally I mean that if the user inputs sparse data with dtype I hope this makes it a bit more clear. :) |
@ssaeger great thanks for your explanation. |
It would be great to wait for #6593 |
Seems like we still need |
Yeah ... I'll do that. |
Hello @MechCoder , Am I on the right track? |
It's actually possible that we could have used a cdef there (though the benefit is not as great), but I'm fairly sure we can here if we used typed memoryviews instead of from cython cimport floating
cdef floating a1(floating[:] x):
return x[0]
def b(X, floating y):
cdef floating[:] X_data = X.data
return a1(X_data) |
(Also, should we be doing the fused typing of CSR/CSC sparse matrix |
Hello @jnothman , I see the code you paste compiles.
But I wonder why the benefit there is not as great as here? |
Because you're going from Cython to Cython here not from Python to Cython as there; and because here it is called repeatedly, while calculating means is not so often repeated. |
Thanks a lot! Now I understand 😄 So, do you mean I can simply replace (I just learned memoryviews, sorry if this is obvious) |
I actually removed the function here (#6676) |
Note about memory views: they tend to not work correctly with readonly memory buffers: https://mail.python.org/pipermail/cython-devel/2013-February/003384.html See the discussion in pandas-dev/pandas#10043 for an example on how this can cause problems when working with readonly mapped data. If we can I think we should try to stick to the |
@yenchenlin1994 Since Olivier just said that using memoryviews would break joblib parallelism. You have two options.
|
@MechCoder thanks for your help! |
The first option seems to be the easier to try out. |
@ssaeger It seems we have fused types support for all sparsefuncs. Would you be able to rebase or would you prefer @yenchenlin1994 to do the rebase and testing for the sparse case as part of GSoC? |
@MechCoder At the moment I'm very busy. Therefore I would prefer @yenchenlin1994 to continue the work on this. |
Sure! @ssaeger thanks for your hard work. |
Hello @ssaeger , really sorry to interrupt you. |
hey, no problem 😃
I used following memory profiler: https://pypi.python.org/pypi/memory_profiler |
|
Hello guys, I've created #6846 to replace this PR, maybe we can close this one? |
As mentioned in #5973, #5776 or #5464 we could use fused types in Cython to work with float32 inputs without wasting memory by converting them internally to float64.
This PR implements fused types for KMeans/MiniBatchKMeans so that float32 input will result in using only float32 internally. Additionally it adds tests to ensure the desired data types are used.
Memory usage for float32 inputs with dimensions 500k x 20:


Before the code changes of this commit:
and after the code changes:
Unfortunately I was not able to add support to handle sparse float32 input data as float32 internally, so that this is still converted to float64. This would require many changes in
sparsefuncs_fast.pyx
.