-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It seems that we are facing precision issues again ... |
Some imprecision in inertia may be tolerable. However, having not looked this over in detail, it seems as if you've changed all the |
Hello @jnothman , if I understand you correctly, you mean by changing some of the variables back to |
Yes. |
I get it, will do. However, it's a little bit weird that CI only fails for precision reason in Python2.6 & Python2.7. |
I recall differences in handling of large ints, but don't recall if there was anything widely publicised about float formats. Feel free to investigate a little for your own knowledge, but it's not essential to getting the job done here . |
Please ping when you've looked at potential changes, whether or not we can make any beneficial changes. |
Hello @jnothman and @MechCoder , I've fixed the test precision issue. The only difference I made here is to change the dtype when computing inertia, i.e., Since inertia is the sum of the distances from points to center, using |
It's a bit unfortunate that it involes |
So |
centers[center_idx] /= counts[center_idx] | ||
# Note: numpy >= 1.10 does not support '/=' for the following | ||
# expression for a mixture of int and float (see numpy issue #6464) | ||
centers[center_idx] = centers[center_idx]/counts[center_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space around /
please
Could you please refactor those tests into one? I think the code duplication doesn't give much value. |
centers = init | ||
# ensure that the centers have the same dtype as X | ||
# this is a requirement of fused types of cython | ||
centers = np.array(init, dtype=X.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also trigger a copy of init
even if init
is already an array with the right dtype while this was not the case before.
I think this is good to systematically copy as it is probably a bad idea to mutate the user provided input array silently. I think you should add a test to check this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a block of code under the function k_means
that starts with if hasattr(init, '__array__')
that also converts explicitly converts init
to higher precision.
That should also be fixed and we should test the attributes cluster_centers_
and inertia_
for precision when init is a ndarray ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test to make sure that we do copy init
here.
@MechCoder , since function _init_centroids
will always be called after the code block you mentioned and thus make sure centers has the same dtype
as input, I think maybe we don't need to touch that part of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, that means a copy is already triggered before _init_centroids
is called right?
Correct me if I'm wrong but in other words, if X
is of dtype np.float32
, there are 2 copies made of init
.
- One in
kmeans
which makes init of dtypenp.float64
- And then one in
_init_centroids
that converts it back tonp.float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right!
But I think it is reasonable to copy the data twice because in step 1.,
init = check_array(init, dtype=np.float64, copy=True)
is used to make sure KMeans.init
is a copy of the array provided by users.
And in step 2., centers = np.array(init, dtype=X.dtype)
is used to make sure the centers
we compute won't alter its argument init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but I still change this line to init = check_array(init, dtype=X.dtype.type, copy=True)
since it is more consistent to keep it as the same datatype as X
, and also it makes no harm to precision! 😃
Great work @yenchenlin. I added some further comments to address on top of @jnothman's. Looking forward to merge this once they are addressed. |
Thanks a lot for you guys comments! |
60fb242
to
5f5a81e
Compare
Please avoid amending commits and force-pushing your changes. It's much easier to review changes incrementally, especially weeks after the previous change, if the commits show precisely what has changed. There are other reasons too, but commits with clear messages were designed that way for a reason. |
for dtype in [np.int32, np.int64, np.float32, np.float64]: | ||
X_test = dtype(X_small) | ||
init_centers_test = dtype(init_centers) | ||
assert_equal(X_test.dtype, init_centers_test.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line does not really test anything, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it yesterday so as to emphasize that they are of same types, but it looks stupid to me overnight. Thanks!
That should be it from me as well. lgtm pending comments. great work! |
Thanks @MechCoder for the inputs, please have a look when you have time! |
In terms of benchmarking, I mostly meant to make sure that we're actually reducing memory consumption in the sparse and dense cases from what it is at master... |
Hello @jnothman , sorry for the misunderstanding ... About the benchmarking, I've updated the results of sparse input in the main description of this PR, |
@@ -305,7 +305,7 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto', | |||
X -= X_mean | |||
|
|||
if hasattr(init, '__array__'): | |||
init = check_array(init, dtype=np.float64, copy=True) | |||
init = check_array(init, dtype=X.dtype.type, copy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm surprised that we need this .type
here and perhaps check_array
should allow a dtype object to be passed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
Let us know when you've dealt with those cosmetic things, after which I think this looks good for merge. |
@jnothman Thanks for the check, done! |
Thanks for the updates. My +1 still holds. |
Good work, @yenchenlin ! |
Thanks for your review! |
Thanks @yenchenlin , great work! |
…g cython fused types (scikit-learn#6846)
… and documentation. Fixes #6862 (#6907) * Make KernelCenterer a _pairwise operation Replicate solution to 9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s. Happy to add tests, just tell me what should be covered. * Adding test for PR #6900 * Simplifying imports and test * updating changelog links on homepage (#6901) * first commit * changed binary average back to macro * changed binomialNB to multinomialNB * emphasis on "higher return values are better..." (#6909) * fix typo in comment of hierarchical clustering (#6912) * [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (#6846) * Fix sklearn.base.clone for all scipy.sparse formats (#6910) * DOC If git is not installed, need to catch OSError Fixes #6860 * DOC add what's new for clone fix * fix a typo in ridge.py (#6917) * pep8 * TST: Speed up: cv=2 This is a smoke test. Hence there is no point having cv=4 * Added support for sample_weight in linearSVR, including tests and documentation * Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance * Fixed pep8 violations and sampleweight format * rebased with upstream
…g cython fused types (scikit-learn#6846)
… and documentation. Fixes scikit-learn#6862 (scikit-learn#6907) * Make KernelCenterer a _pairwise operation Replicate solution to scikit-learn@9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s. Happy to add tests, just tell me what should be covered. * Adding test for PR scikit-learn#6900 * Simplifying imports and test * updating changelog links on homepage (scikit-learn#6901) * first commit * changed binary average back to macro * changed binomialNB to multinomialNB * emphasis on "higher return values are better..." (scikit-learn#6909) * fix typo in comment of hierarchical clustering (scikit-learn#6912) * [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (scikit-learn#6846) * Fix sklearn.base.clone for all scipy.sparse formats (scikit-learn#6910) * DOC If git is not installed, need to catch OSError Fixes scikit-learn#6860 * DOC add what's new for clone fix * fix a typo in ridge.py (scikit-learn#6917) * pep8 * TST: Speed up: cv=2 This is a smoke test. Hence there is no point having cv=4 * Added support for sample_weight in linearSVR, including tests and documentation * Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance * Fixed pep8 violations and sampleweight format * rebased with upstream
…g cython fused types (scikit-learn#6846)
… and documentation. Fixes scikit-learn#6862 (scikit-learn#6907) * Make KernelCenterer a _pairwise operation Replicate solution to scikit-learn@9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s. Happy to add tests, just tell me what should be covered. * Adding test for PR scikit-learn#6900 * Simplifying imports and test * updating changelog links on homepage (scikit-learn#6901) * first commit * changed binary average back to macro * changed binomialNB to multinomialNB * emphasis on "higher return values are better..." (scikit-learn#6909) * fix typo in comment of hierarchical clustering (scikit-learn#6912) * [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (scikit-learn#6846) * Fix sklearn.base.clone for all scipy.sparse formats (scikit-learn#6910) * DOC If git is not installed, need to catch OSError Fixes scikit-learn#6860 * DOC add what's new for clone fix * fix a typo in ridge.py (scikit-learn#6917) * pep8 * TST: Speed up: cv=2 This is a smoke test. Hence there is no point having cv=4 * Added support for sample_weight in linearSVR, including tests and documentation * Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance * Fixed pep8 violations and sampleweight format * rebased with upstream
This is a follow-up PR for #6430, which use fused types in Cython to work with
np.float32
inputs without wasting memory by converting them internally tonp.float64
.Since Sparse Func Utils now supports Cython fused types, we can avoid zig-zag memory usage which showed in #6430 .
Updated on 6/19
np.float32
Datanp.float32
DataHere is the test script used to generate above figures (thanks @ssaeger):