[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6846

yenchenlin · 2016-05-31T14:27:27Z

This is a follow-up PR for #6430, which use fused types in Cython to work with np.float32 inputs without wasting memory by converting them internally to np.float64.

Since Sparse Func Utils now supports Cython fused types, we can avoid zig-zag memory usage which showed in #6430 .

Updated on 6/19

Dense np.float32 Data
- Before this PR:

After this PR:

Sparse np.float32 Data
- Before this PR:

After this PR:

Here is the test script used to generate above figures (thanks @ssaeger):

import numpy as np
from scipy import sparse as sp
from sklearn.cluster import KMeans

@profile
def fit_est():
    estimator.fit(X)

np.random.seed(5)
X = np.random.rand(200000, 20)
X = np.float32(X)

# X = sp.csr_matrix(X)

estimator = KMeans()
fit_est()

yenchenlin · 2016-05-31T18:07:44Z

It seems that we are facing precision issues again ...

jnothman · 2016-05-31T23:07:56Z

Some imprecision in inertia may be tolerable. However, having not looked this over in detail, it seems as if you've changed all the DOUBLEs into floatings. If changing some of these back to DOUBLEs will only increase memory usage in O(n) or O(m) but not O(mn) (n=samples, m=features) I would think it acceptable if it did not also substantially increase runtime. Do you think this is worth investigating?

yenchenlin · 2016-06-01T00:23:06Z

Hello @jnothman , if I understand you correctly, you mean by changing some of the variables back to DOUBLE may only increase memory usage in a O(n) or O(m) without sacrifice precision, right?

jnothman · 2016-06-01T00:26:46Z

Yes.

yenchenlin · 2016-06-01T00:34:18Z

I get it, will do.

However, it's a little bit weird that CI only fails for precision reason in Python2.6 & Python2.7.
Are there any substantial differences in precision between Python2 and Python3?

jnothman · 2016-06-01T00:40:55Z

Are there any substantial differences in precision between Python2 and Python3?

I recall differences in handling of large ints, but don't recall if there was anything widely publicised about float formats. Feel free to investigate a little for your own knowledge, but it's not essential to getting the job done here .

jnothman · 2016-06-02T05:15:00Z

Please ping when you've looked at potential changes, whether or not we can make any beneficial changes.

yenchenlin · 2016-06-04T00:48:18Z

Hello @jnothman and @MechCoder ,

I've fixed the test precision issue.

The only difference I made here is to change the dtype when computing inertia, i.e., np.sum.

Since inertia is the sum of the distances from points to center, using np.float64 accumulator can keep the precision for np.float32 data.

jnothman · 2016-06-04T08:35:42Z

It's a bit unfortunate that it involes O(mn) conversions to float64, especially since it's done in every iteration. Can you get an estimate of runtime cost, relative to the lower precision version?

jnothman · 2016-06-04T08:47:46Z

So cumsum(..., dtype=np.float64) is our best option, perhaps?

jnothman · 2016-06-04T12:03:56Z

sklearn/cluster/k_means_.py

-            centers[center_idx] /= counts[center_idx]
+            # Note: numpy >= 1.10 does not support '/=' for the following
+            # expression for a mixture of int and float (see numpy issue #6464)
+            centers[center_idx] = centers[center_idx]/counts[center_idx]


space around / please

jnothman · 2016-06-04T12:09:37Z

Could you please refactor those tests into one? I think the code duplication doesn't give much value.

ogrisel · 2016-06-04T16:30:35Z

sklearn/cluster/k_means_.py

-        centers = init
+        # ensure that the centers have the same dtype as X
+        # this is a requirement of fused types of cython
+        centers = np.array(init, dtype=X.dtype)


This will also trigger a copy of init even if init is already an array with the right dtype while this was not the case before.

I think this is good to systematically copy as it is probably a bad idea to mutate the user provided input array silently. I think you should add a test to check this behavior.

Has this been addressed?

There is a block of code under the function k_means that starts with if hasattr(init, '__array__') that also converts explicitly converts init to higher precision.

That should also be fixed and we should test the attributes cluster_centers_ and inertia_ for precision when init is a ndarray ...

I've added a test to make sure that we do copy init here.

@MechCoder , since function _init_centroids will always be called after the code block you mentioned and thus make sure centers has the same dtype as input, I think maybe we don't need to touch that part of code?

Ouch, that means a copy is already triggered before _init_centroids is called right?

Correct me if I'm wrong but in other words, if X is of dtype np.float32, there are 2 copies made of init.

One in kmeans which makes init of dtype np.float64

And then one in _init_centroids that converts it back to np.float32

You are right!

But I think it is reasonable to copy the data twice because in step 1.,
init = check_array(init, dtype=np.float64, copy=True) is used to make sure KMeans.init is a copy of the array provided by users.

And in step 2., centers = np.array(init, dtype=X.dtype)
is used to make sure the centers we compute won't alter its argument init.

Yeah but I still change this line to init = check_array(init, dtype=X.dtype.type, copy=True) since it is more consistent to keep it as the same datatype as X, and also it makes no harm to precision! 😃

ogrisel · 2016-06-04T16:46:20Z

Great work @yenchenlin. I added some further comments to address on top of @jnothman's. Looking forward to merge this once they are addressed.

yenchenlin · 2016-06-04T23:27:28Z

Thanks a lot for you guys comments!
I've started addressing your comments.

jnothman · 2016-06-14T01:22:20Z

Please avoid amending commits and force-pushing your changes. It's much easier to review changes incrementally, especially weeks after the previous change, if the commits show precisely what has changed. There are other reasons too, but commits with clear messages were designed that way for a reason.

MechCoder · 2016-06-18T01:31:34Z

sklearn/cluster/tests/test_k_means.py

+    for dtype in [np.int32, np.int64, np.float32, np.float64]:
+        X_test = dtype(X_small)
+        init_centers_test = dtype(init_centers)
+        assert_equal(X_test.dtype, init_centers_test.dtype)


This line does not really test anything, does it?

I did it yesterday so as to emphasize that they are of same types, but it looks stupid to me overnight. Thanks!

MechCoder · 2016-06-18T02:07:59Z

That should be it from me as well. lgtm pending comments. great work!

yenchenlin · 2016-06-18T06:29:39Z

Thanks @MechCoder for the inputs, please have a look when you have time!

jnothman · 2016-06-18T12:22:53Z

In terms of benchmarking, I mostly meant to make sure that we're actually reducing memory consumption in the sparse and dense cases from what it is at master...

yenchenlin · 2016-06-19T17:05:36Z

Hello @jnothman , sorry for the misunderstanding ...
But what I want to emphasize in this comment is that we can keep precision while not increasing overhead.

About the benchmarking, I've updated the results of sparse input in the main description of this PR,
while the results remain the same in the dense case.

jnothman · 2016-06-20T13:47:50Z

sklearn/cluster/k_means_.py

@@ -305,7 +305,7 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
        X -= X_mean

    if hasattr(init, '__array__'):
-        init = check_array(init, dtype=np.float64, copy=True)
+        init = check_array(init, dtype=X.dtype.type, copy=True)


(I'm surprised that we need this .type here and perhaps check_array should allow a dtype object to be passed.)

I think so.

jnothman · 2016-06-20T13:53:29Z

Let us know when you've dealt with those cosmetic things, after which I think this looks good for merge.

yenchenlin · 2016-06-20T16:57:47Z

@jnothman Thanks for the check, done!

MechCoder · 2016-06-21T05:22:33Z

Thanks for the updates. My +1 still holds.

jnothman · 2016-06-21T05:25:51Z

Good work, @yenchenlin !

yenchenlin · 2016-06-21T05:31:35Z

Thanks for your review!
Also thanks @ssaeger !

ssaeger · 2016-06-21T07:46:37Z

Thanks @yenchenlin , great work!

…g cython fused types (scikit-learn#6846)

… and documentation. Fixes #6862 (#6907) * Make KernelCenterer a _pairwise operation Replicate solution to 9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s. Happy to add tests, just tell me what should be covered. * Adding test for PR #6900 * Simplifying imports and test * updating changelog links on homepage (#6901) * first commit * changed binary average back to macro * changed binomialNB to multinomialNB * emphasis on "higher return values are better..." (#6909) * fix typo in comment of hierarchical clustering (#6912) * [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (#6846) * Fix sklearn.base.clone for all scipy.sparse formats (#6910) * DOC If git is not installed, need to catch OSError Fixes #6860 * DOC add what's new for clone fix * fix a typo in ridge.py (#6917) * pep8 * TST: Speed up: cv=2 This is a smoke test. Hence there is no point having cv=4 * Added support for sample_weight in linearSVR, including tests and documentation * Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance * Fixed pep8 violations and sampleweight format * rebased with upstream

…g cython fused types (scikit-learn#6846)

… and documentation. Fixes scikit-learn#6862 (scikit-learn#6907) * Make KernelCenterer a _pairwise operation Replicate solution to scikit-learn@9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s. Happy to add tests, just tell me what should be covered. * Adding test for PR scikit-learn#6900 * Simplifying imports and test * updating changelog links on homepage (scikit-learn#6901) * first commit * changed binary average back to macro * changed binomialNB to multinomialNB * emphasis on "higher return values are better..." (scikit-learn#6909) * fix typo in comment of hierarchical clustering (scikit-learn#6912) * [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (scikit-learn#6846) * Fix sklearn.base.clone for all scipy.sparse formats (scikit-learn#6910) * DOC If git is not installed, need to catch OSError Fixes scikit-learn#6860 * DOC add what's new for clone fix * fix a typo in ridge.py (scikit-learn#6917) * pep8 * TST: Speed up: cv=2 This is a smoke test. Hence there is no point having cv=4 * Added support for sample_weight in linearSVR, including tests and documentation * Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance * Fixed pep8 violations and sampleweight format * rebased with upstream

…g cython fused types (scikit-learn#6846)

… and documentation. Fixes scikit-learn#6862 (scikit-learn#6907) * Make KernelCenterer a _pairwise operation Replicate solution to scikit-learn@9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s. Happy to add tests, just tell me what should be covered. * Adding test for PR scikit-learn#6900 * Simplifying imports and test * updating changelog links on homepage (scikit-learn#6901) * first commit * changed binary average back to macro * changed binomialNB to multinomialNB * emphasis on "higher return values are better..." (scikit-learn#6909) * fix typo in comment of hierarchical clustering (scikit-learn#6912) * [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (scikit-learn#6846) * Fix sklearn.base.clone for all scipy.sparse formats (scikit-learn#6910) * DOC If git is not installed, need to catch OSError Fixes scikit-learn#6860 * DOC add what's new for clone fix * fix a typo in ridge.py (scikit-learn#6917) * pep8 * TST: Speed up: cv=2 This is a smoke test. Hence there is no point having cv=4 * Added support for sample_weight in linearSVR, including tests and documentation * Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance * Fixed pep8 violations and sampleweight format * rebased with upstream

yenchenlin mentioned this pull request May 31, 2016

[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6430

Closed

yenchenlin force-pushed the fused_types branch from 3e3bd7a to abb752c Compare May 31, 2016 15:30

yenchenlin force-pushed the fused_types branch from abb752c to 243e013 Compare June 3, 2016 23:47

yenchenlin force-pushed the fused_types branch from 243e013 to d425715 Compare June 4, 2016 01:02

yenchenlin changed the title ~~[WIP] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types~~ [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types Jun 4, 2016

yenchenlin force-pushed the fused_types branch from d425715 to 13f4bef Compare June 4, 2016 07:06

jnothman reviewed Jun 4, 2016
View reviewed changes

ogrisel reviewed Jun 4, 2016
View reviewed changes

ogrisel added this to the 0.18 milestone Jun 4, 2016

yenchenlin force-pushed the fused_types branch 2 times, most recently from 60fb242 to 5f5a81e Compare June 14, 2016 00:58

MechCoder reviewed Jun 18, 2016
View reviewed changes

MechCoder changed the title ~~[MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types~~ [MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types Jun 18, 2016

yenchenlin added 2 commits June 18, 2016 14:20

Remove redundant tests

6751e8b

Make sure datatype consistency in KMeans

1d51706

jnothman reviewed Jun 20, 2016
View reviewed changes

Fix PEP8 issues

d8f02b7

Make tests more clear

0a72176

jnothman merged commit 6bea11b into scikit-learn:master Jun 21, 2016

imaculate pushed a commit to imaculate/scikit-learn that referenced this pull request Jun 23, 2016

[MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by usin…

3c34fb3

…g cython fused types (scikit-learn#6846)

olologin pushed a commit to olologin/scikit-learn that referenced this pull request Aug 24, 2016

[MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by usin…

191a267

…g cython fused types (scikit-learn#6846)

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by usin…

666c8d0

…g cython fused types (scikit-learn#6846)

[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6846

[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6846

Conversation

yenchenlin commented May 31, 2016 • edited Loading

yenchenlin commented May 31, 2016

jnothman commented May 31, 2016

yenchenlin commented Jun 1, 2016 • edited Loading

jnothman commented Jun 1, 2016

yenchenlin commented Jun 1, 2016

jnothman commented Jun 1, 2016 • edited Loading

jnothman commented Jun 2, 2016

yenchenlin commented Jun 4, 2016 • edited Loading

jnothman commented Jun 4, 2016

jnothman commented Jun 4, 2016

Choose a reason for hiding this comment

jnothman commented Jun 4, 2016

ogrisel Jun 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MechCoder Jun 17, 2016 • edited Loading

Choose a reason for hiding this comment

yenchenlin Jun 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jun 4, 2016

yenchenlin commented Jun 4, 2016 • edited Loading

jnothman commented Jun 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MechCoder commented Jun 18, 2016

yenchenlin commented Jun 18, 2016

jnothman commented Jun 18, 2016

yenchenlin commented Jun 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jun 20, 2016

yenchenlin commented Jun 20, 2016

MechCoder commented Jun 21, 2016

jnothman commented Jun 21, 2016

yenchenlin commented Jun 21, 2016 • edited Loading

ssaeger commented Jun 21, 2016

yenchenlin commented May 31, 2016 •

edited

Loading

yenchenlin commented Jun 1, 2016 •

edited

Loading

jnothman commented Jun 1, 2016 •

edited

Loading

yenchenlin commented Jun 4, 2016 •

edited

Loading

ogrisel Jun 4, 2016 •

edited

Loading

MechCoder Jun 17, 2016 •

edited

Loading

yenchenlin Jun 17, 2016 •

edited

Loading

yenchenlin commented Jun 4, 2016 •

edited

Loading

yenchenlin commented Jun 19, 2016 •

edited

Loading

yenchenlin commented Jun 21, 2016 •

edited

Loading