[MRG+2] Use fused types in sparse mean variance functions #6593

yenchenlin · 2016-03-26T16:06:05Z

This is a follow up PR from #6588 , which try to make functions in utils/sparsefuncs_fast.pyx support Cython fused types.

In this PR, I focus on functions listed below:

csr_mean_variance_axis0
csc_mean_variance_axis0
incr_mean_variance_axis0

EDIT:
I called mean_variance_axis function on a np.float32 array with shape (5*10^6, 20).
Here is the memory usage over time

master:

this branch:

memory usage surrounded by the bracket indeed decrease.

yenchenlin · 2016-03-26T16:08:30Z

sklearn/utils/sparsefuncs_fast.pyx

+    is_CSC = isinstance(X, sp.csc_matrix)
+    return _incr_mean_variance_axis0(X.data, X.shape, X.indices, X.indptr,
+                                     last_mean, last_var, last_n,
+                                     is_CSR, is_CSC)


Maybe there is a better solution to distinguish X as a csr_matrix or csc_matrix?

You could use X.format, but the present is better.

In any case, why did you move the logic to check the type of sparse matrix here?

Ah I see why.

This might be nitpicking but you need not pass is_CSR or is_CSC. You can get whatever information you want from len(X_indptr)

if len(X_indptr) == shape[0] + 1: # Then CSR else: # Then CSC

yenchenlin · 2016-04-02T04:10:32Z

Hello @MechCoder , I've addressed the issues you pointed out.
Also, I've refactored the tests for mean variance functions and make sure it will output np.float32 result when X passed in is np.float32.

yenchenlin · 2016-04-03T00:49:14Z

Would @jnothman please have a look when you have time?
Thanks!

MechCoder · 2016-04-07T13:52:02Z

sklearn/utils/sparsefuncs_fast.pyx

-    cdef unsigned int n_samples = X.shape[0]
-    cdef unsigned int n_features = X.shape[1]
+    if X.dtype == np.int32 or X.dtype == np.int64:
+        X = X.astype(np.float64)


We can cast X to np.float32 if X.dtype is np.int32 right?

MechCoder · 2016-04-07T14:08:38Z

Thanks for the refactoring!

Can you use memory profiler to do some quick benchmarks and confirm the lessened memory usage when dtype is np.float32 between this branch and master?

yenchenlin · 2016-04-08T01:24:57Z

Hello @MechCoder ,
I've fixed the problems you mentioned.

About memory profiling, I called mean_variance_axis function on a np.float32 array with shape (5*10^6, 20).

Here is the memory usage over time

master:

this branch:

As you can see, memory usage surrounded by the bracket drastically decrease.

MechCoder · 2016-04-08T01:42:20Z

I would expect the peak memory usage to drastically reduce. Did you forget to rebuild sklearn?

yenchenlin · 2016-04-08T01:52:10Z

Here is my test script:

import numpy as np
import scipy.sparse as sp
from sklearn.utils.sparsefuncs import mean_variance_axis

X = np.random.rand(5000000, 20)
X = X.astype(np.float32)
X_csr = sp.csr_matrix(X)

@profile
def test():
    X_means, X_vars = mean_variance_axis(X_csr, axis=0)
    print X_means.dtype

test()

I think peak memory usage appear when I initialize
X = np.random.rand(5000000, 20), and it should not change between the two branch.

MechCoder · 2016-04-08T01:54:52Z

In that case, you can use the -m flag to verify.

python -m memory_profiler test.py

which will provide line by line output.

MechCoder · 2016-04-08T01:56:36Z

Oh the graph, looks good, I did not read it properly :/

yenchenlin · 2016-04-08T01:57:00Z

Oh okay 😄

MechCoder · 2016-04-08T01:58:00Z

LGTM. cc: @jnothman

jnothman · 2016-04-09T13:46:16Z

sklearn/utils/sparsefuncs_fast.pyx

-    cdef np.ndarray[DOUBLE, ndim=1, mode="c"] X_data
-    X_data = np.asarray(X.data, dtype=np.float64)     # might copy!
-    cdef np.ndarray[int, ndim=1] X_indices = X.indices
+    if X.dtype == np.int32:


I'm not sure what I think of these rules... The question being, I guess: what level of precision do we need in a mean?

However, for instance, an int32 converted to float32 loses precision. So we're going to be providing less precise answers than we used to for integer input.

And is there a reason we don't support other integer sparse matrix types (including bool, although mean and variance are silly functions to run in that case)? If the code supported those dtypes before (even if untested), I think we'd better maintain support.

Does this have test coverage?

And is there a reason we don't support other integer sparse matrix types

I have to admit that I don't know the reason 😢

an int32 converted to float32 loses precision

Yeah you are right, but there's a workaround:
Still convert int32 and int64 into float64 just like before.

What do you think?

I think we're playing at the margins, but that keeping it as float64 (at least for int32) is safest. Most important of my comments is that other types remain supported.

Indeed, I think we've actually lost some precision in calculating the mean of float32s in float32 rather than float64, but in that case I think the difference is really marginal.

Not sure what that question means.

Sorry for not being clear, I mean the workaround:

Still convert dtypes except float32 into float64.

can already solve related issues that arguing in some cases we only need float32 precision.
So I think maybe we can first merge the workaround solution?

Yes, I think that's the right solution, "workaround" or otherwise.

@jnothman thanks!

🐝 ping @MechCoder

Yes, we should revert this back. Sorry about that!

yenchenlin · 2016-04-12T10:04:15Z

Hello @jnothman @MechCoder ,

I've updated the code, please have a look. Thanks!

MechCoder · 2016-04-13T15:32:52Z

I will let @jnothman do the honours.

jnothman · 2016-04-14T00:46:56Z

It's not really an issue with this PR, but I suspect we should have a test there to ensure the result is sensible for integer dtypes. You could just have something like:

for input_dtype, expected_dtype in [(np.float32, np.float32), (np.int32, np.float64), ...]

In fact, the expected_dtype is not really of interest: the enhancement we're making here is not really testable except via memory profiling.

Thinking further about it, this change still copies data for integers, which we could avoid with a more generic fused type, while we still having float output. Can we assume that the mean of explicitly integer features is not actually something we're interested in often, so not worth the additional compilation time?

We also need a what's new entry boasting what we've enhanced.

yenchenlin · 2016-04-14T01:10:27Z

Hello @jnothman , thanks for your review.
I'll add the test and what's new.

Can we assume that the mean of explicitly integer features is not actually something we're interested in often, so not worth the additional compilation time

Maybe it's my lack of knowledge, but why mean of integer features is often not we are interesred in?

jnothman · 2016-04-14T02:04:24Z

I guess integer features, or at least binary, are common enough to many problem spaces. So the current code is still performing a copy for integer input. Do we mind?

ogrisel · 2016-04-14T07:22:39Z

I guess integer features, or at least binary, are common enough to many problem spaces. So the current code is still performing a copy for integer input. Do we mind?

+1 for not delaying this PR because of integer handling.

ogrisel · 2016-04-14T07:23:33Z

sklearn/utils/sparsefuncs_fast.pyx

+    # Implement the function here since variables using fused types
+    # cannot be declared directly and can only be passed as function arguments
+    cdef unsigned int n_samples = shape[0]
+    cdef unsigned int n_features =shape[1]


style: spacing around =

ogrisel · 2016-04-14T07:31:34Z

+1 for not delaying this PR because of integer handling.

But +1 for adding a non-regression test to check that integer dtypes are still supported.

yenchenlin · 2016-04-16T07:16:26Z

Hi @jnothman @ogrisel @MechCoder ,
I've fixed the issues you guys mentioned.

Would you please check again?
Thanks a lot!

jnothman · 2016-04-16T12:22:44Z

doc/whats_new.rst

@@ -131,6 +131,9 @@ Enhancements
   - Add option to show ``indicator features`` in the output of Imputer.
     By `Mani Teja`_.

+   - Reduce the memory usage of :func:`utils.mean_variance_axis` and  :func:`utils.incr_mean_variance_axis` 


So I was hoping we could say that this reduced the memory usage of some estimators, rather than talking about utils here. Is that wrong? Or do we need to do more to reduce estimator memory usage?

What do you think about addressing that separately (checking all estimators that use this function)?

I don't mind mentioning utils. Saying "for 32-bit float arrays" might be worthwhile.

I think I still need to make assign_rows_csr to support fused types, which can benefit #6430 a lot.
And after that, we may add something about estimators in whats_new?

Hi @jnothman , are you talking about something like the following?

Reduce the memory usage for 32-bit float input arrays of :func:utils.mean_variance_axis and :func:utils.incr_mean_variance_axis

yeah, that's better

jnothman · 2016-04-16T14:41:44Z

LGTM!

MechCoder · 2016-04-16T15:16:13Z

Thanks, Yen!

…rn#6593) * Use fused types in mean variance functions * Add test for mean-variance functions using fused trpes * Add whats_new

yenchenlin reviewed Mar 26, 2016
View reviewed changes

MechCoder mentioned this pull request Mar 29, 2016

[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6430

Closed

yenchenlin force-pushed the use-fused-types-in-mean-variance branch 2 times, most recently from a3831e7 to 0405f80 Compare April 2, 2016 03:02

yenchenlin changed the title ~~[WIP] Use fused types in sparse mean variance functions~~ [MRG] Use fused types in sparse mean variance functions Apr 2, 2016

MechCoder reviewed Apr 7, 2016
View reviewed changes

yenchenlin force-pushed the use-fused-types-in-mean-variance branch 2 times, most recently from acf2eba to 12739e7 Compare April 7, 2016 17:07

MechCoder changed the title ~~[MRG] Use fused types in sparse mean variance functions~~ [MRG+1] Use fused types in sparse mean variance functions Apr 8, 2016

jnothman reviewed Apr 9, 2016
View reviewed changes

yenchenlin force-pushed the use-fused-types-in-mean-variance branch from 12739e7 to 61ebeb2 Compare April 12, 2016 07:45

ogrisel reviewed Apr 14, 2016
View reviewed changes

yenchenlin force-pushed the use-fused-types-in-mean-variance branch from 61ebeb2 to 043d2be Compare April 16, 2016 03:17

yenchenlin added 2 commits April 16, 2016 11:31

Use fused types in mean variance functions

5ccd557

Add test for mean-variance functions using fused trpes

b2540db

yenchenlin force-pushed the use-fused-types-in-mean-variance branch 2 times, most recently from d98b23e to c9d6ece Compare April 16, 2016 04:30

jnothman reviewed Apr 16, 2016
View reviewed changes

Add whats_new

5bc8d1d

yenchenlin force-pushed the use-fused-types-in-mean-variance branch from c9d6ece to 5bc8d1d Compare April 16, 2016 14:21

jnothman changed the title ~~[MRG+1] Use fused types in sparse mean variance functions~~ [MRG+2] Use fused types in sparse mean variance functions Apr 16, 2016

MechCoder merged commit 28758cc into scikit-learn:master Apr 16, 2016

[MRG+2] Use fused types in sparse mean variance functions #6593

[MRG+2] Use fused types in sparse mean variance functions #6593

Conversation

yenchenlin commented Mar 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yenchenlin commented Apr 2, 2016

yenchenlin commented Apr 3, 2016

Choose a reason for hiding this comment

MechCoder commented Apr 7, 2016

yenchenlin commented Apr 8, 2016

MechCoder commented Apr 8, 2016

yenchenlin commented Apr 8, 2016

MechCoder commented Apr 8, 2016

MechCoder commented Apr 8, 2016

yenchenlin commented Apr 8, 2016

MechCoder commented Apr 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yenchenlin commented Apr 12, 2016

MechCoder commented Apr 13, 2016

jnothman commented Apr 14, 2016

yenchenlin commented Apr 14, 2016

jnothman commented Apr 14, 2016

ogrisel commented Apr 14, 2016

Choose a reason for hiding this comment

ogrisel commented Apr 14, 2016

yenchenlin commented Apr 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Apr 16, 2016

MechCoder commented Apr 16, 2016

yenchenlin commented Apr 16, 2016 •

edited

Loading