-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG: Speed up euclidean_distances with Cython #1006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@dwf did something similar during the Granada sprint but @GaelVaroquaux and him decided to not merge the code (don't remember why though). |
Me neither. I was thinking about this recently and wanted to look at the |
Basically it's because in a lot of cases, as @jakevdp pointed out to me, BLAS is faster. Past a certain number of features/number of examples, you are fighting a land war in Asia, so to speak. Thus, better to use np.dot() (or, if you want to call into dgemm/sgemm from Cython, that could be an option too). One thing that we still need is an out= argument, though, for algorithms where repeated distance computation are the bottleneck, to prevent multiple allocations. |
On Aug 9, 2012, at 19:13 , David Warde-Farley notifications@github.com wrote:
As for the sparse case, I'm calling And either way, could I please see your code? I want to gulp up all the cython tricks I can.
Partly implemented. Should I keep going with this, guys?
Vlad N. |
Besides speed up, another benefit is that your implementation avoids intermediary memory allocations (like |
On Thu, Aug 9, 2012 at 1:19 PM, Vlad Niculae notifications@github.comwrote:
Ahh ok. Sorry I didn't look over your code, I'm in the middle of writing my
Sounds good to me. |
On Thu, Aug 09, 2012 at 10:19:34AM -0700, Vlad Niculae wrote:
I don't have time to look at the code (currently in panic mode), but this |
Looks good - could you put up some comparison benchmarks? There's been a lot of buzz on the cython list lately about the best way to pass pointers from numpy to C. (see https://groups.google.com/forum/?fromgroups#!topic/cython-users/8uuxjB_wbBQ[1-25]). It seems the |
np.ndarray[DOUBLE, ndim=2, mode="c"] out, | ||
bool squared=False): | ||
cdef np.ndarray[DOUBLE, ndim=1] XX = np.empty(X.shape[0], dtype=np.float64) | ||
cdef np.ndarray[DOUBLE, ndim=1] YY = np.empty(Y.shape[0], dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining arrays like this does not allow fast indexing (this was true in cython <0.16; I'm not sure if it's been fixed that since then).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should specify: I believe defining numpy arrays as a function argument leads to fast indexing, but defining arrays as variables does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? I've never noticed that, and I'm almost sure I have seen some old generated C that was doing indexing properly on such local variables.
It's definitely the case if you do "cdef np.ndarray foo" that you get Python overhead, but I think if a C type and ndim are there you're alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vene could you compare if indexing via .data
is faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: Cancel, that, I managed.
The speed is about the same. The generated C code is slightly different. Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XX[ii] = 0
->
__pyx_t_10 = __pyx_v_ii;
*__Pyx_BufPtrStrided1d(__pyx_t_7sklearn_7metrics_14euclidean_fast_DOUBLE *, __pyx_pybuffernd_XX.rcbuffer->pybuffer.buf, __pyx_t_10, __pyx_pybuffernd_XX.diminfo[0].strides) = 0.0;
versus
(<DOUBLE *>(XX.data))[ii] = 0
->
(((__pyx_t_7sklearn_7metrics_14euclidean_fast_DOUBLE *)__pyx_v_XX->data)[__pyx_v_ii]) = 0.0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving that a shot. Good to know.
cython -a gives me some yellow lines that hurts my eyes :) |
@jakevdp I recall a discussion in Austin with you and Kurt Smith that there were some memoryview performance issues. Has that been resolved? |
On Aug 9, 2012, at 23:35 , Alexandre Gramfort notifications@github.com wrote:
I just learned about that moments before submitting the PR, sorry. It's still not obvious to me how to fix some of them. I can conclude that it's not cool to return an array, but I was gonna use an out= parameter anyway, so that's that. But what's with the RefNannyDeclarations that make all the lines where I declare a function go yellow? Should I aim for no yellow at all?
Vlad N. |
On Thu, Aug 09, 2012 at 03:48:12PM -0700, Vlad Niculae wrote:
You are sure? Yellow lines are important only if they occur in a for |
On Thu, Aug 9, 2012 at 6:51 PM, Gael Varoquaux notifications@github.comwrote:
Indeed, you should not pay attention to yellow lines at the beginning/end I will add to what @GaelVaroquaux said that yellow lines inside a for loop |
I can't believe I didn't know about |
@dwf - the performance issues come up when passing around many small slices: essentially, the overhead of creating a memoryview slice, though small, can be significant in some circumstances. i.e. if you're slicing an array in order to multiply two 3-dimensional vectors, it's not so good. If you're slicing in order to multiply two 1000-dimensional vectors, then you're probably OK. The way this PR is written, typed memory views will be as fast or faster than using the cython numpy interface, and will lead to code that's more stable in numpy 1.7+ (with respect to the |
On Fri, Aug 10, 2012 at 06:32:10AM -0700, Jacob Vanderplas wrote:
That worries me a bit. It forces developpers to instal there own version
We do not care about Python 2.4. I see two options:
|
@jakevdp I wrote a cython csr_matmat_t function to compute X * Y.T, imagining it would be faster for this specific setting. My implementation computes every element by finding the common nonzero indices in the rows. I get the good result, but the default, 2-pass implementation kicks my ass. I thought X * Y.T would be a simpler case. Is there any way to improve this or should I keep using scipy's dot? |
@vene - I think two passes is required for this. Take a look at the scipy implementation: |
@GaelVaroquaux good points. I think, though, that because cython is only required for developers and not users, we can safely depend on the latest stable version. If I'm not mistaken, in the past we've settled on everyone using the latest stable version in order to prevent minor changes causing huge diffs in the generated C code. Also, there is a way to maintain future compatibility without using typed memory views: access the pointer with |
On Aug 10, 2012, at 17:48 , Jacob Vanderplas notifications@github.com wrote:
Actually, I toyed with the cd_fast.pyx to see what its cython -a output looks like, and it turns out the C file is generated with an older version than the one I have (pip-installed). So this is not a rule. Vlad N. |
You can do the dot product in one pass like this: This assumes that the indices are sorted. Indices can be sorted by |
On Aug 10, 2012, at 18:12 , Mathieu Blondel notifications@github.com wrote:
Mathieu, you just blew my mind! :) |
This is the same in the svm implementation we are using. Did you get it from there @mblondel? |
For the cython issue: we are using |
@amueller Yes that's the same as in libsvm. |
Re the fact that it's only needed for devs: I don't buy this argument. On 8/10/12, Mathieu Blondel notifications@github.com wrote:
|
OK, new observations.
|
I don't think that the np.empty is a problem in itself: the yellow On 8/10/12, Vlad Niculae notifications@github.com wrote:
|
Fixed, the branch was indeed unreachable because How does this look? Should I rename |
I'll check ASAP.;
Yes please,
If it's not too much of a pain. But only after review, as it will screw |
This PR looks good. I am 👍 for merge. Before you merge, @vene: did you |
Yes, that is supposed to be fixed now. I hope the impact will be visible. |
Regarding the scikit-learn speed page: the readme on github specifies that the URL is at vene.github.com, but I have the impression that it hasn't been updating lately. Is this the right URL? |
You are correct in that it has moved to http://scikit-learn.github.com/scikit-learn-speed Thanks for pointing that out. The idea was to link it from the main website and to alias it to speed.scikit-learn.org. |
from numpy.distutils.system_info import get_info | ||
config = Configuration('metrics', parent_package, top_path) | ||
|
||
# euclidean_fast needs CBLAS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bit here was refactored in to _build_utils iirc.
Can this be merged? Is another review needed? what is the test coverage? |
I think that it is good.
Dunno |
@vene can you change the setup.py I pointed to and run a coverage report, then I'll merge? |
Coverage dropped from 90% to 89% in the |
Coverage report now:
Coverage report before:
I just rebuilt and ran the test suite, everything looks fine, I'd say this is good to go (just a featureunion failure that is fixed in master). Sorry but if the history is not pretty enough, I'll leave the squashing to someone else as I will have a busy weekend. |
There seems to be a linker problem. I get |
I couldn't reproduce the error and merged. Thanks for all the work! :) |
Ok, so jenkins segfaulted. Thats not good :( |
Maybe it just needs a cleanup. I think that I got that once on that I am trying to see if I can spot a problem on one of my computers. |
On Tue, Oct 23, 2012 at 04:03:02PM +0200, Gael Varoquaux wrote:
Can't spot any problem of course. How do I trigger a new build (without making a commit) to see if this |
On Tue, Oct 23, 2012 at 04:21:17PM +0200, Gael Varoquaux wrote:
Good news. I can reproduce on one of my computers. |
for some definition of "good". |
On Tue, Oct 23, 2012 at 04:49:05PM +0200, Gael Varoquaux wrote:
I think that I found the problem: I can't believe that we didn't fall in this trap before... |
On Tue, Oct 23, 2012 at 05:19:17PM +0200, Gael Varoquaux wrote:
I was wrong. To reproduce, 'python -c import sklearn.metrics.pairwise' is sufficient. |
A backtrace, just in case people have great ideas looking at it: #0 0x0000000000568af0 in visit_decref () #1 0x000000000056953e in list_traverse.16457 () #2 0x0000000000543015 in collect.48496 () #3 0x00000000005439b9 in _PyObject_GC_Malloc () #4 0x00000000004df8cd in list_iter () #5 0x00000000004f5602 in PyObject_GetIter () #6 0x0000000000440e60 in builtin_zip.32579 () #7 0x0000000000497ea4 in PyEval_EvalFrameEx () #8 0x000000000049f1c0 in PyEval_EvalCodeEx () #9 0x00000000004983b8 in PyEval_EvalFrameEx () #10 0x000000000049f1c0 in PyEval_EvalCodeEx () #11 0x00000000004983b8 in PyEval_EvalFrameEx () #12 0x000000000049f1c0 in PyEval_EvalCodeEx () #13 0x00000000004a7f18 in PyImport_ExecCodeModuleEx () #14 0x000000000053cde1 in load_source_module.39052 () #15 0x000000000056211a in imp_load_module.39102 () #16 0x0000000000497ea4 in PyEval_EvalFrameEx () #17 0x0000000000498602 in PyEval_EvalFrameEx () #18 0x0000000000498602 in PyEval_EvalFrameEx () #19 0x000000000049f1c0 in PyEval_EvalCodeEx () #20 0x00000000004983b8 in PyEval_EvalFrameEx () #21 0x00000000004a7548 in gen_send_ex.isra.0 () #22 0x000000000043f6be in listextend.16657 () #23 0x0000000000498546 in PyEval_EvalFrameEx () #24 0x000000000049f1c0 in PyEval_EvalCodeEx () It clearly happens during a collect. |
euclidean_distances
usually pops up near the top as cumulative time in every profiler output that uses it. (I'm talking about our new toy in http://jenkins-scikit-learn.github.com/scikit-learn-speed/) so optimizing it sounds like a good idea.Here is my first go. I'm new at writing cython so feedback is highly appreciated. I have a sinking feeling that I'm copying around memory when I don't want that.
Here is a hackish benchmark script: https://gist.github.com/3305769
I have intentionally not included the .c file in this PR, so the commits will be lighter until this is done.
TODO:
P.S. be gentle, I'm typing with a broken finger :)