WIP: Refactor euclidean_distance, add optional argument for preallocated output array. #483

dwf · 2011-12-19T17:31:03Z

This pull request adds

Several optimized Cython routines for (squared) Euclidean distance, now used in the dense-dense case by euclidean_distance.
An explicit check that euclidean_distance is operating on floating point arrays (general consensus in the sprint room is that it doesn't usually make much sense in the integer case and if you really want that, you can cast).
An optional out argument for euclidean_distance (only in the dense-dense case; raises an error otherwise).

Explicit is better than implicit!

fabianp · 2011-12-19T17:47:15Z

sklearn/utils/arrayfuncs.pyx

+    ----------
+    X : ndarray, float32, shape = [n_samples_a, n_features]
+
+    Y : ndarray, float32, shape = [n_sample_b, n_features]


typo: n_sampleS_b

Thanks, fixed.

dwf · 2011-12-19T18:02:08Z

It seems that under certain circumstances, using BLAS might be faster, but loses precision in certain cases. @jakevdp has done some benchmarking.

In an ideal world, these Cython functions should try and guess whether the BLAS call is worth it and do it.

GaelVaroquaux · 2011-12-20T16:03:45Z

Code to bench (create a '.ipy' file and run it in IPython):

import timeit

from sklearn.metrics.pairwise import euclidean_distances

import numpy as np
np.random.seed(0)

for n_features in (10, 100, 1000):
    for n_samples in (10, 100, 1000):
    Y = np.random.random(size=(n_features, n_samples))
    X = np.random.random(size=(n_features, n_samples))

    print 'n_features: % 2i, n_samples % 2i' % (n_features, n_samples)
    %timeit euclidean_distances(X, Y)

fabianp · 2011-12-20T16:13:15Z

another implementation I found in milk: https://github.com/luispedro/milk/blob/master/milk/unsupervised/pdist.py

dwf added 8 commits December 19, 2011 17:12

Cython-based dense Euclidean distance computation.

1a127c1

Updated C code for arrayfuncs.pyx

b8c6ef1

Avoid copies in new euclidean_distance code.

e1a6087

Check for float dtype in euclidean_distance.

2eb5c59

Explicit is better than implicit!

Remove computation no longer needed in dense-dense case.

a698bb3

Add optional output argument to euclidean_distance.

8494696

More consistent doc formatting in arrayfuncs.pyx.

45f731f

Update C code for arrayfuncs.pyx.

555ded9

fabianp reviewed Dec 19, 2011
View reviewed changes

Typo: n_sample_b -> n_samples_b

fefd318

dwf closed this Jul 21, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WIP: Refactor euclidean_distance, add optional argument for preallocated output array. #483

WIP: Refactor euclidean_distance, add optional argument for preallocated output array. #483

Uh oh!

dwf commented Dec 19, 2011

Uh oh!

fabianp Dec 19, 2011

Uh oh!

dwf Dec 19, 2011

Uh oh!

dwf commented Dec 19, 2011

Uh oh!

GaelVaroquaux commented Dec 20, 2011

Uh oh!

fabianp commented Dec 20, 2011

Uh oh!

Uh oh!

Uh oh!

WIP: Refactor euclidean_distance, add optional argument for preallocated output array. #483

WIP: Refactor euclidean_distance, add optional argument for preallocated output array. #483

Uh oh!

Conversation

dwf commented Dec 19, 2011

Uh oh!

fabianp Dec 19, 2011

Choose a reason for hiding this comment

Uh oh!

dwf Dec 19, 2011

Choose a reason for hiding this comment

Uh oh!

dwf commented Dec 19, 2011

Uh oh!

GaelVaroquaux commented Dec 20, 2011

Uh oh!

fabianp commented Dec 20, 2011

Uh oh!

Uh oh!