-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors #11136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I believe no tests are needed: the functionality of algorithm='brute' is tested, and the memory consumption of pairwise_distances_chunked
is tested elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be very useful!
Some benchmarks to double check that doesn't affect performance too much on a standard use case with default working_memory
size could be helpful. There were some benchmarks in #10280 (comment) but I'm not sure to what extent they would generalize to this..
For readability, it would help to have some docstrings in the reduce functions specifying type and shape of inputs.
@@ -276,9 +278,25 @@ def _pairwise(self): | |||
class KNeighborsMixin(object): | |||
"""Mixin for k-neighbors searches""" | |||
|
|||
def _kneighbors_reduce_func(self, dist, start, | |||
n_neighbors, return_distance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would make sense to decorate it with @staticmethod
and remove self
to help redability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses self.effective_metric_
Benchmark: from pprint import pprint
import time
import sklearn
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import euclidean_distances
n_features = 100
results = []
for n_samples in [1000, 10000, 20000, 50000, ]:
X = np.random.rand(n_samples, n_features)
sampleD = euclidean_distances(X[:100])
avg_dist = np.mean(sampleD[np.triu_indices_from(sampleD, 1)])
for metric in ['euclidean']:
nn = NearestNeighbors(radius=avg_dist * .9, n_neighbors=10,
metric=metric, algorithm='brute').fit(X)
for method in ['kneighbors', 'radius_neighbors']:
### for working_mem in ['master']:
for working_mem in [100, 1024, 999999999]:
with sklearn.config_context(working_memory=working_mem):
s = time.time()
out = getattr(nn, method)()
t = time.time() - s
n_neighbors = sum(len(x) for x in out[0])
d = {'metric': metric,
'n_samples': n_samples,
'method': method,
'working_memory': working_mem,
'total_n_neighbors': n_neighbors,
'time': t}
pprint(d)
results.append(d) All timings:
|
Basically, we see good improvements with capped |
@rth, let's merge this too? |
Thanks for the benchmarks ! I was actually curious if this has an impact on the run time - it turns out that it does a little. Below a few more benchmarks, For `NearestNeighbors.kneighbors` (details)import numpy as np
from sklearn.neighbors import NearestNeighbors
from neurtu import delayed, Benchmark
n_features = 100
n_samples_predict = 1000
def runs():
for n_samples in [5000, 10000, 20000, 40000, 80000, 160000, 320000]:
for n_neighbors in [1, 5, 10]
tags = {'n_samples': n_samples, 'n_neighbors': n_neighbors}
X = np.random.rand(n_samples, n_features)
Xp = np.random.rand(n_samples_predict, n_features)
estimator = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute').fit(X)
yield delayed(estimator, tags=tags).kneighbors(Xp)
df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['n_samples_predict', 'n_samples'], inplace=True)
df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})
df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)
For `NearestNeighbors.radius_neighbors` (details)import numpy as np
from sklearn.neighbors import NearestNeighbors
from neurtu import delayed, Benchmark
n_features = 100
n_neighbors = 5
n_samples_predict = 1000
radius = 0.5
def runs():
for n_samples in [5000, 10000, 20000, 40000, 80000, 160000, 320000]:
tags = {'n_samples': n_samples, 'radius': radius}
X = np.random.rand(n_samples, n_features)
Xp = np.random.rand(n_samples_predict, n_features)
estimator = NearestNeighbors(radius=radius, algorithm='brute').fit(X)
yield delayed(estimator, tags=tags).radius_neighbors(Xp)
df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['radius', 'n_samples'], inplace=True)
df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})
df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)
So the memory usage works as expected however we pay for it with some performance overhead of around 4-5% once all of the matrix doesn't fit into a single chunk. I imagine the overhead is due to chunking & concatenating. Overall this sounds like a reasonable compromise to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if some docstrings are added to the private reduce functions (#11136 (review))
Thanks @jnothman !
Huh. My benchmarks above show substantial wall time decreases for large n_samples |
The plots represent run time? I though it was memory usage.. Do you still have the benchmark code? |
The benchmark code is included above. Yes, it was time.
|
(although naively using |
How much RAM were you running these benchmarks with? I get a memory error for 50000 samples with 32GB RAM... Hmm, but for the rest, yes, with your script I can confirm the results,
The difference appears to be that in your benchmark you were running The 20k case for me takes around 20 GB RAM (out of 32 GB available). I'm not familiar enough with the details of dynamic memory allocation, could it be that that to allocate such a contiguous array without chunking there is some overhead at the OS level (fragmentation issues etc), when comparing to allocating/deallocating smaller chunks? In any case the results look good in both benchmarks ! |
Hmm... I think I ran these on my laptop, with 16G... Yes, your benchmarks probably had fewer experimental variables. I forgot that I should probably keep the size of the test data constant; although in practice users are often using X=None with NN. Let's not worry about it too much, given that we have separately confirmed that this PR is a good thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Formerly part of #10280
Fixes #7287