Deprecate LSHForest #8996

amueller · 2017-06-06T09:21:23Z

LSHForest should be deprecated and scheduled for removal in 0.21. It should also warn about having bad performance. cc @ogrisel

jnothman · 2017-06-06T10:05:10Z

sigh. Yes I'm not pursuaded it's doing its job. Part of that is because
it's not fast enough, part because it's not flexible enough.

Where I think we have most prominently failed, however, is in ensuring its
integration: providing a way to do approximate NN in the various estimators
that use neighbours. I think if it were better integrated into API, it
could have been enhanced or alternative implementations proposed. Should we
still be trying to do that?

rth · 2017-06-06T10:11:56Z

@jnothman I'm currently working on rebasing your PR #3922 and fixing things that broke due to API changes since then. That is still relevant, right (even if would be used with external implementations instead of LSHForest) ?

jnothman · 2017-06-06T10:41:10Z

yes, it's still relevant, except the question is how we illustrate that feature without even a basic approx nn implementation. I fear without illustration it becomes an obscurity and somewhat difficult to maintain

…

On 6 Jun 2017 8:12 pm, "Roman Yurchak" ***@***.***> wrote: @jnothman <https://github.com/jnothman> I'm currently working on rebasing your PR #3922 <#3922> and fixing things that broke due to API changes since then. That is still relevant, right (even if would be used with external implementations instead of LSHForest) ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8996 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yBlfQAULMArh_-I_SjrbQwAJyU9ks5sBSXugaJpZM4NxEvA> .

amueller · 2017-06-06T11:22:41Z

@ogrisel is looking into other approximate algorithms right now. I'm not sure we should try to include another one into sklearn now, but maybe into contrib.

ogrisel · 2017-06-09T08:14:35Z

Note that we currently we can implement ANN with other scikit-learn components:

ann_pca_balltree = make_pipeline(
    PCA(n_components=30),
    NearestNeighbors(algorithm='ball_tree')
)

or even:

ann_rtree_pca_balltree = make_pipeline(
    RandomTreesEmbedding(n_estimators=300, max_depth=10),
    PCA(n_components=30),
    NearestNeighbors(algorithm='ball_tree')
),

But I have not tested any of those. BTW @lesteve it would be great to try to benchmark those against annoy and HNSW.

rth · 2017-06-09T08:20:46Z

Unfortunately that approach wouldn't work that well for data on which PCA was already applied with 100-200 components, in which case BallTree is unlikely to outperform brute force...

ogrisel · 2017-06-09T08:24:40Z

By reducing with PCA(n_components=30) I think you can get a good speed up. I would not bet too high on the speed / accuracy tradeoffs of my pipelines otherwise this kind of strategies would probably be well known but it worth checking on standard benchmark datasets (word2vec or glove word embeddings and SIFT images embeddings).

rth · 2017-06-09T09:10:45Z

would not bet too high on the speed / accuracy tradeoffs

Well for word2vec and GloVe the accuracy seems to drop pretty fast below 100 dimensions,

From word2vec paper,

and GloVe paper,

Applying PCA with 30 dimensions to initial 100 dimensional vectors would produce roughly equivalent results, right?

So when going from 100 to 30 dimensions, the loss in accuracy appears to be ~30%. In annoy, the equivalent loss of accuracy would yield a roughly a x80 speedup. 100 -> 30 dimensions, is already a x3 speed up for brute force, so the question is whether brute force -> ball_tree in 30 dimensions would gives anything like a ~80/3= x26 speedup? In quick tests in the similar conditions as the ANN benchmaks linked above (n_samples=1M ) but with n_features=30 the query speed per sample I get is actually faster with brute force than ball_tree. Very quick considerations might have forgotten something...

ogrisel · 2017-06-09T10:05:05Z

@rth you are probably right :)

jnothman · 2017-06-10T11:09:25Z

we would also need to add neighbour query methods to Pipeline

…

On 9 Jun 2017 8:05 pm, "Olivier Grisel" ***@***.***> wrote: @rth <https://github.com/rth> you are probably right :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8996 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz622QbJhQDRUzOU588xwlfDlBjZ8Fks5sCRjSgaJpZM4NxEvA> .

espg · 2017-08-14T23:51:27Z

...are there plans to replace LSH with something else? Right now, it seems to be the only thing that can query >500 features for large (>1M) sample datasets in reasonable time... unless there's a another option I'm missing perhaps?

jnothman · 2017-08-15T00:53:58Z

We found that it was hard to find a case where this performed a lot better than standard nearest neighbours, while alternative approximate nearest neighbours (ANN) implementations exist that are more efficient and flexible (we only supported cosine similarity, batch fitting, etc), including Facebook's pysparnn and Spotify's annoy and ryanrhymes' panns and flann. You could also perform exact nearest neighbour search in a reduced dimensionality. At the end of the day, nearest neighbour search is not machine learning. It is a tool used by machine learning and may involve machine learning. We need to be able to integrate with external ANN tools more than we need to implement them.

jnothman · 2017-08-15T00:55:05Z

Also, you're welcome to state why you disagree...

jnothman · 2017-08-15T00:58:07Z

I must say that most of the reasons above are not sufficient to remove something. I think we felt that having it available made people assume it was good quality, or efficient for their purposes. It wasn't.

espg · 2017-08-15T18:17:48Z

Thanks for the info, I appreciate it. It looks like Spotify's annoy is a reasonably close drop in replacement that will work much better than LSH for my application.

I do think it would be good to have an approximate neighbors interface in sklearn; something that fit well with the API and could be specified by the 'algo' flag would let quite a few existing algorithms be spend up in cases where clustering accuracy isn't paramount (i.e., when clustering is a preprocessing step). I understand that the LSH implementation isn't this, both in terms of quality and in terms of how it's integrated with the neighbors API...but if there was an implementation on par (quality-wise) with Balltree, that would be a very useful enhancement for the library.

amueller · 2017-08-15T19:38:34Z

@espg PR welcome ;) I think someone in the paris team was actually looking into it, though I don't remember who. @agramfort might know?

rth · 2017-08-15T20:53:40Z

In #8999 (will try to find time to finish it)..

agramfort · 2017-08-17T13:00:38Z

also @lesteve has started to play with this

jaallbri · 2018-09-28T14:44:56Z

I couldn't find anywhere else to comment on this but we run into high dimension problems well solved by approximate nearest neighbor algorithms like LSHForest. For example, we have custom approximate string matching problems where we a have corpus of 10 million strings and we want to search for best matches. Using sci-kit we can control vectorization and matching a lot better than using something like Solr. I also have other use cases where we may have a sparse high dimension (millions) space that could be used to find duplicate objects in our data and LSHForest is ideal because it supports sparse input. What do you need to see in order to keep it from being removed?

amueller · 2018-09-28T17:07:55Z

@jaallbri we need to see it actually being useful. Check out https://erikbern.com/2018/06/17/new-approximate-nearest-neighbor-benchmarks.htm for a good comparison.

cottrell · 2019-01-13T22:24:47Z

I couldn't find anywhere else to comment on this but we run into high dimension problems well solved by approximate nearest neighbor algorithms like LSHForest. For example, we have custom approximate string matching problems where we a have corpus of 10 million strings and we want to search for best matches. Using sci-kit we can control vectorization and matching a lot better than using something like Solr. I also have other use cases where we may have a sparse high dimension (millions) space that could be used to find duplicate objects in our data and LSHForest is ideal because it supports sparse input. What do you need to see in order to keep it from being removed?

Am also occasionally kicking around with LSH, largely for high-dimensional structured text that has been mislabeled. There are a lot of other LSH implementations but I haven't experimented much with them. Curious what people are typically using instead of sklearn for similar tasks.

espg · 2019-01-18T18:07:49Z

Curious what people are typically using instead of sklearn for similar tasks.

@cottrell I still use sklearn, just an earlier version that still has LSH. I have played around with spotify-annoy for a bit, and found it was faster but less accurate... so it seemed better to use sklearn given that I wanted higher accuracy (recall)

It would be nice to fix LSH in sklearn to be faster, although I don't have a good sense of how poor the performance is-- @amueller the benchmarks you posted don't include sklearn's depreciated LSH model, so I don't know where it fell performance-wise :-/

jnothman · 2019-01-19T11:25:45Z

I think we never had strong enough motivation for including lsh forest in a machine learning library to be honest. It makes sense as a substitute for nearest neighbours, but we needed to provide an API for learning from approximate nearest neighbours. #10482 should largely solve that if we can get consensus on some API purity issues. But all that still won't fix the fact that our LSHForest was very weak relative to the market of python approximate nearest neighbours implementations.

amueller · 2019-01-19T16:01:38Z

@espg you can run them yourself: https://github.com/erikbern/ann-benchmarks

amueller added Easy Well-defined and straightforward way to resolve Need Contributor Sprint labels Jun 6, 2017

This was referenced Jun 6, 2017

[WIP] allow nearest neighbors algorithm to be an estimator (v2) #8999

Closed

Support alternative metrics/hashers in LSHForest #3988

Closed

Support approximating Euclidean/Manhattan metrics in LSHForest #3990

Closed

ldirer mentioned this issue Jun 9, 2017

[MRG] Deprecate lsh forest #9078

Merged

amueller closed this as completed in #9078 Jun 10, 2017

kchaliki mentioned this issue Feb 6, 2020

refactor cluster selection so it can be replaced, add dbscan based clustering facebookresearch/pysparnn#28

Open

msluszniak mentioned this issue Nov 28, 2023

Alternatives to kNN + t-SNE elixir-nx/scholar#207

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate LSHForest #8996

Deprecate LSHForest #8996

amueller commented Jun 6, 2017

jnothman commented Jun 6, 2017

rth commented Jun 6, 2017

jnothman commented Jun 6, 2017 via email

amueller commented Jun 6, 2017

ogrisel commented Jun 9, 2017 •

edited

Loading

rth commented Jun 9, 2017

ogrisel commented Jun 9, 2017 •

edited

Loading

rth commented Jun 9, 2017

ogrisel commented Jun 9, 2017

jnothman commented Jun 10, 2017 via email

espg commented Aug 14, 2017

jnothman commented Aug 15, 2017 via email

jnothman commented Aug 15, 2017

jnothman commented Aug 15, 2017

espg commented Aug 15, 2017

amueller commented Aug 15, 2017

rth commented Aug 15, 2017

agramfort commented Aug 17, 2017

jaallbri commented Sep 28, 2018

amueller commented Sep 28, 2018

cottrell commented Jan 13, 2019

espg commented Jan 18, 2019 •

edited

Loading

jnothman commented Jan 19, 2019

amueller commented Jan 19, 2019

Deprecate LSHForest #8996

Deprecate LSHForest #8996

Comments

amueller commented Jun 6, 2017

jnothman commented Jun 6, 2017

rth commented Jun 6, 2017

jnothman commented Jun 6, 2017 via email

amueller commented Jun 6, 2017

ogrisel commented Jun 9, 2017 • edited Loading

rth commented Jun 9, 2017

ogrisel commented Jun 9, 2017 • edited Loading

rth commented Jun 9, 2017

ogrisel commented Jun 9, 2017

jnothman commented Jun 10, 2017 via email

espg commented Aug 14, 2017

jnothman commented Aug 15, 2017 via email

jnothman commented Aug 15, 2017

jnothman commented Aug 15, 2017

espg commented Aug 15, 2017

amueller commented Aug 15, 2017

rth commented Aug 15, 2017

agramfort commented Aug 17, 2017

jaallbri commented Sep 28, 2018

amueller commented Sep 28, 2018

cottrell commented Jan 13, 2019

espg commented Jan 18, 2019 • edited Loading

jnothman commented Jan 19, 2019

amueller commented Jan 19, 2019

ogrisel commented Jun 9, 2017 •

edited

Loading

ogrisel commented Jun 9, 2017 •

edited

Loading

espg commented Jan 18, 2019 •

edited

Loading