Skip to content

Deprecate LSHForest #8996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Jun 6, 2017 · 24 comments · Fixed by #9078
Closed

Deprecate LSHForest #8996

amueller opened this issue Jun 6, 2017 · 24 comments · Fixed by #9078
Labels
Easy Well-defined and straightforward way to resolve Sprint

Comments

@amueller
Copy link
Member

amueller commented Jun 6, 2017

LSHForest should be deprecated and scheduled for removal in 0.21. It should also warn about having bad performance. cc @ogrisel

@amueller amueller added Easy Well-defined and straightforward way to resolve Need Contributor Sprint labels Jun 6, 2017
@jnothman
Copy link
Member

jnothman commented Jun 6, 2017

sigh. Yes I'm not pursuaded it's doing its job. Part of that is because
it's not fast enough, part because it's not flexible enough.

Where I think we have most prominently failed, however, is in ensuring its
integration: providing a way to do approximate NN in the various estimators
that use neighbours. I think if it were better integrated into API, it
could have been enhanced or alternative implementations proposed. Should we
still be trying to do that?

@rth
Copy link
Member

rth commented Jun 6, 2017

@jnothman I'm currently working on rebasing your PR #3922 and fixing things that broke due to API changes since then. That is still relevant, right (even if would be used with external implementations instead of LSHForest) ?

@jnothman
Copy link
Member

jnothman commented Jun 6, 2017 via email

@amueller
Copy link
Member Author

amueller commented Jun 6, 2017

@ogrisel is looking into other approximate algorithms right now. I'm not sure we should try to include another one into sklearn now, but maybe into contrib.

@ogrisel
Copy link
Member

ogrisel commented Jun 9, 2017

Note that we currently we can implement ANN with other scikit-learn components:

ann_pca_balltree = make_pipeline(
    PCA(n_components=30),
    NearestNeighbors(algorithm='ball_tree')
)

or even:

ann_rtree_pca_balltree = make_pipeline(
    RandomTreesEmbedding(n_estimators=300, max_depth=10),
    PCA(n_components=30),
    NearestNeighbors(algorithm='ball_tree')
),

But I have not tested any of those. BTW @lesteve it would be great to try to benchmark those against annoy and HNSW.

@rth
Copy link
Member

rth commented Jun 9, 2017

Unfortunately that approach wouldn't work that well for data on which PCA was already applied with 100-200 components, in which case BallTree is unlikely to outperform brute force...

@ogrisel
Copy link
Member

ogrisel commented Jun 9, 2017

By reducing with PCA(n_components=30) I think you can get a good speed up. I would not bet too high on the speed / accuracy tradeoffs of my pipelines otherwise this kind of strategies would probably be well known but it worth checking on standard benchmark datasets (word2vec or glove word embeddings and SIFT images embeddings).

@rth
Copy link
Member

rth commented Jun 9, 2017

would not bet too high on the speed / accuracy tradeoffs

Well for word2vec and GloVe the accuracy seems to drop pretty fast below 100 dimensions,

From word2vec paper,
word2vec_accuracy

and GloVe paper,
glove_accuracy_dim

Applying PCA with 30 dimensions to initial 100 dimensional vectors would produce roughly equivalent results, right?

So when going from 100 to 30 dimensions, the loss in accuracy appears to be ~30%. In annoy, the equivalent loss of accuracy would yield a roughly a x80 speedup. 100 -> 30 dimensions, is already a x3 speed up for brute force, so the question is whether brute force -> ball_tree in 30 dimensions would gives anything like a ~80/3= x26 speedup? In quick tests in the similar conditions as the ANN benchmaks linked above (n_samples=1M ) but with n_features=30 the query speed per sample I get is actually faster with brute force than ball_tree. Very quick considerations might have forgotten something...

@ogrisel
Copy link
Member

ogrisel commented Jun 9, 2017

@rth you are probably right :)

@jnothman
Copy link
Member

jnothman commented Jun 10, 2017 via email

@espg
Copy link
Contributor

espg commented Aug 14, 2017

...are there plans to replace LSH with something else? Right now, it seems to be the only thing that can query >500 features for large (>1M) sample datasets in reasonable time... unless there's a another option I'm missing perhaps?

@jnothman
Copy link
Member

jnothman commented Aug 15, 2017 via email

@jnothman
Copy link
Member

Also, you're welcome to state why you disagree...

@jnothman
Copy link
Member

I must say that most of the reasons above are not sufficient to remove something. I think we felt that having it available made people assume it was good quality, or efficient for their purposes. It wasn't.

@espg
Copy link
Contributor

espg commented Aug 15, 2017

Thanks for the info, I appreciate it. It looks like Spotify's annoy is a reasonably close drop in replacement that will work much better than LSH for my application.

I do think it would be good to have an approximate neighbors interface in sklearn; something that fit well with the API and could be specified by the 'algo' flag would let quite a few existing algorithms be spend up in cases where clustering accuracy isn't paramount (i.e., when clustering is a preprocessing step). I understand that the LSH implementation isn't this, both in terms of quality and in terms of how it's integrated with the neighbors API...but if there was an implementation on par (quality-wise) with Balltree, that would be a very useful enhancement for the library.

@amueller
Copy link
Member Author

@espg PR welcome ;) I think someone in the paris team was actually looking into it, though I don't remember who. @agramfort might know?

@rth
Copy link
Member

rth commented Aug 15, 2017

In #8999 (will try to find time to finish it)..

@agramfort
Copy link
Member

also @lesteve has started to play with this

@jaallbri
Copy link

I couldn't find anywhere else to comment on this but we run into high dimension problems well solved by approximate nearest neighbor algorithms like LSHForest. For example, we have custom approximate string matching problems where we a have corpus of 10 million strings and we want to search for best matches. Using sci-kit we can control vectorization and matching a lot better than using something like Solr. I also have other use cases where we may have a sparse high dimension (millions) space that could be used to find duplicate objects in our data and LSHForest is ideal because it supports sparse input. What do you need to see in order to keep it from being removed?

@amueller
Copy link
Member Author

@jaallbri we need to see it actually being useful. Check out https://erikbern.com/2018/06/17/new-approximate-nearest-neighbor-benchmarks.htm for a good comparison.

@cottrell
Copy link

I couldn't find anywhere else to comment on this but we run into high dimension problems well solved by approximate nearest neighbor algorithms like LSHForest. For example, we have custom approximate string matching problems where we a have corpus of 10 million strings and we want to search for best matches. Using sci-kit we can control vectorization and matching a lot better than using something like Solr. I also have other use cases where we may have a sparse high dimension (millions) space that could be used to find duplicate objects in our data and LSHForest is ideal because it supports sparse input. What do you need to see in order to keep it from being removed?

Am also occasionally kicking around with LSH, largely for high-dimensional structured text that has been mislabeled. There are a lot of other LSH implementations but I haven't experimented much with them. Curious what people are typically using instead of sklearn for similar tasks.

@espg
Copy link
Contributor

espg commented Jan 18, 2019

Curious what people are typically using instead of sklearn for similar tasks.

@cottrell I still use sklearn, just an earlier version that still has LSH. I have played around with spotify-annoy for a bit, and found it was faster but less accurate... so it seemed better to use sklearn given that I wanted higher accuracy (recall)

It would be nice to fix LSH in sklearn to be faster, although I don't have a good sense of how poor the performance is-- @amueller the benchmarks you posted don't include sklearn's depreciated LSH model, so I don't know where it fell performance-wise :-/

@jnothman
Copy link
Member

I think we never had strong enough motivation for including lsh forest in a machine learning library to be honest. It makes sense as a substitute for nearest neighbours, but we needed to provide an API for learning from approximate nearest neighbours. #10482 should largely solve that if we can get consensus on some API purity issues. But all that still won't fix the fact that our LSHForest was very weak relative to the market of python approximate nearest neighbours implementations.

@amueller
Copy link
Member Author

@espg you can run them yourself: https://github.com/erikbern/ann-benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants