-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Deprecate LSHForest #8996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
sigh. Yes I'm not pursuaded it's doing its job. Part of that is because Where I think we have most prominently failed, however, is in ensuring its |
yes, it's still relevant, except the question is how we illustrate that
feature without even a basic approx nn implementation. I fear without
illustration it becomes an obscurity and somewhat difficult to maintain
…On 6 Jun 2017 8:12 pm, "Roman Yurchak" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I'm currently working on rebasing
your PR #3922 <#3922>
and fixing things that broke due to API changes since then. That is still
relevant, right (even if would be used with external implementations
instead of LSHForest) ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yBlfQAULMArh_-I_SjrbQwAJyU9ks5sBSXugaJpZM4NxEvA>
.
|
@ogrisel is looking into other approximate algorithms right now. I'm not sure we should try to include another one into sklearn now, but maybe into contrib. |
Note that we currently we can implement ANN with other scikit-learn components: ann_pca_balltree = make_pipeline(
PCA(n_components=30),
NearestNeighbors(algorithm='ball_tree')
) or even: ann_rtree_pca_balltree = make_pipeline(
RandomTreesEmbedding(n_estimators=300, max_depth=10),
PCA(n_components=30),
NearestNeighbors(algorithm='ball_tree')
), But I have not tested any of those. BTW @lesteve it would be great to try to benchmark those against annoy and HNSW. |
Unfortunately that approach wouldn't work that well for data on which PCA was already applied with 100-200 components, in which case BallTree is unlikely to outperform brute force... |
By reducing with |
Well for word2vec and GloVe the accuracy seems to drop pretty fast below 100 dimensions, Applying PCA with 30 dimensions to initial 100 dimensional vectors would produce roughly equivalent results, right? So when going from 100 to 30 dimensions, the loss in accuracy appears to be ~30%. In annoy, the equivalent loss of accuracy would yield a roughly a x80 speedup. 100 -> 30 dimensions, is already a x3 speed up for brute force, so the question is whether brute force -> ball_tree in 30 dimensions would gives anything like a ~80/3= x26 speedup? In quick tests in the similar conditions as the ANN benchmaks linked above (n_samples=1M ) but with n_features=30 the query speed per sample I get is actually faster with brute force than ball_tree. Very quick considerations might have forgotten something... |
@rth you are probably right :) |
we would also need to add neighbour query methods to Pipeline
…On 9 Jun 2017 8:05 pm, "Olivier Grisel" ***@***.***> wrote:
@rth <https://github.com/rth> you are probably right :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz622QbJhQDRUzOU588xwlfDlBjZ8Fks5sCRjSgaJpZM4NxEvA>
.
|
...are there plans to replace LSH with something else? Right now, it seems to be the only thing that can query >500 features for large (>1M) sample datasets in reasonable time... unless there's a another option I'm missing perhaps? |
We found that it was hard to find a case where this performed a lot better
than standard nearest neighbours, while alternative approximate nearest
neighbours (ANN) implementations exist that are more efficient and flexible
(we only supported cosine similarity, batch fitting, etc), including
Facebook's pysparnn and Spotify's annoy and ryanrhymes' panns and flann.
You could also perform exact nearest neighbour search in a reduced
dimensionality.
At the end of the day, nearest neighbour search is not machine learning. It
is a tool used by machine learning and may involve machine learning. We
need to be able to integrate with external ANN tools more than we need to
implement them.
|
Also, you're welcome to state why you disagree... |
I must say that most of the reasons above are not sufficient to remove something. I think we felt that having it available made people assume it was good quality, or efficient for their purposes. It wasn't. |
Thanks for the info, I appreciate it. It looks like Spotify's annoy is a reasonably close drop in replacement that will work much better than LSH for my application. I do think it would be good to have an approximate neighbors interface in sklearn; something that fit well with the API and could be specified by the 'algo' flag would let quite a few existing algorithms be spend up in cases where clustering accuracy isn't paramount (i.e., when clustering is a preprocessing step). I understand that the LSH implementation isn't this, both in terms of quality and in terms of how it's integrated with the neighbors API...but if there was an implementation on par (quality-wise) with Balltree, that would be a very useful enhancement for the library. |
@espg PR welcome ;) I think someone in the paris team was actually looking into it, though I don't remember who. @agramfort might know? |
In #8999 (will try to find time to finish it).. |
also @lesteve has started to play with this |
I couldn't find anywhere else to comment on this but we run into high dimension problems well solved by approximate nearest neighbor algorithms like LSHForest. For example, we have custom approximate string matching problems where we a have corpus of 10 million strings and we want to search for best matches. Using sci-kit we can control vectorization and matching a lot better than using something like Solr. I also have other use cases where we may have a sparse high dimension (millions) space that could be used to find duplicate objects in our data and LSHForest is ideal because it supports sparse input. What do you need to see in order to keep it from being removed? |
@jaallbri we need to see it actually being useful. Check out https://erikbern.com/2018/06/17/new-approximate-nearest-neighbor-benchmarks.htm for a good comparison. |
Am also occasionally kicking around with LSH, largely for high-dimensional structured text that has been mislabeled. There are a lot of other LSH implementations but I haven't experimented much with them. Curious what people are typically using instead of sklearn for similar tasks. |
@cottrell I still use sklearn, just an earlier version that still has LSH. I have played around with spotify-annoy for a bit, and found it was faster but less accurate... so it seemed better to use sklearn given that I wanted higher accuracy (recall) It would be nice to fix LSH in sklearn to be faster, although I don't have a good sense of how poor the performance is-- @amueller the benchmarks you posted don't include sklearn's depreciated LSH model, so I don't know where it fell performance-wise :-/ |
I think we never had strong enough motivation for including lsh forest in a machine learning library to be honest. It makes sense as a substitute for nearest neighbours, but we needed to provide an API for learning from approximate nearest neighbours. #10482 should largely solve that if we can get consensus on some API purity issues. But all that still won't fix the fact that our LSHForest was very weak relative to the market of python approximate nearest neighbours implementations. |
@espg you can run them yourself: https://github.com/erikbern/ann-benchmarks |
LSHForest should be deprecated and scheduled for removal in 0.21. It should also warn about having bad performance. cc @ogrisel
The text was updated successfully, but these errors were encountered: