-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP: Sparse Random Projections #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Projected array. | ||
|
||
""" | ||
return X * self.components_.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe_sparse_dot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.components_ is always a CSR matrix so there is no need for safe_sparse_dot in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I realized that afterwards. I don't think it's sparse if you use a gaussian N(0, 1) for generating R though.
I wonder if it'd make sense to put the module in Good job on bootstrapping this PR, @ogrisel! |
I am not sure about moving to decomposition. It is mostly data independent. I think I should first finish the remaining bullet points on the TODO list and then discuss where to put this module. |
"""Random Projection transformers | ||
|
||
Random Projections are an efficient way to reduce the dimensionality | ||
of the data by trading a controlled amout of accuracy (as additional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "amout" -> "amount".
Shouldn't this be in the |
Thanks for the typos @larsmans, I will address them asap. As for the right package, I have no idea :) We will have to discuss this on the ML once the rest of this PR is merge ready. |
…g instead of direct uniform sampling in the n_features space
On Sat, Oct 01, 2011 at 05:18:03PM -0700, Olivier Grisel wrote:
Make sure that both in the docstring and in the narrative doc, you stress Thanks for the poule! |
I was skimming through "Fast and Accurate k-means for Large Datasets" (NIPS 2011) and what caught my attention is that they're using RP for fast approximate neighbor search (to avoid comparing each point with K clusters). The paper was quite vague so I had a quick look at the code and apparently this is what they are doing:
That seems like a very aggressive approximation to me but we could try it during the sprint. Instead of considering just two centers, we can consider a window for more flexibility (once we know between which centers the point is, we readily know the other closest centers since the centers are sorted). Code and paper: http://web.engr.oregonstate.edu/~shindler/kMeansCode/ |
Interesting but I would rather like to focus on the text extraction / random projection / hashing vectorizer part during the sprint personally. |
…nto random_projection
What is the state of this pull request? |
It's a bit stalled but I would like to revive it if I can find the time. We could introduce dense random projections but I don't see any real life application as the projection matrix would be too big to fit in memory for real life dimensionality reduction tasks. Do you have some need in this area in particular? |
Yes, I am highly interested in random projection technique as a data transformation technique Indeed in my current project, the storage of the random projection matrix is not my bottleneck. Edit: Note also that in compressed sensing, dense Gaussian and binomial random projections are well studied. |
@arjoly please feel free to branch it on your repo and issue a new pull request from here to add the dense Gaussian and binomial baselines then, either in the same class or a in a new class of the same module. I was also thinking of dropping the hashing-based, implicit sparse random projection. While it allows for simulating the sparse RP without materialization the random matrix in memory, I think it's too CPU intensive to be competitive in real word applications. |
Maybe @mblondel has an opinion on the latter. |
BTW, I am pretty proud of the narrative documentation and the JL lemma related plots. |
I'm personally fine either way (with or without dense gaussian). |
I was referring to the option to remove the At the moment the two options are possible but the hash variant seems useless in practice (too slow to be useful compared to the pre-allocated variant) and hence I was thinking of removing that part of the code to make it simpler. |
@ogrisel +1 for removal if it seems useless in practice. Besides, it's not published work, right? |
Nope, but it's just an alternative implementation of the sparse random projection: it's mathematically equivalent to materializing the random matrix in memory. |
It could probably be sped-up to be fast enough to be useful by rewriting it in cython at some point but it's low priority for me so we should not hold back this old PR because of that. So @arjoly please feel free to remove |
Ok, thanks for the clarification. |
All right, I am going to open a new pull request shortly. |
@ogrisel Can you explain why you use murmurhash3_32 instead of numpy.random? |
It would have been possible to use |
Early pull requests for sparse random projection
Main paper used as a reference for this PR:
Li, Hastie and Church, KDD 2006, Very Sparse Random Projections
TODO before merge
empirical checks for johnson lindenstrauss in test suiteexample explaining johnson lindenstrauss embeddingupdate manifold example to use RP module instead of random unitary