Skip to content

WIP: Sparse Random Projections #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 58 commits into from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Oct 2, 2011

Early pull requests for sparse random projection

Main paper used as a reference for this PR:

Li, Hastie and Church, KDD 2006, Very Sparse Random Projections

TODO before merge

  • write narrative doc
  • empirical checks for johnson lindenstrauss in test suite
  • example explaining johnson lindenstrauss embedding
  • update manifold example to use RP module instead of random unitary
  • example on dim-reduction (e.g. ANN search on faces data with ball-tree)

Projected array.

"""
return X * self.components_.T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe_sparse_dot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.components_ is always a CSR matrix so there is no need for safe_sparse_dot in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I realized that afterwards. I don't think it's sparse if you use a gaussian N(0, 1) for generating R though.

@mblondel
Copy link
Member

mblondel commented Oct 2, 2011

I wonder if it'd make sense to put the module in decomposition/?

Good job on bootstrapping this PR, @ogrisel!

@ogrisel
Copy link
Member Author

ogrisel commented Oct 2, 2011

I am not sure about moving to decomposition. It is mostly data independent. I think I should first finish the remaining bullet points on the TODO list and then discuss where to put this module.

"""Random Projection transformers

Random Projections are an efficient way to reduce the dimensionality
of the data by trading a controlled amout of accuracy (as additional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "amout" -> "amount".

@larsmans
Copy link
Member

larsmans commented Oct 2, 2011

Shouldn't this be in the feature_selection submodule? Or in a new module dimensionality_reduction?

@ogrisel
Copy link
Member Author

ogrisel commented Oct 2, 2011

Thanks for the typos @larsmans, I will address them asap. As for the right package, I have no idea :) We will have to discuss this on the ML once the rest of this PR is merge ready.

@GaelVaroquaux
Copy link
Member

On Sat, Oct 01, 2011 at 05:18:03PM -0700, Olivier Grisel wrote:

TODO before merge

  • write narrative doc

Make sure that both in the docstring and in the narrative doc, you stress
the usecase: the scikit is becoming richer and richer, and we want to
guide the users as much as much.

Thanks for the poule!

@mblondel
Copy link
Member

I was skimming through "Fast and Accurate k-means for Large Datasets" (NIPS 2011) and what caught my attention is that they're using RP for fast approximate neighbor search (to avoid comparing each point with K clusters). The paper was quite vague so I had a quick look at the code and apparently this is what they are doing:

  • project centers to 1 dimension (i.e. on the real line) via random projection
  • sort the centers
  • project point to 1 dimension
  • find the 2 centers between which the point is via binary search
    • if the point is farthest to the left or to the right, then just pick the center next to it
    • if the point is between 2 centers, find the closest center between the two using exact search (i.e. in the original space)

That seems like a very aggressive approximation to me but we could try it during the sprint. Instead of considering just two centers, we can consider a window for more flexibility (once we know between which centers the point is, we readily know the other closest centers since the centers are sorted).

Code and paper: http://web.engr.oregonstate.edu/~shindler/kMeansCode/

@ogrisel
Copy link
Member Author

ogrisel commented Nov 25, 2011

Interesting but I would rather like to focus on the text extraction / random projection / hashing vectorizer part during the sprint personally.

@arjoly
Copy link
Member

arjoly commented Nov 29, 2012

What is the state of this pull request?
Is there plan to incorporate denser random projections such Gaussian and binomial?

@ogrisel
Copy link
Member Author

ogrisel commented Nov 29, 2012

It's a bit stalled but I would like to revive it if I can find the time.

We could introduce dense random projections but I don't see any real life application as the projection matrix would be too big to fit in memory for real life dimensionality reduction tasks.

Do you have some need in this area in particular?

@arjoly
Copy link
Member

arjoly commented Nov 29, 2012

Yes, I am highly interested in random projection technique as a data transformation technique
(expansion and/or reduction). And in this context, Gaussian and binomial dense random projection
are baseline algorithms.

Indeed in my current project, the storage of the random projection matrix is not my bottleneck.

Edit: Note also that in compressed sensing, dense Gaussian and binomial random projections are well studied.
Edit 2: I think strongly to help you to finish this pull request if you accept.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 29, 2012

@arjoly please feel free to branch it on your repo and issue a new pull request from here to add the dense Gaussian and binomial baselines then, either in the same class or a in a new class of the same module.

I was also thinking of dropping the hashing-based, implicit sparse random projection. While it allows for simulating the sparse RP without materialization the random matrix in memory, I think it's too CPU intensive to be competitive in real word applications.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 29, 2012

Maybe @mblondel has an opinion on the latter.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 29, 2012

BTW, I am pretty proud of the narrative documentation and the JL lemma related plots.

@mblondel
Copy link
Member

I'm personally fine either way (with or without dense gaussian).

@ogrisel
Copy link
Member Author

ogrisel commented Nov 29, 2012

I was referring to the option to remove the random_dot function that implements sparse random projection using murmurhash on the fly instead of pre-allocating a sparse random matrix in memory.

At the moment the two options are possible but the hash variant seems useless in practice (too slow to be useful compared to the pre-allocated variant) and hence I was thinking of removing that part of the code to make it simpler.

@mblondel
Copy link
Member

@ogrisel +1 for removal if it seems useless in practice. Besides, it's not published work, right?

@ogrisel
Copy link
Member Author

ogrisel commented Nov 29, 2012

Nope, but it's just an alternative implementation of the sparse random projection: it's mathematically equivalent to materializing the random matrix in memory.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 29, 2012

It could probably be sped-up to be fast enough to be useful by rewriting it in cython at some point but it's low priority for me so we should not hold back this old PR because of that. So @arjoly please feel free to remove random_dot function and the materialize=False option in your own branch.

@mblondel
Copy link
Member

Ok, thanks for the clarification.

@arjoly
Copy link
Member

arjoly commented Nov 29, 2012

All right, I am going to open a new pull request shortly.

@arjoly
Copy link
Member

arjoly commented Nov 30, 2012

@ogrisel Can you explain why you use murmurhash3_32 instead of numpy.random?

@ogrisel
Copy link
Member Author

ogrisel commented Nov 30, 2012

@ogrisel Can you explain why you use murmurhash3_32 instead of numpy.random?

It would have been possible to use numpy.random.RandomState(seed=original_feature_idx).randint(n_target_features - 1) but it's probably a lot more overhead than a single call to murmurhash3_32 + a modulo operation.

@arjoly arjoly mentioned this pull request Dec 3, 2012
@amueller
Copy link
Member

Closed by @arjoly and @ogrisel in #1438. Thanks folks.

@amueller amueller closed this Dec 21, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants