WIP: Sparse Random Projections #372

ogrisel · 2011-10-02T00:18:02Z

Early pull requests for sparse random projection

Main paper used as a reference for this PR:

Li, Hastie and Church, KDD 2006, Very Sparse Random Projections

TODO before merge

write narrative doc
~~empirical checks for johnson lindenstrauss in test suite~~
~~example explaining johnson lindenstrauss embedding~~
~~update manifold example to use RP module instead of random unitary~~
example on dim-reduction (e.g. ANN search on faces data with ball-tree)

mblondel · 2011-10-02T06:52:20Z

sklearn/random_projection.py

+            Projected array.
+
+        """
+        return X * self.components_.T


safe_sparse_dot?

self.components_ is always a CSR matrix so there is no need for safe_sparse_dot in that case.

Indeed, I realized that afterwards. I don't think it's sparse if you use a gaussian N(0, 1) for generating R though.

mblondel · 2011-10-02T07:08:22Z

I wonder if it'd make sense to put the module in decomposition/?

Good job on bootstrapping this PR, @ogrisel!

ogrisel · 2011-10-02T09:45:51Z

I am not sure about moving to decomposition. It is mostly data independent. I think I should first finish the remaining bullet points on the TODO list and then discuss where to put this module.

larsmans · 2011-10-02T14:28:53Z

sklearn/random_projection.py

+"""Random Projection transformers
+
+Random Projections are an efficient way to reduce the dimensionality
+of the data by trading a controlled amout of accuracy (as additional


typo: "amout" -> "amount".

larsmans · 2011-10-02T14:33:14Z

Shouldn't this be in the feature_selection submodule? Or in a new module dimensionality_reduction?

ogrisel · 2011-10-02T14:36:07Z

Thanks for the typos @larsmans, I will address them asap. As for the right package, I have no idea :) We will have to discuss this on the ML once the rest of this PR is merge ready.

…g instead of direct uniform sampling in the n_features space

…fication example

GaelVaroquaux · 2011-10-03T07:26:44Z

On Sat, Oct 01, 2011 at 05:18:03PM -0700, Olivier Grisel wrote:

TODO before merge

write narrative doc

Make sure that both in the docstring and in the narrative doc, you stress
the usecase: the scikit is becoming richer and richer, and we want to
guide the users as much as much.

Thanks for the poule!

mblondel · 2011-11-25T06:43:43Z

I was skimming through "Fast and Accurate k-means for Large Datasets" (NIPS 2011) and what caught my attention is that they're using RP for fast approximate neighbor search (to avoid comparing each point with K clusters). The paper was quite vague so I had a quick look at the code and apparently this is what they are doing:

project centers to 1 dimension (i.e. on the real line) via random projection
sort the centers
project point to 1 dimension
find the 2 centers between which the point is via binary search
- if the point is farthest to the left or to the right, then just pick the center next to it
- if the point is between 2 centers, find the closest center between the two using exact search (i.e. in the original space)

That seems like a very aggressive approximation to me but we could try it during the sprint. Instead of considering just two centers, we can consider a window for more flexibility (once we know between which centers the point is, we readily know the other closest centers since the centers are sorted).

Code and paper: http://web.engr.oregonstate.edu/~shindler/kMeansCode/

ogrisel · 2011-11-25T14:23:49Z

Interesting but I would rather like to focus on the text extraction / random projection / hashing vectorizer part during the sprint personally.

…ndom_projection

…nto random_projection

arjoly · 2012-11-29T09:58:09Z

What is the state of this pull request?
Is there plan to incorporate denser random projections such Gaussian and binomial?

ogrisel · 2012-11-29T10:04:07Z

It's a bit stalled but I would like to revive it if I can find the time.

We could introduce dense random projections but I don't see any real life application as the projection matrix would be too big to fit in memory for real life dimensionality reduction tasks.

Do you have some need in this area in particular?

arjoly · 2012-11-29T11:08:46Z

Yes, I am highly interested in random projection technique as a data transformation technique
(expansion and/or reduction). And in this context, Gaussian and binomial dense random projection
are baseline algorithms.

Indeed in my current project, the storage of the random projection matrix is not my bottleneck.

Edit: Note also that in compressed sensing, dense Gaussian and binomial random projections are well studied.
Edit 2: I think strongly to help you to finish this pull request if you accept.

ogrisel · 2012-11-29T13:38:23Z

@arjoly please feel free to branch it on your repo and issue a new pull request from here to add the dense Gaussian and binomial baselines then, either in the same class or a in a new class of the same module.

I was also thinking of dropping the hashing-based, implicit sparse random projection. While it allows for simulating the sparse RP without materialization the random matrix in memory, I think it's too CPU intensive to be competitive in real word applications.

ogrisel · 2012-11-29T13:38:48Z

Maybe @mblondel has an opinion on the latter.

ogrisel · 2012-11-29T13:39:13Z

BTW, I am pretty proud of the narrative documentation and the JL lemma related plots.

mblondel · 2012-11-29T14:06:23Z

I'm personally fine either way (with or without dense gaussian).

ogrisel · 2012-11-29T14:11:24Z

I was referring to the option to remove the random_dot function that implements sparse random projection using murmurhash on the fly instead of pre-allocating a sparse random matrix in memory.

At the moment the two options are possible but the hash variant seems useless in practice (too slow to be useful compared to the pre-allocated variant) and hence I was thinking of removing that part of the code to make it simpler.

mblondel · 2012-11-29T14:17:18Z

@ogrisel +1 for removal if it seems useless in practice. Besides, it's not published work, right?

ogrisel · 2012-11-29T14:19:01Z

Nope, but it's just an alternative implementation of the sparse random projection: it's mathematically equivalent to materializing the random matrix in memory.

ogrisel · 2012-11-29T14:20:39Z

It could probably be sped-up to be fast enough to be useful by rewriting it in cython at some point but it's low priority for me so we should not hold back this old PR because of that. So @arjoly please feel free to remove random_dot function and the materialize=False option in your own branch.

mblondel · 2012-11-29T14:25:01Z

Ok, thanks for the clarification.

arjoly · 2012-11-29T15:57:27Z

All right, I am going to open a new pull request shortly.

arjoly · 2012-11-30T16:07:38Z

@ogrisel Can you explain why you use murmurhash3_32 instead of numpy.random?

ogrisel · 2012-11-30T17:02:43Z

@ogrisel Can you explain why you use murmurhash3_32 instead of numpy.random?

It would have been possible to use numpy.random.RandomState(seed=original_feature_idx).randint(n_target_features - 1) but it's probably a lot more overhead than a single call to murmurhash3_32 + a modulo operation.

amueller · 2012-12-21T13:39:30Z

Closed by @arjoly and @ogrisel in #1438. Thanks folks.

first pass at implementing sparse random projections

d9b574e

mblondel reviewed Oct 2, 2011
View reviewed changes

ogrisel added 4 commits October 2, 2011 12:46

DOC: better docstrings

8710c87

DOC: more docstring improvements

ef853cd

Remove non-ASCII char from docstring

eac349a

use random projections in the digits manifold example

c61fc2a

larsmans reviewed Oct 2, 2011
View reviewed changes

ogrisel added 9 commits October 2, 2011 17:47

test embedding quality and bad inputs (100% line coverage)

6a98797

typos

daafc07

one more typo

3f44ef0

OPTIM: CPU and memory optim by using a binomial and reservoir samplin…

763909c

…g instead of direct uniform sampling in the n_features space

note for later possible optims

a9d85eb

Merge branch 'master' into random_projection

a30aa81

fix borked doctests

b2abc35

Merge branch 'master' into random_projection

9d8cf8b

make it possible to use random projection on the 20 newsgroups classi…

70bce10

…fication example

Merge branch 'master' into random_projection

392fc65

ogrisel added 12 commits January 24, 2012 09:35

fixed broken seeding of the hashing_dot function

2b2313e

leave dense_output=False by default

2ddef1d

use the 20 newsgroups as example dataset instead

e8f9534

make it possible to use a preallocated output array for hashing_dot

d2bcde8

missing docstring and s/hashing_dot/random_dot/g

0e806ef

Merge branch 'master' into random_projection

fe67154

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ra…

dc57e14

…ndom_projection

Merge branch 'master' into random_projection

aaeaf10

merge master

ab47202

Merge branch 'random_projection' of github.com:ogrisel/scikit-learn i…

f7dd191

…nto random_projection

ENH: more informative exception message

9623d8a

Merge branch 'master' into random_projection

2017a4f

arjoly mentioned this pull request Dec 3, 2012

[MRG] Random Projections #1438

Closed

amueller closed this Dec 21, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Sparse Random Projections #372

WIP: Sparse Random Projections #372

ogrisel commented Oct 2, 2011

mblondel Oct 2, 2011

ogrisel Oct 2, 2011

mblondel Oct 2, 2011

mblondel commented Oct 2, 2011

ogrisel commented Oct 2, 2011

larsmans Oct 2, 2011

larsmans commented Oct 2, 2011

ogrisel commented Oct 2, 2011

GaelVaroquaux commented Oct 3, 2011

TODO before merge

mblondel commented Nov 25, 2011

ogrisel commented Nov 25, 2011

arjoly commented Nov 29, 2012

ogrisel commented Nov 29, 2012

arjoly commented Nov 29, 2012

ogrisel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

mblondel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

mblondel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

mblondel commented Nov 29, 2012

arjoly commented Nov 29, 2012

arjoly commented Nov 30, 2012

ogrisel commented Nov 30, 2012

amueller commented Dec 21, 2012

WIP: Sparse Random Projections #372

WIP: Sparse Random Projections #372

Conversation

ogrisel commented Oct 2, 2011

TODO before merge

mblondel Oct 2, 2011

Choose a reason for hiding this comment

ogrisel Oct 2, 2011

Choose a reason for hiding this comment

mblondel Oct 2, 2011

Choose a reason for hiding this comment

mblondel commented Oct 2, 2011

ogrisel commented Oct 2, 2011

larsmans Oct 2, 2011

Choose a reason for hiding this comment

larsmans commented Oct 2, 2011

ogrisel commented Oct 2, 2011

GaelVaroquaux commented Oct 3, 2011

TODO before merge

mblondel commented Nov 25, 2011

ogrisel commented Nov 25, 2011

arjoly commented Nov 29, 2012

ogrisel commented Nov 29, 2012

arjoly commented Nov 29, 2012

ogrisel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

mblondel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

mblondel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

ogrisel commented Nov 29, 2012

mblondel commented Nov 29, 2012

arjoly commented Nov 29, 2012

arjoly commented Nov 30, 2012

ogrisel commented Nov 30, 2012

amueller commented Dec 21, 2012