[MRG] Random Projections #1438

arjoly · 2012-12-03T10:20:46Z

Early pull requests for random projection, I want to finish pr #372.

My goal is to have a workable implementation of Gaussian, Bernoulli and sparse random
projection matrix.

Note that the materialize=False and random_dot feature won't probably be implemented in this pull request.

amueller · 2012-12-03T10:23:18Z

sklearn/utils/testing.py

@@ -69,6 +68,20 @@ def _assert_greater(a, b, msg=None):
    assert_greater = _assert_greater


+def assert_raise_message(exception, message, callable, *args, **kwargs):


I think there is already a way to do that. Could you have a look at the sparse tests in sklearn/tests/test_common.py?

I think sklearn/utils/testing.py is the right location for such helpers. If there is redundancy it should be factorized here rather than test_common I think.

No argument there.
I thought it would be possible without a helper function. But what Gael did there is basically write down this function.
So sorry for the noise and +1 for including it :)
I would be nice to maybe also match substrings, as was done there.

@amueller It is done.

arjoly · 2012-12-03T10:35:59Z

Keep a gist of random_dot before cleaning it (code and tests): https://gist.github.com/4194105

ogrisel · 2012-12-03T10:36:18Z

Note that the ´´materialize=False´´ and random_dot feature won't probably be implemented in this pull request.

To clarify this sentence, the currently implemented materialize=False and random_dot code from this branch will be removed as this code seems to be to slow to be useful hence better only merge the simple fast running features first and implement the the more experimental tricks in from random_dot later in another PR if required.

ogrisel · 2012-12-03T10:37:46Z

Thanks for the gist. Good initiative.

arjoly · 2012-12-04T14:47:27Z

If the user specified to generate more random projections than features, I wouldn't deny his right to do it.

Nevertheless, I can add a warning message.

ogrisel · 2012-12-04T15:06:50Z

Fair enough. Maybe a warning would be indeed useful.

arjoly · 2012-12-04T17:14:25Z

I have added a warning.

I would like to change components_ to projection_matrix_. It avoids the confusion with n_components_ and n_components. Some thoughts about that?

ogrisel · 2012-12-04T17:17:07Z

I'd rather keep the components notation as it's already consistently used throughout the code base for similar projection purpose (see PCA, NMF, ...).

GaelVaroquaux · 2012-12-04T17:18:20Z

I'd rather keep the components notation as it's already consistently used
throughout the code base for similar projection purpose (see PCA, NMF, ...).

+1

amueller · 2012-12-04T20:00:48Z

On 12/04/2012 05:14 PM, Arnaud Joly wrote:

I have added a warning.

I would like to change |components_| to |projection_matrix_´. It avoid
confusion with|n_components_|and|n_components`. Some thoughts about that?

—

I agree with Gael and Olivier.
Could you explain your concern a bit more?

stuff that starts with n_ is always an integer. Stuff that ends with
_ is always estimated by the model. And components_ is usually a
(linear) projection.

arjoly · 2012-12-05T08:39:44Z

Just the definition of component evokes me part of something bigger and not projection matrix / linear transformation. Anyway, it is better to be consistent.

Thanks for the input !

amueller · 2012-12-05T08:50:10Z

On 12/05/2012 08:39 AM, Arnaud Joly wrote:

Just the definition of component evoke me part of something bigger`
and not projection matrix / linear transformation. Anyway, it is
better to be consistent.

Thanks for the input !

The word comes from Principal Component Analysis and Independent
Component Analysis.
It kind of spread from there ;)

arjoly · 2012-12-07T23:50:03Z

I am currently improving the generation of the sparse Bernouilli matrix.

Can I use modified code of the core python? Is it license compatible?

amueller · 2012-12-08T09:55:31Z

From core Python? What do you want to get? Looks like the Python license is compatible with BSD, just have to leave in the copyright. Though having mixed licenses is always a bit of a hassle (then all files are under BSD except one file which is under PSF Python license).

arjoly · 2012-12-08T20:04:51Z

The one line that kill performance is the used of random.sample from core python.

I have achieved 2.6442 times speed up using a cythonized and specialized version of random.sample which used np.random.randint. Furthermore, they have two nice tests for this kind of function that I would happily adapt.

There might be still some place for improvement, I am a beginner with cython.

edit: 2.6442 (and not 6, sorry the line profiler incurred a large down speed)

amueller · 2012-12-09T09:24:13Z

Can you say which line it is? Sorry, I didn't find it, I didn't study the code before.

amueller · 2012-12-09T09:35:08Z

Ok, found it. Can this not be done with np.random.permutation?

amueller · 2012-12-09T09:39:42Z

It seems np.random.permutation is way slower than random.sample. Interesting.

glouppe · 2012-12-09T09:46:00Z

Hmm, that should be interesting to speed up the tree construction process. I'll have a try.

amueller · 2012-12-09T12:58:53Z

About taking code from core python, let's ask the elders: @ogrisel @GaelVaroquaux what do you think? ;)

arjoly · 2012-12-09T19:58:47Z

I can show you the code in a gist. However, I don't know if it might breaks any license rules.
If not, say it. I would happily share this code.

ogrisel · 2012-12-10T10:16:16Z

No problem with reusing code from Python as long as we clearly state for the copyright / license / origin of this code section using inline comments. The Python License is a non viral license very similar to the BSD license so it's ok.

arjoly · 2012-12-10T15:39:27Z

Thanks @ogrisel for the answer.

I ran the following benchmark

Dataset statics

n_samples = 1000
n_features = 100000
n_components = 5920 (auto)
n_elements = 100000000
n_nonzeros = 100 per feature
ratio_nonzeros = 0.001

Script arguments

Arguments	fit
random_seed	13
density	0.333333333333
eps	0.1
ratio_nonzeros	0.001
n_components	auto
sparse	True
n_samples	1000
n_times	10
selected_transformers	Bernouilli
n_features	100000

Performances for bernouill sparse matrix:

Transformer	fit
bernouillli sparse random matrix - original	3.9888s
bernouillli sparse random matrix - cython	1.5082s

Performance are improved with a factor 2.6442 with a custom sampler of integer in cython (and not 6, sorry the line profiler incurred a large down speed for python).

arjoly · 2012-12-10T16:08:00Z

I got the following error when I perform make test:

======================================================================
ERROR: sklearn.tests.test_common.test_transformers_sparse_data
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ajoly/.virtualenv/sklearn/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ajoly/git/scikit-learn/sklearn/tests/test_common.py", line 254, in test_transformers_sparse_data
    raise exc
ValueError: eps=0.500000 and n_samples=40 lead to a target dimension of 177 which is larger than the original space with n_features=10

======================================================================
ERROR: sklearn.tests.test_common.test_estimators_nan_inf
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ajoly/.virtualenv/sklearn/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ajoly/git/scikit-learn/sklearn/tests/test_common.py", line 296, in test_estimators_nan_inf
    raise e
ValueError: eps=0.500000 and n_samples=10 lead to a target dimension of 110 which is larger than the original space with n_features=3

======================================================================
FAIL: sklearn.tests.test_common.test_transformers
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ajoly/.virtualenv/sklearn/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ajoly/git/scikit-learn/sklearn/tests/test_common.py", line 218, in test_transformers
    assert_true(succeeded)
AssertionError: False is not true
    'False is not true' = self._formatMessage('False is not true', "%s is not true" % safe_repr(False))
>>  raise self.failureException('False is not true')

The first two error are due to the chosen default behavior. I don't understand the third one.

What do you advise?

ogrisel · 2012-12-11T10:44:56Z

I am not sure if we should convert the errors like: ValueError: eps=0.500000 and n_samples=40 lead to a target dimension of 177 which is larger than the original space with n_features=10 to warnings instead.

Also by reading https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_common.py#L148 it seems that the last failure is a consequence of the same ValueError on the target dimension check.

arjoly · 2012-12-11T14:21:57Z

@ogrisel I agree with you so I have tweaked test_comons when necessary.

ogrisel · 2012-12-11T15:02:43Z

sklearn/random_projection/_random_projection.pyx

+    # The following line of code are heavily inspired from python core,
+    # more precisely of random.sample.
+    cdef set selected = set()
+    selected_add = selected.add


Is this kind of local scope optim required in cython code?

Declaring selected as a set improve a little performance.
It is the same for selected_add and rng_randint.

@ogrisel Declaring the type of the variables in Cython allows for static typing. This basically circumvent a lot of Python overhead.

I agree with the types, I was wondering about the selected_add optim.

You are right about selected_add. In the .html, it shows a little overhead.
I mistaken this with python optimization.

ogrisel · 2012-12-21T13:17:57Z

LGTM, 👍 for merging.

amueller · 2012-12-21T13:27:15Z

Ok, I'm doing it (manually).

arjoly · 2012-12-21T13:28:12Z

Thanks all !!! :-) It was fun to make this pr!

amueller · 2012-12-21T13:35:27Z

Done. Thanks a lot!

amueller · 2012-12-21T13:37:18Z

Great work @ogrisel and @arjoly 🍻

arjoly · 2012-12-21T13:38:10Z

#372 can be closed!!!

Great work !!! :-)

ogrisel · 2012-12-21T13:39:32Z

Prosit 🍻 !

arjoly · 2012-12-21T13:40:25Z

It is strange.

Pull request closed — the arjoly:random_projection branch has unmerged commits.

amueller · 2012-12-21T13:43:08Z

Sorry @arjoly this is caused by me trying to keep master less messy. I rebased your branch again before pushing. You didn't use the newest master (Rebasing changes the hashes of the commits, so the commits now in master have different hashes than the ones in your pr.)

amueller · 2012-12-21T13:45:41Z

or maybe I messed up again... not sure what is wrong with me today. Some commits seem to be missing indeed.

amueller · 2012-12-21T13:48:09Z

For some reason my pr/1438 branch was not up to date with the PR. I have no idea why. I removed it and pulled again and it was still 10 days behind...

amueller · 2012-12-21T13:49:28Z

Now all commits are in. I cloned the branch in the old-fashioned way from @arjoly directly....

arjoly · 2012-12-21T13:53:02Z

Thanks !!!

May I be added to the contributor list?

amueller · 2012-12-21T14:03:30Z

You are in the whatsnew, right?
Do you mean the "AUTHORS"? Or do you mean obtain commit rights? Both are usually for people you have been contribution for some time.

Hm there was an error about assert_allclose. It looked odd to me but I didn't say anything because it looked ok. Problem is, it's not in numpy 1.3. Fixed it just now.

arjoly · 2012-12-21T14:05:28Z

Fine, it will be for a latter time :-)

amueller · 2012-12-21T14:05:37Z

Btw, I didn't really follow your work as there is a lot going on and I struggle keeping up. You had several PRs already, righ? so you might qualify for commit rights

arjoly · 2012-12-21T14:06:31Z

Just wanted to be on the author list. Sorry for the all_close, I haven't checked.
Yes, 3 pr (two minors and this one).

amueller · 2012-12-21T14:12:41Z

The author list needs some rewriting but that won't happen soon.
Also, I think that was supposed to be somewhat exclusive. But don't worry, if you keep up the good work, you'll definitely land there ;) And your contributions will be very visible in the next release via the "whatsnew".

arjoly · 2012-12-21T14:16:03Z

Fine, thanks :-)

coveralls · 2014-02-20T04:25:44Z

Changes Unknown when pulling a61645c on arjoly:random_projection into * on scikit-learn:master*.

amueller reviewed Dec 3, 2012
View reviewed changes

ogrisel mentioned this pull request Dec 3, 2012

Test tweaking makefile for travis #1437

Merged

erg mentioned this pull request Dec 9, 2012

MRG+1: Add resample to preprocessing. #1454

Closed

ogrisel reviewed Dec 11, 2012
View reviewed changes

arjoly added 7 commits December 21, 2012 13:40

ENH Rename Bernoulli random projection to sparse random projection

f3c2a6a

DOC add see also

ac32c95

pep8

75b3568

COSMIT make everything use the common interface

5f507f6

DOC improve + fix mistakes + TST added

72d16ee

ENH Simplify assert_raise_message + TST add them

f4bbc48

DOC add utitilies to the doc

a61645c

amueller closed this Dec 21, 2012

amueller mentioned this pull request Dec 21, 2012

WIP: Sparse Random Projections #372

Closed

arjoly deleted the random_projection branch January 6, 2013 21:08

		@@ -69,6 +68,20 @@ def _assert_greater(a, b, msg=None):
		assert_greater = _assert_greater


		def assert_raise_message(exception, message, callable, args, *kwargs):

Uh oh!

[MRG] Random Projections #1438

[MRG] Random Projections #1438

Uh oh!

Conversation

arjoly commented Dec 3, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Dec 3, 2012

Uh oh!

ogrisel commented Dec 3, 2012

Uh oh!

ogrisel commented Dec 3, 2012

Uh oh!

arjoly commented Dec 4, 2012

Uh oh!

ogrisel commented Dec 4, 2012

Uh oh!

arjoly commented Dec 4, 2012

Uh oh!

ogrisel commented Dec 4, 2012

Uh oh!

GaelVaroquaux commented Dec 4, 2012

Uh oh!

amueller commented Dec 4, 2012

Uh oh!

arjoly commented Dec 5, 2012

Uh oh!

amueller commented Dec 5, 2012

Uh oh!

arjoly commented Dec 7, 2012

Uh oh!

amueller commented Dec 8, 2012

Uh oh!

arjoly commented Dec 8, 2012

Uh oh!

amueller commented Dec 9, 2012

Uh oh!

amueller commented Dec 9, 2012

Uh oh!

amueller commented Dec 9, 2012

Uh oh!

glouppe commented Dec 9, 2012

Uh oh!

amueller commented Dec 9, 2012

Uh oh!

arjoly commented Dec 9, 2012

Uh oh!

ogrisel commented Dec 10, 2012

Uh oh!

arjoly commented Dec 10, 2012

Dataset statics

Script arguments

Uh oh!

arjoly commented Dec 10, 2012

Uh oh!

ogrisel commented Dec 11, 2012

Uh oh!

arjoly commented Dec 11, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment