Skip to content

[MRG] Random Projections #1438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 114 commits into from
Closed

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented Dec 3, 2012

Early pull requests for random projection, I want to finish pr #372.

My goal is to have a workable implementation of Gaussian, Bernoulli and sparse random
projection matrix.

Note that the materialize=False and random_dot feature won't probably be implemented in this pull request.

@@ -69,6 +68,20 @@ def _assert_greater(a, b, msg=None):
assert_greater = _assert_greater


def assert_raise_message(exception, message, callable, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is already a way to do that. Could you have a look at the sparse tests in sklearn/tests/test_common.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sklearn/utils/testing.py is the right location for such helpers. If there is redundancy it should be factorized here rather than test_common I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No argument there.
I thought it would be possible without a helper function. But what Gael did there is basically write down this function.
So sorry for the noise and +1 for including it :)
I would be nice to maybe also match substrings, as was done there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller It is done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

@arjoly
Copy link
Member Author

arjoly commented Dec 3, 2012

Keep a gist of random_dot before cleaning it (code and tests): https://gist.github.com/4194105

@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2012

Note that the ´´materialize=False´´ and random_dot feature won't probably be implemented in this pull request.

To clarify this sentence, the currently implemented materialize=False and random_dot code from this branch will be removed as this code seems to be to slow to be useful hence better only merge the simple fast running features first and implement the the more experimental tricks in from random_dot later in another PR if required.

@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2012

Thanks for the gist. Good initiative.

@arjoly
Copy link
Member Author

arjoly commented Dec 4, 2012

If the user specified to generate more random projections than features, I wouldn't deny his right to do it.

Nevertheless, I can add a warning message.

@ogrisel
Copy link
Member

ogrisel commented Dec 4, 2012

Fair enough. Maybe a warning would be indeed useful.

@arjoly
Copy link
Member Author

arjoly commented Dec 4, 2012

I have added a warning.

I would like to change components_ to projection_matrix_. It avoids the confusion with n_components_ and n_components. Some thoughts about that?

@ogrisel
Copy link
Member

ogrisel commented Dec 4, 2012

I'd rather keep the components notation as it's already consistently used throughout the code base for similar projection purpose (see PCA, NMF, ...).

@GaelVaroquaux
Copy link
Member

I'd rather keep the components notation as it's already consistently used
throughout the code base for similar projection purpose (see PCA, NMF, ...).

+1

@amueller
Copy link
Member

amueller commented Dec 4, 2012

On 12/04/2012 05:14 PM, Arnaud Joly wrote:

I have added a warning.

I would like to change |components_| to |projection_matrix_´. It avoid
confusion with|n_components_|and|n_components`. Some thoughts about that?

I agree with Gael and Olivier.
Could you explain your concern a bit more?

stuff that starts with n_ is always an integer. Stuff that ends with
_ is always estimated by the model. And components_ is usually a
(linear) projection.

@arjoly
Copy link
Member Author

arjoly commented Dec 5, 2012

Just the definition of component evokes me part of something bigger and not projection matrix / linear transformation. Anyway, it is better to be consistent.

Thanks for the input !

@amueller
Copy link
Member

amueller commented Dec 5, 2012

On 12/05/2012 08:39 AM, Arnaud Joly wrote:

Just the definition of component evoke me part of something bigger`
and not projection matrix / linear transformation. Anyway, it is
better to be consistent.

Thanks for the input !

The word comes from Principal Component Analysis and Independent
Component Analysis.
It kind of spread from there ;)

@arjoly
Copy link
Member Author

arjoly commented Dec 7, 2012

I am currently improving the generation of the sparse Bernouilli matrix.

Can I use modified code of the core python? Is it license compatible?

@amueller
Copy link
Member

amueller commented Dec 8, 2012

From core Python? What do you want to get? Looks like the Python license is compatible with BSD, just have to leave in the copyright. Though having mixed licenses is always a bit of a hassle (then all files are under BSD except one file which is under PSF Python license).

@arjoly
Copy link
Member Author

arjoly commented Dec 8, 2012

The one line that kill performance is the used of random.sample from core python.

I have achieved 2.6442 times speed up using a cythonized and specialized version of random.sample which used np.random.randint. Furthermore, they have two nice tests for this kind of function that I would happily adapt.

There might be still some place for improvement, I am a beginner with cython.

edit: 2.6442 (and not 6, sorry the line profiler incurred a large down speed)

@amueller
Copy link
Member

amueller commented Dec 9, 2012

Can you say which line it is? Sorry, I didn't find it, I didn't study the code before.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

Ok, found it. Can this not be done with np.random.permutation?

@amueller
Copy link
Member

amueller commented Dec 9, 2012

It seems np.random.permutation is way slower than random.sample. Interesting.

@glouppe
Copy link
Contributor

glouppe commented Dec 9, 2012

Hmm, that should be interesting to speed up the tree construction process. I'll have a try.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

About taking code from core python, let's ask the elders: @ogrisel @GaelVaroquaux what do you think? ;)

@arjoly
Copy link
Member Author

arjoly commented Dec 9, 2012

I can show you the code in a gist. However, I don't know if it might breaks any license rules.
If not, say it. I would happily share this code.

@ogrisel
Copy link
Member

ogrisel commented Dec 10, 2012

No problem with reusing code from Python as long as we clearly state for the copyright / license / origin of this code section using inline comments. The Python License is a non viral license very similar to the BSD license so it's ok.

@arjoly
Copy link
Member Author

arjoly commented Dec 10, 2012

Thanks @ogrisel for the answer.

I ran the following benchmark

Dataset statics

n_samples = 1000
n_features = 100000
n_components = 5920 (auto)
n_elements = 100000000
n_nonzeros = 100 per feature
ratio_nonzeros = 0.001

Script arguments

Arguments fit
random_seed 13
density 0.333333333333
eps 0.1
ratio_nonzeros 0.001
n_components auto
sparse True
n_samples 1000
n_times 10
selected_transformers Bernouilli
n_features 100000

Performances for bernouill sparse matrix:

Transformer fit
bernouillli sparse random matrix - original 3.9888s
bernouillli sparse random matrix - cython 1.5082s

Performance are improved with a factor 2.6442 with a custom sampler of integer in cython (and not 6, sorry the line profiler incurred a large down speed for python).

@arjoly
Copy link
Member Author

arjoly commented Dec 10, 2012

I got the following error when I perform make test:

======================================================================
ERROR: sklearn.tests.test_common.test_transformers_sparse_data
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ajoly/.virtualenv/sklearn/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ajoly/git/scikit-learn/sklearn/tests/test_common.py", line 254, in test_transformers_sparse_data
    raise exc
ValueError: eps=0.500000 and n_samples=40 lead to a target dimension of 177 which is larger than the original space with n_features=10

======================================================================
ERROR: sklearn.tests.test_common.test_estimators_nan_inf
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ajoly/.virtualenv/sklearn/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ajoly/git/scikit-learn/sklearn/tests/test_common.py", line 296, in test_estimators_nan_inf
    raise e
ValueError: eps=0.500000 and n_samples=10 lead to a target dimension of 110 which is larger than the original space with n_features=3

======================================================================
FAIL: sklearn.tests.test_common.test_transformers
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ajoly/.virtualenv/sklearn/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ajoly/git/scikit-learn/sklearn/tests/test_common.py", line 218, in test_transformers
    assert_true(succeeded)
AssertionError: False is not true
    'False is not true' = self._formatMessage('False is not true', "%s is not true" % safe_repr(False))
>>  raise self.failureException('False is not true')

The first two error are due to the chosen default behavior. I don't understand the third one.

What do you advise?

@ogrisel
Copy link
Member

ogrisel commented Dec 11, 2012

I am not sure if we should convert the errors like: ValueError: eps=0.500000 and n_samples=40 lead to a target dimension of 177 which is larger than the original space with n_features=10 to warnings instead.

Also by reading https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_common.py#L148 it seems that the last failure is a consequence of the same ValueError on the target dimension check.

@arjoly
Copy link
Member Author

arjoly commented Dec 11, 2012

@ogrisel I agree with you so I have tweaked test_comons when necessary.

# The following line of code are heavily inspired from python core,
# more precisely of random.sample.
cdef set selected = set()
selected_add = selected.add
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this kind of local scope optim required in cython code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declaring selected as a set improve a little performance.
It is the same for selected_add and rng_randint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel Declaring the type of the variables in Cython allows for static typing. This basically circumvent a lot of Python overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the types, I was wondering about the selected_add optim.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about selected_add. In the .html, it shows a little overhead.
I mistaken this with python optimization.

@ogrisel
Copy link
Member

ogrisel commented Dec 21, 2012

LGTM, 👍 for merging.

@amueller
Copy link
Member

Ok, I'm doing it (manually).

@arjoly
Copy link
Member Author

arjoly commented Dec 21, 2012

Thanks all !!! :-) It was fun to make this pr!

@amueller
Copy link
Member

Done. Thanks a lot!

@amueller amueller closed this Dec 21, 2012
@amueller
Copy link
Member

Great work @ogrisel and @arjoly 🍻

@arjoly
Copy link
Member Author

arjoly commented Dec 21, 2012

#372 can be closed!!!

Great work !!! :-)

@ogrisel
Copy link
Member

ogrisel commented Dec 21, 2012

Prosit 🍻 !

@arjoly
Copy link
Member Author

arjoly commented Dec 21, 2012

It is strange.

Pull request closed — the arjoly:random_projection branch has unmerged commits.

@amueller
Copy link
Member

Sorry @arjoly this is caused by me trying to keep master less messy. I rebased your branch again before pushing. You didn't use the newest master (Rebasing changes the hashes of the commits, so the commits now in master have different hashes than the ones in your pr.)

@amueller
Copy link
Member

or maybe I messed up again... not sure what is wrong with me today. Some commits seem to be missing indeed.

@amueller
Copy link
Member

For some reason my pr/1438 branch was not up to date with the PR. I have no idea why. I removed it and pulled again and it was still 10 days behind...

@amueller
Copy link
Member

Now all commits are in. I cloned the branch in the old-fashioned way from @arjoly directly....

@arjoly
Copy link
Member Author

arjoly commented Dec 21, 2012

Thanks !!!

May I be added to the contributor list?

@amueller
Copy link
Member

You are in the whatsnew, right?
Do you mean the "AUTHORS"? Or do you mean obtain commit rights? Both are usually for people you have been contribution for some time.

Hm there was an error about assert_allclose. It looked odd to me but I didn't say anything because it looked ok. Problem is, it's not in numpy 1.3. Fixed it just now.

@arjoly
Copy link
Member Author

arjoly commented Dec 21, 2012

Fine, it will be for a latter time :-)

@amueller
Copy link
Member

Btw, I didn't really follow your work as there is a lot going on and I struggle keeping up. You had several PRs already, righ? so you might qualify for commit rights

@arjoly
Copy link
Member Author

arjoly commented Dec 21, 2012

Just wanted to be on the author list. Sorry for the all_close, I haven't checked.
Yes, 3 pr (two minors and this one).

@amueller
Copy link
Member

The author list needs some rewriting but that won't happen soon.
Also, I think that was supposed to be somewhat exclusive. But don't worry, if you keep up the good work, you'll definitely land there ;) And your contributions will be very visible in the next release via the "whatsnew".

@arjoly
Copy link
Member Author

arjoly commented Dec 21, 2012

Fine, thanks :-)

@arjoly arjoly deleted the random_projection branch January 6, 2013 21:08
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a61645c on arjoly:random_projection into * on scikit-learn:master*.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants