-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Random Projections #1438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Random Projections #1438
Conversation
@@ -69,6 +68,20 @@ def _assert_greater(a, b, msg=None): | |||
assert_greater = _assert_greater | |||
|
|||
|
|||
def assert_raise_message(exception, message, callable, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is already a way to do that. Could you have a look at the sparse tests in sklearn/tests/test_common.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sklearn/utils/testing.py
is the right location for such helpers. If there is redundancy it should be factorized here rather than test_common
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No argument there.
I thought it would be possible without a helper function. But what Gael did there is basically write down this function.
So sorry for the noise and +1 for including it :)
I would be nice to maybe also match substrings, as was done there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller It is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :)
Keep a gist of random_dot before cleaning it (code and tests): https://gist.github.com/4194105 |
To clarify this sentence, the currently implemented |
Thanks for the gist. Good initiative. |
If the user specified to generate more random projections than features, I wouldn't deny his right to do it. Nevertheless, I can add a warning message. |
Fair enough. Maybe a warning would be indeed useful. |
I have added a warning. I would like to change |
I'd rather keep the |
+1 |
On 12/04/2012 05:14 PM, Arnaud Joly wrote:
I agree with Gael and Olivier. stuff that starts with |
Just the definition of component evokes me part of something bigger and not projection matrix / linear transformation. Anyway, it is better to be consistent. Thanks for the input ! |
On 12/05/2012 08:39 AM, Arnaud Joly wrote:
|
I am currently improving the generation of the sparse Bernouilli matrix. Can I use modified code of the core python? Is it license compatible? |
From core Python? What do you want to get? Looks like the Python license is compatible with BSD, just have to leave in the copyright. Though having mixed licenses is always a bit of a hassle (then all files are under BSD except one file which is under PSF Python license). |
The one line that kill performance is the used of random.sample from core python. I have achieved 2.6442 times speed up using a cythonized and specialized version of There might be still some place for improvement, I am a beginner with cython. edit: 2.6442 (and not 6, sorry the line profiler incurred a large down speed) |
Can you say which line it is? Sorry, I didn't find it, I didn't study the code before. |
Ok, found it. Can this not be done with |
It seems |
Hmm, that should be interesting to speed up the tree construction process. I'll have a try. |
About taking code from core python, let's ask the elders: @ogrisel @GaelVaroquaux what do you think? ;) |
I can show you the code in a gist. However, I don't know if it might breaks any license rules. |
No problem with reusing code from Python as long as we clearly state for the copyright / license / origin of this code section using inline comments. The Python License is a non viral license very similar to the BSD license so it's ok. |
Thanks @ogrisel for the answer. I ran the following benchmark Dataset staticsn_samples = 1000 Script arguments
Performances for bernouill sparse matrix:
Performance are improved with a factor 2.6442 with a custom sampler of integer in cython (and not 6, sorry the line profiler incurred a large down speed for python). |
I got the following error when I perform
The first two error are due to the chosen default behavior. I don't understand the third one. What do you advise? |
I am not sure if we should convert the errors like: Also by reading https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_common.py#L148 it seems that the last failure is a consequence of the same |
@ogrisel I agree with you so I have tweaked test_comons when necessary. |
# The following line of code are heavily inspired from python core, | ||
# more precisely of random.sample. | ||
cdef set selected = set() | ||
selected_add = selected.add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this kind of local scope optim required in cython code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declaring selected as a set improve a little performance.
It is the same for selected_add
and rng_randint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel Declaring the type of the variables in Cython allows for static typing. This basically circumvent a lot of Python overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the types, I was wondering about the selected_add
optim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about selected_add. In the .html, it shows a little overhead.
I mistaken this with python optimization.
LGTM, 👍 for merging. |
Ok, I'm doing it (manually). |
Thanks all !!! :-) It was fun to make this pr! |
Done. Thanks a lot! |
#372 can be closed!!! Great work !!! :-) |
Prosit 🍻 ! |
It is strange. Pull request closed — the arjoly:random_projection branch has unmerged commits. |
Sorry @arjoly this is caused by me trying to keep master less messy. I rebased your branch again before pushing. You didn't use the newest master (Rebasing changes the hashes of the commits, so the commits now in master have different hashes than the ones in your pr.) |
or maybe I messed up again... not sure what is wrong with me today. Some commits seem to be missing indeed. |
For some reason my |
Now all commits are in. I cloned the branch in the old-fashioned way from @arjoly directly.... |
Thanks !!! May I be added to the contributor list? |
You are in the whatsnew, right? Hm there was an error about assert_allclose. It looked odd to me but I didn't say anything because it looked ok. Problem is, it's not in numpy 1.3. Fixed it just now. |
Fine, it will be for a latter time :-) |
Btw, I didn't really follow your work as there is a lot going on and I struggle keeping up. You had several PRs already, righ? so you might qualify for commit rights |
Just wanted to be on the author list. Sorry for the all_close, I haven't checked. |
The author list needs some rewriting but that won't happen soon. |
Fine, thanks :-) |
Changes Unknown when pulling a61645c on arjoly:random_projection into * on scikit-learn:master*. |
Early pull requests for random projection, I want to finish pr #372.
My goal is to have a workable implementation of Gaussian, Bernoulli and sparse random
projection matrix.
Note that the
materialize=False
andrandom_dot
feature won't probably be implemented in this pull request.