Skip to content

[MRG] Random Projections #1438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 114 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
143f605
first pass at implementing sparse random projections
ogrisel Oct 2, 2011
dbc26ca
DOC: better docstrings
ogrisel Oct 2, 2011
c3c45a1
DOC: more docstring improvements
ogrisel Oct 2, 2011
4efa9ff
Remove non-ASCII char from docstring
ogrisel Oct 2, 2011
5d24689
use random projections in the digits manifold example
ogrisel Oct 2, 2011
95c1c2f
test embedding quality and bad inputs (100% line coverage)
ogrisel Oct 2, 2011
f1ff664
typos
ogrisel Oct 2, 2011
908acde
one more typo
ogrisel Oct 2, 2011
f9818b4
OPTIM: CPU and memory optim by using a binomial and reservoir samplin…
ogrisel Oct 2, 2011
3028955
note for later possible optims
ogrisel Oct 2, 2011
2cbc59d
fix borked doctests
ogrisel Oct 2, 2011
ae0247a
make it possible to use random projection on the 20 newsgroups classi…
ogrisel Oct 2, 2011
69151c7
FIX: raise ValueError when n_components is too large
ogrisel Dec 31, 2011
5ffd5ad
remove the random projection option from the 20 newsgroups example
ogrisel Dec 31, 2011
49ab2dd
leave self.density to 'auto' to implement the curified estimator pattern
ogrisel Dec 31, 2011
dc3d799
more curified estimator API
ogrisel Dec 31, 2011
d4dd361
useless import
ogrisel Dec 31, 2011
e3984f6
change API to enforce dense_output representation by default
ogrisel Dec 31, 2011
0e6b17c
ENH: vectorize the johnson_lindenstrauss_bound function
ogrisel Jan 3, 2012
ae2f8c8
started work on plotting the JL bounds to be used in the narrative do…
ogrisel Jan 4, 2012
087b69b
More vectorization of the johnson_lindenstraus_bound function
ogrisel Jan 4, 2012
588afc8
More work on the JL example to plot the distribution of the distortion
ogrisel Jan 6, 2012
07752a2
WIP: tweaking JL function names
ogrisel Jan 7, 2012
0c79d2f
check JL bound domain
ogrisel Jan 14, 2012
c14bda0
JL Example improvements
ogrisel Jan 14, 2012
d35bc44
WIP: starting implementation implicit random matrix dot product
ogrisel Jan 18, 2012
8382b41
working on implicit random projections using a hashing function
ogrisel Jan 22, 2012
b243341
OPTIM: call murmurhash once + update test & example
ogrisel Jan 22, 2012
9c1dc5c
first stab at CSR input for hashing dot projections
ogrisel Jan 22, 2012
af9174a
implemented dense_output=False for hashing_dot
ogrisel Jan 23, 2012
f5fbfb0
refactored test to check that both materialized and implicit RP behav…
ogrisel Jan 24, 2012
a1e6bd9
fixed broken seeding of the hashing_dot function
ogrisel Jan 24, 2012
8b87825
leave dense_output=False by default
ogrisel Jan 24, 2012
f8f81df
use the 20 newsgroups as example dataset instead
ogrisel Jan 24, 2012
4de6744
make it possible to use a preallocated output array for hashing_dot
ogrisel Jan 25, 2012
dc5ee08
missing docstring and s/hashing_dot/random_dot/g
ogrisel Jan 25, 2012
10e1bb1
COSMIT use sklearn.utils.testing
arjoly Dec 3, 2012
43efc0d
ENH Let the user decide the number of random projections
arjoly Dec 3, 2012
39c704a
Clean random_dot features
arjoly Dec 3, 2012
feb3b45
Clean random_dot features (2)
arjoly Dec 3, 2012
fb54161
Clean random_dot features (3)
arjoly Dec 3, 2012
42d8229
Clean random_dot features (3)
arjoly Dec 3, 2012
31a0001
ENH let the user decide density between 0 and 1
arjoly Dec 3, 2012
332784a
COSMIT
arjoly Dec 3, 2012
7dd8884
ENH Strenghtens the input checking
arjoly Dec 3, 2012
662869d
ENH Add gaussian projeciton + refactor sparse random matrix to reuse …
arjoly Dec 4, 2012
53dde07
ENH add more tests with wrong input
arjoly Dec 4, 2012
8337a34
ENH add warning when user ask n_components > n_features
arjoly Dec 4, 2012
e027861
DOC: correct doc
arjoly Dec 4, 2012
16c2a5a
ENH add more tests
arjoly Dec 5, 2012
482e5f6
Update doctests
arjoly Dec 5, 2012
d19f1da
ENH cosmit naming consistency
arjoly Dec 5, 2012
24f3a2c
FIX renaming bug
arjoly Dec 5, 2012
e8403e3
COSMIT
arjoly Dec 5, 2012
87d5ab0
WIP: add benchmark for random_projection module
arjoly Dec 5, 2012
6f729ca
ENH finish benchmark
arjoly Dec 6, 2012
180c9c5
Typo
arjoly Dec 8, 2012
af78760
ENH optim sparse bernouilli matrix
arjoly Dec 10, 2012
2d08535
FIX example import (name changed)
arjoly Dec 10, 2012
cb19c77
FIX: argument passing selection of sparse/dense matrix
arjoly Dec 10, 2012
b1caedd
ENH assert_raise_message check for substring existence
arjoly Dec 10, 2012
2c79417
ENH add two tests to check proper transformation matrix
arjoly Dec 10, 2012
29fb160
PEP8 + PEP257
arjoly Dec 11, 2012
14e0344
DOC improve dev doc on reservoir sampling
arjoly Dec 11, 2012
b24e05b
COSMIT + ENH better handle dense bernouilli random matrix
arjoly Dec 11, 2012
0bdc0b5
FIX: make test_commons succeed with random_projection
arjoly Dec 11, 2012
0170366
DOC removed unrelevant paragraph(s)
arjoly Dec 11, 2012
7cf6c98
ENH add implementation choice for sample_int
arjoly Dec 11, 2012
ca302aa
ENH add various sampling without replacement algorithm
arjoly Dec 12, 2012
6d22689
Typo
arjoly Dec 12, 2012
fe3ef30
TST: Add tests for every sampling algorithm + DOC: improved doc
arjoly Dec 18, 2012
bee6f47
DOC: fix mistake in the doc + ADD benchmarking script
arjoly Dec 18, 2012
c6fd908
ENH Rename sample_int to sample_without_replacement
arjoly Dec 18, 2012
f27d6e8
DOC + ENH: minor add in doc + set correct default
arjoly Dec 19, 2012
d981c8d
FIX: broken import
arjoly Dec 19, 2012
5a821aa
FIX typo mistakes + ENH change default behavior to speed the bench wi…
arjoly Dec 19, 2012
969a35d
ENH Add allclose to sklearn.testing
arjoly Dec 19, 2012
7fd4445
ENH improve naming consistency
arjoly Dec 19, 2012
562a239
PEP8
arjoly Dec 19, 2012
b852662
COSMIT
arjoly Dec 19, 2012
2ef97cf
DOC + typo
arjoly Dec 19, 2012
596c83b
DOC set narrative doc for random projection
arjoly Dec 19, 2012
3ee3497
FIX: broken test due to typo correction
arjoly Dec 19, 2012
8e1e3eb
DOC minor improvements
arjoly Dec 19, 2012
f2e398d
DOC mainly switch from .\n:: to ::
arjoly Dec 19, 2012
51d63d5
FIX typo mistakes
arjoly Dec 19, 2012
3f36baa
DOC improve name in example
arjoly Dec 19, 2012
3c41a24
DOC Separate the jl example from references
arjoly Dec 19, 2012
fa2411b
ENH Add jl lemma figure to random_projection.rst
arjoly Dec 19, 2012
0a9e366
COSMIT (typo, doc, simplify code)
arjoly Dec 19, 2012
0f0460c
pep8
arjoly Dec 19, 2012
0ef4b0d
Typo
arjoly Dec 19, 2012
e5849ec
DOC typo in narrative doc
arjoly Dec 20, 2012
439c210
DOC fix typo in filename
arjoly Dec 20, 2012
2b0675f
DOC clarification
arjoly Dec 20, 2012
e801000
ENH flatten random_projection module + add sklearn.utils.random
arjoly Dec 20, 2012
7fff25b
ENH refactor matrix generation BaseRandomProjectiona and subclass
arjoly Dec 20, 2012
e15b127
DOC improve layout (url)
arjoly Dec 20, 2012
5174284
Make the JL / RP example use the digits dataset by default
arjoly Dec 20, 2012
6356837
FIX broken import
arjoly Dec 20, 2012
f86ee55
pep257 + COSMIT: naming consistency
arjoly Dec 20, 2012
ab6c963
COSMIT
arjoly Dec 20, 2012
a7aa944
COSMIT
arjoly Dec 20, 2012
7641f1d
Remove unused line
arjoly Dec 20, 2012
e1f675d
DOC improve doc for jl lemma function
arjoly Dec 20, 2012
75598ad
typo
arjoly Dec 20, 2012
e6345b5
ENH Rename Bernoulli random projection to sparse random projection
arjoly Dec 20, 2012
f3c2a6a
ENH Rename Bernoulli random projection to sparse random projection
arjoly Dec 20, 2012
ac32c95
DOC add see also
arjoly Dec 20, 2012
75b3568
pep8
arjoly Dec 20, 2012
5f507f6
COSMIT make everything use the common interface
arjoly Dec 21, 2012
72d16ee
DOC improve + fix mistakes + TST added
arjoly Dec 21, 2012
f4bbc48
ENH Simplify assert_raise_message + TST add them
arjoly Dec 21, 2012
a61645c
DOC add utitilies to the doc
arjoly Dec 21, 2012
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions benchmarks/bench_random_projections.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
"""
===========================
Random projection benchmark
===========================

Benchmarks for random projections.

"""
from __future__ import division
from __future__ import print_function

import gc
import sys
import optparse
from datetime import datetime
import collections

import numpy as np
import scipy.sparse as sp

from sklearn import clone
from sklearn.random_projection import (SparseRandomProjection,
GaussianRandomProjection,
johnson_lindenstrauss_min_dim)


def type_auto_or_float(val):
if val == "auto":
return "auto"
else:
return float(val)


def type_auto_or_int(val):
if val == "auto":
return "auto"
else:
return int(val)


def compute_time(t_start, delta):
mu_second = 0.0 + 10 ** 6 # number of microseconds in a second

return delta.seconds + delta.microseconds / mu_second


def bench_scikit_transformer(X, transfomer):
gc.collect()

clf = clone(transfomer)

# start time
t_start = datetime.now()
clf.fit(X)
delta = (datetime.now() - t_start)
# stop time
time_to_fit = compute_time(t_start, delta)

# start time
t_start = datetime.now()
clf.transform(X)
delta = (datetime.now() - t_start)
# stop time
time_to_transform = compute_time(t_start, delta)

return time_to_fit, time_to_transform


# Make some random data with uniformly located non zero entries with
# Gaussian distributed values
def make_sparse_random_data(n_samples, n_features, n_nonzeros,
random_state=None):
rng = np.random.RandomState(random_state)
data_coo = sp.coo_matrix(
(rng.randn(n_nonzeros),
(rng.randint(n_samples, size=n_nonzeros),
rng.randint(n_features, size=n_nonzeros))),
shape=(n_samples, n_features))
return data_coo.toarray(), data_coo.tocsr()


def print_row(clf_type, time_fit, time_transform):
print("%s | %s | %s" % (clf_type.ljust(30),
("%.4fs" % time_fit).center(12),
("%.4fs" % time_transform).center(12)))


if __name__ == "__main__":
###########################################################################
# Option parser
###########################################################################
op = optparse.OptionParser()
op.add_option("--n-times",
dest="n_times", default=5, type=int,
help="Bench results are average over n_times experiments")

op.add_option("--n-features",
dest="n_features", default=10 ** 4, type=int,
help="Number of features in the benchmarks")

op.add_option("--n-components",
dest="n_components", default="auto",
help="Size of the random subspace."
"('auto' or int > 0)")

op.add_option("--ratio-nonzeros",
dest="ratio_nonzeros", default=10 ** -3, type=float,
help="Number of features in the benchmarks")

op.add_option("--n-samples",
dest="n_samples", default=500, type=int,
help="Number of samples in the benchmarks")

op.add_option("--random-seed",
dest="random_seed", default=13, type=int,
help="Seed used by the random number generators.")

op.add_option("--density",
dest="density", default=1 / 3,
help="Density used by the sparse random projection."
"('auto' or float (0.0, 1.0]")

op.add_option("--eps",
dest="eps", default=0.5, type=float,
help="See the documentation of the underlying transformers.")

op.add_option("--transformers",
dest="selected_transformers",
default='GaussianRandomProjection,SparseRandomProjection',
type=str,
help="Comma-separated list of transformer to benchmark. "
"Default: %default. Available: "
"GaussianRandomProjection,SparseRandomProjection")

op.add_option("--dense",
dest="dense",
default=False,
action="store_true",
help="Set input space as a dense matrix.")

(opts, args) = op.parse_args()
if len(args) > 0:
op.error("this script takes no arguments.")
sys.exit(1)
opts.n_components = type_auto_or_int(opts.n_components)
opts.density = type_auto_or_float(opts.density)
selected_transformers = opts.selected_transformers.split(',')

###########################################################################
# Generate dataset
###########################################################################
n_nonzeros = int(opts.ratio_nonzeros * opts.n_features)

print('Dataset statics')
print("===========================")
print('n_samples \t= %s' % opts.n_samples)
print('n_features \t= %s' % opts.n_features)
if opts.n_components == "auto":
print('n_components \t= %s (auto)' %
johnson_lindenstrauss_min_dim(n_samples=opts.n_samples,
eps=opts.eps))
else:
print('n_components \t= %s' % opts.n_components)
print('n_elements \t= %s' % (opts.n_features * opts.n_samples))
print('n_nonzeros \t= %s per feature' % n_nonzeros)
print('ratio_nonzeros \t= %s' % opts.ratio_nonzeros)
print('')

###########################################################################
# Set transformer input
###########################################################################
transformers = {}

###########################################################################
# Set GaussianRandomProjection input
gaussian_matrix_params = {
"n_components": opts.n_components,
"random_state": opts.random_seed
}
transformers["GaussianRandomProjection"] = \
GaussianRandomProjection(**gaussian_matrix_params)

###########################################################################
# Set SparseRandomProjection input
sparse_matrix_params = {
"n_components": opts.n_components,
"random_state": opts.random_seed,
"density": opts.density,
"eps": opts.eps,
}

transformers["SparseRandomProjection"] = \
SparseRandomProjection(**sparse_matrix_params)

###########################################################################
# Perform benchmark
###########################################################################
time_fit = collections.defaultdict(list)
time_transform = collections.defaultdict(list)

print('Benchmarks')
print("===========================")
print("Generate dataset benchmarks... ", end="")
X_dense, X_sparse = make_sparse_random_data(opts.n_samples,
opts.n_features,
n_nonzeros,
random_state=opts.random_seed)
X = X_dense if opts.dense else X_sparse
print("done")

for name in selected_transformers:
print("Perform benchmarks for %s..." % name)

for iteration in xrange(opts.n_times):
print("\titer %s..." % iteration, end="")
time_to_fit, time_to_transform = bench_scikit_transformer(X_dense,
transformers[name])
time_fit[name].append(time_to_fit)
time_transform[name].append(time_to_transform)
print("done")

print("")

###########################################################################
# Print results
###########################################################################
print("Script arguments")
print("===========================")
arguments = vars(opts)
print("%s \t | %s " % ("Arguments".ljust(16),
"Value".center(12),))
print(25 * "-" + ("|" + "-" * 14) * 1)
for key, value in arguments.items():
print("%s \t | %s " % (str(key).ljust(16),
str(value).strip().center(12)))
print("")

print("Transformer performance:")
print("===========================")
print("Results are averaged over %s repetition(s)." % opts.n_times)
print("")
print("%s | %s | %s" % ("Transformer".ljust(30),
"fit".center(12),
"transform".center(12)))
print(31 * "-" + ("|" + "-" * 14) * 2)

for name in sorted(selected_transformers):
print_row(name,
np.mean(time_fit[name]),
np.mean(time_transform[name]))

print("")
print("")
Loading