Apache Mahout - SSVD PDF
Apache Mahout - SSVD PDF
Apache Mahout - SSVD PDF
mathematical definition: A UV .
Map-Reduce characteristics:
SSVD uses at most 3 MR sequential steps (map-only + map-reduce + 2 optional parallel map-reduce jobs) to
produce reduced rank approximation of U, V and S matrices. Additionally, two more map-reduce steps are added
for each power iteration step if requested.
Potential drawbacks:
potentially less precise (but adding even one power iteration seems to fix that quite a bit).
Twitter
Tweets
Paul Scott
@paulscott56
Thanks to the 6 brave souls for
attending my @ApacheMahout talk at
#tech4africa. Hope I answered some
questions!
Retweeted by Apache Mahout
Expand
Vintzooz
@vintzooz
Tutoriel sur Apache Mahout, un sousprojet Hadoop pour la mise en place
d'outils de recommandation
soat.developpez.com/tutoriels/nosq
Retweeted by Apache Mahout
Expand
John Kennedy
@CommerceJohn
A Great AWS Mahout tutorial
blogs.aws.amazon.com/bigdata/post/T
#mapreduce #ecommerce #tech #math
Retweeted by Apache Mahout
Expand
Documentation
Overview and Usage
Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.
Zaizi Social
@Zaizi
Content Recommendation in @Alfresco
using @ApacheMahout, read our latest
blog bit.ly/1rurNHN
Tweet to @ApacheMahout
Publications
Nathan Halko's dissertation "Randomized methods for computing low-rank approximations of matrices" contains
comprehensive definition of parallelization strategy taken in Mahout SSVD implementation and also some
precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation on a typical corpus data set.
Halko, Martinsson, Tropp paper discusses family of random projection-based algorithms and contains theoretical
error estimates.
Apache Software
Foundation
How the ASF works
R simulation
Get Involved
Non-parallel SSVD simulation in R with power iterations and PCA options. Note that this implementation is not
most optimal for sequential flow solver, but it is for demonstration purposes only.
Developer Resources
Thanks
Sponsorship
tests.R
n<-1000
m<-2000
k<-10
qi<-1
Related Projects
Lucene
Hadoop
#simulated input
svalsim<-diag(k:1)
usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))
x<- usim %*% svalsim %*% t(vsim)
and try to compare ssvd.svd(x) and stock svd(x) performance for the same rank k, notice the difference in the
running time. Also play with power iterations (qIter) and compare accuracies of standard svd and SSVD.
Note: numerical stability of R algorithms may differ from that of Mahout's distributed version. We haven't studied
accuracy of the R simulation. For study of accuracy of Mahout's version, please refer to Nathan's dissertation as
referenced above.
Modified SSVD Algorithm.
Given an m n matrix A , a target rank k N 1 , an oversampling parameter p N 1 , and the number of
1/2
10/15/2014
1.
Create seed for random n (k + p) matrix . The seed defines matrix using Gaussian unit vectors
per one of suggestions in Halko, Martinsson, Tropp .
2.
Y = A, Y R
3.
Column-orthonormalize Y
Q R
4.
m(k+p)
B0 = Q
A :
m(k+p)
, R R
B R
(k+p)n
. I denote this as Q
6.
7.
Singular values
8.
9.
: Bi
If q
= 1..q
= A
qr (AB i1 ) . Q
If needed, compute
0.5
^
U = QU
If needed, compute V
. Also,
^
^
= U U
= i
, Bq Bq
(k+p)(k+p)
1
^
= Bq U
= QR
= qr (Y) . Q
5.
> 0
repeat: for i
(k+p)(k+p)
. Another way is V
= A
Copyright 2014 The Apache Software Foundation, Licensed under the Apache License, Version 2.0 .
Apache and the Apache feather logos are trademarks of The Apache Software Foundation.
https://mahout.apache.org/users/dim-reduction/ssvd.html
2/2