An Introduction to Sparse Coding and
Dictionary Learning
Kai Cao
January 14, 2014
Outline
Introduction
Mathematical foundation
Sparse coding
Dictionary learning
Summary
2
Introduction
What is sparsity?
Sparsity implies many zeros in a vector or a matrix
Sparse
representation
FFT
transform
Fingerprint patch
FFT response
IFFT
transform
Usage:
Compression
Analysis
Denoising
Reconstructed patch
Sparse Representation
xD
Dictionary Learning
Problem
Sparse Coding
Problem
Application---Denoising
Source
Result 30.829dB
Dictionary
PSNR = 22.1dB
Noisy image
[M. Elad, Springer 2010]
Application---Compression
Original
JPEG
JPEG
2000
PCA
15.81
13.89
10.66
6.60
14.67
12.41
9.44
5.49
15.30
12.57
10.27
6.36
Dictionary
based
550 bytes per
image
Bottom:
RMSE values
[O. Bryta, M. Elad, 2008]
Mathematical foundation
Derivatives of vectors
First order
aT x xT a
= = a
x
x
Second order
Exercise
xT Bx
= ( B + BT ) x
x
1
minm || x D ||22 + || ||22 , x R n , D R nm
2
=
( DT D + I ) 1 DT x
9
Trace of a Matrix
Definition
A (aij ) R nn
Tr ( A) = i =1 aii , =
n
Properties
=
|| A ||2F
2
T
=
a
Tr
(
A
A),
ij
=i 1 =j 1
Tr ( A) = Tr ( AT ),
Tr ( A + B )= Tr ( A + B ),
Tr (aA) = aTr ( A),
Tr ( AB ) = Tr ( BA),
B R n n
aR
B R nn
Tr=
( ABC ) Tr
=
( BCA) Tr (CAB ),
B , C R n n
10
Derivatives of traces
First order
Tr ( XA) = AT
X
Tr ( X T A) = A
X
Derivatives of traces
Exercise
Tr ( X T XA
=
) XAT + XA
X
Tr ( X T BX
=
) BT X + BX
X
2
2
min
||
X
DA
||
+
||
A
||
F
F,
k m
AR
X R n m , D R n k
=
A ( DT D + I ) 1 DT X
11
Sparse coding
12
Sparse linear model
Let x Rn be a signal
Let D =[d1, d2, , dm] Rnm be a set
of normalized (diT di = 1)basis vectors
(dictionary)
D
Sparse representation is to find a sparse vector
Rm such that x D, where is regarded as sparse
code
13
The sparse coding model
Objective function
1
minm || x D ||22
2
Data fitting term
+ ( )
Regularization term
The regularization term can be
the l2 norm. || || i =1 i2
the l0 norm. || ||0 #{i | ai 0}
m
the l1 norm. || ||1 i =1 | i |
2
2
Sparsity inducing
14
Matching pursuit
1
minm || x D ||22
2
1.
2.
3.
s. t.
|| ||0 L
Initialization: = 0, residual r = x
while ||||0 <L
Select the element with maximum correlation with the residual
i = arg max | diT r |
i =1,..., m
4.
5.
Update the coefficients and residual
T
=
+
d
r
i
i
i
End while
r= r (diT r )di
15
An example for matching pursuit
Dictionary elements
Patch from latent
c1=-0.039 c2= 0.577
c3=0.054 c4=-0.031 c5 =-0.437
Correlation ci= diT x
Residual r
d1
d2
d3
d4
d5
0.577
c1=-0.035 c2= 0
c3=0.037 c4=-0.046 c5 =-0.289
Correlation ci= diT r
d1
d2
d3
d4
d5
Coefficient does not update !
Residual r
Reconstructed patch
0.577
(-0.289)
|| x x ||2 =
0.763
=
0.577 +
(-0.289)
Orthogonal matching pursuit
1
minm || x D ||22
2
s. t.
|| ||0 L
1.
Initialization: = 0, residual r = x, active set =
2.
3.
while ||||0 <L
Select the element with maximum correlation with the residual
i = arg max | diT r |
i =1,..., m
4.
Update the active set, coefficients and residual
= i
= (d T d ) 1 d T r
5.
End while
r= x d
17
An example for orthogonal matching
pursuit
Dictionary elements
Patch from latent
c1=-0.039 c2= 0.577
c3=0.054 c4=-0.031 c5 =-0.437
Correlation ci= diT x
Residual r
d1
d2
d3
d4
d5
0.577
c1=-0.035 c2= 0
c3=0.037 c4=-0.046 c5 =-0.289
Correlation ci= diT r
d1
Residual r
Reconstructed patch
d2
0.499 -
d3
d4
d5
(-0.309)
|| x x ||2 =
0.759
=
0.499 +
(-0.309)
Why does l1-norm induce sparsity?
Analysis in 1D (comparison with l2)
1
min ( x ) 2 + | |
2
if x , = x
if x - , = x+
else, =0
= x/(1+2)
1
min ( x ) 2 + 2
2
slope = /(1+2)
x
x
19
Why does l1-norm induce sparsity?
Analysis in 2D (comparison with l2)
1
min || x ||22 + || ||1
2
1
2
min
|| x ||2 s.t. || ||1
2
1
min || x ||22 + || ||22
2
1
2
min
||
x
||
2 s.t. || ||2
2
1
20
Optimality condition for l1-norm regularization
1
minm J ( ) = || x D ||22 + || ||1
2
Directional derivative in the direction u at
J ( + tu ) J ( )
J ( , u ) =
lim
t 0+
g is subgradient of J at if and only if
t R m , J (t ) J ( ) + g T (t )
Proposition 1: g is a subgradient u R m , g T u J ( , u )
Proposition 2: if J is differentiable at , J ( , u ) =
J ( )T u
Proposition 3: is optimal if and only if for all u, J ( , u ) 0
21
Subgradient for l1-norm regularization
Example: f(x) = |x|
f(x) = |x|
subgradient
1
x
x
-1
|u|
f ( x , u ) =
sign( x)u
x=0
x0
22
Subgradient for l1-norm regularization
1
minm J ( ) = || x D ||22 + || ||1
2
J ( , u ) =
u T DT ( x D ) +
sign(a )u + | u |
i , ai 0
i
0
i , ai =
g is a subgradient at if and only if for all i
| gi diT ( x D ) |
gi = diT ( x D ) + sign(ai )
if
ai = 0
if
ai 0
23
First order method for
convex optimization
Differentiable objective
Gradient descent: t +1 = t t J ( t )
With line search for a decent ht
Diminishing step size: e.g., ht=(t+t0)-1
Non differentiable objective
Subgradient decent: t +=1 t t gt , gt is a subgradient
With line search
Diminishing step size
24
Reformulation as quadratic
program
1
minm || x D ||22 + || ||1
2
1
min m || x D + + D ||22 + (1T + + 1T )
+ , + 2
25
Dictionary Learning
26
Dictionary selection
Which D to use?
A fixed set of basis:
Steerable wavelet
Contourlet
DCT Basis
Data adaptive dictionary learn from data
K-SVD (l0-norm)
On-line dictionary learning (l1-norm)
27
The objective function for K-SVD
min || X DA ||2F
D, A
The examples are
linear combinations
of atoms from D
"j, s.t. || j ||0 L
Each example has a
sparse representation with
no more than L atoms
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
28
KSVD An Overview
Initialize
D
Sparse Coding
Use MP or OMP
Dictionary
Update
Column-by-Column by
SVD computation
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
29
KSVD: Sparse Coding Stage
"j, s.t. || j ||0 L
min || X DA ||2F
A
For the jth
example
we solve
min
D x j
2
2
X
s.t.
D
T
Ordinary Sparse Coding !
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
30
KSVD: Dictionary Update Stage
min || X DA ||2F
D
"j, s.t. || j ||0 L
For the kth
atom
we solve
min || d k Ek ||
dk
Ek
k
T
2
F
i
d
i T X (the residual)
ik
Solve with SVD
Ek= U V T
d k = u1
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
31
KSVD Dictionary Update Stage
We want to solve:
dk
Only some of
the examples
use column dk!
T
k
Ek
When updating ak,
only recompute
the coefficients
corresponding to
those examples
Solve with
SVD!
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
32
Compare K-SVD with K-means
Initialize
Dictionary
Initialize
Cluster Centers
Sparse Coding
Use MP or OMP
Assignment
for each vector
Dictionary
Update
Cluster centers
update
Column-by-Column by
SVD computation
K-SVD
Cluster-by-cluster
K-means
33
dictionary learning with
l1-norm regularization
Objective function for l1-norm regularization
1 t 1
min || xi D i ||22 + || i ||1
D t
i =1 2
where
1
i arg min || xi D ||22 + || ||1
2
R m
Advantages of online learning:
Handle large and dynamic datasets,
Could be much faster than batch algorithms.
34
dictionary learning with
l1-norm regularization
1 t 1
2
)
||
||
Ft ( D=
x
i
i 2 + || i ||1
t i =1 2
t
1 1
T
T
= ( Tr ( D DAt ) Tr ( D Bt )) + || i ||1
t 2
i =1
where
t
At = i iT ,
i =1
Bt = xi iT
i =1
Ft ( D) 1
= ( DAt Bt )
D
t
For a new xt+1,
T
At +=
A
+
1
t
t +1 t +1 ,
T
Bt +=
B
+
x
1
t
t +1 t +1
35
On-line dictionary learning
1) Initialization: D0 Rnm; A0=0; B0=0;
2) For t=1,,T
3)
Draw xt from the training data set
4)
Get sparse code
1
=
t arg min || xt Dt 1 ||22 + || ||1
2
R m
5) Aggregate sufficient statistics
=
At At 1 + t tT ,
=
Bt Bt 1 + xt tT ,
6) Dictionary update
Ft ( D)
=
Dt Dt 1
D
7) End for
36
Toolbox - SPAMS
SPArse Modeling Software:
Sparse coding
l0-norm regularization
l1-norm regularization
Dictionary learning
K-SVD
Online dictionary learning
C++ implemented with Matlab interface
http://spams-devel.gforge.inria.fr/
37
Summary
Sparsity and sparse representation
Sparse coding with l0- and l1-norm regularization
Orthogonal matching pursuit/matching pursuit
Subgradient and optimal condition
Dictionary learning with l0- and l1-norm regularization
K-SVD
Online dictionary learning
Try to use it !!
38
References
T. T. Cai, Lie Wang ,Orthogonal Matching Pursuit for Sparse Signal Recovery
With Noise, IEEE Transactions on Information Theory, 57(7): 4680-4688,2011
Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals
of statistics, 32(2):407499, 2004.
M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for
designing of overcomplete dictionaries for sparse representations. IEEE
Transactions on Signal Processing, 54(11):4311-4322, November 2006.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse
coding. In Proceedings of the International Conference on Machine Learning
(ICML), 2009a.
39
Thank you for listening
40