Hidden Markov Model
Hidden Markov Model
Hidden Markov Model
Contents
• Introduction
• Markov Model
• Hidden Markov model (HMM)
• Three algorithms of HMM
– Model evaluation
– Most probable path decoding
– Model training
• Pattern classification by HMM
• Application of HMM to on-line handwriting recognition
with HMM toolbox for Matlab
• Summary
• References
What’s HMM?
• Scenario
• Graphical representation
• Definition
• Sequence probability
• State probability
0.4 0.3
1: 2: 0.6
rain 0.2 cloudy
0.3 0.2
0.1 0.1
3:
sunny
0.8
3:
q1 , q2 , L , qT sunny
0.8
A = 21 3:
M M M M sunny
a N1 a NN L a NN 0.8
– Where
aij = P (qt = j | qt −1 = i ), 1 ≤ i, j ≤ N
– With constraints
N
a ij ≥ 0, ∑a
j =1
ij =1
Chain rule
P ( q1 , q2 , L , qT )
= P ( q1 ) P (q2 | q1 ) L P (qT −1 | q1 , L , qT − 2 ) P (qT | q1 , L , qT −1 )
= P ( q1 ) P (q2 | q1 ) L P (qT −1 | qT − 2 ) P ( qT | qT −1 )
S1
S2
...
SN
S1 P ( q t −1 = 1)
S2 a 1i
...
SN
N
= ∑j =1
P ( q t −1 = j ) P ( q t = i | q t −1 = j )
N
= ∑ P ( q t −1 = j ) ⋅ a ji
j =1 Time complexity: O ( N 2t )
April 16, 2005, S.-J. Cho 15
What’s HMM?
• Example
• Generation process
• Definition
• Model evaluation algorithm
• Path decoding algorithm
• Training algorithm
April 16, 2005, S.-J. Cho 17
2 0.3
0.2 0.6
0.1 0.1
1 3 0.6
0.6 0.2
0.3
q1 q2 L qt −1 qt
X1 X2 X t −1 Xt
π = { P ( q 1 = i )} = [1 , 0 , 0 ] 0.3
• State transition probability distribution
0 .6 0 .2 0 .2
A = { a ij } = 0 . 1 0 .3 0 . 6
0 . 3 0 .1 0 . 6
• Observation symbol probability distribution
• 3 / 6 2 / 6 1 / 6
B = { b i ( v k )} = 1 / 6 3/6 2 / 6
1 / 6 1/6 4 / 6
April 16, 2005, S.-J. Cho 23
0.3
HMM: Three Problems 2
0.2 0.6
0.1 0.1 0.6
• What is the probability of generating 0.6 1 0.2 3
Forward algorithm
Backward algorithm
Definition
• Given a model λ
• Observation sequence: X = x 1 , x 2 , L , x T
• P(X| λ) = ?
• P(X | λ) = ∑ Q
P ( X , Q | λ ) = ∑ P ( X | Q , λ ) P (Q | λ )
Q
x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8
P(X | λ) = ∑ Q
P ( X , Q | λ ) = ∑ P ( X | Q , λ ) P (Q | λ )
Q
= ∑b
Q
q1 ( x 1 ) b q 2 ( x 2 ) L b q T ( x T ) π q 1 a q 1 q 2 a q 2 q 3 L a q T −1 q T
Forward Algorithm
• Key idea
– Span a lattice of N states and T times
– Keep the sum of probabilities of all the paths coming to each
state i at time t
x1 x2 x3 x4 xt xT
Sj
• Forward probability
α t ( j ) = P( x1 x2 ...xt , qt = S j | λ )
= ∑ P( x1 x2 ...xt , Qt = q1...qt | λ )
Qt
N
= ∑ α t −1 (i )aij b j ( xt )
i =1
April 16, 2005, S.-J. Cho 28
Forward Algorithm
• Initialization
α1 (i ) = π i bi (x1 ) 1≤ i ≤ N
• Induction
N
α t ( j ) = ∑ α t −1 (i)aij b j (x t ) 1 ≤ j ≤ N , t = 2, 3, L, T
i =1
• Termination
N
P ( X | λ ) = ∑ α T (i )
i =1 x1 x2 x3 x4 xt xT
Sj
π =[1 0 0]T
R R G B
.5 R .6
G .2 1×.6 .5×.6 .5×.2 .5×.2
.6 .18 .018 .0018
B .2
.4 .4×.2 .4×.5 .4×.3
.6 .1 .2
0×.2 .6×.2 .6×.5 .6×.3
.5 .0 .048 .0504 .01123
.3 .1×.0 .1×.3 .1×.7
.4 .4×.0 .4×.3 .4×.7
.0 0×.0
.3 .0 .0 .01116 .01537
.7
Sj
• Backward probability
β t (i ) = P( xt +1 xt + 2 ...xT | qt = Si , λ )
= ∑ P( xt +1 xt + 2 ...xT , Qt +1 = qt +1...qT | qt = Si , λ )
Qt +1
N
= ∑ aij b j ( xt +1 ) β t +1 ( j )
j =1
April 16, 2005, S.-J. Cho 31
β T (i ) = 1 1≤ i ≤ N
• Induction
N
β t (i ) = ∑ aij b j (x t +1 ) β t +1 ( j ) 1 ≤ i ≤ N , t = T − 1, T − 2, L, 1
j =1
x1 x2 x3 x4 xt xT
Sj
State sequence
Optimal path
Viterbi algorithm
Sequence segmentation
x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8
• Viterbi Algorithm
– Alignment of observation and state transition
– Dynamic programming technique
δ t (1) a1 j b j ( x t + 1 )
t t+1
δ t ( i ) a ij b j ( x t + 1 )
Sj
• Termination: P ∗ = max δ T (i )
1≤i ≤ N
qT∗ = arg max δ T (i )
1≤i ≤ N
• Path backtracking: q = ψ t +1 (qt∗+1 ), t = T − 1,K ,1
∗
t
states 2
3
π =[1 0 0]T
R R G B
.5 R .6
G .2 1×.6 .5×.6 .5×.2 .5×.2
.6 .18 .018 .0018
B .2
.4 .4×.2 .4×.5 .4×.3
.6 .1 .2
0×.2 .6×.2 .6×.5 .6×.3
.5 .0 .048 .036 .00648
.3 .1×.0 .1×.3 .1×.7
.4 .4×.0 .4×.3 .4×.7
.0 0×.0
.3 .0 .0 .00576 .01008
.7
MLE Example
• Scenario
– Known: 3 balls inside urn
– (some red; some white)
– Unknown: R = # red balls
– Observation: (two reds)
Γ (i ) = ∑γ t
t (i )
x1 x2 x3 x4 xt xT
Si
α t (i )aij b j ( xt +1 ) β t +1 ( j )
ξ t (i, j ) = P(qt = S i , qt +1 = S j | X , λ ) =
∑∑ α (i)a b ( x
i j
i ij j t +1 ) β t +1 ( j)
Ξ (i, j ) = ∑ξ (i, j)
t
t
x1 x2 x3 x4 xt xt+1 xT
Sj
Si
∑
t =1
ξ t (i, j )
P(X|λ)
a ij = T −1
∑ Γt (i)
t =1
T
λ:HMM parameters
∑
t =1
Γ t ( i )δ ( x t , v k )
bi (v k ) = T
∑
t =1
Γt (i)
π i = γ 1 (i)
( t +1)
– Iterative: P ( X | λ ) ≥ P ( X | λ( t ) )
– convergence proven:
– arriving local optima
• Pattern classification
• Extension of HMM structure
• Extension of HMM training method
• Practical issues of HMM
• HMM history
• Recognition flow
Input Feature
Input Pre-processing
Encoding
HMM Initialization
• HMM for class ‘0’ and randomly initialization
– hmm0.prior = [1 0 0];
– hmm0.transmat = rand(3,3); % 3 by 3 transition matrix
– hmm0.transmat(2,1) =0; hmm0.transmat(3,1) = 0; hmm0.transmat(3,2) = 0;
– hmm0.transmat = mk_stochastic(hmm0.transmat);
– hmm0.transmat
0.20 0.47 0.33
0 0.45 0.55 1 2 3
0 0.00 1.00
– hmm0.obsmat = rand(3, 16); % # of states * # of observation
– hmm0.obsmat = mk_stochastic(hmm0.obsmat)
0.02 0.04 0.05 0.00 0.12 0.11 0.13 0.00 0.06 0.09 0.02 0.11 0.06 0.05 0.04 0.08
0.12 0.04 0.07 0.06 0.03 0.03 0.08 0.02 0.11 0.04 0.02 0.06 0.06 0.11 0.01 0.12
0.05 0.04 0.01 0.11 0.02 0.08 0.11 0.10 0.09 0.02 0.05 0.10 0.06 0.00 0.09 0.07
HMM Training
• Training of model 0
– [LL0, hmm0.prior, hmm0.transmat, hmm0.obsmat] = dhmm_em(data0,
hmm0.prior, hmm0.transmat, hmm0.obsmat)
iteration 1, loglik = -365.390770
0.91 0.93 1
iteration 2, loglik = -251.112160
… 0.09 0.07
iteration 9, loglik = -210.991114 1 2 3
• Trained result 0
- hmm0.transmat
0.91 0.09 0.00
0.00 0.93 0.07
0.00 0.00 1.00
- hmm0.obsmat
0.00 0.00 0.00 0.00 0.30 0.33 0.21 0.12 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.09 0.07 0.07 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.28 0.11
0.00 0.00 0.00 0.00 0.00 0.06 0.06 0.16 0.23 0.13 0.10 0.10 0.16 0.00 0.00 0.00
HMM Evaluation
• Evaluation of data 0
– for dt =1:length(data0)
– loglike0 = dhmm_logprob(data0{dt}, hmm0.prior, hmm0.transmat,
hmm0.obsmat);
– loglike1 = dhmm_logprob(data0{dt}, hmm1.prior, hmm1.transmat,
hmm1.obsmat);
– disp(sprintf('[class 0: %d-th data] model 0: %.3f, model 1: %.3f',dt,
loglike0, loglike1));
– end
[class 0: 1-th data] model 0: -68.969, model 1: -289.652
[class 0: 2-th data] model 0: -66.370, model 1: -291.671
[class 0: 3-th data] model 0: -75.649, model 1: -310.484
• Evaluation of data 1
[class 0: 1-th data] model 0: -18.676, model 1: -5.775
[class 0: 2-th data] model 0: -17.914, model 1: -11.162
[class 0: 3-th data] model 0: -21.193, model 1: -13.037
Summary
• Markov model
– 1-st order Markov assumption on state transition
– ‘Visible’: observation sequence determines state transition seq.
• Hidden Markov model
– 1-st order Markov assumption on state transition
– ‘Hidden’: observation sequence may result from many possible
state transition sequences
– Fit very well to the modeling of spatial-temporally variable signal
– Three algorithms: model evaluation, the most probable path
decoding, model training
• Example of HMM application to on-line handwriting
recognition
– Use HMM tool box for Matlab
References (Cont.)
• HMM SW
– Kevin Murphy, “HMM toolbox for Matlab”, freely downloadable SW
written in Matlab,
http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html
– Speech Vision and Robotics Group of Cambridge University, “HTK
(Hidden Markov toolkit)”, freely downloadable SW written in C,
http://htk.eng.cam.ac.uk/
• Online Character Recognition
– C.C. Tappert et. al, “The state of the Art in On-Line Handwriting
Recognition”, IEEE PAMI, pp. 787-808, Aug, 1990
– B.K. Sin and J.H. Kim, “Ligature Modeling for Online Cursive Script
Recognition”, IEEE PAMI, pp. 623-633,Jun, 1997
– S.-J. Cho and J.H. Kim, “Bayesian Network Modeling of Character
Components and Their Relationships for On-line Handwriting
Recognition”, Pattern Recognition, pp253-264, Feb. 2004
– J. Hu, et. al, “Writer Independent On-line Handwriting Recognition Using
an HMM Approach”, Pattern Recognition, pp 133-147, 2000
for dt =1:length(data1)
loglike0 = dhmm_logprob(data1{dt}, hmm0.prior, hmm0.transmat, hmm0.obsmat);
loglike1 = dhmm_logprob(data1{dt}, hmm1.prior, hmm1.transmat, hmm1.obsmat);
disp(sprintf('[class 1: %d-th data] model 0: %.3f, model 1: %.3f',dt, loglike0, loglike1));
end