Regression On A Cylinder: A Project Submitted To The Faculty of The Graduate School of The University of Minnesota BY

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Regression on a Cylinder

A PROJECT
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Xianjing Tang
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF SCIENCE
Kang James
May, 2012
c Xianjing Tang 2012
ALL RIGHTS RESERVED
Acknowledgements
First of all, I would like to express my greatest gratitude to my adviser Dr. Kang
James. She guided me all through this project from the initial topic pick to the nal
presentation and provided a lot of advice.
Next, thank Dr. Barry James and Dr. Qi for being my committee members. Dr.
Barry James oered help with my simulation.
I also want to show my sincere appreciation to my peer graduate students for their
help during my graduate study.
Last but not the least, I am very grateful for the encouragement and support from
my family, who are always there to support when it is needed.
i
Abstract
The model Regression on a Cylinder is used for modeling the linear relationship
between variables which are wrapped around a xed number. I studied two meth-
ods, a mixture model with EM algorithm and an iterative method in the least square
sense, to solve the problem. Simulations are performed to compare the two methods
in the following aspects: Accuracy of parameter estimators; Rate of convergence (iter-
ation steps); Dependence of initial values. Results and analysis showed support that
the second method has a better performance in general. The weaknesses and possible
improvements of the methods are also discussed.
ii
Contents
Acknowledgements i
Abstract ii
List of Tables v
List of Figures vi
1 Introduction 1
1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Method 6
2.1 The Mixture Model Method . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Maximum-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Alternative Method based on Least Squares . . . . . . . . . . . . 13
3 Simulation 15
3.1 Simulation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Algorithm of the Mixture Model . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Algorithm of Least Square Estimate . . . . . . . . . . . . . . . . . . . . 16
3.4 Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Analysis 18
4.1 Accuracy of Parameter Estimators . . . . . . . . . . . . . . . . . . . . . 18
iii
4.2 Speed of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Dependence on Initial Values . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Conclusion and Discussion 25
References 27
Appendix A. R code 28
A.1 code of mixture model method . . . . . . . . . . . . . . . . . . . . . . . 28
A.2 code of LSE method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3 code for simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iv
List of Tables
4.1 iteration steps when = .5 and
0
= 0 . . . . . . . . . . . . . . . . . . . 23
5.1 summary of two methods . . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
List of Figures
1.1 An example of simple linear regression . . . . . . . . . . . . . . . . . . . 1
1.2 scatter plot of data points . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 unit circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 data points on a cylinder . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 A graphical view of mixture model . . . . . . . . . . . . . . . . . . . . . 7
4.1 = .5,
1
= 2 .1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 = 1,
1
= 2 .1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 = .5,
1
= 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 = 1,
1
= 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 = .5,
1
= 2 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 = 1,
1
= 2 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.7 = .5,
1
= 2 .1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.8 = 1,
1
= 2 .1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.9 = .5,
1
= 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.10 = 1,
1
= 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.11 = .5,
1
= 2 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.12 = 1,
1
= 2 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.13
1
= 2,
0
= 0, = .5,n = 100 with 1000 trials . . . . . . . . . . . . . . 24
vi
Chapter 1
Introduction
Linear regression is a widely used statistical technique for modeling the linear relation-
ship between variables by tting a linear equation to observed data. Its applications
are numerous and in almost every eld, including business, biology, social science and
engineering. The following is an example of simple linear regression with the tted
regression line.
Figure 1.1: An example of simple linear regression
However, under some circumstances it is very dicult to acquire the desired lin-
early related data which can be directly used for linear regression model. For example,
suppose that the observed data are the desired data being wrapped around (Modulo
1
2
operation) after they reach a certain value. A scatter plot shows the property of the
data set.
Figure 1.2: scatter plot of data points
From the graph, we can see that the data are clustered into several groups, and the
response region is restricted into an interval [0, m) (m is 6 in the graph). Assuming the
two original variables still have a linear relationship, without knowing the original data,
we want to model the linear relationship by tting a linear equation with only wrapped
data.
1.1 Model
To solve this problem, the model Regression on a Cylinder[1] was proposed.
3
The model is:
Y
i
=
0
+
1
X
i
+
i

i
N(0,
2
)
where Y
i
is not observed but
U
i
= Y
i
%m
is observed.
Here m is xed and known.
% is the modulo operation: For a, b > 0, a%b = a a/bb
The goal is to estimate the coecients
0
,
1
, as well as the standard deviation .
Notice that the estimate of
0
has to be under modular denition(
0
is equivalent to

0
+km).
A familiar use of modular arithmetic is in the unit circle, in which each point corre-
sponds to a radian in [0, 2) adding or subtracting a multiple of 2.
Figure 1.3: unit circle
Suppose all the observed data(U
i
) are points on the circle which is constrained to
[0, 2). The regressor (X
i
) and the response(U
i
) therefore form a cylinder, thus changing
the problem into a linear regression tting problem performed on a cylinder. This is
where the name of the model comes from.
1.2 Application
One typical application is phase unwrapping and frequency estimation in digital signal
and image processing where phase and frequency are both very important concepts.[2]
4
Figure 1.4: data points on a cylinder
The phase of a function x(t) is the real-valued function:
(t) = arg(x(t))
When (t) is constrained to an interval such as [0, 2), its called the wrapped phase.
In the light of the importance of phase to a signal, phase extraction process is a neces-
sary stage in many applications, such as magnetic resonance imaging (MRI), synthetic
aperture radar (SAR), fringe pattern analysis, tomography and spectroscopy.[3] These
systems have mature algorithms to extract the phase signal. However, the phase is
returned wrapped modulo 2, and the wrapped phase is unusable until being un-
wrapped as a continuous function of time. As a result, phase unwrapping is the nal
and most challenging step in the phase extraction process.[3]
In the digital image processing area, the phase unwrapping problem is one of the
most active and challenging research topics, and a huge amount eort has been de-
voted to it in the last twenty years. More than 500 papers have been published, and
a large variety of approaches have been proposed to solve this problem. Examples
of theoretical principles that have been used to solve the phase unwrapping problem
include the following: local and global optimization theory, signal and image process-
ing algorithms such as region growing, number theory, dynamic programming, graph
theory, wavelets, the Fourier, Hilbert and cosine transforms, network ow algorithms,
Bayesian approaches, probability and estimation theories, statistical approaches, Greens
functions, fractals, cellular automata, heuristic and exhaustive search algorithms, cost
functions, minimum spanning tree methods and articial intelligence etc.[3]
5
On the other hand, frequency variation of synthetic signals(for example in radio,
TV and sensor devices) are often used in design of systems to convey information. The
estimation of frequency has application to vibration analysis, data set testing, telephone
transmission system testing, radar, and other measurement situations.[4] One method
of frequency estimation is to estimate the frequency by performing regression on the
phase of the received signal which is also wrapped modulo 2 and therefore must be
unwrapped before the regression can be performed.
The model Regression on a Cylinder can be applied to the unwrapping phase and
frequency estimation problem as one statistical technique solution which integrates the
unwrapping and coecient estimation into one process.
Chapter 2
Method
The statistical method of parameter estimation attempts to perform linear regression on
the unwrapped data. This procedure is complicated by the fact that the observed data
are wrapped modulo a xed and known number mand therefore must be unwrapped
before the regression can be performed.
To solve this problem, Aiyou Chen, Barry James and Kang James proposed two dif-
ferent approaches for estimating the parameters of the model regression on a cylinder[1]
in 2010. One method builds a mixture model, and uses the EM algorithm to perform
iteration based on the maximum likelihood estimate(MLE). The other one is developing
an iterative algorithm in the least square sense.
2.1 The Mixture Model Method
Mixture model[5] applies to data that are divided into groups. The scatter plot in
Chapter 1 Figure 1.2 suggest that the data points of our model are clustered into
several groups, therefore intuitively are suitable for mixture model.
A mixture distribution is a probability density function of the form
f(x) =
K

k=1

k
f(x;
k
)

k
: unknown parameter of the distribution of cluster k
K: the number of clusters in the mixture model.
6
7
f(x;
k
): the pdf of cluster number k.

k
: weight or prior probability of cluster k.
In order for the mixture model to be a proper pdf, it must be the case that

k

k
= 1.
For a simple example, suppose we have a data set where each point x
i
comes from
one of two univariate Gaussian distributions N(0, 1) and N(5, 2) with probability .3 and
.7. The following gure gives a graphical view of the mixture.[6]
Figure 2.1: A graphical view of mixture model
Back to our model, since each observed data point comes from one of the clusters
and U
i
= Y
i
%m, we introduce
Y
i
= U
i
+k
i
m
This is an equivalent representation of mod(m). This step is essentially the unwrap-
ping process where k
i
is the membership of U
i
. Then we can treat this as a mixture
model:
(U
i
|k
i
= k)
0
+
1
X
i
km+
i
(2.1)
P(k
i
= k) =
k
(2.2)
The distribution of U
i
under the condition that it comes from the k
th
component
is the probability density function of component k. The probability that it belongs to
component k is
k
. In addition, we have

K
k=1

k
= 1.
8
Since N(0,
2
), we know that
f(
i
) =
1

2
e

2
i
2
2
.
Because

i
= U
i
+k
i
m
0

1
x
i
,
the above equation can be written as:
f(U
i
) =
1

2
e

(U
i
+k
i
m
0

1
x
i
)
2
2
2
= (
U
i
+k
i
m
0

1
x
i

).
Therefore, apply the mixture (2.1) (2.2) to get:
U
i

K

k=1

k
(
U
i
+km
0

1
x
i

)
If we denote (
U
i
+km
0

1
x
i

) as

(U
i
+km
0

1
x
i
), the observed data
U
i
has a mixture distribution:
U
i

K

k=1

(U
i
+km
0

1
x
i
) (2.3)
2.1.1 Maximum-likelihood
A common estimator we use for solving mixture model is the maximum likelihood
estimate(MLE). Recall the denition of the maximum-likelihood estimation problem.
We have a density function f(x; ) and a data set of size n, supposedly drawn from this
distribution. Assuming these data are independent and identically distributed(i.i.d),
the resulting density for the sample is:
f(x|) =

n
i=1
f(x; ) = L(|x)
This function L(|x) is called the likelihood function of the given data. The like-
lihood is thought of as a function of the parameter where the data x is xed. The
principle of maximum likelihood is to choose the parameters of the mixture distribution
that maximize the likelihood L of the observed data . That is, we wish to nd

where

= arg max

L(|x)
9
In most cases, we try to maximize log(L(|x)) instead to make the computation
easier. Depending on the form of f(x|), this problem can be easy or hard.
In our model, the likelihood function is:
L(
0
,
1
, ) =

n
i=1

K
k=1

(U
i
+km
0

1
x
i
)
Take log to get the log-likelihood function:
log L(
0
,
1
, ) = log
n

i=1
K

k=1

(U
i
+km
0

1
x
i
)
=
n

i=1
log[
K

k=1

(U
i
+km
0

1
x
i
)] (2.4)
Next,we attempt to solve for
0
,
1
directly by setting the derivatives of log(L(
0
,
1
))
to zero. However, we notice that its not possible to nd such a closed-form solution.
The numerical diculty comes from the sum inside the log function. Therefore, we have
to resort to more elaborate techniques.
2.1.2 EM algorithm
The expectation-maximization (EM) algorithm is one of the elaborate techniques which
is most commonly used for computing maximum-likelihood estimates of a mixture model
or when the data is incomplete.[7]
In the mixture model, the actual observed data are the points U
1
to U
n
. Conceptu-
ally, there is an unknown label 1 k
i
K for each U
i
. The actual observed data are
often called the incomplete data, while the set (U
i
, k
i
) is called the complete data.
Before performing the EM algorithm, let us look at the complete or full log-likelihood
function.
The complete density function of U
i
given U
i
comes from component k
i
is:
f(U
i
|k
i
) =
K

k=1

(U
i
+km
0

1
x
i
)
I(k
i
=k)
where I(k
i
= k) is the indicator function of k
i
= k. When k
i
= k, the value of the
function is 1, 0 otherwise.
10
So, the density of the complete data is:
f(U
i
, k
i
) = f(U
i
|k
i
)f(k
i
)
=
K

k=1
[
k

(U
i
+km
0

1
x
i
)]
I(k
i
=k)
Similarly, we compute the likelihood function of the complete data:
L
c
(
0
,
1
, ) =
n

i=1
K

k=1
[
k

(U
i
+km
0

1
x
i
)]
I(k
i
=k)
Take log,
log L
c
(
0
,
1
, ) = log
n

i=1
K

k=1
[
k

(U
i
+km
0

1
x
i
)]
I(k
i
=k)
=
n

i=1
K

k=1
I(k
i
= k)[log
k

(U
i
+km
0

1
x
i
)
2
2
2

1
2
log(2
2
)]
(2.5)
The idea of the EM algorithm is to maximize the complete log-likelihood function by
iterating two steps: E-step(expectation) and M-step(maximization) until convergence.
The E-step computes the expected value of the complete log-likelihood function with
respect to the unknown value k
i
, and estimates the missing data
ik
= P(k
i
= k|U
i
, X
i
)
given the observed data U,X and the current parameter estimates

0
,

1
, . To simplify
our notation, let
= {
0
,
1
, };
thus we have:
Q(
(r)
|

(r1)
) = E
k
i
(log L
c
()|U, X,

(r1)
)
=
n

i=1
K

k=1
1[(log
k

(U
i
+km
0

1
x
i
)
2
2
2
)
1
2
log(2
2
)]P(k
i
= k|U
i
, X
i
)
+
n

i=1
K

k=1
0[(log
k

(U
i
+km
0

1
x
i
)
2
2
2
)
1
2
log(2
2
)]P(k
i
= k|U
i
, X
i
)
=
n

i=1
K

k=1

ik
[log
k

(U
i
+km
0

1
x
i
)
2
2
2

1
2
log(2
2
)] (2.6)
11
where
ik
is the posterior probability of i
th
data point from component k. We calculate

ik
using Bayes rule.

ik
= P(k
i
= k|U
i
, X
i
)
=
P(k
i
= k)P(U
i
|k
i
= k)

K
k=1
P(k
i
= k)P(x
i
|k
i
= k)
P(k
i
= k)P(U
i
|k
i
= k)
=
k

(U
i
+km

1
x
i
) (2.7)
Notice:

K
k=1

ik
= 1
The M step maximize the expectation we computed in the E step. That is, we nd

(r)
= arg max

Q(|

(r1)
)
From the equation (2.6), we can see that maximizing the function Q is equivalent
to minimizing
n

i=1
K

k=1

ik
[
(U
i
+km
0

1
x
i
)
2
2
2
+
1
2
log(2
2
)] (2.8)
since log
k
does not involve any unknown parameters.
The only part which involves
0
and
1
is the rst part in (2.8). Therefore,
(

0
,

1
) = arg min

0
,
1
n

i=1
K

k=1

ik
(U
i
+km
0

1
x
i
)
2
12
The function to be minimized, denoted as f(
0
,
1
) is:
f(
0
,
1
) =
n

i=1
K

k=1

ik
(U
i
+km
0

1
x
i
)
2
=
n

i=1
K

k=1

ik
(
0
+
1
x
i
)
2
2
n

i=1
K

k=1

ik
(U
i
+km)(
0
+
1
x
i
) +C
1
=
n

i=1
(
0
+
1
x
i
)
2
K

k=1

ik
2
n

i=1
(
0
+
1
x
i
)
K

k=1

ik
(U
i
+km) +C
1
=
n

i=1
(
0
+
1
x
i
)
2
2
n

i=1
(
0
+
1
x
i
)(U
i
K

k=1

ik
+
K

k=1

ik
km) +C
1
=
n

i=1
(
0
+
1
x
i
)
2
2
n

i=1
(
0
+
1
x
i
)(U
i
+m
K

k=1

ik
k) +C
1
=
n

i=1
(Y

i

0

1
x
i
)
2
+C
2
(2.9)
where Y

i
= U
i
+m

K
k=1

ik
k.
Notice equation (2.9) has the same form as a sum for least squares. Therefore,
applying the least square formula, we have
(

0
,

1
)
T
= (X
T
X)
1
X
T
Y (2.10)
where X = {1, X
i
} and Y = {Y

i
}.
For

2
, the function which involves ,denoted as g(
2
) is:
g(
2
) =
n

i=1
K

k=1

ik
[
(U
i
+km
0

1
x
i
)
2
2
2
+
1
2
log(2
2
)]
=
n

i=1
K

k=1

ik
(U
i
+km
0

1
x
i
)
2
2
2
+
n

i=1
K

k=1

ik
1
2
log(2
2
)
=
n

i=1
K

k=1

ik
(U
i
+km
0

1
x
i
)
2
2
2
+
n

i=1
1
2
log(2
2
)
K

k=1

ik
=
n

i=1
K

k=1

ik
(U
i
+km
0

1
x
i
)
2
2
2
+
n

i=1
1
2
log(2
2
)
=

n
i=1

K
k=1

ik
(U
i
+km
0

1
x
i
)
2
2
2
+
n
2
log(2
2
). (2.11)
13
Minimize (2.11) by taking the derivative with respect to
2
:
g

(
2
) =

n
i=1

K
k=1

ik
(U
i
+km
0

1
x
i
)
2
2
1

4
+
n
2
1
2
2
2
=
1
2
2
[

n
i=1

K
k=1

ik
(U
i
+km
0

1
x
i
)
2

2
n] (2.12)
Set (2.12) equal to 0.

1
2
2
[

n
i=1

K
k=1

ik
(U
i
+km
0

1
x
i
)
2

2
n] = 0

n
i=1

K
k=1

ik
(U
i
+km
0

1
x
i
)
2

2
n = 0
Therefore, we get

2
=

n
i=1

K
k=1

ik
(U
i
+km
0

1
x
i
)
2
n
. (2.13)
Then we update
k
in the M step:

k
=
1
n
n

i=1

ik
. (2.14)
In summary, for a xed number of components, the EM algorithm is: start with an
initial value of each parameter, calculate the Q function and the posterior probability

ik
according to (2.7) from the E step, next update parameters and
k
from the M step.
Keep iterating until convergence.
2.1.3 Alternative Method based on Least Squares
The principle of the least square estimate is to t a straight line which minimizes the
sum of squared residuals. In the simple linear regression case, the sum of squares is:
S(
0
,
1
) =
n

i=1
(Y
i

1
x
i
)
2
For our model, the observed data are the U
i
, which are the actual data wrapped
around m (modulo m). So we always have the relationship of U
i
and Y
i
as follows:
U
i
= Y
i
%m Y
i
= U
i
+k
i
m
14
Therefore, the least square formula of our data is modied to be:
S(
0
,
1
) =
n

i=1
(U
i
+k
i
m
0

1
x
i
)
2
. (2.15)
Normally, we would set the partial derivatives equal to 0, then solve for
0
and
1
directly. But this is not the case. Notice the label of each data point k
i
is unknown.
The number of unknown parameters are more than that of the equations we have.
Therefore, iteration is needed.
The idea of the iteration in the least square sense is to start with a initial value
of
0
and
1
, pick the label k
i
for each U
i
to minimize the sum of squares. With
those labels obtained, update the parameters
0
and
1
using the simple least square
estimation. Keep iterating until convergence. The estimate of is obtain as in simple
linear regression:

2
=
1
n 2
n

i=1
(U
i
+k
i
m

1
x
i
)
2
. (2.16)
The algorithms of these two methods will be illustrated more explicitly in the next
chapter.
Chapter 3
Simulation
The goal of simulation is to compare the performance of the two methods in the following
aspects:
1. Accuracy of parameter estimators.
2. Rate of convergence (iteration steps).
3. Dependence on initial values.
I choose m = 2 all through the simulation process. The results will be presented
and analyzed in detail in the next chapter. This chapter will focus on the algorithm
and programming used in the simulation.
3.1 Simulation Step
The steps for comparison of the two methods performance are as follows:
1. Generate and sort 100 random numbers as X.
2. Generate 100 random numbers with distribution N(0, 1) (independent with step 1 ).
3. Choose values of
0
,
1
and .
4. Obtain Y
i
and U
i
using the formula Y
i
=
0
+
1
x
i
+N(0, 1) and U
i
= Y
i
Y
i
/mm.
5. Apply the two methods to obtain

0
,

1
and .
6. Compare or observe the distribution of the dierence between the actual value and
the estimate, for example

1
, with 1000 trials.
The algorithm of the two methods in step 5 is described.
15
16
3.2 Algorithm of the Mixture Model
0) Let K be the number of components in the scatter plot, and start with an initial
value (

0
(0)
,

1
(0)
,
(0)
k
);
1) Calculate
ik
from the E step:

ik
= P(k
i
= k|U
i
, X
i
) P(k
i
= k)P(x
i
|k
i
= k)
2) Update (

0
,

1
,

2
) and
i
from the M step:
(

0
,

1
)
T
= (X
T
X)
1
X
T
Y
where X = {1, X
i
} and Y = {Y

i
} with Y

i
= U
i
+m

K
k=1

ik
k,

2
=

n
i=1

K
k=1

ik
(U
i
+km
0

1
x
i
)
2
n

k
=
1
n
n

i=1

ik
3) Iterate 1) and 2) until convergence.
3.3 Algorithm of Least Square Estimate
0) Start with an initial value

0
(0)
and

1
(0)
;
1) Choose k
i
to minimize (U
i
+k
i
m
0

1
x
i
)
2
for each i;
2) Update

0
and

1
by least squares
(

0
,

1
)
T
= (X
T
X)
1
X
T
Y
where X = {1, X
i
} and Y = {U
i
+k
i
m}.
3) Iterate 1) and 2) until convergence.
4) An estimate for
2
can be obtained from the nal sum of squares

2
=
1
n 2
n

i=1
(U
i
+k
i
m
0

1
x
i
)
2
17
3.4 Criterion
I evaluated the mean-squared error (MSE) performance of the two algorithm via simu-
lation and compared them in n = 1000 trials. The MSE is dened as:
MSE() =

n
i=1
(

)
2
n
In addition to MSE, I also compared the distribution of estimate error from the
boxplot. The convergence criterion for the two algorithms is set the same:

0
(r)

0
(r1)
)
2
+ (

1
(r)

1
(r1)
)
2
< 10
3
Since the performance of the two algorithms may depend on both the coecient val-
ues
0
,
1
, the standard deviation and initial values, I did several groups of simulations
with dierent values of parameters. The results are shown in next chapter.
Chapter 4
Analysis
In terms of complexity, we notice that both algorithms have two iteration steps. The
mixture model updates four parameters
0
,
1
, ,
k
in the M step, while the LSE algo-
rithm updates only two parameters
0
and
1
. The estimate of is not involved in the
iteration step, yet obtained from the nal sum of squares. Therefore, the LSE algorithm
is simpler than the mixture model. This may also relate to the convergence rate.
I also notice that both algorithms need an initial value to start the iteration. So the
dependence on initial value is another factor to be considered.
In this chapter, I compare the performance of estimate accuracy via simulation, as
well as the speed of convergence and dependence on initial values.
4.1 Accuracy of Parameter Estimators
The performance of the estimator may depend on the coecient values, in addition to
the noise (). So simulation is performed with = .5, 1 in three levels of
1
: 2 .1 ,
2 and 2 10. Here estimate of
0
is not considered, since
0
has to be under modular
denition(
0
is equivalent to
0
+km). The following are boxplots of
1
error:
18
19
Figure 4.1: = .5,
1
= 2 .1 Figure 4.2: = 1,
1
= 2 .1
Figure 4.3: = .5,
1
= 2 Figure 4.4: = 1,
1
= 2
20
Figure 4.5: = .5,
1
= 2 10 Figure 4.6: = 1,
1
= 2 10
From the above graph, we can see the distribution of
1
error is symmetric for LSE,
while its more negative for the mixture model. In terms of MSE, LSE is consistently
less than that for the mixture.
At the level of = .5,
1
error of both algorithms are small, which is fairly good.
However, at the level of = 1, the mixture model lost its normality, and has a small
amount of points far apart to the negative side, while other points stay normal. There-
fore, we can conclude that has a big inuence on the mixture model.
At dierent level of
1
, the boxplot is proportionally changed, since my plot scale is
proportion to the value of
1
, and the shape of plots almost stay the same. This implies
that the relative error of
1
(

1
) stays the same as
1
changes for both algorithms,
while does aect the estimate of accuracy for the Mixture model, but not for LSE.
As for error, I have the following plots:
21
Figure 4.7: = .5,
1
= 2 .1 Figure 4.8: = 1,
1
= 2 .1
Figure 4.9: = .5,
1
= 2 Figure 4.10: = 1,
1
= 2
22
Figure 4.11: = .5,
1
= 2 10 Figure 4.12: = 1,
1
= 2 10
Just as for the
1
error, the MSE of error for LSE is consistently less than that
for the mixture model. However, the MSE for LSE is less when is big, while it
behaves oppositely for the mixture model. The MSE for the mixture model is greater
when is big. The plot also implies the distributions of error are similar for the two
methods. But for the mixture model, it spreads out more.
1
seems to have no eect
on estimation.
4.2 Speed of Convergence
Take the number of iteration steps as the criterion for rate of convergence. Pick an
arbitrary value of
1
, x = .5 and
0
= 0. Part of the simulation results are displayed
in the table.
23
Table 4.1: iteration steps when = .5 and
0
= 0

1
Mixture LSE
36.284632 3 2
31.260325 5 2
11.280412 2 2
34.080453 3 2
9.8529282 2 2
52.388282 2 2
31.475833 3 2
56.826892 2 2
30.708335 5 2
62.263023 3 2
Generally, the LSE method converges faster than the mixture method. Similar
results are obtained in other simulation trials with other parameter values. This makes
sense because as we can see from the algorithm in the simulation chapter, there are three
updated parameters in the EM algorithm while there are only two in LSE iteration.
4.3 Dependence on Initial Values
In LSE, is NOT involved in the iteration, so the initial value of has no eect on
the convergence. We only consider the initial value of
1
. The mean-square error(MSE)
performance of
1
is evaluated via simulation. For xed
1
= 2,
0
= 0, = .5, do 1000
trials for each initial value. Then plot the MSE of
1
.
The graph shows that both of the two algorithms highly depend on the initial value
of
1
. Only when its close to or not very far away from the actual value, can a good
estimate be obtained.
24
Figure 4.13:
1
= 2,
0
= 0, = .5,n = 100 with 1000 trials
Chapter 5
Conclusion and Discussion
I have studied the two methods for solving the model Regression on a Cylinder. One
method builds a mixture model, and uses the EM algorithm to perform iteration based
on the maximum likelihood estimate. The other one develops an iterative algorithm in
the least square sense. They both integrate the unwrapping and coecients estimation
into one process. The performance of the two methods can be summarized by the
following table:
Table 5.1: summary of two methods
Mixture Model LSE
2 steps per iteration 2 steps per iteration
3 parameters updated in each iteration 2 parameters updated in each iteration
Slow convergence Fast convergence
Highly dependent on the initial values Highly dependent on the initial values
Less accurate in terms of MSE More accurate in terms of MSE
Generally speaking, LSE method works better than the mixture model. However,
there are some disadvantages of LSE and the Mixture model: They both highly depend
on the initial values; When the initial value is not very close to the actual value, the
convergence rate is slow, although they will achieve the value nally; the EM algorithm
may have hidden properties where some research can be conducted. For example,there
is one theorem proved to guarantee the EM algorithm converges to a local maximum of
the likelihood function: Each iteration of EM either increases or holds constant
the log-likelihood. However, it is not guaranteed to converge to a global maximum.[8]
25
26
[9]
Based on these facts, my future work may include: developing a better approach
that is less sensitive to the initial values and parameter values; improving the mixture
model by using a better algorithm with faster convergence; studying the convergence
property of the EM algorithm to improve its performance for this specic model.
References
[1] Kang James Aiyou Chen, Barry James. Regression on a cylinder. Technical report,
University of California, berkeley, 2010.
[2] J.Kitchen B.J.Slocumb. A polynomial phase parameter estimation-phase unwrap-
ping algorithm. In Acoustics, Speech, and Signal Processing, pages 1922, April
1994.
[3] Helen Pottle. The phase unwrapping problem. Technical report, Liverpool John
Moores University, 2012. http://www.ljmu.ac.uk/GERI/90205.htm.
[4] The estimation and tracking of frequency. Cambridge University Press, February
2001.
[5] Charles Elkan. Mixture models. Technical report, University of California, San
Diego, March 2010.
[6] Partha De. Finite mixture model. Technical report, Hunter College and the Graduate
Center, CUNY, July 2008.
[7] Wencheng Geng. Application of gap statistics to penalized model-based clustering.
Masters Project, 2010.
[8] C.F.J. Wu. On the convergence properties of the em algorithm. The Annals of
Statistics, 11(1):95103, 1983.
[9] L. Xu and M.I. Jordan. On convergence properties of the em algorithm for gaussian
mixtures. Neural Computation, (8):129151, 1996.
27
Appendix A
R code
A.1 code of mixture model method
# objective function:log-likelihood
"objfun"<-function(X,U,m,K,Kpi,beta,sigma){
uxki=matrix(0,nrow=length(U),ncol=K)
uxki[,1]=U+m-beta[1]-beta[2]*X
for (k in 2:K){
uxki[,k]=uxki[,1]+(k-1)*m
}
uxki=uxki^2/sigma^2/2
minux=min(uxki)
uxki=uxki-minux
obj=mean(log(exp(-uxki)%*%Kpi))-minux-log(2*pi*sigma^2)/2
}
# fit a mixture model
"fitmix"<-function(X,U,m,K,theta){
#mixture model
Kpi=rep(1/K,K)
if (is.null(theta)){
28
29
b0=m*2
b1=runif(1)
sigma=.1
}else{
b0=theta[1]
b1=theta[2]
sigma=theta[3]
}
iter=0
eps=1
uxki=matrix(0,n,K)
while (iter<100 & eps>1e-3){
\# E-step
Kipi=matrix(0,n,K)
for (i in 1:n){
tt=(U[i]+(1:K)*m-b0-b1*X[i])^2/sigma^2
tt=tt-min(tt)
Kipi[i,]=Kpi*exp(-.5*tt)
Kipi[i,]=Kipi[i,]/sum(Kipi[i,])
}
# M-step
Xm=cbind(rep(1,n),X)
Yt=U+m*Kipi%*%(1:K)
newbeta=solve(t(Xm)%*%Xm)%*%t(Xm)%*%Yt
v=U-Xm%*%cbind(newbeta)
k1=Kipi%*%(1:K)
k2=Kipi%*%(1:K)^2
sigma=sqrt(mean(v^2+2*m*v*k1+m^2*k2))
Kpi=apply(Kipi,2,mean)
30
# check obj
obj=objfun(X,U,m,K,Kpi,newbeta,sigma)
# updata parameters
eps=sqrt((newbeta[1]-b0)^2+(newbeta[2]-b1)^2)
b0=newbeta[1]
b1=newbeta[2]
iter=iter+1
#cat(iter,eps,obj,"\n")
}
# calculate cluster labels based n posterior prob
mem=rep(0,n)
tt=apply(Kipi,1,max)
for (i in 1:n){
mem[i]=which(Kipi[i,]==tt[i])[1]
}
# output
list(mem=mem,beta=newbeta,Kpi=Kpi,sigma=sigma)
}
A.2 code of LSE method
# least square function
"lsefun"<-function(x,U,m,theta){
if (is.null(theta)){
b0=m*2
b1=runif(1)
sigma=.1
}else{
b0=theta[1]
31
b1=theta[2]
sigma=theta[3]
}
iter=0
eps=1
n=length(x)
while (iter<100 & eps>1e-3){
k=(b0+b1*x-U)/m
for (i in 1:n){
eps1=(U[i]+floor(k[i])*m-b0-b1*x[i])^2
eps2=(U[i]+ceiling(k[i])*m-b0-b1*x[i])^2
if (eps1<eps2){
k[i]=floor(k[i])
}
else{
k[i]=ceiling(k[i])
}
}
fitlse<-lm(U+k*m ~ x)
newb0=fitlse$coefficients[[1]]
newb1=fitlse$coefficients[[2]]
eps=sqrt((newb0-b0)^2+(newb1-b1)^2)
b0=newb0
b1=newb1
iter=iter+1
}
sigma=sqrt(1/(n-2)*sum((U+k*m-b0-b1*X)^2))
# output
list(b0=b0,b1=b1,sigma=sigma,iter=iter)
}
32
A.3 code for simulation
#generate the data
"getdata"<-function(n,m,b0,b1,sigma){
#generate Y
#set.seed(1234)
X=runif(n,0,.2)
Y=b0+b1*X+rnorm(n)*sigma
#generate U
U=Y-floor(Y/m)*m
ind=order(X)
list(X=X[ind],U=U[ind],Y=Y[ind])
}
Try<-function(trials,initial){
b1lse=rep(0,trials)
b1mix=rep(0,trials)
siglse=rep(0,trials)
sigmix=rep(0,trials)
if (is.null(initial)){
for (i in 1:trials){
dat=getdata(n,m,b0,b1,sigma)
X=dat$X
U=dat$U
lsefit=lsefun(X,U,m,NULL)
mixfit=fitmix(X,U,m,K,NULL)
b1lse[i]=lsefit$b1
b1mix[i]=mixfit$beta[2]
siglse[i]=lsefit$sigma
sigmix[i]=mixfit$sigma
33
}
}else{
for (i in 1:trials){
dat=getdata(n,m,b0,b1,sigma)
X=dat$X
U=dat$U
lsefit=lsefun(X,U,m,initial)
mixfit=fitmix(X,U,m,K,theta=initial)
b1lse[i]=lsefit$b1
b1mix[i]=mixfit$beta[2]
siglse[i]=lsefit$sigma
sigmix[i]=mixfit$sigma
}
}
mse_lse_b1=mean((b1lse-b1)^2)
mse_mix_b1=mean((b1mix-b1)^2)
mse_lse_sig=mean((siglse-sigma)^2)
mse_mix_sig=mean((sigmix-sigma)^2)
list(b1l=b1lse-b1,b1m=b1mix-b1,b1msel=mse_lse_b1,b1msem=mse_mix_b1,
sigl=siglse-sigma,sigm=sigmix-sigma,sigmsel=mse_lse_sig,sigmsem=mse_mix_sig)
}
#simulation examples
b0=0
b1=2*pi*10
sigma=1
n=100
m=2*pi
K=5
dat=getdata(n,m,b0,b1,sigma)
34
X=dat$X
U=dat$U
plot(X,U)
initial=c(2*pi,62.8,1)
trials=1000
result=Try(trials,initial)
png(file="boxplot101_b1.png",width=480,height=480)
Lse=result$b1l
Mixture=result$b1m
mat<-cbind(Lse,Mixture)
boxplot(data.frame(mat),range=10,col=(c("blue","red")))
title(paste("MSE of LSE=" , result$b1msel),line=2)
title(paste("MSE of Mixture=" , result$b1msem),line=1)
dev.off()
png(file="boxplot101_sig.png",width=480,height=480)
Lse=result$sigl
Mixture=result$sigm
mat<-cbind(Lse,Mixture)
boxplot(data.frame(mat),range=10,col=(c("blue","red")))
title(paste("MSE of LSE=" , result$sigmsel),line=2)
title(paste("MSE of Mixture=" , result$sigmsem),line=1)
dev.off()

You might also like