Lu Princeton 0181D 13623
Lu Princeton 0181D 13623
Lu Princeton 0181D 13623
Kun Lu
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
May 2021
© Copyright by Kun Lu, 2021.
All rights reserved.
Abstract
This dissertation focus on developing new statistical and machine learning methods
for financial applications. We first propose a new model named Features Augmented
Hidden Markov Model (FAHMM), which extends the the traditional Hidden Markov
Model (HMM) by including the features structure. We also allow the model to be very
general from two perspectives: 1. the emission distribution can be of different form
(eg. exponential family); 2. we also deal with different features structures (e.g. high
dimensionality, multi-colinearity) by adding different penalization terms. Theoretical
proof of convergence, simulation and an empirical application to currency regime
identification are provided.
Next, we develop a new neural Natural Language Processing model, which combines
the reinforcement learning model with Bidirectional Encoder Representations from
Transformers (BERT) model to deal with the long documents classification. Due to
the limitation of BERT allowing only 512 tokens, it cannot deal with long documents,
which is very common in financial data (e.g. financial news, earnings transcript), we
train reinforcement learning model together with the BERT based model end to end:
using policy-gradient reinforcement learning to do sentences/chunks selection. Then
we apply our model to earnings conference call transcripts data and predict the stock
price movement after the call.
Finally, we work on a method to estimate the high dimensional covariance matrix
using high frequency data. We use factor structure and thresholding methods to
deal with high dimensionality, and using pre-average and refresh time to tackle high
frequency data specialty: microstructure noise and non-synchronicity. We also consider
three different scenarios, when we only know factors, or only know loadings, or know
neither. Theoretical proof and simulation are provided to support the theory, and a
horse race on the out-of-sample portfolio allocation with Dow Jones 30, S&P 100, and
S&P 500 index constituents, respectively are also conducted.
iii
Acknowledgements
v
To my family.
vi
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
vii
1.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.10 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.11 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
3.2 Model Setup and Assumptions . . . . . . . . . . . . . . . . . . . . . 68
3.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.2 Factor Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.3 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.4 Microstructure Noise . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.5 Asynchronicity . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Three Estimators and their Asymptotic Properties . . . . . . . . . . . 77
3.3.1 Time-Series Regression (TSR) . . . . . . . . . . . . . . . . . . 78
3.3.2 Cross-Sectional Regression (CSR) . . . . . . . . . . . . . . . 80
3.3.3 Principal Component Analysis (PCA) . . . . . . . . . . . . . 83
3.4 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4.1 Choice of δ and kn . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4.2 Choice of r and f pn, dq . . . . . . . . . . . . . . . . . . . . . . 87
3.4.3 Choice of the Thresholding Methods . . . . . . . . . . . . . . 88
3.5 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.6 Empirical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6.2 Out-of-Sample Portfolio Allocation . . . . . . . . . . . . . . . 95
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.1 Mathematical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . 127
3.1.3 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . 127
3.1.4 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography 138
ix
List of Tables
x
List of Figures
2.1 Zoom’s stock price movement after Q4 2020 earnings conference call. 42
2.2 BERT-based model’s architecture. . . . . . . . . . . . . . . . . . . . . 47
2.3 The Deep Reinforced Model’s architecture. . . . . . . . . . . . . . . . 50
2.4 Distribution of returns among the 1,500 samples. . . . . . . . . . . . 54
2.5 Returns distribution of label 1 (left) and label -1 (right). . . . . . . . 58
2.6 Alpha distribution of label 1 (left) and label -1 (right). . . . . . . . . 59
xi
2.7 Accumulative return for directional betting for test data from 2019-07-17
to 2020-02-21, in total of 103 trades. . . . . . . . . . . . . . . . . . . 60
2.8 Accumulative return for delta hedging for test data from 2018-10-31 to
2019-06-27, in total of 63 trades. . . . . . . . . . . . . . . . . . . . . . 61
2.9 Comparison between the models only using text, only using financial
data and combining them together. . . . . . . . . . . . . . . . . . . . 62
xii
Chapter 1
1.1 Introduction
The hidden Markov model plays a significant role in machine learning and other
domains, especially for applications in temporal pattern recognition such as speech,
handwriting, bioinformatics and financial economics. With the explosive growth of
data, however, the selection of features in a HMM can be burdensome. Traditionally,
in finance, the modeler must choose her data to be included by reference to expert
judgement, theoretical studies, and analyses of historical performance. There have been
few papers that focus on the feature selection problem in the HMM setting. (Adams,
Beling, and Cogill 2016) proposed a model called Feature Saliency Hidden Markov
Model. Here, they introduce a vector of binary variables, called saliency variables,
to indicate whether a feature is significant or not. This approach complicates the
search algorithm and is difficult to extend to high dimensional applications. Instead,
1
we propose a feature/factor structural model for the output emission variables and
incorporate a regularization within the M-estimation step. This allows for high
dimensional applications.
An early representative form of factors modeling within HMM is regime-switching
regression. The idea of switching between two regression dates back to (Quandt 1958),
which assumes that two regimes that can be fitted by separate regressions using Log
Likelihood estimation. (Goldfeld and Quandt 1973) were the first to model the change
of latent states as a Markov Chain. (Hamilton 1989) extend the Markov switching
model to the autoregression case and apply it to investigate the U.S. business cycle.
(Kim, Piger, and Startz 2008) relax the assumption that the latent state variable is
exogenous, and develop an endogenous Markov switching regression model.
Today, there is a vast literature on Markov switching regression applications in
economics and finance: Foreign exchange rates: (Engle and Hamilton 1990), (Cai 1994),
(Jeanne and Masson 2000); stock market: (Hamilton and Susmel 1994), (Alizadeh
and Nomikos 2004), (Lee, Yoder, Mittelhammer, and McCluskey 2006); financial
derivatives: (Buffington and Elliott 2002), (Alizadeh, Nomikos, and Pouliasis 2008);
the term structure of interest rates: (Ang and Bekaert 2002), and portfolio allocation:
(Valdez and Vargiolu 2013), among others. Markov switching models appear in other
areas: speech recognition (Rabiner, Juang, and Rutledge 1993) and computer vision
(Horst and Michael 2001).
On the other hand, with the development of technology and decrease of cost for
acquiring a huge amount of data, high dimensional data are nowadays becoming more
and more frequently used in many fields. However, there are few studies involving
high dimensional data within the HMM framework. An exception is (Hering, Kazor,
and Kleiber 2015) and (Ailliot, Monbet, and Prevosto 2006) who search for parametric
shape for the autoregressive and covariance matrices, which normally assumes the
off-diagonals of the matrices decrease with the distance from the diagonal increases.
2
(Monbet and Ailliot 2017) applied SCAD penalty to the Markov Switching Vector
Autoregressive Regression (MSVAR) to deal with the dimensionality of VAR problems.
Contributions: First, we propose a noval model named as Feature Augmented
Hidden Markov Model (FAHMM), which improves the performance of Hidden Markov
Model by giving a factor structure for the output variable. When features x are
available other than the output variable y, basic HMMs stack x and y as a output
vector. However, by doing this, the inferred latent states come from x and y together,
rather than y itself, which may cause serious errors in practice. Our FAHMM
considering x in a factor structure of y can correctly identify the true states. We also
cover different forms of emission distribution, and by adding different regularization
terms, we can achieve many practical goals, e.g. feature selection, dealing with high
dimensionality or the multi-colinearity issue.
Second, to fit our model, we design a new algorithm called regularized Baum Welch,
which is the first Baum Welch algorithm under the high dimensional setting. For high
dimensionality case, the M-step is normally not stable or even well-defined in some
cases. We solve this issue by introducing a regularization. We show how to perform
this iterative regularization and appropriately update the regularization parameter.
Other than that, we also establish a general local convergence theorem and statistical
and computational guarantees for this algorithm. The central challenge is to deal with
the dependence issue and high dimensionality issue at the same time, as this must
control the optimization error, as well as the statistical error.
Last, as far as we know, our paper is the first convergence analysis for the Markov
Switching regression, which can be regarded as a special case of our model, but
is popularly used in econometrics. The local convergence theorem and theoretical
guarantees are presented in Section 1.5.
Related Work: Our work firstly builds upon a framework for analysis of the
EM algorithm. Until recently, there have been few papers discussing convergence
3
and theoretical guarantees. (Balakrishnan, Wainwright, Yu, et al. 2017) propose a
theoretical framework for the EM algorithm’s convergence to a small neighborhood of
the global minimum; (Yi and Caramanis 2015) then extend this analysis framework
to the regularized EM case; but both of these work applies to models using IID data,
while for hidden Markov model, since the data are dependent, the population Q
function might not even exists. (Yang, Balakrishnan, and Wainwright 2017) generalize
the analysis to dependent data, and give statistical and computational guarantee of
the Baum Welch Algorithm. Our paper extend the analysis of Baum Welch to the
high dimensional case, but we require a factor structure for the output variable.
To build a as general framework, we also borrow results and analysis from
(Negahban, Ravikumar, Wainwright, Yu, et al. 2012). They provide a unified frame-
work for establishing convergence rates and consistency for regularized M -estimators
under the high-dimensional scaling. However, our problem is more complex due to the
essence of iterative regularization. Different but appropriate choices of regularization
parameters are needed for each iteration step is needed to guarantee the convergence
of the estimator.
Consider a N-state Hidden Markov Model with hidden states random variables tSi uiPZ
taking values in a discrete set t1, . . . , N u as shown in Figure 2.4. The latent state follows
a ergodic Markov Chain with the transition probability to be: P “ ppjk qj,kPt1,...,N u ,
where pjk “ P pZ2 “ k|Z1 “ jq and N
ř
k“1 pjk “ 1 for any j “ 1, . . . , N . We define
stationary state. We also assume that the probability density/mass function f p¨q is
finite.
4
Figure 1.1: Graphical illustration of Hidden Markov Model with factor structure and
general emission distribution.
Suppose for sample i, the emission distribution comes from an exponential family
of probability distributions, a large class of probability distributions that includes
the normal, binomial, Poisson and gamma distributions, among others. With a
corresponding link function g we have:
1.3 Estimation
In this section, we introduce the estimation method of our model, the Feature Aug-
mented Hidden Markov Model (FAHMM) and the regularized Baum Welch algorithm.
1.3.1 Likelihood
Suppose we observe sequences y1n “ ty1 , . . . , yn u, xn1 “ tx1 , . . . , xn u from the response-
covariate pair random vector pY, Xq P R ˆ Rp . Then the conditional log likelihood
5
function is: ¨ ˛
1 ÿ
Ln pθq “ log ˝ f py1n , S0n |xn1 ; θq‚,
n Sn 0
where θ :“ pβ, γq parameterize the emission distribution and the transition probability
matrix.
We notice that the log of sum function is not guaranteed to be concave. Due to
computation and analysis consideration, in reality, we normally apply a transformation
to work on an alternative concave problem instead, which is the concave lower bound
of the conditional log likelihood via Jensen’s inequality. For any choice of θ1 , we have
ˆ ˆ ˙˙
1 f py1n , S0n |xn1 ; θq
Ln pθq “ log ES0n |y1n ,xn1 ,θ1
n f pS0n |y1n , xn1 , θ1 q
1 1
ě ES0n |y1n ,xn1 ,θ1 rlog f py1n , S0n |xn1 ; θqs ´ ES0n |y1n ,xn1 ,θ1 rlog f pS0n |y1n , xn1 , θ1 qs .
n
loooooooooooooooooooomoooooooooooooooooooon n
loooooooooooooooooooomoooooooooooooooooooon
Qn pθ|θ1 q Hn pθ1 q
Since only the first term Qn pθ|θ1 q depends on θ, therefore EM try to maximizing Ln pθq
by maximizing Qn pθ|θ1 q.
We expand Qn pθ|θ1 q and get:
n n
1ÿ 1ÿ
Qn pθ|θ1 q “ ESi´1 ,Si |y1n ,xn1 ,θ1 rlog ppSi |Si´1 ; θqs ` ES |yn ,xn ,θ1 rlog f pyi |xi , Si ; βqs.
n i“1 n i“1 i 1 1
where f pyi |xi , Si ; βq is the emission distribution under state Si . For example, when
?
the emission distribution is Gaussian, f pyi |xi , Si ; βq “ p 2πσ 2 q´1{2 expt´ 12 pyn ´
x|n βSi q2 {σ 2 u is the density of yn where σ is constant for different states.
6
1
If we define ξi,j “ P pSi “ j|y1n , xn1 , θ1 q and τi,jk
1
“ EpSi´1 “ j, Si “ k|y1n , xn1 , θ1 q,
then
n
1 1ÿ 1
Qn pθ|θ q “ ESi´1 ,Si |y1n ,xn1 ,θ1 rlog ppSi |Si´1 ; γqs ` ES0 |y1n ,xn1 ,θ1 rlog π0 pS0 ; γqs
n i“1 n
looooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooon
Q2,n pγ|θ1 q
n (1.1)
1 ÿ
` ES |yn ,xn ,θ1 rlog f pyi |xi , Si ; βqs .
n i“1 i i 1
looooooooooooooooooooomooooooooooooooooooooon
Q1,n pβ|θ1 q
We observe that Q1,n pβ|θ1 q solely depends on β, and Q2,n pβ|θ1 q only depends on γ.
Therefore we can optimize them separately to find the estimation of β and γ.
1.3.2 Regularization
To encourage the assumed structure (sparsity, multi-colinearity, etc) of the factor part,
we add a regularization term to Q1,n pβ|θ1 q:
N
ÿ
Qp1,n pβ|θ1 q 1
“ Q1,n pβ|θ q ´ λj Rpβj q.
j“1
1.3.3 Algorithm
Since we allow the number of features to be of high dimension, we proposes this new
regularized Baum Welch algorithm. Using this recursive algorithm, we approximate
the maximum likelihood estimator of θ˚ by cycling between the following Expectation
step and Maximization step.
7
Algorithm 1 Regularized Baum Welch Algorithm
Input: Samples y1n , xn1 , regularizer R, number of iterations T , initial parameters
p0q
θ0 :“ pβ p0q , γ p0q q, initial regularization parameters tλj uNj“1 , estimated statistical
error ∆, contractive factor κ ă 1.
for t “ 1 to T do
E-step: Compute Qn pθ|θpt´1q q given θpt´1q .
Regularized M-step: Compute
ptq pt´1q
λj Ð κλj ` ∆,
# +
N
ÿ ptq
β ptq “ arg max Q1,n pβ|θpt´1q q ´ λj Rpβj q ,
β
j“1
and
γ ptq “ arg max Q2,n pγ|θpt´1q q.
γ
end for
Output β pT q , γ pT q .
E-step
At t-th iteration, the E-step mainly computes the function Qn pθ|θpt´1q q, which equals
ptq ptq
to calculate ξi,j and τi,jk given θpt´1q . This can be done by the forward-backward
algorithm:
and
N
ÿ
ptq ptq
ξi,j “ τi,jk .
k“1
ptq
ai,j “ P py1 , . . . , yn , Si “ j|xn1 , θptq q,
8
and can be calculated using forward step:
˜ ¸
N
ÿ
ptq ptq ptq
ai`1,j “ ai,k pjk fj pyi`1 ; θptq q,
k“1
ptq
Similarly, bi,j is the "backward" probabilities defined as:
ptq
bi,k “ P pyi`1 , . . . , yn |Si “ k, xn1 , θptq q
N
ÿ
ptq ptq ptq
bi,k “ pjk fj pyi`1 ; θptq qbi`1,j .
j“1
ptq
We initialize these recursions by setting a1j “ πi fj py1 ; θptq q and bnk “ 1, where
ptq ptq
π ptq “ pπ1 , . . . , πn q is the principal eigenvector of P ptq π “ π.
M-step
The M-step maximize Qpn pθ|θpt´1q q to get estimation for β and γ, and notice that we
can maximize Qp1,n pθ|θpt´1q q and Q2,n pθ|θpt´1q q separately:
# +
n N N
1 ÿ ÿ pt´1q ÿ ptq
β ptq “ arg max ξ log f pyi |xi , Si ; βj q ´ λj Rpβj q ,
β n i“1 j“1 i,j j“1
and
# +
n N n N N
1 ÿ ÿ pt´1q 1 ÿÿÿ 1
γ ptq “ arg max ξ log ppSi |Si´1 ; γq ` τ log πpS0 ; γq .
γ n i“1 j“1 i,j n i“1 j“1 k“1 i,jk
Actually we can also do the maximization for each state separately, which means:
# +
n
ptq 1 ÿ pt´1q ptq
βj “ arg max ξ log f pyi |xi , Si ; βj q ´ λj Rpβj q .
β n i“1 i,j
9
Moreover, since we expect linear convergence of optimization, it is natural to update
ptq
λj via a recursive way also:
ptq
λj “ κλt´1
j ` ∆,
and Regularizer
K (
M :“ v P Rp |xu, vy “ 0, for all u P M ,
10
K
for any u P M, v P M .
Remark 1. The insight of decomposable regularizer is that the structure of the model
parameter β ˚ can be mostly captured by a small model subspace M of Rp . The
regularizer penalize the deviations away from M as much as possible.
Definition 3. (Dual norm). For any inner product, the dual norm of R is given by:
xu, vy
R˚ pvq :“ sup “ sup xu, vy.
uPRp zt0u Rpuq Rpuqď1
Remark 3. For example, the dual of the l1 norm is the l8 norm. The dual norm
aims for specifying a suitable choice of the regularization weight λn .
11
1.4.2 Conditions on Q functions
To give the statistical and computational guarantees of FAHMM, we also need some
technical conditions, originally proposed by (Yang, Balakrishnan, and Wainwright
2017), on population Qpθ|θ1 q, truncated population Qk pθ|θ1 q, sample Qn pθ|θ1 q and
truncated sample Qkn pθ|θ1 q.
The performance of Baum-Welch is known to depend on initialization, we only
consider local properties in our analysis with regard to an initialization θp0q :“
pβ p0q , γ p0q q P Ω :“ B2 pr; β ˚ q ˆ Ωγ , where γ is a particular parameterization of the
stationary distribution and the transition matrix of the Markov Chain. B2 pr; β ˚ q is a
ball centered at β ˚ with radius r, therefore Ω is a smaller set than the complete set
Ω̃ :“ Ωβ ˆ Ωγ .
Existence of Qpθ|θ1 q
We have already defined Qn pθ|θ1 q in equation (1.1). Under i.i.d. setting, the law
of large numbers (LLN) guarantees that when the sample size n goes to infinity,
Qn pθ|θ1 q will approach to the population Q function Qpθ|θ1 q, which is the expectation
of Qn pθ|θ1 q. While in our case, due to dependence of the data, we need to show that
the population Q function does exist. A natural candidate is given by:
12
To deal with dependence, we also need to consider a suitably truncated version of the
Q-function. We firstly define:
n N n N N
1 ÿ ÿ k1 1 ÿ ÿ ÿ k1
Qkn pθ|θ1 q “ ξi,j log ppSi |Si´1 ; γq ` τ log πpS0 ; γq
n i“1 j“1 n i“1 j“1 k“1 i,jk
loooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooon
Qk2,n pγ|θ1 q
n N
(1.3)
1 ÿ ÿ k1
` ξ log f pyi |xi , Si ; βq .
n i“1 j“1 i,j
looooooooooooooooomooooooooooooooooon
Qk1,n pβ|θ1 q
construction, we can guarantee that the k-truncated population function defined by:
are well-defined. (Yang, Balakrishnan, and Wainwright 2017) shows that under the
following Condition 2, the population Q function defined in equation (1.2) exists and
can be uniformly approximated by Qk pθ|θ1 q.
Condition 2. (Mixing Rate). We assume there exists a mixing constant mix such
that:
P pSi “ k|Si´1 “ j; γq
mix ď ď ´1
mix
πk
Remark 5. This condition implies that the dependence on the initial distribution
decreases geometrically.
13
Population level
Similar definitions work for k-truncated version M k pθq and their corresponding sample
version Mn pθq and Mnk pθq.
λβ
Qk1 pβ1 |θ1 q ´ Qk1 pβ2 |θ1 q´x5µ Qk1 pβ2 |θ1 q, β1 ´ β2 y ď ´ }β1 ´ β2 }22 ,
2
λγ
Qk2 pγ1 |θ1 q ´ Qk2 pγ2 |θ1 q´x5γ Qk2 pγ2 |θ1 q, γ1 ´ γ2 y ď ´ }γ1 ´ γ2 }22 ,
2
µβ
Qk1 pβ1 |θ1 q ´ Qk1 pβ2 |θ1 q´x5µ Qk1 pβ2 |θ1 q, β1 ´ β2 y ě ´ }β1 ´ β2 }22 .
2
µγ
Qk2 pγ1 |θ1 q ´ Qk2 pγ2 |θ1 q´x5γ Qk2 pγ2 |θ1 q, γ1 ´ γ2 y ě ´ }γ1 ´ γ2 }22 .
2
If we could access Qk pθ|θ˚ q, Baum-Welch could converge in one step to the true
optimum. Therefore, if we can show the closeness of Qk pθ|θ1 q and Qk pθ|θ˚ q, we can
conclude that M k pθq is close to θ˚ . The following gradient stability shows the closeness
in a first-order sense.
14
Condition 4. (Gradient Stability). First-order stability conditions on the gradients
of each component of Qk ,
Sample Level
K (
CpM, M , Rq :“ u P Rp |2RpuMK q ď 4RpuM q ` 3N ΨpMq}u} . (1.5)
λn
Qk1,n pβ|θ1 q ´ Qk1,n pβ ˚ |θ1 q´x5β Qk1,n pβ ˚ |θ1 q, β ´ β ˚ y ď ´ }β ´ β ˚ }22 .
2
15
The above condition states that Qk1,n p¨|θ1 q is strongly concave in direction β ´ β ˚
K
belonging to CpM, M , Rq.
This statistical error replaces the quantity }Mnβ,k pθq ´ M β,k pθq} in (Balakrishnan,
Wainwright, Yu, et al. 2017) and (Yang, Balakrishnan, and Wainwright 2017) to deal
with the high dimensional issue.
Other than the above conditions, we also need the concentration inequality for γ.
For any δ P p0, 1q, we let γn pδ, kq to be the smallest positive numbers such that
ˆ ˙
P sup }Mnγ,k pθq ´M γ,k
pθq} ě γn pδ, kq ď δ,
θPΩ
For a given truncation level k, this concentration gives a upper bound on the difference
between the population and sample version of M estimator on γ.
At the same time, we need to ensure that the truncation error is controlled. For a
tolerance parameter δ P p0, 1q, we let φn pδ, kq be the smallest positive number such
that:
ˆ ˙
P sup }Mn pθq ´ Mnk pθq} ě φn pδ, kq ď δ.
θPΩ
This quantity shows the upper bound on the difference between the truncated
sample and sample version of M estimator.
16
1.4.3 Main Results
We now are ready to present our main theorems about the estimation. We
begin by showing the linear convergence of the sequence tθptq uTt“0 obtained
by continually applying the regularized Baum-Welch. In the sequel, we let
α :“ supuPRp zt0u }u}˚ {}u}, where } ¨ }˚ is the dual norm of } ¨ } and define
κ :“ pmaxtLβ,1 , Lβ,2 u ` maxtLγ1 , Lγ2 uq{ maxtλβ , λγ u, ∆ :“ rλn {p60ΨpMqq,
κ˚ :“ 5αµβ κ{λn , and assume that ∆n ď ∆.
ptq λn p0q 1 ´ κt
λj “ κt }βj ´ βj˚ } ` ∆, j “ 1, . . . , N, t “ 1, . . . , T,
5ΨpMq 1´κ
for any ∆ P r3∆n , 3∆s, κ P rκ˚ , 3{4s. Then for a given sample size n, suppose pn, kq is
large enough to ensure that:
? ∆
φn pδ, kq ` 6 N ΨpMq ` φpkq ď p1 ´ κqr ´ κ max }γ ´ γ ˚ },
λn γPΩγ
with probability at least 1 ´ δ, the Baum-Welch sequence tθptq uTt“0 satisfies the
bound:
1 ? ∆
}θptq ´ θ˚ } ďκt }θp0q ´ θ˚ } ` tφn pδ, kq ` 6 N ΨpMq ` γn pδ, kq ` φpkqu.
1´κ λn
Remark 6. Theorem 1 shows that the estimation error is bounded by two terms: (1)
κt }θp0q ´ θ˚ }2 is linearly decayed with number of iterations, which can be regarded
1
?
as the optimization error; (2) 1´κ tφn pδ, kq ` 6 N λ∆n ΨpMq ` γn pδ, kq ` φpkqu can
be controlled by letting the sample size n to be large enough, therefore it can be
interpreted as the statistical error. With T “ Oplog nq and appropriate choice of
17
∆ “ Op∆n q, we have:
1 ? ∆n
}θpT q ´ θ˚ } À tφn pδ, kq ` 6 N ΨpMq ` γn pδ, kq ` φpkqu.
1´κ λn
Penalty
In this section, we give results for a special case of Theorem 1: when the emission
distribution is Gaussian and L1 penalty. Actually this special case has its unique
name called Markov Switching Regression, which was proposed first in (Goldfeld and
Quandt 1973), and have been frequently used in many econometric research as we can
see in the Introduction. Here we add a L1 penalty to it to include features selection
at the same time and allow for high dimensional features setting. Here we assume
xn1 „ N p0, 1p q, the noise follows N p0, σ 2 q, and the number of states is 2. As far as we
know, this is the first theoretical analysis for the Markov Switching Regression model.
Remark 7. Similar to Theorem 1, the error bound also correspond to the optimization
error and the statistical error, if we let T “ Oplogpn{ps log pqqq, then the error rate
„ b b
1 ˚ s log p log8 pnpq
becomes 1´κ C1 p}β }2 ` δq n
` C2 n
.
18
1.6 Determination of number of states
In many real life applications, the number of states is often not known as a priori.
In this paper, AIC helps to determine the optimal number of states. AIC criterion
consists of two terms: a log likelihood term and a penalty term:
˜ ¸
n
ÿ n
ÿ N
ÿ
AICpN q “ ´ 2 log f pyi |xi ; θq ` 2N “ ´2 log fi pyi , Si |xi , θqP pSi “ jq ` 2N,
i“1 i“1 j“1
where N is the parameter for number of states. Then the optimum number of states
is chosen to be the minimizer:
1.7 Simulation
In this section, we investigate the finite sample performance of our estimators and
regularized Baum-Welch Algorithm. We first sample 500 paths from the model:
19
matrix to be: ¨ ˛
˚ 0.9 0.1 ‹
P “˝ ‚.
0.2 0.8
First of all, before digging into model fitting, we need to check whether AIC can
correctly capture the number of states, which is 2 in our experiment. We choose
LASSO as the regularizer and choose N to be 1, 2, 3 and running our algorithm.
The results are shown in Table 1.1. According to equation (1.6), we can see that we
successfully recover the true number of states.
Figure 1.2 shows the results of regime identification. The upper left is the true
states generated from a Markov Chain, the upper right and lower left are the FAHMM
with regularizer to be LASSO and Elastic Net, while lower right is the FAHMM
without regularizer. We can see that both LASSO and Elastic Net correctly recover
the true states, while Markov regression with no regularizer fails to identify the true
states.
Table 1.2 shows the results of different measurement for Gaussian emission FAHMM
with LASSO and Elastic Net regularizer, and no regularizer, which are calculated from
the average of 1000 simulations. We can see that, because the ordinary FAHMM with-
out regularizer cannot identify the regimes, so every error measurement is inaccurate;
20
Table 1.2: Simulation results for different models.
Lasso ElasticNet Ordinary
S“1 0.0206 0.0541 66.0323
}β̂ ´ β}1
S“2 0.0186 0.0721 66.0323
S“1 0.0116 0.0206 8.769
}β̂ ´ β}22
S“2 0.0082 0.0263 8.769
S“1 0.015 0.007 4.576
|σ̂ ´ σ|
S“2 0.005 0.089 4.576
2 —— 0.0766 0.0586 0.707
}P̂ ´ P }F
}P̂ ´ P }M AX —— 0.0481 0.0375 0.400
S“1 5 13 100
Model Size
S“2 9 16 100
MSE —— 2.704 2.706 105.192
Regime Identification Error 0 0 NA
while both LASSO and Elastic Net can approximate the true parameter well, and
since we don’t have strong correlation here, LASSO has better performance compared
with Elastic Net: smaller estimation error and reduced model size.
We apply both FAHMM with or without regularizer to predict foreign exchange rate
of USD/BRL (US Dollar per Brazilian Real). Since in our case, we use Gaussian
emission distribution and L1 penalty, we can also call this example a Markov switching
regression. We choose 7 features shown in Table 1.3 as observed features xn1 , and use
return of exchange rate USD/BRL as the output variable y1n . These features are based
on both practical and theoretical considerations.
Firstly, Table 1.3 shows that adding regularizer can successfully shrink many of
the coefficients to be 0, which increase the explanatory power of the model. The
regularized model shows us that the Stock feature is the most significant feature in
Regime 1 (Bull) and Interests Rate is the most significant feature in Regime 2 (Bear).
This variable selection results make a lot of sense, since according to Taylor rule, the
21
Figure 1.2: The top picture is the simulated true states. The rest are estimated
probabilities of state 1 and state 2 using regularized FAHMM with Elastic Net penalty,
lasso penalty and ordinary FAHMM (without regularizer, least square) accordingly.
Table 1.3: Coefficients comparison between FAHMM with and without regularization.
Regime 1
Features Interests Politics Export Stock CPI Foreign Reserve Balance
Regularized / / / 0.0793 / / /
Ordinary -0.072 0.132 0.066 0.221 0.427 0.266 0.335
Regime 2
Features Interests Politics Export Stock CPI Foreign Reserve Balance
Regularized -0.180 / / / / / /
Ordinary -0.090 0.132 1.881 0.302 -5.161 -0.676 -0.894
interest rate is the most significant factor affecting exchange rate; while in crash period,
stock indexes will have the highest correlation with exchange rate. On the other hand,
from Figure 1.3, we can see that, both of the two methods can identify the hidden
regimes well, but we are using a much simpler model by adding regularization.The
22
superiority of regularized Markov switching regression will become more significant
if we include additional and correlated features, and also for the out of sample tests,
because simpler model can prevent overfitting. The estimated probability for two
regimes are shown in Figure 1.4.
Figure 1.3: The upper picture is regime identification results for BRL/USD using
LASSO FAHMM; the lower picture is the regime identification results using Least
Square FAHMM. The red and blue line stand for Regime 1 and Regime 2.
Figure 1.4: The upper picture is the estimated probability for regime 1 and regime 2
for BRL/USD using LASSO FAHMM; the lower picture is the estimated probability
for regime 1 and regime 2 for BRL/USD using Least Square FAHMM. The red and
blue line stand for Regime 1 and Regime 2.
23
1.9 Concluding Remarks
In this paper, we propose a new model called Feature Augmented Hidden Markov
Model (FAHMM), it improves the performance of Hidden Markov Model by including
a feature structure to the output variable. At the same time, we allow the the
number of features to be high dimensional and consider general form of emission
distributions. To fit the model, we provide a new Regularized Baum Welch Algorithm,
the regularization in the M-step can deal with different practical issues with the
features: feature selection, high dimensionality, multi-colinearity, etc. Other than that,
we also establish general local convergence theorems and statistical and computational
guarantees for the estimator.
We leave a number of issues for future research. As is well known, the EM algorithm,
as a concave approximation to the non-concave log-likelihood objective function,
depends upon good starting estimates and as we have shown in the convergence
section the global optimal solution can be discovered when the estimate is close enough
(suitably defined). The proposed FAHMM algorithm displays similar characteristics.
Future research on computing good starting estimates is a worthwhile goal. Another
interesting direction is to work on the online version of this algorithm. Although Baum-
Welch algorithm is intuitively easy to apply, while similar to many other iterative
algorithms, its computation inefficiency limits its applications in many areas, for
example, with high frequency data.
In this section, we give the proof of Theorem 1 about the statistical and computational
guarantees of the Baum-Welch algorithm. Firstly, we show a result of the population
level convergence guarantees.
24
Lemma 1. Under the Condition 2 (mixing rate), and Condition 3 (strong concavity
condition), there is a universal constant c0 such that:
where
2 c0 N 4
φ pkq “ 9 2
p1 ´ mix π̄min qk ,
λmix π̄min
Proof. Proof of this lemma can be found in proof of Theorem 1 part (a) in (Yang,
Balakrishnan, and Wainwright 2017).
This lemma means through operator M k p¨q, it always decreases the distance between θ
and θ˚ .
Lemma 3. Under Condition 1-6, consider initialization θp0q :“ pβ p0q , γ p0q q P Ω, with
regularization parameters given by:
ptq 1 ´ κt λn β,k ˚
λj “ ∆ ` κt }β p0q ´ Mn,j pθ q},
1´κ 5ΨpMq
j “ 1, . . . , N, t “ 1, . . . , T,
5ΨpMqN 1 ´ κt
}Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ď ∆ ` κt }β p0q ´ Mnβ,k pθ˚ q}.
λn 1´κ
25
Proof. Recall that
N
ÿ pt`1q
Mnβ,k pθpt´1q q “ arg max Qk1,n pβ|θpt´1q q ´ λj Rpβj q.
βPΩ
j“1
N
ÿ N
ÿ
Qk1,n pβ ` |βq ´ λj Rpβj` q ě Qk1,n pβ ˚ |βq ´ λj Rpβj˚ q.
j“1 j“1
Equivalently,
N
ÿ N
ÿ
Qk1,n pβ ` |βq ´ Qk1,n pβ ˚ |βq ě λj Rpβj` q ´ λj Rpβj˚ q.
j“1 j“1
Define Θ :“ β ` ´ β ˚ , since
|x5β Qk1,n pβ ˚ |βq, β ` ´ βy| “|x5β Qk1,n pβ ˚ |βq ´ 5β Qk1 pβ ˚ |βq ` 5β Qk1 pβ ˚ |βq, Θy|
ď|x5β Qk1,n pβ ˚ |βq ´ 5β Qk1 pβ ˚ |βq, Θy| ` |x5β Qk1 pβ ˚ |βq, Θy|
26
Therefore we have
N
ÿ N
ÿ
λj Rpβj` q ´ λj Rpβj˚ q ď ∆n RpΘq ` αµκ}β ´ β ˚ }}Θ}.
j“1 j“1
Then
λj pRpΘi,M| q ´ RpΘi,M qq ď ∆n RpΘi q ` αµκ}βj ´ βj˚ }}Θ}.
If we choose
αµκ
λj ě 3∆n ` }βj˚ ´ βj },
ΨpMq
then we have
∆n αµκ}βj˚ ´ βj } 1
RpΘi,M| q ´ RpΘi,M q ď RpΘi q ` }Θ} ď RpΘi q ` ΨpMq}Θ}.
λj λj 3
Then
1
RpΘM| q ´ RpΘM q ď RpΘq ` N ΨpMq}Θ}.
3
|
Therefore, we show that Θ lies in CpM, M , Rq, then according to Condition 5, we
have
λn
Qk1,n pβ ` Θ|βq ´ Qk1,n pβ ˚ |βq ďx5β Qk1,n pβ ˚ |βq, Θy ´ }Θ}2 .
2
λn
ď∆n RpΘq ` αµκ}β ´ β ˚ } ˆ }Θ} ´ }Θ}2 .
2
27
Then using the optimality condition again, we have
N
ÿ N
ÿ N
ÿ
Qk1,n pβ`Θ|βq´Qk1,n pβ ˚ |βq ě ˚
λj Rpβ `Θq´ ˚
λj Rpβ q ě ´ λj RpΘi,M q. (1.7)
j“1 j“1 j“1
N
λn 2 ˚
ÿ
}Θ} ď ∆n RpΘq ` αµκ}β ´ β } ˆ }Θ} ` λj RpΘi,M q.
2 j“1
N
λn 9 ÿ
}Θ}2 ď N ∆n ΨpMq}Θ} ` αµκ}β ´ β ˚ } ˆ }Θ} ` ΨpMq λj }Θ}.
2 2 j“1
N
∆n 2 2 ÿ
}Θ} ď9N ΨpMq ` αµκ}β ´ β ˚ } ˆ }Θ} ` ΨpMq λj .
λn λn λn j“1
řN
ΨpMq αµκ j“1 λj
“ p9N ∆n ` 2 }β ´ β ˚ }q ` 2ΨpMq
λn ΨpMq λn
řN
j“1 λj
ď5ΨpMq .
λn
Combine the results above, we reach the conclusion that if M β,k pθpt´1q q P Ω, and
then we have řN
j“1 λj
}Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ď 5ΨpMq .
λn
28
5αµκ
Then we let κ˚ :“ λn
, and assume κ˚ ď 3{4. Then for any κ P rκ˚ , 3{4s. We
have for any κ P rκ˚ , 3{4s, ∆ ą 3∆n , we can set
ptq 1 ´ κt λn β,k ˚
λj “ ∆ ` κt }β p0q ´ Mn,j pθ q},
1´κ 5ΨpMq
5ΨpMqN 1 ´ κt
}Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ď ∆ ` κt }β p0q ´ Mnβ,k pθ˚ q}.
λn 1´κ
ďκ}θ ´ θ˚ }2 ` ψ 2 pkq.
ďφn pδ, kq ` }Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ` }Mnγ,k pθpt´1q q ´ M γ,k pθpt´1q q}
29
Here the second equation is based on the following concentration inequalities: For a
tolerance of δ, we let φn pδ, kq and γn pδ, kq to be the smallest positive scalar such that:
„
P sup }Mn pθq ´ Mnk pθq} ě φn pδ, kq ď δ,
θPΩ
and
„
P sup }Mnγ,k pθq ´M γ,k
pθq} ě γn pδ, kq ď δ.
θPΩ
Since we need to guarantee that the iterates won’t get out of B2 pr; β ˚ q, which
means
}β ptq ´ β ˚ } ď}Mnβ pθpt´1q q ´ Mnβ,k pθpt´1q q} ` }Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ` }M β,k pθ˚ q ´ β ˚ }
5ΨpMqN 1 ´ κt
ďφn pδ, kq ` ∆ ` κt }θp0q ´ Mnβ,k pθ˚ q} ` φpkq ď r,
λn 1´κ
Consequently, we have
5ΨpMqN 1 ´ κt
φn pδ, kq ` ∆ ` φpkq ď p1 ´ κt qr. (1.9)
λn 1´κ
Therefore, under equation (1.9), summing up the geometric series in equation (3.20),
we have
1 5ΨpMqN 1 ´ κt
}θptq ´θ˚ }2 ď κt }θp0q ´θ˚ }` tφn pkq` ∆`γn pδ, kq`p1`κt qφpkqu.
1´κ λn 1´κ
(1.10)
Then the rest of the proof are naturally about finding the φn pδ, kq and n pδ, kq in their
concentration inequality.
30
1.11 Proof of Theorem 2
Lemma 4. Suppose that the truncation level satisfies k Á log n, and n ě Crp}β ˚ }2 `
δq{}β ˚ }2 s2 s log p for a sufficiently large C, then we have:
c
k 3 logpk 2 {δq
γn pδ, kq ď C0 σ .
n
and
i`k i`k i`k i`k
ωθ,2 pYi´k ; Xi´k q “ ppS0 “ S1 “ ´1|Yi´k ; Xi´k q,
# +
n´1 N
N ÿ n
γ1 ,k 1 ÿÿ k,ptq 1 1 ÿ k,l
Mn pθq “ arg max τ log pij ` ξ log πi
γPΩγ n i“1 j“1 k“1 i,jk n n i“1 i,1
˜ ¸
n
1ÿ i`k i`k
“ΠΩγ1 ωθ,1 pYi´k ; Xi´k q . (1.11)
n i“1
Since weights ωθ,1 is Lipschitz continuous on the parameters, that is |ωθ,1 ´ ωθ1 ,1 | ď
L}θ ´ θ1 }2 . As a consequence, we have
m
1 ÿ δ
Prsup | ωθ,1 pX̃i:2k , Ỹi;2k q ´ Eωθ,1 pX̃i:2k , Ỹi;2k qq| ě βn s ď ,
θPΩ m i“1 8k
31
Then using similar arguments as Proof of inequality (47c) in (Yang, Balakrishnan,
and Wainwright 2017), we have
c
k 3 logpk 2 {δq
}Mnγ,k pθpt´1q q ´M γ,k
pθ pt´1q
q}2 ď C0 σ .
n
Lemma 5. For any geometrically ρmix ´mixing and time reversible Markov Chain
tSn u with N states, there is a universal constant such that
cpN ` 1q k
sup |ppsi,0 “ 1|Ykn , Xkn q ´ ppsi,0 “ 1q| ď ρ .
i
3
π̄min 3mix mix
Proof. The proof follows from Lemma 12 in (Yang, Balakrishnan, and Wainwright
2017).
Lemma 6. Suppose that the truncation level satisfies k Á log n, and n ě Crp}β ˚ }2 `
δq{}β ˚ }2 s2 s log p for a sufficiently large C, then we have:
c
logpn{δq δ
}Mn pθ pt´1q
q´ Mnk pθpt´1q q}2 “C `C .
n n
8}Qn ´ Qkn }8 2
sup }Mn pθq ´ Mnk pθq}22 ď , where λ ě p1 ´ b2 q.
θPΩ λ 3
Therefore the remaining part of the proof is to bound the difference }Qn ´ Qkn }8 . Let
us denote
N
ÿ
hpyn , xn , βj , σi , θptq q “ log ppyn |xn , βj , σi q ` ppS “ i|S “ j, θptq q log pji .
j“1
32
We obtain
ˇ ˇ
n ÿ n
k
ˇ 1 ÿ ptq k,l
ˇ
ptq ˇ
}Qn ´ Qn }8 “ ˇ sup pξ ´ ξi,t qhpyn , xn , βj , σi , θ qˇ
ˇ
ˇθ,θptq n i“1 i“1 i,t ˇ
ˇ ˇ
ˇ 1ÿ ptq
ˇ
` ˇ sup ppS “ i|X, Y, θ q log πi ˇ
ˇ ˇ
ˇθ,θptq n i“0 ˇ
k
2Cs3 1 ÿ i
ď 8 r ρ̃mix max |hpyn , xn , βj , σi , θptq q| ` log π̄min
´1
s
mix π̄min n t“1 i
i´k
CN 3 ρ̃kmix 1 ÿ
` 8
max |hpyn , xn , βj , σi , θptq q|
mix π̄min T ´ 2k i“k i
k
2CN 3 ÿ
i N logpπ̄mix q´1 ´1
ď rmax
¯ min T i | log ppy |x
n n , βj , σi q| ρ̃ mix ` ` log π̄min s
8mix pi i“1
π̄ min min
˜
CN 3 ρkmix
` ¯ rE max | log fj pyn , xn ; βj q| ` eT ´2k pY q ` N logpπ̄min mix q´1 s,
8 min i
where
ˇ ˇ
ˇ1 ÿ n ˇ
ˇ ˇ ˇ ˇ
en pY q :“ ˇ max ˇlog ppyn , xn , βj , σi , θptq qˇ ´ E max ˇlog ppyn , xn , βj , σi , θptq qˇˇ
ˇ ˇ
ˇ n i“1 i i ˇ
ˇ ˇ
n
ˇ1 ÿ pyn ´ x|n βj q2 pyi ´ x|i βj q2 ˇˇ
max ´ E max ˇ.
ˇ
“ˇ
ˇ n i“1 i σi2 i σi2 ˇ
We can show
˜ˇ ˇ ¸
n n
ˇ1 ÿ
ÿ pyn ´ x|n βj q2 pyi ´ x|i βj q2 ˇˇ u
Ppen pY q ě uq ď P ˇ
ˇ
´E ˇě
i“1
ˇ n i“1 σi2 σi2 ˇ n
˜ ˜ˇ ˇ ¸ ¸
m
ˇ1 ÿ pyn ´ x|n βj q2 pyn ´ x|n βj q2 ˇˇ u
ďCN k̃ P ˇ ` mγpk̃q ,
ˇ
´E ˇě
ˇ m t“1 σi2 σi2 ˇ n
where
p0q
γpkq “ sup |PpP q ´ P´8 ˆ P8
1 pP q|.
p0q
APσpS´8 ,Sk8 q
p0q p0q
Here S´8 and Sk8 are sigma algebras generated by random vector Y´8 and Yk8
p0q 0
correspondingly, and P´8 ˆ P8
1 is the measure under which the sequences Y´8 and
33
Y18 are independent. It is easy to show that
p0q
γpkq “|f pyq ´ f pyk8 qf py´8 q|
p0q n p0q
ď|f py8 |yk q ´ f py´8 q|
We choose d
log k̃N C2 logp 3 3n 3 q
δ mix π̄min δ
t :“ C1 , and k̃ :“
m log 1{ρ̃mix
Therefore, c
logpn{δq δ
ptq
}Mn pθ q ´ Mnk pθptq q}2 “C `C .
n n
Lemma 7. Suppose that the truncation level satisfies k Á log n, and n ě Crp}β ˚ }2 `
δq{}β ˚ }2 s2 s log p for a sufficiently large C, then for any δ P p0, 1q, we have
c
˚ log p
∆n “ Cp}β }2 ` δq ,
n
Proof. Similar proof can be found in (Yi and Caramanis 2015) Lemma 9.
34
Lemma 8. Suppose that the truncation level satisfies k Á log n, and n ě
! ´ 2´γ ´
¯ 1´γ 1
¯ 1´γ *
54Kp2C1 log pq1{γ 54K C
max λmin pΣX q
, λmin pΣX q
2
C1
, for a sufficiently large C, then
we have:
" *
1 ˚ ˚ pλmin pΣX qqγ γ
λn “ λmin pΣX q, pM, Mq “ psupppβ q, supppβ qq, δ “ 2n exp ´ n .
2 p54Kqγ 2C1
Proof. Here we prove a stronger version when xn1 follows a subweibull distribution. In
our problem we have
n ÿ
ÿ n n
ÿ
ptq
Qk1,n pθ, θptq q “ ξi,t log fj pyn , xn ; βj q ´ λi Rpβj q
i“1 i“1 i“1
Since we assume fi is a normal density coming from state i, the loss function part is
actually a weighted least square. We maximize the Qk1,n pθ, θptq q for different state j,
since each of them can be maximized separately. We let
n
ÿ ptq
Qk1,j,n pθ, θptq q “ ξi,t log fj pyn , xn ; βj q ´ λi Rpβj q.
i“1
n
ÿ ptq 1 pyn ´ xn βj q2
“ ξi,t p´ log 2π ´ log σi ´ q ´ λi Rpβj q
i“1
2 2σi2
1 1 1
“ ` .
γ γ1 γ 2
˜ˇ ˇ ¸
n " * " *
ˇ1 ÿ ˇ punqγ u2 n
Xn ˇ ą u ďn exp ´ γ ` exp ´ ,
ˇ ˇ
P ˇ
ˇ n i“1 ˇ K C1 C2 K 2
The proof is based on Theorem 1 of (Merlevède, Peligrad, and Rio 2011), under
our β-mixing and subweibull assumptions, we have:
" * " *
uγ u2
Pp|Σ̂n pωq ´ EΣ̂n pωq| ą uq ďn exp ´ ` exp ´
C1 C2 p1 ` T V q
" " **
u3 1 up1´γq γ
` exp ´ exp p q
C3 T C4 logpuq
here V is a worst case measure of the partial sum of the auto-covariances on the
dependent time series Xn , t “ 1, . . . , T , then for any geometrically decaying β-mixing,
according to (Viennet 1997), it can be shown that V can be bounded by a constant.
We just replace γ2 with γ2 {2, γ “ p1{γ1 ` 2{γ2 q´1 , then we have
" * " *
punqγ u2 n
Pp|Σ̂n pωq ´ EΣ̂n pωq| ą uq ďn exp ´ γ ` exp ´ .
K C1 C2 K 2
Let Kp2kq to be the set of 2k sparse vector with Euclidean norm to be 1. Then we
can get the uniform concentration of |Σ̂n pωq ´ EΣ̂n pωq| on Kp2kq:
" * " *
punqγ u2 n
Pp sup |Σ̂n pωq ´ EΣ̂n pωq| ą uq ďn exp ´ γ ` k log p ` exp ´ ` k log p .
ωPKp2kq K C1 C2 K 2
36
The based on Lemma 12 in (Loh and Wainwright 2011), we know for k ě 1, with
probability
" * " *
punqγ u2 n
1 ´ n exp ´ γ ` k log p ´ exp ´ ` k log p ,
K C1 C2 K 2
ˆ ˙
1
|Σ̂n pωq ´ EΣ̂n pωq| ď 27u }ω}22 2
` }ω}1 .
k
ˆ ˙
1
|
Σ̂n pωq ď ω ΣX ω ` 27u }ω}22 2
` }ω}1 ,
k
and
ˆ ˙
1
|
Σ̂n pωq ě ω ΣX ω ´ 27u }ω}22 2
` }ω}1 .
k
1
Let u “ λ pΣX q,
54 min
then
ˆ ˙
1
Σ̂pωq ěλmin pΣX q}ω}22 ´ 27u }ω}22 ` }ω}21
k
1 λmin
“ λmin pΣX q}ω}22 ´ }ω}21 .
2 2k
" *
punqγ u2 n
k log p “ min , ,
K γ C1 C2 K 2
and if we want to the minimum is obtained on the first term and also to ensure that
k ě 1, then we have:
37
When the sample size
# ˆ 2´γ ˆ
˙ 1´γ 1
˙ 1´γ +
54Kp2C1 log pq1{γ 54K C2
n ě max , ,
λmin pΣX q λmin pΣX q C1
" *
pλmin pΣX qqγ γ
1 ´ 2n exp ´ n ,
p54Kqγ 2C1
we have
1 p54Kqγ C1
Σ̂pωq ě λmin pΣX q}ω}22 ´ γ´1
}ω}21 .
2 p2λmin pΣX q q
Therefore we build up the Restricted Eigenvalue (RE) conditions for equation (1.14).
Condition 3,4 are very direct but tedious, which can be easily inferred from (Yi
and Caramanis 2015). After all, lemma 5 to Lemma 10, and Theorem 1 conclude the
proof of Theorem 2.
38
Chapter 2
39
Text classification can be done by manual labelling or automatic labelling. With the
growth of the size of text data, automatic labelling becomes more and more important
and unavoidable. Approaches to automatic text classification can be grouped into
three main categories: 1. Rule-based methods; 2. Machine learning based methods; 3.
Hybrid methods.
Rule-based methods classify the texts into different labels using a set of pre-defined
rules. It requires a deep domain knowledge and intensive labor work, and the systems
are also difficult to maintain. On the other hand, machine learning based methods
require a pre-labeled samples as the training data, and then uses supervised machine
methods to model the relationship between the text and the labels. Thus, the machine
learning methods are able to detect hidden pattern in the data and are more scalable,
and easily to be applied to different tasks. The hybrid methods use a combination of
rule-based and machine learning methods to do the classification.
Machine learning methods are drawing increasing attention in recent years. Most
of the classical machine learning based-models follow the popular two-step procedure,
where in the first step, we extract hand-crafted features from the text and then we
feed these features to machine learning models to do the prediction. Some of the most
popular features extraction methods are bag of words (BOW), term frequency and
inverse document frequency (TFIDF), and their extensions. However, these two-step
processes have several limitations. For example, reliance on the hand-crafted features
requires tedious feature engineering and analysis to obtain a good performance. The
strong dependence of domain knowledge for designing features makes the methods
difficult to generalize to other tasks. Finally, these models cannot take full advantage
of the large amounts of training data because the features are pre-defined.
A paradigm shift started occurring in 2012, when a deep learning based-model
AlexNet (Krizhevsky, Sutskever, and Hinton 2017) won the ImageNet competition by
a large margin. After that, deep learning methods have been applied to a wide range
40
of tasks in computer vision and NLP, improving the state of art ((Vaswani, Shazeer,
Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin 2017), (He, Zhang, Ren, and
Sun 2016), (Devlin, Chang, Lee, and Toutanova 2018), (Yang, Dai, Yang, Carbonell,
Salakhutdinov, and Le 2019)). These models try to learn feature representations and
perform classification or regression, in an end-to-end fashion. They not only have the
ability to uncover the hidden patterns in data, but also easily to be generalizable to
other tasks. These methods are becoming the main stream framework for various text
classification tasks in recent years.
Among different text classification tasks, document classification is a special sub
area, since many of the huge language models (e.g. BERT, GPT) have limitations on
the number of input token size. On the other hand, as we know, RNN models will fail
when the document is very long, because the memory capability of RNN will vanish.
There are only a few research working on document classification: (Yang, Yang, Dyer,
He, Smola, and Hovy 2016) uses a hierarchical model to extract features from the
documents, with word and sentence level attention mechanism to classify documents;
(Adhikari, Ram, Tang, and Lin 2019b) proposes to use a simple, properly-regularized
single layer BiLSTM, and achieves similar or even better performance than many
very complex models; (Adhikari, Ram, Tang, and Lin 2019a) takes the advantage of
knowledge distillation to compress the BERT model into a simpler LSTM model.
2.2 Introduction
Stock movement prediction has been an attractive topic of both investors and re-
searchers for many years, dating back to (Fama 1965), who showed that there is no
significant autocorrelation in the daily returns of 30 stocks from the Dow-Jones Indus-
trial Average. Similar studies were conducted by (Taylor 2008) and (Ding, Granger,
and Engle 1993), who find significant autocorrelation in squared and absolute returns
41
(i.e. volatility). These effects are also observed on intraday volatility patterns, as
demonstrated by (Wood, McInish, and Ord 1985) and (Andersen and Bollerslev 1997)
on absolute returns. These findings tend to demonstrate that, given solely historical
stock returns, future stock returns are not predictable while volatility is.
However, many empirical studies show that, the market is closer to be efficiently
inefficient (Pedersen 2019): prices can be pushed away from their fundamental values
because of strong information. Due to market inefficiency, it takes time for the market
to price in the new information. Several types of representative significant events are,
for example, earnings reports, mergers, split-offs etc. Figure 2.1 is price movement
on March 5, 2020 after the earnings conference call on previous day. It took the
whole after market hours and additional one and half hour from 9:30 a.m. to around
11:00 a.m. to enter a new steady state. A useful and powerful way to extract signals
from news and reports is natural language processing (NLP), which is proved to be
profitable by many hedge funds’ applications in designing trading strategies.
Figure 2.1: Zoom’s stock price movement after Q4 2020 earnings conference call.
Among previous research works about using NLP for stock movement prediction,
public news and social media are two primary content resources for stock market
42
prediction, and the models that use these sources are often discriminative. Among
them, classic research relies heavily on feature engineering ((Schumaker and Chen 2009);
(Oliveira, Cortez, and Areal 2013)). With the prevalence of deep neural networks
((Le and Mikolov 2014)), event-driven approaches were studied with structured event
representations ((Ding, Zhang, Liu, and Duan 2014); (Ding, Zhang, Liu, and Duan
2015)). Most recently, (Xu and Cohen 2018) presents a deep generative model for
a binary classification of stock price movement based on tweets and historic prices.
(Duan, Ding, and Liu 2018) learns sentence representations over tree structures for a
binary classification task of stock return movement in reaction to financial news.
In this paper, we combine established knowledge from the financial domain with
recent advancements in NLP to create our two models BERT-based model and a deep
reinforced model, the state of art neural model jointly exploiting financial and textual
data for stock movement prediction. We collect a comprehensive set of historical
financial data and enrich it with natural language information revealed in recurring
events, so called earnings calls; in these calls, the performance of publicly traded
companies is summarized and prognosticated by their management. We then train a
joint model to predict short-term risk following these calls.
Earnings calls are a rather underexplored type of disclosure for stock movement
prediction—despite their unique and interesting properties: they start with a scripted
presentation by the company management, and after that they also contain an
open questions-and-answers session, in which banking analysts can pose challenging
questions to the executives. Hence, different to already well-explored disclosures like
the uniform and formal annual report 10-K, this allows for an unscripted, spontaneous
interaction (Larcker and Zakolyukina 2012) of high authenticity.
43
2.3 Problem Formulation
In this paper, we want to predict the stock movement of the next trading day right
after the earnings conference call. Here we formulate the problem into a classification
task based on next day’s intraday return.
Before defining the stock movement, we first calculate the maximum up movement
since open:
phigh,t ´ popen,t
rup,t “ ,
popen,t
plow,t ´ popen,t
rdown,t “ ,
popen,t
where popen,t , phigh,t , and plow,t are the open, high and low price of day t. Based on
this, we formulate our task as two classification problems:
Direction Prediction
in this case, we define the stock movement to be:
$
& rup,t
’
if rup,t ě ´rdown,t
r1,t “
% rdown,t if rup,t ă ´rdown,t .
’
44
Here τ1 is the thresholding hyper-parameter which need to be tuned.
Abnormal Return Prediction
Abnormal return is related to many directional neural strategy, which is defined as:
2.3.2 Features
We consider both textual and financial features which are defined below.
Textual Features We split each transcript into two sections: presentation and
question-and-answers. Each of the two sections is represented by a vector t, i.e.
tP , tQ : the tokens w1 , w2 , . . . , wn of the transcript sections are represented with
embeddings wp1q , wp2q , . . . , wpnq , then the BERT model will compose these tokens into
a document/chunk representation t. Embedding size of 768 are trained with our
BERT-based model and deep reinforced model on the earnings transcripts data.
Financial Features
Except for text features, for each transcript, we also include a set of financial features,
since companies with different characteristics will normally react distinctly in terms
of stock movement to new information.
Earnings surprise is the difference between the actual earnings per share and its
consensus estimate, which is one of the main driver of the drastic movement after
45
earnings report.
Market volatility is closely related to the volatility of each stock. We use the CBOE
Volatility Index (VIX) to measure it. We retrieve the VIX for the trading day before
the earnings conference call.
Past volatility is another strong indicator of the movement after the earnings conference
call. We calculated the daily volatility for the past week and past month for the stock
before its earnings conference call.
Past return shows investors’s expectation for the stock, for example, if the stock went
down a lot recently, but the actual earnings beat the consensus a lot, it will have
larger probability to soar after the earnings call. We calculated the return for the
week and the month before the earnings day.
Book to market ratio is the ratio of the book value and market value of the stock,
which shows the price level of the company compared to its intrinsic value.
Market value shows the size of the company, research shows that smaller companies
tend to be more volatile compared with large companies, therefore tends to react more
drastically to strong information.
Our first BERT-based model is a neural model incorporating both the textual and
financial features to predict the stock movement of the next trading day after the
earnings day. Many previous works use linear models, which extract text features from
the document, e.g. bag of key words, TFIDF score, etc. Intuitively, the relationship
between the market movement and features is very complex due to hidden or indirect
relationships, and these term frequency text features will lose a lot of information
46
of the document. Therefore, we want to train an end-to-end neural model to do the
prediction. The structure of BERT-based model is shown in Figure 2.2.
We split the earnings call transcripts into two parts: presentation and questions-
and-answers. In presentation section, management team of the company typically
report the key financial information and summarizing company’s performance. Then
in the question and answer section, investor will ask informed questions regarding the
company. Therefore, intuitively they are supposed to contain different information.
For each section, tokens of that section will be encoded by a deep transformer (BERT).
BERT model only allows for input dimension of at most 512 tokens, including two
special token [CLS] and [SEP], thus we only keep 100th token to 610th tokens of
each section. We drop the first 100 tokens because there is always a uninformative
part at the beginning, which mainly introduce who are the speakers and who will
47
join the Q&A session. However, even only using these 510 tokens, we still achieve
the state of art performance according to the experiments, even better than the
Hierarchical Attention Model (HAN) applied on the whole section. This shows that
the openning paragraphs already contain enough information to learn the sentiment
of each section, while the LSTM will fail when the sequence is long enough. Each
token wi is transformed into a contextualized embedding ci , i “ 1, . . . , 512. Due to
the design of BERT, we use the first embedding c1 corresponding to the special token
[CLS] to be the embedding of that section. The obtained embeddings of presentation
section cP and questions-and-answer section cQ are concatenated to be fed into a
one-layer multilayer perceptron (MLP) to generate textual features
On the other hand, we pass all the financial features into another MLP to generate
final financial features ff inancial . Then we concatenate ftextual and ff inancial to be the
final features, and then fed into a third MLP for classification.
The drawback of the BERT-based Model is that the input size of BERT is at most
512 tokens (around 400 words), which is not practical in many practical problems.
For example, in our earnings data example, the average number of words for each
transcript is around 6,000 words, which is much larger than the BERT limitation.
To tackle this issue, there are two possible directions: 1) to summarize the long
document into a shorter one, then fed into BERT; 2) Chunks or sentences selection.
The first method does not work in many cases due to the two-step nature. For example,
for our application in using earnings transcript data to predict stock movement. If we
summarize the data, we can only get structure information of the transcripts, which
48
is known already before hand. However, the information that can have impact on the
stock price is the "surprise", or we can say "new information", for example, how does
the executives explain the financial data or how the executives answer the questions
from the investors. Therefore, under similar situations, we should not use a two-step
but an end-to-end method instead, this is because the final classification performance
can provide the guidance for choosing the most appropriate content, which brings the
idea of the second direction: chunks or sentences selection.
Therefore, we come up with using reinforcement learning model to do
chunk/sentence selection for BERT based model. This overall process is shown
in Figure 2.3. The model consists of four components: Policy Network, BERT
Embedding Network, Financial Data Network and Classification Network. The policy
network adopts a stochastic policy, which samples a random action at each state.
The whole documents is divided into several chunks, then the policy network will
go through different chunks and produce an action ({retain or delete}) sequence for
chunks. The selected chunks will be treated as an input for the BERT model. Then
the embedding from BERT combined with the features generated by the financial
network will go to the classification network for classification. The classification
performance will offer reward for the policy network.
Policy Network
The policy network follows a stochastic policy πpat |st ; Θq, and uses the classification
performance as a delayed reward to guide the training of policy learning. In order to
do the chunk selection, we perform action sampling for all the chunks. Once all the
actions are decided, the selected chunks will be treated as an input to the BERT model
to generate embeddings for the text. Combined with the generated features from the
financial data network, these features will be used by the classification network to
compute P py|Xq, where X are features set including network output from both BERT
49
Figure 2.3: The Deep Reinforced Model’s architecture.
and the financial data network. The reward of the classification is used as a reward
for policy learning.
Here we briefly introduce the state, action, policy, reward and objective function
as follows:
State: At each time step, the state st is the concatenation of two parts: the represen-
tation of the previously selected chunk and the representation of the current chunk.
The previously selected chunk servers as the context and provides a prior importance.
Both parts are represented by the pre-trained BERT embedding.
Action: The action space at each step is a binary variable: {Retain, Delete}. If the
action is Retain, the current chunk will be added to the selected chunk; otherwise, the
50
current chunk will be discarded. The final selected text will be a concatenation of all
selected chunks in different steps.
Reward: The reward comes at the end when all actions are sampled for the entire
sequence. We use the logarithm of the classification probability of the true label of
input X, i.e. P py “ cg |Xq, where cg is the gold label. Moreover, we also want to
encourage the model to delete more chunks to satisfy BERT’s 512 tokens limitation,
we also include an additional term by computing the proportion of chunks being
deleted:
where θ is the set of policy parameters including W and b, at and st are action and
state at step t. During training, action is sampled each step with probability from
the policy network. After sampling for the whole document, the delayed reward is
calculated. For validation and testing, the action is picked by arg maxa πpa|st ; Θq.
Objective Function: The training is guided by the REINFORCE algorithm, which
optimizes the policy to maximize the expected reward:
51
“Eτ Πt πΘ pat |st qRτ ,
where τ represents the sampled trajectory ta1 , a2 , . . . , aT u. The the gradient is calcu-
lated as:
ÿ
OΘ JpΘq “Eτ OΘ log πΘ pat |st qRτ
t
N ÿ
1 ÿ
« OΘ log πΘ pait |sit qRτi ,
N i“1 t
where N is the number of sample trajectories. Rτi here equals the delayed reward
from the downstream classifier at the last step.
Additionally, to encourage exploration and avoid local optimal, we add the entropy
regularization on the policy objective function:
N
λ ÿ 1
Jreg pΘq “ OΘ Hpπpsit;Θ qq,
N i“1 Ti
where H is the entropy, and λ is the regularization strength, and Ti is the trajectory
length. However, this objective function still cannot guarantee we choose no more than
512 tokens itself. Therefore we add a following steps to choose those chunks/sentences
with highest predicted probability πpat |st ; Θq to keep.
Finally, the downstream classification network and policy network are warm-started
by training separately and then jointly trained together.
2.5 Experiments
52
Ticker Date Transcript Return Earn_Surp VIX ... Book to Market Ratio
HST 2007-07-19 ###### 0.010 0.174 16.000 ... 0.443
MRO 2007-08-01 ###### -0.074 0.047 23.52 ... 0.387
MS 2007-12-20 ###### 0.039 -2.842 21.68 ... 0.562
SKX 2008-04-23 ###### -0.019 -0.611 20.87 ... 0.712
.. .. .. .. .. .. .. ..
. . . . . . . .
We collected 141,965 earnings call transcripts from Seekingalpha.com over the period
from 2007 to early 2020. This covers most recent economic cycles. Company names,
tickers, and dates information are also saved to align with financial data.
The stock price data is downloaded through the Yahoo Finance API. If the earnings
call happens after the close of market, our stock movement measures are calculated
using the next day’s data; on the other hand, if the call happens before the opening
of the market, we will use the stock price that day to calculate the measures. We also
use this data source to get VIX and calculate past volatility and past returns.
For earnings surprise, book to market ratio and market value, these are calculated
using data downloaded from databases CRSP and CRSP/Computstat Merged which
we access via the WRDS Python API.
Since fetching data from WRDS for financial data is very time consuming. We
randomly sample 1,500 transcripts from the whole 141, 965 samples, and get the data
for all variables. A sketch of our data is shown in Table 2.1. Figure 2.4 shows the
returns distribution of the 1,500 samples.
53
Figure 2.4: Distribution of returns among the 1,500 samples.
2007 to Oct 2018, validation set from Oct 2018 to July 2019, and test from July 2019
to February 2020.
For the transcript data, we treat differently for the BERT-based model and the
deep reinforced model. For BERT-based model, we truncate the 100th token to 610th
token of the transcript, with two extra special token [CLS] and [SEP] to satisfy the
512 tokens limit of BERT. For the deep reinforced model, the input of text is the
whole transcript, but we divide them into chunks, each with number of tokens to be
255, which means the model will choose the most important two chunks as the final
input to the BERT model.
We set the embedding size for BERT model to be 768, and the batch size to be 32.
For the financial network MLP and final combination MLP, we both use two layers of
full connected network. We train the final model with an Adam optimizer with initial
learning rate of 0.00001.
We use two evaluation metrics, the first one is the classical performance measure: the
classification accuracy. While when the samples are imbalanced, the accuracy can be
54
high when the classifier always predict the result to be the majority class. To deal
with imbalance issue, previous work (Xie, Passonneau, Wu, and Creamer 2013), (Ding,
Zhang, Liu, and Duan 2014), and (Ding, Zhang, Liu, and Duan 2015) to use Matthews
Correlation Coefficient (MCC) measure the performance. It is a single summary value
of the confusion matrix, which is proved to be more informative than accuracy and f1
score. In this paper, we will also use this metrics to compare with previous work. For
binary classification, MCC is calculated as:
TP ˆ TN ´ FP ˆ FN
M CC “ a ,
pT P ` F P qpT P ` F N qpT N ` F P qpT N ` F N q
where TP, TN, FP, and FN are true positives, true negatives, false positives and
false negatives correspondingly.
2.5.4 Baselines
• Random guess: a naive estimator randomly guessing the class labels using
corresponding proportion.
• Majority Class: another naive estimator always guessing the majority class.
• SVM : support vector machine using TFIDF scores calculated for chosen words
to do classification . (Luss and d’Aspremont 2015)
55
Model Accuracy MCC
Random guess 33.56 0.001
Majority class 37.22 0.000
ARIMA 34.26 0.013
SVM 35.61 0.038
HAN 38.23 0.091
BERT-based Model 40.35 0.120
Deep Reinforced Model 41.57 0.127
Predicting the direction of the stock movement is a very challenging task. Here we
choose τ1 “ 0.03, then in the dataset, three types are almost balanced. We show the
performance of the baseline models and our models in Table 2.2, our BERT-based
model achieves the new state-of-art of 40.35 in accuracy and 0.120 in MCC, outperform
HAN significantly with 2.12 in accuracy and 0.029 in MCC. It is worth to mention
that our BERT-based model only use the first 510 tokens for both presentation
and questions-and-answer section, while HAN uses all of the token in each section.
We believe there are two major reasons: (1) According to many research studies in
extractive summarization, the opening paragraph can be a good summarization for
the whole article, it is a typical phenomenon in news and reports. (2) The capacity of
BERT is much larger than HAN, and self attention mechanism is also proved to be
more powerful than RNN to learn dependence. Our deep reinforced model achieves
even better results, with accuracy to be 41.57 and MCC to be 0.127. This means
the Reinforcement Learning do help us find the most important chunks, however the
difference between the BERT-based model and the deep reinforced model is not very
large, which means the first 510 tokens has already set the sentiment of the whole
document.
56
Model Accuracy MCC
Random guess 61.50 0.009
Majority class 73.98 0.000
ARIMA 61.23 0.021
SVM 65.32 0.081
HAN 68.53 0.116
BERT-based Model 72.23 0.156
Deep Reinforced Model 73.38 0.161
Though slightly better than random guess, classic time series model, e.g. ARIMA,
does not yield satisfying results. Major class’s MCC is 0 due to the calculation
formula. (Luss and d’Aspremont 2015)’s performance is only a little bit better than
Random guess and ARIMA, but still close to no predictability, which is consistent
to their findings that, their model does not have power to prediction the direction
of stock movement. This is due to the loss of information by using only TFIDF to
extract information from the document and also the capacity of SVM is much smaller
than the complicated neural net models.
This task is relevantly easier than task 1, since we only need to predict the magnitude
rather than the direction, and predict absolute return essentially is predicting volatility,
many research works found the predictability in abnormal return, (Taylor 2008), (Ding,
Granger, and Engle 1993), and (Luss and d’Aspremont 2015).
Here we set τ2 “ 0.03, then the proportion of type 1 is 0.31, which is imbalanced
data, therefore we add weights for the loss function. According to Table 2.3, our
model BERT-based model and deep reinforced model still achieves the state-of-art
performance. The reasons are the same as in task2. One thing worth to mention is
that, since in this task, the data is imbalanced, so the Majority class is better than
57
Figure 2.5: Returns distribution of label 1 (left) and label -1 (right).
Random guess and even ARIMA, while its MCC is still 0; SVM performances much
better in this task, since in the original paper (Luss and d’Aspremont 2015), they
have already found predicting power for abnormal return using SVM.
Alpha Distribution
In this section, we want to do some descriptive statistics to show what is the distribution
of alpha over a week after the earning conference call to see which day should we do
the prediction. We split the data into three parts corresponding to label 1 (going up),
0 (neutral), -1 (going down), and focus on label 1 and label -1 since they are the cases
we want to trade. Figure 2.5 shows the return distribution over different days for label
1 (left) and label -1 separately (right), we can see that for both labels, the largest
positive return is in the first day after the earnings conference call. To show these
abnormal return are not due to market beta, we estimate market beta and exclude
the part of return explained by beta, and regard the residuals to be alpha. In Figure
2.6, we can also see that most of the alpha are also from the first day, and on the
second day, there is some small reversion, while for other days, the returns are similar
to random. This supports our prediction on the first day after the earnings call.
58
Figure 2.6: Alpha distribution of label 1 (left) and label -1 (right).
Direction betting
We imitate the strategy used by (Lavrenko, Schmill, Lawrie, Ogilvie, Jensen, and
Allan 2000) and (Ding, Zhang, Liu, and Duan 2015) to demonstrate performance of
trading using our prediction. If our model predicts the next trading day, the stock
price will go up more than 2% (type 1), then we will invest $10,000 to buy the stock
with opening price, assuming we can buy in dollars. If the price actually drop by more
than 2%, which means the prediction is wrong, then we will sell it right away to stop
the loss. Otherwise, we will sell it at the end of the day at the closing price. The
result P&L is shown in Figure 2.7. Here the P&L include the base 10,000 for easier
comparison.
Delta Hedging
In this section, we use the delta hedging strategy to evaluate our prediction on
abnormal return. Delta ∆ measures the rate of change of option price with respect
to change the underlying’s asset price. ∆ is the first derivative of the option price V
with respect to the underlying instrument’s price S:
BV
∆“ .
BS
59
Figure 2.7: Accumulative return for directional betting for test data from 2019-07-17
to 2020-02-21, in total of 103 trades.
Delta hedging uses the underlying stock to hedge out the market risk of option
or the opposite direction, which will earn profits purely betting on volatility. We get
option’s data in Optionmetrics in wrds database. We buy call options at the close
time of the trading day before the earnings call, and short delta shares of underly
stocks for each call option, we maintain the total value for buy and sell position to be
10000. Then we clear our position when the market close at the next day. However if
the total value invested drop by 2%, we will close our position right away to prevent
serious drawdown. Figure 2.8 shows the results of the p&l in of applying our strategy.
60
Figure 2.8: Accumulative return for delta hedging for test data from 2018-10-31 to
2019-06-27, in total of 63 trades.
is for direction prediction, we can see that text contribute more to the prediction.
Although financial data almost contribute nothing to the final prediction by itself, but
combing with text features, it increases the performance. An intuitive example might
be: size itself does not have much prediction power, since according to efficient market
theory, all these information have already been priced in. However, small companies
will react more drastically to information coming from the text since their stock price
is more volatile. Similar results can be observed in abnormal prediction. Another
interesting we can observe is that the abnormal return prediction performance is
better than direction prediction. This also meets our expectation, as (Luss and
d’Aspremont 2015) found in their paper, while our method shows prediction power
even for the direction prediction task.
2.7 Conclusion
introduce the deep reinforced model. We tested our models on a new comprehensive
dataset and showed that they both performed better than previous strong baselines.
62
Chapter 3
3.1 Introduction
63
are known. In the first two scenarios, we employ either a time-series regression (TSR)
or a cross-sectional regression (CSR) to estimate the unknown loadings or factors,
using either portfolios and ETFs as proxies for factors, or characteristics as proxies for
factor loadings. In the third scenario, we employ the principal component analysis
(PCA) to identify latent factors and their loadings. Combining the estimated factors
and/or their loadings yields the low-rank component of the covariance matrix. With
respect to the sparse component, we adopt a variety of thresholding methods that
warrant positive semi-definite estimates of the covariance matrix.
In addition, we use the large cross section of transaction-level prices available at
high frequencies. High frequency data provide a unique opportunity to measure the
variation and covariation among stock returns. The massive amount of data facilitates
the use of simple nonparametric estimators within a short window, such as the sample
covariance matrix estimator on a daily, weekly, or monthly basis, so that several issues
associated with parametric estimation using low-frequency time series covering a long
timespan become irrelevant, such as structural breaks and time-varying parameters,
see, e.g., (Aït-Sahalia and Jacod 2014) for a review. However, the microstructure
noise and the asynchronous arrival of trades, which come together with intraday data,
result in biases of the sample covariance estimator with data sampled at a frequency
higher than, say, every 15 minutes, exacerbating the curse of dimensionality due to
data elimination.
By adapting the pre-averaging estimator designed for low-dimensional covariance
matrix estimation, e.g., (Jacod, Li, Mykland, Podolskij, and Vetter 2009), we construct
noise-robust estimators for large covariance matrices, making use of a factor model
in each of the three aforementioned scenarios. Using the large deviation theory of
martingales, we establish the desired consistency of our covariance matrix estimators
under the infinity norm (on the vector space) and the weighted quadratic norm, as
well as the precision matrix estimators under the operator norm. Moreover, we show
64
that TSR converges as the sample size increases, regardless of a fixed or an increasing
dimension. By contrast, the convergence of CSR and PCA requires a joint increase of
the dimensionality and the sample size – the so called blessings of dimensionality, see
(Donoho et al. 2000).
Empirically, we analyze the out-of-sample risk of optimal portfolios in a horse race
among various estimators of the covariance matrix as inputs. The portfolios comprise
constituents of Dow Jones 30, S&P 100, and S&P 500 indices, respectively. We find
covariance matrix estimators based on pre-averaged returns sampled at refresh times
outperform those based on returns subsampled at a fixed 15-minute frequency, for
almost all combinations of estimation strategies and thresholding methods. Moreover,
either TSR or PCA, plus the location thresholding that utilizes the Global Industry
Classification Standards (GICS) codes, yield the best performance for constituents of
the S&P 500 and S&P 100 indices, whereas TSR dominates in the case of Dow Jones
30.
Our paper is closely related to a growing literature on continuous-time factor
models for high frequency data. (Fan, Furger, and Xiu 2016) and (Aït-Sahalia and
Xiu 2017) develop the asymptotic theory for large dimensional factor models with
known and unknown factors, respectively, assuming a synchronous and noiseless
dataset. Their simulations show a clear break-down of either TSR or PCA when noise
is present and the sampling frequency is beyond every 15 minutes. (Pelger 2015a) and
(Pelger 2015b) develop the central limit results of such models in the absence of the
noise. Their asymptotic results are element-wise, whereas the theoretical results in
this paper focus on matrix-wise properties. (Wang and Zou 2010) propose the first
noise-robust covariance matrix estimator in the high dimensional setting, by imposing
the sparsity assumption on the covariance matrix itself, see also (Tao, Wang, and
Zhou 2013), (Tao, Wang, Yao, and Zou 2011), (Tao, Wang, and Chen 2013), and
(Kim, Wang, and Zou 2016) for related results. (Brownlees, Nualart, and Sun 2017)
65
impose the sparsity condition on the inverse of the covariance matrix or the inverse of
the idiosyncratic covariance matrix. By contrast, we impose the sparsity assumption
on the covariance of the idiosyncratic components of a factor model, as motivated
from the economic theory, which also fits the empirical data better.
Our paper is also related to the recent literature on the estimation of low-
dimensional covariance matrix using noisy high frequency data. The noise-robust
estimators include, among others, the multivariate realized kernels by (Barndorff-
Nielsen, Hansen, Lunde, and Shephard 2011), the quasi-maximum likelihood estimator
by (Aït-Sahalia, Fan, and Xiu 2010) and (Shephard and Xiu 2017), the pre-averaging
estimator by (Christensen, Kinnebrock, and Podolskij 2010), the local method of
moments by (Bibinger, Hautsch, Malec, and Reiss 2014), and the two-scale and
multi-scale estimators by (Zhang 2011) and (Bibinger 2012). (Shephard and Xiu 2017)
document the advantage of using a factor model in their empirical study, even when the
dimension of assets is as low as 13. We build our estimator based on the pre-averaging
method because of its simplicity in deriving the in-fill and high-dimensional asymp-
totic results. Allowing for an increasing dimensionality asymptotics sheds light on
important statistical properties of the covariance matrix estimators, such as minimum
and maximum eigenvalues, condition numbers, etc, which are critical for portfolio
allocation exercises. (Aït-Sahalia and Xiu 2017) develop a related theory of principal
component analysis for low-dimensional high frequency data.
Our paper is also related to the recent literature on large covariance matrix
estimation with low frequency data. (Fan, Fan, and Lv 2008) propose a large covariance
matrix estimator using a strict factor model with observable factors. (Fan, Liao, and
Mincheva 2011) extend this result to approximate factor models. (Fan, Liao, and
Mincheva 2013) develop the POET method for models with unobservable factors.
Alternative covariance matrix estimators include the shrinkage method by (Ledoit
and Wolf 2004) and (Ledoit and Wolf 2012), and the thresholding method proposed
66
by (Bickel and Levina 2008a), (Bickel and Levina 2008b), (Cai and Liu 2011) and
(Rothman, Levina, and Zhu 2009).
Our paper shares the theoretical insight with existing literature of factor models
specified in discrete time. (Bai and Ng 2002) and (Onatski 2010) propose estimators
to determine the number of factors. (Bai 2003) develops the element-wise inferential
theory for factors and their loadings. These papers, including ours, allow for more
general models than the approximate factor models introduced in (Chamberlain and
Rothschild 1983). The above factor models are static, as opposed to the dynamic
factor models in which the lagged values of the unobserved factors may also affect the
observed dependent variables. Inference on dynamic factor models are developed in
(Forni, Hallin, Lippi, and Reichlin 2000), (Forni and Lippi 2001), (Forni, Hallin, Lippi,
and Reichlin 2004), and (Doz, Giannone, and Reichlin 2011); see (Croux, Renault,
and Werker 2004) for a discussion.
Factor models based on stock characteristics date back to (Rosenberg 1974), which
suggests to model factor betas of stocks as linear functions of observable security
characteristics. (Connor and Linton 2007) and (Connor, Hagmann, and Linton 2012)
further extend this model to allow for nonlinear or nonparametric functions. One
of our covariance matrix estimators, namely, CSR, is designed to leverage the linear
factor model with characteristics as loadings. Such an estimation strategy is widely
used in the financial industry, but is largely ignored by the academic literature.1 Our
asymptotic analysis of this estimator fills in this gap.
The rest of the paper is structured as follows. In Section 3.2, we set up the model
and discuss model assumptions. Section 3.3 proposes the estimation procedure for
each scenario of the factor model and establishes the asymptotic properties of these
estimators. Section 3.4 discusses the choice of tuning parameters. Section 3.5 provides
Monte Carlo simulation evidence. Section 3.6 evaluates these estimators in an out-
1
Barra Inc., which was acquired by the MSCI Inc., was a leading provider of this type of covariance
matrices to practitioners, see, e.g., (Kahn, Brougham, and Green 1998).
67
of-sample optimal portfolio allocation race. The appendix contains all mathematical
proofs. The paper is published in Journal of Econometrics (Dai, Lu, and Xiu 2019).
3.2.1 Notation
Let pΩ, F, tFt u, Pq be a filtered probability space. Throughout this paper, we use
λj pAq, λmin pAq, and λmax pAq to denote the j-th (descending order), the minimum, and
the maximum eigenvalues of a square matrix A. In addition, we use }A}1 , }A}, }A}F
and }A}Σ to denote the L1 norm, the operator norm (or L2 norm), the Frobenius norm,
ř a
and the weighted quadratic norm of a matrix A, that is, maxj i |Aij |, λmax pA| Aq,
a
TrpA| Aq, and d´1{2 }Σ´1{2 AΣ´1{2 }F , respectively. Notice that }A}Σ is only defined
for a d ˆ d square matrix. We use Id to denote a d ˆ d identity matrix. All vectors are
regarded as column vectors, unless otherwise specified. When A is a vector, both }A}
and }A}F are equal to its Euclidean norm. We also use }A}MAX “ maxi,j |Aij | to denote
the L8 norm of A on the vector space. We use ei to denote a d-dimensional column
vector whose ith entry is 1 and 0 elsewhere. We write An — Bn if |An |{|Bn | “ Op1q.
We use C to denote a generic constant that may change from line to line.
Yt “ βXt ` Zt , (3.1)
68
in which Xt is a continuous Itô semimartingale, that is,
żt żt
Xt “ hs ds ` ηs dWs ,
0 0
żt żt
Zt “ fs ds ` γs dBs ,
0 0
where Ws and Bs are standard Brownian motions. In addition, hs and fs are progres-
sively measurable. Moreover, the processes ηs and γs are càdlàg, and, writing es “ ηs ηs| ,
gs “ γs γs| , es , es´ , gs , and gs´ are positive-definite. Finally, for all 1 ď i, j ď r,
1 ď k, l ď d, |βkj | ď C, for some C ą 0, and there exists a locally bounded process Hs ,
such that |hi,s |, |ηij,s |, |γkl,s |, |eij,s |, |fkl,s |, and |gkl,s | are bounded by Hs uniformly for
0 ď s ď t.
Σ “ βEβ | ` Γ,
where
żt żt żt
1 1 1
Σ“ cs ds, Γ“ gs ds, and E “ es ds.
t 0 t 0 t 0
Assumption 3. E has distinct eigenvalues, with λmin pEq bounded away from 0. More-
over, there exists some positive-definite matrix B such that kd´1 β | β ´ Bk “ op1q, as
d Ñ 8, and λmin pBq is bounded away from 0.
70
norm bound for the precision matrix. Such an assumption may also be restrictive in
that it excludes the existence of weak factors, see, e.g., (Onatski 2010). Dealing with
weak factors requires a rather different setup, so we leave it for future work.
3.2.3 Sparsity
ÿ
md “ max |Γij |q , for some q P r0, 1q.
iďd
jďd
d
ÿ
}Γ} ď }Γ}1 “ max |Γij | “ Opmd q.
iďd
jďd
71
Therefore, imposing md {d Ñ 0 creates a gap between eigenvalues of the low rank
(βEβ | ) and the sparse (Γ) components of the covariance matrix Σ, which is essential
for identification in a latent factor model specification, and for estimation of the factor
models in general.
To use sparsity for estimation, we define a class of thresholding functions sλ pzq :
R Ñ R, which satisfies:
piq |sλ pzq| ď |z|; piiq sλ pzq “ 0 for |z| ď λ; piiiq |sλ pzq ´ z| ď λ.
As discussed in (Rothman, Levina, and Zhu 2009), Condition (i) imposes shrinkage,
condition (ii) enforces thresholding, and condition (iii) restricts the amount of shrinkage
to be no more than λ. The exact three requirements of sλ p¨q ensure desirable statistical
properties of the estimated covariance matrix. Examples of such thresholding functions
we use include hard thresholding, soft thresholding, smoothly clipped absolute deviation
(SCAD) ((Fan and Li 2001)), and adaptive lasso (AL) ((Zou 2006)):
sHard
λ pzq “ z1p|z| ą λq, sSoft
λ pzq “ signpzqp|z| ´ λq` , sAL
λ pzq “ signpzqp|z| ´ λ
η`1
|z|´η q` ,
$
’
’signpzqp|z| ´ λq` ,
’
’ when |z| ď 2λ;
’
’
&
sSCAD
λ pzq “ pa´1qz´signpzqaλ , when 2λ ă |z| ď aλ; .
’ a´2
’
’
’
’
%z,
’ when aλ ă |z| .
72
3.2.4 Microstructure Noise
There are three scenarios of factor models we analyze, depending on whether the factor
X or its loading β are known or not. We use the term “known” instead of “observable”,
because even if we assume that the factor X can be proxied by certain portfolios in
the literature, for instance the Fama-French three factors by (Fama and French 1993),
we allow for potential microstructure noise so that the true factors are always latent
in our setup.
The first scenario assumes that X is known, in which case, we denote the observed
factor as X ‹ :
Yt‹i “ Ytij ` εyti , Xt‹i “ Xtij ` εxti ,
j j j j
where εy and εx are some additive noises associated with the observations at sampling
times tij s. We use tij to denote the arrival time of the jth transaction of asset i. We
can thereby rewrite the factor model (3.1) as:
‹ y x
where Zo,t i “ Zti ` εti ´ βεti . Barring from the noise, this model is a standard linear
j
j j j
regression. In the empirical study, we regard those portfolios that are useful to explain
the cross-section of expected asset returns as factors, including the five Fama-French
factors ((Fama and French 2015)) and the momentum factor ((Carhart 1997)).2 We
also add industry portfolios as suggested by (King 1966).
The second scenario assumes that β is known and perfectly observed, yet Y is
again contaminated, so we can write the model as,
2
While these factors explain the cross-section of expected returns, they also account for significant
variations in the time series of realizations.
73
‹ y
where Zu,ti “ Zti ` εti . This model dates back to (Rosenberg 1974), who developed a
j
j j
factor model of stock returns in which the factor loadings of stocks are linear functions
of observable security characteristics. This is equivalent to a model with characteristics
as βs associated with some linear latent factors. In the empirical study, we use 13
characteristics obtained from the MSCI Barra, a leading company that provides factors
and covariance matrices using this method.
In the third scenario, we only observe a noisy Y , so the model can be written into
the same form as (3.3). This model is a “noisy” version of the approximate factor model
by (Chamberlain and Rothschild 1983) and (Bai 2003), which can only be identified
as the dimension of Y increases to 8, thanks to the “blessing” of the dimensionality.
With respect to the microstructure noises, following (Kim, Wang, and Zou 2016),
we assume:
where, for some α ě 0 and for each i ď d, writing ∆til “ til ´ til´1 ,
8
ÿ 8
ÿ
ui,tij “ ρi,tij´l ξi,tij´l , vi,tij “ bi,tij´l ∆´α
ti
rB
ri,ti ´ B
j´l
ri,ti
j´l´1
s.
j´l
l“0 l“0
We assume ξi,til and ξj,tj1 are random variables with mean 0, and independent when
l
ř8
l“0 |ρi,til | ă 8 uniformly in i, and ξi,til is independent of the filtration tFt u generated
by X and Z. Moreover, B
ri is a Brownian motion independent of ξi but potentially
correlated with W and B of Assumption 1, and bi,til is adapted to the filtration tFtil u,
and bounded in probability with 8
ř
l“0 |bi,til | ă 8 uniformly in i.
74
The microstructure noise has two independent components: u and v, where u
allows for serial dependence, and v allows for correlation with returns. This assumption
is motivated from the microstructure theory and existing empirical findings that order
flows tend to cluster in the same direction and that they tend to be correlated with
returns in the short run, see, e.g., (Hasbrouck 2007), (Brogaard, Hendershott, and
Riordan 2014). This assumption is therefore more realistic in particular for data
sampled at ultra-high frequencies. A similar assumption is also adopted by (Kalnina
and Linton 2008) and (Barndorff-Nielsen, Hansen, Lunde, and Shephard 2011).
3.2.5 Asynchronicity
Since the transactions arrive asynchronously, it is common to adopt the refresh time
sampling scheme proposed by (Martens 2004) prior to estimation. The first refresh
(
time is defined as t1 “ max t11 , t21 , . . . , td1 . The subsequent refresh times are defined
recursively as
" *
1 d
tj`1 “ max tNt1 `1 , . . . , tN d `1 ,
j tj
where Ntij is the number of transactions for asset i prior to time tj . We denote by n
the resulting sample size after refresh time sampling.
Effectively, this sampling scheme selects the most recent transactions at refresh
times, which avoids adding zero returns artificially, because by design all assets have at
least one update between two refresh times. By comparison, the alternative previous-
tick subsampling scheme by (Zhang 2011) discards more data in order to avoid artificial
zero returns. That said, the refresh time scheme is notoriously influenced by the most
illiquid asset, which largely determines the number of observations after sampling.
Pair-wise refresh time as suggested by (Aït-Sahalia, Fan, and Xiu 2010), is more
efficient for entry-wise consistency of the covariance matrix. In this paper, since
our focus is the consistency under matrix-wise norms, we adopt the refresh time
75
sampling throughout. We also find empirically that the refresh time approach delivers
satisfactory performance. We make the following assumption on the observation times,
following (Kim, Wang, and Zou 2016):
and that c1 n̄ ď n ď c2 n̄ holds with probability approaching 1, where c1 and c2 are some
positive constants.
There is a large literature devoted to the modeling of durations, i.e., time intervals
between adjacent transactions, since the seminal paper by (Engle and Russell 1998),
which proposes an autoregressive conditional duration (ACD) model and shows that
this parametric model can successfully describe the evolution of time durations for
(heavily traded) stocks. Our focus here is the covariance matrix of returns, so we are
agnostic about the dynamics of durations. The independence between durations and
prices is a strong assumption yet commonly used in the literature, with the exception
of (Li, Mykland, Renault, Zhang, and Zheng 2014). This assumption means we can
make our inference regarding the times of trades, and therefore refresh times, as fixed.
To simplify the notation, we treat n as if it is deterministic in what follows. Also,
we use Yt‹i to denote the most recent observation prior to or at the ith refresh time,
and relabel the associated noise as εxi and εyi . Note that Yt‹i ´ εyi is not necessarily
equal to Yti . Instead, it equals the value of Y at actual transaction times. The same
convention applies to X ‹ and Z ‹ .
76
3.3 Three Estimators and their Asymptotic Proper-
ties
We now proceed to the estimators and their asymptotic properties. Our results rely on
the joint large sample (n Ñ 8) and increasing dimensionality (d Ñ 8) asymptotics,
with a fixed number of factors r and a fixed time window r0, ts.
To deal with the bias due to the microstructure noise, we adopt the pre-averaging
method proposed by (Jacod, Li, Mykland, Podolskij, and Vetter 2009) to pre-weight
the returns. Specifically, we divide the whole sample into a sequence of blocks, with
the size of each block being kn , which satisfies:
kn ´1{2
“ θ ` opnδ q, (3.4)
n1{2`δ
´1{2
where θ and δ are positive constants, and nδ :“ n´1{4`δ{2 ` n´2δ .
As shown in (Christensen, Kinnebrock, and Podolskij 2010) and (Kim, Wang, and
Zou 2016), our choice of kn is not optimal (δ “ 0) for estimating the covariance matrix.
However, the resulting estimator is simpler, robust to dependent and endogenous noise,
and relies on less tuning parameters. More importantly, it guarantees a semi-definite
covariance matrix estimate in any finite sample, a desirable property our following-up
thresholding procedure relies on.
The returns in each block are weighted by a piecewise continuously differentiable
function g on r0, 1s, with a piecewise Lipschitz derivative g 1 . Moreover, gp0q “ gp1q “ 0,
ş1
and 0 g 2 psqds ą 0. A simple example of g would be gpsq “ s ^ p1 ´ sq. We define
ψ1 “ φ1 p0q, ψ2 “ φ2 p0q, where
ż1 ż1
1 1
φ1 psq “ g puqg pu ´ sqdu, φ2 psq “ gpuqgpu ´ sqdu.
s s
77
For a sequence of vectors V , we define its weighted average return as V̄ , given by
kÿ
n ´1
ˆ ˙
j
V̄i “ g ∆ni`j V, for i “ 0, . . . , n ´ kn ` 1.
j“1
kn
and ∆ni V “ Vi ´ Vi´1 , for i “ 1, 2, . . . , n, and Vi P tXti , Yti , Zti , Xt‹i , Yt‹i , Zt‹i , εxi , εyi u.
In what follows we propose three estimators corresponding to different scenarios
of the factor model. Each estimator uses the sample covariance matrix of these
pre-averaged returns as an input, which leads to robustness to the microstructure
noise. Intuitively, the effect of the noise is dominated by a strengthened return signal
of each block.
U :“ pY | , X | q| , Ū ‹ :“ pȲ ‹| , X̄ ‹| q| ,
n´k n `1
n 1 ÿ
Π
p“ Ūi‹ Ūi‹| ,
n ´ kn ` 2 ψ2 kn t i“0
78
where Ūi‹ is the pi`1qth column of Ū ‹ . We then construct estimators of each component
of the covariance matrix as:
βp “ Π p 22 q´1 ,
p 12 pΠ E
p“Π
p 22 , and Γ
p“Π
p 11 ´ Π p 22 q´1 Π
p 12 pΠ p 21 .
Σ
p TSR “ βpE pS .
p βp| ` Γ
We postpone the choice of the thresholding method sp¨q and tuning parameters λij to
Section 3.4, where we provide a procedure to guarantee the positive semi-definiteness
of Σ
p TSR .
We provide the convergence rates under several matrix norms for both the covari-
ance matrix Σ
p TSR and its inverse:
´1{2 ?
Theorem 3. Suppose that Assumptions 1 - 5 hold, and nδ log d “ op1q. Then we
have
› › ´ a ¯
´1{2
›ΣTSR ´ Σ› “ O p nδ log d ,
›p ›
MAX
› › ` ˘
Σ Σ› “ Op d1{2 n´1 ´1
δ log d ` pnδ log dq
p1´qq{2
md ,
›p ›
› TSR ´
Σ
´ a ¯
´1{2
}β ´ β} “ Op nδ
p log d .
79
` ˘p1´qq{2
Moreover, if n´1
δ log d md “ op1q, we have
´ ¯´1 ´` ˘p1´qq{2 ¯
Σp TSR ´ Σ´1 “ Op n´1
δ log d md ,
where
q˚ “ 1 X q˚ “ 1 Y ´ β X
´ ¯´ ¯|
q “ pβ | βq´1 β | Y,
X E qXq |, Γ q Y ´ βX
q ,
t t
and $
´ ¯ & Γ
’ qij i “ j,
q˚S “ Γ
Γ q˚S , q˚S
Γ “ .
ij ij
% sλij pΓ
’ qij q i ‰ j,
This estimator is similar in spirit to the covariance matrix estimator provided by the
MSCI Barra, see (Kahn, Brougham, and Green 1998). We analyze its properties as
follows:
?
Theorem 4. Suppose that Assumptions 1 - 3, 5 hold, n´1{2 log d “ op1q. Then we
have
› › ´ a ¯
›p˚ ´1{2 ´1{2 1{2
Σ Σ n log d d m ,
›
› CSR ´ › “Op ` d
MAX
› › ˆ” ı2 ” ı1´q ˙
›p˚ ´1{2 1{4
a ´1{4 1{2 ´1{2
a ´1{2 1{2
›ΣCSR ´ Σ› “Op n d log d ` d md ` n log d ` d md md ,
›
Σ
› › ´ ¯
´1{2 1{2
›X ´ X › “Op d md .
›q ›
” ? ı1´q
´1{2 ´1{2 1{2
Moreover, if n log d ` d md md “ op1q, we have
´ ¯´1 ˆ” ı1´q ˙
˚ ´1 ´1{2
a ´1{2 1{2
ΣCSR
p ´Σ “Op n log d ` d md md ,
Theorem 4 shows that the CSR estimator does not converge under k¨kMAX when
1{2
d is fixed, due to the second term d´1{2 md , unlike the TSR estimator. This is not
surprising, as the cross-sectional regression exploits an increasing dimensionality to
81
?
estimate X. The first term n´1{2 log d is the same as that in the convergence rate
of TSR in the absence of noise (see (Fan, Furger, and Xiu 2016)), because both
approaches estimate Γ based on a thresholded sample covariance estimator. It is also
interesting to compare this convergence rate with that of the PCA estimator given by
?
(Aït-Sahalia and Xiu 2017), which is n´1{2 log d ` d´1{2 md . The rate improvement
in the CSR estimator comes from the second term and is due to the knowledge of β.
Overall, the convergence rate of CSR depends on a striking trade-off between n and d.
In the general scenario where the noise plagues the data, we construct a pre-
averaging based covariance matrix estimator:
Σ
p CSR “ β Eβ qS ,
q |`Γ
where
q ‹ “ pβ | βq´1 β | Ȳ ‹ , n 1 q ‹ q ‹|
X E
q“ X X ,
n ´ kn ` 2 ψ2 kn t
n 1 ´ ¯´ ¯|
Γ
q“ Ȳ ‹ ´ β X
q ‹ Ȳ ‹ ´ β X q‹ ,
n ´ kn ` 2 ψ2 kn t
and
$
´ ¯ & Γ
’ qij i “ j,
qS “ Γ
Γ qS , qS “
Γ .
ij ij
% sλij pΓ
’ qij q i ‰ j,
´1{2 ? 1{2
Theorem 5. Suppose that Assumptions 1 - 5 hold, nδ log d ` d´1{2 md “ op1q.
Then we have
› › ´ a ¯
´1{2 ´1{2 1{2
Σ Σ n log d d m ,
›p ›
› CSR ´ › “Op δ ` d
MAX
› › ˆ” ı2 ” ı1´q ˙
´1{2
a 1{4 ´1{2 1{2 ´1{2
a ´1{2 1{2
›ΣCSR ´ Σ› “Op nδ log dd ` d md ` nδ log d ` d md md ,
›p ›
Σ
82
› › ´ ¯
› q‹ 1{2
›X ´ X̄ › “Op d´1{2 md .
‹›
ı1´q
´1{2 ?
”
´1{2 1{2
Moreover, if nδ log d ` d md md “ op1q, we have
´ ¯´1 ˆ” ı1´q ˙
´1 ´1{2
a ´1{2 1{2
Σp CSR ´Σ “Op nδ log d ` d md md ,
Compared to the results of Theorem 4, the rates in Theorem 5 remain the same
except that n is replaced by nδ , which is the effective sample size when noise is present.
The same intuition as in the no-noise case holds as well.
Without prior knowledge of factors or their loadings, we apply PCA to the pre-averaged
covariance matrix estimate based on Ȳ ‹ :
n´kn `1
n 1 ÿ
Σ
r“ Ȳi‹ Ȳi‹| .
n ´ kn ` 2 ψ2 kn t i“0
Suppose that λ
p1 ą λ
p2 ą . . . ą λ
pd are the simple eigenvalues of Σ,
r and that ξp1 , ξp2 , . . . , ξpd
rp
ÿ
Σ
r“ pj ξpj ξp| ` Γ,
λ r (3.6)
j
j“1
$
´ ¯ & Γ
’ rij i “ j,
rS “ Γ
Γ rS , S
Γij “
r ,
ij
% sλij pΓ
’ rij q i ‰ j,
83
and the resulting estimator of Σ is:
rp
ÿ
Σ
p PCA “ pj ξpj ξp| ` Γ
λ rS . (3.7)
j
j“1
This estimator is motivated from the POET strategy by (Fan, Liao, and Mincheva
2013) for low-frequency data, and adapted from the PCA approach by (Aït-Sahalia
and Xiu 2017) for high-frequency data. The idea behind this strategy is that the
eigenvectors corresponding to the first r eigenvalues of the sample covariance matrix
can be used to construct proxies of β, so that the low rank component of Σ can be
approximated by the first term on the right hand side of (3.6). The second term
then approximates the sparse component of Σ, which leads to the construction of the
estimator given by (3.7).
This estimator can also be constructed from a least square point of view, which
seek F and G such that
›Ȳ ´ F G›2 ,
› ‹ ›
pF, Gq “ arg min F
F PMdˆpr ,GPMrpˆn
(Bai and Ng 2002) and (Bai 2003) propose this estimator to estimate factors and their
loadings in a factor model for low-frequency data.
Once we have estimates of factors and loadings, we can obtain the same Γ rS
r and Γ
as above by:
$
n 1 ` ‹ ˘` ˘| ´ ¯ & Γ
’ rij i “ j,
Γ
r“ Ȳ ´ F G Ȳ ‹ ´ F G , rS “ Γ
Γ rS , rS “
Γ ,
ij ij
n ´ kn ` 2 ψ2 kn t % sλij pΓ
’ rij q i ‰ j,
84
with which we can construct
Σ rS .
p PCA “ t´1 F GG| F | ` Γ (3.8)
and provides estimates of factors and their loadings (up to some rotation).
To determine the number of factors r, we propose the following estimator using a
penalty function:
´ ¯
rp “ arg min d´1 λj pΣq
r ` j ˆ f pn, dq ´ 1,
1ďjďrmax
where rmax is some upper bound. This estimator is similar to that of (Aït-Sahalia
and Xiu 2017), which shares the spirit with (Bai and Ng 2002). The penalty function
f pn, dq satisfies two criteria. On the one hand, the penalty is dominated by the signal,
i.e., the value of d´1 λj pΣq, for 1 ď j ď r. Because d´1 λr pΣq is Op p1q as d increases,
we select a penalty that shrinks to 0. On the other hand, we require the penalty
to dominate the estimation error as well as d´1 λj pΣq for r ` 1 ď j ď d to avoid
overshooting. The choice of rmax does not play any role in theory, yet it warrants an
economically meaningful estimate of rp in a finite sample or in practice.
´1{2 ?
Theorem 6. Suppose that Assumptions 1 - 5 hold. Also, nδ log d`d´1{2 md “ op1q,
´1{2 ?
´ ¯ ´1
f pn, dq Ñ 0, and f pn, dq nδ log d ` d´1 md Ñ 8. Then we have
› › ´ a ¯
´1{2 ´1{2
›ΣPCA ´ Σ› “ Op nδ log d ` d md ,
›p ›
MAX
› › ˆ ´ ¯1´q ˙
1{2 ´1 ´1{2 2 ´1{2
a ´1{2
›ΣPCA ´ Σ› “ Op d nδ log d ` d md ` md nδ log d ` d md .
›p ›
Σ
85
Also, there exists a r ˆ r matrix H, such that with probability approaching 1, H is
invertible, }HH | ´ Ir } “ op p1q, and that
´ a ¯
´1{2 ´1{2
kF ´ βHkMAX “ Op nδ log d ` d md ,
´ a ¯
´1{2
G ´ H ´1 X “ Op nδ log d ` d´1{2 md .
´1{2 ?
If in addition nδ log d “ op1q, then λmin pΣ
p PCA q is bounded away from 0 with
› › ˆ ´ ¯1´q ˙
› p ´1 ´1 › 3 ´1{2
a ´1{2
›ΣPCA ´ Σ › “ Op md nδ log d ` d md .
86
is given. With a large number of observations, the estimates are not sensitive to the
choice of kn even when d is moderately large. In simulations and empirical studies, we
adopt a range of θs, and thus kn s, all of which lead to similar estimates with identical
interpretations.
´ a ¯κ
´1{2
f pn, dq “ µ nδ log d ` d´1 md ¨ medianptλ̂1 , ..., λ̂d uq, (3.9)
for some tuning parameters µ and κ. One might also use the perturbed eigenvalue
ratio estimator in (Pelger 2015a) to determine r, which gives almost the same result
in simulations. The latter requires one less tuning parameter, but its proof is more
involved. Alternatively, as argued by (Aït-Sahalia and Xiu 2017) we can simply regard
r as a tuning parameter from the practical point of view. In fact, the performance
of the estimator is not sensitive to rp, as long as rp is greater than or equal to r, yet
is not too large, as shown from our simulation results. We conjecture that the same
convergence rate holds for our estimators as long as rp ě r and rp is finite, which is
indeed the case for parameter estimation in the interactive effect models, see, e.g.,
(Moon and Weidner 2015). The proof likely involves the random matrix theory that
is not available for continuous martingales. We leave this for future work. In our
empirical studies, we find that as soon as r is greater than 3 but not as large as 20, the
comparison results remain the same qualitatively and the interpretations are identical.
87
3.4.3 Choice of the Thresholding Methods
The first one is the location-based thresholding utilizing domain knowledge (denoted
as location thresholding), as in (Fan, Furger, and Xiu 2016). This approach preserves
positive semi-definiteness in a finite sample and is computationally efficient, as neither
tuning nor optimization is involved. Specifically, we first sort the residual covariance
matrix Γ
p into blocks by assets’ industrial classifications (sector, industry group,
sloc
λij pzq “ z1pλij “ 1q, where λij “ 1, if and only if pi, jq P the same block.
b
λij “ τ Γpii Γ
pjj ,
sufficiently sparse.
3
Alternatively, one can adopt the adaptive thresholding method by (Cai and Liu 2011). In our
simulations, this approach does not perform as well as the location-based and correlation-based
thresholding methods we consider.
88
To fix these issues, we find an appropriate τ via a grid search algorithm, following
(Fan, Liao, and Mincheva 2013). We start from a small value of τ , and gradually
pS is obtained and the degree of sparsity is
increase it till a positive semi-definite Γ
below certain threshold. As τ increases, the degree of sparsity decreases, and the
solution shrinks towards the diagonal of Γ
p which is positive semi-definite. Thus our
grid search is guaranteed to produce a solution. In other words, this algorithm yields
a positive semi-definite estimate in a finite sample. Notice that the grid search for τ is
easier here compared to that of (Fan, Liao, and Mincheva 2013), because τ is bounded
between 0 and 1. In practice, a natural choice of the desired degree of sparsity can be
obtained using that of the location based thresholding, which is computationally less
expensive compared to the cross-validation method.
r
ÿ
dYi,t “ βi,j dXj,t ` dZi,t , dXj,t “ bj dt ` σj,t dWj,t , dZi,t “ γi| dBi,t ,
j“1
89
2
allow for time-varying σj,t which evolves as
2 2
dσj,t “ κj pθj ´ σj,t qdt ` ηj σj,t dB
rj,t , j “ 1, 2, . . . , r,
0.0012 . To mimic the asynchronicity, we censor the data using Poisson sampling, where
the number of observations for each stock is sampled from a truncated log normal
distribution. The log-normal distribution log N pµ, σ 2 q has parameters µ “ 2, 500
and σ “ 0.8, and the lower and upper truncation boundaries are 1,000 and 23,400,
respectively, which matches the empirical data on S&P 500 index constituents.
For pre-averaging estimators, we compare a range of local window sizes by varying
θ in (3.4). For subsampling estimators, we compare a range of subsampling frequencies
from every 5 minutes to every 65 minutes, denote as ∆n in seconds. We also add the
benchmark estimates using noiseless returns sampled at 5-minute frequency without
asynchronicity for comparison. In addition, we consider a “mixed” approach by
applying the pre-averaging method to the subsampled data, in order to check the
90
marginal effect of the refresh-time versus subsampling. Since the simulated Γ is
a block-diagonal matrix, we regard the location-based thresholding method as the
benchmark, and compare its performance with other thresholding methods in Section
3.4.3. We present the simulation results in Tables 3.1 and 3.2.
First, we find that the pre-averaging estimators achieve smaller errors compared
to subsampling estimators, and are closer to the noiseless benchmark, for almost
all model specifications and almost all criteria. The pre-averaging method achieves
a slightly better estimate of the low rank part βEβ | , and a better residual part Γ,
especially under the operator norm for Γ´1 , which in turn gives a better estimate of
the precision matrix Σ´1 . Throughout, the pre-averaging estimators are robust to a
wide range of tuning parameters θ. The sweet spot appears to be θ “ 0.03, while the
optimal frequency for the subsampling method seems achieved near ∆n “ 900.
Secondly, for comparison across various thresholding methods, we find in almost
all scenarios that the Location thresholding achieves the best result, followed by Soft,
AL, and SCAD thresholding, whereas Hard thresholding appears to be the worst.
The differences are smaller in terms of }Σ
p ´ Σ}MAX and }Γ
p ´ Γ}MAX , because the
largest entry-wise errors are either achieved along the diagonal or on entries off the
diagonal with large magnitudes, which are least affected by various thresholding
p ´1 ´ Σ´1 } and
methods. Nevertheless, the differences are most salient in terms of }pΣq
p ´1 ´ Γ´1 }. The former is arguably the most important metric in this table, since
}pΓq
it dictates the performance of the portfolio allocation in the empirical exercise.
Thirdly, the comparison between the “mixed” approach vs subsampling depends on
a bias-variance tradeoff. If the bias due to noise is large, mixed approach outperforms;
if the noise effect becomes negligible for a sufficiently low sampling frequency, the
subsampling approach can outperform because its convergence rate is faster. By
contrast, the “mixed” approach is always dominated by the pre-averaging based on the
91
refresh time scheme, because the latter reserves more data and that the two estimators
are equally efficient given the same amount of data.
Finally, to demonstrate the impact of the selected number of factors on the PCA,
we present in Table 3.3 the errors with a range of rp pre-set instead of being estimated.
For reason of space, we only report the results with θ “ 0.03 for the pre-averaging,
∆n “ 900 for the subsampling method, and the benchmark no-noise and synchronous
case with ∆˚n “ 300. Location thresholding is used in all cases. For the estimation of
Σ and Σ´1 , we find when rp ă r, the performance is much worse in every metric than
the case with rp “ r, whereas in the case of rp ą r, the performance is only slightly
worse, in particular when rp is within a reasonable range (smaller than 20). For the
purpose of covariance matrix estimation, this result justifies treating r as a tuning
parameter without estimating it. With respect to estimating the low rank and sparse
components of Σ, using an incorrect number of factors is harmful.
3.6.1 Data
We collect from the TAQ database intraday transaction prices of the constituents of
Dow Jones 30 index, S&P 100 index, and S&P 500 index, respectively, from January
2004 to December 2013. There are in total 42, 152, and 735 stocks for each index,
respectively, during this sampling period.
We select stocks that are members of these indices on the last day of each month,
and exclude those among them that have no trading activities on one or more trading
days of this month, as well as the bottom 5% stocks in terms of the number of
observations for liquidity concerns. To clean the data, we adopt the procedure detailed
in (Da and Xiu 2017), which only rely on the condition codes from the exchanges
and the range of NBBO quotes to identify outliers. Overnight returns are excluded
92
to avoid dividend issuances and stock splits. Days with half trading hours are also
excluded. We do not, however, remove jumps from these intraday returns as they
do not seem to matter for our purpose. We sample the stocks using the refresh time
approach, as well as the previous tick method at a 15-minute frequency. We select
15-minute because the Hausman tests proposed in (Aït-Sahalia and Xiu 2016) suggest
it is a safe frequency to use realized covariance estimators for this pool of stocks.
Figure 3.1 plots the daily sample sizes after refresh time sampling for S&P 500,
S&P 100, and Dow Jones 30 index constituents, respectively. In addition, it presents
the quartiles of the daily sample sizes for S&P 500 index constituents by different colors
of shading. As we increase the number of assets, the daily refresh time observations
decreases rapidly. Still we are able to obtain on average 284 observations per day
for S&P 500, which is approximately equivalent to sampling every 90 seconds. The
average number of observations for S&P 100 and Dow Jones 30 constituents are 905
and 2,105, respectively.
We collect the Global Industrial Classification Standard (GICS) codes from the
Compustat database for the Location thresholding method. There are 8 digits in the
codes. Digits 1-2 indicate the company’s sector; digits 3-4 describe the company’s
industry group; digits 5-6 describe the industry; digits 7-8 describe the sub-industry.
Throughout 120 months and among the assets we consider, the time series median of
the largest block size is 80 for sector-based classification, 39 for industry group, 27 for
industry, and 15 for sub-industry categories, for S&P 500 index constituents.
We construct observable factors from high-frequency transaction prices at a 15-
minute frequency. The factors include the market portfolio, the small-minus-big
(SMB) portfolio, the high-minus-low (HML) portfolio, the robust-minus-weak (RMW)
portfolio, and the conservative-minus-aggressive (CMA) portfolio in the Fama-French
5 factor model. We also include the momentum (MOM) portfolio formed by sorting
stock returns between the past 250 days and 21 days. We also collect the 9 industry
93
SPDR ETFs from the TAQ database (Energy (XLE), Materials (XLB), Industrials
(XLI), Consumer Discretionary (XLY), Consumer Staples (XLP), Health Care (XLV),
Financial (XLF), Information Technology (XLK), and Utilities (XLU)). The time
series of cumulative returns of all factors are plotted in Figure 3.2.
We obtain monthly factor loadings (exposures) from the MSCI Barra USE3 by
(Kahn, Brougham, and Green 1998). The loadings we utilize include Earnings Yield,
Momentum, Trading Activity, Currency Sensitivity, Earnings Variation, Growth,
Volatility, Dividend Yield, Size, Size Nonlinearity, Leverage, and Value. In addition,
we construct and add the market exposure for each stock, using the slope coefficient
in a time-series regression of its weighted daily returns on the weighted S&P 500
index returns over the trailing 252 trading days. The weights are chosen to have a
half life of 63 days, so as to match the method documented by USE3. In total, we
have 14 observable loadings including the intercept term. We normalize the factor
exposures such that their cross-sectional means are 0 and variances are 1 for each month.
Although the covariance matrix estimation is invariant under such transformations,
the estimated factors now have similar scales. In case of missing exposures, we use
their latest available values, or set them to 0 if they are missing throughout the entire
sample period for certain stocks.4 The cumulative returns of the estimated factors
based on S&P 500 constituents are shown in Figure 3.3.
We plot the cumulative leading principal components of S&P 500 constituents using
our PCA method in Figure 3.4. It is difficult to recognize a one-to-one correspondence
among the factors in Figures 3.2, 3.3, and 3.4, because the list of characteristics
available from the MSCI Barra do not match the observed factors we obtain. Instead,
we plot their generalized correlations using 15-minute returns in Figure 3.5, which
measure how correlated two sets of factors are as suggested by (Bai and Ng 2006) and
recently employed in (Pelger 2015b). Indeed, there exist strong coherence among the
4
The empirical findings remain the same if we exclude stocks whose loadings are missing from the
USE3 dataset.
94
observed and inferred factors using different approaches, in particular among the PCA
and the CSR factors.
min ω | Σω,
p subject to ω | 1 “ 1, }ω} ď γ,
1 (3.10)
ω
realized covariance matrices using data of the previous month are directly used for the
portfolio construction the next month. For a range of exposure constraints, we measure
the out-of-sample portfolio risk using 15-minute realized volatility. We compare the
covariance matrices based on pre-averaging and subsampling methods, with many
95
choices of θs and subsampling frequencies ∆n s. Figures 3.6, 3.7, and 3.8 provide
the results for the beset choice of θ “ 0.03 and ∆n “ 900 in simulations and five
thresholding methods we consider.
Figure 3.6 shows that (i) for S&P 500 constituents, the Location thresholding
(black) performs the best among all thresholding methods, followed by Soft (blue),
AL(red), and SCAD (green) approaches, with Hard thresholding (yellow) being the
worst. (ii) The TSR approach appears to be the best, with lowest out-of-sample risk
and most stable performance across different thresholding method. PCA is almost the
same as TSR when Location thresholding is applied, but its performance deteriorates
if we apply other thresholding techniques. CSR is dominated by the other two by
a very large margin. This differs from our simulation results, indicating that TSR
suffers from more serious model misspecification. (iii) When the exposure constraint
γ is small, the performance gap among different thresholding methods is small. This
is expected because the k¨kMAX norm differences across all methods are similar and in
this case the portfolios are heavily constrained that they are effectively low-dimensional.
(iv) The pre-averaging estimators (solid lines) dominate the subsampling estimators
(dash-dotted lines) across almost all cases, which agrees with our simulation results.
For S&P 100, we observe similar patterns from Figure 3.7 that pre-averaging
outperform the subsampling methods. PCA performs slightly worse than TSR. With
respect to the Dow Jones 30, Figure 3.8 shows that the CSR and PCA perform
considerably worse than TSR. This is not surprising, given that our theory suggests
that consistency of these two estimators requires a large dimension, whereas for TSR
a smaller cross-section works better.
Finally, we report in Table 3.4 the Diebold-Mariano ((Diebold and Mariano 2002))
tests for comparison of the out-of-the-sample risk of portfolios based on the pre-
averaging estimators against the subsampling estimators. Negative test statistics favor
the pre-averaging approach. Similar to Figures 3.6 - 3.8, when γ is large, pre-averaging
96
estimators deliver significantly smaller out-of-the-sample risk using TSR and PCA
across all thresholding methods for S&P 500 index constituents. For S&P 100 and
Dow-Jones index constituents, the difference between pre-averaging and subsampling
is significant only if using TSR.
3.7 Conclusion
97
Estimator Pre-Averaging Subsampling Noiseless Mixed
Tuning θ θ θ θ ∆n ∆n ∆n ∆n ∆˚n θ, ∆n
Parameters 0.02 0.03 0.04 0.05 300 900 1, 800 3, 900 300 0.03, 900
Location Thresholding
TSR 0.06 0.06 0.06 0.06 0.12 0.08 0.09 0.13 0.03 0.08
}Σ
p ´ Σ}MAX CSR 0.05 0.04 0.04 0.05 0.11 0.07 0.07 0.09 0.03 0.06
PCA 0.06 0.06 0.06 0.07 0.12 0.08 0.09 0.17 0.04 0.08
TSR 0.04 0.04 0.05 0.06 0.04 0.06 0.08 0.11 0.03 0.07
p | ´ βEβ | }MAX
}β Eβ CSR 0.02 0.02 0.03 0.03 0.02 0.03 0.04 0.06 0.02 0.03
PCA 0.04 0.05 0.05 0.06 0.04 0.06 0.08 0.19 0.04 0.07
TSR 0.05 0.04 0.04 0.04 0.12 0.06 0.06 0.07 0.02 0.05
}Γ
p ´ Γ}MAX CSR 0.04 0.03 0.03 0.04 0.11 0.06 0.06 0.07 0.02 0.05
PCA 0.04 0.03 0.03 0.04 0.11 0.06 0.06 0.17 0.02 0.05
TSR 0.28 0.24 0.25 0.28 0.80 0.40 0.43 0.62 0.14 0.36
}Σ
p ´ Σ}Σ CSR 0.27 0.22 0.22 0.24 0.79 0.37 0.37 0.48 0.13 0.32
PCA 0.28 0.24 0.26 0.28 0.79 0.40 0.43 0.93 0.15 0.36
TSR 5.32 4.31 5.41 6.98 9.43 6.50 10.45 30.16 3.57 7.00
p ´1 ´ Σ´1 }
}pΣq CSR 5.35 4.35 5.35 6.88 9.42 6.52 10.26 28.90 3.54 6.90
PCA 5.33 4.33 5.43 6.99 9.41 6.50 10.49 28.68 3.56 7.03
TSR 5.34 4.32 5.45 7.03 9.45 6.52 10.54 30.59 3.59 7.06
p ´1 ´ Γ´1 }
}pΓq CSR 5.37 4.38 5.39 6.93 9.44 6.54 10.38 29.33 3.57 6.97
PCA 5.35 4.35 5.47 7.05 9.44 6.52 10.59 28.89 3.58 7.09
rp PCA 3.00 3.00 3.00 3.00 3.00 3.00 3.00 1.71 3.00 3.00
Hard Thresholding
TSR 0.08 0.08 0.09 0.09 0.12 0.09 0.11 0.13 0.08 0.10
}Σ
p ´ Σ}MAX CSR 0.08 0.08 0.08 0.08 0.11 0.08 0.08 0.09 0.07 0.08
PCA 0.08 0.08 0.08 0.09 0.12 0.09 0.10 0.17 0.08 0.09
TSR 0.08 0.08 0.08 0.08 0.12 0.08 0.08 0.08 0.08 0.08
}Γ
p ´ Γ}MAX CSR 0.08 0.08 0.08 0.08 0.11 0.08 0.08 0.08 0.08 0.08
PCA 0.08 0.08 0.08 0.08 0.11 0.08 0.08 0.17 0.08 0.08
TSR 0.56 0.46 0.43 0.42 1.09 0.61 0.53 0.58 0.37 0.51
}Σ
p ´ Σ}Σ CSR 0.53 0.43 0.40 0.38 1.04 0.56 0.46 0.41 0.35 0.45
PCA 0.54 0.44 0.41 0.40 1.04 0.58 0.51 0.67 0.35 0.49
TSR 8.16 7.36 7.07 6.97 10.65 8.52 7.79 7.56 6.58 7.70
p ´1 ´ Σ´1 }
}pΣq CSR 8.01 7.20 6.93 6.88 10.53 8.38 7.67 7.52 6.67 7.57
PCA 7.98 7.18 7.18 7.30 10.52 8.35 7.61 8.45 7.43 7.51
TSR 8.18 7.38 7.09 6.99 10.67 8.54 7.81 7.59 6.60 7.72
p ´1 ´ Γ´1 }
}pΓq CSR 8.03 7.22 6.95 6.90 10.55 8.41 7.70 7.55 6.69 7.59
PCA 8.01 7.18 6.91 6.87 10.54 8.37 7.62 8.40 6.70 7.53
Note: In this table, we compare the performance of pre-averaging estimators with subsampling
approach using different thresholding methods under a variety of matrix norms. The tuning
parameter θ is for the pre-averaging local window size (kn « θn1{2`δ ), while the ∆n is the
subsampling frequency in seconds. The “Noiseless” column provides the estimates using clean
and synchronous data with a sampling frequency at ∆˚n “ 300 (5-minute), and the “Mixed”
column provides the estimates using the pre-averaging approach on the subsampled data
p | is identical across thresholding methods,
(θ “ 0.03, ∆n “ 900). Since the low-rank part β Eβ
we only report it in the upper panel. We also report the average estimated number of factors
for the PCA approach. The number of Monte Carlo repetitions is 1,000.
98
Estimator Pre-Averaging Subsampling Noiseless Mixed
Tuning θ θ θ θ ∆n ∆n ∆n ∆n ∆˚n θ, ∆n
Parameters 0.02 0.03 0.04 0.05 300 900 1, 800 3, 900 300 0.03, 900
Soft Thresholding
TSR 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.13 0.05 0.09
}Σ
p ´ Σ}MAX CSR 0.05 0.05 0.06 0.06 0.11 0.07 0.07 0.09 0.04 0.07
PCA 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.15 0.05 0.09
TSR 0.05 0.05 0.05 0.05 0.12 0.06 0.06 0.08 0.04 0.06
}Γ
p ´ Γ}MAX CSR 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.08 0.05 0.06
PCA 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.17 0.05 0.06
TSR 0.47 0.37 0.35 0.35 1.05 0.55 0.50 0.60 0.25 0.45
}Σp ´ Σ}Σ CSR 0.46 0.35 0.32 0.32 1.01 0.51 0.44 0.45 0.25 0.41
PCA 0.46 0.37 0.35 0.35 1.02 0.53 0.49 0.74 0.25 0.45
TSR 6.94 5.91 5.64 5.62 10.39 7.76 7.14 7.79 5.43 6.78
p ´1 ´ Σ´1 } CSR
}pΣq 6.93 5.92 5.67 5.66 10.36 7.76 7.18 7.99 5.54 6.81
PCA 6.92 7.55 8.00 8.31 10.35 7.73 8.12 12.54 8.24 7.73
TSR 6.97 5.93 5.66 5.64 10.42 7.78 7.16 7.82 5.68 6.80
p ´1 ´ Γ´1 } CSR
}pΓq 6.96 5.94 5.69 5.68 10.39 7.78 7.20 8.03 5.76 6.83
PCA 6.94 5.91 5.66 5.64 10.38 7.75 7.17 12.05 5.77 6.79
SCAD Thresholding
TSR 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.13 0.05 0.09
}Σ
p ´ Σ}MAX CSR 0.05 0.05 0.06 0.06 0.11 0.07 0.07 0.09 0.04 0.07
PCA 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.15 0.05 0.09
TSR 0.05 0.05 0.05 0.05 0.12 0.06 0.06 0.08 0.04 0.06
}Γ
p ´ Γ}MAX CSR 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.08 0.05 0.06
PCA 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.17 0.05 0.06
TSR 0.47 0.37 0.36 0.37 1.05 0.55 0.52 0.61 0.26 0.46
}Σp ´ Σ}Σ CSR 0.46 0.35 0.33 0.33 1.01 0.52 0.45 0.45 0.25 0.42
PCA 0.46 0.37 0.35 0.36 1.02 0.54 0.51 0.74 0.26 0.45
TSR 7.11 6.22 6.13 7.54 10.39 7.99 12.21 9.69 5.44 7.63
p ´1 ´ Σ´1 } CSR
}pΣq 7.08 6.26 6.13 7.00 10.36 7.94 10.78 9.96 5.55 7.39
PCA 7.06 7.66 8.26 8.88 10.35 7.91 11.04 12.96 8.39 8.24
TSR 7.14 6.24 6.15 7.60 10.42 8.01 12.36 9.78 5.68 7.68
p ´1 ´ Γ´1 } CSR
}pΓq 7.11 6.28 6.16 7.06 10.39 7.97 10.91 10.05 5.76 7.43
PCA 7.09 6.25 6.13 7.11 10.38 7.94 10.68 12.60 5.77 7.42
AL Thresholding
TSR 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.13 0.05 0.09
}Σ
p ´ Σ}MAX CSR 0.05 0.05 0.05 0.06 0.11 0.07 0.07 0.09 0.04 0.07
PCA 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.16 0.05 0.08
TSR 0.05 0.04 0.05 0.05 0.12 0.06 0.06 0.08 0.04 0.06
}Γ
p ´ Γ}MAX CSR 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.08 0.04 0.06
PCA 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.17 0.04 0.06
TSR 0.44 0.36 0.36 0.37 1.03 0.55 0.55 0.67 0.23 0.47
}Σp ´ Σ}Σ CSR 0.43 0.35 0.33 0.34 1.00 0.52 0.50 0.53 0.23 0.44
PCA 0.44 0.36 0.36 0.37 1.00 0.54 0.55 0.75 0.24 0.47
TSR 6.34 5.33 6.12 8.12 10.26 7.46 16.46 41.68 5.34 8.95
p ´1 ´ Σ´1 } CSR
}pΣq 6.40 5.41 6.08 8.05 10.30 7.53 17.40 40.75 5.46 9.00
PCA 7.21 8.47 9.30 10.14 10.29 7.71 18.10 25.70 9.12 10.23
TSR 6.36 5.35 6.17 8.20 10.29 7.48 16.70 42.81 5.59 9.05
p ´1 ´ Γ´1 } CSR
}pΓq 6.42 5.44 6.19 8.18 10.32 7.56 17.72 41.92 5.71 9.15
PCA 6.41 5.43 6.27 8.34 10.32 7.54 17.95 25.58 5.71 9.38
Table 3.3: Impact of the Selected Number of Factors on the PCA Method
Note: We present the errors of the PCA approach for a variety of norms, using different
number of factors rp, with the true value of r being equal to 3. For tuning parameters, we
select θ “ 0.03 for pre-averaging, ∆n “ 900 for subsampling method, and the benchmark
no-noise and synchronous case with ∆˚n “ 300. Location thresholding is used in all cases.
The number of Monte Carlo repetitions is 1,000.
100
Exposure Constraint γ 1 1.2 1.6 2 3 4 8
S&P 500
TSR -1.60 -1.88* -2.24** -2.76*** -3.12*** -3.16*** -3.15***
Location CSR -0.09 -0.51 -1.55 -1.75* -2.21** -2.07** -2.02**
PCA -2.31** -2.55** -2.64*** -2.80*** -2.96*** -2.81*** -2.79***
TSR -1.53 -2.07** -2.59*** -2.76*** -2.72*** -2.58*** -2.56**
Hard CSR 0.48 0.47 0.31 0.73 1.20 1.28 1.30
PCA -2.39** -2.91*** -3.27*** -3.44*** -3.48*** -3.27*** -3.26***
TSR -1.64 -2.14** -2.81*** -3.27*** -3.43*** -3.40*** -3.38***
Soft CSR 0.38 0.24 -0.05 0.20 0.54 0.62 0.66
PCA -2.33** -2.76*** -3.28*** -3.44*** -3.56*** -3.47*** -3.46***
TSR -1.63 -2.09** -2.62*** -3.00*** -3.08*** -2.97*** -2.96***
SCAD CSR 0.45 0.39 0.21 0.38 0.87 0.98 1.00
PCA -2.32** -2.77*** -3.17*** -3.35*** -3.22*** -2.86*** -2.84***
TSR -1.64 -2.20** -2.89*** -3.35*** -3.53*** -3.46*** -3.44***
AL CSR 0.53 0.38 0.45 0.76 1.23 1.36 1.38
PCA -2.30** -2.71*** -3.21*** -3.36*** -3.42*** -3.22*** -3.21***
S&P 100
TSR -1.57 -2.09** -2.31** -2.45** -1.88* -1.46 -1.47
Location CSR -2.38** -2.78*** -3.22*** -3.18*** -3.29*** -3.29*** -3.29***
PCA -1.56 -1.90* -1.55 -1.39 -0.82 -0.59 -0.59
TSR -1.69* -2.27** -2.44** -2.42** -2.05** -2.07** -2.07**
Hard CSR -1.76* -1.69* -1.48 -1.70* -1.79* -1.79* -1.79*
PCA -1.35 -1.58 -1.04 -1.19 -0.70 -0.43 -0.43
TSR -1.68* -2.33** -2.60*** -2.59*** -2.15** -2.07** -2.08**
Soft CSR -2.09** -2.50** -2.23** -2.33** -2.42** -2.42** -2.42**
PCA -1.70* -2.12** -1.50 -1.59 -1.18 -1.11 -1.11
TSR -1.70* -2.34** -2.59*** -2.54** -2.04** -1.92* -1.93*
SCAD CSR -2.06** -2.41** -2.16** -2.26** -2.36** -2.36** -2.36**
PCA -1.57 -1.95* -1.18 -1.29 -0.82 -0.61 -0.61
TSR -1.68* -2.30** -2.57** -2.59*** -2.30** -1.98** -2.00**
AL CSR -2.10** -2.44** -2.17** -2.30** -2.40** -2.40** -2.40**
PCA -1.71* -2.07** -1.47 -1.48 -1.02 -0.88 -0.88
Dow Jones 30
TSR -1.85* -2.29** -2.74*** -2.66*** -2.55** -2.55** -2.55**
Location CSR -0.12 -0.07 0.10 0.09 0.16 0.16 0.16
PCA 0.67 0.36 -0.04 -0.62 -0.63 -0.63 -0.63
TSR -1.80* -2.01** -2.70*** -2.53** -2.45** -2.45** -2.45**
Hard CSR -0.84 -0.88 -0.75 -0.67 -0.55 -0.55 -0.55
PCA 0.59 0.19 -0.53 -1.07 -0.93 -0.89 -0.89
TSR -0.59 -1.06 -2.43** -2.29** -2.21** -2.21** -2.21**
Soft CSR -0.71 -0.90 -1.09 -1.08 -0.94 -0.94 -0.94
PCA 0.58 0.18 -0.45 -0.94 -1.00 -1.00 -1.00
TSR -0.56 -1.03 -2.43** -2.29** -2.21** -2.21** -2.21**
SCAD CSR -0.71 -0.90 -1.10 -1.09 -0.95 -0.95 -0.95
PCA 0.57 0.17 -0.48 -0.93 -1.00 -1.00 -1.00
TSR -0.03 -0.55 -2.06** -1.98** -1.94* -1.94* -1.94*
AL CSR -0.63 -0.76 -1.13 -1.11 -0.96 -0.96 -0.96
PCA 0.60 0.20 -0.44 -0.82 -0.92 -0.92 -0.92
Note: This table reports the Diebold-Mariano 101test statistics against the null that the out-of-
sample risk of the portfolios based on the pre-averaging method is equal to that based on
the subsampling approach. Negative numbers favor the pre-averaging approach. We use *,
**, and *** to reveal the significance at the 10%, 5%, and 1% levels, respectively.
Trading Intensity Graph
5%-25% & 75%-max 25%-75% 50%
SP500 Refresh SP100 Refresh DJI30 Refresh
23400
11700
4680
2340
1170
468
234
78
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Note: This figure plots the max, 5%, 25%, 50%, and 75% quantiles of the daily number of
observations for the S&P 500 index constituents after cleaning. It also plots the sample size
after refresh time sampling for the Dow Jones 30, S&P 100, and S&P 500 index constituents,
respectively.
102
Market SMB HML
0.2
0 0.1 0.1
0 0
-0.2
-0.1
-0.1
-0.4 -0.2
-0.2
-0.6 -0.3
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
Note: This figure plots the cumulative returns of the factors we have used in TSR, including
the market portfolio, the small-minus-big market capitalization (SMB) portfolio, the high-
minus-low price-earning ratio (HML) portfolio, the robust-minus-weak (RMW) portfolio, the
conservative-minus-aggressive (CMA) portfolio, the momentum (MOM) portfolio, as well
as 9 industry SPDR ETFs (Energy (XLE), Materials (XLB), Industrials (XLI), Consumer
Discretionary (XLY), Consumer Staples (XLP), Health Care (XLV), Financial (XLF), Infor-
mation Technology (XLK), and Utilities (XLU)). The overnight returns are excluded, same
for the half trading days. 103
Intercept MarketExp EarningsYieldExp
0
0.5
-0.2 0.1
0.4
-0.4 0.3
0.2 0.05
-0.6
0.1
-0.8
0 0
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
-0.1 -0.05 0
-0.15
-0.1 -0.05
-0.2
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
-0.1
-0.15
-0.3
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
0.3
-0.05 -0.2
0.2
-0.1
-0.4
0.1
-0.15
-0.6
0
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
LeverageExp ValueExp
0.05
0
0
-0.05
-0.05
-0.1 -0.1
-0.15 -0.15
-0.2 -0.2
Note: This figure plots the cumulative returns of factors we estimate using CSR, based on
S&P 500 constituents. The corresponding factor exposures include the intercept, the market
beta, and 12 other variables from MSCI Barra USE3 (Earnings Yield, Momentum, Trading
Activity, Currency Sensitivity, Earnings Variation, Growth, Volatility, Dividend Yield, Size,
Size Nonlinearity, Leverage, and Value).
104
PC1 PC2 PC3
6
5
6
4
0
4
2
-5
2
0
-10 0
-2
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
2 4 2
1 1
2
0 0
-1 -1
0
-2
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
-6 -1 -0.5
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
Note: This figure plots the cumulative returns of the leading principal components of S&P
500 constituents we estimate using PCA.
105
TSR vs PCA
1
1
3 0.8
Canonical Rank
5
0.6
7
9 0.4
11
0.2
13
15 0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
TSR vs CSR
1
1
3 0.8
Canonical Rank
5
0.6
7
9 0.4
11
0.2
13
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
CSR vs PCA
1
1
3 0.8
Canonical Rank
5
0.6
7
9 0.4
11
0.2
13
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Note: This figure provides the heatmap of the monthly generalized (canonical) correlations
among the time series of factors used in TSR and estimated from CSR and PCA methods.
The y-axis is the rank of the generalized correlation. For two sets of factors Xa and Xb , the
generalized correlation of rank k is calculated as the square-root of the kth-largest eigen-
value of the matrix rXa , Xa s´1 rXa , Xb srXb , Xb s´1 rXb , Xa s, where r¨, ¨s denotes the quadratic
covariation.
106
TSR with 15 Factors CSR with 14 Loadings
11.5
7
Annualized Risk (%)
6 10.5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Exposure Constraint Exposure Constraint
PCA with 15 Components
Pre-Averaging + Location Thresholding
Pre-Averaging + Hard Thresholding 7
Pre-Averaging + Soft Thresholding Annualized Risk (%)
Pre-Averaging + SCAD Thresholding
6.5
Pre-Averaging + AL Thresholding
Subsampling + Location Thresholding
Subsampling + Hard Thresholding
6
Subsampling + Soft Thresholding
Subsampling + SCAD Thresholding
Subsampling + AL Thresholding
1 2 3 4 5 6 7 8
Exposure Constraint
Note: This figure compares the time series average of the out-of-sample monthly volatility
from 2004 to 2013 using S&P 500 Index constituents. The x-axis is the exposure constraint
γ in the optimization problem (3.10). The results are based on a combination of methods:
(pre-averaging, subsampling) ˆ (TSR, CSR, PCA) ˆ (Location, Soft, AL, SCAD, Hard
thresholding). We use the GICS codes at industry group level for the location thresholding
method. The pre-averaging method uses θ “ 0.03, while the subsampling method uses
15-minute returns. The out-of-sample volatility is calculated using 15-minute subsampled
returns each month.
107
TSR with 15 Factors CSR with 14 Loadings
10.4 11.8
Note: This figure compares the time series average of the out-of-sample monthly volatility
from 2004 to 2013 using S&P 100 Index constituents. All other settings remain the same as
those in Figure 3.6.
12.3
11.8
12.2
11.6
12.1
11.4
12
11.2 11.9
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Exposure Constraint Exposure Constraint
PCA with 5 Components
Pre-Averaging + Location Thresholding 12
Pre-Averaging + Hard Thresholding
Annualized Risk (%)
Note: This figure compares the time series average of the out-of-sample monthly volatility
from 2004 to 2013 using Dow Jones 30 Index constituents. All other settings remain the
same as those in Figure 3.6.
108
3.1 Mathematical Proofs
´› › a ¯
´1{2 2
P ›E ´ E› ě C 0 nδ log d “ Opr2 C1 d´C0 C2 q, (3.1)
›p ›
piq
´› ›MAX a ¯
´1{2 2
P ›E ´ E› ě C0 rnδ log d “ Opr2 C1 d´C0 C2 q, (3.2)
›p ›
piiq
˜ ˇ F ˇ ¸
n´k n `1
ˇ n 1 ÿ
‹ ‹ ˇ
ˇ
´1{2
a
P max ˇ X̄k,i Z̄o,li ˇ ě C0 nδ log d (3.3)
ˇ
piiiq
1ďkďr ˇ n ´ kn ` 2 ψ2 kn ˇ
1ďlďd i“0
2
“ OprC1 d´C0 C2 `1 q,
˜ ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
P max ˇ Z̄o,ki Z̄o,li ´ gs,lk dsˇ (3.4)
ˇ ˇ
pivq
1ďk,lďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
a ¯ 2
´1{2
ě C0 nδ log d “ OpC1 d´C0 C2 `2 q,
ˆ › › ˙
´1{2
a 2
P max ›βj ´ βj › ě C0 nδ log d “ OpC1 d´C0 C2 `1 q, (3.5)
›p ›
pvq
1ďjďd
´› › a ¯
´1{2 2
P ›βp ´ β › ě C0 d1{2 nδ log d “ OpC1 d´C0 C2 `1 q, (3.6)
› ›
pviq
F
´› › a ¯
´1{2 2
P ›βp ´ β › ě C 0 nδ log d “ OpC1 d´C0 C2 `1 q, (3.7)
› ›
pviiq
MAX
˜ ¸
n´k n `1 ´´
n 1 ÿ ¯| ¯
pviiiq P max β ´ βp X̄i‹ ě C0 r2 n´1{2`δ log d(3.8)
1ďkďd n ´ kn ` 2 ψ2 kn k
i“0
2
“ OpC1 d´C0 C2 `1 q,
ˆ ˇ ˇ ˙
´1{2
a 2
P max ˇΓkl ´ Γkl ˇ ě C0 nδ log d “ OpC1 d´C0 C2 `2 q, (3.9)
ˇp ˇ
pixq
1ďk,lďd
ˆ ˇ ˇ ˙
ˇpS ´1{2
a 2
P max ˇΓkl ´ Γkl ˇ ě C0 nδ log d “ OpC1 d´C0 C2 `2 q. (3.10)
ˇ
pxq
1ďk,lďd
˜ ¸
n´k n `1 n´k n `1 n´k n `1 n´k n `1
n 1 ÿ ÿ ÿ ÿ
E
p kl “ X̄k,i X̄l,i ` X̄k,i ε̄l,i ` X̄l,i ε̄k,i ` ε̄k,i ε̄l,i
n ´ kn ` 2 ψ2 kn i“0 i“0 i“0 i“0
109
“T1 ` T2 ` T3 ` T4 ,
therefore,
Pp|E
p kl ´Ekl | ě uq ď Pp|T1 ´Ekl | ě u{4q`Pp|T2 | ě u{4q`Pp|T3 | ě u{4q`Pp|T4 | ě u{4q.
for certain numbers a1,i pk, lq and b1,ij pk, lq such that
F “ tpi, jq|1 ď i ď n ´ kn ` 1, 1 ď j ď n ´ kn ` 1, |i ´ j| ď kn ´ 1, i ‰ ju .
Let
n a1,i pk, lq
Aik,l :“ “ Op1q.
n ´ kn ` 2 ψ2 kn
We insert synchronized true price Xk,ti and Xk,ti´1 between Xk,tki and Xk,tki´1 and write
110
Now using the above expression to expand pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q, we obtain
the following decomposition
n´k
ÿn `1
řn´kn `1 şt
For i“1 Aik,l ∆ni Xk ∆ni Xl , denote Xt˚ “ 0
σs dWs , and denote for 1 ď i ď n,
1 ď k, l ď p,
řn´kn `1
then Mt “ i“1 Aik,l ζi,kl
2
is a continuous-time martingale. By Itô’s lemma, we have
żt żt żt
` ˚ ˚
˘` ˚ ˚
˘ |
` ˚ ˚
˘ ` ˚ ˚
˘
Xt,k ´ Xs,k Xt,l ´ Xs,l ´ pσσ qv,kl dv “ Xv,k ´ Xs,k pσv dWv ql ` Xv,l ´ Xs,l pσv dWv qk .
s s s
Therefore
ż i∆n ż i∆n
2
` ˚ ˚
˘ ` ˚ ˚
˘
ζi,kl “ Xs,k ´ Xpi´1q∆ n ,k
pσs dWs ql ` Xs,l ´ Xpi´1q∆ n ,l
pσs dWs qk .
pi´1q∆n pi´1q∆n
111
Now we have
ˇ ˇ
ˇn´k
ÿn `1 n´k
ÿ n `1 ż i∆n ˇ
i i |
Ak,l pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q ´ Ak,l pσσ qs,kl dsˇ
ˇ ˇ
ˇ
ˇ pi´1q∆n ˇ
ˇ i“1 i“1
ˇn´k n `1 ż i∆n n´k n `1 ż i∆n
ˇ ÿ i n ˚
ÿ
i n ˚
“ˇ Ak,l ∆i Xk bs,l ds ` Ak,l ∆i Xl bs,k ds
ˇ i“1 pi´1q∆n i“1 pi´1q∆n
ˇ
n´kÿ n `1 ż i∆n ż i∆n ˇ
i
Ak,l bs,l ds bs,k ds ` Mt ˇ .
ˇ
`
i“1 pi´1q∆n pi´1q∆n ˇ
112
˜ˇ ˇ ¸
ˇn´k n `1 n´k n `1 ż i∆n ˇ u2
ˇ ÿ ÿ
Aik,k p∆ni Xk˚ q2 ´ Aik,k pσσ | qs,kk dsˇ ą ´ tK
ˇ
ďP ˇ
ˇ i“1 i“1 pi´1q∆n ˇ tK∆n
¨ ´ ¯2 ˛
u2
tK∆n
´ tK ‹
ď exp ˝´ ‚,
˚
32K 3 t∆n
where the last inequality follows from (3.11). Finally, notice that
ˇ ˇ
ˇn´k n `1 ż i∆n ż i∆n ˇ
ˇ ÿ i
Ak,l bs,l ds bs,k dsˇ ď tK 2 ∆n max |Aik,l |,
ˇ
ˇ
ˇ i“1 pi´1q∆n pi´1q∆n ˇ i
we can derive
˜ˇ ˇ ¸
ˇn´k n `1 n´k n `1 ż i∆n ˇ
ˇ ÿ ÿ
Aik,l pXk,ti ´ Xk,ti´1 qpXl,ti ´ Xl,ti´1 q ´ Aik,l pσσ | qs,kl dsˇ ą 4u
ˇ
P ˇ
ˇ i“1 i“1 pi´1q∆ n
ˇ
˜ˇ ˇ ¸
ˇn´k n `1 ż i∆n ˇ
ˇ ÿ i n ˚
ďP p|Mt | ą uq ` 2P ˇ Ak,l ∆i Xk bs,l dsˇ ą u ` 1ttK 2 ∆n maxi |Aik,l |ąuu
ˇ
ˇ i“1 pi´1q∆n ˇ
¨ ´ ¯ 2
˛
u2
ˆ
u 2
˙
tK∆ n
´ tK
ď exp ´ ` 2 exp ˝´
˚ ‹
3
32K t∆n 3
32K t∆n
‚
ˆ ˙
16C2 u2
ďC1 exp ´ ,
∆n
?
where the above inequality holds if x ą ptK 2 ∆n maxi |Aik,l |q _ ptK ∆n q _
a
ptK∆n {2 1 ` 4{∆n q, and C1 ě 3, C2 ď p512K 3 tq´1 . On the other hand, if x violates
?
this bound, i.e., x ď C 1 ∆n , we can choose C1 such that C1 expp´16C2 C 12 q ě 1, so
1 1
that the inequality follows trivially. For Hkl p1q, . . . Hkl p8q, we can use exactly the
same technique for proving n´k
ř n `1 i n
i“1 Ak,l ∆i Xk ∆ni Xl , and have that
`ˇ 1 1
ˇ ˘ 2
P ˇHkl p1q ` ¨ ¨ ¨ ` Hkl p8qˇ ě u ď Cp e´Cp nu .
113
Since it is easy to show that
n´k
ÿ n `1 ż i∆n ż|
Aik,l pσσ | qs,kl ds “ pσσ | qs,kl ds ` opnδ q,
i“1 pi´1q∆n 0
we have
˜ˇ ˇ ¸
n´k n `1 ż|
ˇ n 1 ÿ ˇ
a1,i pk, lqAik,l pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q ´ pσσ | qs,kl dsˇ ě u
ˇ ˇ
P ˇ
ˇ n ´ kn ` 2 ψ2 kn i“1 0 ˇ
˜ˇ ˇ ¸
ˇn´k n `1 n´k n `1 ż i∆n ˇ
ˇ ÿ i
ÿ
Ak,l pXk,ti ´ Xk,ti´1 qpXl,ti ´ Xl,ti´1 q ´ Aik,l pσσ | qs,kl dsˇ ą u{2
ˇ
ďP ˇ
ˇ i“1 i“1 pi´1q∆n ˇ
˜ˇ ˇ ¸
`ˇ 1 ˇ ˘ ˇn´k ÿn `1 ż i∆n ż| ˇ
1
` P ˇHkl p1q ` ¨ ¨ ¨ ` Hkl p8qˇ ě u{3 ` P ˇ Ai pσσ | qs,kl ds ´ pσσ | qs,kl dsˇ ě u{3
ˇ ˇ
ˇ i“1 pi´1q∆n k,l 0 ˇ
2
ďCp e´Cp nu ` 1tuăopnδ qu
2
ďCp e´Cp nu .
and
pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q “ Op pn´1 q,
writing
n n
X1,ij “ b1,ij pk, lqpXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q,
n ´ kn ` 2 ψ2 kn
114
Using a similar decomposition, we can obtain
ÿ ÿ
i
X1,ij “ Bk,l ∆ni Xk ∆ni Xl ` Hkl
2 2
p1q ` ¨ ¨ ¨ ` Hkl p8q.
pi,jqPF pi,jqPF
i
Since }Xt }8 is bounded, then Bk,l ∆ni Xk ∆ni Xl ´ EpBk,l
i
∆ni Xk ∆ni Xl q is also
i
bounded. Then according to the Hoeffding’s lemma, we obtain Bk,l ∆ni Xk ∆ni Xl ´
i
EpBk,l ∆ni Xk ∆ni Xl q is a sub-Gaussian random variable. Similar arguments can
2 2
be extended to Hkl p1q, . . . , Hkl p8q. Then according to Hoeffding inequality, and
7Fkl ď Cnkn , where 7Fkl denotes the number of elements in the set Fkl , then we have
¨ˇ ˇ ˛
ˇ ˇ
ˇ1 ÿ ˇ
P ˝ˇˇ X1,ij ˇˇ ě u{8‚
ˇ n pi,jqPF ˇ
¨ˇ ¨ ˛ˇ ˛
ˇ ˇ
ˇ1 ÿ
i 1 ÿ ˇ
ďP ˝ˇˇ Bk,l ∆ni Xk ∆ni Xl ´ E ˝ i
Bk,l ∆ni Xk ∆ni Xl ‚ˇˇ ě u{24‚
ˇ n pi,jqPF n pi,jqPF ˇ
¨ˇ ¨ ˛ˇ ˛
ˇ ˇ
`ˇ 2 2 2 2
ˇ ˘ ˇ 1 ÿ ˇ
`P ˇHkl p1q ` ¨ ¨ ¨ ` Hkl p8q ´ EpHkl p1q ` ¨ ¨ ¨ ` Hkl p8qqˇ ě u{24 ` 1 ˝ˇˇE ˝ X1,ij ‚ˇˇ ě u{24‚
ˇ n pi,jqPF ˇ
Cp pnuq2
´
ďCp e nkn ` 1t|Ep 1 ř X1,ij q|ěu{24u
n pi,jqPF
n´kn `1
n 1 ÿ
T4 “ ūk,tki ūl,tli ` v̄k,tki v̄l,tki ` ūk,tki v̄l,tli ` `v̄k,tki ūl,tli
n ´ kn ` 2 ψ2 kn i“0
115
Using similar techniques of proving Proposition 10 in (Kim, Wang, and Zou 2016), we
can prove that
` ˘
max E |T4 | ď Cp n´2δ ` nα´1{2´2δ ` n2α´1´2δ .
1ďk,lďd
n´kn `1
n 1 ÿ
T2 “ X̄k,tki ūl,tli ` X̄k,tki v̄l,tli “ T21 ` T22 .
n ´ kn ` 2 ψ2 kn i“0
n´kn `1
1 ÿ
max E|T21 | ďCp E|X̄k,tki |E|ūl,tli |
1ďk,lďd kn i“0
n´k n `1
1 ÿ 1
ďCp E|X̄k,tki |kn´1{2 ď Cp nn´1 kn´1{2 “ kn´3{2 .
kn i“0
kn
n´k n `1
1 ÿ
max E|T22 | ďCp rE|X̄k,tki |2 s1{2 rE|v̄l,t
2 1{2
ls ď Cp kn´1 nας .
1ďk,lďd kn i“0
i
Therefore,
max E|T2 | ď Cp pkn´3{2 ` kn´1 nας q.
1ďk,lďd
´3 ´2 2ας
qu2
Pp max |T2 | ě uq ď d2 e´Cp pkn `kn n
,
1ďk,lďd
116
´3 ´2 2ας
qu2
Pp max |T3 | ě uq ď d2 e´Cp pkn `kn n
,
1ďk,lďd
and
´2δ `nα´1{2´2δ `n2α´1´2δ qu2
Pp max |T4 | ě uq ď d2 e´Cp pn .
1ďk,lďd
Thus
´› › a ¯
P ›E ´ E› ě C0 rnδ ` n´2δ s log d
›p ›
MAX
´1{2 ?
´C2 rn1{2´δ `n2δ spC0 nδ log dq2 2
ďC1 r2 e “ C1 d´C0 C2 .
(ii) Since
› › › ›
›E ´ E› ď r ›E ´ E› ,
›p › ›p ›
F MAX
then
´› › a ¯
´1{2 2
P ›E ´ E› ě C0 rnδ log d ď C1 d´C0 C2 .
›p ›
F
‹ ‹ ‹ ‹
(iii) The derivation of X̄k,i Z̄o,li is similar to that of X̄k,i X̄l,i given by (i), therefore we
obtain
˜ ˇ ˇ ¸
n´kn `1
ˇ n 1 ÿ ˇ
´1{2
a 2
max ˇ ‹
X̄k,i ‹ ˇ
Z̄o,li ˇ ě C0 nδ log d ď C1 d´C0 C2 `1 .
ˇ
P
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
˜ ˇ ˇ ¸
n´kn `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
´1{2
a
max ˇ Z̄o,ki Z̄o,li gs,kl dsˇ ě C0 nδ log d
ˇ ˇ
P ´
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
2
ďC1 d´C0 C2 `2 .
117
(v)-(vii) Moreover, note that
´ ¯´1
p 22
βj ´ βj “ Π
p p 12 ,
Π j
we have
and
4rC02 n´1{2`δ d log d
}βp ´ β}2F ď şt .
λ2min p 0 es dsq
ˆ› żt › ˆż t ˙˙ ˆ ˇ żt ˇ ˆż t ˙˙
› › 1 ˇ ˇ 1
P ››Ep ´ es ds› ď λmin
› 2 es ds ěP r max ˇˇE p ij ´ eij,s dsˇ ď λmin
ˇ 2 es ds
0 0 1ďi,jďs 0 0
2
ě1 ´ OpC1 d´C0 C2 q.
ˆ ˆż t ˙˙
p ě 1 λmin
P λmin pEq es ds
2
ě 1 ´ OpC1 d´C0 C2 q.
2 0
2
Combining this with (3.3), we have PpAq ě 1 ´ OpC1 d´C0 C2 `1 q.
118
(viii) To prove (3.8), we note that
˜ ¸2
n´kn `1 r n´kn `1
n 1 ÿ ÿ n 1 ÿ
max pβk,l ´ βpk,l qX̄l,i ď max }βpk ´βk }2 }X̄i‹ }2 .
1ďkďd n ´ kn ` 2 ψ2 kn 1ďkďd n ´ kn ` 2 ψ2 kn
i“0 l“1 i“0
şt
Then by (3.1) with C ą max1ďkďr e ds,
0 s,kk
˜ ¸
n´k n `1
n 1 ÿ
‹ 2
P }X̄i } ď rC
n ´ kn ` 2 ψ2 kn i“0
˜ ˇ ˇ ¸
n´k n `1 żt żt
ˇ n 1 ÿ
‹ 2
ˇ
ěP s max ˇ }X̄i } ´ es,kk dsˇ ` r max es,kk ds ď rC
ˇ ˇ
1ďkďr ˇ n ´ kn ` 2 ψ2 kn 0 ˇ 1ďkďr 0
i“0
2
ě1 ´ OpC1 d´C0 C2 q.
By (3.5), we obtain
¨ ˜ ¸2 ˛
n´k n `1 r
n 1 ÿ ÿ 2
P ˝ max ‹
pβk,l ´ βpk,l qX̄l,i ą Cn´1{2`δ r2 log d‚ ď OpC1 d´C0 C2 `1 q.
1ďkďd n ´ kn ` 2 ψ2 kn i“0 l“1
# ˇ ˇ +
n´k n `1 żt żt
ˇ n 1 ÿ
‹ 2
ˇ 1
max ˇ pZ̄o,li q ´ gs,ll dsˇ ď max gs,ll ds
ˇ ˇ
1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ 4 1ďlďd 0
i“0
$ ˜ ¸2 ,
n´k n `1 r
& n 1 ÿ ÿ
‹ ´1{2`δ 2
.
X max pβk,l ´ βk,l qX̄l,i
p ď Cn r log d ,
%1ďkďd n ´ kn ` 2 ψ2 kn i“0 l“1
-
119
n´k n `1
n 1 ÿ
ď max ppβp ´ βqX̄i‹ q2k
1ďkďd n ´ kn ` 2 ψ2 kn
i“0
g
n´k n´k
f n `1 n `1
f n 1 ÿ n 1 ÿ
` 2e max ‹
pZ̄o,ki q2 max ppβp ´ βqX̄i‹ q2k
1ďkďd n ´ kn ` 2 ψ2 kn 1ďkďd n ´ kn ` 2 ψ2 kn
i“0 i“0
dˆ ˙
´1{2`δ 2 5
ďC0 n r log d ` 2 Ct pC0 n´1{2`δ r2 log dq
4
1
a
ďC0 n´1{4`δ r log d.
Consequently, we have
ˇ ˇ
n´k n `1 ”´
ˇ n 1 ÿ
‹
¯´ ¯ ıˇ
max ˇ Ȳk,i ´ pβpX̄i‹ qk Ȳl,i‹ ´ pβpX̄i‹ ql ´ Z̄o,ki
‹ ‹
Z̄o,li
ˇ ˇ
ˇ
1ďk,lďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
1
a
ďC0 n´1{4`δ r log d,
` ˘
with probability 1 ´ O n3{2`δ´3ν{4´νδ{2 by (3.4) and (3.6). Finally, by triangle
inequality, we obtain
ˇ ˇ
max ˇΓ Γ
ˇp ˇ
kl ´ kl ˇ
1ďl,kďd
ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
ď max ˇ Z̄k,i Z̄l,i ´ gs,lk dsˇ
ˇ ˇ
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
ˇ ˇ
n´k n `1 ”´
ˇ n 1 ÿ
‹
¯´ ¯ ıˇ
` max ˇ Ȳk,i ´ pβpX̄i‹ qk Ȳl,i‹ ´ pβpX̄i‹ ql ´ Z̄o,ki
‹ ‹
Z̄o,li ˇ,
ˇ ˇ
1ďk,lďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
" *
´1{2
a
B1 “ max |Γkl ´ Γ| ď Csnδ
p log d ,
k,l
and
" b *
B2 “ C1 ă Γpkk Γ
pll ă C2 , f or all k ď d, l ď d ,
120
´1{2 ?
here C1 and C2 are some constant, and ωn “ nδ log d. Because of skl pzq “
´ ¯
?
skl z1|z|ąτ ω Γp Γp , we have
n kk ll
› › ˇ ˇ
›pS ? ? ?
›Γ ´ Γ› ď max ˇΓ 1 Γ 1 Γ 1
› ˇp ˇ
kl |Γ ` kl ´ kl ˇ
MAX 1ďk,lďd kl |ěτ ωn Γkk Γll |Γ̂kl |ěτ ωn Γkk Γll |Γ̂kl |ďτ ωn Γkk Γll
p p p p p p p
ˇ
ď max ˇskl pΓ pkl q ´ Γ ? ? ?
p ll ` |Γkl ´ Γkl |1|Γ̂kl |ďτ ωn Γ p ll ` Γkl 1|Γ̂kl |ďτ ωn Γ
ˇ pkl |1 p
1ďk,lďd |Γ̂kl |ěτ ωn Γ p kk Γ p kk Γ pk
b b
ďCωn Γ pkk Γ
pll 1
|Γ̂kl |ěCωn θ1 ` Cωn 1|Γ̂kl |ěCωn θ1 ` Cωn Γ
pkk Γ pll
ďCωn .
Then we obtain
› › a
›pS ´1{2
›Γ ´ Γ› ď Crnδ log d,
›
MAX
2
with probability at least 1 ´ OpC1 d´C0 C2 `2 q.
´1{2 ?
Lemma 11. Under Assumptions 1 - 4, and nδ log d “ op1q. We have
ˆ› ›2 › ›2 ˙
2
|› |› ´1 ´1{2`δ
piq P ›βpE ´ Eqβ › ` ›β Epβ ´ βq › ě C0 d n log d “ OpC1 d´C0 C2 `1 q,
› p › p p
Σ Σ
and
ˆ› ›2 ˙
2 2
|›
piiq P ›pβ ´ βqEpβ ´ βq › ě C0 dn ´1`2δ
log d “ OpC1 d´C0 C2 `1 q.
› p p p
Σ
Proof of Lemma 11. (i) For the first part, using the same argument in proof of theorem
2 in (Fan, Fan, and Lv 2008), we have
› | ´1 › › ›
›β Σ β › ď 2 ›cov ´1 pXq› .
121
Therefore
› ›2 ´ ¯
|› ´1 ´1{2 | ´1 | ´1{2
›βpE ´ Eqβ › “d tr Σ βpE ´ Eqβ Σ βpE ´ Eqβ Σ
› p p p
Σ
´ ¯
“d´1 tr pEp ´ Eqβ | Σ´1 βpEp ´ Eqβ | Σ´1 β
› ›2
ďd´1 ›pE ´ Eqβ | Σ´1 β ›
› p ›
F
› ›2
ďOpd´1 q ›E ´ E› .
›p ›
F
› ›2 1 › › › ›
›β Epβ ´ βq| › ď ›β | Σ´1 β Ep
p βp ´ βq›› ››EΣ p ´1 pβp ´ βq| ››
› p p › ›
Σ d F F
1 ›› | ´1 ›› ›› p ››2 ›› p ›2
ď β Σ β F ›E› ›β ´ β › .
›
d F F
2
p 2 ą Cq “ OpC1 r2 d´C0 C2 q. We can get
Then by Lemma 10 (3.2) and (3.6), and Pp}E} F
´1{2 ?
Lemma 12. Under Assumptions 1 - 4, and nδ log d “ op1q, we have
´ ¯
piq P pS ´ Γ ą C0 md n´p1{4´δ{2qp1´qq plog dqp1´qq{2 “ Opn3{2`δ´3ν{4´νδ{2
Γ (3.12)
q,
ˆ ´ ¯ 1 ˙
piiq pS ě λmin pΓq ą 1 ´ Opn3{2`δ´3ν{4´νδ{2 q,
P λmin Γ (3.13)
2
122
´ ¯
piiiq P pS q´1 ´ Γ´1 ą C0 md n´p1{4´δ{2qp1´qq plog dqp1´qq{2
pΓ
“ Opn3{2`δ´3ν{4´νδ{2 q, (3.14)
´ ˘
pivq pS q´1 βp ´ β | Γ´1 β ą C0 md n´p1{4´δ{2qp1´qq ¨dplog dqp1´qq{2
P βp| pΓ
“ Opn3{2`δ´3ν{4´νδ{2 q, (3.15)
´ ´ ¯ ` ˘
pvq P E pS q´1 βp ´ E´1 ` β | pΓS q´1 β
p ´1 ` βp| pΓ
˘
ą C0 md n´p1{4´δ{2qp1´qq dplog dqp1´qq{2 “ Opn3{2`δ´3ν{4´νδ{2 q, (3.16)
ˆ ´ ¯´1 ˙
pviq P E pS q´1 βp
p ´1 ` βp| pΓ ą C0 md d´1 “ Opn3{2`δ´3ν{4´νδ{2 q, (3.17)
ˆ ´ ¯´1 ˙
S S
pviiq P βp E ´1
p ` βp pΓ| ´1
p q βp |
βp pΓ
p q ´1
ą C0 md “ Opn3{2`δ´3ν{4´νδ{2
(3.18)
q,
ˆ ´ ¯´1 ˙
| p S ´1 p
pviiiq p p ´1
P β E ` β pΓ q β p p| ´1
β Γ ą C0 md “ Opn3{2`δ´3ν{4´νδ{2 q.
(3.19)
d ˇ
ÿ ˇ
pS ´ Γ ď max
Γ
ˇpS
ˇΓlk ´ Γlk ˇ .
ˇ
1ďlďd
k“1
Then using the same technique for proving (3.10), we can prove that, with probability
no less than Opn3{2`δ´3ν{4´νδ{2 q, we have
Proofs of (3.13)-(3.19) are similar to that of Lemma 5 in (Fan, Furger, and Xiu 2016),
therefore we omit the details.
Proof of Theorem 3. Based on Lemma 10, and following the same steps as that of
theorem 1 in (Fan, Furger, and Xiu 2016), we can obtain
› › ´ a ¯
´1{2
›ΣTSR ´ Σ› “ O nδ log d .
›p ›
MAX
123
For the next part, we will prove the convergence results based on the Σ norm:
› ›2 › ›2 › ›2
|›
›Σ̂TSR ´ Σ› ď4 ›βpE ´ Eqβ › ` 24 ›β Epβ ´ βq›
› › › p › p p ›
Σ Σ Σ
› ›2 › ›2
| S
` 16 ›pβ ´ βqEpβ ´ βq › ` 2 ›Γ̂ ´ Γ› . (3.20)
› p p p › › ›
Σ Σ
Finally, we have
› › › › › ›
› S ´1{2 › ´1{2 S ´1{2 › › ´1{2 S ´1{2 ›
›Γ̂ ´ Γ› “d ›Σ pΓ̂ ´ ΓqΣ › ď ›Σ pΓ̂ ´ ΓqΣ
›
›
Σ F
› ›
ď ›Γ̂S ´ Γ› λmax pΣ´1 q.
› ›
Then based on (3.20), Lemma 11 and Lemma 12 (3.12), and the fact that
we prove that
› › ` ˘
Σ Σ › “ Op d1{2 n´1{2`δ log d ` n´p1{4´δ{2qp1´qq md plog dqp1´qq{2 .
›p ›
› TSR ´
Σ
On the other hand, if we do not assume the factor structure, using a direct
pre-averaging estimator Σ̂‹ . Then we will get
› ›2 › ›2 › ›2 › ›2
´ Eqβ | › ` C ›β X̄ Z̄ | ›Σ ` C ›Z̄ Z̄ | ´ Γ›Σ .
› ‹
›Σ̂ ´ Σ› ď C ›βpE
› › p ›
Σ Σ
› ›2
According to the proof of Lemma 11, we obtain ›βpE ´ Eqβ | › “ Op pd´1 n´1{2`δ log dq
› p ›
› |
›2 ´1 ´1{2`δ
› | Σ ›2
and ›β X̄ Z̄ ›Σ “ Op pd n log dq. We can also get ›Z̄ Z̄ ´ Γ›Σ “ Op pdn´1{2`δ log dq.
´1{2 ?
› ›
Therefore ›Σ̂‹ ´ Σ› “ Op pd1{2 nδ log dq, which has slower convergence rate than
› ›
Σ
our estimator.
124
For the inverse part, by the localization argument, we only need to prove the
result under a stronger assumption that the entry-wise norms of all the processes are
bounded uniformly in r0, ts. By the Sherman - Morrison -Woodbury formula, we have
:“L1 ` L2 ` L3 ` L4 ` L5 ` L6 .
We now bound each term above with probability no less than 1´Opn3{2`δ´3ν{4´νδ{2 q.
First of all, by (3.12)
´ ¯´1
pS q´1 ´ Γ´1 ¨ βp E
L2 ď pΓ pS q´1 βp
p ´1 ` βp| pΓ pS q´1 ď Cm2 n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
βp| pΓ d
´ ¯´1
2 pS q´1 βp
L4 ď Γ´1 ¨ βp ´ β ¨ βp ¨ p ´1 ` βp| pΓ
E ď Cm2d n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
125
?
Similarly, using the fact that kβk ď kβkF “ Op dq, we can establish the same bound
for L5 .
Finally, we have
´ ¯´1 `
´1 2 2 | p S ´1 p
˘´1
L6 ď Γ ¨ kβk ¨ ´1
E ` β pΓ q β
p p ´ E´1 ` β | pΓS q´1 β
´ ¯ `
´1 2 2
˘
ď Γ ¨ kβk ¨ E ` β pΓ q β ´ E´1 ` β | pΓS q´1 β
p ´1 p| p S ´1 p
´ ¯´1 ` ˘´1
¨ E pS q´1 βp
p ´1 ` βp| pΓ ¨ E´1 ` β | Γ´1 β .
Note that for any vector v such that kvk “ 1, by the definition of operator norm, we
have
1 | | 1 1
v β βv “ v | Bv ´ v | pB ´ β | βqv ě λmin pBq ´ β | β ´ B ą C,
d d d
where C is some constant. Thus, λmin pβ | βq ą Cd. Therefore λmin pβ | Γ´1 βq ą Cd,
following from the fact that λmax pΓq ď Kmd . It then implies that λmin pE´1 `
β | Γ´1 βq ě λmin pβ | Γ´1 βq ą Cm´1
d d.
` ˘´1
E´1 ` β | Γ´1 β “ Op pmd d´1 q.
126
Using (3.16) and (3.17) , we have
` ˘
p TSR q´1 ´ Σ´1 ď C m3 n´p1{4´δ{2qp1´qq plog dqp1´qq{2 ` m2 n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
pΣ d d
We find that the second term on the right is dominated by the first one, then replace
the whole above equation by the first term, which yields the desired result.
To prove the second statement, note that for any vector v such that kvk “ 1, we
have
´ ¯
v|Σ
p TSR v “ v | βpE pS v ě λmin Γ
p βp| v ` v | Γ pS ,
´ ¯ ´ ¯
λmin Σ pS .
p TSR ě λmin Γ
This inequality, combining with (3.13) of Lemma 12, concludes the proof.
We note that
q ‹ ´ X̄ ‹ “ pβ | βq´1 β | Z̄ ‹ .
X
127
Similar to the proof of 3.1-3.3, and by Hoeffding inequality, we have
b b
| ‹
}β Z̄ } “ }β Z̄ Z̄ β} ď }β | pZ̄ ‹ Z̄ ‹| ´ Γqβ} ` }β | Γβ}
| ‹ ‹|
b
ď }β | }}Z̄ ‹ Z̄ ‹| ´ Γ}}β} ` }β | }}Γ}}β}
b
´p1´qq{2
ďC dmd nδ plog dqp1´qq{2 ` dmd
1{2 ´p1´qq{4 1{2
ďCd1{2 md nδ plog dqp1´qq{4 ` d1{2 md ,
p ?
where we use }Γ} ď md , and }Z̄ ‹ Z̄ ‹| ´ Γ} “ Op pnδ log dq. Then we have
› › }β | Z̄ ‹ }
›X ´ X̄ › ď}pβ | βq´1 }}β | Z̄ ‹ } ď
› q‹ ‹›
λmin pβ | βq
}β | Z̄ ‹ }d´1 1{2 ´p1´qq{4 1{2
ď |
ď Cd´1{2 md nδ plog dqp1´qq{4 ` d´1{2 md .
λmin pd β βq
´1
˜ ¸2
n´k
ÿ n `1 r
ÿ
max p ‹ ´ X̄ ‹ q
βk,l pX l,i l,i ď max }βk }2 }X
q ‹ ´ X̄ ‹ }2
F
1ďkďd 1ďkďd,1ďiďn
i“0 l“1
´p1´qq{2
ďCd´1 md nδ plog dqp1´qq{2 ` d´1 md .
128
Therefore, according to (3.21) and (3.4), and using triangle inequality, we have
ˇ ˇ
ˇp1
max ˇΓkl ´ Γkl ˇ
ˇ
1ďl,kďd
ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
ď max ˇ Z̄k,i Z̄l,i ´ gs,lk dsˇ
ˇ ˇ
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
ˇ ˇ
n´k n `1 ”´
ˇ n 1 ÿ
‹
¯´
q ‹ qk Ȳ ‹ ´ pβ X
¯
q ‹ ql ´ Z̄ ‹ Z̄ ‹ ˇˇ
ıˇ
` max ˇ Ȳk,i ´ pβ X
ˇ
i l,i i o,ki o,li
1ďk,lďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
´ a ¯
´1{2 1{2 ´p1´qq{4 1{2
ďC nδ log d ` Cd´1{2 md nδ plog dqp1´qq{4 ` d´1{2 md
´ a ¯
´1{2 ´1{2 1{2
ďC nδ log d ` d md .
The rest steps are similar to the TSR case. This concludes the proof.
where tλj , 1 ď j ď du and tξj , 1 ď j ď du are the eigenvalues and their corresponding
λ
eigenvectors of Σ, and r̄ “ arg min1ďjďd p dj ` jd´1{2 md q ´ 1.
´1{2 ?
Lemma 13. Suppose Assumptions 1 - 4 hold, and nδ log d “ op1q, then we have
ˇ ˇ
n´kn `1 żt
ˇ n 1 ÿ ˇ
´1{2
a
max ˇ X̄ik X̄il ´ es,kl dsˇ “ Op pnδ log dq,(3.22)
ˇ ˇ
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
129
ˇ ˇ
n´kn `1
ˇ n 1 ÿ ˇ
´1{2
a
max ˇ X̄ik Z̄u,li
‹ ˇ
ˇ “ Op pnδ log dq, (3.23)
ˇ
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
´1{2
a
max ˇ Z̄u,ki Z̄u,li ´ gs,kl dsˇ “ Op pnδ log(3.24)
dq.
ˇ ˇ
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
Recall that
´ ¯ ´ ¯
1{2
Λ “ Diag λ1 , λ2 , . . . , λr ,
p p p F “d ξ1 , ξ2 , . . . , ξr ,
p p p and G “ d´1 F | Y.
We write
n 1
H“ X̄ ‹ X̄ ‹| β | F Λ´1 .
n ´ kn ` 2 ψ2 kn t
ΣF
p “ F Λ, GG| “ td´1 ˆ Λ, F | F “ d ˆ Ir , and
p “ 1 pY ´ F Gq pY ´ F Gq| “ 1 YY | ´ 1 F ΛF | .
Γ
t t d
´1{2 ?
Lemma 14. Suppose Assumptions 1 - 4 hold with λij “ Opnδ log dq. Suppose
´1{2 ?
d´1{2 md “ op1q, nδ log d “ op1q, and rp Ñ r with probability approaching 1, then
there exists a r ˆ r matrix H, such that with probability approaching 1, H is invertible,
}HH | ´ Ir } “ }H | H ´ Ir } “ op p1q, and more importantly,
´ a ¯
´1{2
}F ´ βH}MAX “ Op nδ log d ` d´1{2 md ,
› › ´ a ¯
›G ´ H ´1 X̄ › “ Op n´1{2 log d ` d´1{2 md .
δ
130
´1{2 ?
Lemma 15. Under Assumptions 1 - 4 , d´1{2 md “ op1q, and nδ log d “ op1q, we
have
´ a ¯
´1{2 ´1{2
kF ´ βHkMAX “ Op nδ log d ` d md . (3.25)
H ´1 “ Op p1q. (3.26)
´ a ¯
´1{2
G ´ H ´1 X̄ “ O p nδ log d ` d´1{2 md . (3.27)
´1{2 ?
Lemma 16. Under Assumptions 1 - 4 , d´1{2 md “ op1q, and nδ log d “ op1q, we
have
´ a ¯
rS ´ Γ ´1{2
Γ ď Γ
p´Γ “ Op nδ log d ` d´1{2 md . (3.28)
MAX MAX
Lemma 17. Under Assumptions 1 - 4, d´1{2 md “ op1q, and n´1{2`δ log d “ op1q, we
have
1 ´
´1{2
a ¯
F GG| F | ´ βEβ | “ Op nδ log d ` d ´1{2
md .
t MAX
Proof of Proposition 1, Lemma 13-17. The proofs follow the same arguments as in
(Aït-Sahalia and Xiu 2017), thus we omit the details.
´1{2 ?
Lemma 18. Under Assumptions 1 - 4, dpnδ log dq2 “ op1q, nδ´1{2 log d “ op1q,
and d´1{2 md “ op1q, we have
´ a ¯
kF ´ βHk2F “ Op dpnδ
´1{2
piq log dq2 ` m2d . (3.29)
131
Proof of Lemma 18. (i) We have kF ´ βHk2F ď d kF ´ βHk2MAX .
(ii) According to the definition of } ¨ }Σ , we have
1
kpF ´ βHqpF ´ βHq| k2Σ “ Op p }F ´ βH}4F q “ Op pd kF ´ βHk4MAX q.
d
1
kβHpF ´ βHq| k2Σ “ trpHpF ´ βHq| Σ´1 pF ´ βHqHβ | Σ´1 βq
d
1
ď kHk2 β | Σ´1 β Σ´1 kF ´ βHk2F
d
“Op pkF ´ βHk2MAX q.
1
kβpH | H ´ Ir qβ | k2Σ ď trtpH | H ´ Ir qβ | Σ´1 βpH | H ´ Ir qβ | Σ´1 βu
d
1 2
ď β | Σ´1 β }H | H ´ Ir }2F
d
“op p1q.
´1{2 ?
Lemma 19. Under Assumptions 1 - 4, d´1{2 md “ op1q, and nδ log d “ op1q, we
have
´ a ¯
rS ´ Γ “ Op md pn´1{2 log d ` d´1{2 md q1´q .
Γ (3.33)
δ
132
´1{2 ?
Moreover, if in addition, d´1{2 m2d “ op1q and md nδ log d “ op1q hold, then
´ ¯
pS is bounded away from 0 with probability approaching 1, and
λmin Γ
´ ¯´1 ´ a ¯
rS ´1{2
Γ ´ Γ´1 “ Op md pnδ log d ` d´1{2 md q1´q . (3.34)
pS ´ Γ is symmetric,
Proof of Lemma 19. Note that since Γ
d ˇ
ÿ ˇ
rS ´ Γ ď Γ
Γ rS ´ Γ “ max
ˇpS
ˇΓlk ´ Γlk ˇ .
ˇ
8 1ďlďd
k“1
By Lemma 16, and using the same technique as proving (3.10), we have
´ a ¯
rS ´ Γ “ Op md S ´q pn´1{2 log d ` d´1{2 md q ` md S 1´q .
Γ δ
´1{2 ?
Choosing λij “ M 1 pnδ log d ` d´1{2 md q, M 1 is some positive constant, we have
´ a ¯
S ´1{2 ´1{2 1´q
Γ ´ Γ “ Op md pnδ
r log d ` d md q .
Moreover, since λmin pΓq ą K for some constant K and by Weyl’s inequality, we
rS q ą K ´ op p1q. As a result, we have
have λmin pΓ
´ ¯´1 ´ ¯´1 ´ ´ ¯¯
S ´1 S rS Γ´1 ď λmin pΓ
rS q´1 λmin pΓq´1 Γ ´ Γ
rS
Γ
r ´Γ “ Γr Γ´ Γ
´ a ¯
´1{2 ´1{2 1´q
ďOp md pnδ log d ` d md q .
p PCA “ 1 F ΛF | ` Γ
Σ rS “ 1 F GG| F | ` Γ
rS .
d t
133
By Lemma 16, we have
´ a ¯
rS ´ Γ ´1{2 ´1{2
Γ “ Op nδ log d ` d md .
MAX
1 rS ´ Γ
Σ
p PCA ´ Σ ď F ΛF | ´ βEβ | ` Γ
MAX d MAX MAX
” ı
p PCA ´ Σ}2 ďC kβpH | H ´ Ir qβ | k2 ` kβHpF ´ βHq| k2 ` kpF ´ βHqpF ´ βHq| k2 ` Γ
}Σ rS ´ Γ
Σ Σ Σ Σ
´1{2
a 1 ´1{2
a
“Op pdpnδ log dq4 ` m4d ` m2d pnδ log d ` d´1{2 md q2p1´qq q.
d
For the inverse, firstly, by Lemma 19 and the fact that λmin pΣ rS q, we
p PCA q ě λmin pΓ
´ ¯´1 ´ ¯´1
Σp PCA ´ Σ r
´ ¯´1 ` ˘´1
“ t´1 F GG| F | ` Γ rS ´ t´1 βHH ´1 X̄ ‹ X̄ ‹| pH ´1 q| H | β | ` Γ
´ ¯ ´ ¯ ´ ¯´1
“ pΓrS q´1 ´ Γ´1 ´ pΓ rS q´1 ´ Γ´1 F dΛ´1 ` F | pΓ rS q´1 F rS q´1
F | pΓ
´ ¯´1 ´ ¯
´1 ´1 | r S ´1 | S ´1 ´1
´ Γ F dΛ ` F pΓ q F F pΓ q ´ Γ
r
´ ¯´1
| ´1
|
` ˘ | | ´1
´1
` Γ pβH ´ F q tH X̄ X̄ H ` H β Γ βH H | β | Γ´1
´ ¯´1
| ´1
|
` ˘ | | ´1
´1
´ Γ F tH X̄ X̄ H ` H β Γ βH pF | ´ H | β | qΓ´1
ˆ´ ¯´1 ´ ¯´1 ˙
| ´1
|
` ˘ | | ´1 | r S ´1
`Γ F ´1
tH X̄ X̄ H ` H β Γ βH ´1
´ dΛ ` F pΓ q F F | Γ´1
“L1 ` L2 ` L3 ` L4 ` L5 ` L6 .
134
By Lemma 19, we have
´ a ¯
´1{2 ´1{2 1´q
kL1 k “ Op md pnδ log d ` d md q .
´ ¯ ´ ¯´1
1{2 S ´1 S
For L2 , because kF k “ Op pd q, λmax pΓ q
r ď λmin pΓ q
r ď K ` op p1q,
´ ¯ ´ ¯ ´ ¯
rS q´1 F ě λmin F | pΓ
λmin dΛ´1 ` F | pΓ rS q´1 F ě λmin pF | F q λmin pΓ
rS q´1 ě m´1 d,
d
´ ¯ ´ ¯´1
kL2 k ď rS q´1 ´ Γ´1 kF k dΛ´1 ` F | pΓ
pΓ rS q´1 F rS q´1
F | pΓ
´ a ¯
´1{2
“ Op md pnδ log d ` d´1{2 md q1´q .
The same bound holds for kL3 k. As for L4 , note that kβk “ Op pd1{2 q, kΓ´1 k ď
?
pλmin pΓqq´1 ď K, kHk “ Op p1q, and kβH ´ F k ď rd kβH ´ F kMAX “
Op pnδ d2{ν`1{2 ` md q, and that
´ ` ˘´1 ¯ ` ˘
λmin tH | X̄ X̄ | H ` H | β | Γ´1 βH ě λmin H | β | Γ´1 βH
ą Km´1
d d,
hence we have
´ ` ‹ ‹| ˘´1 ¯´1
| | | ´1
kL4 k ď Γ ´1
kpβH ´ F qk tH X̄ X̄ H ` H β Γ βH kH | β | k Γ´1
´1{2
a
“ Op pmd nδ log d ` d´1{2 m2d q.
135
The same bound holds for L5 . Finally, with respect to L6 , we have
´ ¯´1 ´ ¯´1
| ´1
` ˘
tH |
X̄ X̄ | | ´1
H ` H β Γ βH ´ dΛ ´1 |rS q´1 F
` F pΓ
´ ¯ ´ ¯
| ´1
` ˘
ďKd ´2
m2d |
tH X̄ X̄ | | ´1 ´1 | r S ´1
H ` H β Γ βH ´ dΛ ` F pΓ q F .
and
rS q´1 F
H | β | Γ´1 βH ´ F | pΓ
´ ¯
rS q´1 F
ď pH | β | ´ F | qΓ´1 βH ` F | Γ´1 pβH ´ F q ` F | Γ´1 ´ pΓ
´ a ¯
´1{2
“Op dmd pnδ log d ` d´1{2 md q1´q .
´ a ¯
´1{2
kL6 k “ Op m3d pnδ log d ` d´1{2 md q1´q .
On the other hand, using the Sherman - Morrison - Woodbury formula again,
r ´1 ´ Σ´1
Σ
` ˘´1
“ t´1 β X̄ X̄ | β | ` Γ ´ pβEβ | ` Γq´1
ˆ´ ¯´1 ` ˙
´1 2 | ´1
2 |
` ˘ | | ´1 | ´1 | | ´1
˘´1
ď Γ kβHk tH X̄ X̄ H ` H β Γ βH ´ H E H ` H β Γ βH
` ˘´1 ´1 ´1 ` ˘´1
ďKd tH | X̄ X̄ | H ` H | β | Γ´1 βH H | E´1 H ` H | β | Γ´1 βH t X̄ X̄ | ´ E´1
´ a ¯
´1{2
“Op md nδ log d .
136
By the triangle inequality, we obtain
137
Bibliography
Adams, S., P. A. Beling, and R. Cogill (2016): “Feature selection for hidden
Markov models and hidden semi-Markov models,” IEEE Access, 4, 1642–1657.
Adhikari, A., A. Ram, R. Tang, and J. Lin (2019a): “Docbert: Bert for document
classification,” arXiv preprint arXiv:1904.08398.
Aït-Sahalia, Y., and D. Xiu (2016): “A Hausman Test for the Presence of Market
Microstructure Noise in High Frequency Data,” Journal of Econometrics, forthcom-
ing.
Alizadeh, A., and N. Nomikos (2004): “A Markov regime switching approach for
hedging stock indices,” Journal of Futures Markets: Futures, Options, and Other
Derivative Products, 24(7), 649–674.
Ang, A., and G. Bekaert (2002): “Regime switches in interest rates,” Journal of
Business & Economic Statistics, 20(2), 163–182.
Bai, J. (2003): “Inferential Theory for Factor Models of Large Dimensions,” Econo-
metrica, 71(1), 135–171.
139
Cai, J. (1994): “A Markov model of switching-regime ARCH,” Journal of Business &
Economic Statistics, 12(3), 309–316.
Cai, T., and W. Liu (2011): “Adaptive Thresholding for Sparse Covariance Matrix
Estimation,” Journal of the American Statistical Association, 106(494), 672–684.
Da, R., and D. Xiu (2017): “When Moving-Average Models Meet High-Frequency
Data: Uniform Inference on Volatility,” Discussion paper, University of Chicago.
Dai, C., K. Lu, and D. Xiu (2019): “Knowing factors or factor loadings, or neither?
Evaluating estimators of large covariance matrices with noisy and asynchronous
data,” Journal of Econometrics, 208(1), 43–79.
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova (2018): “Bert: Pre-
training of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805.
Ding, X., Y. Zhang, T. Liu, and J. Duan (2014): “Using structured events to
predict stock price movement: An empirical investigation,” in Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 1415–1425.
140
(2015): “Deep learning for event-driven stock prediction,” in Twenty-fourth
international joint conference on artificial intelligence.
Donoho, D. L., et al. (2000): “High-Dimensional Data Analysis: The Curses and
Blessings of Dimensionality,” AMS Math Challenges Lecture, pp. 1–32.
Duan, J., X. Ding, and T. Liu (2018): “Learning sentence representations over tree
structures for target-dependent classification,” in Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Papers), pp. 551–560.
Engle, C., and J. D. Hamilton (1990): “Long swings in the dollar: are they in the
data and do markets know it,” American Economic Review, 80(4), 689–713.
Fama, E. F., and K. R. French (1993): “Common Risk Factors in the Returns on
Stocks and Bonds,” Journal of Financial Economics, 33(1), 3–56.
Fan, J., Y. Fan, and J. Lv (2008): “High Dimensional Covariance Matrix Estimation
Using a Factor Model,” Journal of Econometrics, 147(1), 186–197.
Fan, J., A. Furger, and D. Xiu (2016): “Incorporating Global Industrial Classifica-
tion Standard Into Portfolio Allocation: A Simple Factor-Based Large Covariance
Matrix Estimator With High-Frequency Data,” Journal of Business & Economic
Statistics, 34(4), 489–503.
Fan, J., and R. Li (2001): “Variable selection via nonconcave penalized likelihood
and its oracle properties,” Journal of the American statistical Association, 96(456),
1348–1360.
141
(2013): “Large Covariance Estimation by Thresholding Principal Orthogonal
Complements,” Journal of the Royal Statistical Society, B, 75(4), 603–680, With 33
discussions by 57 authors and a reply by Fan, Liao and Mincheva.
Fan, J., J. Zhang, and K. Yu (2012): “Vast Portfolio Selection with Gross-Exposure
Constraints,” Journal of the American Statistical Association, 107(498), 592–606.
Forni, M., and M. Lippi (2001): “The Generalized Dynamic Factor Model: Repre-
sentation Theory,” Econometric Theory, 17, 1113–1141.
He, K., X. Zhang, S. Ren, and J. Sun (2016): “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 770–778.
Hu, Z., W. Liu, J. Bian, X. Liu, and T.-Y. Liu (2018): “Listening to chaotic
whispers: A deep learning framework for news-oriented stock trend prediction,” in
Proceedings of the eleventh ACM international conference on web search and data
mining, pp. 261–269.
Kahn, R. N., P. Brougham, and P. Green (1998): United States Equity Model
(USE3) - Risk Model Handbook MSCI INC.
Kim, D., Y. Wang, and J. Zou (2016): “Asymptotic Theory for Large Volatility
Matrix Estimation Based on High-Frequency Financial Data,” Stochastic Processes
and Their Applications, 126(11), 3527–3577.
King, B. F. (1966): “Market and Industry Factors in Stock Price Behavior,” The
Journal of Business, 39(1), 139–190.
Ledoit, O., and M. Wolf (2004): “Honey, I Shrunk the Sample Covariance Matrix,”
Journal of Portfolio Management, 30(4), 110–119.
Luss, R., and A. d’Aspremont (2015): “Predicting abnormal returns from news
using text classification,” Quantitative Finance, 15(6), 999–1012.
Monbet, V., and P. Ailliot (2017): “Sparse vector Markov switching autoregressive
models. Application to multivariate time series of temperature,” Computational
Statistics & Data Analysis, 108, 40–51.
Moon, H. R., and M. Weidner (2015): “Linear Regression for Panel with Unknown
Number of Factors as Interactive Fixed Effects,” Econometrica, 83(4), 1543–1579.
Pedersen, L. H. (2019): Efficiently inefficient: how smart money invests and market
prices are determined. Princeton University Press.
144
Reiß, M., V. Todorov, and G. Tauchen (2015): “Nonparametric Test for a
Constant Beta Between Itô semi-Martingales Based on High-Frequency Data,”
Stochastic Processes and Their Applications, 125(8), 2955–2988.
Schumaker, R. P., and H. Chen (2009): “Textual analysis of stock market predic-
tion using breaking financial news: The AZFin text system,” ACM Transactions on
Information Systems (TOIS), 27(2), 1–19.
Tao, M., Y. Wang, and X. Chen (2013): “Fast Convergence Rates in Estimating
Large Volatility Matrices Using High-Frequency Financial Data,” Econometric
Theory, 29(4), 838–856.
Tao, M., Y. Wang, Q. Yao, and J. Zou (2011): “Large Volatility Matrix Inference
via Combining Low-Frequency and High-Frequency Approaches,” Journal of the
American Statistical Association, 106(495), 1025–1040.
Tao, M., Y. Wang, and H. H. Zhou (2013): “Optimal Sparse Volatility Matrix
Estimation for High-Dimensional Itô Processes with Measurement Errors,” Annals
of Statistics, 41(4), 1816–1864.
Todorov, V., and T. Bollerslev (2010): “Jumps and Betas: A New Framework
for Disentangling and Estimating Systematic Risks,” Journal of Econometrics, 157,
220–235.
145
Viennet, G. (1997): “Inequalities for absolutely regular sequences: application to
density estimation,” Probability theory and related fields, 107(4), 467–492.
Wang, Y., and J. Zou (2010): “Vast Volatility Matrix Estimation for High-Frequency
Financial Data,” Annals of Statistics, 38(2), 943–978.
Xu, Y., and S. B. Cohen (2018): “Stock movement prediction from tweets and
historical prices,” in Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pp. 1970–1979.
Yang, Z., D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016): “Hi-
erarchical attention networks for document classification,” in Proceedings of the
2016 conference of the North American chapter of the association for computational
linguistics: human language technologies, pp. 1480–1489.
Zou, H. (2006): “The adaptive lasso and its oracle properties,” Journal of the American
statistical association, 101(476), 1418–1429.
Zou, H., and T. Hastie (2005): “Regularization and variable selection via the elastic
net,” Journal of the royal statistical society: series B (statistical methodology), 67(2),
301–320.
146