Lu Princeton 0181D 13623

Statistical and Machine Learning
Methods For Financial Data
Kun Lu
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance

by the Department of
Operations Research and Financial Engineering
Adviser: Professor John M. Mulvey
May 2021
© Copyright by Kun Lu, 2021.
All rights reserved.
Abstract
This dissertation focus on developing new statistical and machine learning methods
for financial applications. We first propose a new model named Features Augmented
Hidden Markov Model (FAHMM), which extends the the traditional Hidden Markov
Model (HMM) by including the features structure. We also allow the model to be very
general from two perspectives: 1. the emission distribution can be of different form
(eg. exponential family); 2. we also deal with different features structures (e.g. high
dimensionality, multi-colinearity) by adding different penalization terms. Theoretical
proof of convergence, simulation and an empirical application to currency regime
identification are provided.
Next, we develop a new neural Natural Language Processing model, which combines
the reinforcement learning model with Bidirectional Encoder Representations from
Transformers (BERT) model to deal with the long documents classification. Due to
the limitation of BERT allowing only 512 tokens, it cannot deal with long documents,
which is very common in financial data (e.g. financial news, earnings transcript), we
train reinforcement learning model together with the BERT based model end to end:
using policy-gradient reinforcement learning to do sentences/chunks selection. Then
we apply our model to earnings conference call transcripts data and predict the stock
price movement after the call.
Finally, we work on a method to estimate the high dimensional covariance matrix
using high frequency data. We use factor structure and thresholding methods to
deal with high dimensionality, and using pre-average and refresh time to tackle high
frequency data specialty: microstructure noise and non-synchronicity. We also consider
three different scenarios, when we only know factors, or only know loadings, or know
neither. Theoretical proof and simulation are provided to support the theory, and a
horse race on the out-of-sample portfolio allocation with Dow Jones 30, S&P 100, and
S&P 500 index constituents, respectively are also conducted.
iii
Acknowledgements
First and foremost, I want to express my sincerest gratitude towards my advisor,

Professor John M. Mulvey for his guidance along my journey in Princeton. His brilliant,
and insightful views in research and finance greatly influenced my development into a
mature researcher and my understanding of the finance world. I really appreciate his
enlightenment, and meanwhile he always gave me a lot of freedom to explore my own
research interests and get deeper understanding by really working in the industry. I
am very grateful for all the valuable opportunities he provided, without his guidance
and kindness, all of my achievements today are impossible.
I also feel sincerely grateful to the members of my FPO committee: Professor Mykhaylo
Shkolnikov and Professor Boris Hanin. Thanks for being in my committee and
providing valuable comments. I would also like to thanks Professor Jason Klusowski
for spending his precious time to read my dissertation very carefully and provide
constructive feedbacks. I would also want to thanks Professor Jianqing Fan, Professor
Mykhaylo Shkolnikov and Dr. Robert Almgren for their excellent leadership in teaching.
Special thanks to Dr. Robert Almgren for his guidance on finishing my internship
project in his company and his insightful and sincere career advice he provided the
day I left. Thanks Professor Danqi Chen for her valuable time for giving me advice
on my work. I also want to show my sincere gratefulness for my two advisors in the
University of Chicago: Professor Wei Biao Wu, and Professor Dacheng Xiu. Although
I did not get chance to go back to visit them during last few years of my PhD studies,
but all their guidance and help are the reasons why I am here. Special thanks to my
undergraduate advisor from Renmin University of China Professor Hanfang Yang, he
is not only my advisor but also my life-long friend.
I appreciate the financial support from Princeton University and teaching assistantship
I received. I also want to express my gratitude to all the professors, scholars and
friends I met in Princeton. Thanks for the amazing five years here.
iv
Lastly, I will not be here to finish writing this thesis without the sincerest love and
support from my parents, they are the greatest parents in the world. My dearly love
to my girlfriend, Yixu Lin, who is always my strongest supporter. She helped me get
through those tough times, shared with me those happy days. Having them is the
luckiest thing in my life.
v
To my family.
vi
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Feature Augmented Hidden Markov Model: A Unified Framework

and Theoretical Guarantees 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Main Results for General Emission Distribution and Regularizer . . . 10
1.4.1 Decomposable Regularizer . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Conditions on Q functions . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Concrete Results for Gaussian Emission and L1 Penalty . . . . . . . . 18
1.6 Determination of number of states . . . . . . . . . . . . . . . . . . . . 19
1.7 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vii
1.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.10 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.11 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2 A Deep Reinforced Model For Long Document Classification With

An Application In Earnings Transcript Data 39
2.1 An Overview of Text Classification . . . . . . . . . . . . . . . . . . . 39
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Classification Label . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.1 BERT based Model . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.2 A Deep Reinforced Model . . . . . . . . . . . . . . . . . . . . 48
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5.2 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.6 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.1 Direction Prediction . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.2 Abnormal Return Prediction . . . . . . . . . . . . . . . . . . . 57
2.6.3 Market Simulation . . . . . . . . . . . . . . . . . . . . . . . . 58
2.6.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3 Large Covariance Matrix Estimation Using High Frequency Data 63

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
viii
3.2 Model Setup and Assumptions . . . . . . . . . . . . . . . . . . . . . 68
3.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.2 Factor Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.3 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.4 Microstructure Noise . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.5 Asynchronicity . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Three Estimators and their Asymptotic Properties . . . . . . . . . . . 77
3.3.1 Time-Series Regression (TSR) . . . . . . . . . . . . . . . . . . 78
3.3.2 Cross-Sectional Regression (CSR) . . . . . . . . . . . . . . . 80
3.3.3 Principal Component Analysis (PCA) . . . . . . . . . . . . . 83
3.4 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4.1 Choice of δ and kn . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4.2 Choice of r and f pn, dq . . . . . . . . . . . . . . . . . . . . . . 87
3.4.3 Choice of the Thresholding Methods . . . . . . . . . . . . . . 88
3.5 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.6 Empirical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6.2 Out-of-Sample Portfolio Allocation . . . . . . . . . . . . . . . 95
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.1 Mathematical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . 127
3.1.3 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . 127
3.1.4 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography 138
ix
List of Tables
1.1 AIC for different number of states. . . . . . . . . . . . . . . . . . . . 20

1.2 Simulation results for different models. . . . . . . . . . . . . . . . . . 21
1.3 Coefficients comparison between FAHMM with and without regularization. 22
2.1 The Structure of the Dataset. . . . . . . . . . . . . . . . . . . . . . . 53

2.2 Results for task 1: direction prediction. . . . . . . . . . . . . . . . . 56
2.3 Results for task 2: Abnormal return prediction. . . . . . . . . . . . . 57
3.1 Pre-Averaging v.s. Subsampling using Location and Hard Thresholding 98

3.2 Pre-Averaging v.s. Subsampling using Soft, SCAD, and AL Threshold-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3 Impact of the Selected Number of Factors on the PCA Method . . . 100
3.4 Diebold-Mariano Tests for the Out-of-Sample Risk Comparison . . . 101
x
List of Figures
1.1 Graphical illustration of Hidden Markov Model with factor structure

and general emission distribution. . . . . . . . . . . . . . . . . . . . . 5
1.2 The top picture is the simulated true states. The rest are estimated
probabilities of state 1 and state 2 using regularized FAHMM with
Elastic Net penalty, lasso penalty and ordinary FAHMM (without
regularizer, least square) accordingly. . . . . . . . . . . . . . . . . . . 22
1.3 The upper picture is regime identification results for BRL/USD using
LASSO FAHMM; the lower picture is the regime identification results
using Least Square FAHMM. The red and blue line stand for Regime 1
and Regime 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 The upper picture is the estimated probability for regime 1 and regime 2
for BRL/USD using LASSO FAHMM; the lower picture is the estimated
probability for regime 1 and regime 2 for BRL/USD using Least Square
FAHMM. The red and blue line stand for Regime 1 and Regime 2. . . 23
2.1 Zoom’s stock price movement after Q4 2020 earnings conference call. 42
2.2 BERT-based model’s architecture. . . . . . . . . . . . . . . . . . . . . 47
2.3 The Deep Reinforced Model’s architecture. . . . . . . . . . . . . . . . 50
2.4 Distribution of returns among the 1,500 samples. . . . . . . . . . . . 54
2.5 Returns distribution of label 1 (left) and label -1 (right). . . . . . . . 58
2.6 Alpha distribution of label 1 (left) and label -1 (right). . . . . . . . . 59
xi
2.7 Accumulative return for directional betting for test data from 2019-07-17
to 2020-02-21, in total of 103 trades. . . . . . . . . . . . . . . . . . . 60
2.8 Accumulative return for delta hedging for test data from 2018-10-31 to
2019-06-27, in total of 63 trades. . . . . . . . . . . . . . . . . . . . . . 61
2.9 Comparison between the models only using text, only using financial
data and combining them together. . . . . . . . . . . . . . . . . . . . 62
3.1 Trading Intensity of Assets . . . . . . . . . . . . . . . . . . . . . . . 102

3.2 Time Series of Factors Used in TSR . . . . . . . . . . . . . . . . . . 103
3.3 Time Series of Estimated Factors by CSR . . . . . . . . . . . . . . . 104
3.4 Time Series of Estimated Factors by PCA . . . . . . . . . . . . . . . 105
3.5 Generalized Correlation Plot . . . . . . . . . . . . . . . . . . . . . . 106
3.6 Out-of-Sample Risk of the S&P 500 Portfolio . . . . . . . . . . . . . 107
3.7 Out-of-Sample Risk of the S&P 100 Portfolio . . . . . . . . . . . . . 108
3.8 Out-of-Sample Risk of the Dow Jones 30 Portfolio . . . . . . . . . . 108
xii
Chapter 1
Feature Augmented Hidden Markov

Model: A Unified Framework
and Theoretical Guarantees
1.1 Introduction
The hidden Markov model plays a significant role in machine learning and other
domains, especially for applications in temporal pattern recognition such as speech,
handwriting, bioinformatics and financial economics. With the explosive growth of
data, however, the selection of features in a HMM can be burdensome. Traditionally,
in finance, the modeler must choose her data to be included by reference to expert
judgement, theoretical studies, and analyses of historical performance. There have been
few papers that focus on the feature selection problem in the HMM setting. (Adams,
Beling, and Cogill 2016) proposed a model called Feature Saliency Hidden Markov
Model. Here, they introduce a vector of binary variables, called saliency variables,
to indicate whether a feature is significant or not. This approach complicates the
search algorithm and is difficult to extend to high dimensional applications. Instead,
1
we propose a feature/factor structural model for the output emission variables and
incorporate a regularization within the M-estimation step. This allows for high
dimensional applications.
An early representative form of factors modeling within HMM is regime-switching
regression. The idea of switching between two regression dates back to (Quandt 1958),
which assumes that two regimes that can be fitted by separate regressions using Log
Likelihood estimation. (Goldfeld and Quandt 1973) were the first to model the change
of latent states as a Markov Chain. (Hamilton 1989) extend the Markov switching
model to the autoregression case and apply it to investigate the U.S. business cycle.
(Kim, Piger, and Startz 2008) relax the assumption that the latent state variable is
exogenous, and develop an endogenous Markov switching regression model.
Today, there is a vast literature on Markov switching regression applications in
economics and finance: Foreign exchange rates: (Engle and Hamilton 1990), (Cai 1994),
(Jeanne and Masson 2000); stock market: (Hamilton and Susmel 1994), (Alizadeh
and Nomikos 2004), (Lee, Yoder, Mittelhammer, and McCluskey 2006); financial
derivatives: (Buffington and Elliott 2002), (Alizadeh, Nomikos, and Pouliasis 2008);
the term structure of interest rates: (Ang and Bekaert 2002), and portfolio allocation:
(Valdez and Vargiolu 2013), among others. Markov switching models appear in other
areas: speech recognition (Rabiner, Juang, and Rutledge 1993) and computer vision
(Horst and Michael 2001).
On the other hand, with the development of technology and decrease of cost for
acquiring a huge amount of data, high dimensional data are nowadays becoming more
and more frequently used in many fields. However, there are few studies involving
high dimensional data within the HMM framework. An exception is (Hering, Kazor,
and Kleiber 2015) and (Ailliot, Monbet, and Prevosto 2006) who search for parametric
shape for the autoregressive and covariance matrices, which normally assumes the
off-diagonals of the matrices decrease with the distance from the diagonal increases.
2
(Monbet and Ailliot 2017) applied SCAD penalty to the Markov Switching Vector
Autoregressive Regression (MSVAR) to deal with the dimensionality of VAR problems.
Contributions: First, we propose a noval model named as Feature Augmented
Hidden Markov Model (FAHMM), which improves the performance of Hidden Markov
Model by giving a factor structure for the output variable. When features x are
available other than the output variable y, basic HMMs stack x and y as a output
vector. However, by doing this, the inferred latent states come from x and y together,
rather than y itself, which may cause serious errors in practice. Our FAHMM
considering x in a factor structure of y can correctly identify the true states. We also
cover different forms of emission distribution, and by adding different regularization
terms, we can achieve many practical goals, e.g. feature selection, dealing with high
dimensionality or the multi-colinearity issue.
Second, to fit our model, we design a new algorithm called regularized Baum Welch,
which is the first Baum Welch algorithm under the high dimensional setting. For high
dimensionality case, the M-step is normally not stable or even well-defined in some
cases. We solve this issue by introducing a regularization. We show how to perform
this iterative regularization and appropriately update the regularization parameter.
Other than that, we also establish a general local convergence theorem and statistical
and computational guarantees for this algorithm. The central challenge is to deal with
the dependence issue and high dimensionality issue at the same time, as this must
control the optimization error, as well as the statistical error.
Last, as far as we know, our paper is the first convergence analysis for the Markov
Switching regression, which can be regarded as a special case of our model, but
is popularly used in econometrics. The local convergence theorem and theoretical
guarantees are presented in Section 1.5.
Related Work: Our work firstly builds upon a framework for analysis of the
EM algorithm. Until recently, there have been few papers discussing convergence
3
and theoretical guarantees. (Balakrishnan, Wainwright, Yu, et al. 2017) propose a
theoretical framework for the EM algorithm’s convergence to a small neighborhood of
the global minimum; (Yi and Caramanis 2015) then extend this analysis framework
to the regularized EM case; but both of these work applies to models using IID data,
while for hidden Markov model, since the data are dependent, the population Q
function might not even exists. (Yang, Balakrishnan, and Wainwright 2017) generalize
the analysis to dependent data, and give statistical and computational guarantee of
the Baum Welch Algorithm. Our paper extend the analysis of Baum Welch to the
high dimensional case, but we require a factor structure for the output variable.
To build a as general framework, we also borrow results and analysis from
(Negahban, Ravikumar, Wainwright, Yu, et al. 2012). They provide a unified frame-
work for establishing convergence rates and consistency for regularized M -estimators
under the high-dimensional scaling. However, our problem is more complex due to the
essence of iterative regularization. Different but appropriate choices of regularization
parameters are needed for each iteration step is needed to guarantee the convergence
of the estimator.
1.2 Model Structure
Consider a N-state Hidden Markov Model with hidden states random variables tSi uiPZ
taking values in a discrete set t1, . . . , N u as shown in Figure 2.4. The latent state follows
a ergodic Markov Chain with the transition probability to be: P “ ppjk qj,kPt1,...,N u ,
where pjk “ P pZ2 “ k|Z1 “ jq and N
ř
k“1 pjk “ 1 for any j “ 1, . . . , N . We define
the ergodic probabilities of the Markov process to be vector π “ pπ1 , . . . , πN q| , where

řN
j“1 πj “ 1, and assume throughout that the ergodic Markov chain starts in its
stationary state. We also assume that the probability density/mass function f p¨q is
finite.
4
Figure 1.1: Graphical illustration of Hidden Markov Model with factor structure and
general emission distribution.
Suppose for sample i, the emission distribution comes from an exponential family
of probability distributions, a large class of probability distributions that includes
the normal, binomial, Poisson and gamma distributions, among others. With a
corresponding link function g we have:
Epyi |xi , Si q “ g ´1 px|i βSi q.
where xi is a p dimensional exogenous explanatory variables, βSi P Rp indicates the

corresponding regression coefficients. In our problem setting, we allow p Ñ 8, which
means we have a large pool of factors.
1.3 Estimation
In this section, we introduce the estimation method of our model, the Feature Aug-
mented Hidden Markov Model (FAHMM) and the regularized Baum Welch algorithm.
1.3.1 Likelihood
Suppose we observe sequences y1n “ ty1 , . . . , yn u, xn1 “ tx1 , . . . , xn u from the response-
covariate pair random vector pY, Xq P R ˆ Rp . Then the conditional log likelihood
5
function is: ¨ ˛
1 ÿ
Ln pθq “ log ˝ f py1n , S0n |xn1 ; θq‚,
n Sn 0
where θ :“ pβ, γq parameterize the emission distribution and the transition probability
matrix.
We notice that the log of sum function is not guaranteed to be concave. Due to
computation and analysis consideration, in reality, we normally apply a transformation
to work on an alternative concave problem instead, which is the concave lower bound
of the conditional log likelihood via Jensen’s inequality. For any choice of θ1 , we have
ˆ ˆ ˙˙
1 f py1n , S0n |xn1 ; θq
Ln pθq “ log ES0n |y1n ,xn1 ,θ1
n f pS0n |y1n , xn1 , θ1 q
1 1
ě ES0n |y1n ,xn1 ,θ1 rlog f py1n , S0n |xn1 ; θqs ´ ES0n |y1n ,xn1 ,θ1 rlog f pS0n |y1n , xn1 , θ1 qs .
n
loooooooooooooooooooomoooooooooooooooooooon n
loooooooooooooooooooomoooooooooooooooooooon
Qn pθ|θ1 q Hn pθ1 q
Since only the first term Qn pθ|θ1 q depends on θ, therefore EM try to maximizing Ln pθq
by maximizing Qn pθ|θ1 q.
We expand Qn pθ|θ1 q and get:
n n
1ÿ 1ÿ
Qn pθ|θ1 q “ ESi´1 ,Si |y1n ,xn1 ,θ1 rlog ppSi |Si´1 ; θqs ` ES |yn ,xn ,θ1 rlog f pyi |xi , Si ; βqs.
n i“1 n i“1 i 1 1
where f pyi |xi , Si ; βq is the emission distribution under state Si . For example, when
?
the emission distribution is Gaussian, f pyi |xi , Si ; βq “ p 2πσ 2 q´1{2 expt´ 12 pyn ´
x|n βSi q2 {σ 2 u is the density of yn where σ is constant for different states.
6
1
If we define ξi,j “ P pSi “ j|y1n , xn1 , θ1 q and τi,jk
1
“ EpSi´1 “ j, Si “ k|y1n , xn1 , θ1 q,
then
n
1 1ÿ 1
Qn pθ|θ q “ ESi´1 ,Si |y1n ,xn1 ,θ1 rlog ppSi |Si´1 ; γqs ` ES0 |y1n ,xn1 ,θ1 rlog π0 pS0 ; γqs
n i“1 n
looooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooon
Q2,n pγ|θ1 q
n (1.1)
1 ÿ
` ES |yn ,xn ,θ1 rlog f pyi |xi , Si ; βqs .
n i“1 i i 1
looooooooooooooooooooomooooooooooooooooooooon
Q1,n pβ|θ1 q
We observe that Q1,n pβ|θ1 q solely depends on β, and Q2,n pβ|θ1 q only depends on γ.
Therefore we can optimize them separately to find the estimation of β and γ.
1.3.2 Regularization
To encourage the assumed structure (sparsity, multi-colinearity, etc) of the factor part,
we add a regularization term to Q1,n pβ|θ1 q:
N
ÿ
Qp1,n pβ|θ1 q 1
“ Q1,n pβ|θ q ´ λj Rpβj q.
j“1
Here Rλ pβq : R Ñ R can be different types of penalty functions. Examples of such

penalty functions we use include l1 penalty, l2 penalty, smoothly clipped absolute
deviation (SCAD) (Fan and Li 2001), adaptive lasso (AL) (Zou 2006), and Elastic-Net
(Zou and Hastie 2005).
1.3.3 Algorithm
Since we allow the number of features to be of high dimension, we proposes this new
regularized Baum Welch algorithm. Using this recursive algorithm, we approximate
the maximum likelihood estimator of θ˚ by cycling between the following Expectation
step and Maximization step.
7
Algorithm 1 Regularized Baum Welch Algorithm
Input: Samples y1n , xn1 , regularizer R, number of iterations T , initial parameters
p0q
θ0 :“ pβ p0q , γ p0q q, initial regularization parameters tλj uNj“1 , estimated statistical
error ∆, contractive factor κ ă 1.
for t “ 1 to T do
E-step: Compute Qn pθ|θpt´1q q given θpt´1q .
Regularized M-step: Compute
ptq pt´1q
λj Ð κλj ` ∆,
# +
N
ÿ ptq
β ptq “ arg max Q1,n pβ|θpt´1q q ´ λj Rpβj q ,
β
j“1
and
γ ptq “ arg max Q2,n pγ|θpt´1q q.
γ
end for
Output β pT q , γ pT q .
E-step
At t-th iteration, the E-step mainly computes the function Qn pθ|θpt´1q q, which equals
ptq ptq
to calculate ξi,j and τi,jk given θpt´1q . This can be done by the forward-backward
algorithm:
ptq ptq ptq

ptq ati pij fi pyt`1 ; θptq qbt`1,j
τi,jk “ řT řT ptq ptq ptq ptq
i“1 j“1 ati pij fi pyt`1 ; θ qbt`1,j
and
N
ÿ
ptq ptq
ξi,j “ τi,jk .
k“1
Here ati is the "forward" probabilities defined as:
ptq
ai,j “ P py1 , . . . , yn , Si “ j|xn1 , θptq q,
8
and can be calculated using forward step:
˜ ¸
N
ÿ
ptq ptq ptq
ai`1,j “ ai,k pjk fj pyi`1 ; θptq q,
k“1
ptq
Similarly, bi,j is the "backward" probabilities defined as:
ptq
bi,k “ P pyi`1 , . . . , yn |Si “ k, xn1 , θptq q
which can be calculated by the backward recursion:
N
ÿ
ptq ptq ptq
bi,k “ pjk fj pyi`1 ; θptq qbi`1,j .
j“1
ptq
We initialize these recursions by setting a1j “ πi fj py1 ; θptq q and bnk “ 1, where
ptq ptq
π ptq “ pπ1 , . . . , πn q is the principal eigenvector of P ptq π “ π.
M-step
The M-step maximize Qpn pθ|θpt´1q q to get estimation for β and γ, and notice that we
can maximize Qp1,n pθ|θpt´1q q and Q2,n pθ|θpt´1q q separately:
# +
n N N
1 ÿ ÿ pt´1q ÿ ptq
β ptq “ arg max ξ log f pyi |xi , Si ; βj q ´ λj Rpβj q ,
β n i“1 j“1 i,j j“1
and
# +
n N n N N
1 ÿ ÿ pt´1q 1 ÿÿÿ 1
γ ptq “ arg max ξ log ppSi |Si´1 ; γq ` τ log πpS0 ; γq .
γ n i“1 j“1 i,j n i“1 j“1 k“1 i,jk
Actually we can also do the maximization for each state separately, which means:
# +
n
ptq 1 ÿ pt´1q ptq
βj “ arg max ξ log f pyi |xi , Si ; βj q ´ λj Rpβj q .
β n i“1 i,j
9
Moreover, since we expect linear convergence of optimization, it is natural to update
ptq
λj via a recursive way also:
ptq
λj “ κλt´1
j ` ∆,
here ∆ is the estimated statistical error.
1.4 Main Results for General Emission Distribution
and Regularizer
We now tend to theoretical analysis of Feature Augmented Hidden Markov Model

(FAHMM). We first introduce one of the key conditions decomposable regularizer,
which covers most of the commons regularizers, including those discussed in Section
1.3.2. Then we show several technical conditions on Q functions. In Section 1.4.3, we
provide our main results on statistical and computational performance of estimators.
1.4.1 Decomposable Regularizer
Decomposability has been an very important ingredient of high dimensional regular-

ization estimation, which is defined in terms of a pair of subspaces M Ď M in Rp ,
called the model subspace, and the orthogonal complement of the space M, namely
the perturbation subspace, is defined as:
K (
M :“ v P Rp |xu, vy “ 0, for all u P M ,
Definition 1. (Decomposability). A norm-based regularizer R is decomposable with

K
respect to pM, M q if
Rpu ` vq “ Rpuq ` Rpvq
10
K
for any u P M, v P M .
Remark 1. The insight of decomposable regularizer is that the structure of the model
parameter β ˚ can be mostly captured by a small model subspace M of Rp . The
regularizer penalize the deviations away from M as much as possible.
Definition 2. (Subspace compatibility constant). For any subspace M of Rp , the

subspace compatibility constant of M with respect to R, and some norm } ¨ } is given
by
Rpuq
ΨpMq :“ sup .
uPMzt0u }u}
Remark 2. This subspace compatibility constant measures the compatibility between

the regularizer and the error norm on space M. A simple but common example, is
for the s-sparse LASSO case, we have Rpuq “ }u}1 , and }u} “ }u}2 , then we have
?
ΨpMq “ s.
Definition 3. (Dual norm). For any inner product, the dual norm of R is given by:
xu, vy
R˚ pvq :“ sup “ sup xu, vy.
uPRp zt0u Rpuq Rpuqď1
Remark 3. For example, the dual of the l1 norm is the l8 norm. The dual norm
aims for specifying a suitable choice of the regularization weight λn .
Condition 1. The regularizer R is decomposable with respect to the subspace pair

K
(M,M ).
Remark 4. This condition is a common condition in the analysis of high dimensional

M-estimators, details can be found in (Negahban, Ravikumar, Wainwright, Yu, et al.
2012).
11
1.4.2 Conditions on Q functions
To give the statistical and computational guarantees of FAHMM, we also need some
technical conditions, originally proposed by (Yang, Balakrishnan, and Wainwright
2017), on population Qpθ|θ1 q, truncated population Qk pθ|θ1 q, sample Qn pθ|θ1 q and
truncated sample Qkn pθ|θ1 q.
The performance of Baum-Welch is known to depend on initialization, we only
consider local properties in our analysis with regard to an initialization θp0q :“
pβ p0q , γ p0q q P Ω :“ B2 pr; β ˚ q ˆ Ωγ , where γ is a particular parameterization of the
stationary distribution and the transition matrix of the Markov Chain. B2 pr; β ˚ q is a
ball centered at β ˚ with radius r, therefore Ω is a smaller set than the complete set
Ω̃ :“ Ωβ ˆ Ωγ .
Existence of Qpθ|θ1 q
We have already defined Qn pθ|θ1 q in equation (1.1). Under i.i.d. setting, the law
of large numbers (LLN) guarantees that when the sample size n goes to infinity,
Qn pθ|θ1 q will approach to the population Q function Qpθ|θ1 q, which is the expectation
of Qn pθ|θ1 q. While in our case, due to dependence of the data, we need to show that
the population Q function does exist. A natural candidate is given by:
Qpθ|θ1 q :“ lim rEQn pθ|θ1 qs. (1.2)

nÑ8
12
To deal with dependence, we also need to consider a suitably truncated version of the
Q-function. We firstly define:
n N n N N
1 ÿ ÿ k1 1 ÿ ÿ ÿ k1
Qkn pθ|θ1 q “ ξi,j log ppSi |Si´1 ; γq ` τ log πpS0 ; γq
n i“1 j“1 n i“1 j“1 k“1 i,jk
loooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooon
Qk2,n pγ|θ1 q
n N
(1.3)
1 ÿ ÿ k1
` ξ log f pyi |xi , Si ; βq .
n i“1 j“1 i,j
looooooooooooooooomooooooooooooooooon
Qk1,n pβ|θ1 q
1 i`k i`k 1 1 i`k

k
where ξi,j “ P pSi “ j|yi´k , xi´k k
, θ q and τi,jk “ EpSi´1 “ j, Si “ k|yi´k , xi`k 1
i´k , θ q. By
construction, we can guarantee that the k-truncated population function defined by:
Qk pθ|θ1 q “ lim EQkn pθ|θ1 q (1.4)

nÑ8
are well-defined. (Yang, Balakrishnan, and Wainwright 2017) shows that under the
following Condition 2, the population Q function defined in equation (1.2) exists and
can be uniformly approximated by Qk pθ|θ1 q.
Condition 2. (Mixing Rate). We assume there exists a mixing constant mix such
that:
P pSi “ k|Si´1 “ j; γq
mix ď ď ´1
mix
πk
for all pj, kq P t1, . . . , N u.
Remark 5. This condition implies that the dependence on the initial distribution
decreases geometrically.
13
Population level
To infer the closeness of maximizer of Qk to Q, except for the closeness of Qk and Q,

we also need the strong concavity of Qk . To make it simpler for analysis, we define
M pθq :“ arg max Qpθ|θ1 q.

θ
Similar definitions work for k-truncated version M k pθq and their corresponding sample
version Mn pθq and Mnk pθq.
Condition 3. (Strong Concavity and Smoothness). We require that the function

Qk p¨|θ1 q satisfies the strong concavity and smoothness over Ω:
• Strong Concavity: for all pβ1 , γ1 q, pβ2 , γ2 q P Ω, θ1 P Ω, we have
λβ
Qk1 pβ1 |θ1 q ´ Qk1 pβ2 |θ1 q´x5µ Qk1 pβ2 |θ1 q, β1 ´ β2 y ď ´ }β1 ´ β2 }22 ,
2
λγ
Qk2 pγ1 |θ1 q ´ Qk2 pγ2 |θ1 q´x5γ Qk2 pγ2 |θ1 q, γ1 ´ γ2 y ď ´ }γ1 ´ γ2 }22 ,
2
• Smoothness: for all pβ1 , γ1 q, pβ2 , γ2 q P Ω, θ1 P Ω, we have
µβ
Qk1 pβ1 |θ1 q ´ Qk1 pβ2 |θ1 q´x5µ Qk1 pβ2 |θ1 q, β1 ´ β2 y ě ´ }β1 ´ β2 }22 .
2
µγ
Qk2 pγ1 |θ1 q ´ Qk2 pγ2 |θ1 q´x5γ Qk2 pγ2 |θ1 q, γ1 ´ γ2 y ě ´ }γ1 ´ γ2 }22 .
2
If we could access Qk pθ|θ˚ q, Baum-Welch could converge in one step to the true
optimum. Therefore, if we can show the closeness of Qk pθ|θ1 q and Qk pθ|θ˚ q, we can
conclude that M k pθq is close to θ˚ . The following gradient stability shows the closeness
in a first-order sense.
14
Condition 4. (Gradient Stability). First-order stability conditions on the gradients
of each component of Qk ,
• For each β P Ωβ , θ1 P Ω, we have
} 5β Qk1 pβ|β 1 , γ 1 q ´ 5Qk1 pβ|β ˚ , γ 1 q}2 ď Lβ,1 }β 1 ´ β ˚ }2 ,
} 5β Qk1 pβ|β 1 , γ 1 q ´ 5Qk1 pβ|β 1 , γ ˚ q}2 ď Lβ,2 }γ 1 ´ γ ˚ }2 ,
• For each γ P Ωγ , θ1 P Ω, we have
} 5γ Qk2 pγ|β 1 , γ 1 q ´ 5Qk2 pγ|β ˚ , γ 1 q}2 ď Lγ,1 }β 1 ´ β ˚ }2 ,
} 5γ Qk2 pγ|β 1 , γ 1 q ´ 5Qk2 pγ|β 1 , γ ˚ q}2 ď Lγ,2 }γ 1 ´ γ ˚ }2 ,
Sample Level
We now turn to the Regularized Baum-Welch algorithm based on sample data. In

order to guarantee that the estimation error }β̂ ptq ´ β ˚ } is well controlled, we still need
Qk1,n pβ|θ1 q to be strongly concave at β ˚ . Since under the high dimensional case, n ! p,
Qk1,n pβ|θ1 q are flat in some directions. Therefore, we only require strong concavity over
K
a particular set CpM, M , Rq, which is defined as:
K (
CpM, M , Rq :“ u P Rp |2RpuMK q ď 4RpuM q ` 3N ΨpMq}u} . (1.5)
Condition 5. (Restricted Strong Concavity). For any θ1 P Ω, with probability at

K
least 1 ´ δ, for all β ´ β ˚ P Ω X CpM, M , Rq, we have
λn
Qk1,n pβ|θ1 q ´ Qk1,n pβ ˚ |θ1 q´x5β Qk1,n pβ ˚ |θ1 q, β ´ β ˚ y ď ´ }β ´ β ˚ }22 .
2
15
The above condition states that Qk1,n p¨|θ1 q is strongly concave in direction β ´ β ˚
K
belonging to CpM, M , Rq.
Condition 6. (Statistical error). For any fixed θ1 P Ω, with probability 1 ´ δ, we

have
} 5β Qk1,n pβ ˚ |θ1 q ´ 5β Qk1 pβ ˚ |θ1 q}R˚ ď ∆n .
This statistical error replaces the quantity }Mnβ,k pθq ´ M β,k pθq} in (Balakrishnan,
Wainwright, Yu, et al. 2017) and (Yang, Balakrishnan, and Wainwright 2017) to deal
with the high dimensional issue.
Other than the above conditions, we also need the concentration inequality for γ.
For any δ P p0, 1q, we let γn pδ, kq to be the smallest positive numbers such that
ˆ ˙
P sup }Mnγ,k pθq ´M γ,k
pθq} ě γn pδ, kq ď δ,
θPΩ
For a given truncation level k, this concentration gives a upper bound on the difference
between the population and sample version of M estimator on γ.
At the same time, we need to ensure that the truncation error is controlled. For a
tolerance parameter δ P p0, 1q, we let φn pδ, kq be the smallest positive number such
that:
ˆ ˙
P sup }Mn pθq ´ Mnk pθq} ě φn pδ, kq ď δ.
θPΩ
This quantity shows the upper bound on the difference between the truncated
sample and sample version of M estimator.
16
1.4.3 Main Results
We now are ready to present our main theorems about the estimation. We
begin by showing the linear convergence of the sequence tθptq uTt“0 obtained
by continually applying the regularized Baum-Welch. In the sequel, we let
α :“ supuPRp zt0u }u}˚ {}u}, where } ¨ }˚ is the dual norm of } ¨ } and define
κ :“ pmaxtLβ,1 , Lβ,2 u ` maxtLγ1 , Lγ2 uq{ maxtλβ , λγ u, ∆ :“ rλn {p60ΨpMqq,
κ˚ :“ 5αµβ κ{λn , and assume that ∆n ď ∆.
Theorem 1. Under Conditions 1-6, consider initialization θp0q :“ pβ p0q , γ p0q q P Ω,

with regularization parameters given by:
ptq λn p0q 1 ´ κt
λj “ κt }βj ´ βj˚ } ` ∆, j “ 1, . . . , N, t “ 1, . . . , T,
5ΨpMq 1´κ
for any ∆ P r3∆n , 3∆s, κ P rκ˚ , 3{4s. Then for a given sample size n, suppose pn, kq is
large enough to ensure that:
? ∆
φn pδ, kq ` 6 N ΨpMq ` φpkq ď p1 ´ κqr ´ κ max }γ ´ γ ˚ },
λn γPΩγ
with probability at least 1 ´ δ, the Baum-Welch sequence tθptq uTt“0 satisfies the
bound:
1 ? ∆
}θptq ´ θ˚ } ďκt }θp0q ´ θ˚ } ` tφn pδ, kq ` 6 N ΨpMq ` γn pδ, kq ` φpkqu.
1´κ λn
Remark 6. Theorem 1 shows that the estimation error is bounded by two terms: (1)
κt }θp0q ´ θ˚ }2 is linearly decayed with number of iterations, which can be regarded
1
?
as the optimization error; (2) 1´κ tφn pδ, kq ` 6 N λ∆n ΨpMq ` γn pδ, kq ` φpkqu can
be controlled by letting the sample size n to be large enough, therefore it can be
interpreted as the statistical error. With T “ Oplog nq and appropriate choice of
17
∆ “ Op∆n q, we have:
1 ? ∆n
}θpT q ´ θ˚ } À tφn pδ, kq ` 6 N ΨpMq ` γn pδ, kq ` φpkqu.
1´κ λn
1.5 Concrete Results for Gaussian Emission and L1
Penalty
In this section, we give results for a special case of Theorem 1: when the emission
distribution is Gaussian and L1 penalty. Actually this special case has its unique
name called Markov Switching Regression, which was proposed first in (Goldfeld and
Quandt 1973), and have been frequently used in many econometric research as we can
see in the Introduction. Here we add a L1 penalty to it to include features selection
at the same time and allow for high dimensional features setting. Here we assume
xn1 „ N p0, 1p q, the noise follows N p0, σ 2 q, and the number of states is 2. As far as we
know, this is the first theoretical analysis for the Markov Switching Regression model.
Theorem 2. Let n ě Crp}β ˚ }2 ` δq{}β ˚ }2 s2 s log p, k Á log n, θp0q :“ pβ p0q , γ p0q q P Ω,

a p0q p0q ?
∆ “ Cp}β ˚ }2 ` δq log p{n, and λj “ }βj ´ βj˚ }2 {p15 sq, j “ 1, . . . , N . Then with
probability 1 ´ 1{p, κ P r1{2, 3{4s, the Baum-Welch sequence tθptq uTt“0 satisfies the
bound:
b b
˚ s log p log8 pnpq
C1 p}β }2 ` δq n
` C 2 n
}θptq ´ θ˚ } ďκt }θp0q ´ θ˚ } ` .
1´κ
Remark 7. Similar to Theorem 1, the error bound also correspond to the optimization
error and the statistical error, if we let T “ Oplogpn{ps log pqqq, then the error rate
„ b b 
1 ˚ s log p log8 pnpq
becomes 1´κ C1 p}β }2 ` δq n
` C2 n
.
18
1.6 Determination of number of states
In many real life applications, the number of states is often not known as a priori.
In this paper, AIC helps to determine the optimal number of states. AIC criterion
consists of two terms: a log likelihood term and a penalty term:
˜ ¸
n
ÿ n
ÿ N
ÿ
AICpN q “ ´ 2 log f pyi |xi ; θq ` 2N “ ´2 log fi pyi , Si |xi , θqP pSi “ jq ` 2N,
i“1 i“1 j“1
where N is the parameter for number of states. Then the optimum number of states
is chosen to be the minimizer:
Nopt “ arg min AICpN q. (1.6)
Cp , BIC and other similar methods can also be used.
1.7 Simulation
In this section, we investigate the finite sample performance of our estimators and
regularized Baum-Welch Algorithm. We first sample 500 paths from the model:
yi “ x|i βSi ` i , for i in 1, . . . , 500,
by the following settings:
• Features xi : xi “ rx1,i , . . . , x100,i s is a 100 ˆ 1 vector generated from standard

gaussian distribution, with the correlation between xj,i and xk,i to be 0.5|j´k| .
• States Si : we assume there are two states, which means Si equals to 1 or 2. We

generate Si through a Markov chain of length 500 and with transition probability
19
matrix to be: ¨ ˛
˚ 0.9 0.1 ‹
P “˝ ‚.
0.2 0.8
• Coefficients βSi : if Si “ 1, we let β1 “ r1, 1, 1, 1, 0, 0, . . . , 0s| , with

the first 4 elements to be 1, while others are 0; if Si “ 2, we let
β2 “ r0, 0, ´2, ´2, ´2, ´2, 0, . . . , 0s| , with the third to sixth element to
be -2, while others are 0.
• Inovation i : i is generated from standard normal for every sample i.
Table 1.1: AIC for different number of states.

N 1 2 3
AIC -538.176 -2928.594 -525.979
First of all, before digging into model fitting, we need to check whether AIC can
correctly capture the number of states, which is 2 in our experiment. We choose
LASSO as the regularizer and choose N to be 1, 2, 3 and running our algorithm.
The results are shown in Table 1.1. According to equation (1.6), we can see that we
successfully recover the true number of states.
Figure 1.2 shows the results of regime identification. The upper left is the true
states generated from a Markov Chain, the upper right and lower left are the FAHMM
with regularizer to be LASSO and Elastic Net, while lower right is the FAHMM
without regularizer. We can see that both LASSO and Elastic Net correctly recover
the true states, while Markov regression with no regularizer fails to identify the true
states.
Table 1.2 shows the results of different measurement for Gaussian emission FAHMM
with LASSO and Elastic Net regularizer, and no regularizer, which are calculated from
the average of 1000 simulations. We can see that, because the ordinary FAHMM with-
out regularizer cannot identify the regimes, so every error measurement is inaccurate;
20
Table 1.2: Simulation results for different models.
Lasso ElasticNet Ordinary
S“1 0.0206 0.0541 66.0323
}β̂ ´ β}1
S“2 0.0186 0.0721 66.0323
S“1 0.0116 0.0206 8.769
}β̂ ´ β}22
S“2 0.0082 0.0263 8.769
S“1 0.015 0.007 4.576
|σ̂ ´ σ|
S“2 0.005 0.089 4.576
2 —— 0.0766 0.0586 0.707
}P̂ ´ P }F
}P̂ ´ P }M AX —— 0.0481 0.0375 0.400
S“1 5 13 100
Model Size
S“2 9 16 100
MSE —— 2.704 2.706 105.192
Regime Identification Error 0 0 NA
while both LASSO and Elastic Net can approximate the true parameter well, and
since we don’t have strong correlation here, LASSO has better performance compared
with Elastic Net: smaller estimation error and reduced model size.
1.8 Empirical Results
We apply both FAHMM with or without regularizer to predict foreign exchange rate
of USD/BRL (US Dollar per Brazilian Real). Since in our case, we use Gaussian
emission distribution and L1 penalty, we can also call this example a Markov switching
regression. We choose 7 features shown in Table 1.3 as observed features xn1 , and use
return of exchange rate USD/BRL as the output variable y1n . These features are based
on both practical and theoretical considerations.
Firstly, Table 1.3 shows that adding regularizer can successfully shrink many of
the coefficients to be 0, which increase the explanatory power of the model. The
regularized model shows us that the Stock feature is the most significant feature in
Regime 1 (Bull) and Interests Rate is the most significant feature in Regime 2 (Bear).
This variable selection results make a lot of sense, since according to Taylor rule, the
21
Figure 1.2: The top picture is the simulated true states. The rest are estimated
probabilities of state 1 and state 2 using regularized FAHMM with Elastic Net penalty,
lasso penalty and ordinary FAHMM (without regularizer, least square) accordingly.
Table 1.3: Coefficients comparison between FAHMM with and without regularization.
Regime 1
Features Interests Politics Export Stock CPI Foreign Reserve Balance
Regularized / / / 0.0793 / / /
Ordinary -0.072 0.132 0.066 0.221 0.427 0.266 0.335
Regime 2
Features Interests Politics Export Stock CPI Foreign Reserve Balance
Regularized -0.180 / / / / / /
Ordinary -0.090 0.132 1.881 0.302 -5.161 -0.676 -0.894
interest rate is the most significant factor affecting exchange rate; while in crash period,
stock indexes will have the highest correlation with exchange rate. On the other hand,
from Figure 1.3, we can see that, both of the two methods can identify the hidden
regimes well, but we are using a much simpler model by adding regularization.The
22
superiority of regularized Markov switching regression will become more significant
if we include additional and correlated features, and also for the out of sample tests,
because simpler model can prevent overfitting. The estimated probability for two
regimes are shown in Figure 1.4.
Figure 1.3: The upper picture is regime identification results for BRL/USD using
LASSO FAHMM; the lower picture is the regime identification results using Least
Square FAHMM. The red and blue line stand for Regime 1 and Regime 2.
Figure 1.4: The upper picture is the estimated probability for regime 1 and regime 2
for BRL/USD using LASSO FAHMM; the lower picture is the estimated probability
for regime 1 and regime 2 for BRL/USD using Least Square FAHMM. The red and
blue line stand for Regime 1 and Regime 2.
23
1.9 Concluding Remarks
In this paper, we propose a new model called Feature Augmented Hidden Markov
Model (FAHMM), it improves the performance of Hidden Markov Model by including
a feature structure to the output variable. At the same time, we allow the the
number of features to be high dimensional and consider general form of emission
distributions. To fit the model, we provide a new Regularized Baum Welch Algorithm,
the regularization in the M-step can deal with different practical issues with the
features: feature selection, high dimensionality, multi-colinearity, etc. Other than that,
we also establish general local convergence theorems and statistical and computational
guarantees for the estimator.
We leave a number of issues for future research. As is well known, the EM algorithm,
as a concave approximation to the non-concave log-likelihood objective function,
depends upon good starting estimates and as we have shown in the convergence
section the global optimal solution can be discovered when the estimate is close enough
(suitably defined). The proposed FAHMM algorithm displays similar characteristics.
Future research on computing good starting estimates is a worthwhile goal. Another
interesting direction is to work on the online version of this algorithm. Although Baum-
Welch algorithm is intuitively easy to apply, while similar to many other iterative
algorithms, its computation inefficiency limits its applications in many areas, for
example, with high frequency data.
1.10 Proof of Theorem 1
In this section, we give the proof of Theorem 1 about the statistical and computational
guarantees of the Baum-Welch algorithm. Firstly, we show a result of the population
level convergence guarantees.
24
Lemma 1. Under the Condition 2 (mixing rate), and Condition 3 (strong concavity
condition), there is a universal constant c0 such that:
}M k pθq ´ M pθq}22 ď φ2 pkq for all θ P Ω,
where
2 c0 N 4
φ pkq “ 9 2
p1 ´ mix π̄min qk ,
λmix π̄min
and π̄min :“ minγPΩγ miniPS πpi, γq.
Proof. Proof of this lemma can be found in proof of Theorem 1 part (a) in (Yang,
Balakrishnan, and Wainwright 2017).
Lemma 2. Under Condition 3 (strong concavity) and Condition 4 (gradient stability),

we have
}M k pθq ´ M k pθ˚ q} ď κ}θ ´ θ˚ }.
This lemma means through operator M k p¨q, it always decreases the distance between θ
and θ˚ .
Proof. The proof is shown in (Yang, Balakrishnan, and Wainwright 2017).
Lemma 3. Under Condition 1-6, consider initialization θp0q :“ pβ p0q , γ p0q q P Ω, with
regularization parameters given by:
ptq 1 ´ κt λn β,k ˚
λj “ ∆ ` κt }β p0q ´ Mn,j pθ q},
1´κ 5ΨpMq
j “ 1, . . . , N, t “ 1, . . . , T,
for any ∆ P r3∆n , 3∆s, κ P rκ˚ , 3{4s, we have
5ΨpMqN 1 ´ κt
}Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ď ∆ ` κt }β p0q ´ Mnβ,k pθ˚ q}.
λn 1´κ
25
Proof. Recall that
N
ÿ pt`1q
Mnβ,k pθpt´1q q “ arg max Qk1,n pβ|θpt´1q q ´ λj Rpβj q.
βPΩ
j“1
Suppose β k,pt´1q P Ω, then according to optimality of β k,ptq , and we simplify β k,pt´1q ,

ptq
λj and β k,ptq to be β, λj and β ` :
N
ÿ N
ÿ
Qk1,n pβ ` |βq ´ λj Rpβj` q ě Qk1,n pβ ˚ |βq ´ λj Rpβj˚ q.
j“1 j“1
Equivalently,
N
ÿ N
ÿ
Qk1,n pβ ` |βq ´ Qk1,n pβ ˚ |βq ě λj Rpβj` q ´ λj Rpβj˚ q.
j“1 j“1
Using concavity of Qk1,n p¨|βq, we have
Qk1,n pβ ` |βq ´ Qk1,n pβ ˚ |βq ď x5β Qk1,n pβ ˚ |βq, β ` ´ β ˚ y.
Define Θ :“ β ` ´ β ˚ , since
|x5β Qk1,n pβ ˚ |βq, β ` ´ βy| “|x5β Qk1,n pβ ˚ |βq ´ 5β Qk1 pβ ˚ |βq ` 5β Qk1 pβ ˚ |βq, Θy|
ď|x5β Qk1,n pβ ˚ |βq ´ 5β Qk1 pβ ˚ |βq, Θy| ` |x5β Qk1 pβ ˚ |βq, Θy|
ď} 5β Qk1,n pβ ˚ |βq ´ 5β Qk1 pβ ˚ |βq}R˚ ¨ RpΘq ` } 5β Qk1,n pβ ˚ |βq}˚ }Θ}
ď∆n RpΘq ` α} 5β Qk1,n pβ ˚ |βq}}Θ}
ď∆n RpΘq ` α} 5β Qk1,n pβ ˚ |βq ´ 5β Qk1,n pMβ,k pβq, βq}}Θ}
ď∆n RpΘq ` αµ}Mβ,k pβq, βq ´ β ˚ }}Θ}
ď∆n RpΘq ` αµκ}β ´ β ˚ }}Θ}.
26
Therefore we have
N
ÿ N
ÿ
λj Rpβj` q ´ λj Rpβj˚ q ď ∆n RpΘq ` αµκ}β ´ β ˚ }}Θ}.
j“1 j“1
Using the decomposability of R, we get
Rpβj˚ ` Θi q ´ Rpβj˚ q ě Rpβj˚ ` Θi,M| q ´ RpΘi,M q ´ Rpβj˚ q “ RpΘi,M| q ´ RpΘi,M q.
Then
λj pRpΘi,M| q ´ RpΘi,M qq ď ∆n RpΘi q ` αµκ}βj ´ βj˚ }}Θ}.
If we choose
αµκ
λj ě 3∆n ` }βj˚ ´ βj },
ΨpMq
then we have
∆n αµκ}βj˚ ´ βj } 1
RpΘi,M| q ´ RpΘi,M q ď RpΘi q ` }Θ} ď RpΘi q ` ΨpMq}Θ}.
λj λj 3
Then
1
RpΘM| q ´ RpΘM q ď RpΘq ` N ΨpMq}Θ}.
3
Since RpΘq ď RpΘM| q ` RpΘM q, we have
2RpΘq ď RpΘM| q ď 4RpΘM q ` 3N ΨpMq}Θ}.
|
Therefore, we show that Θ lies in CpM, M , Rq, then according to Condition 5, we
have
λn
Qk1,n pβ ` Θ|βq ´ Qk1,n pβ ˚ |βq ďx5β Qk1,n pβ ˚ |βq, Θy ´ }Θ}2 .
2
λn
ď∆n RpΘq ` αµκ}β ´ β ˚ } ˆ }Θ} ´ }Θ}2 .
2
27
Then using the optimality condition again, we have
N
ÿ N
ÿ N
ÿ
Qk1,n pβ`Θ|βq´Qk1,n pβ ˚ |βq ě ˚
λj Rpβ `Θq´ ˚
λj Rpβ q ě ´ λj RpΘi,M q. (1.7)
j“1 j“1 j“1
Then combine the above two inequalities, we have
N
λn 2 ˚
ÿ
}Θ} ď ∆n RpΘq ` αµκ}β ´ β } ˆ }Θ} ` λj RpΘi,M q.
2 j“1
Using RpΘi q ď RpΘi,M| q ` Rpi, ΘM q ď p9{2qΨpMq}Θ}, we further have
N
λn 9 ÿ
}Θ}2 ď N ∆n ΨpMq}Θ} ` αµκ}β ´ β ˚ } ˆ }Θ} ` ΨpMq λj }Θ}.
2 2 j“1
Then we cancel }Θ}, we get
N
∆n 2 2 ÿ
}Θ} ď9N ΨpMq ` αµκ}β ´ β ˚ } ˆ }Θ} ` ΨpMq λj .
λn λn λn j“1
řN
ΨpMq αµκ j“1 λj
“ p9N ∆n ` 2 }β ´ β ˚ }q ` 2ΨpMq
λn ΨpMq λn
řN
j“1 λj
ď5ΨpMq .
λn
Combine the results above, we reach the conclusion that if M β,k pθpt´1q q P Ω, and
αµκ β,k pt´1q

q ´ Mjβ,k pθ˚ q},
ptq
λj ě 3∆n ` }Mn,j pθ
ΨpMq
then we have řN
j“1 λj
}Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ď 5ΨpMq .
λn
28
5αµκ
Then we let κ˚ :“ λn
, and assume κ˚ ď 3{4. Then for any κ P rκ˚ , 3{4s. We
have for any κ P rκ˚ , 3{4s, ∆ ą 3∆n , we can set
ptq 1 ´ κt λn β,k ˚
λj “ ∆ ` κt }β p0q ´ Mn,j pθ q},
1´κ 5ΨpMq
combing with the previous equations yields that:
5ΨpMqN 1 ´ κt
}Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ď ∆ ` κt }β p0q ´ Mnβ,k pθ˚ q}.
λn 1´κ
Then based on Lemma 1 and Lemma 2, we can show that:
}M k pθptq q ´ θ˚ }2 ď}M k pθptq q ´ M k pθ˚ q}2 ` }M k pθ˚ q ´ θ˚ }2
ďκ}θ ´ θ˚ }2 ` ψ 2 pkq.
Now we are ready to prove Theorem 1, using triangle inequality, we get
}θptq ´ θ˚ }2 ď}Mn pθpt´1q q ´ Mnk pθpt´1q q} ` }Mnk pθpt´1q q ´ M k pθ˚ q} ` }M k pθ˚ q ´ θ˚ }
ďφn pδ, kq ` }Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ` }Mnγ,k pθpt´1q q ´ M γ,k pθpt´1q q}
` }M γ,k pθpt´1q q ´ M γ,k pθ˚ q} ` φpkq

5ΨpMqN 1 ´ κt
ďφn pδ, kq ` ∆ ` κt }β p0q ´ Mnβ,k pθ˚ q} ` γn pδ, kq
λn 1´κ
5ΨpMqN 1 ´ κt
ďφn pδ, kq ` ∆ ` κt }β p0q ´ β ˚ } ` κt }β ˚ ´ Mnβ,k pθ˚ q} ` γn pδ, kq
λn 1´κ
5ΨpMqN 1 ´ κt
ďφn pδ, kq ` ∆ ` κt }β p0q ´ β ˚ } ` γn pδ, kq
λn 1´κ
` κ}θpt´1q ´ θ˚ } ` p1 ` κt qφpkq. (1.8)
29
Here the second equation is based on the following concentration inequalities: For a
tolerance of δ, we let φn pδ, kq and γn pδ, kq to be the smallest positive scalar such that:
„ 
P sup }Mn pθq ´ Mnk pθq} ě φn pδ, kq ď δ,
θPΩ
and
„ 
P sup }Mnγ,k pθq ´M γ,k
pθq} ě γn pδ, kq ď δ.
θPΩ
Since we need to guarantee that the iterates won’t get out of B2 pr; β ˚ q, which
means
}β ptq ´ β ˚ } ď}Mnβ pθpt´1q q ´ Mnβ,k pθpt´1q q} ` }Mnβ,k pθpt´1q q ´ M β,k pθ˚ q} ` }M β,k pθ˚ q ´ β ˚ }
5ΨpMqN 1 ´ κt
ďφn pδ, kq ` ∆ ` κt }θp0q ´ Mnβ,k pθ˚ q} ` φpkq ď r,
λn 1´κ
Consequently, we have
5ΨpMqN 1 ´ κt
φn pδ, kq ` ∆ ` φpkq ď p1 ´ κt qr. (1.9)
λn 1´κ
Therefore, under equation (1.9), summing up the geometric series in equation (3.20),
we have
1 5ΨpMqN 1 ´ κt
}θptq ´θ˚ }2 ď κt }θp0q ´θ˚ }` tφn pkq` ∆`γn pδ, kq`p1`κt qφpkqu.
1´κ λn 1´κ
(1.10)
Then the rest of the proof are naturally about finding the φn pδ, kq and n pδ, kq in their
concentration inequality.
30
1.11 Proof of Theorem 2
Since Theorem 2 is a special case of Theorem 1 with the emission distribution to be

normal and regularizer to be L1 , we need to check Conditions 1-6.
Lemma 4. Suppose that the truncation level satisfies k Á log n, and n ě Crp}β ˚ }2 `
δq{}β ˚ }2 s2 s log p for a sufficiently large C, then we have:
c
k 3 logpk 2 {δq
γn pδ, kq ď C0 σ .
n
Proof. First, let us define the weights:
i`k i`k i`k i`k

ωθ,1 pYi´k ; Xi´k q “ ppS0 “ S1 “ 1|Yi´k ; Xi´k q,
and
i`k i`k i`k i`k
ωθ,2 pYi´k ; Xi´k q “ ppS0 “ S1 “ ´1|Yi´k ; Xi´k q,
then we can write Mnγ,k pθq in the form of:
# +
n´1 N
N ÿ n
γ1 ,k 1 ÿÿ k,ptq 1 1 ÿ k,l
Mn pθq “ arg max τ log pij ` ξ log πi
γPΩγ n i“1 j“1 k“1 i,jk n n i“1 i,1
˜ ¸
n
1ÿ i`k i`k
“ΠΩγ1 ωθ,1 pYi´k ; Xi´k q . (1.11)
n i“1
Since weights ωθ,1 is Lipschitz continuous on the parameters, that is |ωθ,1 ´ ωθ1 ,1 | ď
L}θ ´ θ1 }2 . As a consequence, we have
m
1 ÿ δ
Prsup | ωθ,1 pX̃i:2k , Ỹi;2k q ´ Eωθ,1 pX̃i:2k , Ỹi;2k qq| ě βn s ď ,
θPΩ m i“1 8k
31
Then using similar arguments as Proof of inequality (47c) in (Yang, Balakrishnan,
and Wainwright 2017), we have
c
k 3 logpk 2 {δq
}Mnγ,k pθpt´1q q ´M γ,k
pθ pt´1q
q}2 ď C0 σ .
n
Lemma 5. For any geometrically ρmix ´mixing and time reversible Markov Chain
tSn u with N states, there is a universal constant such that
cpN ` 1q k
sup |ppsi,0 “ 1|Ykn , Xkn q ´ ppsi,0 “ 1q| ď ρ .
i
3
π̄min 3mix mix
Proof. The proof follows from Lemma 12 in (Yang, Balakrishnan, and Wainwright
2017).
δq{}β ˚ }2 s2 s log p for a sufficiently large C, then we have:
c
logpn{δq δ
}Mn pθ pt´1q
q´ Mnk pθpt´1q q}2 “C `C .
n n
Proof. By strong concavity of Qn and Qkn , it is easy to show that
8}Qn ´ Qkn }8 2
sup }Mn pθq ´ Mnk pθq}22 ď , where λ ě p1 ´ b2 q.
θPΩ λ 3
Therefore the remaining part of the proof is to bound the difference }Qn ´ Qkn }8 . Let
us denote
N
ÿ
hpyn , xn , βj , σi , θptq q “ log ppyn |xn , βj , σi q ` ppS “ i|S “ j, θptq q log pji .
j“1
32
We obtain
ˇ ˇ
n ÿ n
k
ˇ 1 ÿ ptq k,l
ˇ
ptq ˇ
}Qn ´ Qn }8 “ ˇ sup pξ ´ ξi,t qhpyn , xn , βj , σi , θ qˇ
ˇ
ˇθ,θptq n i“1 i“1 i,t ˇ
ˇ ˇ
ˇ 1ÿ ptq
ˇ
` ˇ sup ppS “ i|X, Y, θ q log πi ˇ
ˇ ˇ
ˇθ,θptq n i“0 ˇ
k
2Cs3 1 ÿ i
ď 8 r ρ̃mix max |hpyn , xn , βj , σi , θptq q| ` log π̄min
´1
s
mix π̄min n t“1 i
i´k
CN 3 ρ̃kmix 1 ÿ
` 8
max |hpyn , xn , βj , σi , θptq q|
mix π̄min T ´ 2k i“k i
k
2CN 3 ÿ
i N logpπ̄mix q´1 ´1
ď rmax
¯ min T i | log ppy |x
n n , βj , σi q| ρ̃ mix ` ` log π̄min s
8mix pi i“1
π̄ min min
˜
CN 3 ρkmix
` ¯ rE max | log fj pyn , xn ; βj q| ` eT ´2k pY q ` N logpπ̄min mix q´1 s,
8 min i
where
ˇ ˇ
ˇ1 ÿ n ˇ
ˇ ˇ ˇ ˇ
en pY q :“ ˇ max ˇlog ppyn , xn , βj , σi , θptq qˇ ´ E max ˇlog ppyn , xn , βj , σi , θptq qˇˇ
ˇ ˇ
ˇ n i“1 i i ˇ
ˇ ˇ
n
ˇ1 ÿ pyn ´ x|n βj q2 pyi ´ x|i βj q2 ˇˇ
max ´ E max ˇ.
ˇ
“ˇ
ˇ n i“1 i σi2 i σi2 ˇ
We can show
˜ˇ ˇ ¸
n n
ˇ1 ÿ
ÿ pyn ´ x|n βj q2 pyi ´ x|i βj q2 ˇˇ u
Ppen pY q ě uq ď P ˇ
ˇ
É ˇě
i“1
ˇ n i“1 σi2 σi2 ˇ n
˜ ˜ˇ ˇ ¸ ¸
m
ˇ1 ÿ pyn ´ x|n βj q2 pyn ´ x|n βj q2 ˇˇ u
ďCN k̃ P ˇ ` mγpk̃q ,
ˇ
É ˇě
ˇ m t“1 σi2 σi2 ˇ n
where
p0q
γpkq “ sup |PpP q ´ P´8 ˆ P8
1 pP q|.
p0q
APσpS´8 ,Sk8 q
p0q p0q
Here S´8 and Sk8 are sigma algebras generated by random vector Y´8 and Yk8
p0q 0
correspondingly, and P´8 ˆ P8
1 is the measure under which the sequences Y´8 and
33
Y18 are independent. It is easy to show that
p0q
γpkq “|f pyq ´ f pyk8 qf py´8 q|
p0q n p0q
ď|f py8 |yk q ´ f py´8 q|
ď|ppsi,0 “ 1|ykn q ´ ppsi,0 “ 1q|

N `1 k
ď ρ ,
π̄3mix mix
then this implies that

cmρk̃mix δ
mγpk̃q ď 3 3
ď . (1.12)
mix π̄min 2N k̃
On the other hand, since

˜ˇ ˇ ¸
ˇ1 ÿ m | 2 | 2ˇ
py n ´ x β
n j q pyn ´ x β q
n j ˇ u
(1.13)
ˇ
P ˇ É ˇě
ˇ m t“1 σi2 σi2 ˇ n
We choose d
log k̃N C2 logp 3 3n 3 q
δ mix π̄min δ
t :“ C1 , and k̃ :“
m log 1{ρ̃mix
Therefore, c
logpn{δq δ
ptq
}Mn pθ q ´ Mnk pθptq q}2 “C `C .
n n
δq{}β ˚ }2 s2 s log p for a sufficiently large C, then for any δ P p0, 1q, we have
c
˚ log p
∆n “ Cp}β }2 ` δq ,
n
for all β P B2 pr; β ˚ q.
Proof. Similar proof can be found in (Yi and Caramanis 2015) Lemma 9.
34
Lemma 8. Suppose that the truncation level satisfies k Á log n, and n ě
! ´ 2´γ ´
¯ 1´γ 1
¯ 1´γ *
54Kp2C1 log pq1{γ 54K C
max λmin pΣX q
, λmin pΣX q
2
C1
, for a sufficiently large C, then
we have:
" *
1 ˚ ˚ pλmin pΣX qqγ γ
λn “ λmin pΣX q, pM, Mq “ psupppβ q, supppβ qq, δ “ 2n exp ´ n .
2 p54Kqγ 2C1
Proof. Here we prove a stronger version when xn1 follows a subweibull distribution. In
our problem we have
n ÿ
ÿ n n
ÿ
ptq
Qk1,n pθ, θptq q “ ξi,t log fj pyn , xn ; βj q ´ λi Rpβj q
i“1 i“1 i“1
Since we assume fi is a normal density coming from state i, the loss function part is
actually a weighted least square. We maximize the Qk1,n pθ, θptq q for different state j,
since each of them can be maximized separately. We let
n
ÿ ptq
Qk1,j,n pθ, θptq q “ ξi,t log fj pyn , xn ; βj q ´ λi Rpβj q.
i“1
n
ÿ ptq 1 pyn ´ xn βj q2
“ ξi,t p´ log 2π ´ log σi ´ q ´ λi Rpβj q
i“1
2 2σi2
Then write it into matrix form, we have
Qk1,n pθ, θptq q “ ´pY ´ Xβj q| ΞpY ´ Xβj q ´ λi Rpβj q, (1.14)
where Y “ py1 , . . . , yn q| , X “ pX1 , . . . , Xn q| , and Ξ P RT ˆT is a diagonal matrix with

ptq
diagonal elements to be ξi,t , t “ 1, . . . , T .
1
Firstly, let Σ̂n pωq :“ n
}Xω}22 for any ω P Rp . According to Lemma II.13 of
(Wong 2017), we have the following lemma:
ptq
Lemma 9. Let tXn ut“1 be a geometrically decaying β-mixing stationary time series
with mixing coefficients to be βplq ď 2 exptćlγ1 u, and the each Xn are mean 0
35
subweibullpγ2 q with subweibull constant K. Let γ be a parameter given by:
1 1 1
“ ` .
γ γ1 γ 2
Assuming γ ă 1, then for T ą 4 and @t ą n1 , we have
˜ˇ ˇ ¸
n " * " *
ˇ1 ÿ ˇ punqγ u2 n
Xn ˇ ą u ďn exp ´ γ ` exp ´ ,
ˇ ˇ
P ˇ
ˇ n i“1 ˇ K C1 C2 K 2
where C1 , C2 are constants only depending on γ1 , γ2 and c.
The proof is based on Theorem 1 of (Merlevède, Peligrad, and Rio 2011), under
our β-mixing and subweibull assumptions, we have:
" * " *
uγ u2
Pp|Σ̂n pωq ´ EΣ̂n pωq| ą uq ďn exp ´ ` exp ´
C1 C2 p1 ` T V q
" " **
u3 1 up1´γq γ
` exp ´ exp p q
C3 T C4 logpuq
here V is a worst case measure of the partial sum of the auto-covariances on the
dependent time series Xn , t “ 1, . . . , T , then for any geometrically decaying β-mixing,
according to (Viennet 1997), it can be shown that V can be bounded by a constant.
We just replace γ2 with γ2 {2, γ “ p1{γ1 ` 2{γ2 q´1 , then we have
" * " *
punqγ u2 n
Pp|Σ̂n pωq ´ EΣ̂n pωq| ą uq ďn exp ´ γ ` exp ´ .
K C1 C2 K 2
Let Kp2kq to be the set of 2k sparse vector with Euclidean norm to be 1. Then we
can get the uniform concentration of |Σ̂n pωq ´ EΣ̂n pωq| on Kp2kq:
" * " *
punqγ u2 n
Pp sup |Σ̂n pωq ´ EΣ̂n pωq| ą uq ďn exp ´ γ ` k log p ` exp ´ ` k log p .
ωPKp2kq K C1 C2 K 2
36
The based on Lemma 12 in (Loh and Wainwright 2011), we know for k ě 1, with
probability
" * " *
punqγ u2 n
1 ´ n exp ´ γ ` k log p ´ exp ´ ` k log p ,
K C1 C2 K 2
the following holds for all u P Rp uniformly:
ˆ ˙
1
|Σ̂n pωq ´ EΣ̂n pωq| ď 27u }ω}22 2
` }ω}1 .
k
Since EΣ̂n pωq “ ω | ΣX ω, then
ˆ ˙
1
|
Σ̂n pωq ď ω ΣX ω ` 27u }ω}22 2
` }ω}1 ,
k
and
ˆ ˙
1
|
Σ̂n pωq ě ω ΣX ω ´ 27u }ω}22 2
` }ω}1 .
k
1
Let u “ λ pΣX q,
54 min
then
ˆ ˙
1
Σ̂pωq ěλmin pΣX q}ω}22 ´ 27u }ω}22 ` }ω}21
k
1 λmin
“ λmin pΣX q}ω}22 ´ }ω}21 .
2 2k
We let want to pick k such that:
" *
punqγ u2 n
k log p “ min , ,
K γ C1 C2 K 2
and if we want to the minimum is obtained on the first term and also to ensure that
k ě 1, then we have:
37
When the sample size
# ˆ 2´γ ˆ
˙ 1´γ 1
˙ 1´γ +
54Kp2C1 log pq1{γ 54K C2
n ě max , ,
λmin pΣX q λmin pΣX q C1
with probability at least
" *
pλmin pΣX qqγ γ
1 ´ 2n exp ´ n ,
p54Kqγ 2C1
we have
1 p54Kqγ C1
Σ̂pωq ě λmin pΣX q}ω}22 ´ γ´1
}ω}21 .
2 p2λmin pΣX q q
Therefore we build up the Restricted Eigenvalue (RE) conditions for equation (1.14).
Condition 3,4 are very direct but tedious, which can be easily inferred from (Yi
and Caramanis 2015). After all, lemma 5 to Lemma 10, and Theorem 1 conclude the
proof of Theorem 2.
38
Chapter 2
A Deep Reinforced Model For Long

Document Classification With An
Application In Earnings Transcript
Data
2.1 An Overview of Text Classification
Text classification is a classical problem in natural language processing (NLP), which

aims to give a label to text units such as sentences, paragraphs, documents, etc.. It
has a wide range of applications in different areas including sentiment analysis, spam
detection, question answering, and so on. Text data can come from different sources,
for example, recording transcripts, social media, web data, user reviews, and many
more. Due to the unstructured nature of text data, extracting information from it
can be challenging and time-consuming, even though it is an extreme rich source of
information.
39
Text classification can be done by manual labelling or automatic labelling. With the
growth of the size of text data, automatic labelling becomes more and more important
and unavoidable. Approaches to automatic text classification can be grouped into
three main categories: 1. Rule-based methods; 2. Machine learning based methods; 3.
Hybrid methods.
Rule-based methods classify the texts into different labels using a set of pre-defined
rules. It requires a deep domain knowledge and intensive labor work, and the systems
are also difficult to maintain. On the other hand, machine learning based methods
require a pre-labeled samples as the training data, and then uses supervised machine
methods to model the relationship between the text and the labels. Thus, the machine
learning methods are able to detect hidden pattern in the data and are more scalable,
and easily to be applied to different tasks. The hybrid methods use a combination of
rule-based and machine learning methods to do the classification.
Machine learning methods are drawing increasing attention in recent years. Most
of the classical machine learning based-models follow the popular two-step procedure,
where in the first step, we extract hand-crafted features from the text and then we
feed these features to machine learning models to do the prediction. Some of the most
popular features extraction methods are bag of words (BOW), term frequency and
inverse document frequency (TFIDF), and their extensions. However, these two-step
processes have several limitations. For example, reliance on the hand-crafted features
requires tedious feature engineering and analysis to obtain a good performance. The
strong dependence of domain knowledge for designing features makes the methods
difficult to generalize to other tasks. Finally, these models cannot take full advantage
of the large amounts of training data because the features are pre-defined.
A paradigm shift started occurring in 2012, when a deep learning based-model
AlexNet (Krizhevsky, Sutskever, and Hinton 2017) won the ImageNet competition by
a large margin. After that, deep learning methods have been applied to a wide range
40
of tasks in computer vision and NLP, improving the state of art ((Vaswani, Shazeer,
Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin 2017), (He, Zhang, Ren, and
Sun 2016), (Devlin, Chang, Lee, and Toutanova 2018), (Yang, Dai, Yang, Carbonell,
Salakhutdinov, and Le 2019)). These models try to learn feature representations and
perform classification or regression, in an end-to-end fashion. They not only have the
ability to uncover the hidden patterns in data, but also easily to be generalizable to
other tasks. These methods are becoming the main stream framework for various text
classification tasks in recent years.
Among different text classification tasks, document classification is a special sub
area, since many of the huge language models (e.g. BERT, GPT) have limitations on
the number of input token size. On the other hand, as we know, RNN models will fail
when the document is very long, because the memory capability of RNN will vanish.
There are only a few research working on document classification: (Yang, Yang, Dyer,
He, Smola, and Hovy 2016) uses a hierarchical model to extract features from the
documents, with word and sentence level attention mechanism to classify documents;
(Adhikari, Ram, Tang, and Lin 2019b) proposes to use a simple, properly-regularized
single layer BiLSTM, and achieves similar or even better performance than many
very complex models; (Adhikari, Ram, Tang, and Lin 2019a) takes the advantage of
knowledge distillation to compress the BERT model into a simpler LSTM model.
2.2 Introduction
Stock movement prediction has been an attractive topic of both investors and re-
searchers for many years, dating back to (Fama 1965), who showed that there is no
significant autocorrelation in the daily returns of 30 stocks from the Dow-Jones Indus-
trial Average. Similar studies were conducted by (Taylor 2008) and (Ding, Granger,
and Engle 1993), who find significant autocorrelation in squared and absolute returns
41
(i.e. volatility). These effects are also observed on intraday volatility patterns, as
demonstrated by (Wood, McInish, and Ord 1985) and (Andersen and Bollerslev 1997)
on absolute returns. These findings tend to demonstrate that, given solely historical
stock returns, future stock returns are not predictable while volatility is.
However, many empirical studies show that, the market is closer to be efficiently
inefficient (Pedersen 2019): prices can be pushed away from their fundamental values
because of strong information. Due to market inefficiency, it takes time for the market
to price in the new information. Several types of representative significant events are,
for example, earnings reports, mergers, split-offs etc. Figure 2.1 is price movement
on March 5, 2020 after the earnings conference call on previous day. It took the
whole after market hours and additional one and half hour from 9:30 a.m. to around
11:00 a.m. to enter a new steady state. A useful and powerful way to extract signals
from news and reports is natural language processing (NLP), which is proved to be
profitable by many hedge funds’ applications in designing trading strategies.
Figure 2.1: Zoom’s stock price movement after Q4 2020 earnings conference call.
Among previous research works about using NLP for stock movement prediction,
public news and social media are two primary content resources for stock market
42
prediction, and the models that use these sources are often discriminative. Among
them, classic research relies heavily on feature engineering ((Schumaker and Chen 2009);
(Oliveira, Cortez, and Areal 2013)). With the prevalence of deep neural networks
((Le and Mikolov 2014)), event-driven approaches were studied with structured event
representations ((Ding, Zhang, Liu, and Duan 2014); (Ding, Zhang, Liu, and Duan
2015)). Most recently, (Xu and Cohen 2018) presents a deep generative model for
a binary classification of stock price movement based on tweets and historic prices.
(Duan, Ding, and Liu 2018) learns sentence representations over tree structures for a
binary classification task of stock return movement in reaction to financial news.
In this paper, we combine established knowledge from the financial domain with
recent advancements in NLP to create our two models BERT-based model and a deep
reinforced model, the state of art neural model jointly exploiting financial and textual
data for stock movement prediction. We collect a comprehensive set of historical
financial data and enrich it with natural language information revealed in recurring
events, so called earnings calls; in these calls, the performance of publicly traded
companies is summarized and prognosticated by their management. We then train a
joint model to predict short-term risk following these calls.
Earnings calls are a rather underexplored type of disclosure for stock movement
prediction—despite their unique and interesting properties: they start with a scripted
presentation by the company management, and after that they also contain an
open questions-and-answers session, in which banking analysts can pose challenging
questions to the executives. Hence, different to already well-explored disclosures like
the uniform and formal annual report 10-K, this allows for an unscripted, spontaneous
interaction (Larcker and Zakolyukina 2012) of high authenticity.
43
2.3 Problem Formulation
In this paper, we want to predict the stock movement of the next trading day right
after the earnings conference call. Here we formulate the problem into a classification
task based on next day’s intraday return.
2.3.1 Classification Label
Before defining the stock movement, we first calculate the maximum up movement
since open:
phigh,t ´ popen,t
rup,t “ ,
popen,t
and maximum down movement since open:
plow,t ´ popen,t
rdown,t “ ,
popen,t
where popen,t , phigh,t , and plow,t are the open, high and low price of day t. Based on
this, we formulate our task as two classification problems:
Direction Prediction
in this case, we define the stock movement to be:
$
& rup,t
’
if rup,t ě ´rdown,t
r1,t “
% rdown,t if rup,t ă ´rdown,t .
’
Then we specify the classification label as:

$
0 if ´τ1 ď r1,t ď τ1 ,
’
’
’
’
&
y1,t “ 1 if r1,t ą τ1 ,
’
’
’
% ´1 if r1,t ă ´τ1 .
’
44
Here τ1 is the thresholding hyper-parameter which need to be tuned.
Abnormal Return Prediction
Abnormal return is related to many directional neural strategy, which is defined as:
r2,t “ maxtrup,t , ´rlow,t u.
Correspondingly, we label it to be:

$
& 0 if r2,t ď τ2
’
y2,t “
% 1 if r2,t ą τ2 .
’
Here τ2 is the thresholding hyper-parameter of abnormal return task. In finance theory,

the latter task is easier than the former one.
2.3.2 Features
We consider both textual and financial features which are defined below.
Textual Features We split each transcript into two sections: presentation and
question-and-answers. Each of the two sections is represented by a vector t, i.e.
tP , tQ : the tokens w1 , w2 , . . . , wn of the transcript sections are represented with
embeddings wp1q , wp2q , . . . , wpnq , then the BERT model will compose these tokens into
a document/chunk representation t. Embedding size of 768 are trained with our
BERT-based model and deep reinforced model on the earnings transcripts data.
Financial Features
Except for text features, for each transcript, we also include a set of financial features,
since companies with different characteristics will normally react distinctly in terms
of stock movement to new information.
Earnings surprise is the difference between the actual earnings per share and its
consensus estimate, which is one of the main driver of the drastic movement after
45
earnings report.
Market volatility is closely related to the volatility of each stock. We use the CBOE
Volatility Index (VIX) to measure it. We retrieve the VIX for the trading day before
the earnings conference call.
Past volatility is another strong indicator of the movement after the earnings conference
call. We calculated the daily volatility for the past week and past month for the stock
before its earnings conference call.
Past return shows investors’s expectation for the stock, for example, if the stock went
down a lot recently, but the actual earnings beat the consensus a lot, it will have
larger probability to soar after the earnings call. We calculated the return for the
week and the month before the earnings day.
Book to market ratio is the ratio of the book value and market value of the stock,
which shows the price level of the company compared to its intrinsic value.
Market value shows the size of the company, research shows that smaller companies
tend to be more volatile compared with large companies, therefore tends to react more
drastically to strong information.
2.4 Model Overview
2.4.1 BERT based Model
Our first BERT-based model is a neural model incorporating both the textual and
financial features to predict the stock movement of the next trading day after the
earnings day. Many previous works use linear models, which extract text features from
the document, e.g. bag of key words, TFIDF score, etc. Intuitively, the relationship
between the market movement and features is very complex due to hidden or indirect
relationships, and these term frequency text features will lose a lot of information
46
of the document. Therefore, we want to train an end-to-end neural model to do the
prediction. The structure of BERT-based model is shown in Figure 2.2.
Figure 2.2: BERT-based model’s architecture.
We split the earnings call transcripts into two parts: presentation and questions-
and-answers. In presentation section, management team of the company typically
report the key financial information and summarizing company’s performance. Then
in the question and answer section, investor will ask informed questions regarding the
company. Therefore, intuitively they are supposed to contain different information.
For each section, tokens of that section will be encoded by a deep transformer (BERT).
BERT model only allows for input dimension of at most 512 tokens, including two
special token [CLS] and [SEP], thus we only keep 100th token to 610th tokens of
each section. We drop the first 100 tokens because there is always a uninformative
part at the beginning, which mainly introduce who are the speakers and who will
47
join the Q&A session. However, even only using these 510 tokens, we still achieve
the state of art performance according to the experiments, even better than the
Hierarchical Attention Model (HAN) applied on the whole section. This shows that
the openning paragraphs already contain enough information to learn the sentiment
of each section, while the LSTM will fail when the sequence is long enough. Each
token wi is transformed into a contextualized embedding ci , i “ 1, . . . , 512. Due to
the design of BERT, we use the first embedding c1 corresponding to the special token
[CLS] to be the embedding of that section. The obtained embeddings of presentation
section cP and questions-and-answer section cQ are concatenated to be fed into a
one-layer multilayer perceptron (MLP) to generate textual features
ftextual “ M LP prcP , cQ sq.
On the other hand, we pass all the financial features into another MLP to generate
final financial features ff inancial . Then we concatenate ftextual and ff inancial to be the
final features, and then fed into a third MLP for classification.
2.4.2 A Deep Reinforced Model
The drawback of the BERT-based Model is that the input size of BERT is at most
512 tokens (around 400 words), which is not practical in many practical problems.
For example, in our earnings data example, the average number of words for each
transcript is around 6,000 words, which is much larger than the BERT limitation.
To tackle this issue, there are two possible directions: 1) to summarize the long
document into a shorter one, then fed into BERT; 2) Chunks or sentences selection.
The first method does not work in many cases due to the two-step nature. For example,
for our application in using earnings transcript data to predict stock movement. If we
summarize the data, we can only get structure information of the transcripts, which
48
is known already before hand. However, the information that can have impact on the
stock price is the "surprise", or we can say "new information", for example, how does
the executives explain the financial data or how the executives answer the questions
from the investors. Therefore, under similar situations, we should not use a two-step
but an end-to-end method instead, this is because the final classification performance
can provide the guidance for choosing the most appropriate content, which brings the
idea of the second direction: chunks or sentences selection.
Therefore, we come up with using reinforcement learning model to do
chunk/sentence selection for BERT based model. This overall process is shown
in Figure 2.3. The model consists of four components: Policy Network, BERT
Embedding Network, Financial Data Network and Classification Network. The policy
network adopts a stochastic policy, which samples a random action at each state.
The whole documents is divided into several chunks, then the policy network will
go through different chunks and produce an action ({retain or delete}) sequence for
chunks. The selected chunks will be treated as an input for the BERT model. Then
the embedding from BERT combined with the features generated by the financial
network will go to the classification network for classification. The classification
performance will offer reward for the policy network.
Policy Network
The policy network follows a stochastic policy πpat |st ; Θq, and uses the classification
performance as a delayed reward to guide the training of policy learning. In order to
do the chunk selection, we perform action sampling for all the chunks. Once all the
actions are decided, the selected chunks will be treated as an input to the BERT model
to generate embeddings for the text. Combined with the generated features from the
financial data network, these features will be used by the classification network to
compute P py|Xq, where X are features set including network output from both BERT
49
Figure 2.3: The Deep Reinforced Model’s architecture.
and the financial data network. The reward of the classification is used as a reward
for policy learning.
Here we briefly introduce the state, action, policy, reward and objective function
as follows:
State: At each time step, the state st is the concatenation of two parts: the represen-
tation of the previously selected chunk and the representation of the current chunk.
The previously selected chunk servers as the context and provides a prior importance.
Both parts are represented by the pre-trained BERT embedding.
Action: The action space at each step is a binary variable: {Retain, Delete}. If the
action is Retain, the current chunk will be added to the selected chunk; otherwise, the
50
current chunk will be discarded. The final selected text will be a concatenation of all
selected chunks in different steps.
Reward: The reward comes at the end when all actions are sampled for the entire
sequence. We use the logarithm of the classification probability of the true label of
input X, i.e. P py “ cg |Xq, where cg is the gold label. Moreover, we also want to
encourage the model to delete more chunks to satisfy BERT’s 512 tokens limitation,
we also include an additional term by computing the proportion of chunks being
deleted:
RL “ log P py “ cg |Xq ` γL1 {L,
where L1 is the number of chunks deleted (corresponding action is delete). γ is a

hyperameter balance the two terms.
Policy Network: The policy network is a stochastic policy:
πpat |st ; Θq :“ σpW st ` bq,
where θ is the set of policy parameters including W and b, at and st are action and
state at step t. During training, action is sampled each step with probability from
the policy network. After sampling for the whole document, the delayed reward is
calculated. For validation and testing, the action is picked by arg maxa πpa|st ; Θq.
Objective Function: The training is guided by the REINFORCE algorithm, which
optimizes the policy to maximize the expected reward:
JpΘq “Epst ,at q„PΘ pst ,at q rps1 a1 ¨ ¨ ¨ sL aL q

ÿ
“ PΘ ps1 a1 ¨ ¨ ¨ sL aL qRτ
s1 a1 ¨¨¨sL aL
ÿ
“ pps1 qΠt πΘ pat |st qppst`1 |st , at qRτ
s1 a1 ¨¨¨sL aL
51
“Eτ Πt πΘ pat |st qRτ ,
where τ represents the sampled trajectory ta1 , a2 , . . . , aT u. The the gradient is calcu-
lated as:
ÿ
OΘ JpΘq “Eτ OΘ log πΘ pat |st qRτ
t
N ÿ
1 ÿ
« OΘ log πΘ pait |sit qRτi ,
N i“1 t
where N is the number of sample trajectories. Rτi here equals the delayed reward
from the downstream classifier at the last step.
Additionally, to encourage exploration and avoid local optimal, we add the entropy
regularization on the policy objective function:
N
λ ÿ 1
Jreg pΘq “ OΘ Hpπpsit;Θ qq,
N i“1 Ti
where H is the entropy, and λ is the regularization strength, and Ti is the trajectory
length. However, this objective function still cannot guarantee we choose no more than
512 tokens itself. Therefore we add a following steps to choose those chunks/sentences
with highest predicted probability πpat |st ; Θq to keep.
Finally, the downstream classification network and policy network are warm-started
by training separately and then jointly trained together.
2.5 Experiments
In this section, we detail our experiments setup and results.
52
Ticker Date Transcript Return Earn_Surp VIX ... Book to Market Ratio
HST 2007-07-19 ###### 0.010 0.174 16.000 ... 0.443
MRO 2007-08-01 ###### -0.074 0.047 23.52 ... 0.387
MS 2007-12-20 ###### 0.039 -2.842 21.68 ... 0.562
SKX 2008-04-23 ###### -0.019 -0.611 20.87 ... 0.712
.. .. .. .. .. .. .. ..
. . . . . . . .
Table 2.1: The Structure of the Dataset.
2.5.1 Data Collection
We collected 141,965 earnings call transcripts from Seekingalpha.com over the period
from 2007 to early 2020. This covers most recent economic cycles. Company names,
tickers, and dates information are also saved to align with financial data.
The stock price data is downloaded through the Yahoo Finance API. If the earnings
call happens after the close of market, our stock movement measures are calculated
using the next day’s data; on the other hand, if the call happens before the opening
of the market, we will use the stock price that day to calculate the measures. We also
use this data source to get VIX and calculate past volatility and past returns.
For earnings surprise, book to market ratio and market value, these are calculated
using data downloaded from databases CRSP and CRSP/Computstat Merged which
we access via the WRDS Python API.
Since fetching data from WRDS for financial data is very time consuming. We
randomly sample 1,500 transcripts from the whole 141, 965 samples, and get the data
for all variables. A sketch of our data is shown in Table 2.1. Figure 2.4 shows the
returns distribution of the 1,500 samples.
2.5.2 Training Setup
To prevent look-ahead bias, we use a temporal 80/10/10 percentage split to divide

the samples into training, validation and test sets. The training set spans from July
53
Figure 2.4: Distribution of returns among the 1,500 samples.
2007 to Oct 2018, validation set from Oct 2018 to July 2019, and test from July 2019
to February 2020.
For the transcript data, we treat differently for the BERT-based model and the
deep reinforced model. For BERT-based model, we truncate the 100th token to 610th
token of the transcript, with two extra special token [CLS] and [SEP] to satisfy the
512 tokens limit of BERT. For the deep reinforced model, the input of text is the
whole transcript, but we divide them into chunks, each with number of tokens to be
255, which means the model will choose the most important two chunks as the final
input to the BERT model.
We set the embedding size for BERT model to be 768, and the batch size to be 32.
For the financial network MLP and final combination MLP, we both use two layers of
full connected network. We train the final model with an Adam optimizer with initial
learning rate of 0.00001.
2.5.3 Evaluation metrics
We use two evaluation metrics, the first one is the classical performance measure: the
classification accuracy. While when the samples are imbalanced, the accuracy can be
54
high when the classifier always predict the result to be the majority class. To deal
with imbalance issue, previous work (Xie, Passonneau, Wu, and Creamer 2013), (Ding,
Zhang, Liu, and Duan 2014), and (Ding, Zhang, Liu, and Duan 2015) to use Matthews
Correlation Coefficient (MCC) measure the performance. It is a single summary value
of the confusion matrix, which is proved to be more informative than accuracy and f1
score. In this paper, we will also use this metrics to compare with previous work. For
binary classification, MCC is calculated as:
TP ˆ TN ´ FP ˆ FN
M CC “ a ,
pT P ` F P qpT P ` F N qpT N ` F P qpT N ` F N q
where TP, TN, FP, and FN are true positives, true negatives, false positives and
false negatives correspondingly.
2.5.4 Baselines
We construct the following five baselines in different genres:
• Random guess: a naive estimator randomly guessing the class labels using
corresponding proportion.
• Majority Class: another naive estimator always guessing the majority class.
• ARIMA: Autoregressive integrated moving average is a famous model for non-

stationary time series data forecasting using only Price data itself.
• SVM : support vector machine using TFIDF scores calculated for chosen words
to do classification . (Luss and d’Aspremont 2015)
• HAN : previous state-of-art in stock movement prediction using hierarchical

attention model. (Hu, Liu, Bian, Liu, and Liu 2018)
55
Model Accuracy MCC
Random guess 33.56 0.001
Majority class 37.22 0.000
ARIMA 34.26 0.013
SVM 35.61 0.038
HAN 38.23 0.091
BERT-based Model 40.35 0.120
Deep Reinforced Model 41.57 0.127
Table 2.2: Results for task 1: direction prediction.
2.6 Results and discussions
2.6.1 Direction Prediction
Predicting the direction of the stock movement is a very challenging task. Here we
choose τ1 “ 0.03, then in the dataset, three types are almost balanced. We show the
performance of the baseline models and our models in Table 2.2, our BERT-based
model achieves the new state-of-art of 40.35 in accuracy and 0.120 in MCC, outperform
HAN significantly with 2.12 in accuracy and 0.029 in MCC. It is worth to mention
that our BERT-based model only use the first 510 tokens for both presentation
and questions-and-answer section, while HAN uses all of the token in each section.
We believe there are two major reasons: (1) According to many research studies in
extractive summarization, the opening paragraph can be a good summarization for
the whole article, it is a typical phenomenon in news and reports. (2) The capacity of
BERT is much larger than HAN, and self attention mechanism is also proved to be
more powerful than RNN to learn dependence. Our deep reinforced model achieves
even better results, with accuracy to be 41.57 and MCC to be 0.127. This means
the Reinforcement Learning do help us find the most important chunks, however the
difference between the BERT-based model and the deep reinforced model is not very
large, which means the first 510 tokens has already set the sentiment of the whole
document.
56
Model Accuracy MCC
Random guess 61.50 0.009
Majority class 73.98 0.000
ARIMA 61.23 0.021
SVM 65.32 0.081
HAN 68.53 0.116
BERT-based Model 72.23 0.156
Deep Reinforced Model 73.38 0.161
Table 2.3: Results for task 2: Abnormal return prediction.
Though slightly better than random guess, classic time series model, e.g. ARIMA,
does not yield satisfying results. Major class’s MCC is 0 due to the calculation
formula. (Luss and d’Aspremont 2015)’s performance is only a little bit better than
Random guess and ARIMA, but still close to no predictability, which is consistent
to their findings that, their model does not have power to prediction the direction
of stock movement. This is due to the loss of information by using only TFIDF to
extract information from the document and also the capacity of SVM is much smaller
than the complicated neural net models.
2.6.2 Abnormal Return Prediction
This task is relevantly easier than task 1, since we only need to predict the magnitude
rather than the direction, and predict absolute return essentially is predicting volatility,
many research works found the predictability in abnormal return, (Taylor 2008), (Ding,
Granger, and Engle 1993), and (Luss and d’Aspremont 2015).
Here we set τ2 “ 0.03, then the proportion of type 1 is 0.31, which is imbalanced
data, therefore we add weights for the loss function. According to Table 2.3, our
model BERT-based model and deep reinforced model still achieves the state-of-art
performance. The reasons are the same as in task2. One thing worth to mention is
that, since in this task, the data is imbalanced, so the Majority class is better than
57
Figure 2.5: Returns distribution of label 1 (left) and label -1 (right).
Random guess and even ARIMA, while its MCC is still 0; SVM performances much
better in this task, since in the original paper (Luss and d’Aspremont 2015), they
have already found predicting power for abnormal return using SVM.
2.6.3 Market Simulation
Alpha Distribution
In this section, we want to do some descriptive statistics to show what is the distribution
of alpha over a week after the earning conference call to see which day should we do
the prediction. We split the data into three parts corresponding to label 1 (going up),
0 (neutral), -1 (going down), and focus on label 1 and label -1 since they are the cases
we want to trade. Figure 2.5 shows the return distribution over different days for label
1 (left) and label -1 separately (right), we can see that for both labels, the largest
positive return is in the first day after the earnings conference call. To show these
abnormal return are not due to market beta, we estimate market beta and exclude
the part of return explained by beta, and regard the residuals to be alpha. In Figure
2.6, we can also see that most of the alpha are also from the first day, and on the
second day, there is some small reversion, while for other days, the returns are similar
to random. This supports our prediction on the first day after the earnings call.
58
Figure 2.6: Alpha distribution of label 1 (left) and label -1 (right).
Direction betting
We imitate the strategy used by (Lavrenko, Schmill, Lawrie, Ogilvie, Jensen, and
Allan 2000) and (Ding, Zhang, Liu, and Duan 2015) to demonstrate performance of
trading using our prediction. If our model predicts the next trading day, the stock
price will go up more than 2% (type 1), then we will invest $10,000 to buy the stock
with opening price, assuming we can buy in dollars. If the price actually drop by more
than 2%, which means the prediction is wrong, then we will sell it right away to stop
the loss. Otherwise, we will sell it at the end of the day at the closing price. The
result P&L is shown in Figure 2.7. Here the P&L include the base 10,000 for easier
comparison.
Delta Hedging
In this section, we use the delta hedging strategy to evaluate our prediction on
abnormal return. Delta ∆ measures the rate of change of option price with respect
to change the underlying’s asset price. ∆ is the first derivative of the option price V
with respect to the underlying instrument’s price S:
BV
∆“ .
BS
59
Figure 2.7: Accumulative return for directional betting for test data from 2019-07-17
to 2020-02-21, in total of 103 trades.
Delta hedging uses the underlying stock to hedge out the market risk of option
or the opposite direction, which will earn profits purely betting on volatility. We get
option’s data in Optionmetrics in wrds database. We buy call options at the close
time of the trading day before the earnings call, and short delta shares of underly
stocks for each call option, we maintain the total value for buy and sell position to be
10000. Then we clear our position when the market close at the next day. However if
the total value invested drop by 2%, we will close our position right away to prevent
serious drawdown. Figure 2.8 shows the results of the p&l in of applying our strategy.
2.6.4 Ablation Study
We continue by performing a systematic ablation study to answer the questions: How

do textual and financial features influence the prediction?
Feature Ablation
Figure 2.9 shows the results of ablation studies for features. We consider three models,
model only using text data, model only using financial data, model using both text
and financial data (without Reinforcement Learning layer). The figure on the left
60
Figure 2.8: Accumulative return for delta hedging for test data from 2018-10-31 to
2019-06-27, in total of 63 trades.
is for direction prediction, we can see that text contribute more to the prediction.
Although financial data almost contribute nothing to the final prediction by itself, but
combing with text features, it increases the performance. An intuitive example might
be: size itself does not have much prediction power, since according to efficient market
theory, all these information have already been priced in. However, small companies
will react more drastically to information coming from the text since their stock price
is more volatile. Similar results can be observed in abnormal prediction. Another
interesting we can observe is that the abnormal return prediction performance is
better than direction prediction. This also meets our expectation, as (Luss and
d’Aspremont 2015) found in their paper, while our method shows prediction power
even for the direction prediction task.
2.7 Conclusion
We demonstrated the effectiveness of deep transformers for stock movement prediction

from earnings call transcript by introducing the BERT-based model, a neural network
architecture for this task. Then to deal with the long document problem, we also
61
Figure 2.9: Comparison between the models only using text, only using financial data
and combining them together.
introduce the deep reinforced model. We tested our models on a new comprehensive
dataset and showed that they both performed better than previous strong baselines.
62
Chapter 3
Large Covariance Matrix Estimation

Using High Frequency Data
3.1 Introduction
Factor models provide a parsimonious representation of the dynamics of asset returns,

as motivated from the arbitrage pricing theory by (Ross 1976). Since this seminal
work, there has been much effort devoted to the search for proxies of factors (e.g.,
(Fama and French 1993) and (Fama and French 2015)) or characteristics of stocks
(e.g., (Daniel and Titman 1997)) to explain the cross-sectional variation of expected
returns. These factors and characteristics also serve as natural candidates for factors
and loadings that drive the time-series dynamics of stock returns. In this paper, we
make use of a factor model to assist the estimation of large covariance matrices among
stock returns.
The factor model specification leads to a low-rank plus sparse structure of the
covariance matrix, which guarantees a well-conditioned estimator as well as a desirable
performance of its inverse (precision matrix). To estimate the low-rank component, we
consider three scenarios: known factors, known factor loadings, or neither of the two
63
are known. In the first two scenarios, we employ either a time-series regression (TSR)
or a cross-sectional regression (CSR) to estimate the unknown loadings or factors,
using either portfolios and ETFs as proxies for factors, or characteristics as proxies for
factor loadings. In the third scenario, we employ the principal component analysis
(PCA) to identify latent factors and their loadings. Combining the estimated factors
and/or their loadings yields the low-rank component of the covariance matrix. With
respect to the sparse component, we adopt a variety of thresholding methods that
warrant positive semi-definite estimates of the covariance matrix.
In addition, we use the large cross section of transaction-level prices available at
high frequencies. High frequency data provide a unique opportunity to measure the
variation and covariation among stock returns. The massive amount of data facilitates
the use of simple nonparametric estimators within a short window, such as the sample
covariance matrix estimator on a daily, weekly, or monthly basis, so that several issues
associated with parametric estimation using low-frequency time series covering a long
timespan become irrelevant, such as structural breaks and time-varying parameters,
see, e.g., (Aït-Sahalia and Jacod 2014) for a review. However, the microstructure
noise and the asynchronous arrival of trades, which come together with intraday data,
result in biases of the sample covariance estimator with data sampled at a frequency
higher than, say, every 15 minutes, exacerbating the curse of dimensionality due to
data elimination.
By adapting the pre-averaging estimator designed for low-dimensional covariance
matrix estimation, e.g., (Jacod, Li, Mykland, Podolskij, and Vetter 2009), we construct
noise-robust estimators for large covariance matrices, making use of a factor model
in each of the three aforementioned scenarios. Using the large deviation theory of
martingales, we establish the desired consistency of our covariance matrix estimators
under the infinity norm (on the vector space) and the weighted quadratic norm, as
well as the precision matrix estimators under the operator norm. Moreover, we show
64
that TSR converges as the sample size increases, regardless of a fixed or an increasing
dimension. By contrast, the convergence of CSR and PCA requires a joint increase of
the dimensionality and the sample size – the so called blessings of dimensionality, see
(Donoho et al. 2000).
Empirically, we analyze the out-of-sample risk of optimal portfolios in a horse race
among various estimators of the covariance matrix as inputs. The portfolios comprise
constituents of Dow Jones 30, S&P 100, and S&P 500 indices, respectively. We find
covariance matrix estimators based on pre-averaged returns sampled at refresh times
outperform those based on returns subsampled at a fixed 15-minute frequency, for
almost all combinations of estimation strategies and thresholding methods. Moreover,
either TSR or PCA, plus the location thresholding that utilizes the Global Industry
Classification Standards (GICS) codes, yield the best performance for constituents of
the S&P 500 and S&P 100 indices, whereas TSR dominates in the case of Dow Jones
30.
Our paper is closely related to a growing literature on continuous-time factor
models for high frequency data. (Fan, Furger, and Xiu 2016) and (Aït-Sahalia and
Xiu 2017) develop the asymptotic theory for large dimensional factor models with
known and unknown factors, respectively, assuming a synchronous and noiseless
dataset. Their simulations show a clear break-down of either TSR or PCA when noise
is present and the sampling frequency is beyond every 15 minutes. (Pelger 2015a) and
(Pelger 2015b) develop the central limit results of such models in the absence of the
noise. Their asymptotic results are element-wise, whereas the theoretical results in
this paper focus on matrix-wise properties. (Wang and Zou 2010) propose the first
noise-robust covariance matrix estimator in the high dimensional setting, by imposing
the sparsity assumption on the covariance matrix itself, see also (Tao, Wang, and
Zhou 2013), (Tao, Wang, Yao, and Zou 2011), (Tao, Wang, and Chen 2013), and
(Kim, Wang, and Zou 2016) for related results. (Brownlees, Nualart, and Sun 2017)
65
impose the sparsity condition on the inverse of the covariance matrix or the inverse of
the idiosyncratic covariance matrix. By contrast, we impose the sparsity assumption
on the covariance of the idiosyncratic components of a factor model, as motivated
from the economic theory, which also fits the empirical data better.
Our paper is also related to the recent literature on the estimation of low-
dimensional covariance matrix using noisy high frequency data. The noise-robust
estimators include, among others, the multivariate realized kernels by (Barndorff-
Nielsen, Hansen, Lunde, and Shephard 2011), the quasi-maximum likelihood estimator
by (Aït-Sahalia, Fan, and Xiu 2010) and (Shephard and Xiu 2017), the pre-averaging
estimator by (Christensen, Kinnebrock, and Podolskij 2010), the local method of
moments by (Bibinger, Hautsch, Malec, and Reiss 2014), and the two-scale and
multi-scale estimators by (Zhang 2011) and (Bibinger 2012). (Shephard and Xiu 2017)
document the advantage of using a factor model in their empirical study, even when the
dimension of assets is as low as 13. We build our estimator based on the pre-averaging
method because of its simplicity in deriving the in-fill and high-dimensional asymp-
totic results. Allowing for an increasing dimensionality asymptotics sheds light on
important statistical properties of the covariance matrix estimators, such as minimum
and maximum eigenvalues, condition numbers, etc, which are critical for portfolio
allocation exercises. (Aït-Sahalia and Xiu 2017) develop a related theory of principal
component analysis for low-dimensional high frequency data.
Our paper is also related to the recent literature on large covariance matrix
estimation with low frequency data. (Fan, Fan, and Lv 2008) propose a large covariance
matrix estimator using a strict factor model with observable factors. (Fan, Liao, and
Mincheva 2011) extend this result to approximate factor models. (Fan, Liao, and
Mincheva 2013) develop the POET method for models with unobservable factors.
Alternative covariance matrix estimators include the shrinkage method by (Ledoit
and Wolf 2004) and (Ledoit and Wolf 2012), and the thresholding method proposed
66
by (Bickel and Levina 2008a), (Bickel and Levina 2008b), (Cai and Liu 2011) and
(Rothman, Levina, and Zhu 2009).
Our paper shares the theoretical insight with existing literature of factor models
specified in discrete time. (Bai and Ng 2002) and (Onatski 2010) propose estimators
to determine the number of factors. (Bai 2003) develops the element-wise inferential
theory for factors and their loadings. These papers, including ours, allow for more
general models than the approximate factor models introduced in (Chamberlain and
Rothschild 1983). The above factor models are static, as opposed to the dynamic
factor models in which the lagged values of the unobserved factors may also affect the
observed dependent variables. Inference on dynamic factor models are developed in
(Forni, Hallin, Lippi, and Reichlin 2000), (Forni and Lippi 2001), (Forni, Hallin, Lippi,
and Reichlin 2004), and (Doz, Giannone, and Reichlin 2011); see (Croux, Renault,
and Werker 2004) for a discussion.
Factor models based on stock characteristics date back to (Rosenberg 1974), which
suggests to model factor betas of stocks as linear functions of observable security
characteristics. (Connor and Linton 2007) and (Connor, Hagmann, and Linton 2012)
further extend this model to allow for nonlinear or nonparametric functions. One
of our covariance matrix estimators, namely, CSR, is designed to leverage the linear
factor model with characteristics as loadings. Such an estimation strategy is widely
used in the financial industry, but is largely ignored by the academic literature.1 Our
asymptotic analysis of this estimator fills in this gap.
The rest of the paper is structured as follows. In Section 3.2, we set up the model
and discuss model assumptions. Section 3.3 proposes the estimation procedure for
each scenario of the factor model and establishes the asymptotic properties of these
estimators. Section 3.4 discusses the choice of tuning parameters. Section 3.5 provides
Monte Carlo simulation evidence. Section 3.6 evaluates these estimators in an out-
1
Barra Inc., which was acquired by the MSCI Inc., was a leading provider of this type of covariance
matrices to practitioners, see, e.g., (Kahn, Brougham, and Green 1998).
67
of-sample optimal portfolio allocation race. The appendix contains all mathematical
proofs. The paper is published in Journal of Econometrics (Dai, Lu, and Xiu 2019).
3.2 Model Setup and Assumptions
3.2.1 Notation
Let pΩ, F, tFt u, Pq be a filtered probability space. Throughout this paper, we use
λj pAq, λmin pAq, and λmax pAq to denote the j-th (descending order), the minimum, and
the maximum eigenvalues of a square matrix A. In addition, we use }A}1 , }A}, }A}F
and }A}Σ to denote the L1 norm, the operator norm (or L2 norm), the Frobenius norm,
ř a
and the weighted quadratic norm of a matrix A, that is, maxj i |Aij |, λmax pA| Aq,
a
TrpA| Aq, and d´1{2 }Σ´1{2 AΣ´1{2 }F , respectively. Notice that }A}Σ is only defined
for a d ˆ d square matrix. We use Id to denote a d ˆ d identity matrix. All vectors are
regarded as column vectors, unless otherwise specified. When A is a vector, both }A}
and }A}F are equal to its Euclidean norm. We also use }A}MAX “ maxi,j |Aij | to denote
the L8 norm of A on the vector space. We use ei to denote a d-dimensional column
vector whose ith entry is 1 and 0 elsewhere. We write An — Bn if |An |{|Bn | “ Op1q.
We use C to denote a generic constant that may change from line to line.
3.2.2 Factor Dynamics
Let Y be a d-dimensional log-price process, X be a r-dimensional factor process, Z be

the idiosyncratic component, and β be a constant factor loading matrix of size d ˆ r.
We make the following assumption about their dynamic relationship:
Assumption 1. Suppose that Yt follows a continuous-time factor model:
Yt “ βXt ` Zt , (3.1)
68
in which Xt is a continuous Itô semimartingale, that is,
żt żt
Xt “ hs ds ` ηs dWs ,
0 0
and Zt is another continuous Itô semimartingale, satisfying
żt żt
Zt “ fs ds ` γs dBs ,
0 0
where Ws and Bs are standard Brownian motions. In addition, hs and fs are progres-
sively measurable. Moreover, the processes ηs and γs are càdlàg, and, writing es “ ηs ηs| ,
gs “ γs γs| , es , es´ , gs , and gs´ are positive-definite. Finally, for all 1 ď i, j ď r,
1 ď k, l ď d, |βkj | ď C, for some C ą 0, and there exists a locally bounded process Hs ,
such that |hi,s |, |ηij,s |, |γkl,s |, |eij,s |, |fkl,s |, and |gkl,s | are bounded by Hs uniformly for
0 ď s ď t.
Assumption 1 is fairly general except for two important limitations – β is constant

and jumps are excluded. The constant β assumption seems restrictive, but (Reiß,
Todorov, and Tauchen 2015) find evidence in support of this within a short time
window, say, a month. The same assumption has been adopted by e.g., (Todorov and
Bollerslev 2010) in a low dimensional setting and (Fan, Furger, and Xiu 2016) and
(Aït-Sahalia and Xiu 2017) in high dimensional settings. In our empirical study, we
impose a constant β within each month, also because it is available from the MSCI
Barra at a monthly frequency.
To emphasize and highlight the theoretical trade-offs from a high dimensional
perspective, we exclude jumps from our theoretical analysis to avoid delving into
unnecessary technicalities. As a result, our empirical covariance matrix estimates
contain the quadratic covariation contributed by co-jumps. While co-jumps may be
an important component, we do not find it particularly important to separate them
from the total quadratic covariation for the large portfolio allocation exercise we
69
are interested in. Moreover, separating jumps would substantially complicate the
estimation procedure, e.g., with more tuning parameters. In this paper, we desire
simpler estimators while leaving analysis of jumps for future work.
Next, we impose the usual exogeneity assumption:
Assumption 2. For any 1 ď k ď d, and 1 ď l ď r, we have rZk,s , Xl,s s “ 0, for any

0 ď s ď t, where r¨, ¨s denotes the quadratic covariation.
Our main goal is to estimate the integrated covariance matrix of Y , denoted as

şt
Σ “ 1t 0 cs ds, where cs is the spot covariance of Ys . Assumptions 1 and 2 infer a factor
structure on cs :
cs “ βes β | ` gs , 0 ď s ď t.
As a result, we can decompose the quadratic covariation of Y within r0, ts, Σ, as
Σ “ βEβ | ` Γ,
where
żt żt żt
1 1 1
Σ“ cs ds, Γ“ gs ds, and E “ es ds.
t 0 t 0 t 0
We omit the dependence of Σ, E, and Γ on t for brevity in notation.

Next we impose that factors are pervasive, in the sense that they influence a large
number of assets, see, e.g., (Chamberlain and Rothschild 1983).
Assumption 3. E has distinct eigenvalues, with λmin pEq bounded away from 0. More-
over, there exists some positive-definite matrix B such that kd´1 β | β ´ Bk “ op1q, as
d Ñ 8, and λmin pBq is bounded away from 0.
As in (Aït-Sahalia and Xiu 2017), Assumption 3 leads to the identification of the

number of factors when factors and their loadings are latent. It is also used in the
case of known factors by (Fan, Furger, and Xiu 2016), when building the operator
70
norm bound for the precision matrix. Such an assumption may also be restrictive in
that it excludes the existence of weak factors, see, e.g., (Onatski 2010). Dealing with
weak factors requires a rather different setup, so we leave it for future work.
3.2.3 Sparsity
For high dimensional covariance matrix estimation, certain “sparsity” condition is

necessary for dimension reduction, in addition to a factor model. This is because the
idiosyncratic component of the covariance matrix, once the low-rank component is
removed, is equally large. One cannot obtain a good estimate of it without additional
assumptions. Sparsity seems a reasonable choice for both known factor and unknown
factor models given the empirical findings of (Aït-Sahalia and Xiu 2017).
We define md as the degree of sparsity of Γ, where
ÿ
md “ max |Γij |q , for some q P r0, 1q.
iďd
jďd
The sparsity assumption imposes md {d Ñ 0. This notion of sparsity follows from

(Rothman, Levina, and Zhu 2009) and (Bickel and Levina 2008b). When q “ 0, md is
equal to the maximum number of non-zero elements in rows of Γ, the usual notion used
by (Bickel and Levina 2008a). In this case, sparsity simply means there are not many
non-zero elements in each row of Γ. (Fan, Furger, and Xiu 2016) and (Aït-Sahalia
and Xiu 2017) consider this special case, while requiring Γ is block diagonal. Our
assumption below is more general.
Under this notion of sparsity, we have
d
ÿ
}Γ} ď }Γ}1 “ max |Γij | “ Opmd q.
iďd
jďd
71
Therefore, imposing md {d Ñ 0 creates a gap between eigenvalues of the low rank
(βEβ | ) and the sparse (Γ) components of the covariance matrix Σ, which is essential
for identification in a latent factor model specification, and for estimation of the factor
models in general.
To use sparsity for estimation, we define a class of thresholding functions sλ pzq :
R Ñ R, which satisfies:
piq |sλ pzq| ď |z|; piiq sλ pzq “ 0 for |z| ď λ; piiiq |sλ pzq ´ z| ď λ.
As discussed in (Rothman, Levina, and Zhu 2009), Condition (i) imposes shrinkage,
condition (ii) enforces thresholding, and condition (iii) restricts the amount of shrinkage
to be no more than λ. The exact three requirements of sλ p¨q ensure desirable statistical
properties of the estimated covariance matrix. Examples of such thresholding functions
we use include hard thresholding, soft thresholding, smoothly clipped absolute deviation
(SCAD) ((Fan and Li 2001)), and adaptive lasso (AL) ((Zou 2006)):
sHard
λ pzq “ z1p|z| ą λq, sSoft
λ pzq “ signpzqp|z| ´ λq` , sAL
λ pzq “ signpzqp|z| ´ λ
η`1
|z|´η q` ,
$
’
’signpzqp|z| ´ λq` ,
’
’ when |z| ď 2λ;
’
’
&
sSCAD
λ pzq “ pa´1qz´signpzqaλ , when 2λ ă |z| ď aλ; .
’ a´2
’
’
’
’
%z,
’ when aλ ă |z| .
where a “ 3.7 and η “ 1, as suggested by (Rothman, Levina, and Zhu 2009). We

adopt these functions in the construction of the estimators in Section 3.3. Although
these choices lead to the same convergence rate from our analysis, the resulting finite
sample performance of the covariance matrices differ quite a bit, which we investigate
in simulations and the empirical study.
72
3.2.4 Microstructure Noise
There are three scenarios of factor models we analyze, depending on whether the factor
X or its loading β are known or not. We use the term “known” instead of “observable”,
because even if we assume that the factor X can be proxied by certain portfolios in
the literature, for instance the Fama-French three factors by (Fama and French 1993),
we allow for potential microstructure noise so that the true factors are always latent
in our setup.
The first scenario assumes that X is known, in which case, we denote the observed
factor as X ‹ :
Yt‹i “ Ytij ` εyti , Xt‹i “ Xtij ` εxti ,
j j j j
where εy and εx are some additive noises associated with the observations at sampling
times tij s. We use tij to denote the arrival time of the jth transaction of asset i. We
can thereby rewrite the factor model (3.1) as:
Yt‹i “ βXt‹i ` Zo,t

‹
i, (3.2)
j j j
‹ y x
where Zo,t i “ Zti ` εti ´ βεti . Barring from the noise, this model is a standard linear
j
j j j
regression. In the empirical study, we regard those portfolios that are useful to explain
the cross-section of expected asset returns as factors, including the five Fama-French
factors ((Fama and French 2015)) and the momentum factor ((Carhart 1997)).2 We
also add industry portfolios as suggested by (King 1966).
The second scenario assumes that β is known and perfectly observed, yet Y is
again contaminated, so we can write the model as,
Yt‹i “ βXtij ` Zu,t

‹
i, (3.3)
j j
2
While these factors explain the cross-section of expected returns, they also account for significant
variations in the time series of realizations.
73
‹ y
where Zu,ti “ Zti ` εti . This model dates back to (Rosenberg 1974), who developed a
j
j j
factor model of stock returns in which the factor loadings of stocks are linear functions
of observable security characteristics. This is equivalent to a model with characteristics
as βs associated with some linear latent factors. In the empirical study, we use 13
characteristics obtained from the MSCI Barra, a leading company that provides factors
and covariance matrices using this method.
In the third scenario, we only observe a noisy Y , so the model can be written into
the same form as (3.3). This model is a “noisy” version of the approximate factor model
by (Chamberlain and Rothschild 1983) and (Bai 2003), which can only be identified
as the dimension of Y increases to 8, thanks to the “blessing” of the dimensionality.
With respect to the microstructure noises, following (Kim, Wang, and Zou 2016),
we assume:
Assumption 4. Both tεxi u and tεyi u have the following structure,
εi,tij “ ui,tij ` vi,tij ,
where, for some α ě 0 and for each i ď d, writing ∆til “ til ´ til´1 ,
8
ÿ 8
ÿ
ui,tij “ ρi,tij´l ξi,tij´l , vi,tij “ bi,tij´l ∆´α
ti
rB
ri,ti ´ B
j´l
ri,ti
j´l´1
s.
j´l
l“0 l“0
We assume ξi,til and ξj,tj1 are random variables with mean 0, and independent when
l
l ‰ l , but potentially dependent for l “ l1 . Also, ρi,til is bounded in probability with

1
ř8
l“0 |ρi,til | ă 8 uniformly in i, and ξi,til is independent of the filtration tFt u generated
by X and Z. Moreover, B
ri is a Brownian motion independent of ξi but potentially
correlated with W and B of Assumption 1, and bi,til is adapted to the filtration tFtil u,
and bounded in probability with 8
ř
l“0 |bi,til | ă 8 uniformly in i.
74
The microstructure noise has two independent components: u and v, where u
allows for serial dependence, and v allows for correlation with returns. This assumption
is motivated from the microstructure theory and existing empirical findings that order
flows tend to cluster in the same direction and that they tend to be correlated with
returns in the short run, see, e.g., (Hasbrouck 2007), (Brogaard, Hendershott, and
Riordan 2014). This assumption is therefore more realistic in particular for data
sampled at ultra-high frequencies. A similar assumption is also adopted by (Kalnina
and Linton 2008) and (Barndorff-Nielsen, Hansen, Lunde, and Shephard 2011).
3.2.5 Asynchronicity
Since the transactions arrive asynchronously, it is common to adopt the refresh time
sampling scheme proposed by (Martens 2004) prior to estimation. The first refresh
(
time is defined as t1 “ max t11 , t21 , . . . , td1 . The subsequent refresh times are defined
recursively as
" *
1 d
tj`1 “ max tNt1 `1 , . . . , tN d `1 ,
j tj
where Ntij is the number of transactions for asset i prior to time tj . We denote by n
the resulting sample size after refresh time sampling.
Effectively, this sampling scheme selects the most recent transactions at refresh
times, which avoids adding zero returns artificially, because by design all assets have at
least one update between two refresh times. By comparison, the alternative previous-
tick subsampling scheme by (Zhang 2011) discards more data in order to avoid artificial
zero returns. That said, the refresh time scheme is notoriously influenced by the most
illiquid asset, which largely determines the number of observations after sampling.
Pair-wise refresh time as suggested by (Aït-Sahalia, Fan, and Xiu 2010), is more
efficient for entry-wise consistency of the covariance matrix. In this paper, since
our focus is the consistency under matrix-wise norms, we adopt the refresh time
75
sampling throughout. We also find empirically that the refresh time approach delivers
satisfactory performance. We make the following assumption on the observation times,
following (Kim, Wang, and Zou 2016):
Assumption 5. The observation time tij s, 1 ď j ď Nti , 1 ď i ď d, are independent of

the price process Xt and Zt , and the noise εx and εy . We assume the intervals between
two adjacent observations are independent, and that there exist constants n̄, C and
ς ą 2, such that
max E|tij ´ tij´1 |a ď C n̄á , for any 1 ď a ď 2ς,

1ďiďd
and that c1 n̄ ď n ď c2 n̄ holds with probability approaching 1, where c1 and c2 are some
positive constants.
There is a large literature devoted to the modeling of durations, i.e., time intervals
between adjacent transactions, since the seminal paper by (Engle and Russell 1998),
which proposes an autoregressive conditional duration (ACD) model and shows that
this parametric model can successfully describe the evolution of time durations for
(heavily traded) stocks. Our focus here is the covariance matrix of returns, so we are
agnostic about the dynamics of durations. The independence between durations and
prices is a strong assumption yet commonly used in the literature, with the exception
of (Li, Mykland, Renault, Zhang, and Zheng 2014). This assumption means we can
make our inference regarding the times of trades, and therefore refresh times, as fixed.
To simplify the notation, we treat n as if it is deterministic in what follows. Also,
we use Yt‹i to denote the most recent observation prior to or at the ith refresh time,
and relabel the associated noise as εxi and εyi . Note that Yt‹i ´ εyi is not necessarily
equal to Yti . Instead, it equals the value of Y at actual transaction times. The same
convention applies to X ‹ and Z ‹ .
76
3.3 Three Estimators and their Asymptotic Proper-
ties
We now proceed to the estimators and their asymptotic properties. Our results rely on
the joint large sample (n Ñ 8) and increasing dimensionality (d Ñ 8) asymptotics,
with a fixed number of factors r and a fixed time window r0, ts.
To deal with the bias due to the microstructure noise, we adopt the pre-averaging
method proposed by (Jacod, Li, Mykland, Podolskij, and Vetter 2009) to pre-weight
the returns. Specifically, we divide the whole sample into a sequence of blocks, with
the size of each block being kn , which satisfies:
kn ´1{2
“ θ ` opnδ q, (3.4)
n1{2`δ
´1{2
where θ and δ are positive constants, and nδ :“ n´1{4`δ{2 ` n´2δ .
As shown in (Christensen, Kinnebrock, and Podolskij 2010) and (Kim, Wang, and
Zou 2016), our choice of kn is not optimal (δ “ 0) for estimating the covariance matrix.
However, the resulting estimator is simpler, robust to dependent and endogenous noise,
and relies on less tuning parameters. More importantly, it guarantees a semi-definite
covariance matrix estimate in any finite sample, a desirable property our following-up
thresholding procedure relies on.
The returns in each block are weighted by a piecewise continuously differentiable
function g on r0, 1s, with a piecewise Lipschitz derivative g 1 . Moreover, gp0q “ gp1q “ 0,
ş1
and 0 g 2 psqds ą 0. A simple example of g would be gpsq “ s ^ p1 ´ sq. We define
ψ1 “ φ1 p0q, ψ2 “ φ2 p0q, where
ż1 ż1
1 1
φ1 psq “ g puqg pu ´ sqdu, φ2 psq “ gpuqgpu ´ sqdu.
s s
77
For a sequence of vectors V , we define its weighted average return as V̄ , given by
kÿ
n ´1
ˆ ˙
j
V̄i “ g ∆ni`j V, for i “ 0, . . . , n ´ kn ` 1.
j“1
kn
and ∆ni V “ Vi ´ Vi´1 , for i “ 1, 2, . . . , n, and Vi P tXti , Yti , Zti , Xt‹i , Yt‹i , Zt‹i , εxi , εyi u.
In what follows we propose three estimators corresponding to different scenarios
of the factor model. Each estimator uses the sample covariance matrix of these
pre-averaged returns as an input, which leads to robustness to the microstructure
noise. Intuitively, the effect of the noise is dominated by a strengthened return signal
of each block.
3.3.1 Time-Series Regression (TSR)
When factors are known, we adopt a time-series regression-based approach using Ȳ ‹

and X̄ ‹ . We stack the d- and r-dimensional processes Y and X into U , and their
pre-average returns Ȳ ‹ , X̄ ‹ into Ū ‹ , respectively:
U :“ pY | , X | q| , Ū ‹ :“ pȲ ‹| , X̄ ‹| q| ,
where U is a pd ` rq-dimensional process, Ū ‹ is a d ˆ pn ´ kn ` 2q dimensional matrix.

The quadratic covariation of U is given by
¨ ˛ ¨ ˛
żt żt |
1 1 ˚ βes β ` gs βes ‹ ˚ Π11 Π12 ‹
Π :“ rdUs , dUs sds “ ˝ ‚ds “: ˝ ‚,
t 0 t 0 es β |
es Π21 Π22
which can be estimated by the sample covariance matrix of Ū ‹ :
n´k n `1
n 1 ÿ
Π
p“ Ūi‹ Ūi‹| ,
n ´ kn ` 2 ψ2 kn t i“0
78
where Ūi‹ is the pi`1qth column of Ū ‹ . We then construct estimators of each component
of the covariance matrix as:
βp “ Π p 22 q´1 ,
p 12 pΠ E
p“Π
p 22 , and Γ
p“Π
p 11 ´ Π p 22 q´1 Π
p 12 pΠ p 21 .
We apply thresholding to the covariance matrix estimates and obtain:

$
´ ¯ & Γ
’ pij i “ j,
pS “ Γ
Γ pS , ΓS
p “ . (3.5)
ij ij
% sλij pΓ
’ pij q i ‰ j,
A plug-in covariance matrix estimator is therefore given by:
Σ
p TSR “ βpE pS .
p βp| ` Γ
We postpone the choice of the thresholding method sp¨q and tuning parameters λij to
Section 3.4, where we provide a procedure to guarantee the positive semi-definiteness
of Σ
p TSR .
We provide the convergence rates under several matrix norms for both the covari-
ance matrix Σ
p TSR and its inverse:
´1{2 ?
Theorem 3. Suppose that Assumptions 1 - 5 hold, and nδ log d “ op1q. Then we
have
› › ´ a ¯
´1{2
›ΣTSR ´ Σ› “ O p nδ log d ,
›p ›
MAX
› › ` ˘
Σ Σ› “ Op d1{2 n´1 ´1
δ log d ` pnδ log dq
p1´qq{2
md ,
›p ›
› TSR ´
Σ
´ a ¯
´1{2
}β ´ β} “ Op nδ
p log d .
79
` ˘p1´qq{2
Moreover, if n´1
δ log d md “ op1q, we have
´ ¯´1 ´` ˘p1´qq{2 ¯
Σp TSR ´ Σ´1 “ Op n´1
δ log d md ,
p TSR q ě 1 λmin pΓq, with probability approaching 1.

and λmin pΣ 2
Theorem 3 establishes the convergence rate, which depends on the degree of

sparsity md , q, the dimension d, and the local window size parameter δ. Under } ¨ }MAX
norm, covariance matrix estimators with/without a factor model deliver the same
rate. This is because even in a factor model, there remain too many parameters for
estimation from the low rank component, which determines the low convergence rate
under } ¨ }MAX norm.
In terms of the inverse, when d ą n, estimating Σ´1 becomes infeasible without
a factor model, while the factor-based covariance matrix is invertible with high
probability, and the inverse converges to the target under the operator norm. Since
} ¨ }Σ norm depends on both Σ and Σ´1 , and the latter is more accurately estimated
using a factor model, hence under } ¨ }Σ norm, using a factor model gives a better rate
´1{2 ?
than the rate without it, which would be d1{2 nδ log d.
More importantly, the minimum eigenvalue of the resulting covariance matrix is
bounded away from 0 with high probability, so that the covariance matrix estimate
is well-conditioned. This property is essential to warrant an economically feasible
optimal portfolio using the estimated covariance matrix as the input.
3.3.2 Cross-Sectional Regression (CSR)
In case we observe the factor loading matrix β, we propose a cross-sectional regression

approach which recovers X at first. We start with a scenario in which the data are
synchronous and noise-free, since the asymptotic property of such an estimator is
not available in the literature to the best of our knowledge. The estimator can be
80
constructed as :
Σ q ˚β | ` Γ
p ˚ “ βE q˚S ,
CSR
where
q˚ “ 1 X q˚ “ 1 Y ´ β X
´ ¯´ ¯|
q “ pβ | βq´1 β | Y,
X E qXq |, Γ q Y ´ βX
q ,
t t
and $
´ ¯ & Γ
’ qij i “ j,
q˚S “ Γ
Γ q˚S , q˚S
Γ “ .
ij ij
% sλij pΓ
’ qij q i ‰ j,
This estimator is similar in spirit to the covariance matrix estimator provided by the
MSCI Barra, see (Kahn, Brougham, and Green 1998). We analyze its properties as
follows:
?
Theorem 4. Suppose that Assumptions 1 - 3, 5 hold, n´1{2 log d “ op1q. Then we
have
› › ´ a ¯
›p˚ ´1{2 ´1{2 1{2
Σ Σ n log d d m ,
›
› CSR ´ › “Op ` d
MAX
› › ˆ” ı2 ” ı1´q ˙
›p˚ ´1{2 1{4
a ´1{4 1{2 ´1{2
a ´1{2 1{2
›ΣCSR ´ Σ› “Op n d log d ` d md ` n log d ` d md md ,
›
Σ
› › ´ ¯
´1{2 1{2
›X ´ X › “Op d md .
›q ›
” ? ı1´q
´1{2 ´1{2 1{2
Moreover, if n log d ` d md md “ op1q, we have
´ ¯´1 ˆ” ı1´q ˙
˚ ´1 ´1{2
a ´1{2 1{2
ΣCSR
p ´Σ “Op n log d ` d md md ,
p ˚ q ě 1 λmin pΓq, with probability approaching 1.

and λmin pΣ CSR 2
Theorem 4 shows that the CSR estimator does not converge under k¨kMAX when
1{2
d is fixed, due to the second term d´1{2 md , unlike the TSR estimator. This is not
surprising, as the cross-sectional regression exploits an increasing dimensionality to
81
?
estimate X. The first term n´1{2 log d is the same as that in the convergence rate
of TSR in the absence of noise (see (Fan, Furger, and Xiu 2016)), because both
approaches estimate Γ based on a thresholded sample covariance estimator. It is also
interesting to compare this convergence rate with that of the PCA estimator given by
?
(Aït-Sahalia and Xiu 2017), which is n´1{2 log d ` d´1{2 md . The rate improvement
in the CSR estimator comes from the second term and is due to the knowledge of β.
Overall, the convergence rate of CSR depends on a striking trade-off between n and d.
In the general scenario where the noise plagues the data, we construct a pre-
averaging based covariance matrix estimator:
Σ
p CSR “ β Eβ qS ,
q |`Γ
where
q ‹ “ pβ | βq´1 β | Ȳ ‹ , n 1 q ‹ q ‹|
X E
q“ X X ,
n ´ kn ` 2 ψ2 kn t
n 1 ´ ¯´ ¯|
Γ
q“ Ȳ ‹ ´ β X
q ‹ Ȳ ‹ ´ β X q‹ ,
n ´ kn ` 2 ψ2 kn t
and
$
´ ¯ & Γ
’ qij i “ j,
qS “ Γ
Γ qS , qS “
Γ .
ij ij
% sλij pΓ
’ qij q i ‰ j,
The next theorem gives the convergence rates:
´1{2 ? 1{2
Theorem 5. Suppose that Assumptions 1 - 5 hold, nδ log d ` d´1{2 md “ op1q.
Then we have
› › ´ a ¯
´1{2 ´1{2 1{2
Σ Σ n log d d m ,
›p ›
› CSR ´ › “Op δ ` d
MAX
› › ˆ” ı2 ” ı1´q ˙
´1{2
a 1{4 ´1{2 1{2 ´1{2
a ´1{2 1{2
›ΣCSR ´ Σ› “Op nδ log dd ` d md ` nδ log d ` d md md ,
›p ›
Σ
82
› › ´ ¯
› q‹ 1{2
›X ´ X̄ › “Op d´1{2 md .
‹›
ı1´q
´1{2 ?
”
´1{2 1{2
Moreover, if nδ log d ` d md md “ op1q, we have
´ ¯´1 ˆ” ı1´q ˙
´1 ´1{2
a ´1{2 1{2
Σp CSR ´Σ “Op nδ log d ` d md md ,
p CSR q ě 1 λmin pΓq, with probability approaching 1.

and λmin pΣ 2
Compared to the results of Theorem 4, the rates in Theorem 5 remain the same
except that n is replaced by nδ , which is the effective sample size when noise is present.
The same intuition as in the no-noise case holds as well.
3.3.3 Principal Component Analysis (PCA)
Without prior knowledge of factors or their loadings, we apply PCA to the pre-averaged
covariance matrix estimate based on Ȳ ‹ :
n´kn `1
n 1 ÿ
Σ
r“ Ȳi‹ Ȳi‹| .
n ´ kn ` 2 ψ2 kn t i“0
Suppose that λ
p1 ą λ
p2 ą . . . ą λ
pd are the simple eigenvalues of Σ,
r and that ξp1 , ξp2 , . . . , ξpd
are the corresponding eigenvectors. Then Σ

r can be decomposed as:
rp
ÿ
Σ
r“ pj ξpj ξp| ` Γ,
λ r (3.6)
j
j“1
where rp is an estimator of r to be introduced below. Similar to TSR, we apply

thresholding on Γ
r and obtain:
$
´ ¯ & Γ
’ rij i “ j,
rS “ Γ
Γ rS , S
Γij “
r ,
ij
% sλij pΓ
’ rij q i ‰ j,
83
and the resulting estimator of Σ is:
rp
ÿ
Σ
p PCA “ pj ξpj ξp| ` Γ
λ rS . (3.7)
j
j“1
This estimator is motivated from the POET strategy by (Fan, Liao, and Mincheva
2013) for low-frequency data, and adapted from the PCA approach by (Aït-Sahalia
and Xiu 2017) for high-frequency data. The idea behind this strategy is that the
eigenvectors corresponding to the first r eigenvalues of the sample covariance matrix
can be used to construct proxies of β, so that the low rank component of Σ can be
approximated by the first term on the right hand side of (3.6). The second term
then approximates the sparse component of Σ, which leads to the construction of the
estimator given by (3.7).
This estimator can also be constructed from a least square point of view, which
seek F and G such that
›Ȳ ´ F G›2 ,
› ‹ ›
pF, Gq “ arg min F
F PMdˆpr ,GPMrpˆn
subject to the normalization
d´1 F | F “ Irp, GG| is an rp ˆ rp diagonal matrix.
(Bai and Ng 2002) and (Bai 2003) propose this estimator to estimate factors and their
loadings in a factor model for low-frequency data.
Once we have estimates of factors and loadings, we can obtain the same Γ rS
r and Γ
as above by:
$
n 1 ` ‹ ˘` ˘| ´ ¯ & Γ
’ rij i “ j,
Γ
r“ Ȳ ´ F G Ȳ ‹ ´ F G , rS “ Γ
Γ rS , rS “
Γ ,
ij ij
n ´ kn ` 2 ψ2 kn t % sλij pΓ
’ rij q i ‰ j,
84
with which we can construct
Σ rS .
p PCA “ t´1 F GG| F | ` Γ (3.8)
While (3.7) is easier to implement, this equivalent form of Σ

p PCA is useful in the proof
and provides estimates of factors and their loadings (up to some rotation).
To determine the number of factors r, we propose the following estimator using a
penalty function:
´ ¯
rp “ arg min d´1 λj pΣq
r ` j ˆ f pn, dq ´ 1,
1ďjďrmax
where rmax is some upper bound. This estimator is similar to that of (Aït-Sahalia
and Xiu 2017), which shares the spirit with (Bai and Ng 2002). The penalty function
f pn, dq satisfies two criteria. On the one hand, the penalty is dominated by the signal,
i.e., the value of d´1 λj pΣq, for 1 ď j ď r. Because d´1 λr pΣq is Op p1q as d increases,
we select a penalty that shrinks to 0. On the other hand, we require the penalty
to dominate the estimation error as well as d´1 λj pΣq for r ` 1 ď j ď d to avoid
overshooting. The choice of rmax does not play any role in theory, yet it warrants an
economically meaningful estimate of rp in a finite sample or in practice.
´1{2 ?
Theorem 6. Suppose that Assumptions 1 - 5 hold. Also, nδ log d`d´1{2 md “ op1q,
´1{2 ?
´ ¯ ´1
f pn, dq Ñ 0, and f pn, dq nδ log d ` d´1 md Ñ 8. Then we have
› › ´ a ¯
´1{2 ´1{2
›ΣPCA ´ Σ› “ Op nδ log d ` d md ,
›p ›
MAX
› › ˆ ´ ¯1´q ˙
1{2 ´1 ´1{2 2 ´1{2
a ´1{2
›ΣPCA ´ Σ› “ Op d nδ log d ` d md ` md nδ log d ` d md .
›p ›
Σ
85
Also, there exists a r ˆ r matrix H, such that with probability approaching 1, H is
invertible, }HH | ´ Ir } “ op p1q, and that
´ a ¯
´1{2 ´1{2
kF ´ βHkMAX “ Op nδ log d ` d md ,
´ a ¯
´1{2
G ´ H ´1 X “ Op nδ log d ` d´1{2 md .
´1{2 ?
If in addition nδ log d “ op1q, then λmin pΣ
p PCA q is bounded away from 0 with
probability approaching 1, and
› › ˆ ´ ¯1´q ˙
› p ´1 ´1 › 3 ´1{2
a ´1{2
›ΣPCA ´ Σ › “ Op md nδ log d ` d md .
Due to the fundamental indeterminacy of a factor model, we only identify the

latent factors and their loadings up to some invertible matrix H. That said, the
covariance matrix estimator itself is invariant to H.
So far we have obtained the convergence rates of all scenarios of factor models
for the covariance matrix and its inverse, under a variety of norms, respectively. By
comparison, we observe that the convergence rates of the PCA are the lowest among
all three estimators. This is the usual tradeoff between efficiency and robustness,
because PCA requires the least amount of information and is therefore more robust
to model misspecification errors.
3.4 Practical Considerations
3.4.1 Choice of δ and kn
As discussed in (Christensen, Kinnebrock, and Podolskij 2010), the pre-averaging

estimator is consistent in the low dimensional case if 0.1 ă δ ă 0.5. We fix δ at 0.2.
The pre-averaging estimator involves a tuning parameter kn or equivalently θ, once δ
86
is given. With a large number of observations, the estimates are not sensitive to the
choice of kn even when d is moderately large. In simulations and empirical studies, we
adopt a range of θs, and thus kn s, all of which lead to similar estimates with identical
interpretations.
3.4.2 Choice of r and f pn, dq
A sensible choice of the penalty function could be
´ a ¯κ
´1{2
f pn, dq “ µ nδ log d ` d´1 md ¨ medianptλ̂1 , ..., λ̂d uq, (3.9)
for some tuning parameters µ and κ. One might also use the perturbed eigenvalue
ratio estimator in (Pelger 2015a) to determine r, which gives almost the same result
in simulations. The latter requires one less tuning parameter, but its proof is more
involved. Alternatively, as argued by (Aït-Sahalia and Xiu 2017) we can simply regard
r as a tuning parameter from the practical point of view. In fact, the performance
of the estimator is not sensitive to rp, as long as rp is greater than or equal to r, yet
is not too large, as shown from our simulation results. We conjecture that the same
convergence rate holds for our estimators as long as rp ě r and rp is finite, which is
indeed the case for parameter estimation in the interactive effect models, see, e.g.,
(Moon and Weidner 2015). The proof likely involves the random matrix theory that
is not available for continuous martingales. We leave this for future work. In our
empirical studies, we find that as soon as r is greater than 3 but not as large as 20, the
comparison results remain the same qualitatively and the interpretations are identical.
87
3.4.3 Choice of the Thresholding Methods
We compare two types of thresholding methods on the residual covariance matrix,

p constructed in TSR.3 The same applies to Γ
e.g., Γ q (CSR) and Γ
r (PCA).
The first one is the location-based thresholding utilizing domain knowledge (denoted
as location thresholding), as in (Fan, Furger, and Xiu 2016). This approach preserves
positive semi-definiteness in a finite sample and is computationally efficient, as neither
tuning nor optimization is involved. Specifically, we first sort the residual covariance
matrix Γ
p into blocks by assets’ industrial classifications (sector, industry group,
industry, or sub-industry), then apply a block-diagonal mask to this residual covariance

matrix. The thresholding function can be written explicitly as:
sloc
λij pzq “ z1pλij “ 1q, where λij “ 1, if and only if pi, jq P the same block.
For each block, we have a positive semi-definite sub-matrix because Γ

p is positive
semi-definite by construction, so that stacking these blocks on the diagonal produces

pS .
a positive semi-definite Γ
The second class of methods we consider employ a threshold based on the sample
correlation matrix. Specifically, we set
b
λij “ τ Γpii Γ
pjj ,
where τ is some constant to be determined. With this threshold, we then apply

Hard, Soft, SCAD, and AL thresholding methods with sλij p¨qs given by Section 3.2.3,
respectively, in the construction of (3.5). These methods do not always guarantee a
pS in a finite sample. Also, when τ is small, Γ
positive semi-definite Γ pS might not be
sufficiently sparse.
3
Alternatively, one can adopt the adaptive thresholding method by (Cai and Liu 2011). In our
simulations, this approach does not perform as well as the location-based and correlation-based
thresholding methods we consider.
88
To fix these issues, we find an appropriate τ via a grid search algorithm, following
(Fan, Liao, and Mincheva 2013). We start from a small value of τ , and gradually
pS is obtained and the degree of sparsity is
increase it till a positive semi-definite Γ
below certain threshold. As τ increases, the degree of sparsity decreases, and the
solution shrinks towards the diagonal of Γ
p which is positive semi-definite. Thus our
grid search is guaranteed to produce a solution. In other words, this algorithm yields
a positive semi-definite estimate in a finite sample. Notice that the grid search for τ is
easier here compared to that of (Fan, Liao, and Mincheva 2013), because τ is bounded
between 0 and 1. In practice, a natural choice of the desired degree of sparsity can be
obtained using that of the location based thresholding, which is computationally less
expensive compared to the cross-validation method.
3.5 Monte Carlo Simulations
In this section, we investigate the finite sample performance of the pre-averaging

estimator and compare it with the subsampling method discussed in (Fan, Furger,
and Xiu 2016) and (Aït-Sahalia and Xiu 2017). The latter estimators are built upon
the realized covariance estimators using subsampled returns. We simulate 1,000 paths
from a continuous-time r-factor model of d assets specified as:
r
ÿ
dYi,t “ βi,j dXj,t ` dZi,t , dXj,t “ bj dt ` σj,t dWj,t , dZi,t “ γi| dBi,t ,
j“1
where Wj is a standard Brownian motion and Bi is a d-dimensional Brownian motion,

for i “ 1, 2, . . . , d, and j “ 1, 2, . . . , r. They are mutually independent. Xj is the jth
factor, and we set X1 as the market factor, with the associated loadings being positive.
The covariance matrix of Z, Γ, is a block-diagonal matrix with Γi,l “ γi| γl . We also
89
2
allow for time-varying σj,t which evolves as
2 2
dσj,t “ κj pθj ´ σj,t qdt ` ηj σj,t dB
rj,t , j “ 1, 2, . . . , r,
rj is a standard Brownian motion with ErdBj,t dB

where B rj,t s “ ρj dt. We choose
d “ 500 and r “ 3. We fix t at 21 trading days, i.e., t “ 1{12. In addition,

κ “ p3, 4, 5q, θ “ p0.09, 0.04, 0.06q, η “ p0.3, 0.4, 0.3q, ρ “ p´0.6, ´0.4, ´0.250q, and
b “ p0.05, 0.03, 0.02q. As to the factor loadings, we sample β1 „ Ur0.25, 1.75s, and
β2 , β3 „ N p0, 0.52 q. The diagonal elements of Γ are sampled independently from
Ur0.1, 0.2s, with constant within-block correlations sampled from Ur0.10, 0.40s for
each block. To generate blocks with random sizes, we fix the largest block size at 35,
and randomly generate the sizes of the blocks from a uniform distribution between 10
and 35, such that the total sizes of all blocks is d. βs and block sizes are randomly
generated but fixed across Monte Carlo repetitions.
We simulate the noise in log prices for each asset as an MA(1) process, i.e.,
xi,ti “ ξi,tij ´ 0.5ξi,tij´1 , where ξi,tij is an i.i.d. normal noise with mean 0 and variance
j
0.0012 . To mimic the asynchronicity, we censor the data using Poisson sampling, where
the number of observations for each stock is sampled from a truncated log normal
distribution. The log-normal distribution log N pµ, σ 2 q has parameters µ “ 2, 500
and σ “ 0.8, and the lower and upper truncation boundaries are 1,000 and 23,400,
respectively, which matches the empirical data on S&P 500 index constituents.
For pre-averaging estimators, we compare a range of local window sizes by varying
θ in (3.4). For subsampling estimators, we compare a range of subsampling frequencies
from every 5 minutes to every 65 minutes, denote as ∆n in seconds. We also add the
benchmark estimates using noiseless returns sampled at 5-minute frequency without
asynchronicity for comparison. In addition, we consider a “mixed” approach by
applying the pre-averaging method to the subsampled data, in order to check the
90
marginal effect of the refresh-time versus subsampling. Since the simulated Γ is
a block-diagonal matrix, we regard the location-based thresholding method as the
benchmark, and compare its performance with other thresholding methods in Section
3.4.3. We present the simulation results in Tables 3.1 and 3.2.
First, we find that the pre-averaging estimators achieve smaller errors compared
to subsampling estimators, and are closer to the noiseless benchmark, for almost
all model specifications and almost all criteria. The pre-averaging method achieves
a slightly better estimate of the low rank part βEβ | , and a better residual part Γ,
especially under the operator norm for Γ´1 , which in turn gives a better estimate of
the precision matrix Σ´1 . Throughout, the pre-averaging estimators are robust to a
wide range of tuning parameters θ. The sweet spot appears to be θ “ 0.03, while the
optimal frequency for the subsampling method seems achieved near ∆n “ 900.
Secondly, for comparison across various thresholding methods, we find in almost
all scenarios that the Location thresholding achieves the best result, followed by Soft,
AL, and SCAD thresholding, whereas Hard thresholding appears to be the worst.
The differences are smaller in terms of }Σ
p ´ Σ}MAX and }Γ
p ´ Γ}MAX , because the
largest entry-wise errors are either achieved along the diagonal or on entries off the
diagonal with large magnitudes, which are least affected by various thresholding
p ´1 ´ Σ´1 } and
methods. Nevertheless, the differences are most salient in terms of }pΣq
p ´1 ´ Γ´1 }. The former is arguably the most important metric in this table, since
}pΓq
it dictates the performance of the portfolio allocation in the empirical exercise.
Thirdly, the comparison between the “mixed” approach vs subsampling depends on
a bias-variance tradeoff. If the bias due to noise is large, mixed approach outperforms;
if the noise effect becomes negligible for a sufficiently low sampling frequency, the
subsampling approach can outperform because its convergence rate is faster. By
contrast, the “mixed” approach is always dominated by the pre-averaging based on the
91
refresh time scheme, because the latter reserves more data and that the two estimators
are equally efficient given the same amount of data.
Finally, to demonstrate the impact of the selected number of factors on the PCA,
we present in Table 3.3 the errors with a range of rp pre-set instead of being estimated.
For reason of space, we only report the results with θ “ 0.03 for the pre-averaging,
∆n “ 900 for the subsampling method, and the benchmark no-noise and synchronous
case with ∆˚n “ 300. Location thresholding is used in all cases. For the estimation of
Σ and Σ´1 , we find when rp ă r, the performance is much worse in every metric than
the case with rp “ r, whereas in the case of rp ą r, the performance is only slightly
worse, in particular when rp is within a reasonable range (smaller than 20). For the
purpose of covariance matrix estimation, this result justifies treating r as a tuning
parameter without estimating it. With respect to estimating the low rank and sparse
components of Σ, using an incorrect number of factors is harmful.
3.6 Empirical Applications
3.6.1 Data
We collect from the TAQ database intraday transaction prices of the constituents of
Dow Jones 30 index, S&P 100 index, and S&P 500 index, respectively, from January
2004 to December 2013. There are in total 42, 152, and 735 stocks for each index,
respectively, during this sampling period.
We select stocks that are members of these indices on the last day of each month,
and exclude those among them that have no trading activities on one or more trading
days of this month, as well as the bottom 5% stocks in terms of the number of
observations for liquidity concerns. To clean the data, we adopt the procedure detailed
in (Da and Xiu 2017), which only rely on the condition codes from the exchanges
and the range of NBBO quotes to identify outliers. Overnight returns are excluded
92
to avoid dividend issuances and stock splits. Days with half trading hours are also
excluded. We do not, however, remove jumps from these intraday returns as they
do not seem to matter for our purpose. We sample the stocks using the refresh time
approach, as well as the previous tick method at a 15-minute frequency. We select
15-minute because the Hausman tests proposed in (Aït-Sahalia and Xiu 2016) suggest
it is a safe frequency to use realized covariance estimators for this pool of stocks.
Figure 3.1 plots the daily sample sizes after refresh time sampling for S&P 500,
S&P 100, and Dow Jones 30 index constituents, respectively. In addition, it presents
the quartiles of the daily sample sizes for S&P 500 index constituents by different colors
of shading. As we increase the number of assets, the daily refresh time observations
decreases rapidly. Still we are able to obtain on average 284 observations per day
for S&P 500, which is approximately equivalent to sampling every 90 seconds. The
average number of observations for S&P 100 and Dow Jones 30 constituents are 905
and 2,105, respectively.
We collect the Global Industrial Classification Standard (GICS) codes from the
Compustat database for the Location thresholding method. There are 8 digits in the
codes. Digits 1-2 indicate the company’s sector; digits 3-4 describe the company’s
industry group; digits 5-6 describe the industry; digits 7-8 describe the sub-industry.
Throughout 120 months and among the assets we consider, the time series median of
the largest block size is 80 for sector-based classification, 39 for industry group, 27 for
industry, and 15 for sub-industry categories, for S&P 500 index constituents.
We construct observable factors from high-frequency transaction prices at a 15-
minute frequency. The factors include the market portfolio, the small-minus-big
(SMB) portfolio, the high-minus-low (HML) portfolio, the robust-minus-weak (RMW)
portfolio, and the conservative-minus-aggressive (CMA) portfolio in the Fama-French
5 factor model. We also include the momentum (MOM) portfolio formed by sorting
stock returns between the past 250 days and 21 days. We also collect the 9 industry
93
SPDR ETFs from the TAQ database (Energy (XLE), Materials (XLB), Industrials
(XLI), Consumer Discretionary (XLY), Consumer Staples (XLP), Health Care (XLV),
Financial (XLF), Information Technology (XLK), and Utilities (XLU)). The time
series of cumulative returns of all factors are plotted in Figure 3.2.
We obtain monthly factor loadings (exposures) from the MSCI Barra USE3 by
(Kahn, Brougham, and Green 1998). The loadings we utilize include Earnings Yield,
Momentum, Trading Activity, Currency Sensitivity, Earnings Variation, Growth,
Volatility, Dividend Yield, Size, Size Nonlinearity, Leverage, and Value. In addition,
we construct and add the market exposure for each stock, using the slope coefficient
in a time-series regression of its weighted daily returns on the weighted S&P 500
index returns over the trailing 252 trading days. The weights are chosen to have a
half life of 63 days, so as to match the method documented by USE3. In total, we
have 14 observable loadings including the intercept term. We normalize the factor
exposures such that their cross-sectional means are 0 and variances are 1 for each month.
Although the covariance matrix estimation is invariant under such transformations,
the estimated factors now have similar scales. In case of missing exposures, we use
their latest available values, or set them to 0 if they are missing throughout the entire
sample period for certain stocks.4 The cumulative returns of the estimated factors
based on S&P 500 constituents are shown in Figure 3.3.
We plot the cumulative leading principal components of S&P 500 constituents using
our PCA method in Figure 3.4. It is difficult to recognize a one-to-one correspondence
among the factors in Figures 3.2, 3.3, and 3.4, because the list of characteristics
available from the MSCI Barra do not match the observed factors we obtain. Instead,
we plot their generalized correlations using 15-minute returns in Figure 3.5, which
measure how correlated two sets of factors are as suggested by (Bai and Ng 2006) and
recently employed in (Pelger 2015b). Indeed, there exist strong coherence among the
4
The empirical findings remain the same if we exclude stocks whose loadings are missing from the
USE3 dataset.
94
observed and inferred factors using different approaches, in particular among the PCA
and the CSR factors.
3.6.2 Out-of-Sample Portfolio Allocation
We then examine the performance of the covariance matrix estimators in a constrained

minimum variance portfolio allocation exercise as it requires only the estimated
covariance matrix as inputs. This is a commonly used approach to evaluating estimators
of large covariance matrices, see, e.g., (Fan, Zhang, and Yu 2012). Specifically, we
consider the following optimization problem:
min ω | Σω,
p subject to ω | 1 “ 1, }ω} ď γ,
1 (3.10)
ω
where }ω}1 ď γ imposes an exposure constraint. As explained in (Fan, Furger, and

Xiu 2016), when γ “ 1, all portfolio weights must be non-negative, i.e., there is
no short selling. When γ is small, Σ
p ´Σ dictates the performance of the
MAX
portfolio risk because the optimal portfolio comprises of a relatively small number
of stocks. By contrast, when γ is large, the portfolio is close to the global minimum
p ´1 ´ Σ´1 drives the performance of the portfolio
variance portfolio, for which Σ
risk. Therefore, investigating the out-of-sample risk of the portfolios in (3.10) for
a variety of exposure constraints is informative about the quality of the covariance
matrix estimation.
To focus on the evaluation of covariance matrix estimators, we intensionally adopt
the simplest random walk forecasting model, i.e., Σ
p t « Et pΣt`1 q, so that the estimated
realized covariance matrices using data of the previous month are directly used for the
portfolio construction the next month. For a range of exposure constraints, we measure
the out-of-sample portfolio risk using 15-minute realized volatility. We compare the
covariance matrices based on pre-averaging and subsampling methods, with many
95
choices of θs and subsampling frequencies ∆n s. Figures 3.6, 3.7, and 3.8 provide
the results for the beset choice of θ “ 0.03 and ∆n “ 900 in simulations and five
thresholding methods we consider.
Figure 3.6 shows that (i) for S&P 500 constituents, the Location thresholding
(black) performs the best among all thresholding methods, followed by Soft (blue),
AL(red), and SCAD (green) approaches, with Hard thresholding (yellow) being the
worst. (ii) The TSR approach appears to be the best, with lowest out-of-sample risk
and most stable performance across different thresholding method. PCA is almost the
same as TSR when Location thresholding is applied, but its performance deteriorates
if we apply other thresholding techniques. CSR is dominated by the other two by
a very large margin. This differs from our simulation results, indicating that TSR
suffers from more serious model misspecification. (iii) When the exposure constraint
γ is small, the performance gap among different thresholding methods is small. This
is expected because the k¨kMAX norm differences across all methods are similar and in
this case the portfolios are heavily constrained that they are effectively low-dimensional.
(iv) The pre-averaging estimators (solid lines) dominate the subsampling estimators
(dash-dotted lines) across almost all cases, which agrees with our simulation results.
For S&P 100, we observe similar patterns from Figure 3.7 that pre-averaging
outperform the subsampling methods. PCA performs slightly worse than TSR. With
respect to the Dow Jones 30, Figure 3.8 shows that the CSR and PCA perform
considerably worse than TSR. This is not surprising, given that our theory suggests
that consistency of these two estimators requires a large dimension, whereas for TSR
a smaller cross-section works better.
Finally, we report in Table 3.4 the Diebold-Mariano ((Diebold and Mariano 2002))
tests for comparison of the out-of-the-sample risk of portfolios based on the pre-
averaging estimators against the subsampling estimators. Negative test statistics favor
the pre-averaging approach. Similar to Figures 3.6 - 3.8, when γ is large, pre-averaging
96
estimators deliver significantly smaller out-of-the-sample risk using TSR and PCA
across all thresholding methods for S&P 500 index constituents. For S&P 100 and
Dow-Jones index constituents, the difference between pre-averaging and subsampling
is significant only if using TSR.
3.7 Conclusion
Leveraging a variety of factor models, we construct pre-averaging based large covari-

ance matrix estimators using high frequency transaction prices, which are robust to
asynchronous arrival of trades and the market microstructure noise. We compare
various estimators based on different combinations of factor model specifications and
thresholding methods, in terms of their convergence rates, their finite sample behavior,
and their empirical performance in a portfolio allocation horse race. Throughout, we
find that pre-averaging plus TSR or PCA with Location thresholding dominates the
other combinations, in particular the subsampling method. Also, CSR, the method
which the MSCI Barra adopts for low frequency data, performs considerably worse
in almost all scenarios we study. This is perhaps driven by model misspecification,
which can be alleviated provided a potentially better set of factor exposures.
97
Estimator Pre-Averaging Subsampling Noiseless Mixed
Tuning θ θ θ θ ∆n ∆n ∆n ∆n ∆˚n θ, ∆n
Parameters 0.02 0.03 0.04 0.05 300 900 1, 800 3, 900 300 0.03, 900
Location Thresholding
TSR 0.06 0.06 0.06 0.06 0.12 0.08 0.09 0.13 0.03 0.08
}Σ
p ´ Σ}MAX CSR 0.05 0.04 0.04 0.05 0.11 0.07 0.07 0.09 0.03 0.06
PCA 0.06 0.06 0.06 0.07 0.12 0.08 0.09 0.17 0.04 0.08
TSR 0.04 0.04 0.05 0.06 0.04 0.06 0.08 0.11 0.03 0.07
p | ´ βEβ | }MAX
}β Eβ CSR 0.02 0.02 0.03 0.03 0.02 0.03 0.04 0.06 0.02 0.03
PCA 0.04 0.05 0.05 0.06 0.04 0.06 0.08 0.19 0.04 0.07
TSR 0.05 0.04 0.04 0.04 0.12 0.06 0.06 0.07 0.02 0.05
}Γ
p ´ Γ}MAX CSR 0.04 0.03 0.03 0.04 0.11 0.06 0.06 0.07 0.02 0.05
PCA 0.04 0.03 0.03 0.04 0.11 0.06 0.06 0.17 0.02 0.05
TSR 0.28 0.24 0.25 0.28 0.80 0.40 0.43 0.62 0.14 0.36
}Σ
p ´ Σ}Σ CSR 0.27 0.22 0.22 0.24 0.79 0.37 0.37 0.48 0.13 0.32
PCA 0.28 0.24 0.26 0.28 0.79 0.40 0.43 0.93 0.15 0.36
TSR 5.32 4.31 5.41 6.98 9.43 6.50 10.45 30.16 3.57 7.00
p ´1 ´ Σ´1 }
}pΣq CSR 5.35 4.35 5.35 6.88 9.42 6.52 10.26 28.90 3.54 6.90
PCA 5.33 4.33 5.43 6.99 9.41 6.50 10.49 28.68 3.56 7.03
TSR 5.34 4.32 5.45 7.03 9.45 6.52 10.54 30.59 3.59 7.06
p ´1 ´ Γ´1 }
}pΓq CSR 5.37 4.38 5.39 6.93 9.44 6.54 10.38 29.33 3.57 6.97
PCA 5.35 4.35 5.47 7.05 9.44 6.52 10.59 28.89 3.58 7.09
rp PCA 3.00 3.00 3.00 3.00 3.00 3.00 3.00 1.71 3.00 3.00
Hard Thresholding
TSR 0.08 0.08 0.09 0.09 0.12 0.09 0.11 0.13 0.08 0.10
}Σ
p ´ Σ}MAX CSR 0.08 0.08 0.08 0.08 0.11 0.08 0.08 0.09 0.07 0.08
PCA 0.08 0.08 0.08 0.09 0.12 0.09 0.10 0.17 0.08 0.09
TSR 0.08 0.08 0.08 0.08 0.12 0.08 0.08 0.08 0.08 0.08
}Γ
p ´ Γ}MAX CSR 0.08 0.08 0.08 0.08 0.11 0.08 0.08 0.08 0.08 0.08
PCA 0.08 0.08 0.08 0.08 0.11 0.08 0.08 0.17 0.08 0.08
TSR 0.56 0.46 0.43 0.42 1.09 0.61 0.53 0.58 0.37 0.51
}Σ
p ´ Σ}Σ CSR 0.53 0.43 0.40 0.38 1.04 0.56 0.46 0.41 0.35 0.45
PCA 0.54 0.44 0.41 0.40 1.04 0.58 0.51 0.67 0.35 0.49
TSR 8.16 7.36 7.07 6.97 10.65 8.52 7.79 7.56 6.58 7.70
p ´1 ´ Σ´1 }
}pΣq CSR 8.01 7.20 6.93 6.88 10.53 8.38 7.67 7.52 6.67 7.57
PCA 7.98 7.18 7.18 7.30 10.52 8.35 7.61 8.45 7.43 7.51
TSR 8.18 7.38 7.09 6.99 10.67 8.54 7.81 7.59 6.60 7.72
p ´1 ´ Γ´1 }
}pΓq CSR 8.03 7.22 6.95 6.90 10.55 8.41 7.70 7.55 6.69 7.59
PCA 8.01 7.18 6.91 6.87 10.54 8.37 7.62 8.40 6.70 7.53
Table 3.1: Pre-Averaging v.s. Subsampling using Location and Hard

Thresholding
Note: In this table, we compare the performance of pre-averaging estimators with subsampling
approach using different thresholding methods under a variety of matrix norms. The tuning
parameter θ is for the pre-averaging local window size (kn « θn1{2`δ ), while the ∆n is the
subsampling frequency in seconds. The “Noiseless” column provides the estimates using clean
and synchronous data with a sampling frequency at ∆˚n “ 300 (5-minute), and the “Mixed”
column provides the estimates using the pre-averaging approach on the subsampled data
p | is identical across thresholding methods,
(θ “ 0.03, ∆n “ 900). Since the low-rank part β Eβ
we only report it in the upper panel. We also report the average estimated number of factors
for the PCA approach. The number of Monte Carlo repetitions is 1,000.
98
Estimator Pre-Averaging Subsampling Noiseless Mixed
Tuning θ θ θ θ ∆n ∆n ∆n ∆n ∆˚n θ, ∆n
Parameters 0.02 0.03 0.04 0.05 300 900 1, 800 3, 900 300 0.03, 900
Soft Thresholding
TSR 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.13 0.05 0.09
}Σ
p ´ Σ}MAX CSR 0.05 0.05 0.06 0.06 0.11 0.07 0.07 0.09 0.04 0.07
PCA 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.15 0.05 0.09
TSR 0.05 0.05 0.05 0.05 0.12 0.06 0.06 0.08 0.04 0.06
}Γ
p ´ Γ}MAX CSR 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.08 0.05 0.06
PCA 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.17 0.05 0.06
TSR 0.47 0.37 0.35 0.35 1.05 0.55 0.50 0.60 0.25 0.45
}Σp ´ Σ}Σ CSR 0.46 0.35 0.32 0.32 1.01 0.51 0.44 0.45 0.25 0.41
PCA 0.46 0.37 0.35 0.35 1.02 0.53 0.49 0.74 0.25 0.45
TSR 6.94 5.91 5.64 5.62 10.39 7.76 7.14 7.79 5.43 6.78
p ´1 ´ Σ´1 } CSR
}pΣq 6.93 5.92 5.67 5.66 10.36 7.76 7.18 7.99 5.54 6.81
PCA 6.92 7.55 8.00 8.31 10.35 7.73 8.12 12.54 8.24 7.73
TSR 6.97 5.93 5.66 5.64 10.42 7.78 7.16 7.82 5.68 6.80
p ´1 ´ Γ´1 } CSR
}pΓq 6.96 5.94 5.69 5.68 10.39 7.78 7.20 8.03 5.76 6.83
PCA 6.94 5.91 5.66 5.64 10.38 7.75 7.17 12.05 5.77 6.79
SCAD Thresholding
TSR 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.13 0.05 0.09
}Σ
p ´ Σ}MAX CSR 0.05 0.05 0.06 0.06 0.11 0.07 0.07 0.09 0.04 0.07
PCA 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.15 0.05 0.09
TSR 0.05 0.05 0.05 0.05 0.12 0.06 0.06 0.08 0.04 0.06
}Γ
p ´ Γ}MAX CSR 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.08 0.05 0.06
PCA 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.17 0.05 0.06
TSR 0.47 0.37 0.36 0.37 1.05 0.55 0.52 0.61 0.26 0.46
}Σp ´ Σ}Σ CSR 0.46 0.35 0.33 0.33 1.01 0.52 0.45 0.45 0.25 0.42
PCA 0.46 0.37 0.35 0.36 1.02 0.54 0.51 0.74 0.26 0.45
TSR 7.11 6.22 6.13 7.54 10.39 7.99 12.21 9.69 5.44 7.63
p ´1 ´ Σ´1 } CSR
}pΣq 7.08 6.26 6.13 7.00 10.36 7.94 10.78 9.96 5.55 7.39
PCA 7.06 7.66 8.26 8.88 10.35 7.91 11.04 12.96 8.39 8.24
TSR 7.14 6.24 6.15 7.60 10.42 8.01 12.36 9.78 5.68 7.68
p ´1 ´ Γ´1 } CSR
}pΓq 7.11 6.28 6.16 7.06 10.39 7.97 10.91 10.05 5.76 7.43
PCA 7.09 6.25 6.13 7.11 10.38 7.94 10.68 12.60 5.77 7.42
AL Thresholding
TSR 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.13 0.05 0.09
}Σ
p ´ Σ}MAX CSR 0.05 0.05 0.05 0.06 0.11 0.07 0.07 0.09 0.04 0.07
PCA 0.06 0.06 0.07 0.07 0.12 0.09 0.10 0.16 0.05 0.08
TSR 0.05 0.04 0.05 0.05 0.12 0.06 0.06 0.08 0.04 0.06
}Γ
p ´ Γ}MAX CSR 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.08 0.04 0.06
PCA 0.05 0.05 0.05 0.05 0.11 0.06 0.06 0.17 0.04 0.06
TSR 0.44 0.36 0.36 0.37 1.03 0.55 0.55 0.67 0.23 0.47
}Σp ´ Σ}Σ CSR 0.43 0.35 0.33 0.34 1.00 0.52 0.50 0.53 0.23 0.44
PCA 0.44 0.36 0.36 0.37 1.00 0.54 0.55 0.75 0.24 0.47
TSR 6.34 5.33 6.12 8.12 10.26 7.46 16.46 41.68 5.34 8.95
p ´1 ´ Σ´1 } CSR
}pΣq 6.40 5.41 6.08 8.05 10.30 7.53 17.40 40.75 5.46 9.00
PCA 7.21 8.47 9.30 10.14 10.29 7.71 18.10 25.70 9.12 10.23
TSR 6.36 5.35 6.17 8.20 10.29 7.48 16.70 42.81 5.59 9.05
p ´1 ´ Γ´1 } CSR
}pΓq 6.42 5.44 6.19 8.18 10.32 7.56 17.72 41.92 5.71 9.15
PCA 6.41 5.43 6.27 8.34 10.32 7.54 17.95 25.58 5.71 9.38
Table 3.2: Pre-Averaging v.s. Subsampling using Soft, SCAD, and AL

Thresholding
99
Note: This is a continuation of Table 3.1, where we report the simulation results using
different thresholding methods. All other settings remain the same.
Location Thresholding
rp 1 2 3 4 5 6 8 10 15 20 30 50
Pre-Averaging 0.20 0.14 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
}Σ
p ´ Σ}MAX Subsampling 0.20 0.15 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08
Noiseless 0.20 0.14 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
Pre-Averaging 0.24 0.16 0.05 0.08 0.09 0.10 0.10 0.10 0.10 0.11 0.12 0.14
p | ´ βEβ | }MAX
}β Eβ Subsampling 0.24 0.16 0.06 0.09 0.10 0.10 0.11 0.11 0.11 0.12 0.13 0.16
Noiseless 0.24 0.16 0.04 0.08 0.08 0.09 0.09 0.09 0.09 0.10 0.10 0.12
Pre-Averaging 0.25 0.17 0.03 0.07 0.07 0.08 0.09 0.09 0.09 0.09 0.09 0.10
}Γ
p ´ Γ}MAX Subsampling 0.27 0.19 0.06 0.07 0.08 0.08 0.09 0.09 0.09 0.09 0.09 0.10
Noiseless 0.24 0.16 0.02 0.07 0.07 0.08 0.09 0.09 0.09 0.09 0.09 0.11
Pre-Averaging 1.11 0.66 0.24 0.25 0.26 0.27 0.29 0.30 0.34 0.37 0.53 0.73
}Σ
p ´ Σ}Σ Subsampling 1.19 0.76 0.40 0.41 0.43 0.44 0.46 0.48 0.54 0.59 0.78 1.03
Noiseless 1.07 0.62 0.15 0.15 0.16 0.16 0.18 0.19 0.21 0.23 0.34 0.49
Pre-Averaging 0.10 0.09 0.04 0.04 0.04 0.04 0.04 0.04 0.05 0.06 0.07 0.08
p ´1 ´ Σ´1 } ˆ 10´2
}pΣq Subsampling 0.10 0.09 0.07 0.06 0.06 0.06 0.06 0.06 0.06 0.07 0.08 0.09
Noiseless 0.10 0.09 0.04 0.04 0.04 0.04 0.04 0.04 0.05 0.06 0.06 0.07
Pre-Averaging 0.10 0.09 0.04 0.09 0.14 0.23 0.40 0.54 0.95 1.46 1.84 2.21
p ´1 ´ Γ´1 } ˆ 10´2
}pΓq Subsampling 0.10 0.09 0.07 0.08 0.11 0.17 0.26 0.35 0.59 0.84 1.03 1.26
Noiseless 0.10 0.09 0.04 0.14 0.23 0.42 0.79 1.09 2.03 3.23 4.36 5.01
Table 3.3: Impact of the Selected Number of Factors on the PCA Method
Note: We present the errors of the PCA approach for a variety of norms, using different
number of factors rp, with the true value of r being equal to 3. For tuning parameters, we
select θ “ 0.03 for pre-averaging, ∆n “ 900 for subsampling method, and the benchmark
no-noise and synchronous case with ∆˚n “ 300. Location thresholding is used in all cases.
The number of Monte Carlo repetitions is 1,000.
100
Exposure Constraint γ 1 1.2 1.6 2 3 4 8
S&P 500
TSR -1.60 -1.88* -2.24** -2.76*** -3.12*** -3.16*** -3.15***
Location CSR -0.09 -0.51 -1.55 -1.75* -2.21** -2.07** -2.02**
PCA -2.31** -2.55** -2.64*** -2.80*** -2.96*** -2.81*** -2.79***
TSR -1.53 -2.07** -2.59*** -2.76*** -2.72*** -2.58*** -2.56**
Hard CSR 0.48 0.47 0.31 0.73 1.20 1.28 1.30
PCA -2.39** -2.91*** -3.27*** -3.44*** -3.48*** -3.27*** -3.26***
TSR -1.64 -2.14** -2.81*** -3.27*** -3.43*** -3.40*** -3.38***
Soft CSR 0.38 0.24 -0.05 0.20 0.54 0.62 0.66
PCA -2.33** -2.76*** -3.28*** -3.44*** -3.56*** -3.47*** -3.46***
TSR -1.63 -2.09** -2.62*** -3.00*** -3.08*** -2.97*** -2.96***
SCAD CSR 0.45 0.39 0.21 0.38 0.87 0.98 1.00
PCA -2.32** -2.77*** -3.17*** -3.35*** -3.22*** -2.86*** -2.84***
TSR -1.64 -2.20** -2.89*** -3.35*** -3.53*** -3.46*** -3.44***
AL CSR 0.53 0.38 0.45 0.76 1.23 1.36 1.38
PCA -2.30** -2.71*** -3.21*** -3.36*** -3.42*** -3.22*** -3.21***
S&P 100
TSR -1.57 -2.09** -2.31** -2.45** -1.88* -1.46 -1.47
Location CSR -2.38** -2.78*** -3.22*** -3.18*** -3.29*** -3.29*** -3.29***
PCA -1.56 -1.90* -1.55 -1.39 -0.82 -0.59 -0.59
TSR -1.69* -2.27** -2.44** -2.42** -2.05** -2.07** -2.07**
Hard CSR -1.76* -1.69* -1.48 -1.70* -1.79* -1.79* -1.79*
PCA -1.35 -1.58 -1.04 -1.19 -0.70 -0.43 -0.43
TSR -1.68* -2.33** -2.60*** -2.59*** -2.15** -2.07** -2.08**
Soft CSR -2.09** -2.50** -2.23** -2.33** -2.42** -2.42** -2.42**
PCA -1.70* -2.12** -1.50 -1.59 -1.18 -1.11 -1.11
TSR -1.70* -2.34** -2.59*** -2.54** -2.04** -1.92* -1.93*
SCAD CSR -2.06** -2.41** -2.16** -2.26** -2.36** -2.36** -2.36**
PCA -1.57 -1.95* -1.18 -1.29 -0.82 -0.61 -0.61
TSR -1.68* -2.30** -2.57** -2.59*** -2.30** -1.98** -2.00**
AL CSR -2.10** -2.44** -2.17** -2.30** -2.40** -2.40** -2.40**
PCA -1.71* -2.07** -1.47 -1.48 -1.02 -0.88 -0.88
Dow Jones 30
TSR -1.85* -2.29** -2.74*** -2.66*** -2.55** -2.55** -2.55**
Location CSR -0.12 -0.07 0.10 0.09 0.16 0.16 0.16
PCA 0.67 0.36 -0.04 -0.62 -0.63 -0.63 -0.63
TSR -1.80* -2.01** -2.70*** -2.53** -2.45** -2.45** -2.45**
Hard CSR -0.84 -0.88 -0.75 -0.67 -0.55 -0.55 -0.55
PCA 0.59 0.19 -0.53 -1.07 -0.93 -0.89 -0.89
TSR -0.59 -1.06 -2.43** -2.29** -2.21** -2.21** -2.21**
Soft CSR -0.71 -0.90 -1.09 -1.08 -0.94 -0.94 -0.94
PCA 0.58 0.18 -0.45 -0.94 -1.00 -1.00 -1.00
TSR -0.56 -1.03 -2.43** -2.29** -2.21** -2.21** -2.21**
SCAD CSR -0.71 -0.90 -1.10 -1.09 -0.95 -0.95 -0.95
PCA 0.57 0.17 -0.48 -0.93 -1.00 -1.00 -1.00
TSR -0.03 -0.55 -2.06** -1.98** -1.94* -1.94* -1.94*
AL CSR -0.63 -0.76 -1.13 -1.11 -0.96 -0.96 -0.96
PCA 0.60 0.20 -0.44 -0.82 -0.92 -0.92 -0.92
Table 3.4: Diebold-Mariano Tests for the Out-of-Sample Risk Comparison
Note: This table reports the Diebold-Mariano 101test statistics against the null that the out-of-
sample risk of the portfolios based on the pre-averaging method is equal to that based on
the subsampling approach. Negative numbers favor the pre-averaging approach. We use *,
**, and *** to reveal the significance at the 10%, 5%, and 1% levels, respectively.
Trading Intensity Graph
5%-25% & 75%-max 25%-75% 50%
SP500 Refresh SP100 Refresh DJI30 Refresh
23400
11700
4680
2340
1170
468
234
78
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Figure 3.1: Trading Intensity of Assets
Note: This figure plots the max, 5%, 25%, 50%, and 75% quantiles of the daily number of
observations for the S&P 500 index constituents after cleaning. It also plots the sample size
after refresh time sampling for the Dow Jones 30, S&P 100, and S&P 500 index constituents,
respectively.
102
Market SMB HML
0.2
0 0.1 0.1
0 0
-0.2
-0.1
-0.1
-0.4 -0.2
-0.2
-0.6 -0.3
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
RMW CMA MOM
0.4 0.15 0.4

0.1
0.3 0.2
0.05
0.2
0 0
0.1
-0.05
0 -0.2
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
XLE XLB XLI

0.6
0.2 0
0.4
0 -0.2
0.2
-0.2 -0.4
0
-0.4
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
XLY XLP XLV

0.2
0.2 0.4
0
0
-0.2 0.2 -0.2
-0.4
0 -0.4
-0.6
-0.8
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
XLF XLK XLU

0
0 0.1
0
-0.5 -0.2 -0.1
-0.2
-0.4
-1 -0.3
-0.6 -0.4
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
Figure 3.2: Time Series of Factors Used in TSR
Note: This figure plots the cumulative returns of the factors we have used in TSR, including
the market portfolio, the small-minus-big market capitalization (SMB) portfolio, the high-
minus-low price-earning ratio (HML) portfolio, the robust-minus-weak (RMW) portfolio, the
conservative-minus-aggressive (CMA) portfolio, the momentum (MOM) portfolio, as well
as 9 industry SPDR ETFs (Energy (XLE), Materials (XLB), Industrials (XLI), Consumer
Discretionary (XLY), Consumer Staples (XLP), Health Care (XLV), Financial (XLF), Infor-
mation Technology (XLK), and Utilities (XLU)). The overnight returns are excluded, same
for the half trading days. 103
Intercept MarketExp EarningsYieldExp
0
0.5
-0.2 0.1
0.4
-0.4 0.3
0.2 0.05
-0.6
0.1
-0.8
0 0
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
MomentumExp TradingActivityExp CurrencySensitivityExp

0 0 0.05
-0.05
-0.1 -0.05 0
-0.15
-0.1 -0.05
-0.2
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
EarningsVariationExp GrowthExp VolatilityExp

0 0
0 -0.02
-0.04 -0.1
-0.05
-0.06
-0.1 -0.08 -0.2
-0.1
-0.15
-0.3
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
DividendYieldExp SizeExp SizeNonLinearityExp

0 0
0.3
-0.05 -0.2
0.2
-0.1
-0.4
0.1
-0.15
-0.6
0
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
LeverageExp ValueExp
0.05
0
0
-0.05
-0.05
-0.1 -0.1
-0.15 -0.15
-0.2 -0.2
2004 2007 2010 2013 2004 2007 2010 2013
Figure 3.3: Time Series of Estimated Factors by CSR
Note: This figure plots the cumulative returns of factors we estimate using CSR, based on
S&P 500 constituents. The corresponding factor exposures include the intercept, the market
beta, and 12 other variables from MSCI Barra USE3 (Earnings Yield, Momentum, Trading
Activity, Currency Sensitivity, Earnings Variation, Growth, Volatility, Dividend Yield, Size,
Size Nonlinearity, Leverage, and Value).
104
PC1 PC2 PC3
6
5
6
4
0
4
2
-5
2
0
-10 0
-2
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
PC4 PC5 PC6
2 4 2
1 1
2
0 0
-1 -1
0
-2
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
PC7 PC8 PC9

3 4
0
2 3
-1
1 2
-2
1
0
-3 0
-1
-4
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
PC10 PC11 PC12

4 5
6
3 4
4 3
2
2
2 1
1
0 0 0
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
PC13 PC14 PC15

3
0
1.5
2
-2 1
1
0.5
-4
0 0
-6 -1 -0.5
2004 2007 2010 2013 2004 2007 2010 2013 2004 2007 2010 2013
Figure 3.4: Time Series of Estimated Factors by PCA
Note: This figure plots the cumulative returns of the leading principal components of S&P
500 constituents we estimate using PCA.
105
TSR vs PCA
1
1
3 0.8
Canonical Rank
5
0.6
7
9 0.4
11
0.2
13
15 0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
TSR vs CSR
1
1
3 0.8
Canonical Rank
5
0.6
7
9 0.4
11
0.2
13
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
CSR vs PCA
1
1
3 0.8
Canonical Rank
5
0.6
7
9 0.4
11
0.2
13
0
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Figure 3.5: Generalized Correlation Plot
Note: This figure provides the heatmap of the monthly generalized (canonical) correlations
among the time series of factors used in TSR and estimated from CSR and PCA methods.
The y-axis is the rank of the generalized correlation. For two sets of factors Xa and Xb , the
generalized correlation of rank k is calculated as the square-root of the kth-largest eigen-
value of the matrix rXa , Xa s´1 rXa , Xb srXb , Xb s´1 rXb , Xa s, where r¨, ¨s denotes the quadratic
covariation.
106
TSR with 15 Factors CSR with 14 Loadings
11.5
7
Annualized Risk (%)
Annualized Risk (%)

6.5 11
6 10.5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Exposure Constraint Exposure Constraint
PCA with 15 Components
Pre-Averaging + Location Thresholding
Pre-Averaging + Hard Thresholding 7
Pre-Averaging + Soft Thresholding Annualized Risk (%)
Pre-Averaging + SCAD Thresholding
6.5
Pre-Averaging + AL Thresholding
Subsampling + Location Thresholding
Subsampling + Hard Thresholding
6
Subsampling + Soft Thresholding
Subsampling + SCAD Thresholding
Subsampling + AL Thresholding
1 2 3 4 5 6 7 8
Exposure Constraint
Figure 3.6: Out-of-Sample Risk of the S&P 500 Portfolio
Note: This figure compares the time series average of the out-of-sample monthly volatility
from 2004 to 2013 using S&P 500 Index constituents. The x-axis is the exposure constraint
γ in the optimization problem (3.10). The results are based on a combination of methods:
(pre-averaging, subsampling) ˆ (TSR, CSR, PCA) ˆ (Location, Soft, AL, SCAD, Hard
thresholding). We use the GICS codes at industry group level for the location thresholding
method. The pre-averaging method uses θ “ 0.03, while the subsampling method uses
15-minute returns. The out-of-sample volatility is calculated using 15-minute subsampled
returns each month.
107
10.4 11.8
Annualized Risk (%)
Annualized Risk (%)

10.2 11.6
10
11.4
9.8
11.2
9.6
11
9.4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Pre-Averaging + Location Thresholding
Pre-Averaging + Hard Thresholding 10.4
Annualized Risk (%)

Pre-Averaging + Soft Thresholding 10.2
Pre-Averaging + AL Thresholding 10
9.8
Subsampling + Soft Thresholding 9.6
9.4
Subsampling + AL Thresholding
1 2 3 4 5 6 7 8
Exposure Constraint
Figure 3.7: Out-of-Sample Risk of the S&P 100 Portfolio
from 2004 to 2013 using S&P 100 Index constituents. All other settings remain the same as
those in Figure 3.6.

12 12.4
Annualized Risk (%)
Annualized Risk (%)
12.3
11.8
12.2
11.6
12.1
11.4
12
11.2 11.9
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Pre-Averaging + Location Thresholding 12
Pre-Averaging + Hard Thresholding
Annualized Risk (%)
Pre-Averaging + Soft Thresholding 11.8

Pre-Averaging + AL Thresholding
11.6
Subsampling + Soft Thresholding 11.4
Subsampling + AL Thresholding 11.2
1 2 3 4 5 6 7 8
Exposure Constraint
Figure 3.8: Out-of-Sample Risk of the Dow Jones 30 Portfolio
from 2004 to 2013 using Dow Jones 30 Index constituents. All other settings remain the
same as those in Figure 3.6.
108
3.1 Mathematical Proofs
3.1.1 Proof of Theorem 3
We need a few lemmas.

´1{2 ?
Lemma 10. Suppose that nδ log d “ op1q. Under Assumptions 1 - 4, and for
some constants C0 , C1 and C2 , we have
´› › a ¯
´1{2 2
P ›E ´ E› ě C 0 nδ log d “ Opr2 C1 dĆ0 C2 q, (3.1)
›p ›
piq
´› ›MAX a ¯
´1{2 2
P ›E ´ E› ě C0 rnδ log d “ Opr2 C1 dĆ0 C2 q, (3.2)
›p ›
piiq
˜ ˇ F ˇ ¸
n´k n `1
ˇ n 1 ÿ
‹ ‹ ˇ
ˇ
´1{2
a
P max ˇ X̄k,i Z̄o,li ˇ ě C0 nδ log d (3.3)
ˇ
piiiq
1ďkďr ˇ n ´ kn ` 2 ψ2 kn ˇ
1ďlďd i“0
2
“ OprC1 dĆ0 C2 `1 q,
˜ ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
P max ˇ Z̄o,ki Z̄o,li ´ gs,lk dsˇ (3.4)
ˇ ˇ
pivq
1ďk,lďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
a ¯ 2
´1{2
ě C0 nδ log d “ OpC1 dĆ0 C2 `2 q,
ˆ › › ˙
´1{2
a 2
P max ›βj ´ βj › ě C0 nδ log d “ OpC1 dĆ0 C2 `1 q, (3.5)
›p ›
pvq
1ďjďd
´› › a ¯
´1{2 2
P ›βp ´ β › ě C0 d1{2 nδ log d “ OpC1 dĆ0 C2 `1 q, (3.6)
› ›
pviq
F
´› › a ¯
´1{2 2
P ›βp ´ β › ě C 0 nδ log d “ OpC1 dĆ0 C2 `1 q, (3.7)
› ›
pviiq
MAX
˜ ¸
n´k n `1 ´´
n 1 ÿ ¯| ¯
pviiiq P max β ´ βp X̄i‹ ě C0 r2 n´1{2`δ log d(3.8)
1ďkďd n ´ kn ` 2 ψ2 kn k
i“0
2
“ OpC1 dĆ0 C2 `1 q,
ˆ ˇ ˇ ˙
´1{2
a 2
P max ˇΓkl ´ Γkl ˇ ě C0 nδ log d “ OpC1 dĆ0 C2 `2 q, (3.9)
ˇp ˇ
pixq
1ďk,lďd
ˆ ˇ ˇ ˙
ˇpS ´1{2
a 2
P max ˇΓkl ´ Γkl ˇ ě C0 nδ log d “ OpC1 dĆ0 C2 `2 q. (3.10)
ˇ
pxq
1ďk,lďd
Proof of Lemma 10 . (i) Note that we have
˜ ¸
n´k n `1 n´k n `1 n´k n `1 n´k n `1
n 1 ÿ ÿ ÿ ÿ
E
p kl “ X̄k,i X̄l,i ` X̄k,i ε̄l,i ` X̄l,i ε̄k,i ` ε̄k,i ε̄l,i
n ´ kn ` 2 ψ2 kn i“0 i“0 i“0 i“0
109
“T1 ` T2 ` T3 ` T4 ,
therefore,
Pp|E
p kl Ékl | ě uq ď Pp|T1 Ékl | ě u{4q`Pp|T2 | ě u{4q`Pp|T3 | ě u{4q`Pp|T4 | ě u{4q.
For T1 , this expression can be furthered decomposed as:

#
n´kn `1
n 1 ÿ
T1 “ a1,i pk, lqpXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q
n ´ kn ` 2 ψ2 kn i“1
,
ÿ .
` b1,ij pk, lqpXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q ,
-
pi,jqPF
for certain numbers a1,i pk, lq and b1,ij pk, lq such that
|a1,i pk, lq| ` |b1,ij pk, lq| ď Ckn .
The set F is given by
F “ tpi, jq|1 ď i ď n ´ kn ` 1, 1 ď j ď n ´ kn ` 1, |i ´ j| ď kn ´ 1, i ‰ ju .
Let
n a1,i pk, lq
Aik,l :“ “ Op1q.
n ´ kn ` 2 ψ2 kn
We insert synchronized true price Xk,ti and Xk,ti´1 between Xk,tki and Xk,tki´1 and write
Xk,tki ´ Xk,tki´1 “ Xk,tki ´ Xk,ti ` Xk,ti ´ Xk,ti´1 ` Xk,ti´1 ´ Xk,tki´1 .
110
Now using the above expression to expand pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q, we obtain
the following decomposition
n´k
ÿn `1
Aik,l pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q

i“1
n´k
ÿ n `1 !
“ Aik,l pXk,ti ´ Xk,ti´1 qpXl,ti ´ Xl,ti´1 q ` pXk,tki ´ Xk,ti qpXl,tli ´ Xl,ti q
i“1
` pXk,tki ´ Xk,ti qpXl,ti ´ Xl,ti´1 q ` pXk,tki ´ Xk,ti qpXl,ti´1 ´ Xl,tli´1 q
` pXk,ti ´ Xk,ti´1 qpXl,tli ´ Xl,ti q ` pXk,ti ´ Xk,ti´1 qpXl,ti´1 ´ Xl,tli´1 q
` pXk,ti´1 ´ Xk,tki´1 qpXl,tli ´ Xl,ti q ` pXk,ti´1 ´ Xk,tki´1 qpXl,ti ´ Xl,ti´1 q

)
`pXk,ti´1 ´ Xk,tki´1 qpXl,ti´1 ´ Xl,tli´1 q
n´k
ÿ n `1
” Aik,l ∆ni Xk ∆ni Xl ` Hkl

1 1
p1q ` ¨ ¨ ¨ ` Hkl p8q.
i“1
řn´kn `1 şt
For i“1 Aik,l ∆ni Xk ∆ni Xl , denote Xt˚ “ 0
σs dWs , and denote for 1 ď i ď n,
1 ď k, l ď p,
ζi,kl “ p∆ni Xk˚ qp∆ni Xl˚ q, ζi,kl

1
“ Epp∆ni Xk˚ qp∆ni Xl˚ q|Fpi´1q∆n q, 2
ζi,kl 1
“ ζi,kl ´ ζi,kl ,
řn´kn `1
then Mt “ i“1 Aik,l ζi,kl
2
is a continuous-time martingale. By Itô’s lemma, we have
żt żt żt
` ˚ ˚
˘` ˚ ˚
˘ |
` ˚ ˚
˘ ` ˚ ˚
˘
Xt,k ´ Xs,k Xt,l ´ Xs,l ´ pσσ qv,kl dv “ Xv,k ´ Xs,k pσv dWv ql ` Xv,l ´ Xs,l pσv dWv qk .
s s s
Therefore
ż i∆n ż i∆n
2
` ˚ ˚
˘ ` ˚ ˚
˘
ζi,kl “ Xs,k ´ Xpi´1q∆ n ,k
pσs dWs ql ` Xs,l ´ Xpi´1q∆ n ,l
pσs dWs qk .
pi´1q∆n pi´1q∆n
111
Now we have
ˇ ˇ
ˇn´k
ÿn `1 n´k
ÿ n `1 ż i∆n ˇ
i i |
Ak,l pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q ´ Ak,l pσσ qs,kl dsˇ
ˇ ˇ
ˇ
ˇ pi´1q∆n ˇ
ˇ i“1 i“1
ˇn´k n `1 ż i∆n n´k n `1 ż i∆n
ˇ ÿ i n ˚
ÿ
i n ˚
“ˇ Ak,l ∆i Xk bs,l ds ` Ak,l ∆i Xl bs,k ds
ˇ i“1 pi´1q∆n i“1 pi´1q∆n
ˇ
n´kÿ n `1 ż i∆n ż i∆n ˇ
i
Ak,l bs,l ds bs,k ds ` Mt ˇ .
ˇ
`
i“1 pi´1q∆n pi´1q∆n ˇ
We proceed with each of the four terms, starting with Mt .

The quadratic variation of Mt is given by
˜ q q
n´kn `1 ż i∆n
ÿ ` ˚ ˘2 ÿ ` ˚ ˘2 ÿ
rM, M st “∆n pAik,l q2 Xs,k ˚
´ Xpi´1q∆ n ,k
σ 2
s,lr ` X s,l ´ X ˚
pi´1q∆n ,l
2
σs,kr
i“1 pi´1q∆n r“1 r“1
q
¸
` ˚ ˘` ˚ ˘ ÿ
˚ ˚
`2 Xs,k ´ Xpi´1q∆ n ,k
Xs,l ´ Xpi´1q∆ n ,l
σs,lr σs,kr ds.
r“1
According to Assumption 1, here we assume that kXt k8 ď K, kht k8 ď K, and

kσt σt| kMAX ď K, for some constant K ą 0. Therefore by Cauchy-Schwarz inequality,
we have
rM, M st ď 16K 3 t∆n .
Then by the exponential inequality for a continuous martingale, we have

˜ ˇ ˇ ¸
ˇn´k
ÿ n `1 ˇ ˆ
nu2
˙
i 2 ˇ
P sup ˇ Ak,l ζi,kl ˇ ą u ď exp ´ . (3.11)
ˇ
0ďsďt ˇ i“1 ˇ 32K 3 t
In addition, by Cauchy-Schwarz inequality:

˜ˇ ˇ ¸
ˇn´k
ÿ n `1 ż i∆n ˇ
Aik,k ∆ni Xk˚ bs,l dsˇ ą u
ˇ ˇ
P ˇ
ˇ i“1 pi´1q∆n ˇ
˜
n´k n `1 n´k n `1 ż i∆n n´kn `1 ż i∆n
ÿ
i n ˚ 2
ÿ x2 ÿ
ďP Ak,k p∆i Xk q ´ Aik,k pσσ | qs,kk ds ą ´ Aik,k pσσ | qs,kk ds
i“1 i“1 pi´1q∆n tK∆n i“1 pi´1q∆n
112
˜ˇ ˇ ¸
ˇn´k n `1 n´k n `1 ż i∆n ˇ u2
ˇ ÿ ÿ
Aik,k p∆ni Xk˚ q2 ´ Aik,k pσσ | qs,kk dsˇ ą ´ tK
ˇ
ďP ˇ
ˇ i“1 i“1 pi´1q∆n ˇ tK∆n
¨ ´ ¯2 ˛
u2
tK∆n
´ tK ‹
ď exp ˝´ ‚,
˚
32K 3 t∆n
where the last inequality follows from (3.11). Finally, notice that
ˇ ˇ
ˇn´k n `1 ż i∆n ż i∆n ˇ
ˇ ÿ i
Ak,l bs,l ds bs,k dsˇ ď tK 2 ∆n max |Aik,l |,
ˇ
ˇ
ˇ i“1 pi´1q∆n pi´1q∆n ˇ i
we can derive
˜ˇ ˇ ¸
ˇn´k n `1 n´k n `1 ż i∆n ˇ
ˇ ÿ ÿ
Aik,l pXk,ti ´ Xk,ti´1 qpXl,ti ´ Xl,ti´1 q ´ Aik,l pσσ | qs,kl dsˇ ą 4u
ˇ
P ˇ
ˇ i“1 i“1 pi´1q∆ n
ˇ
˜ˇ ˇ ¸
ˇn´k n `1 ż i∆n ˇ
ˇ ÿ i n ˚
ďP p|Mt | ą uq ` 2P ˇ Ak,l ∆i Xk bs,l dsˇ ą u ` 1ttK 2 ∆n maxi |Aik,l |ąuu
ˇ
ˇ i“1 pi´1q∆n ˇ
¨ ´ ¯ 2
˛
u2
ˆ
u 2
˙
tK∆ n
´ tK
ď exp ´ ` 2 exp ˝´
˚ ‹
3
32K t∆n 3
32K t∆n
‚
ˆ ˙
16C2 u2
ďC1 exp ´ ,
∆n
?
where the above inequality holds if x ą ptK 2 ∆n maxi |Aik,l |q _ ptK ∆n q _
a
ptK∆n {2 1 ` 4{∆n q, and C1 ě 3, C2 ď p512K 3 tq´1 . On the other hand, if x violates
?
this bound, i.e., x ď C 1 ∆n , we can choose C1 such that C1 expp´16C2 C 12 q ě 1, so
1 1
that the inequality follows trivially. For Hkl p1q, . . . Hkl p8q, we can use exactly the
same technique for proving n´k
ř n `1 i n
i“1 Ak,l ∆i Xk ∆ni Xl , and have that
`ˇ 1 1
ˇ ˘ 2
P ˇHkl p1q ` ¨ ¨ ¨ ` Hkl p8qˇ ě u ď Cp eĆp nu .
113
Since it is easy to show that
n´k
ÿ n `1 ż i∆n ż|
Aik,l pσσ | qs,kl ds “ pσσ | qs,kl ds ` opnδ q,
i“1 pi´1q∆n 0
we have
˜ˇ ˇ ¸
n´k n `1 ż|
ˇ n 1 ÿ ˇ
a1,i pk, lqAik,l pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q ´ pσσ | qs,kl dsˇ ě u
ˇ ˇ
P ˇ
ˇ n ´ kn ` 2 ψ2 kn i“1 0 ˇ
˜ˇ ˇ ¸
ˇn´k n `1 n´k n `1 ż i∆n ˇ
ˇ ÿ i
ÿ
Ak,l pXk,ti ´ Xk,ti´1 qpXl,ti ´ Xl,ti´1 q ´ Aik,l pσσ | qs,kl dsˇ ą u{2
ˇ
ďP ˇ
ˇ i“1 i“1 pi´1q∆n ˇ
˜ˇ ˇ ¸
`ˇ 1 ˇ ˘ ˇn´k ÿn `1 ż i∆n ż| ˇ
1
` P ˇHkl p1q ` ¨ ¨ ¨ ` Hkl p8qˇ ě u{3 ` P ˇ Ai pσσ | qs,kl ds ´ pσσ | qs,kl dsˇ ě u{3
ˇ ˇ
ˇ i“1 pi´1q∆n k,l 0 ˇ
2
ďCp eĆp nu ` 1tuăopnδ qu
2
ďCp eĆp nu .
On the other hand, since we have
|a1,i pk, lq| ` |b1,ij pk, lq| ď Ckn ,
and
pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q “ Op pn´1 q,
pXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q “ Op pn´1 q,
writing
n n
X1,ij “ b1,ij pk, lqpXk,tki ´ Xk,tki´1 qpXl,tli ´ Xl,tli´1 q,
n ´ kn ` 2 ψ2 kn
then for pi, jq P F,

X1,ij “ Op pn´1 q.
114
Using a similar decomposition, we can obtain
ÿ ÿ
i
X1,ij “ Bk,l ∆ni Xk ∆ni Xl ` Hkl
2 2
p1q ` ¨ ¨ ¨ ` Hkl p8q.
pi,jqPF pi,jqPF
i
Since }Xt }8 is bounded, then Bk,l ∆ni Xk ∆ni Xl ´ EpBk,l
i
∆ni Xk ∆ni Xl q is also
i
bounded. Then according to the Hoeffding’s lemma, we obtain Bk,l ∆ni Xk ∆ni Xl ´
i
EpBk,l ∆ni Xk ∆ni Xl q is a sub-Gaussian random variable. Similar arguments can
2 2
be extended to Hkl p1q, . . . , Hkl p8q. Then according to Hoeffding inequality, and
7Fkl ď Cnkn , where 7Fkl denotes the number of elements in the set Fkl , then we have
¨ˇ ˇ ˛
ˇ ˇ
ˇ1 ÿ ˇ
P ˝ˇˇ X1,ij ˇˇ ě u{8‚
ˇ n pi,jqPF ˇ
¨ˇ ¨ ˛ˇ ˛
ˇ ˇ
ˇ1 ÿ
i 1 ÿ ˇ
ďP ˝ˇˇ Bk,l ∆ni Xk ∆ni Xl ´ E ˝ i
Bk,l ∆ni Xk ∆ni Xl ‚ˇˇ ě u{24‚
ˇ n pi,jqPF n pi,jqPF ˇ
¨ˇ ¨ ˛ˇ ˛
ˇ ˇ
`ˇ 2 2 2 2
ˇ ˘ ˇ 1 ÿ ˇ
`P ˇHkl p1q ` ¨ ¨ ¨ ` Hkl p8q ´ EpHkl p1q ` ¨ ¨ ¨ ` Hkl p8qqˇ ě u{24 ` 1 ˝ˇˇE ˝ X1,ij ‚ˇˇ ě u{24‚
ˇ n pi,jqPF ˇ
Cp pnuq2
´
ďCp e nkn ` 1t|Ep 1 ř X1,ij q|ěu{24u
n pi,jqPF
1{2´δ `n2δ su2

ďCp eĆp rn .
This inequality holds when u ě Ckn {n for some constant C.

As for T4 , it can be decomposed as
n´kn `1
n 1 ÿ
T4 “ ūk,tki ūl,tli ` v̄k,tki v̄l,tki ` ūk,tki v̄l,tli ` `v̄k,tki ūl,tli
n ´ kn ` 2 ψ2 kn i“0
“T41 ` T42 ` T43 ` T44 .
115
Using similar techniques of proving Proposition 10 in (Kim, Wang, and Zou 2016), we
can prove that
` ˘
max E |T4 | ď Cp n´2δ ` nα´1{2´2δ ` n2α´1´2δ .
1ďk,lďd
With respect to T2 , note that
n´kn `1
n 1 ÿ
T2 “ X̄k,tki ūl,tli ` X̄k,tki v̄l,tli “ T21 ` T22 .
n ´ kn ` 2 ψ2 kn i“0
For T21 , by Jensen’s inequality, we have
n´kn `1
1 ÿ
max E|T21 | ďCp E|X̄k,tki |E|ūl,tli |
1ďk,lďd kn i“0
n´k n `1
1 ÿ 1
ďCp E|X̄k,tki |kn´1{2 ď Cp nn´1 kn´1{2 “ kn´3{2 .
kn i“0
kn
For T22 , by Jensen’s inequality and Holder’s inequality, we have
n´k n `1
1 ÿ
max E|T22 | ďCp rE|X̄k,tki |2 s1{2 rE|v̄l,t
2 1{2
ls ď Cp kn´1 nας .
1ďk,lďd kn i“0
i
Therefore,
max E|T2 | ď Cp pkn´3{2 ` kn´1 nας q.
1ďk,lďd
Similarly, we can show that
max E|T3 | ď Cp pkn´3{2 ` kn´1 nας q.

1ďk,lďd
Using Talagrand’s concentration inequality and some simple calculation, we have
´3 ´2 2ας
qu2
Pp max |T2 | ě uq ď d2 eĆp pkn `kn n
,
1ďk,lďd
116
´3 ´2 2ας
qu2
Pp max |T3 | ě uq ď d2 eĆp pkn `kn n
,
1ďk,lďd
and
´2δ `nα´1{2´2δ `n2α´1´2δ qu2
Pp max |T4 | ě uq ď d2 eĆp pn .
1ďk,lďd
Above all, we prove that
1{2´δ `n2δ su2

p kl ´ Ekl | ě uq ď C1 eĆ2 rn
Pp|E .
Thus
´› › a ¯
P ›E ´ E› ě C0 rnδ ` n´2δ s log d
›p ›
MAX
´1{2 ?
Ć2 rn1{2´δ `n2δ spC0 nδ log dq2 2
ďC1 r2 e “ C1 dĆ0 C2 .
(ii) Since
› › › ›
›E ´ E› ď r ›E ´ E› ,
›p › ›p ›
F MAX
then
´› › a ¯
´1{2 2
P ›E ´ E› ě C0 rnδ log d ď C1 dĆ0 C2 .
›p ›
F
‹ ‹ ‹ ‹
(iii) The derivation of X̄k,i Z̄o,li is similar to that of X̄k,i X̄l,i given by (i), therefore we
obtain
˜ ˇ ˇ ¸
n´kn `1
ˇ n 1 ÿ ˇ
´1{2
a 2
max ˇ ‹
X̄k,i ‹ ˇ
Z̄o,li ˇ ě C0 nδ log d ď C1 dĆ0 C2 `1 .
ˇ
P
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
(iv) By the similar argument as in (i), we have
˜ ˇ ˇ ¸
n´kn `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
´1{2
a
max ˇ Z̄o,ki Z̄o,li gs,kl dsˇ ě C0 nδ log d
ˇ ˇ
P ´
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ
i“0
2
ďC1 dĆ0 C2 `2 .
117
(v)-(vii) Moreover, note that
´ ¯´1
p 22
βj ´ βj “ Π
p p 12 ,
Π j
therefore, under the event that

# ˇ ˇ +
n´k n `1
ˇ n 1 ÿ
‹ ‹ ˇ
ˇ
´1{2
a
A“ max ˇ X̄k,i Z̄o,li ˇ ď C0 nδ log d
ˇ
1ďiďs,1ďjďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
" ´ ¯ 1 ˆż t ˙*
X λmin E p ě λmin es ds ,
2 0
we have
r ´ ¯2 4rC 2 n´1{2`δ log d

4 ÿ
0
}βpj ´ βj }2 ď ´ş
t
¯ 12
Πij ď
p
2
şt .
λ2min e s ds i“1 λmin p 0 es dsq
0
and
4rC02 n´1{2`δ d log d
}βp ´ β}2F ď şt .
λ2min p 0 es dsq
Therefore, it suffices to show that PpAq ě 1 ´ Oprn3{2`δ´3ν{4´νδ{2 d´1 q.

´ş ¯
t
We assume λmin 0 es ds is bounded away from 0 and r is finite, so it follows that
ˆ› żt › ˆż t ˙˙ ˆ ˇ żt ˇ ˆż t ˙˙
› › 1 ˇ ˇ 1
P ››Ep ´ es ds› ď λmin
› 2 es ds ěP r max ˇˇE p ij ´ eij,s dsˇ ď λmin
ˇ 2 es ds
0 0 1ďi,jďs 0 0
2
ě1 ´ OpC1 dĆ0 C2 q.
By Lemma A.1 of (Fan, Liao, and Mincheva 2011), we have
ˆ ˆż t ˙˙
p ě 1 λmin
P λmin pEq es ds
2
ě 1 ´ OpC1 dĆ0 C2 q.
2 0
2
Combining this with (3.3), we have PpAq ě 1 ´ OpC1 dĆ0 C2 `1 q.
118
(viii) To prove (3.8), we note that
˜ ¸2
n´kn `1 r n´kn `1
n 1 ÿ ÿ n 1 ÿ
max pβk,l ´ βpk,l qX̄l,i ď max }βpk ´βk }2 }X̄i‹ }2 .
1ďkďd n ´ kn ` 2 ψ2 kn 1ďkďd n ´ kn ` 2 ψ2 kn
i“0 l“1 i“0
şt
Then by (3.1) with C ą max1ďkďr e ds,
0 s,kk
˜ ¸
n´k n `1
n 1 ÿ
‹ 2
P }X̄i } ď rC
n ´ kn ` 2 ψ2 kn i“0
˜ ˇ ˇ ¸
n´k n `1 żt żt
ˇ n 1 ÿ
‹ 2
ˇ
ěP s max ˇ }X̄i } ´ es,kk dsˇ ` r max es,kk ds ď rC
ˇ ˇ
1ďkďr ˇ n ´ kn ` 2 ψ2 kn 0 ˇ 1ďkďr 0
i“0
2
ě1 ´ OpC1 dĆ0 C2 q.
By (3.5), we obtain
¨ ˜ ¸2 ˛
n´k n `1 r
n 1 ÿ ÿ 2
P ˝ max ‹
pβk,l ´ βpk,l qX̄l,i ą Cn´1{2`δ r2 log d‚ ď OpC1 dĆ0 C2 `1 q.
1ďkďd n ´ kn ` 2 ψ2 kn i“0 l“1
(ix) Finally, under the event of
# ˇ ˇ +
n´k n `1 żt żt
ˇ n 1 ÿ
‹ 2
ˇ 1
max ˇ pZ̄o,li q ´ gs,ll dsˇ ď max gs,ll ds
ˇ ˇ
1ďlďd ˇ n ´ kn ` 2 ψ2 kn 0 ˇ 4 1ďlďd 0
i“0
$ ˜ ¸2 ,
n´k n `1 r
& n 1 ÿ ÿ
‹ ´1{2`δ 2
.
X max pβk,l ´ βk,l qX̄l,i
p ď Cn r log d ,
%1ďkďd n ´ kn ` 2 ψ2 kn i“0 l“1
-
according to Cauchy-Schwarz inequality, we have

ˇ ˇ
n´k n `1 ”
ˇ n 1 ÿ
‹
ıˇ
max ˇ pȲk,i ´ pβpX̄i‹ qk qpȲl,i‹ ´ pβpX̄i‹ ql q ´ Z̄o,ki
‹ ‹
Z̄o,li
ˇ ˇ
ˇ
1ďk,lďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
ˇ ˇ
n´k n `1
ˇ n 1 ÿ
‹ ‹ ˇ
ˇ
ď max ˇ ppβ ´ βqX̄i qk pβ ´ βqX̄i ql ˇ
ˇ p p
i“0
ˇ ˇ
n´k n `1
ˇ n 1 ÿ
‹ ‹ ˇ
ˇ
` 2 max ˇ Z̄o,ki ppβ ´ βqX̄i ql ˇ
ˇ p
1ďl,kďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
119
n´k n `1
n 1 ÿ
ď max ppβp ´ βqX̄i‹ q2k
1ďkďd n ´ kn ` 2 ψ2 kn
i“0
g
n´k n´k
f n `1 n `1
f n 1 ÿ n 1 ÿ
` 2e max ‹
pZ̄o,ki q2 max ppβp ´ βqX̄i‹ q2k
1ďkďd n ´ kn ` 2 ψ2 kn 1ďkďd n ´ kn ` 2 ψ2 kn
i“0 i“0
dˆ ˙
´1{2`δ 2 5
ďC0 n r log d ` 2 Ct pC0 n´1{2`δ r2 log dq
4
1
a
ďC0 n´1{4`δ r log d.
Consequently, we have
ˇ ˇ
n´k n `1 ”´
ˇ n 1 ÿ
‹
¯´ ¯ ıˇ
max ˇ Ȳk,i ´ pβpX̄i‹ qk Ȳl,i‹ ´ pβpX̄i‹ ql ´ Z̄o,ki
‹ ‹
Z̄o,li
ˇ ˇ
ˇ
i“0
1
a
ďC0 n´1{4`δ r log d,
` ˘
with probability 1 ´ O n3{2`δ´3ν{4´νδ{2 by (3.4) and (3.6). Finally, by triangle
inequality, we obtain
ˇ ˇ
max ˇΓ Γ
ˇp ˇ
kl ´ kl ˇ
1ďl,kďd
ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
ď max ˇ Z̄k,i Z̄l,i ´ gs,lk dsˇ
ˇ ˇ
i“0
ˇ ˇ
n´k n `1 ”´
ˇ n 1 ÿ
‹
¯´ ¯ ıˇ
` max ˇ Ȳk,i ´ pβpX̄i‹ qk Ȳl,i‹ ´ pβpX̄i‹ ql ´ Z̄o,ki
‹ ‹
Z̄o,li ˇ,
ˇ ˇ
i“0
which leads to the result by (3.4).

(x) Under the event of
" *
´1{2
a
B1 “ max |Γkl ´ Γ| ď Csnδ
p log d ,
k,l
and
" b *
B2 “ C1 ă Γpkk Γ
pll ă C2 , f or all k ď d, l ď d ,
120
´1{2 ?
here C1 and C2 are some constant, and ωn “ nδ log d. Because of skl pzq “
´ ¯
?
skl z1|z|ąτ ω Γp Γp , we have
n kk ll
› › ˇ ˇ
›pS ? ? ?
›Γ ´ Γ› ď max ˇΓ 1 Γ 1 Γ 1
› ˇp ˇ
kl |Γ ` kl ´ kl ˇ
MAX 1ďk,lďd kl |ěτ ωn Γkk Γll |Γ̂kl |ěτ ωn Γkk Γll |Γ̂kl |ďτ ωn Γkk Γll
p p p p p p p
ˇ
ď max ˇskl pΓ pkl q ´ Γ ? ? ?
p ll ` |Γkl ´ Γkl |1|Γ̂kl |ďτ ωn Γ p ll ` Γkl 1|Γ̂kl |ďτ ωn Γ
ˇ pkl |1 p
1ďk,lďd |Γ̂kl |ěτ ωn Γ p kk Γ p kk Γ pk
b b
ďCωn Γ pkk Γ
pll 1
|Γ̂kl |ěCωn θ1 ` Cωn 1|Γ̂kl |ěCωn θ1 ` Cωn Γ
pkk Γ pll
ďCωn .
Then we obtain
› › a
›pS ´1{2
›Γ ´ Γ› ď Crnδ log d,
›
MAX
2
with probability at least 1 ´ OpC1 dĆ0 C2 `2 q.
´1{2 ?
Lemma 11. Under Assumptions 1 - 4, and nδ log d “ op1q. We have
ˆ› ›2 › ›2 ˙
2
|› |› ´1 ´1{2`δ
piq P ›βpE ´ Eqβ › ` ›β Epβ ´ βq › ě C0 d n log d “ OpC1 dĆ0 C2 `1 q,
› p › p p
Σ Σ
and
ˆ› ›2 ˙
2 2
|›
piiq P ›pβ ´ βqEpβ ´ βq › ě C0 dn ´1`2δ
log d “ OpC1 dĆ0 C2 `1 q.
› p p p
Σ
Proof of Lemma 11. (i) For the first part, using the same argument in proof of theorem
2 in (Fan, Fan, and Lv 2008), we have
› | ´1 › › ›
›β Σ β › ď 2 ›cov ´1 pXq› .
121
Therefore
› ›2 ´ ¯
|› ´1 ´1{2 | ´1 | ´1{2
›βpE ´ Eqβ › “d tr Σ βpE ´ Eqβ Σ βpE ´ Eqβ Σ
› p p p
Σ
´ ¯
“d´1 tr pEp ´ Eqβ | Σ´1 βpEp ´ Eqβ | Σ´1 β
› ›2
ďd´1 ›pE ´ Eqβ | Σ´1 β ›
› p ›
F
› ›2
ďOpd´1 q ›E ´ E› .
›p ›
F
On the other hand, we also have
› ›2 1 › › › ›
›β Epβ ´ βq| › ď ›β | Σ´1 β Ep
p βp ´ βq›› ››EΣ p ´1 pβp ´ βq| ››
› p p › ›
Σ d F F
1 ›› | ´1 ›› ›› p ››2 ›› p ›2
ď β Σ β F ›E› ›β ´ β › .
›
d F F
2
p 2 ą Cq “ OpC1 r2 dĆ0 C2 q. We can get
Then by Lemma 10 (3.2) and (3.6), and Pp}E} F
the final results.

(ii) For the second part, we have
p βp ´ βq| ›› “ 1 tr pβp ´ βqEp

› ›2 ´ ¯
›pβ ´ βqEp
› p p βp ´ βq| Σ´1 pβp ´ βqEp
p βp ´ βq| Σ´1
Σ d
1 ›› ›2
p βp ´ βq| Σ´1 ››
ď ›pβp ´ βqEp
d F
1 2 ´1 2
› ›4
ď λMAX pΣ qλMAX pEq ›β ´ β › .
p ›p ›
d F
Since λ2MAX pΣ´1 q and λ2MAX pEq

p are both bounded, then the result follows from (3.6).
´1{2 ?
Lemma 12. Under Assumptions 1 - 4, and nδ log d “ op1q, we have
´ ¯
piq P pS ´ Γ ą C0 md n´p1{4´δ{2qp1´qq plog dqp1´qq{2 “ Opn3{2`δ´3ν{4´νδ{2
Γ (3.12)
q,
ˆ ´ ¯ 1 ˙
piiq pS ě λmin pΓq ą 1 ´ Opn3{2`δ´3ν{4´νδ{2 q,
P λmin Γ (3.13)
2
122
´ ¯
piiiq P pS q´1 ´ Γ´1 ą C0 md n´p1{4´δ{2qp1´qq plog dqp1´qq{2
pΓ
“ Opn3{2`δ´3ν{4´νδ{2 q, (3.14)
´ ˘
pivq pS q´1 βp ´ β | Γ´1 β ą C0 md n´p1{4´δ{2qp1´qq ¨dplog dqp1´qq{2
P βp| pΓ
“ Opn3{2`δ´3ν{4´νδ{2 q, (3.15)
´ ´ ¯ ` ˘
pvq P E pS q´1 βp ´ E´1 ` β | pΓS q´1 β
p ´1 ` βp| pΓ
˘
ą C0 md n´p1{4´δ{2qp1´qq dplog dqp1´qq{2 “ Opn3{2`δ´3ν{4´νδ{2 q, (3.16)
ˆ ´ ¯´1 ˙
pviq P E pS q´1 βp
p ´1 ` βp| pΓ ą C0 md d´1 “ Opn3{2`δ´3ν{4´νδ{2 q, (3.17)
ˆ ´ ¯´1 ˙
S S
pviiq P βp E ´1
p ` βp pΓ| ´1
p q βp |
βp pΓ
p q ´1
ą C0 md “ Opn3{2`δ´3ν{4´νδ{2
(3.18)
q,
ˆ ´ ¯´1 ˙
| p S ´1 p
pviiiq p p ´1
P β E ` β pΓ q β p p| ´1
β Γ ą C0 md “ Opn3{2`δ´3ν{4´νδ{2 q.
(3.19)
pS ´ Γ is symmetric, its operator norm is bounded by

Proof of Lemma 12. (i) Since Γ
the 8-norm:
d ˇ
ÿ ˇ
pS ´ Γ ď max
Γ
ˇpS
ˇΓlk ´ Γlk ˇ .
ˇ
1ďlďd
k“1
Then using the same technique for proving (3.10), we can prove that, with probability
no less than Opn3{2`δ´3ν{4´νδ{2 q, we have
pS ´ Γ ď C0 md n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .

Γ
Proofs of (3.13)-(3.19) are similar to that of Lemma 5 in (Fan, Furger, and Xiu 2016),
therefore we omit the details.
Proof of Theorem 3. Based on Lemma 10, and following the same steps as that of
theorem 1 in (Fan, Furger, and Xiu 2016), we can obtain
› › ´ a ¯
´1{2
›ΣTSR ´ Σ› “ O nδ log d .
›p ›
MAX
123
For the next part, we will prove the convergence results based on the Σ norm:
› ›2 › ›2 › ›2
|›
›Σ̂TSR ´ Σ› ď4 ›βpE ´ Eqβ › ` 24 ›β Epβ ´ βq›
› › › p › p p ›
Σ Σ Σ
› ›2 › ›2
| S
` 16 ›pβ ´ βqEpβ ´ βq › ` 2 ›Γ̂ ´ Γ› . (3.20)
› p p p › › ›
Σ Σ
Finally, we have
› › › › › ›
› S ´1{2 › ´1{2 S ´1{2 › › ´1{2 S ´1{2 ›
›Γ̂ ´ Γ› “d ›Σ pΓ̂ ´ ΓqΣ › ď ›Σ pΓ̂ ´ ΓqΣ
›
›
Σ F
› ›
ď ›Γ̂S ´ Γ› λmax pΣ´1 q.
› ›
Then based on (3.20), Lemma 11 and Lemma 12 (3.12), and the fact that
d´1 n´1{2`δ log d ` dn´1`2δ log2 d ` m2d n´p1{2´δqp1´qq plog dqp1´qq

` ˘
“O dn´1`2δ log2 d ` m2d n´p1{2´δqp1´qq plog dqp1´qq ,
we prove that
› › ` ˘
Σ Σ › “ Op d1{2 n´1{2`δ log d ` n´p1{4´δ{2qp1´qq md plog dqp1´qq{2 .
›p ›
› TSR ´
Σ
On the other hand, if we do not assume the factor structure, using a direct
pre-averaging estimator Σ̂‹ . Then we will get
› ›2 › ›2 › ›2 › ›2
´ Eqβ | › ` C ›β X̄ Z̄ | ›Σ ` C ›Z̄ Z̄ | ´ Γ›Σ .
› ‹
›Σ̂ ´ Σ› ď C ›βpE
› › p ›
Σ Σ
› ›2
According to the proof of Lemma 11, we obtain ›βpE ´ Eqβ | › “ Op pd´1 n´1{2`δ log dq
› p ›
› |
›2 ´1 ´1{2`δ
› | Σ ›2
and ›β X̄ Z̄ ›Σ “ Op pd n log dq. We can also get ›Z̄ Z̄ ´ Γ›Σ “ Op pdn´1{2`δ log dq.
´1{2 ?
› ›
Therefore ›Σ̂‹ ´ Σ› “ Op pd1{2 nδ log dq, which has slower convergence rate than
› ›
Σ
our estimator.
124
For the inverse part, by the localization argument, we only need to prove the
result under a stronger assumption that the entry-wise norms of all the processes are
bounded uniformly in r0, ts. By the Sherman - Morrison -Woodbury formula, we have
p TSR q´1 ´ Σ´1

pΣ
´ ¯ ´ ¯´1
pS q´1 ´ Γ´1 `
ď pΓ pS q´1 ´ Γ´1 βp E
pΓ pS q´1 βp
p ´1 ` βp| pΓ pS q´1
βp| pΓ
´ ¯ ´ ¯´1
` pS q´1 ´ Γ´1 βp E
pΓ pS q´1 βp
p ´1 ` βp| pΓ βp| Γ´1
´ ¯´1 ´ ¯´1
` Γ´1 pβp ´ βq E pS q´1 βp
p ´1 ` βp| pΓ βp| Γ´1 ` Γ´1 pβp ´ βq E pS q´1 βp
p ´1 ` βp| pΓ β | Γ´1
ˆ´ ¯´1 ` ˙
| S | S
˘´1
´1
` Γ β E ´1
p ` βp pΓ ´1
p q βp ´1
´ E ` β pΓ q β ´1
β | Γ´1
:“L1 ` L2 ` L3 ` L4 ` L5 ` L6 .
We now bound each term above with probability no less than 1Ópn3{2`δ´3ν{4´νδ{2 q.
First of all, by (3.12)
L1 ď Cmd n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
To bound L2 , by (3.14) and (3.18), we have
´ ¯´1
pS q´1 ´ Γ´1 ¨ βp E
L2 ď pΓ pS q´1 βp
p ´1 ` βp| pΓ pS q´1 ď Cm2 n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
βp| pΓ d
Similarly, L3 can be bounded using (3.14) and (3.19).

Next, for L4 , we use (3.6), and (3.17), k¨k ď k¨kF , and λmin pΓq is bounded below
by some constant,
´ ¯´1
2 pS q´1 βp
L4 ď Γ´1 ¨ βp ´ β ¨ βp ¨ p ´1 ` βp| pΓ
E ď Cm2d n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
125
?
Similarly, using the fact that kβk ď kβkF “ Op dq, we can establish the same bound
for L5 .
Finally, we have
´ ¯´1 `
´1 2 2 | p S ´1 p
˘´1
L6 ď Γ ¨ kβk ¨ ´1
E ` β pΓ q β
p p ´ E´1 ` β | pΓS q´1 β
´ ¯ `
´1 2 2
˘
ď Γ ¨ kβk ¨ E ` β pΓ q β ´ E´1 ` β | pΓS q´1 β
p ´1 p| p S ´1 p
´ ¯´1 ` ˘´1
¨ E pS q´1 βp
p ´1 ` βp| pΓ ¨ E´1 ` β | Γ´1 β .
Note that for any vector v such that kvk “ 1, by the definition of operator norm, we
have
v | β | Γ´1 βv ě λmin pΓ´1 qv | β | βv ě λmin pΓ´1 qλmin pβ | βq.
It then follows that
λmin pβ | Γ´1 βq ě λmin pΓ´1 qλmin pβ | βq.
On the other hand, by Assumption 3, we have
1 | | 1 1
v β βv “ v | Bv ´ v | pB ´ β | βqv ě λmin pBq ´ β | β ´ B ą C,
d d d
where C is some constant. Thus, λmin pβ | βq ą Cd. Therefore λmin pβ | Γ´1 βq ą Cd,
following from the fact that λmax pΓq ď Kmd . It then implies that λmin pE´1 `
β | Γ´1 βq ě λmin pβ | Γ´1 βq ą Cm´1
d d.
` ˘´1
E´1 ` β | Γ´1 β “ Op pmd d´1 q.
126
Using (3.16) and (3.17) , we have
L6 ď Cm3d n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
Finally, combining these results, we can obtain, for some constant C ą 0,
` ˘
p TSR q´1 ´ Σ´1 ď C m3 n´p1{4´δ{2qp1´qq plog dqp1´qq{2 ` m2 n´p1{4´δ{2qp1´qq plog dqp1´qq{2 .
pΣ d d
We find that the second term on the right is dominated by the first one, then replace
the whole above equation by the first term, which yields the desired result.
To prove the second statement, note that for any vector v such that kvk “ 1, we
have
´ ¯
v|Σ
p TSR v “ v | βpE pS v ě λmin Γ
p βp| v ` v | Γ pS ,
which implies that
´ ¯ ´ ¯
λmin Σ pS .
p TSR ě λmin Γ
This inequality, combining with (3.13) of Lemma 12, concludes the proof.
Proof of Theorem 4 follows the same arguments as that of Theorem 5.
We note that
q ‹ ´ X̄ ‹ “ pβ | βq´1 β | Z̄ ‹ .
X
127
Similar to the proof of 3.1-3.3, and by Hoeffding inequality, we have
b b
| ‹
}β Z̄ } “ }β Z̄ Z̄ β} ď }β | pZ̄ ‹ Z̄ ‹| ´ Γqβ} ` }β | Γβ}
| ‹ ‹|
b
ď }β | }}Z̄ ‹ Z̄ ‹| ´ Γ}}β} ` }β | }}Γ}}β}
b
´p1´qq{2
ďC dmd nδ plog dqp1´qq{2 ` dmd
1{2 ´p1´qq{4 1{2
ďCd1{2 md nδ plog dqp1´qq{4 ` d1{2 md ,
p ?
where we use }Γ} ď md , and }Z̄ ‹ Z̄ ‹| ´ Γ} “ Op pnδ log dq. Then we have
› › }β | Z̄ ‹ }
›X ´ X̄ › ď}pβ | βq´1 }}β | Z̄ ‹ } ď
› q‹ ‹›
λmin pβ | βq
}β | Z̄ ‹ }d´1 1{2 ´p1´qq{4 1{2
ď |
ď Cd´1{2 md nδ plog dqp1´qq{4 ` d´1{2 md .
λmin pd β βq
´1
Moreover, we note that
˜ ¸2
n´k
ÿ n `1 r
ÿ
max p ‹ ´ X̄ ‹ q
βk,l pX l,i l,i ď max }βk }2 }X
q ‹ ´ X̄ ‹ }2
F
1ďkďd 1ďkďd,1ďiďn
i“0 l“1
ď max }βk }2 r}X

q ‹ ´ X̄ ‹ }2 ,
1ďkďd,1ďiďn
´p1´qq{2
ďCd´1 md nδ plog dqp1´qq{2 ` d´1 md .
Using these estimates, we obtain

ˇ ˇ
ˇn´k n `1 ” ıˇ
ˇ ÿ ‹ p ‹ qk qpȲ ‹ ´ pβ X
p ‹ ql q ´ Z̄ ‹ Z̄ ‹ ˇˇ
max ˇ pȲk,i ´ pβ X i l,i i o,ki o,li
1ďk,lďd ˇ ˇ
i“0
ˇ ˇ ˇ ˇ
ˇn´k n `1 ˇ ˇn´k n `1 ˇ
ˇ ÿ ‹ ‹ ‹ ‹
ÿ
‹ ‹ ‹
ď max ˇ pβpX ´ X̄i qqk pβpX ´ X̄i qql ˇ ` 2 max ˇ Z̄o,ki pβpX ´ X̄i qql ˇ
q q ˇ ˇ q ˇ
1ďk,lďd ˇ ˇ 1ďl,kďd ˇ ˇ
i“0 i“0
g
n´k n´k n´k
n `1
f n `1 n `1
ÿ f ÿ ÿ
ď max pβpXq ‹ ´ X̄ ‹ qq2 ` 2e max pZ̄ ‹
q 2 max pβpX q ‹ ´ X̄ ‹ qq2
i k o,ki i k
1ďkďd 1ďkďd 1ďkďd
i“0 i“0 i“0
1{2 ´p1´qq{4 1{2

ďCd´1{2 md nδ plog dqp1´qq{4 ` d´1{2 md . (3.21)
128
Therefore, according to (3.21) and (3.4), and using triangle inequality, we have
ˇ ˇ
ˇp1
max ˇΓkl ´ Γkl ˇ
ˇ
1ďl,kďd
ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
ď max ˇ Z̄k,i Z̄l,i ´ gs,lk dsˇ
ˇ ˇ
i“0
ˇ ˇ
n´k n `1 ”´
ˇ n 1 ÿ
‹
¯´
q ‹ qk Ȳ ‹ ´ pβ X
¯
q ‹ ql ´ Z̄ ‹ Z̄ ‹ ˇˇ
ıˇ
` max ˇ Ȳk,i ´ pβ X
ˇ
i l,i i o,ki o,li
i“0
´ a ¯
´1{2 1{2 ´p1´qq{4 1{2
ďC nδ log d ` Cd´1{2 md nδ plog dqp1´qq{4 ` d´1{2 md
´ a ¯
´1{2 ´1{2 1{2
ďC nδ log d ` d md .
The rest steps are similar to the TSR case. This concludes the proof.
Proposition 1. Suppose that Assumptions 1 - 4 hold. Also, assume that }E}MAX ď K,

´1{2 ?
}Γ}MAX ď K almost surely for some constant K, nδ log d “ op1q, and d´1{2 md “
op1q. Then r, βEβ | , and Γ can be identified as d Ñ 8. That is, r̄ “ r, if d is
sufficiently large. Moreover, we have
› ›
›ÿr̄ ›
| |›
› λj ξj ξj ´ βEβ › ď Cd´1{2 md , and
›
›j“1 ›
› ›MAX
› ÿd ›
λj ξj ξj| ´ Γ› ď Cd´1{2 md ,
› ›
›
›j“r̄`1 ›
MAX
where tλj , 1 ď j ď du and tξj , 1 ď j ď du are the eigenvalues and their corresponding
λ
eigenvectors of Σ, and r̄ “ arg min1ďjďd p dj ` jd´1{2 md q ´ 1.
´1{2 ?
Lemma 13. Suppose Assumptions 1 - 4 hold, and nδ log d “ op1q, then we have
ˇ ˇ
n´kn `1 żt
ˇ n 1 ÿ ˇ
´1{2
a
max ˇ X̄ik X̄il ´ es,kl dsˇ “ Op pnδ log dq,(3.22)
ˇ ˇ
i“0
129
ˇ ˇ
n´kn `1
ˇ n 1 ÿ ˇ
´1{2
a
max ˇ X̄ik Z̄u,li
‹ ˇ
ˇ “ Op pnδ log dq, (3.23)
ˇ
1ďkďr,1ďlďd ˇ n ´ kn ` 2 ψ2 kn ˇ
i“0
ˇ ˇ
n´k n `1 żt
ˇ n 1 ÿ
‹ ‹
ˇ
´1{2
a
max ˇ Z̄u,ki Z̄u,li ´ gs,kl dsˇ “ Op pnδ log(3.24)
dq.
ˇ ˇ
i“0
Recall that
´ ¯ ´ ¯
1{2
Λ “ Diag λ1 , λ2 , . . . , λr ,
p p p F “d ξ1 , ξ2 , . . . , ξr ,
p p p and G “ d´1 F | Y.
We write
n 1
H“ X̄ ‹ X̄ ‹| β | F Λ´1 .
n ´ kn ` 2 ψ2 kn t
It is easy to verify that
ΣF
p “ F Λ, GG| “ td´1 ˆ Λ, F | F “ d ˆ Ir , and
p “ 1 pY ´ F Gq pY ´ F Gq| “ 1 YY | ´ 1 F ΛF | .
Γ
t t d
´1{2 ?
Lemma 14. Suppose Assumptions 1 - 4 hold with λij “ Opnδ log dq. Suppose
´1{2 ?
d´1{2 md “ op1q, nδ log d “ op1q, and rp Ñ r with probability approaching 1, then
there exists a r ˆ r matrix H, such that with probability approaching 1, H is invertible,
}HH | ´ Ir } “ }H | H ´ Ir } “ op p1q, and more importantly,
´ a ¯
´1{2
}F ´ βH}MAX “ Op nδ log d ` d´1{2 md ,
› › ´ a ¯
›G ´ H ´1 X̄ › “ Op n´1{2 log d ` d´1{2 md .
δ
130
´1{2 ?
Lemma 15. Under Assumptions 1 - 4 , d´1{2 md “ op1q, and nδ log d “ op1q, we
have
´ a ¯
´1{2 ´1{2
kF ´ βHkMAX “ Op nδ log d ` d md . (3.25)
H ´1 “ Op p1q. (3.26)
´ a ¯
´1{2
G ´ H ´1 X̄ “ O p nδ log d ` d´1{2 md . (3.27)
´1{2 ?
Lemma 16. Under Assumptions 1 - 4 , d´1{2 md “ op1q, and nδ log d “ op1q, we
have
´ a ¯
rS ´ Γ ´1{2
Γ ď Γ
p´Γ “ Op nδ log d ` d´1{2 md . (3.28)
MAX MAX
Lemma 17. Under Assumptions 1 - 4, d´1{2 md “ op1q, and n´1{2`δ log d “ op1q, we
have
1 ´
´1{2
a ¯
F GG| F | ´ βEβ | “ Op nδ log d ` d ´1{2
md .
t MAX
Proof of Proposition 1, Lemma 13-17. The proofs follow the same arguments as in
(Aït-Sahalia and Xiu 2017), thus we omit the details.
´1{2 ?
Lemma 18. Under Assumptions 1 - 4, dpnδ log dq2 “ op1q, nδ´1{2 log d “ op1q,
and d´1{2 md “ op1q, we have
´ a ¯
kF ´ βHk2F “ Op dpnδ
´1{2
piq log dq2 ` m2d . (3.29)
kpF ´ βHqpF ´ βHq| k2F “ Op dn2δ´1 log2 d ` d´1 m4d .

` ˘
piiq (3.30)
kβHpF ´ βHq| k2Σ “ Op nδ´1{2 log d ` d´1 m2d .

` ˘
piiiq (3.31)
pivq kβpH | H ´ Ir qβ | k2Σ “ op p1q. (3.32)
131
Proof of Lemma 18. (i) We have kF ´ βHk2F ď d kF ´ βHk2MAX .
(ii) According to the definition of } ¨ }Σ , we have
1
kpF ´ βHqpF ´ βHq| k2Σ “ Op p }F ´ βH}4F q “ Op pd kF ´ βHk4MAX q.
d
(iii) By kβ | Σ´1 βk “ Op1q, We have
1
kβHpF ´ βHq| k2Σ “ trpHpF ´ βHq| Σ´1 pF ´ βHqHβ | Σ´1 βq
d
1
ď kHk2 β | Σ´1 β Σ´1 kF ´ βHk2F
d
“Op pkF ´ βHk2MAX q.
(iv) by kβ | Σ´1 βk “ Op1q, We have
1
kβpH | H ´ Ir qβ | k2Σ ď trtpH | H ´ Ir qβ | Σ´1 βpH | H ´ Ir qβ | Σ´1 βu
d
1 2
ď β | Σ´1 β }H | H ´ Ir }2F
d
“op p1q.
´1{2 ?
Lemma 19. Under Assumptions 1 - 4, d´1{2 md “ op1q, and nδ log d “ op1q, we
have
´ a ¯
rS ´ Γ “ Op md pn´1{2 log d ` d´1{2 md q1´q .
Γ (3.33)
δ
132
´1{2 ?
Moreover, if in addition, d´1{2 m2d “ op1q and md nδ log d “ op1q hold, then
´ ¯
pS is bounded away from 0 with probability approaching 1, and
λmin Γ
´ ¯´1 ´ a ¯
rS ´1{2
Γ ´ Γ´1 “ Op md pnδ log d ` d´1{2 md q1´q . (3.34)
pS ´ Γ is symmetric,
Proof of Lemma 19. Note that since Γ
d ˇ
ÿ ˇ
rS ´ Γ ď Γ
Γ rS ´ Γ “ max
ˇpS
ˇΓlk ´ Γlk ˇ .
ˇ
8 1ďlďd
k“1
By Lemma 16, and using the same technique as proving (3.10), we have
´ a ¯
rS ´ Γ “ Op md S ´q pn´1{2 log d ` d´1{2 md q ` md S 1´q .
Γ δ
´1{2 ?
Choosing λij “ M 1 pnδ log d ` d´1{2 md q, M 1 is some positive constant, we have
´ a ¯
S ´1{2 ´1{2 1´q
Γ ´ Γ “ Op md pnδ
r log d ` d md q .
Moreover, since λmin pΓq ą K for some constant K and by Weyl’s inequality, we
rS q ą K ´ op p1q. As a result, we have
have λmin pΓ
´ ¯´1 ´ ¯´1 ´ ´ ¯¯
S ´1 S rS Γ´1 ď λmin pΓ
rS q´1 λmin pΓq´1 Γ ´ Γ
rS
Γ
r ´Γ “ Γr Γ´ Γ
´ a ¯
´1{2 ´1{2 1´q
ďOp md pnδ log d ` d md q .
Proof of Theorem 6. Note that
p PCA “ 1 F ΛF | ` Γ
Σ rS “ 1 F GG| F | ` Γ
rS .
d t
133
By Lemma 16, we have
´ a ¯
rS ´ Γ ´1{2 ´1{2
Γ “ Op nδ log d ` d md .
MAX
By the triangle inequality, we have
1 rS ´ Γ
Σ
p PCA ´ Σ ď F ΛF | ´ βEβ | ` Γ
MAX d MAX MAX
Therefore, the desired result follows from Lemmas 16 and 17.

Using Lemma 18, for some constant C, we have
” ı
p PCA ´ Σ}2 ďC kβpH | H ´ Ir qβ | k2 ` kβHpF ´ βHq| k2 ` kpF ´ βHqpF ´ βHq| k2 ` Γ
}Σ rS ´ Γ
Σ Σ Σ Σ
´1{2
a 1 ´1{2
a
“Op pdpnδ log dq4 ` m4d ` m2d pnδ log d ` d´1{2 md q2p1´qq q.
d
For the inverse, firstly, by Lemma 19 and the fact that λmin pΣ rS q, we
p PCA q ě λmin pΓ
can establish the first two statements.

p PCA q´1 ´ Σ´1 , by the Sherman - Morrison - Woodbury formula, we
To bound pΣ
have
´ ¯´1 ´ ¯´1
Σp PCA ´ Σ r
´ ¯´1 ` ˘´1
“ t´1 F GG| F | ` Γ rS ´ t´1 βHH ´1 X̄ ‹ X̄ ‹| pH ´1 q| H | β | ` Γ
´ ¯ ´ ¯ ´ ¯´1
“ pΓrS q´1 ´ Γ´1 ´ pΓ rS q´1 ´ Γ´1 F dΛ´1 ` F | pΓ rS q´1 F rS q´1
F | pΓ
´ ¯´1 ´ ¯
´1 ´1 | r S ´1 | S ´1 ´1
´ Γ F dΛ ` F pΓ q F F pΓ q ´ Γ
r
´ ¯´1
| ´1
|
` ˘ | | ´1
´1
` Γ pβH ´ F q tH X̄ X̄ H ` H β Γ βH H | β | Γ´1
´ ¯´1
| ´1
|
` ˘ | | ´1
´1
´ Γ F tH X̄ X̄ H ` H β Γ βH pF | ´ H | β | qΓ´1
ˆ´ ¯´1 ´ ¯´1 ˙
| ´1
|
` ˘ | | ´1 | r S ´1
`Γ F ´1
tH X̄ X̄ H ` H β Γ βH ´1
´ dΛ ` F pΓ q F F | Γ´1
“L1 ` L2 ` L3 ` L4 ` L5 ` L6 .
134
By Lemma 19, we have
´ a ¯
´1{2 ´1{2 1´q
kL1 k “ Op md pnδ log d ` d md q .
´ ¯ ´ ¯´1
1{2 S ´1 S
For L2 , because kF k “ Op pd q, λmax pΓ q
r ď λmin pΓ q
r ď K ` op p1q,
´ ¯ ´ ¯ ´ ¯
rS q´1 F ě λmin F | pΓ
λmin dΛ´1 ` F | pΓ rS q´1 F ě λmin pF | F q λmin pΓ
rS q´1 ě m´1 d,
d
and by Lemma 19, we have
´ ¯ ´ ¯´1
kL2 k ď rS q´1 ´ Γ´1 kF k dΛ´1 ` F | pΓ
pΓ rS q´1 F rS q´1
F | pΓ
´ a ¯
´1{2
“ Op md pnδ log d ` d´1{2 md q1´q .
The same bound holds for kL3 k. As for L4 , note that kβk “ Op pd1{2 q, kΓ´1 k ď
?
pλmin pΓqq´1 ď K, kHk “ Op p1q, and kβH ´ F k ď rd kβH ´ F kMAX “
Op pnδ d2{ν`1{2 ` md q, and that
´ ` ˘´1 ¯ ` ˘
λmin tH | X̄ X̄ | H ` H | β | Γ´1 βH ě λmin H | β | Γ´1 βH
ě λmin pΓ´1 qλmin pβ | βqλmin pH | Hq
ą Km´1
d d,
hence we have
´ ` ‹ ‹| ˘´1 ¯´1
| | | ´1
kL4 k ď Γ ´1
kpβH ´ F qk tH X̄ X̄ H ` H β Γ βH kH | β | k Γ´1
´1{2
a
“ Op pmd nδ log d ` d´1{2 m2d q.
135
The same bound holds for L5 . Finally, with respect to L6 , we have
´ ¯´1 ´ ¯´1
| ´1
` ˘
tH |
X̄ X̄ | | ´1
H ` H β Γ βH ´ dΛ ´1 |rS q´1 F
` F pΓ
´ ¯ ´ ¯
| ´1
` ˘
ďKd ´2
m2d |
tH X̄ X̄ | | ´1 ´1 | r S ´1
H ` H β Γ βH ´ dΛ ` F pΓ q F .
Moreover, since we have
tH | pX̄ X̄ | q´1 H ´ dΛ´1 “ Λ´1 F | pβH ´ F q

´ a ¯
´1{2
“ Op nδ log d ` d´1{2 md ,
and
rS q´1 F
H | β | Γ´1 βH ´ F | pΓ
´ ¯
rS q´1 F
ď pH | β | ´ F | qΓ´1 βH ` F | Γ´1 pβH ´ F q ` F | Γ´1 ´ pΓ
´ a ¯
´1{2
“Op dmd pnδ log d ` d´1{2 md q1´q .
Combining these inequalities yields
´ a ¯
´1{2
kL6 k “ Op m3d pnδ log d ` d´1{2 md q1´q .
On the other hand, using the Sherman - Morrison - Woodbury formula again,
r ´1 ´ Σ´1
Σ
` ˘´1
“ t´1 β X̄ X̄ | β | ` Γ ´ pβEβ | ` Γq´1
ˆ´ ¯´1 ` ˙
´1 2 | ´1
2 |
` ˘ | | ´1 | ´1 | | ´1
˘´1
ď Γ kβHk tH X̄ X̄ H ` H β Γ βH ´ H E H ` H β Γ βH
` ˘´1 ´1 ´1 ` ˘´1
ďKd tH | X̄ X̄ | H ` H | β | Γ´1 βH H | E´1 H ` H | β | Γ´1 βH t X̄ X̄ | ´ E´1
´ a ¯
´1{2
“Op md nδ log d .
136
By the triangle inequality, we obtain
p PCA q´1 ´ Σ´1 ď pΣ

pΣ p PCA q´1 ´ Σr ´1 ` Σ r ´1 ´ Σ´1
´ a ¯
3 ´1{2 ´1{2 1´q
“Op md pnδ log d ` d md q .
This concludes the proof.
137
Bibliography
Adams, S., P. A. Beling, and R. Cogill (2016): “Feature selection for hidden
Markov models and hidden semi-Markov models,” IEEE Access, 4, 1642–1657.
Adhikari, A., A. Ram, R. Tang, and J. Lin (2019a): “Docbert: Bert for document
classification,” arXiv preprint arXiv:1904.08398.
(2019b): “Rethinking complex neural network architectures for document

classification,” in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pp. 4046–4051.
Ailliot, P., V. Monbet, and M. Prevosto (2006): “An autoregressive model

with time-varying coefficients for wind fields,” Environmetrics, 17(2), 107–117.
Aït-Sahalia, Y., J. Fan, and D. Xiu (2010): “High-Frequency Covariance Estimates

with Noisy and Asynchronous Financial Data,” Journal of the American Statistical
Association, 105(492), 1504–1517.
Aït-Sahalia, Y., and J. Jacod (2014): High Frequency Financial Econometrics.

Princeton University Press.
Aït-Sahalia, Y., and D. Xiu (2016): “A Hausman Test for the Presence of Market
Microstructure Noise in High Frequency Data,” Journal of Econometrics, forthcom-
ing.
(2017): “Principal Component Analysis of High Frequency Data,” Journal of

the American Statistical Association, forthcoming.
Aït-Sahalia, Y., and D. Xiu (2017): “Using Principal Component Analysis to

Estimate a High Dimensional Factor Model with High-Frequency Data,” Journal of
Econometrics, 201(2), 384–399.
Alizadeh, A., and N. Nomikos (2004): “A Markov regime switching approach for
hedging stock indices,” Journal of Futures Markets: Futures, Options, and Other
Derivative Products, 24(7), 649–674.
Alizadeh, A. H., N. K. Nomikos, and P. K. Pouliasis (2008): “A Markov

regime switching approach for hedging energy commodities,” Journal of Banking &
Finance, 32(9), 1970–1983.
138
Andersen, T. G., and T. Bollerslev (1997): “Intraday periodicity and volatility
persistence in financial markets,” Journal of empirical finance, 4(2-3), 115–158.
Ang, A., and G. Bekaert (2002): “Regime switches in interest rates,” Journal of
Business & Economic Statistics, 20(2), 163–182.
Bai, J. (2003): “Inferential Theory for Factor Models of Large Dimensions,” Econo-
metrica, 71(1), 135–171.
Bai, J., and S. Ng (2002): “Determining the Number of Factors in Approximate

Factor Models,” Econometrica, 70(1), 191–221.
(2006): “Evaluating Latent and Observed Factors in Macroeconomics and

Finance,” Journal of Econometrics, 131(1), 507–537.
Balakrishnan, S., M. J. Wainwright, B. Yu, et al. (2017): “Statistical

guarantees for the EM algorithm: From population to sample-based analysis,” The
Annals of Statistics, 45(1), 77–120.
Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde, and N. Shephard (2011):

“Multivariate Realised Kernels: Consistent Positive Semi-Definite Estimators of the
Covariation of Equity Prices with Noise and Non-Synchronous Trading,” Journal of
Econometrics, 162(2), 149–169.
Bibinger, M. (2012): “An Estimator for the Quadratic Covariation of Asynchronously

Observed Itô Processes with Noise: Asymptotic Distribution Theory,” Stochastic
Processes and their Applications, 122, 2411–2453.
Bibinger, M., N. Hautsch, P. Malec, and M. Reiss (2014): “Estimating the

Quadratic Covariation Matrix from Noisy Observations: Local Method of Moments
and Efficiency,” Annals of Statistics, 42(4), 80–114.
Bickel, P. J., and E. Levina (2008a): “Regularized Estimation of Large Covariance

Matrices,” Annals of Statistics, 36(1), 199–227.
(2008b): “Covariance Regularization by Thresholding,” Annals of Statistics,

36(6), 2577–2604.
Brogaard, J. A., T. Hendershott, and R. Riordan (2014): “High Frequency

Trading and Price Discovery,” Review of Financial Studies, 27, 2267–2306.
Brownlees, C. T., E. Nualart, and Y. Sun (2017): “Realized Networks,” Discus-

sion paper, Universitat Pompeu Fabra - Department of Economics and Business;
Barcelona Graduate School of Economics (Barcelona GSE).
Buffington, J., and R. J. Elliott (2002): “American options with regime

switching,” International Journal of Theoretical and Applied Finance, 5(05), 497–
514.
139
Cai, J. (1994): “A Markov model of switching-regime ARCH,” Journal of Business &
Economic Statistics, 12(3), 309–316.
Cai, T., and W. Liu (2011): “Adaptive Thresholding for Sparse Covariance Matrix
Estimation,” Journal of the American Statistical Association, 106(494), 672–684.
Carhart, M. M. (1997): “On Persistence in Mutual Fund Performance,” The Journal

of Finance, 52(1), 57–82.
Chamberlain, G., and M. Rothschild (1983): “Arbitrage, Factor Structure, and

Mean-Variance Analysis on Large Asset Markets,” Econometrica, 51(5), 1281–1304.
Christensen, K., S. Kinnebrock, and M. Podolskij (2010): “Pre-Averaging

Estimators of the Ex-Post Covariance Matrix in Noisy Diffusion Models with
Non-Synchronous Data,” Journal of Econometrics, 159(1), 116–133.
Connor, G., M. Hagmann, and O. Linton (2012): “Efficient Semiparametric

Estimation of the Fama–French Model and Extensions,” Econometrica, 80(2), 713–
754.
Connor, G., and O. Linton (2007): “Semiparametric Estimation of a Characteristic-

Based Factor Model of Common Stock Returns,” Journal of Empirical Finance,
14(5), 694–717.
Croux, C., E. Renault, and B. Werker (2004): “Dynamic Factor Models,”

Journal of Econometrics, 119, 223–230.
Da, R., and D. Xiu (2017): “When Moving-Average Models Meet High-Frequency
Data: Uniform Inference on Volatility,” Discussion paper, University of Chicago.
Dai, C., K. Lu, and D. Xiu (2019): “Knowing factors or factor loadings, or neither?
Evaluating estimators of large covariance matrices with noisy and asynchronous
data,” Journal of Econometrics, 208(1), 43–79.
Daniel, K., and S. Titman (1997): “Evidence on the Characteristics of Cross

Sectional Variation in Stock Returns,” The Journal of Finance, 52(1), 1–33.
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova (2018): “Bert: Pre-
training of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805.
Diebold, F. X., and R. S. Mariano (2002): “Comparing Predictive Accuracy,”

Journal of Business & Economic Statistics, 20(1), 134–144.
Ding, X., Y. Zhang, T. Liu, and J. Duan (2014): “Using structured events to
predict stock price movement: An empirical investigation,” in Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 1415–1425.
140
(2015): “Deep learning for event-driven stock prediction,” in Twenty-fourth
international joint conference on artificial intelligence.
Ding, Z., C. W. Granger, and R. F. Engle (1993): “A long memory property of

stock market returns and a new model,” Journal of empirical finance, 1(1), 83–106.
Donoho, D. L., et al. (2000): “High-Dimensional Data Analysis: The Curses and
Blessings of Dimensionality,” AMS Math Challenges Lecture, pp. 1–32.
Doz, C., D. Giannone, and L. Reichlin (2011): “A Two-Step Estimator for

Large Approximate Dynamic Factor Models Based on Kalman Filtering,” Journal
of Econometrics, 164, 188–205.
Duan, J., X. Ding, and T. Liu (2018): “Learning sentence representations over tree
structures for target-dependent classification,” in Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Papers), pp. 551–560.
Engle, C., and J. D. Hamilton (1990): “Long swings in the dollar: are they in the
data and do markets know it,” American Economic Review, 80(4), 689–713.
Engle, R. F., and J. R. Russell (1998): “Autoregressive Conditional Duration: a

New Model for Irregularly Spaced Transaction Data,” Econometrica, pp. 1127–1162.
Fama, E. F. (1965): “The behavior of stock-market prices,” The journal of Business,

38(1), 34–105.
Fama, E. F., and K. R. French (1993): “Common Risk Factors in the Returns on
Stocks and Bonds,” Journal of Financial Economics, 33(1), 3–56.
(2015): “A Five-Factor Asset Pricing Model,” Journal of Financial Economics,

116(1), 1–22.
Fan, J., Y. Fan, and J. Lv (2008): “High Dimensional Covariance Matrix Estimation
Using a Factor Model,” Journal of Econometrics, 147(1), 186–197.
Fan, J., A. Furger, and D. Xiu (2016): “Incorporating Global Industrial Classifica-
tion Standard Into Portfolio Allocation: A Simple Factor-Based Large Covariance
Matrix Estimator With High-Frequency Data,” Journal of Business & Economic
Statistics, 34(4), 489–503.
Fan, J., and R. Li (2001): “Variable selection via nonconcave penalized likelihood
and its oracle properties,” Journal of the American statistical Association, 96(456),
1348–1360.
Fan, J., Y. Liao, and M. Mincheva (2011): “High-Dimensional Covariance Matrix

Estimation in Approximate Factor Models,” Annals of Statistics, 39(6), 3320–3356.
141
(2013): “Large Covariance Estimation by Thresholding Principal Orthogonal
Complements,” Journal of the Royal Statistical Society, B, 75(4), 603–680, With 33
discussions by 57 authors and a reply by Fan, Liao and Mincheva.
Fan, J., J. Zhang, and K. Yu (2012): “Vast Portfolio Selection with Gross-Exposure
Constraints,” Journal of the American Statistical Association, 107(498), 592–606.
Forni, M., M. Hallin, M. Lippi, and L. Reichlin (2000): “The Generalized

Dynamic-Factor Model: Identification and Estimation,” The Review of Economics
and Statistics, 82, 540–554.
(2004): “The Generalized Dynamic Factor Model: Consistency and Rates,”

Journal of Econometrics, 119(2), 231–255.
Forni, M., and M. Lippi (2001): “The Generalized Dynamic Factor Model: Repre-
sentation Theory,” Econometric Theory, 17, 1113–1141.
Goldfeld, S. M., and R. E. Quandt (1973): “A Markov model for switching

regressions,” Journal of econometrics, 1(1), 3–15.
Hamilton, J. D. (1989): “A new approach to the economic analysis of nonstationary

time series and the business cycle,” Econometrica: Journal of the Econometric
Society, pp. 357–384.
Hamilton, J. D., and R. Susmel (1994): “Autoregressive conditional heteroskedas-

ticity and changes in regime,” Journal of econometrics, 64(1-2), 307–333.
Hasbrouck, J. (2007): Empirical Market Microstructure. Oxford University Press,

New York, NY.
He, K., X. Zhang, S. Ren, and J. Sun (2016): “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 770–778.
Hering, A., K. Kazor, and W. Kleiber (2015): “A Markov-switching vector

autoregressive stochastic wind generator for multiple spatial and temporal scales,”
Resources, 4(1), 70–92.
Horst, B., and C. T. Michael (2001): Hidden Markov models: Applications In

Computer Vision, vol. 45. World Scientific.
Hu, Z., W. Liu, J. Bian, X. Liu, and T.-Y. Liu (2018): “Listening to chaotic
whispers: A deep learning framework for news-oriented stock trend prediction,” in
Proceedings of the eleventh ACM international conference on web search and data
mining, pp. 261–269.
Jacod, J., Y. Li, P. A. Mykland, M. Podolskij, and M. Vetter (2009): “Mi-

crostructure Noise in the Continuous Case: the Pre-Averaging Approach,” Stochastic
Processes and Their Applications, 119(7), 2249–2276.
142
Jeanne, O., and P. Masson (2000): “Currency crises, sunspots and Markov-
switching regimes,” Journal of international economics, 50(2), 327–350.
Kahn, R. N., P. Brougham, and P. Green (1998): United States Equity Model
(USE3) - Risk Model Handbook MSCI INC.
Kalnina, I., and O. Linton (2008): “Estimating quadratic variation consistently in

the presence of endogenous and diurnal measurement error,” Journal of Economet-
rics, 147, 47–59.
Kim, C.-J., J. Piger, and R. Startz (2008): “Estimation of Markov regime-

switching regression models with endogenous switching,” Journal of Econometrics,
143(2), 263–273.
Kim, D., Y. Wang, and J. Zou (2016): “Asymptotic Theory for Large Volatility
Matrix Estimation Based on High-Frequency Financial Data,” Stochastic Processes
and Their Applications, 126(11), 3527–3577.
King, B. F. (1966): “Market and Industry Factors in Stock Price Behavior,” The
Journal of Business, 39(1), 139–190.
Krizhevsky, A., I. Sutskever, and G. E. Hinton (2017): “Imagenet classification

with deep convolutional neural networks,” Communications of the ACM, 60(6),
84–90.
Larcker, D. F., and A. A. Zakolyukina (2012): “Detecting deceptive discussions

in conference calls,” Journal of Accounting Research, 50(2), 495–540.
Lavrenko, V., M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan

(2000): “Mining of concurrent text and time series,” in KDD-2000 Workshop on
Text Mining, vol. 2000, pp. 37–44. Citeseer.
Le, Q., and T. Mikolov (2014): “Distributed representations of sentences and

documents,” in International conference on machine learning, pp. 1188–1196.
Ledoit, O., and M. Wolf (2004): “Honey, I Shrunk the Sample Covariance Matrix,”
Journal of Portfolio Management, 30(4), 110–119.
(2012): “Nonlinear Shrinkage Estimation of Large-Dimensional Covariance

Matrices,” Annals of Statistics, 40(2), 1024–1060.
Lee, H.-T., J. K. Yoder, R. C. Mittelhammer, and J. J. McCluskey (2006):

“A random coefficient autoregressive Markov regime switching model for dynamic fu-
tures hedging,” Journal of Futures Markets: Futures, Options, and Other Derivative
Products, 26(2), 103–129.
Li, Y., P. Mykland, E. Renault, L. Zhang, and X. Zheng (2014): “Realized

Volatility When Sampling Times are Possibly Endogenous,” Econometric Theory,
30, 580–605.
143
Loh, P.-L., and M. J. Wainwright (2011): “High-dimensional regression with
noisy and missing data: Provable guarantees with non-convexity,” in Advances in
Neural Information Processing Systems, pp. 2726–2734.
Luss, R., and A. d’Aspremont (2015): “Predicting abnormal returns from news
using text classification,” Quantitative Finance, 15(6), 999–1012.
Martens, M. (2004): “Estimating Unbiased and Precise Realized Covariances,” Dis-

cussion paper, Erasmus University Rotterdam (EUR); Robeco Asset Management.
Merlevède, F., M. Peligrad, and E. Rio (2011): “A Bernstein type inequality

and moderate deviations for weakly dependent sequences,” Probability Theory and
Related Fields, 151(3-4), 435–474.
Monbet, V., and P. Ailliot (2017): “Sparse vector Markov switching autoregressive
models. Application to multivariate time series of temperature,” Computational
Statistics & Data Analysis, 108, 40–51.
Moon, H. R., and M. Weidner (2015): “Linear Regression for Panel with Unknown
Number of Factors as Interactive Fixed Effects,” Econometrica, 83(4), 1543–1579.
Negahban, S. N., P. Ravikumar, M. J. Wainwright, B. Yu, et al. (2012): “A

unified framework for high-dimensional analysis of M -estimators with decomposable
regularizers,” Statistical Science, 27(4), 538–557.
Oliveira, N., P. Cortez, and N. Areal (2013): “Some experiments on modeling

stock market behavior using investor sentiment analysis and posting volume from
Twitter,” in Proceedings of the 3rd International Conference on Web Intelligence,
Mining and Semantics, pp. 1–8.
Onatski, A. (2010): “Determining the Number of Factors from Empirical Distribution

of Eigenvalues,” Review of Economics and Statistics, 92, 1004–1016.
Pedersen, L. H. (2019): Efficiently inefficient: how smart money invests and market
prices are determined. Princeton University Press.
Pelger, M. (2015a): “Large-Dimensional Factor Modeling Based on High-Frequency

Observations,” Available at SSRN 2584172.
(2015b): “Understanding Systematic Risk: A High-Frequency Approach,”

Available at SSRN 2647040.
Quandt, R. E. (1958): “The estimation of the parameters of a linear regression

system obeying two separate regimes,” Journal of the american statistical association,
53(284), 873–880.
Rabiner, L. R., B.-H. Juang, and J. C. Rutledge (1993): Fundamentals of

speech recognition, vol. 14. PTR Prentice Hall Englewood Cliffs.
144
Reiß, M., V. Todorov, and G. Tauchen (2015): “Nonparametric Test for a
Constant Beta Between Itô semi-Martingales Based on High-Frequency Data,”
Stochastic Processes and Their Applications, 125(8), 2955–2988.
Rosenberg, B. (1974): “Extra-Market Components of Covariance in Security Re-

turns,” Journal of Financial and Quantitative Analysis, 9(02), 263–274.
Ross, S. A. (1976): “The Arbitrage Theory of Capital Asset Pricing,” Journal of

Economic Theory, 13(3), 341–360.
Rothman, A. J., E. Levina, and J. Zhu (2009): “Generalized Thresholding of Large

Covariance Matrices,” Journal of the American Statistical Association, 104(485),
177–186.
Schumaker, R. P., and H. Chen (2009): “Textual analysis of stock market predic-
tion using breaking financial news: The AZFin text system,” ACM Transactions on
Information Systems (TOIS), 27(2), 1–19.
Shephard, N., and D. Xiu (2017): “Econometric Analysis of Multivariate Realised

QML: Estimation of the Covariation of Equity Prices under Asynchronous Trading,”
Tao, M., Y. Wang, and X. Chen (2013): “Fast Convergence Rates in Estimating
Large Volatility Matrices Using High-Frequency Financial Data,” Econometric
Theory, 29(4), 838–856.
Tao, M., Y. Wang, Q. Yao, and J. Zou (2011): “Large Volatility Matrix Inference
via Combining Low-Frequency and High-Frequency Approaches,” Journal of the
American Statistical Association, 106(495), 1025–1040.
Tao, M., Y. Wang, and H. H. Zhou (2013): “Optimal Sparse Volatility Matrix
Estimation for High-Dimensional Itô Processes with Measurement Errors,” Annals
of Statistics, 41(4), 1816–1864.
Taylor, S. J. (2008): Modelling financial time series. world scientific.
Todorov, V., and T. Bollerslev (2010): “Jumps and Betas: A New Framework
for Disentangling and Estimating Systematic Risks,” Journal of Econometrics, 157,
220–235.
Valdez, A. R. L., and T. Vargiolu (2013): “Optimal portfolio in a regime-switching

model,” in Seminar on Stochastic Analysis, Random Fields and Applications VII,
pp. 435–449. Springer.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

Ł. Kaiser, and I. Polosukhin (2017): “Attention is all you need,” in Advances
in neural information processing systems, pp. 5998–6008.
145
Viennet, G. (1997): “Inequalities for absolutely regular sequences: application to
density estimation,” Probability theory and related fields, 107(4), 467–492.
Wang, Y., and J. Zou (2010): “Vast Volatility Matrix Estimation for High-Frequency
Financial Data,” Annals of Statistics, 38(2), 943–978.
Wong, K. C. (2017): “Lasso Guarantees for Dependent Data,” Ph.D. thesis.
Wood, R. A., T. H. McInish, and J. K. Ord (1985): “An investigation of

transactions data for NYSE stocks,” The Journal of Finance, 40(3), 723–739.
Xie, B., R. Passonneau, L. Wu, and G. G. Creamer (2013): “Semantic frames

to predict stock price movement,” in Proceedings of the 51st annual meeting of the
association for computational linguistics, pp. 873–883.
Xu, Y., and S. B. Cohen (2018): “Stock movement prediction from tweets and
historical prices,” in Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pp. 1970–1979.
Yang, F., S. Balakrishnan, and M. J. Wainwright (2017): “Statistical and

computational guarantees for the Baum-Welch algorithm,” The Journal of Machine
Learning Research, 18(1), 4528–4580.
Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le

(2019): “Xlnet: Generalized autoregressive pretraining for language understanding,”
in Advances in neural information processing systems, pp. 5753–5763.
Yang, Z., D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016): “Hi-
erarchical attention networks for document classification,” in Proceedings of the
2016 conference of the North American chapter of the association for computational
linguistics: human language technologies, pp. 1480–1489.
Yi, X., and C. Caramanis (2015): “Regularized em algorithms: A unified framework

and statistical guarantees,” in Advances in Neural Information Processing Systems,
pp. 1567–1575.
Zhang, L. (2011): “Estimating Covariation: Epps effect, Microstructure Noise,”

Zou, H. (2006): “The adaptive lasso and its oracle properties,” Journal of the American
statistical association, 101(476), 1418–1429.
Zou, H., and T. Hastie (2005): “Regularization and variable selection via the elastic
net,” Journal of the royal statistical society: series B (statistical methodology), 67(2),
301–320.
146

Lu Princeton 0181D 13623

Uploaded by

Copyright:

Available Formats

Lu Princeton 0181D 13623

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lu Princeton 0181D 13623

Uploaded by

Copyright:

Available Formats

Statistical and Machine Learning

Methods For Financial Data

Recommended for Acceptance

First and foremost, I want to express my sincerest gratitude towards my advisor,

1 Feature Augmented Hidden Markov Model: A Unified Framework

2 A Deep Reinforced Model For Long Document Classification With

3 Large Covariance Matrix Estimation Using High Frequency Data 63

1.1 AIC for different number of states. . . . . . . . . . . . . . . . . . . . 20

2.1 The Structure of the Dataset. . . . . . . . . . . . . . . . . . . . . . . 53

3.1 Pre-Averaging v.s. Subsampling using Location and Hard Thresholding 98

1.1 Graphical illustration of Hidden Markov Model with factor structure

3.1 Trading Intensity of Assets . . . . . . . . . . . . . . . . . . . . . . . 102

Feature Augmented Hidden Markov

1.2 Model Structure

the ergodic probabilities of the Markov process to be vector π “ pπ1 , . . . , πN q| , where

Epyi |xi , Si q “ g ´1 px|i βSi q.

where xi is a p dimensional exogenous explanatory variables, βSi P Rp indicates the

Here Rλ pβq : R Ñ R can be different types of penalty functions. Examples of such

ptq ptq ptq

Here ati is the "forward" probabilities defined as:

which can be calculated by the backward recursion:

here ∆ is the estimated statistical error.

1.4 Main Results for General Emission Distribution

We now tend to theoretical analysis of Feature Augmented Hidden Markov Model

1.4.1 Decomposable Regularizer

Decomposability has been an very important ingredient of high dimensional regular-

Definition 1. (Decomposability). A norm-based regularizer R is decomposable with

Definition 2. (Subspace compatibility constant). For any subspace M of Rp , the

Remark 2. This subspace compatibility constant measures the compatibility between

Condition 1. The regularizer R is decomposable with respect to the subspace pair

Remark 4. This condition is a common condition in the analysis of high dimensional

Qpθ|θ1 q :“ lim rEQn pθ|θ1 qs. (1.2)

1 i`k i`k 1 1 i`k

Qk pθ|θ1 q “ lim EQkn pθ|θ1 q (1.4)

for all pj, kq P t1, . . . , N u.

To infer the closeness of maximizer of Qk to Q, except for the closeness of Qk and Q,

M pθq :“ arg max Qpθ|θ1 q.

Condition 3. (Strong Concavity and Smoothness). We require that the function

• Strong Concavity: for all pβ1 , γ1 q, pβ2 , γ2 q P Ω, θ1 P Ω, we have

• Smoothness: for all pβ1 , γ1 q, pβ2 , γ2 q P Ω, θ1 P Ω, we have

• For each β P Ωβ , θ1 P Ω, we have

} 5β Qk1 pβ|β 1 , γ 1 q ´ 5Qk1 pβ|β ˚ , γ 1 q}2 ď Lβ,1 }β 1 ´ β ˚ }2 ,

} 5β Qk1 pβ|β 1 , γ 1 q ´ 5Qk1 pβ|β 1 , γ ˚ q}2 ď Lβ,2 }γ 1 ´ γ ˚ }2 ,

• For each γ P Ωγ , θ1 P Ω, we have

} 5γ Qk2 pγ|β 1 , γ 1 q ´ 5Qk2 pγ|β ˚ , γ 1 q}2 ď Lγ,1 }β 1 ´ β ˚ }2 ,

} 5γ Qk2 pγ|β 1 , γ 1 q ´ 5Qk2 pγ|β 1 , γ ˚ q}2 ď Lγ,2 }γ 1 ´ γ ˚ }2 ,

We now turn to the Regularized Baum-Welch algorithm based on sample data. In

Condition 5. (Restricted Strong Concavity). For any θ1 P Ω, with probability at

Condition 6. (Statistical error). For any fixed θ1 P Ω, with probability 1 ´ δ, we

Theorem 1. Under Conditions 1-6, consider initialization θp0q :“ pβ p0q , γ p0q q P Ω,

1.5 Concrete Results for Gaussian Emission and L1

Theorem 2. Let n ě Crp}β ˚ }2 ` δq{}β ˚ }2 s2 s log p, k Á log n, θp0q :“ pβ p0q , γ p0q q P Ω,

Nopt “ arg min AICpN q. (1.6)

Cp , BIC and other similar methods can also be used.

yi “ x|i βSi ` i , for i in 1, . . . , 500,

by the following settings:

• Features xi : xi “ rx1,i , . . . , x100,i s is a 100 ˆ 1 vector generated from standard

• States Si : we assume there are two states, which means Si equals to 1 or 2. We

• Coefficients βSi : if Si “ 1, we let β1 “ r1, 1, 1, 1, 0, 0, . . . , 0s| , with

yi “ x|i βSi ` i , for i in 1, . . . , 500,

• Inovation i : i is generated from standard normal for every sample i.