Microrca: Root Cause Localization of Performance Issues in Microservices
Microrca: Root Cause Localization of Performance Issues in Microservices
Microrca: Root Cause Localization of Performance Issues in Microservices
Issues in Microservices
Li Wu, Johan Tordsson, Erik Elmroth, Odej Kao
Abstract—Software architecture is undergoing a transition stacks for their services. However, performance anomalies
from monolithic architectures to microservices to achieve re- manifest differently for different technology stacks, making
silience, agility and scalability in software development. However, it hard to detect performance issues and locate root causes;
with microservices it is difficult to diagnose performance issues
due to technology heterogeneity, large number of microservices, 4) frequent updates: microservices are frequently updated to
and frequent updates to both software features and infrastruc- meet customers’ requirements, (e.g., Netflix updates thousands
ture. This paper presents MicroRCA, a system to locate root of times per day [7]). This highly dynamic environment
causes of performance issues in microservices. MicroRCA infers aggravates the difficulty in root cause localization.
root causes in real time by correlating application performance
symptoms with corresponding system resource utilization, with-
To date, many studies have been conducted on root cause
out any application instrumentation. The root cause localization diagnostics in distributed systems, clouds and microservices.
is based on an attributed graph that model anomaly propagation These either require the application to be instrumented
across services and machines. Our experimental evaluation where (e.g., [8], [9]) or numerous metrics to be analyzed (e.g.,
common anomalies are injected to a microservice benchmark [5], [10]). A third class of approaches [11]–[13] avoid these
running in a Kubernetes cluster shows that MicroRCA locates
root causes well, with 89% precision and 97% mean average
limitations by building a causality graph and inferring the
precision, outperforming several state-of-the-art methods. causes along the graph based on application-level metrics.
Index Terms—root cause analysis, performance degradation, With this approach, potential root causes are commonly ranked
microservices through the correlation between back-end services and front-
end services. However, this may fail to identify faulty services
I. I NTRODUCTION that have little or no impact on front-end services.
More and more applications are using microservices archi- In this paper, we propose a new system, MicroRCA1 , to
tectures (MSA) in domains such as internet of things (IoT) [1], locate root causes of performance issues in microservices.
mobile and cloud [2], to build large-scale systems that are MicroRCA is an application-agnostic system designed for
more resilient, robust and better adapted to dynamic customer container-based microservices environments. It collects ap-
requirements. With MSA, an application is decomposed into plication and system levels metrics continuously and detects
self-contained and independently deployable services with anomaly on SLO (Service Level Objective) metrics. Once an
lightweight intercommunication [3]. anomaly is detected, MicroRCA constructs an attributed graph
To operate microservices reliably and with high uptime, with services and hosts to model the anomaly propagation
performance issues must be detected quickly and their root among services. This graph does not only include the service
causes pinpointed. However, it is difficult to achieve this call paths but also include services collocated on the same
in microservices systems due to the following challenges: (virtual) machines. MicroRCA correlates anomaly symptoms
1) complex dependencies: the number of services can often of communicating services with relevant resource utilization
be hundreds, or thousands, (e.g., Uber has deployed 4000 to infer the potential abnormal services and ranks the potential
microservices [4]). Consequently, the dependencies among root causes. With the correlation of service anomalies and
services are much more complex than for traditional dis- resource utilization, MicroRCA can identify abnormal non-
tributed systems. A performance degradation from one service compute intensive services that have non-obvious service
can propagate widely and cause multiple alarms, making anomaly symptoms, and mitigate the effect of false alarms to
it difficult to locate the root causes; 2) numerous metrics: root cause localization. We evaluate MicroRCA by injecting
the number of monitoring metrics available is very high. various anomalies to the Sock-shop2 microservice benchmark
According to [5], Netflix exposes 2 million metrics and Uber deployed on Kubernetes running in Google Cloud Engine
exposes 500 million metrics. It would cause a significant (GCE)3 . The results show that MicroRCA achieves a good
overhead if all these metrics were to be used for performance
issue diagnosis; 3) heterogeneous services: technology het- 1 MicroRCA stands for Microservices Root Cause Analysis
erogeneity [6], one key benefit of MSA, enables development 2 Sock-shop - https://microservices-demo.github.io/
teams to use different programming languages and technology 3 Google Cloud Engine - https://cloud.google.com/compute/
diagnosis result, with 89% in precision and 97% in mean MonitorRank [13], Microscope [12] and CloudRanger [11]
average precision (MAP), which is higher than several state- identify root causes based on application level metrics only.
of-the-art methods. MonitorRank considers internal and external factors, and
In summary, our contributions are threefold: proposes a pseudo-anomaly clustering algorithm to classify
• We propose an attributed graph with service and host external factors, then traverses the provided service call graph
nodes to model the anomaly propagation in container- with a random walk algorithm to identify anomalous services.
based microservices environments. Our approach is Microscope considers communicating and non-communicating
purely based on metrics collected at application and dependencies between services and constructs a service causal-
system levels and requires no application instrumentation. ity graph to represent these two types of dependencies. Next,
• We provide a method to identify the anomalous services it traverses the constructed graph from the front-end service
by correlating service performance symptoms with cor- to find the root cause candidates and ranks them based
responding resource utilization, which adapts well to the on the metrics similarity between candidate and front-end
heterogeneity of microservices. service. CloudRanger constructs an impact graph with causal
• We evaluate MicroRCA by locating root causes from analysis and proposes a second-order random walk algorithm
different types of faults and different kinds of faulty to locate root causes. All of these approaches achieve a good
services. On average of 95 test scenarios, it achieves 13% performance in identifying faulty services that impact a front-
precision improvement over the baseline methods. end service. However these methods commonly fail to identify
The remainder of this paper is organized as follows. Related root causes from backend services of the type that scarcely
work is discussed in Section II. Section III gives an overview impact front-end services. Similar to these works, we also use
of the MicroRCA system. Root cause localization procedures a graph model, and localize root causes with an algorithm
are detailed in Section IV. Section V describes the experimen- similar to random walk. However, we correlate anomalous
tal evaluation and Section VI concludes the paper. performance symptoms with relevant resource utilization to
comprehensively represent services anomaly, which improves
II. R ELATED W ORK the precision of root cause localization.
In recent years, many solutions have been proposed to
identify root causes in distributed systems, clouds and mi- III. S YSTEM OVERVIEW
croservices. Log-based approaches [14]–[17] build problem In this section, we briefly introduce MicroRCA and its main
detection and identification models based on logs parsing. components.
Even though log-based approaches can discovery more in-
formational causes, they are hard to work in real time and A. MicroRCA Overview
require abnormal information to be hidden in logs. Similarly, Figure 1 shows the overview of MicroRCA. Data collection
trace-based approaches [8], [9], [18]–[23] gather information module collects metrics from application and system levels.
through complete tracing of the execution paths, then identify The application level metrics continuously, in particular the
root causes through analyzing the deviation of latencies along response times of microservices, are used to detect perfor-
the paths, e.g., based on machine learning. These approaches mance issues, and metrics from both levels are used to locate
are very useful to debug distributed systems. However, it is a root causes. Once anomalies are detected, the cause analysis
daunting task for developers to understand source code well engine constructs an attributed graph G with service and host
enough to instrument tracing code. nodes to represent the anomaly propagation paths. Next, the
In addition, there are many metrics-based approaches [5], engine extracts an anomalous subgraph SG based on detected
[9]–[13], [24], [25], as well as this work. These use metrics anomalies and infers which service is the most likely to cause
from applications and/or additional infrastructure levels to the anomalies.
construct a causality graph that is used to infer root causes.
Seer [9] and Loud [10] use multiple-level metrics to identify B. Data Collection
root causes. Seer identifies the faulty services and which MicroRCA is designed to be application-agnostic. It collects
resource, like CPU overhead, that causes the service perfor- application and system levels metrics from a service mesh [26]
mance degradation. It is a proactive method which applies and monitoring system separately and stores them in a time
deep learning on a massive amount of data to identify root series database. In container-based microservices, system-level
causes. This approach requires instrumentation of source code metrics include container and host resource utilization, as
to get metrics, meanwhile, its performance may decrease illustrated by container and host in ”Model Overview” in
when microservices are frequently updated. Loud identifies Figure 1. Application level metrics include response times
any faulty components that generate metrics. It uses the between two communicating services, etc.
causality graph of KPI metrics from the anomaly detection
system directly and locates the faulty components by means C. Anomaly Detection
of different graph centrality algorithms. However, it requires Anomaly detection is the starting point of root cause
anomaly detection to be performed on all gathered metrics, localization. MicroRCA leverages the unsupervised learning
which would cause a significant overhead. algorithm Distance-Based online clustering BIRCH [27] as the
among all communicating services and located hosts. Fig-
Data Collection
ure 2(b) shows an example of a constructed attributed graph
from the initial microservices environment.
(a) Initial Microservices Environment (b) Attributed Graph (G) (c) Anomalous Subgraph (SG) (d) Weighted and Scored Anomalous Subgraph (e) A Ranked List of Root Causes
anomalous node s3 is the correlation between the re- Algorithm 1: Anomalous Subgraph Weighing
sponse time rt(s1 ,s3 ) between s1 and s3 , and the response Input: Anomalous subgraph SG, anomalous edges,
time rt(s2 ,s3 ) which is the average anomalous response anomalous nodes, anomaly response time rt a,
time of s3 . host metrics Uh , response times of edges rt
• Weight between an anomalous service node and a nor- Output: Weighted SG
mal host node wh . We use the maximum correlation
1 for node vj in anomaly nodes do
coefficient between the average anomalous response time
2 for edge eij in in-edges of vj do
rt a of an anomalous service node, and the host resource
3 if rt(i,j) in anomalous edges then
utilization metrics Uh , including CPU, memory, I/O, net-
4 Assign α to wij ;
work, to represent the similarity between service anomaly
5 else
symptoms and the corresponding resource utilization.
6 Assign corr(rt a(j), rt(i,j) ) to wij
Given that the response time of both normal and abnormal
7 end
services have strong correlations with the host resource
8 end
utilization, we take the average of in-edge weights win
h 9 for edge ejk in out-edges of vj do
as a factor of the similarity. Thus, wij between service
10 if node vk is service node then
node i and host node j can be formulated as:
11 Assign corr(rt a(j), rt(j,k) ) to wjk ;
h else
wij = max corr(uk , rt a(i)) · wI (i) (1) 12
k:uk ∈Uh (j)
13 Assign
To summarize, the weight wij between node i and node avg(win (j)) × max(corr(rt a(j), Uh (k)))
j is given by Equation 2, where rt a(i) denotes the average to wjk
anomalous response time of anomalous service node i. Fig- 14 end
ure 2(d) shows these three types of weights along the edges. 15 end
16 end
α, if rt(i,j) is anomalous, 17 return Weighted SG
(i,j) , rt a(j)), if i ∈ S and normal,
corr(rt
wij = (2)
corr(rt(i,j) , rt a(i)), if j ∈ S and normal,
as per Equation 1 if j ∈ H and normal. and its container resource utilization Uc . To summarize, given
anomalous service node sj , the anomaly score AS(sj ) is
The procedure of anomalous subgraph weighing is pre- defined as:
sented in Algorithm 1. In this algorithm, we iterate over the
anomalous nodes and compute the weights for in-edges in lines AS(sj ) = w(sj ) · max corr(uk , rt a(sj )) (3)
k:uk ∈Uc (sj )
L1-L8 and for out-edges in lines L9-L15.
Assigning Service Anomaly Score: We calculate anomaly Localizing Faulty Services: We locate the faulty services
scores for anomalous service nodes and assign them to a node from the anomalous subgraph with a graph centrality algo-
attribute, denoted as AS ∈ [0, 1]. rithm - Personalized PageRank [28], which proved a good
To quantify the anomaly, we take an average weight of performance in capturing anomaly propagation in previous
service node w(sj ) that indicates the impact to linked nodes. work [10], [11], [13], [29]. In Personalized PageRank, the
Furthermore, In container-based microservices, we assume Personalized PageRank vector (PPV) v is regarded as the root
container resource utilization is correlated with service per- cause score for each node. To compute PPV, we first define
w
formance. We thus complement the service anomaly with P as the transition probability matrix, where Pij = P ijwij if
j
the maximum correlation coefficient between the average node i links to node j, and Pij = 0 otherwise. A preference
anomalous response time rt a of the anomalous service node, vector u denotes the preference of nodes, which we assign
Table I Table II
H ARDWARE AND SOFTWARE CONFIGURATION USED IN EXPERIMENTS . R EQUEST RATES SENT TO MICROSERVICES .
frequently when the random teleportation occurs. Formally, we limit the CPU resource to 1vCPU and memory to 1GB.
the personalized PageRank equation [28] is defined as: The replication factor of each microservice is set with 1.
v = (1 − c)P v + cu (4) Workload Generator: We develop a workload generator
using Locust4 , a distributed, open-source load testing tool that
where c ∈ (0, 1) is the teleportation probability, indicating that simulates concurrent users in an application. The workload is
each step jumps back to a random node with probability c, selected to reflect real user behavior, e.g., more requests are
and with probability 1 − c continues forth along the graph. sent to the entry points front-end and catalogue, and fewer
Typically c = 0.15 [28]. After ranking, we removed the to the shopping carts, user and orders services. We distribute
host nodes from the ranked list as MicroRCA is designed requests to front-end, orders, catalogue, user, carts with five
to locate faulty services. As the link between service nodes locust slaves, and provision 500 users that in total generate
in the anomalous subgraph represents the service call-callee about 600 queries per second to sock-shop. The request rate
relationship, we need to reverse the edges before running the of each microservice is listed in Table II.
localization algorithm. We give an example of the ranked list Data Collection: We use the istio5 service mesh, which in
of root causes in Figure 2(e). term uses Prometheus6 , to collect service-level and container-
level metrics and node-exporter7 to collect host-level metrics.
V. E XPERIMENTAL E VALUATION Prometheus is configured to collect metrics every 5 seconds
In this section, we present the experimental setup, exper- and sends the collected data to MicroRCA. In service-level,
imental results, a comparison with state-of-the-art methods, we collect response time between each pair of services. In both
and discuss the characteristics of our approach. container-level and host-level, we collect CPU usage, memory
usage, and the size of total sent bytes.
A. Experimental Setup Faults Injection: Our method is applicable to any type
Testbed: We evaluate MicroRCA in a testbed established of anomaly that manifests itself as increased microservice
in Google Cloud Engine (GCE) where we set up a Kubernetes response time. In this evaluation, we inject three types of
cluster, deploy the monitoring system and service mesh, and faults commonly used in the evaluation of the state-of-the-
run the benchmark named Sock-shop2 . There are four worker art approaches [10], [12], [31] to sock-shop microservices to
nodes and one master node in the cluster, three of the worker simulate the performance issue. (1) Latency, we use the tc 8
nodes are dedicated to microservices and one for data collec- to delay the network packets; (2) CPU hog, we use stress-ng
9
tion. In addition, one server outside of the cluster runs the , a tool to load and stress compute system, to exhaust CPU
workload generator. Table I describes the detail configuration resources. As microservice payment is non-compute intensive
of the hardware and software in the testbed. whose CPU usage is only 50mHz, we exhaust its CPU heavily
Benchmark: Sock-shop2 is a microservice demo applica- with 99% usage. (3) Memory leak: we use stress-ng to allocate
tion that simulates an e-commerce website that sells socks. memory continuously. As microservice carts and orders are
It is a widely used microservice benchmark designed to CPU and memory intensive services, and memory leak causes
aid demonstration and testing of microservices and cloud- CPU overhead [27], we only provision 1 virtual machine. The
native technologies. Sock-shop consists of 13 microservices, details of the injected faults are described in Table III.
which are implemented in heterogeneous technologies and To inject performance issues in microservices, we customize
intercommunicate using REST over HTTP. In particular, front- the existing sock-shop docker images with installing above
end serves as the entry point for user requests; catalogue faults injection tools. Each fault lasts 1 minute. To increase
provides a sock catalogue and product information; carts holds
4 Locust - https://locust.io/
shopping carts; user provides the user authentication and store
5 Istio - https://istio.io/
user accounts, including paymenet cards and addresses; orders 6 Prometheus - https://prometheus.io/
places orders from carts after user log-in through the user 7 Node-exporter - https://github.com/prometheus/node xporter
e
service, then process the payment and shipping from the pay- 8 tc - https://linux.die.net/man/8/tc
ment and shipping services separately. To each microservice, 9 stress-ng - https://kernel.ubuntu.com/ cking/stress-ng/
the generality, we repeat the injection process 5 times for each Table IV
fault. This produces a total of 95 experiment cases. P ERFORMANCE OF M ICRO RCA.
Baseline Methods: We compare MicroRCA with some microservics front-end orders catalogue user carts shipping payment average
baseline methods as follows: name
Latency
• Random selection(RS): Random selection is a way that an PR@1 1.0 1.0 1.0 0.6 1.0 0.6 1.0 0.89
operating team use without specific domain knowledge PR@3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
MAP 1.0 1.0 1.0 0.91 1.0 0.89 1.0 0.97
of the system. Every time, the operating team randomly CPU Hog
selects one microservice from the uninvestigated mi- PR@1 - 1.0 1.0 1.0 0.8 1.0 0.6 0.9
PR@3 - 1.0 1.0 1.0 1.0 1.0 1.0 1.0
croservices to investigate until they find the root cause. MAP - 1.0 1.0 1.0 0.94 1.0 0.89 0.97
• MonitorRank [13]: MonitorRank uses a customized ran- Memory Leak
PR@1 - 1.0 0.8 1.0 1.0 0.8 0.8 0.9
dom walk algorithm to identify the root cause. It is similar
PR@3 - 1.0 1.0 1.0 1.0 1.0 1.0 1.0
with personalized PageRank[28] we adopt in MicroRCA. MAP - 1.0 0.97 1.0 1.0 0.94 0.94 0.975
As we only consider the internal faults in microservices,
we do not implement the pseudo-anomaly clustering The average precision of shipping and payment that is lower
algorithm to identify the external factors. To implement than other services is likely for three reasons. First, payment is
MonitorRank, we use their customized random walker al- a non-compute intensive service, even though we exhaust the
gorithm to identify root causes in the extracted anomalous resource of CPU and memory, its response time is scarcely
subgraph . impacted. Second, there are few requests to shipping and
• Microscope [12]: Microscope is another graph-based
payment and they do not request any other services, which
method to identify faulty services in microservices en- makes the response time increase less obviously than other
vironment and it also takes sock-shop as the benchmark. faulty services. Third, in order to detect anomalies in their
To implement Microscope, we construct the services experimental cases, we use a small threshold in anomaly
causality graph with paring the service-level metrics, then detection module, which causes more false alarms.
use their cause inference, which traverses the causality Table IV demonstrates the performance of MicroRCA in
graph with the detected anomalous services nodes, to different types of faults and microservices. It shows that
locate root causes. MicroRCA can achieve almost 90% in terms of PR@1 and
effectively locate all root causes in the top three faulty services.
The anomaly detection methods in MonitorRank and Mi-
croscope fail to detect anomalies, especially for non-obvious C. Comparisons
anomalies, like payment service. As we in our experiments
aim to compare the root cause localization performance, we To evaluate the performance of MicroRCA further, we apply
thus use the same results from our anomaly detection module. it and the baseline methods on all experimental cases.
Evaluation Metrics: To quantify the performance of each We compare the overall performance of all methods and
algorithm on a set of anomalies A, we use following metrics: their performances in identifying different types of faults.
Table V shows the performance, in terms of PR@1, PR@3
• Precision at top k denotes the probability that the top k
and MAP, for all methods. We can observe that MicroRCA
results given by an algorithm include the real root cause,
outperforms the baseline methods in overall. In particular,
denoted as P R@k. A higher P R@k score, especially
MicroRCA achieves a precision of 89% and MAP of 97%,
for small values of k, represents the algorithm correctly
which are at least 13% and 15% higher than the baseline
identifies the root cause, Let R[i] be the rank of each
methods separately. In general, Microscope performs well in
cause and vrc be the set of root causes. More formally,
CPU hog and memory leak when the anomalies are detected
P R@k is defined on a set of given anomalies A as:
P correctly but worse in fault latency when more false alarms
1 X i<k (R[i] ∈ vr c ) are detected. However, MicroRCA performs well in all types
P R@k = (5)
|A| (min(k, |vr c |)) of faults, and achieves average improvement of 24% in MAP
a∈A
comparing to MonitorRank and Microscope.
• Mean Average Precision (MAP) quantifies the overall Next, we compare the performance of each method on
performance of an algorithm, where N is the number of different microservices. Figure 4 shows the comparison results,
microservices: in terms of PR@1, PR@3 and MAP, on different services. We
1 X X can see that MonitorRank performs well in identify dominating
M AP = P R@k. (6)
|A| nodes which have large degrees, like microservice orders.
a∈A 1≤k≤N
However it fails to identify leaf nodes, like microservice pay-
B. Experimental Results ment. This is because MonitorRank calculates the similarity
Figure 3 shows the results of our proposed root cause based on the correlation between front-end services and back-
localization method for different types of faults and sock-shop end services. The anomaly of microservice payment decreases
microservices. We observe that all services achieve a MAP the correlation during propagation and thus the root cause
in the range of 80%-100% and an average PR@1 over 80% localization fails. On the contrary, Microscope performs better
except shipping and payment. in identifying leaf nodes, such as the microservice payment,
Latency
Precision MAP CPU Hog
Memory Leak
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
front-end orders catalogue user carts shippingpayment front-end orders catalogue user carts shippingpayment
Figure 3. The result of PR@1 and MAP.
Table V Table VI
P ERFORMANCE OF EACH ALGORITHM . T HE OVERHEAD OF M ICRO RCA.
RS
MonitorRank
orders orders orders Microscope
MicroRCA
payment payment payment
shipping shipping shipping
Figure 4. The comparison results of PR@1,PR@3 and MAP for different microservices.
10
MonitorRank Microscope 10
MicroRCA
10 Number
9 9 9 of cases
8 8 8
7 7 7 1
6 6 6
Rank
Rank
Rank
5 5 5 5
4 4 4
3 3 3 10
2 2 2
1 1 1 20
0 0 0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
F1-score F1-score F1-score
Figure 5. Root cause rank against anomaly detection F1-Score.
1.05 1.0
0.9
1.00
0.8
Performance
0.95 0.7
MAP
0.90 0.6
0.5 F1-score
0.85 PR@1 MAP:Latency MonitorRank
PR@2 MAP:CPU Hog 0.4 Microscope
PR@3 MAP:Memory Leak MicroRCA
0.80 0.3
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.02 0.04
0.06 0.08 0.10
threshold
Figure 6. Performance against anomaly detection confidence(α). Figure 7. Performance against anomaly detection threshold.