Anomaly Detection and Classification Using DT and DL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Anomaly Detection and Classification using


Distributed Tracing and Deep Learning
Sasho Nedelkoski*, Jorge Cardoso^, Odej Kao*
*Complex and Distributed IT-Systems Group, TU Berlin, Berlin, Germany
Email: {nedelkoski, odej.kao}@tu-berlin.de
iHuawei Munich Research Center, Huawei Technologies, Munich, Germany
Email: jorge.cardoso@huawei.com

Abstract—Artificial Intelligence for IT Operations (AIOps) problems, component/system failures (e.g., outages, degraded
combines big data and machine learning to replace a broad range performance), or security incidents. All these examples de­
of IT Operations tasks including availability, performance, and scribe situations, where the system operates outside of the
monitoring of services. By exploiting log, tracing, metric, and
network data, AIOps enable detection of faults and issues of normal, expected or pre-defined behaviour. Thus, the system
services. The focus of this work is on detecting anomalies based exposes that an anomaly must be detected and recognized,
on distributed tracing records that contain detailed information before it leads to a service or a system failure.
for the availability and the response time of the services. In The foundation for AIOps systems is the availability of
large-scale distributed systems, where a service is deployed on suitable and descriptive data, which is typically observed by
heterogeneous hardware and has multiple scenarios of normal
operation, it becomes challenging to detect such anomalous cases. three core components: tracing, logging, and resource moni­
We address the problem by proposing unsupervised, response toring metrics. The tracing component produces events (spans)
time anomaly detection based on deep learning data modeling containing information on the execution path and response
techniques; unsupervised dynamic error threshold approach; time. The logging data represents interactions between data,
tolerance module for false positive reduction; and descriptive files, or applications and is used to analyze specific trends
classification of the anomalies. The evaluation shows that the
approach achieves high accuracy and solid performance in both, or to record events/actions for a later forensic. The resource
experimental testbed and large-scale production cloud. monitoring data reflects the current utilization and status of the
Index Terms—AIOps; anomaly detection; service reliability; infrastructure, typically as cross-layer information regarding
time series; distributed tracing; autoencoders; RNNs; GRUs; CPU, memory, disk, and network throughput and latency. Most
CNNs. of the current AIOps platforms apply deep learning solely
on monitoring data [1], [2], as this data is simple to collect
I. In t r o d u c t io n
and interpret, but not sufficient for a holistic approach. We
The increasing number of IoT applications with dynamically aim at exploring an additional path for anomaly detection
linked devices and their embedding in real-world (smart) using a second category of data, namely the tracing data
environments drive the creation of large multi-layered systems. collected during the execution of system operations. Tracing
As a consequence, the complexity of the systems is steadily technologies [3]—[5] generate events to externalize the state
growing up to a level, where it is impossible for human oper­ of the system by combining performance data from the end-
ators to oversee and holistically manage the systems without to-end execution path with structured and causally related
additional support and automation. Uninterrupted services with execution traces. We are confident that such data can improve
guaranteed latency, response times, and other QoS parameters, the anomaly detection, root-cause analysis, and remediation
are however mandatory prerequisite for many of the data- in the system. It contains detailed information for individual
driven and autonomous applications. Therefore, losing control services and the causal relationship to other related services
is not a feasible option for any system or infrastructure. that form part of the trace.
The large service providers are aware of the need for The focus of the study is to tackle the problem of anomaly
always-on, dependable services and they already deployed detection in real-world tracing data. It faces several challenges,
numberless measures by introducing additional intelligence including the lack of labeled data, concept drift, and concept
to the IT-ecosystem. For example, by employing network evolution. Other major sources of difficulties emerge due
reliability engineers (NRE), site reliability engineers (SRE), to the low signal-to-noise ratio, the presence of multiple
by using automated tools for infrastructure monitoring, and frequencies and multiple distributions, the large number of
developing tools based on artificial intelligence (AIOps) for distinct time series generated by microservice applications,
load balancing, capacity planning, resource utilization, storage and the presence of concept drifts. The signal-to-noise ratio
management, and anomaly detection. is typically very low as many different components affect
The next piece in the puzzle aims at rapidly decreasing the response time of microservices such as switches, routers,
the reaction time in case an urgent activity of a system memory capacity, CPU performance, programming languages,
administrator is necessary. That usually involves performance thread and process concurrency, bugs, and volume of user

978-1-7281-0912-1/19/$31.00 ©2019 IEEE _ IEEE


DOI 10.1109/CCGRID.2019.00038 computer
society
requests. Multiple frequencies are correlated with system and we distinguish between statistical and machine learning meth­
user behaviour since request patterns are different, e.g., from ods. The machine learning approaches can be divided into two
hour to hour due to business operation hours, from day to day general categories [1], supervised [16]—[19] and unsupervised
due to system maintenance tasks, and from month to month [20]—[23].
due to cyclic patterns. For these reasons, the utilization of Vallis et al. [24] proposed a novel approach, which builds
unsupervised approaches is required. In such scenarios where on Extreme Studentized Deviate test (ESD), for detecting
anomaly detection is being used as diagnostic tool, a degree anomalies in long-term time series data. The approach requires
of additional description is required. Identifying the potential the detection of the trend component. This technique is similar
anomaly in the service is of limited value for the operators to most of the statistical methods, which have limitations when
without having more detailed explanation. they are applied to large systems based on service-oriented
Contributions. We provide an algorithm that adapts and and microservice architectures. These systems produce time
extends deep learning methods from various domains. This series data with high noise and with more than a single normal
work focuses on anomaly detection from tracing data in behavior in the signal. Specifically, if the time series has more
large-scale distributed systems, but can also be used in other than two different normal (expected) scenarios of operation,
applications involving anomaly detection on time series data the algorithm would not be able to capture this information.
containing multiple normal operating scenarios. We show the Supervised methods use labeled data to train machine
capability of the Auto-Encoding Variational Bayes (variational learning models. The anomaly detection algorithms are clas­
autoencoder, AEVB) to learn multiple complex distributions sification models trained by data containing the information
representing normal behavior over longer period of time and whether the data point is an anomaly or not. For practical
detect anomalies by employing a dynamic, probability-based, usage, the labelling by experts or injection of anomalies either
error threshold setting. Furthermore, we propose combination is not sufficient (evolving time series, concept drifts) or may
of the threshold setting and post-processing that aims to reduce harm the running system. Therefore, unsupervised methods are
the number of false positives. Lastly, we present a classi­ investigated, having the positive properties of performing the
fication module and provides descriptions for the detected same task, but not using labeled input data.
anomalies. Recently, deep learning techniques are increasingly inves­
The remaining of the paper is structured as follows. In tigated because of their success in range of domains. In
section n, we provide the related work for the field of anomaly that direction, Malhotra et al. [25] used stacked recurrent
detection. In sections HI and IV, we present the preliminary hidden layers to enable learning of higher level temporal
knowledge and our proposed methodology. Section V sum­ features. They presented a model of stacked Long Short­
marizes the performance evaluation in terms of speed and Term Memory (LSTM) networks for anomaly detection in
accuracy for different types of anomalies on both experimental time series. A network was trained on non-anomalous data
and real-world production cloud data. and used as a predictor over a number of time steps. The
resulting prediction errors were modeled as a multivariate
II. Re l a t e d w o r k Gaussian distribution, which was used to assess the likelihood
of anomalous behavior. The efficacy of this approach was also
Tracing technologies for distributed services record in­ demonstrated on four datasets.
formation about all the individual components participating Xu et al. [26] show the usability of variational autoencoders
on an e.g., user request (initiator) within the system. Two for anomaly detection and triggering of timely troubleshooting
classes of solutions have been proposed to aggregate these problems on Key Performance Indicator (KPI) data of Web
information so that one can associate all record entries with applications (e.g., page views, number of online users, and
a given initiator, black-box and annotation-based monitoring number of orders). They proposed Donut, an unsupervised
schemes [3]. Black-box schemes [6]—[8] assume there are no anomaly detection algorithm based on AEVB. Furthermore,
additional information other than the message record described Hundman et al. [27] show the use of LSTM recurrent neural
above, and use statistical regression techniques to infer that networks for spacecraft anomalies on multi-variate telemetry
association. Annotation-based schemes [5], [8]—[10] rely on data.
applications or middleware to explicitly tag every record with Existing supervised and unsupervised auto-regressive ap­
a global identifier that links these message records back to proaches fail with data of small signal-to-noise ratio and
the originated request. We use the annotation based system autocorrelation either by learning only the running mean or
(Zipkin based on Dapper [3]), which relies on proper service by not preserving the order in time series. In similar direction
instrumentation. as of Xu et al. and Hundman et al., we combine the methods
While the anomaly detection on other categories of data like from both and show that the integration of Gated Recurrent
log and metric are part of previous research [1], [2], [11]—[15], Units (GRUs, simplified LSTMs) with variational autoencoder
the related work on time series and the structural anomaly produce results which are able to meet the accuracy and perfor­
detection in trace data is still limited. mance requirements. Of course, the inclusion of preprocessing
Anomaly detection for services have been studied exhaus­ and postprocessing improves the accuracy by reducing the
tively during many years on different kinds of data. In general, amount of false positives.

242
III. Pr e l im in a r ie s forward neural network, setting the output (target) value to be
Systems based on microservices or service-oriented archi­ equal to the inputs i.e. yt = x t [28]. The identity function
tectures consists of several services connected by a network, seems a trivial function to be learn, but by placing some
providing a larger system application. To monitor the user constraints, such as limiting the number of hidden units, or
putting regularization, interesting features from the data can be
requests with a detailed description of different participating
microservices, distributed tracing technologies are utilized. extracted. A typical architecture of an autoencoder is shown in
A trace T = (Eo, E i , . . . , Ei) is represented by an enumer­ Figure 1, where h is called latent representation or bottleneck
of the autoencoder.
ated collection of events. Each event is represented by key-
value pairs (ki,Vi) describing the state, performance, and other
characteristics of a service at a given time tj. The events
contain contain a timestamp when the particular service was
invoked, a response time, and a http URL among the other meta
information (e.g., host IP, service name etc.). Depending on the
request, traces can have different lengths and services invoked.
The response time is one of the most important attributes of
the event, e.g., if its value which characterizes the intra-service
calls in the system suddenly increases at times ij+i, ij+ 2 , it
may indicate a problem with the underlying distributed system.
Let us assume that we observe two traces: T i = (Euri_ai,
Euri—bi; E url_Cl) and T 2 — (Euri_a2,E uri_b2JEuri_C2). Each
Fig. 1. Architecture of an autoencoder network. The x and y can be of any
http URL recorded in the platform is a source of events. Events type, in this paper x = y is time series data.
of the same type are clustered together by their http URL to
form a time series TSi = (Eai,E a2). By training on non-anomalous data, the autoencoder learns
Further, let us assume that we record the following two http and is able to produce good reconstructions on new non­
URLs from two events: anomalous samples (low error). If we use the same model
• 1.1.1.5/tag_l/group_idl/tag_2/servicel to predict the reconstruction for an anomalous sample, then
• 1.1.1,6/tag_5/group_id2/tag_4/servicel the reconstruction error will be larger.
We use regular expression for each of the A variational autoencoder (AEVB) [29] is a deep neural
events having http URL in form of {host network architecture that can learn complex representations
IP}/{tag_id}/{group_id}/{tag_2}/servicel and assign them to from data without supervision. AEVB is composed of an
the same cluster The same procedure is done with other events encoder and decoder, both are neural networks, and contain a
such as those that belong to {host}/groups/{group_id}/logs. loss function. Instead of mapping the input vector onto a fixed
We name such groups as endpoints and to each we assign vector as in the usual autoencoders, the model maps any input
cluster IDs represented by the regular expression. The time into a predefined distribution. Moreover, the bottleneck vector
series formed by the groups of events, having the properties in the variational autoencoder is replaced by two vectors of the
as explained in Section I, are used to study the dynamics of same size. One of them representing the mean and the other
the system and to detect anomalies. representing the variance of the distribution. So, whenever we
Anomaly detection on time series consisting of the service’s need the output of the encoder in order to feed into the decoder
response time can be formulated as follows: For any time network, we need a sample from the distribution, defined by
t given historical observations x t = {et- w, et~w+ i,..., et }, the mean and standard deviation vectors that represent the
where w is the sliding window size and et is the event’s latent low-dimensional space. Let us assume that we have
response time at time t, determine whether an anomaly occurs a dataset X of samples from a distribution parametrized by
or not (1/0). We use a sliding window to break the time a ground truth generative factor. The variational autoencoder
series into fixed-size inputs, required for the autoencoder. The aims to learn the marginal likelihood of the data in a generative
sequential order of the points inside the window is important. process:
Therefore, we combine the AEVB with the ability of the RNNs
for extracting temporal information from sequential data. marimize [loSPe(x \z )] (1)
An anomaly detection algorithm typically computes a real­
Where <j>and 6 parametrize the distributions of the VAE en­
valued score indicating the certainty of having anomaly, e.g.,
coder and the decoder respectively. Furthermore, the complete
p(anomaly = 1 | i n s t e a d of direcdy computing,
loss function is given by:
whether the window represents an anomaly.
£ ( M ; x , z ) = E ^ (zW [logP9(a:|z)] -£ > a -i , ( ^ ( z |x )||p (z ))
A. Variational autoencoder for anomaly detection
( )
An autoencoder is an unsupervised neural network archi­ The loss function, as written in (2), consists of two terms.
tecture. It applies backpropagation like the standard feed The first term represents the reconstruction loss, which is part

243
of any autoencoder architecture, except we have the expecta­ IV . Re spo n s e Tim e A n o m a l y D e t e c t io n

tion operator, because we are sampling from the distribution. The following methods form the core components of our
The second term is the Kullback - Leibler divergence that unsupervised anomaly detection approach for microservice
ensures close mapping to a predefined distribution. or service oriented systems observed by distributed tracing
Recently, there is an increasing adoption of unsupervised, technology. First, the time series data is preprocessed and a
generative machine learning models for anomaly detection. neural network model is trained on it to capture the normal
Similarly, the variational autoencoder (AEVB) first learns the system behavior. Based on this model, the predictions for the
normal scenario (one, or many) [26]. Then, conditioned on its reconstruction are obtained. Then, a probability based, adap­
input is able to generate reconstructions. By setting a threshold tive threshold method is used to determine whether resulting
on the reconstruction error, we are able to classify a given prediction errors represent anomalies for individual services.
window of response time as anomaly or normal. Further, a post-processing strategy, incorporated in a tolerance
B. Recurrent variational autoencoder module, is used to mitigate false positives. Lastly, we provide
Recurrent neural networks (RNNs) [30] are a type of neural anomaly pattern classification to provide descriptive and useful
analysis results. We divide the proposed methods in four core
networks where the connections between neurons form a
directed cycle. They are capable of learning features and steps or modules, that exchange the results in-between.
long term dependencies from sequential and time-series data. • time series preprocessing
A typical architecture of the RNN is shown in Figure 2. • model training
Each step in the unfolding is referred to as a time step, • test-time prediction
where x t is the input at time step t. RNNs can take an • faulty pattern classification
arbitrary length sequence as input, by providing the RNN a For simplicity, we will describe the methods through the
feature representation of one element of the sequence at each lens of a single time series. Given K time series, the solution
time step. st is the hidden state at time step t and contains scales since is meant to be applied to every time series in
information extracted from all time steps up to t. The hidden parallel.
state s is updated with information of the new input x t after A. Time series preprocessing
each time step: st = f{U xt + W sti), where U and W are This step involves two parts, preprocessing in model train­
vectors of weights and / is the non-linear activation function. ing and test-time prediction. The module queries the latest
The most used RNN types in practice are RNNs with LSTM N data points (events) belonging to the same cluster ID
(Long Short-Term Memory) [31] or GRU (Gated Recurrent (time series) and forwards it into a three stage pipeline: data
Unit) [32] cells, which we use in this paper as well. cleaning, normalization and noise reduction.
Tracing events are JSON objects, but in dependence of the


°t-l ot ° t+ l

• I
service instrumentation they might have a slightly different
structure. Common for all are the response time, which is
V V V
extracted for further processing. We assume that most of the
time the services in the system are in normal mode of opera­
tion. That is true in real-world systems where failures happen
rarely. However, the large amount of events in the time series
and the fact that proper training of neural networks requires
* t-i * t+ i
normalization, leads to obligation of having an outlier removal
technique. The presence of a strong outlier, will lead to values
at = b + W st_! + Uxt_t clamped to zero after the normalization. Therefore, events
st = tanh(at)
having response time greater than three standard deviations
ot = b +
from the mean are removed from the training batch. Next, we
Fig. 2. Architecture o f RNN.
normalize the values by using min-max scaling (0, 1) to ensure
that we stay in the positive range of values for the response
Recurrent variational autoencoder [33] is combination time. In contrast, using standardization might produce negative
of AEVB and RNN. The encoder is a recurrent neural values that do not have natural meaning when we deal with
network (RNN) that processes the input sequence x t = response time (no negative time). Normalization is required
{et- w, et- w+ i, ..., et } and produces a sequence of hidden and makes the optimization function of the neural network
states { h t - w , h t - w + i , - - - , h t } . The parameters of the distri­ well-conditioned, which is key for convergence [34]. Min-max
bution over the latent code is then set as a function of h t . normalization is given with the following equation:
The decoder, uses the sampled latent vector z to set the initial
B X t - m in{X)
state of a decoder RNN, which produces the output sequence -A-t,scaled, — /v \ . / -*r\ IpJ
V = 2/1>2/2 ; •••, Ut- The model is trained both to reconstruct the m ax(X ) —m in{X)
input sequence and to learn an approximate posterior close to where m in (X ) and m a x (X ) are saved and then used for the
the prior like in a standard variational autoencoder. normalization in test-time prediction. Lastly in the pipeline,

244
we apply smoothing for noise removal and robustness to the window size input units is fed to the corresponding GRU.
small deviations. The time series is convolved with Hamming In the first timestep T = 0, the Qth response time is fed.
smoothing filter defined with its optimal parameters [35] and The abstract representation learned in the 16 GRU cells is
size M as: then propagated to the next timestep T = 1, where the 1st
response time of the window is fed and so on. Here, we have
the ability to condition the reconstruction of the next point
w(n) = 0.54 —0.46 cos — - J , 0 < n < M —1 (4) given the past points. In such way, that in the last timestep we
have abstract representation of the window of points, which
We use smoothing with size of the window M = 12, but
has salient information for that part of the time series.
one can adjust the size depending on the noise.
Sampling layer: Represents the key part in order to be
For test-time prediction the preprocessing is executed on
able to learn multiple distributions (model of models). This
each new recorded event. During test-time, the event follows
layer consists of (window size/A) 8 units for the mean and
the same preprocessing steps as for the model training except
for the variance. The sampling layer just performs sampling
the normalization where m in (X ) and m a x (X ) are the saved
from multivariate normal distribution with the given mean and
values during model training part.
variance.
Time series partitioning: After the steps in the prepro­
Repeat layer: Repeats the Sampling layer window size
cessing, we define window size, which represents number of
times, which is needed to be fed into the last hidden (GRU)
points in a sliding window that needs to be considered for
layer.
evaluation. The window with the predefined size and stride is
Output/GRU layer: Here, the network takes the output from
applied to the time series. This results in training data shape
the previous layer as input, learns abstract representation and
of: {N —window size, window size, 1). The data in such
as output have the same window size number of input timesteps
format is then feed into the neural network for training. In
test-time prediction, each window size number of events are only with the response time as feature.
fed to the network for prediction. 1) Training details: We observed that the required number
of data points in particular time series used to produce good
B. Model architecture model in training should be more than 1000. The training data
The architecture of our proposed neural network is shown is split into two parts in sequential order. The smaller part or
in Figure 3 and described in following. 20% goes for estimating parameters and tuning the model.
We train the model for 1000 epochs and choose the one with
1. response time, 2. response tim e ,..., window size, response time the best validation score. The solution uses Adam optimizer
with learning rate of 0.001, which are the standard values for
training deep neural networks [28]. As mentioned, the error
function which we optimize is described in Section III. As last
step when the training is finished, the model is saved and used
in test-time prediction.
2) Dynamic error threshold: The difference between a
prediction and an observed parameter value vector is measured
by the mean square error (MSE) which is given with the
following equation.

M SE = . 1 . Y i x i - y i? (5)

Instead of setting a magic error threshold for anomaly detec­


tion purpose, we use the validation set for threshold setting.
For each window/sample in the validation set, we apply the
model produced by the training set and calculate the MSE
between the prediction (reconstruction) and the actual sample.
At every time step, the errors between the predicted vectors
1. response time, 2. response t im e , w in d o w size, response time
and the actual ones in the validation group are modeled as
a Gaussian distribution. Assume that the validation data has
Fig. 3. Model architecture. 1000 windows of with window size = 32. The MSE for all
of them will produce array of 1000 error values. Next, we
Input layer: has window size number of units, each con­ estimate the mean and the variance for the MSE scores and
taining the response time as input. In Figure 3, we use save them on the disk along with the model. These values are
window size — 32. used in test-time prediction. In test-time prediction if the error
First hidden GRU layer: contains (window size/2) 16 between a reconstructed and an observed window of events
GRU cells for each timestep in the input window. Each of is within a high-level of confidence interval of the above

245
Gaussian distribution is considered as normal, otherwise as D. Faulty pattern classification
anomaly.
Identifying the existence of an anomaly without providing
C. Test-time prediction any insight into its nature is of limited value. The user may
be interested in detecting particular types of anomalies which
Previously, we already showed the architecture, model train­ reflect in the time series (e.g., incremental, mean shift, gradual
ing and dynamic threshold setting. After having the trained increase, cylinder etc.). Here, we expect that an expert knows
model, this module receives data from the preprocessing the types of patterns that commonly lead to service or com­
module described previously. The latest model along with the ponent failure. Therefore, we provide a module based on one
saved training parameters are loaded, and used for prediction. dimensional convolutional neural networks that, given as input
For each new event, the past values forming a window a window of the event response time (e.g., 32 events), is able to
x t = {et- w,e t- w+ i, ...,et } are fed as input for prediction. classify into one of the user defined patterns described before.
The reconstruction error M SEtest and the probability under Convolutional Neural Networks (CNNs) [36] are common
the Gaussian are computed: deep learning technique for image, signal processing, sequence
l\r ,i — 1 - P ( X > M S E test) (6) and time series classification tasks. Typically, the architecture
of the CNN consists combination of convolutional and max­
Remembering that the parameters of the probability density pooling layers followed by softmax layer that distributes the
function p and a 2 are computed as parameters in the training probability for a given pattern. Recently, similar work for time
step. series classification is found in [37].
1) Tolerance: false positive reduction: In large-scale sys­ Similar to the preprocessing module, the latest N points
tem architectures, there are thousands of events recorded in from time series are queried and utilized to represent the
short period of time and there are cases where it might happen normal class. After that, we create a dataset which contains
that the response time is greater compared to the expected normal data preprocessed using the preprocessing module
time. If it is only a single anomalous point in the time series described and added to the augmented samples of predefined
or even few of them with increased service response time, patterns (translation and adding small amount of noise) by the
does not mean that anything is wrong with the service. For user.
example, that can be a small bottleneck in the disk usage or The model architecture consists of three (convolutional,
in one of the many components or services. Aiming to detect max-pooling) layers with dropout (0.5) regularization. The last
anomalies that have larger impact, enables the DevOps to pay layer, typical for multi-class classification, is fully connected
only attention to the most critical potential failures. with softmax function for computing the probability distribu­
We define the tolerance and probability error threshold as tion over the classes. The convolutional networks are naturally
parameters. The tolerance represents the allowed number of invariant to translation, which makes them suitable for faulty
anomalous windows that have P{x) greater than the prob­ pattern detection with sliding window over the time series.
ability error threshold before it flags the whole period as The network is trained using the data described and the model
anomalous. In practical scenarios, the tolerance parameter is saved and used for prediction. We used again Adam opti­
usually ranges from 1 to 100, but it is dependent on the mizer with optimal parameters obtained via cross validation
dynamics of the system. The probability outputs Ptest are (ilearning rate = 0.001, number o f epochs = 200 and
kept in queue with the same size (tolerance) for each new batch size = 1000). The classifier triggers when the test-time
window. Each time, a new sample is shown to the network prediction detects an anomaly. The classifier module receives
to be reconstructed, assigned with the probability of being the output from test-time prediction and requests the particular
anomalous and is added to the queue, the tolerance module time series within the provided anomalous time interval. Next,
checks whether the average probability: using the trained model, we map each sliding window to the
j to le r a n c e predicted class and if the particular pattern is recognized the
Pm = , j ^ ' P te s t(i) (7) module will output the name of the class in which the pattern
tolerance % ' belongs and will flag the interval as anomalous.
of all the points in the queue is greater than the error threshold.
V. Ev a l u a t io n
If this is the case, the submodule flags this part of the time
series as unstable and reports an anomaly. In this way we can The deep learning methods are implemented in Python using
deal with the problem of having too many false positives and Keras [38]. The evaluation on the collected datasets were
allow the user to set the sensitivity of the algorithm on his conducted on regular personal computer with the following
or her demand. The output from the whole module is: (first specifications: GPU-NVIDIA GTX 1060 6GB, 1TB HDD, 256
anomaly window timestamp, last anomaly window timestamp). SSD and Intel(R) Core(TM) i7-7700HQ CPU at 2.80GHz.
In our setting, we used window size = 32, hamming In this section, we show evaluations of separate modules
smoothing window with M = 12, confidence interval under on two datasets. First, we show results on the experimental
Gaussian (error threshold = 0.99) and queue size of testbed system based on microservice architecture and later
tolerance = 32windows. the evaluation on real large-scale production cloud data.

246
Fig. 4. Multiple distributions: We notice existence of multiple distributions that need to be learned as normal. The distribution differs from PI compared to
P2 and P3.

Anomalies
Latency increase
• Scenario 4: Network packet loss - Packet loss (10%, 20%,
Kill process 30%) is injected on one of the network links for 1,5, 10
..............0 ............ minutes
Go Python Java
(2 instances) (2 instances) (Tomcat, 2 inst.) • Scenario 5: Network delay - Network delay (1, 2, 3 sec)
is injected on the network for 1, 5, 10 minutes.
Tracing
• Scenario 6: Server process dies - One process is killed
Node 1 Node 2 Node 3
on node 1 and 3 for 1, 5, 10 minutes.
B. Dataset: production-cloud data
^ ‘ nnnnlir
Network delay
Network packet loss
Even in small, controlled experimental setups, the amount
3K\VLFDO1RGH  3K\VLFDO1RGH of noise is high and the time series changes rapidly over time.
This already opposes challenges for the anomaly detection
algorithm. However, testing the approach on large-scale pro­
Fig. 5. Experimental microservice system architecture. duction cloud data is required to show the viability of the
approach. The signal-to-noise ratio is even smaller since many
components affect the response time of microservices, the time
A. Experimental microservice system
series evolves faster and changes its distribution over period
We created an experimental microservice system to evaluate of time while having also some stochastic behavior.
representative anomaly scenarios for microservice architec­ We collected the dataset in a period of four days from a
tures. It allows fast preparation of experiments, known data real-world production cloud. We used only the attributes that
format, and precise control on anomalies injected in the are open-sourced by Zipkin [39], removing all the proprietary
system. For the setup, we used 2 physical nodes and 3 virtual instrumentation. Therefore, from all of the attributes that one
machines with tracing enabled, each of them running instances event has (around 80 in total, depending on the event), we
of Python, Go and Java applications respectively. The testbed extract only the http URL and the response time. The number
architecture is shown in Figure 5. For the anomaly injections of unique http URLs was 100168 and after clustering the total
we used stress-ng, traffic control, and simulator parameters. number of time series ( cluster IDs / q ndpoints) was 143. Out of
We injected timed anomalies in the physical network in them, we selected three services of interest. The anonymized
form of delay and packet loss, physical node anomalies in names along with the count of the samples is given in Table I
form of CPU stress and event response time increase injected
directly into the event. Thus, we created 6 different scenarios TABLE I
S e l e c t e d c l u s t e r s f o r a n a l y s i s
on different endpoints in the system described in following.
• Scenario 1: Baseline with no anomaly - represents the ClusterlD Count
{host}/v l/{p_id}/cs/limits 12900
normal operation (no anomalies) of the system and is {host}/v l/{t_id}/cs/delete 2732
used to train the detection algorithms. {host}/v2/{t_id}/servers/detail 6468
• Scenario 2: Increase service latency - profile 1 (injection
of latency (1 second) for duration of 15 seconds). The time series, as shown in Figure 4, have high level
• Scenario 3: Increase service latency - profile 2 (injection of noise ([0ms, 2000ms] and [0ms, 4000ms]), several distri­
of latency (1, 5, 30 seconds) for duration of 30 seconds, butions changing over time, and no strong anomaly in both
1 and 10 minutes on Nodes 1,2,3). signals except in the neighbourhood of 11500th data point in

247
0HDQVKLIW ,QFUHPHQWDO
M   O M?UVa M O
$GGLWLYHRXWOLHU
M D
6WHSDQGGHFUHPHQW
.
7HPSRUDU\FKDQJH *UDGXDO

Fig. 6. Example of predefined patterns

 ႑"="L\ $ ?  I

ವ  ವ ႑႑ 

   9   ವವವ႑ ႑ ವ ႑ ...


ರ I $$ $!mY?9
 m $9
A
9L  ವ ವ
! 9 Y 9g Z 6 !႑ YW 99
Y 9ಯ   ವಬ  ... ? $ ႑ ! A D$$9:Y9:
event number event number

(a) (b)
Fig. 7. Detected anomalies injected for different scenarios: (a) scenario 5, and (b) scenario 6

the figure. For the evaluation, we will use the part of the time TABLE II
series as normal scenario for training and then inject faults Ac c u r a c y for 15 e n d po in t s in 5 a n o ma l y s c e n a r io s

to check the model accuracy in the rest of the series. The Clus. ID S2 S3 S4 S5 S6
presence of strong noise leads to low autocorrelation. That 1 - 85 95 98 99
means, simple sequence learning without learning the distribu­ 2 - - 99 98 -

3 - - 96 99 96
tion (with variational autoencoder) will result in learning only 4 - 99 - - -
the running mean of the time series. Moreover, the distribution 5 - - 100 98 97
is skewed negatively. Therefore, the typical log transform of 6 98 - - 86
7 - 95 98 97 -
the data wont help for the model and it is omitted.
8 - 98 - - -
9 - 92 91 99 100
C. Results: variational recurrent model 10 90 95 95 99 98
In the following, we show the results obtained by the models 11 - 96 - - -

12 85 - 83 99 97
on both datasets. 13 95 98 - 94 99
1) Results: microservice system: Accuracy of the unsu­ 14 99 96 - - 98
pervised method for all 15 endpoints in our experimental 15 97 95 98 100
average 85.5 95.3 94.6 97.9 97.0
testbed across 5 different scenarios are shown in Table II. We
compute the accuracy in the following way. Let the number of
injected anomaly events be denoted with 7) and the number
of accurately detected anomalies be Ta, then the accuracy samples from normal distribution with different mean, additive
over the injected anomalies is computed as accuracy = outlier, mean shift, step and decrement, incremental, temporary
The number of injected anomalies depends on each scenario change and gradual. Some of the types of anomalies can be
and endpoint. The missing values in Table II mean that the found in Figure 6.
anomaly did not affect those endpoints. The model is trained on each of the three endpoints and for
We note that the accuracy in normal system scenario is each of the endpoints we compute the accuracy of detection for
greater than 99% due to the tolerance module. We show each of the patterns injected in the time series. To test the ro­
scenario 5 and 6 graphically in Figure 7. We show that bustness of the algorithm, we evaluated it for several different
the method successfully flags almost all anomalous events. augmentations of the original patterns, including translation,
Overall, the results shown in the table and in the figure indicate increasing the response time (e.g., R T i = 0.2 means setting
that the combination of generative models like variational the response time of event to 0.2 of the maximum value)
autoencoder with GRU units that extract temporal information and change of the size of anomaly (e.g., in gradual increase,
achieves solid results. size = 10 means that the increase from amplitude A to
2) Results: production cloud data: Due to the low number amplitude B is gradual over 10 events/data points). In Table
of production-system errors, we injected several types of III, we show the aggregated results. The table summarizes the
anomalies. We defined seven types of common anomalies: different anomaly patterns and the needed minimum values

248
7- * normal normal ^ ¥ • ya .
• anomaly • • anomaly
0.4

r e s p o n s e t im e
i- *
0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000
event n um ber event num ber

(a) (b)
Fig. 8. Example o f successfully detected anomalies injected in {host}/vl/{p_id}/cs/lim its. Gradual and mean shift anomalies are injected in (a) and (b)
respectively.

for the parameters size and R T i which lead to a detection TABLE V


of the corresponding anomaly type. Further, in Figure 8a and Pe r f o r m a n c e e v a l u a t io n in t e s t -t im e pr e d ic t io n

8b we visually illustrate some of the patterns detected in the #windows ms/window


series. We notice that the algorithm successfully detects the 1500 0.22
three types of anomalies injected. 1000 0.28
500 0.53
100 1.25
t a bl e m
Re s u l t s for Va r ia t io n a l Re c u r r e n t Mo d e l

Pattern name Parameters time is very important. Having large amounts of traces and
additive outlier R T i > 0.25 events generated in short period of time, requires fast predic­
normal_Mean R T i > 0.2
temporary change R T i > 0.25 tion time and timely detection of anomalies. For that reason,
gradual we evaluate the performance of the approach. We show the
R T i > 0.3, size > 10
mean shift R T i > 0.2 results in Tables IV and V. Lastly, imposing industrial require­
step and decrement R T i > 0.3, size > 10
incremental ments (prediction time < 10ms) and meeting the criteria for
R T i > 0.3, size > 10
performance proves that the approach is fast enough to be
3) Results: faulty pattern classification: The module can be used in production setting. In streaming test-time prediction
evaluated separately from the rest of the solution, since data we achieve performance of 6.64ms per predicted window of
and predefined patterns as described. The dataset consists of points. Of course, the prediction times can differ with reducing
15 different types of patterns similar to the ones shown in or expanding the window size, but it is still within the limits
Figure 6. In practice, the user has the option to define own of the necessary requirements.
patterns.
Furthermore, besides the patterns shown in Figure 6, we VI. C o n c l u s i o n a n d F u t u r e W o r k
produce augmentations to enrich the dataset. The augmenta­ This paper deals with an important and growing challenge:
tions are produced by using: horizontal shifts, adding small the automation of operation and maintenance tasks of planet-
amount of noise and amplitude shifts. scale IT infrastructures. We experimentally demonstrate the
We evaluated the algorithm to see the performance and its advantages of combining GRUs (simplified LSTMs) with
limits when it comes to the level of noise in the signal and variational autoencoders (AEVB) - two deep learning models
the accuracy of classification. We achieved 100% accuracy in - for learning multiple, complex data distributions underlying
data with no additional noise added, 80% and 48% accuracy time series data generated by distributed tracing systems. Our
when Gaussian noise was added with a = [0.05 and 0.1] re­ investigation on experimental and real-world production data
spectively. The convolutional neural network model accurately with artificially injected anomalies showed that our approach
classifies the tested anomaly patterns, with expected lower reaches accuracy greater than 90%, prediction time lower than
accuracy obtained in noisy patterns. 10ms, and robust classification of detected anomalies. The
tracing data was generated by an experimental microservice
TABLE IV application and by a planet-scale cloud infrastructure. These
Pe r f o r m a n c e e v a l u a t io n in t r a in in g
high levels of accuracy open a new door in the field of AIOps.
#windows ms/window Namely, the approach can be extended to also consider the
60000 0.29 structure of distributed traces, use knowledge from cross-event
27000 0.28
9000 0.53
and cross-trace relations, and analyze structural trace anoma­
1000 1.25 lies using complete trace information. These achievements
can ultimately support the development of zero-touch AIOps
4) Performance evaluation: In real-world production sys­ solutions for the automated detection, root-cause analysis and
tems, the performance of the model in training and prediction remediation of IT infrastructures.

249
Re f e r e n c e s [18] M. V. Joshi, R. C. Agarwal, and V. Kumar, “Predicting rare classes: Can
boosting make any weak learner strong?” in Proceedings o f the eighth
ACM S1GKDD international conference on Knowledge discovery and
[1] F. Schmidt, A. Gulenko, M. Wallschlger, A. Acker, V. Hennig, F. Liu,
data mining. ACM, 2002, pp. 297-306.
and O. Kao, ‘Tftm - unsupervised anomaly detection for virtualized
network function services,” in 2018 IEEE International Conference on [19] N. V. Chawla, N. Japkowicz, and A. Kotez, “Special issue on learning
Web Services (ICWS), July 2018, pp, 187-194. from imbalanced data sets,” ACM Sigkdd Explorations Newsletter, vol. 6,
no. 1, pp. 1-6, 2004.
[2] A. Gulenko, F. Schmidt, A. Acker, M. Wallschlager, O. Kao,
and F. Liu, “Detecting anomalous behavior of black-box [20] H. Fichtenberger, M. Gille, M. Schmidt, C. Schwiegelshohn, and
services modeled with distance-based online clustering,” in C. Sohler, “Bico: Birch meets coresets for k-means clustering,” in
2018 IEEE 11th International Conference on Cloud Computing European Symposium on Algorithms. Springer, 2013, pp. 481-492.
(CLOUD), vol. 00, Jul 2018, pp. 912-915. [Online], Available: [21] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman,
doi.ieeecomputersociety.org/10.1109/CLOUD.2018.00134 and A. Y. Wu, “An efficient k-means clustering algorithm: Analysis and
[3] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, implementation,” IEEE Transactions on Pattern Analysis & Machine
D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed Intelligence, no. 7, pp. 881-892, 2002.
systems tracing infrastructure,” Google, Inc., Tech. Rep., 2010. [Online]. [22] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 Eighth
Available: https://research.google.com/archive/papers/dapper-2010-1 .pdf IEEE International Conference on Data Mining. IEEE, 2008, pp. 413-
422.
[4] J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W.
Ong, B. Schaller, P. Shan, B. Viscomi et a i, “Canopy: an end-to-end [23] L. M. Manevitz and M. Yousef, “One-class svms for document clas-
performance tracing and analysis system,” in Proceedings o f the 26th sification,” Journal o f machine Learning research, vol. 2, no. Dec, pp.
Symposium on Operating Systems Principles. ACM, 2017, pp. 34—50. 139-154, 2001.
[5] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, [24] O. Vallis, J. Hochenbaum, and A. Kejariwal, “A novel
“X-trace: A pervasive network tracing framework,” in Proceedings technique for long-term anomaly detection in the cloud,” in
o f the 4th USENIX Conference on Networked Systems Design Proceedings o f the 6th USENIX Conference on Hot Topics
&#38; Implementation, ser. NSDI’07. Berkeley, CA, USA: in Cloud Computing, ser. HotCloud’14. Berkeley, CA, USA:
USENIX Association, 2007, pp. 20-20. [Online]. Available: USENIX Association, 2014, pp. 15-15. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1973430.1973450 http://dl.acm.org/citation.cfm?id=2696535.2696550
[6] P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat, [25] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Long short term
“Wap5: black-box performance debugging for wide-area systems,” in memory networks for anomaly detection in time series,” in Proceedings.
Proceedings o f the 15th international conference on World Wide Web. Presses universitaires de Louvain, 2015, p. 89.
ACM, 2006, pp. 347-356. [26] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei,
[7] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and Y. Feng et al., “Unsupervised anomaly detection via variational auto-
M. Zhang, “Towards highly reliable enterprise network services via encoder for seasonal kpis in web applications,” in Proceedings o f the
inference o f multi-level dependencies,” in ACM S1GCOMM Computer 2018 World Wide Web Conference on World Wide Web. International
Communication Review, vol. 37, no. 4. ACM, 2007, pp. 13-24. World Wide Web Conferences Steering Committee, 2018, pp. 187-196.
[8] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthi- [27] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and
tacharoen, “Performance debugging for distributed systems of black T. Soderstrom, “Detecting spacecraft anomalies using lstms and
boxes,” ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. nonparametric dynamic thresholding,” in Proceedings o f the 24th ACM
74-89, 2003. SIGKDD International Conference on Knowledge Discovery; Data
[9] T. Gschwind, K. Eshghi, P. K. Garg, and K. Wurster, “Webmon: A Mining, ser. KDD ’18. New York, NY, USA: ACM, 2018, pp. 387-395.
performance profiler for web transactions,” in Advanced Issues o f E- [Online], Available: http://doi.acm.org/10.1145/3219819.3219845
Commerce and Web-Based Information Systems, 2002.(WECWIS 2002). [28] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
Proceedings. Fourth IEEE International Workshop on. IEEE, 2002, pp. 2016, http://www.deepleamingbook.org.
171-176. [29] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
[10] P. Barham, R. Isaacs, and D. Narayanan, “Magpie: International Conference on Learning Representations (ICLR), 2014.
online modelling and performance-aware systems,” in 9th [30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-
Workshop on Hot Topics in Operating Systems (HotOS- tations by back-propagating errors,” Nature, vol. 323, no. 6088, p. 533,
IX). USENIX, May 2003, pp. 85-90. [Online], Avail- 1986.
able: https://www.microsoft.com/en-us/research/publication/magpie- [31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
online-modelling-and-performance-aware-sy stems/ computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[11] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection [32] K. Cho, B. Van Merrienboer, D. Bahdanau, and Y. Bengio, “On the
and diagnosis from system logs through deep learning,” in Proceedings properties of neural machine translation: Encoder-decoder approaches,”
o f the 2017 ACM SIGSAC Conference on Computer and Communica Eighth Workshop on Syntax, Semantics and Structure in Statistical
tions Security. ACM, 2017, pp. 1285-1298. Translation (SSST-8), 2014.
[12] F. Bezerra and J. Wainer, “Algorithms for anomaly detection of traces [33] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio,
in logs o f process aware information systems,” Information Systems, “A recurrent latent variable model for sequential data,” in Advances in
vol. 38, no. 1, pp. 33-44, 2013. neural information processing systems, 2015, pp. 2980-2988.
[13] A. Brown, A. Thor, B. Hutchinson, and N. Nichols, “Recurrent neural [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network attention mechanisms for interpretable system log anomaly network training by reducing internal covariate shift,” in Proceedings
detection,” arXiv preprint arXiv:1803.04967, 2018. o f the 32Nd International Conference on International Conference on
[14] Mining Invariants from Console Logs fo r System Prob Machine Learning - Volume 37, ser. ICML’15. JMLR.org, 2015, pp.
lem Detection. USENIX, June 2010. [Online]. Avail- 448-456.
able: https://www.microsoft.com/en-us/research/publication/mining- [35] A. Nuttall, “Some windows with very good sidelobe behavior,” IEEE
invariants-from-console-logs-for-system-problem-detection/ Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 1,
[15] D. Battre, O. Kao, and D. Wameke, “Evaluation of network topology pp. 84-91, February 1981.
inference in opaque compute clouds through end-to-end measurements,” [36] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech,
in 2011 IEEE 4th International Conference on Cloud Computing, July and time series,” The handbook o f brain theory and neural networks,
2011, pp. 17-24. vol. 3361, no. 10, p. 1995, 1995.
[16] L. Breiman, “Random forests,” Machine Learning, vol. 45, [37] B. Zhao, H. Lu, S. Chen, J. Liu, and D. Wu, “Convolutional neural
no. 1, pp. 5-32, Oct 2001. [Online]. Available: networks for time series classification,” Journal o f Systems Engineering
https://doi.Org/10.1023/A:1010933404324 and Electronics, vol. 28, no. 1, pp. 162-169, 2017.
[17] M. Joshi, R. Agarwal, and V. Kumar, “Mining needle in a haystack: [38] F. Chollet et al., “Keras,” https://keras.io, 2018.
classifying rare classes via two-phase rule induction,” ACM SIGMOD [39] OpenZipkin, “openzipkin/zipkin,” 2018. [Online]. Available:
Record, vol. 30, no. 2, pp. 91-102, 2001. https://github.com/openzipkin/zipkin

250

You might also like