LogSig Generating System Events From Raw Textual Logs
LogSig Generating System Events From Raw Textual Logs
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 1.1 Motivation
not made or distributed for profit or commercial advantage and that copies Each log message consists of a sequence of terms. Some
bear this notice and the full citation on the first page. To copy otherwise, to of the terms are variables or parameters for a system even-
republish, to post on servers or to redistribute to lists, requires prior specific t, such as the host name, the user name, IP address and
permission and/or a fee.
CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK. so on. Other terms are plain text words describing seman-
Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00. tic information of the event. For example, three sample log
785
messages of the Hadoop system [3] describing one type of duct experiments on five real system log data. Experiments
events about the IPC (Inter-Process Communication) sub- show that logSig outperforms other alternative algorithms
system are listed below: in terms of the overall performance.
The rest of the paper is organized as follows: Section 2 de-
1. 2011-01-26 13:02:28,335 INFO org.apache.hadoop.ipc. scribes the related work about system event generation from
Server: IPC Server Responder: starting; textual logs. Then, we formulate the problem of the system
events generation in Section 3. In Section 4, we present
2. 2011-01-27 09:24:17,057 INFO org.apache.hadoop.ipc. an overview of the logSig algorithm. Section 5 discusses
Server: IPC Server listener on 9000: starting; some detailed implementation issues. Section 6 proposes t-
wo approaches for incorporating the domain knowledge to
3. 2011-01-27 23:46:21,883 INFO org.apache.hadoop.ipc. improve the accuracy of the logSig algorithm. In Section 7,
Server: IPC Server handler 1 on 9000: starting. we present the experimental studies on five real system log
data. Finally, Section 8 concludes our paper and discusses
The three messages contain many different words(or terms), the future work.
such as the date, the hours, the handler name, and the
port number. People can identify them as the same even-
t type because they share a common subsequence: “INFO:
2. RELATED WORK
org.apache.hadoop.ipc.Server: IPC Server:starting ”. It has been shown in [24] that log messages are relative-
Let’s consider how the three log messages are generated by ly short text messages but could have a large vocabulary
the system. The Java source code for generating them is size. This characteristic often leads to a poor performance
described below: when using the bag-of-words model in text mining on log
data. The reason is that, each single log message has only
logger = Logger.getLogger ("org.apache.hadoop.ipc.Server"); a few terms, but the vocabulary size is very large. Hence,
logger.info("IPC Server "+handlerName+": starting"); the vector space established on sets of terms would be very
sparse. The string kernel can be used to extract deep se-
where logger is the log producer for the IPC subsystem. Us- mantic information (e.g., the order of terms) to improve the
ing different parameters, such as handlerName, the code can performance of the clustering algorithm. It maps a string
output different log messages. But the subsequence “INFO: to a high dimensional vector to represent all possible ter-
org.apache.hadoop.ipc.Server: IPC Server : start- m orders. However, because of the large vocabulary size,
ing ” is fixed in the source code. It will never change unless the dimensionality of the transformed space would be very
the source code has been modified. high. Although the kernel trick does not have to explicitly
Therefore, the fixed subsequence can be viewed as a sig- create those high dimensional vectors, clustering algorithms
nature for an event type. In other words, we can check the would still be influenced by the high dimensionality due to
signatures to identify the event type of a log message. Oth- the Curse of Dimensionality [25].
er parameter terms in the log message should be ignored, The related work about the log data analysis can be broad-
since messages of the same event type can have different ly summarized into two categories. One category is on sys-
parameter terms. Note that some parameters, such as the tem event generation from raw log data [6] [9] [18] [26] and
handlerName in this example, consist of different numbers the other category is on analyzing patterns from system
of terms. Consequently, the position of a message signa- events [22] [30] [13] [15] [10] [29] [14]. Our work in this pa-
ture may vary in different log messages. Hence, the string per belongs to the first category. A word matching similarity
matching similarity proposed in [6] would mismatch some measurement is introduced in [6] for clustering the log mes-
terms. Another method IPLoM proposed in [18] also fails to sages. One problem is that, if most terms of a log message
partition log messages using the term count since the length are parameter terms, then this type of log messages may
of handlerName is not fixed and three log messages have not have many matched common words. This method is de-
different numbers of terms. noted as StringMatch in this paper. [18] develops a 4-steps
Given an arbitrary log message, we do not know in ad- partitioning method IPLoM for clustering the log messages
vance which item is of its signature, or which term is its based on inherent characteristics of the log format. Howev-
parameter. That is the key challenge we aim to address in er, the method can only be useful for strictly formatted log
this paper. data. The logSig algorithm proposed in this paper can han-
dle various types of log data without much prior knowledge
about the log format.
1.2 Contributions
In this paper, we first describe the drawbacks of tradition-
al text clustering techniques for event generation from log 3. PROBLEM FORMULATION
messages. We show that, it is difficult for the bag-of-word The goal of this paper is to identify the event type of each
model to accurately partition log messages. We also analyze log message according to a set of message signatures. Given
that, the string kernel based approach would be inefficient a log message and a set of signatures, we need a metric to
when the log vocabulary size is large. In addition, we discuss determine which signature best matches this log message.
some limitations of related approaches proposed in previous Therefore, we propose the Match Score metric first.
literatures. Notations: Let D be a set of log messages, D = {X1 , ..., XN },
Then, we propose logSig algorithm to generate system where Xi is the ith log message, i = 1, 2, ..., N . Each Xi is
events from textual log messages. logSig algorithm tries a sequence of terms, i.e., Xi = wi1 wi2 ....wini . A message
to find k message signatures to match all given messages as signature S is also a sequence of terms S = wj1 wj2 ....wjn .
much as possible, where k is specified by the user. We con- Given a sequence X = w1 w2 ...wn and a term wi , wi ∈ X
786
indicates wi is a term in X. X −{wi } denotes a subsequence partition? Why not use the LCS as the similarity function
w1 ...wi−1 wi+1 ...wn . |X| denotes the length of the sequence to do k-means clustering? The answer for the two ques-
X. LCS(X, S) denotes the Longest Common Subsequence tions is that, our goal is not to find good clusters of log
between two sequences X and S. messages, but to find the message signatures of all types of
Definition 1. (Match Score) Given a log message Xi log messages. K-means can ensure every two messages in
and a message signature S, the match score is computed by one cluster share a subsequence. However, it cannot guar-
the function below: antee that there exists a common subsequence shared by all
(or most) messages in one cluster. We illustrate this by the
match(Xi , S) = |LCS(Xi , S)| − (|S| − |LCS(Xi , S)|) following example.
= 2|LCS(Xi , S)| − |S|. Example 2. There are three log messages X1 : “abcdef,
Intuitively, |LCS(Xi , S)| is the number of terms in Xi matched X2 : “abghij” and X3 : “xyghef”. Clearly, LCS(X1 , X2 )=2,
with S. |S| − |LCS(Xi , S)| is the number of terms in Xi not LCS(X2 , X3 )=2, and LCS(X1 , X3 )=2. However, there is
matched with S. match(Xi , S) is the number of matched no common subsequence that exists among all X1 , X2 and
terms minus the number of not-matched terms. We illus- X3 . In our case, it means there is no message signature to
trate this by a simple example below: describe all three log messages. Hence, it is hard to believe
that they are generated by the same log message template.
Example 1. A log messages X = abcdef and a mes-
sage signature S = axcey. The longest common subsequence 3.2 Problem Analysis
LCS(X, S) = ace. The matched terms are “a”,“c”,“e”, shown Problem 1 is an NP-hard problem, even if k = 1. When
Table 1: Example of Match Score k = 1, we can reduce the Multiple Longest Common Sub-
X a b c d e f sequence problem to the Problem 1. The Multiple Longest
S a x c e y Common Subsequence problem is a known NP-hard [17].
by underline words in Table 1. “x” and “y” in S are not Lemma 1. Problem 1 is an NP-hard problem when k = 1.
matched with any term in X. Hence, match(X, S) = |ace| −
|xy| = 1. Proof: Let D = {X1 , ..., XN }. When k = 1, S = {S1 }.
Note that this score can be negative. match(Xi , S) is used Construct another set of N sequences Y = {Y1 , ..., YN }, in
to measure the degree of the log message Xi owning the which each term is unique in both D and Y. Let D = D ∪Y,
signature S. If two log messages Xi and Xj have the same J(S, D ) = Xj ∈D match(Xj , S1 ) + Yl ∈Y match(Yl , S1 )
signature S, then we regard Xi and Xj as of the same event
type. The longest common subsequence matching is a widely Let S1∗ be the optimal message signature for D , i.e.,
used similarity metric in biological data analysis [7] [19], S1∗ = arg max J({S1 }, D ).
such as RNA sequences. S1
3.1 Problem Statement Then, the longest common subsequence of X1 ,...,XN must
be an optimal solution S1∗ . This can be proved by contradic-
If all message signatures S1 , S2 ,...,Sk are known, identi-
tion as follows. Let Slcs be the longest common subsequence
fying the event type of each log message in D is straight-
of X1 ,...,XN . Note that Slcs may be an empty sequence if
forward. But we don’t know any message signature at the
there is no common subsequence among all messages.
beginning. Therefore, we should partition log messages and
Case 1: If there exists a term wi ∈ S1∗ , but wi ∈
/ Slcs . Since
find their message signatures simultaneously. The optimal
wi ∈/ Slcs , wi is not matched with at least one message
result is that, within each partition, every log message match-
in X1 ,...,XN . Moreover, Y1 ,...,YN are composed by unique
es its signature as much as possible. This problem is formu-
terms, so wi cannot be matched with any of them. In D ,
lated below.
the number of messages not matching wi is at least N + 1,
Problem 1. Given a set of log messages D and an in- which is greater than the number of messages matching wi .
teger k, find k message signatures S = {S1 , ..., Sk } and a Therefore,
k-partition C1 ,...,Ck of D to maximize
J({S1∗ − {wi }}, D ) > J({S1∗ }, D ),
k
J(S, D) = match(Xj , Si ). which contradicts with S1∗ = arg max J({S1 }, D ).
S1
i=1 Xj ∈Ci
Case 2: If there exists a term wi ∈ Slcs , but wi ∈ / S1∗ .
The objective function J(S, D) is the summation of all match Since wi ∈ Slcs , X1 ,...,XN all match wi . The total number
scores. It is similar to the k-means clustering problem. The of messages that match wi in D is N . Then, there are N
choice of k depends on the user’s domain knowledge to the remaining messages not matching wi : Y1 ,...,YN . Therefore,
system logs. If there is no domain knowledge, we can bor-
J({Slcs }, D ) = J({S1∗ }, D ),
row the idea from the method finding k for k-means [11],
which plots clustering results with k. We can also display which indicates Slcs is also an optimal solution to maximize
generated message signatures for k = 2, 3, .. until the results objective function J on D .
can be approved by experts. To sum up the two cases above, if there is a polynomial
Comparing with k-means clustering problem time-complexity solution to find the optimal solution S1∗ in
Problem 1 is similar to the classic k-means clustering prob- D , the Multiple Longest Common Subsequence problem for
lem, since a message signature can be regarded as the repre- X1 ,...,XN can be solved in polynomial time as well. How-
sentative of a cluster. People may ask the following question- ever, Multiple Longest Common Subsequence problem is an
s: Why we propose the match function to find the optimal NP-hard problem [17].
787
Lemma 2. If when k = n Problem 1 is NP-hard, then Lemma 3. Given a message group C, it has n common
when k = n + 1 Problem 1 is NP-hard, where n is a positive term pairs, then the length of the longest
√ common subse-
integer. quence of messages in C is at least 2n.
Proof-Sketch: This can be proved by contradiction. We
Proof-sketch: Let l be the length of a longest common sub-
can construct a message Y whose term set has no overlap to
sequence of messages in C. Let T (l) be the number of term
the term set of messages in D in a linear time. Suppose the
pairs that generated by that longest common subsequence.
optimal solution for k = n and D is C = {C1 , ..., Ck }, then
Since each term
pair has two terms, this
sequence can gener-
the optimal solution for k = n + 1 and D ∪ {Y } should be
ate at most 2l pairs. Hence, T (l) ≤ 2l = l(l − 1)/2. Note
C = {C1 , ..., Ck , {Y }}. If there is a polynomial time solution
that each term pair of the longest common subsequence is a
for Problem 1 when k = n + 1, we could solve Problem 1
common term pair in C. Now, we already know √ T (l) = n,
when k = n in polynomial time.
so T (l) = n ≤ l(l − 1)/2. Then, we have l ≥ 2n.
4. ALGORITHM OVERVIEW Lemma 4. Given a set of log messages D and a k-partition
C = {C1 , ..., Ck } of D, if F (C, D) ≥ y, y is a constant, we
In this section, we first present an approximated version of can find a set of message signatures S such that on average:
Problem 1 and then present our logSig algorithm. logSig
algorithm consists of three steps. The first step is to separate 2y
J(S, D) ≥ |D| ·
every log message into several pairs of terms. The second k
step is to find k groups of log messages using local search
Proof-sketch: Since F (C, D) ≥ y, on average, each group
strategy such that each group share common pairs as many
has at least y/k common pairs. Then for each group, by
as possible. The last step is to construct message signatures
Lemma 3, the length
of the longest common subsequence
based on identified common pairs for each message group.
must be at least 2y
k
. If we choose this longest common
4.1 Term Pair Generation subsequence as
the message signature, each log message can
The first step of logSig algorithm is converting each log 2y
match at least k
terms of the signature. As a result, the
message into a set of term pairs. For example, there is a log
message collected from FileZilla [2] client: match score of each log message is at least 2y
k
. D has |D|
Then, we have the total match score J(S, D) ≥
2010-05-02 11:34:06 Command: mkdir ".indexes" messages.
We extract each pairwise of terms and preserve the order of |D| · 2y on average.
k
two terms. Then, the converted pairs are as follows:
{2010-05-02,11:34:06}, {2010-05-02,Command:}, Lemma 4 shows that, maximizing the F (C, D) is approx-
{2010-05-02, mkdir}, {2010-05-02, ".indexes" } imately maximizing the original objective function J(S, D).
{11:34:06,Command:}, {11:34:06,mkdir}, But F (C, D) is easier to optimize because it deals with dis-
{11:34:06,".indexes"}, {Command:,mkdir}, crete pairs.
{Command:,".indexes"}, {mkdir,".indexes"}
The converted term pairs preserve the order information of 4.2.2 Local Search
message terms. On the other hand, the computation on The logSig algorithm applies the local search strategy to
the discrete term pairs is easier than a sequence. A similar solve Problem 2. It iteratively moves one message to another
idea was proposed in string kernel [16] for text classification. message group to increase the objective function as large as
Their output is a high dimensional vector, and our output possible. However, unlike the classic local search optimiza-
is a set of pairs. tion method, the movement is not explicitly determined by
objective function F (·). The reason is that, the value of
4.2 Log Messages Partition F (·) may only be updated after a bunch of movements, not
The second step is to partition log messages into k groups just after every single movement. We illustrate this by the
based on converted term pairs. following example.
4.2.1 The Approximated Version of Problem 1 Example 3. Message set D is composed of 100 “ab” and
100 “cd”. Now we have 2-partition C = {C1 , C2 }. Each
Notations: Let X be a log message, R(X) denotes the set message group has 50% of each message type as shown in
of term pairs converted from X, and |R(X)| denotes the Table 2. The optimal 2-partition is C1 has 100 “ab” and C2
number of term pairs in R(X).
Problem 2. Given a set of log messages D and an inte- Table 2: Example of two message groups
XXX
ger k, find a k-partition C = {C1 , ..., Ck } of D to maximize XX group C1 C2
term pair XXXX
objective function F (C, D):
“ab” 50 50
k “cd” 50 50
F (C, D) = | R(Xj )|. has 100 “cd”, or in the reverse way. However, beginning with
i=1 Xj ∈Ci current C1 and C2 , F (C, D) is always 0 until we move 50
“ab” from C2 to C1 , or move 50 “cd” from C1 to C2 . Hence,
Object function F (C, D) is the total number of common pairs
for first 50 movements, F (C, D) cannot guide the local search
over all groups. Intuitively, if a group has more common
because no matter what movement you choose, it is always
pairs, it is more likely to have a longer common subsequence.
0.
Then, the match score of that group would be higher. There-
fore, maximizing function F is approximately maximizing J Therefore, F (·) is not proper to guide the movement in the
in Problem 1. Lemma 4 shows the average lower bound for local search. The decision of every movement should consid-
this approximation. er the potential value of the objective function, rather than
788
the immediate value. So we develop the potential function Algorithm 1 logSig_localsearch (D, k)
to guide the local search instead. Parameter: D : log messages set; k: the number of groups
Notations: Given a message group C, R(C) denotes the to partition;
union set of term pairs from messages of C. For a term pair Result: C : log message partition;
r ∈ R(C), N (r, C) denotes the number of messages in C 1: C ← RandomSeeds(k)
which contains r. p(r, C) = N (r, C)/|C| is the portion of 2: C ← ∅ // Last iteration’s partition
messages in C having r . 3: Create a map G to store message’s group index
4: for Ci ∈ C do
Definition 2. Given a message group C, the potential of 5: for Xj ∈ Ci do
C is defined as φ(C), 6: G[Xj ] ← i
7: end for
φ(C) = N (r, C)[p(r, C)]2 . 8: end for
r∈R(C)
9: while C = C do
10: C ← C
11: for Xj ∈ D do
The potential value indicates the overall “purity” of term
12: i ← G[Xj ]
pairs in C. φ(C) is maximized when every term pair is 13: j ∗ = arg max ΔiX j Φ(D)
contained by every message in the group. In that case, for j=1,..,k →
−
each r, N (r, C) = |C|, φ(C) = |C| · |R(C)|. It also means all 14: if i = j ∗ then
term pairs are common pairs shared by every log message. 15: Ci ← Ci − {Xj }
16: Cj ∗ ← Cj ∗ ∪ {Xj }
φ(C) is minimized when each term pair in R(C) is only 17: G[Xj ] ← j ∗
contained by one message in C. In that case, for each r, 18: end if
N (r, C) = 1, |R(C)| = |C|, φ(C) = 1/|C|. 19: end for
20: end while
Definition 3. Given a k-partition C = {C1 , ..., Ck } of a 21: return C
message set D, the overall potential of D is defined as Φ(D),
k
Φ(D) = φ(Ci ), Why choose this Potential Function?
i=1 Given a message
group C, let g(r) = N (r, C)[p(r, C)]2 , then
φ(C) = r∈R(C) g(r). Since we have to consider all term
where φ(Ci ) is the potential of Ci , i = 1, ..., k.
pairs in C, we define φ(C) as the sum of all g(r). As for
Connection between Φ and F : g(r), it should be a convex function. Figure 1 shows a curve
Objective function F computes the total number of common of g(r) by varying the number of messages having r, i.e.,
term pairs in each group. Both Φ and F are maximized N (r, C). The reason for why g(r) is convex is that, we hope
when each term pair is a common term in its corresponding 100
message group. Let’s consider the average case.
80
Lemma 5. Given a set of log messages D and a k-partition
C = {C1 , ..., Ck } of D, if F (C, D) ≥ y, y is a constant, then 60
g(r)
789
The proof of Lemma 6 is similar to the proof of Lemma logSig’s potential function provides a more reliable heuris-
1. If there exists a term wj ∈ S only appearing in less tic to guide the optimization process. It is much less likely
than |C|/2 messages, we can have: J({S − {wj }}, C) > to stop at a local optima.
J({S}, C), then S is worse than S − {wj }. Thus, S is not Time Complexity and Space Complexity
optimal. The time complexity of converting log messages into term
Lemma 6 indicates that we only need to care about those pairs is O(|D| · L2 ), where L is the maximum length of log
terms which appear in at least one half of the messages in messages. For every Xj ∈ D, L2 ≥ |R(Xj )|. For the local
a group. By scanning every message in a group, we could search, each iteration goes through every message. So the
obtain the optimal sequence of those terms. Since log mes- time complexity of an iteration is O(k · L2 · |D|). Let t be
sages are usually very short, there are only a few term whose the number of iterations, so the time complexity of the local
occurrence number is equal or greater than |C|/2 . There- search is O(t · k · L2 · |D|). Our experiments shows that t is
fore, there are not many candidate sequences generated by usually 3 to 12 for most system log data.
those terms. We enumerate each one of them and select the The space cost of the algorithm is mainly determined by
best one to be the message signature. the term pair histograms created for every message group.
The total space complexity is the total number of term pairs,
O(|R(D)|).
5. ALGORITHM IMPLEMENTATION
In this section, we discuss some detailed issues about the
implementation of the logSig algorithm. 6. INCORPORATING DOMAIN KNOWLEDGE
Term Pair Histogram In real world applications, analyzing system behaviors
To efficiently compute the potential value of each message mainly relies on the generated system events, so the event
group Ci ∈ C, a histogram is maintained for each group. The generation algorithm should be reliable. Current approach-
histogram is implemented by a hash table, whose key is a es for grouping log messages into different event types are
term pair and the value is the number of messages containing totally unsupervised. In practice, for most sorts of logs
that term pair in the group. Then, for a term pair r, c(r, Ci ) (e.g., Hadoop system logs), there are some domain knowl-
and p(r, Ci ) can be obtained in a constant time complexity. edge about a fixed catalog of Java exceptions. In addition,
Approximately Computing ΔΦ(D) the log generation mechanisms implicitly create some asso-
The straightforward way to compute ΔiX j Φ(D) is enumer- ciations between the terminologies and the situations. In-
→
−
ating all term pairs in Ci and Cj . The computation cost of corporating domain knowledge into the clustering process
k could improve the performance of event generation. In this
finding the maximum ΔiX j Φ(D) is O( l=1 |R(Cl )|), which
→
− section, we present two approaches for incorporating domain
is the total number of term pairs in the entire data set. In
knowledge to improve the accuracy of logSig algorithm.
log data, this number is even tens of times greater than |D|.
Hence, this computation cost is not affordable for each single 6.1 Term Feature Layer
log message.
Some terms or words in log messages share some com-
Our approximated solution is to consider the change of
mon features which can help our algorithm identify the log
N (r, C), for each r ∈ R(X). The reason is that, |C| is large.
message type. The features are domain knowledge from hu-
The impact to |C| by inserting or removing one message
man experts. For example, “2011-02-01” and “2011-01-02”
could be ignored comparing to the impact to N (r, C). |C)| ≈
are different terms, but they are both dates; “192.168.0.12”
|C| + 1 ≈ |C| − 1. So |C| can be treated as a constant in one
and “202.119.23.10” are different terms, but they are both
exchange operation in local search. Then,
IP addresses.
∂φ(C) ∂[[N (r, C)]3 /|C|2 ] 3[N (r, C)]2 logSig algorithm allows users to provide a list of features.
= = = 3[p(r, C)]2 . Each feature is described by a regular expression. An addi-
∂N (r, C) ∂N (r, C) |C|2
tional feature layer is built to incorporate those features for
representing log messages. The feature layer is created by
ΔiX j Φ(D) ≈ 3· [[p(r, Cj )]2 − [p(r, Ci )]2 ]. regular expression matching. For example, we have follow-
→
−
r∈R(X) ing regular expressions 1 :
By utilizing the histograms, p(r, Cj ) and p(r, Ci ) can be
Timestamp : \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
obtained in a constant time. Hence, for each log message
FilePath : "(\w|\\)+"
X ∈ D, the time complexity of finding the largest ΔiX j Φ(D)
→
−
is at most O(k · |R(X)|). Although ΔiX j Φ(D) is would un- Then, we could scan the log message X1 to create the feature
→
−
derestimate the actual change of Φ(D), it can save a lot of layer Y1 as shown in Table 3. The feature layer Y1 contains
computation cost.
Table 3: Example of Feature Layer
5.1 Algorithm Analysis X1 2010-05-02 11:34:06 Command: mkdir ".indexes"
Convergence and Local Optima Y1 Timestamp Command: mkdir FilePath
In the local search algorithm, exchanging messages’ groups
only increases the overall potential Φ(D). There is no oper- more semantic information. It can gather similar terms and
ation to decrease the overall potential. Φ(D) is not infinite, reduce noisy terms. In addition, regular expression is easy
which is less than or equal to |R(D)| · |D|. Therefore, the for domain experts to write. Using a sophisticated regular
local search algorithm would converge. language toolkit, a feature layer can be built efficiently.
Similar to other local search based optimization algorithm- 1
The grammar of regular expressions is defined by Java Reg-
s, logSig may converge into a local optima as well. However, ular Library
790
Table 5: Experimental Machine
6.2 Constraints on Message Signature OS CPU bits Memory JVM Heap Size
In reality, domain experts usually identify the event type Linux 2.6.18 Intel Xeon(R) @ 64 16G 12G
of a log message by scanning keywords or phrases. For ex- 2.5GHz, 8 core
ample, as for SFTP/FTP logs, they would be sensitive to
those phrases: “Command”, “No such file”, “Error”, “Transfer 7.2 Data Collection
failed” and so on. Those sensitive phrases should be in- We collect log data from 5 different real systems, which
cluded in the generated message signature to express system are summarized in Table 6. Logs of FileZilla [2], PVFS2
events. On the other hand, domain experts also know that [4] Apache [1] and Hadoop [3] are collected from the server
some trivial phrases should be ignored, such as the message machines/systems in the computer lab of a research center.
IDs and the timestamps. Those trivial phrases should not be Log data of ThunderBird [5] is collected from a supercom-
included in any message signature. To sum up, those knowl- puter in Sandia National Lab. The true categories of log
edge can be transferred as constraints on message signatures. messages are obtained by specialized log parsers. For in-
Those constraints can help the event generation algorithm stance, FillZilla’s log messages are categorized into 4 types:
to improve its accuracy. Unlike traditional constraints in “Command ”, “Status”, “Response”, “Error ”. Apache error
semi-supervised clustering, such as Must-Link or Cannot- log messages are categorized by the error type: “Permission
Link [28], our constraints are placed in the subsequences (or denied ”, “File not exist” and so on.
phrases) of messages, not on the messages themselves.
Constraint-based logSig Algorithm Table 6: Summary of Collected System Logs
To incorporate constraints on message phrases, Problem 1 System Description #Messages #Terms #Category
Per Mes-
can be revised as follows: sage
Problem 3. Given a set of log messages D, a set of sen- FileZilla SFTP/FTP Client 22,421 7 to 15 4
sitive phrases PS and trivial phrases PT , find k message ThunderBird
PVFS2
Supercomputer
Parallel File System
3,248,239
95,496
15 to 30
2 to 20
12
11
signatures S = {S1 , ..., Sk } with a k-partition C1 ,...,Ck of D Apache Web Server 236,055 10 to 20 6
to maximize: Error
Hadoop Parallel Computing 2,479 15 to 30 11
k Platform
J (S, D, PS , PT ) = λ match(Xj , Si ) +
i=1 Xj ∈Ci The vocabulary size is an important characteristic of log
data. Figure 2 exhibits the vocabulary sizes of the 5 different
k logs along with the data size. It can be seen that some
(1 − λ) [ match(Ps , Si ) − match(Pt , Si )],
i=1 Ps ∈PS Ps ∈PT
vocabulary size could become very large if the data size is
large.
where λ is a user-defined parameter between 0 and 1.
The revised optimization problem can be solved by constraint- 50000
PVFS2
based logSig Algorithm. The basic idea of constraint-based FileZilla
logSig is to increase the weight of term pairs in R(PS ) and 40000 ThunderBird
decrease the weight of term pairs in R(PT ). Let wr de- ApacheError
Vocabulary size
30000 Hadoop
note the weight of pair r. The only revised part in logSig
algorithm is ΔiX j Φ(D). By multiplying the weight wr , it
→
− 20000
becomes:
ΔiX j Φ(D) ≈ 3 [[p(r, Cj )]2 − [p(r, Ci )]2 ] · wr , 10000
→
−
r∈R(X)
0
0K 10K 20K 30K 40K 50K
and
⎧ Number of log messages
⎨ 1.0, r∈/ R(PS ), r ∈
/ R(PT )
wr = λ , r ∈ R(PS ), r ∈
/ R(PT ) , Figure 2: Vocabulary size
⎩ 1/λ , r ∈
/ R(PS ), r ∈ R(PT )
7.3 Comparative Algorithms
where λ ≥ 1 is a user-defined parameter that can be approx- We compare our algorithm with 7 alternative algorithms
imately derived from the original parameter λ. λ is utilized in this experiment. Those algorithms are described in Table
to increase(and decrease) the importance of term pairs in 7. 6 of them are unsupervised algorithms which only look
sensitive phrases(and trivial phrases). The choice of λ de- at the terms of log messages. 3 of them are semi-supervised
pends on users’ confidence in those sensitive phrases and algorithms which are able to incorporate the domain knowl-
trivial phrases. In experiments, we let λ = 10.0. We show edge. IPLoM [18] and StringMatch [6] are two methods
the performances of the algorithm when varying λ from 0.0 proposed in recent related literatures . VectorModel [23],
to 20.0 and explain why we choose 10 in the experimental Jaccard [25], StringKernel [16] are traditional methods for
section. text clustering. VectorModel and semi-StringKernel are
implemented by k-means clustering algorithm [25]. Jaccard
7. EVALUATION and StringMatch are implemented by k-medoid algorithm
[12], since they cannot compute the centroid point of a clus-
7.1 Experimental Platforms ter. As for Jaccard, the Jaccard similarity is obtained by a
We implement our algorithm and other comparative algo- hash table to accelerate the computation. VectorModel and
rithms in Java 1.6 platform. Table 5 summarizes our exper- StringKernel use Sparse Vector [23] to reduce the compu-
imental enviroment. tation and space costs.
791
Table 4: Average F-Measure Comparison
hhh
h hhhhLog Data FileZilla PVFS2 ThunderBird Apache Error
hhh Hadoop
Algorithm hh
Jaccard 0.3794 0.4072 0.6503 0.7866 0.5088
VectorModel 0.4443 0.5243 0.4963 0.7575 0.3506
IPLoM 0.2415 0.2993 0.8881 0.7409 0.2015
StringMatch 0.5639 0.4774 0.6663 0.7932 0.4840
StringKernel0.8 0.4462 0.3894 0.6416 0.8810 0.3103
StringKernel0.5 0.4716 0.4345 0.7361 0.9616 0.3963
StringKernel0.3 0.4139 0.6189 0.8321 0.9291 0.4256
logSig 0.6949 0.7179 0.7882 0.9521 0.7658
semi-Jaccard 0.8283 0.4017 0.7222 0.7415 0.4997
semi-StringKernel0.8 0.8951 0.6471 0.7657 0.8645 0.7162
semi-StringKernel0.5 0.7920 0.4245 0.7466 0.8991 0.7461
semi-StringKernel0.3 0.8325 0.7925 0.7113 0.8537 0.6259
semi-logSig 1.0000 0.8009 0.8547 0.7707 0.9531
792
Table 9: Message Signatures
System Log Message Signature Associated Category
FileZilla Date Hours Number Number Status: ... Status
Date Hours Number Number Response: Number Response
Date Hours Number Number Command: Command
Date Hours Number Number Error: File transfer failed Error
Apache Error Timestamp ( 13 ) Permission denied: /home/bear-005/users/xxx/public_html/ke/.htaccess Permission denied
pcfg_openfile: unable to check htaccess file ensure it is readable
Timestamp Error [ client ] File does not exist: /opt/website/sites/users.cs.fiu.edu/ File does not exist
data/favicon.ico
Timestamp Error [ client 66.249.65.4 ] suexec policy violation: see suexec log for Policy violation
more details
Timestamp /home/hpdrc-demo/sdbtools/public_html/hpdrc2/.htaccess: AuthName takes one Authentication
argument The Authentication realm ( e.g. "Members Only" )
Timestamp Error [ client ] 2010-04-01 using N/A
Timestamp Error [ client ] N/A
0 0 0
5K 8K 11K 14K 17K 20K 5K 15K 25K 35K 45K 55K 65K 75K 85K 95K 5K 15K 25K 35K 45K 55K 65K 75K 85K 95K
Number of Log Messages Number of Log Messages Number of Log Messages
Figure 3: Average Running Time Figure 4: Average Running Time Figure 5: Average Running Time
for FileZilla logs for ThunderBird logs for Apache logs
1.0
logsig using Φ 1400 Thunderbird
0.9
1.0 logsig using F Apache Error
Average F-Measure
0.7
0.8 1000
0.6
0.6 0.5 800
FileZilla 0.4 600
0.4 PVFS2 0.3
ThunderBird 400
0.2
0.2 ApacheError
Hadoop 0.1 200
0.0 0.0 0
0 2 4 6 8 10 12 14 16 18 20 FileZilla PVFS2 ThunderBirdApacheError Hadoop 15K 35K 55K 75K 100K 120K 140K 160K 180K 200K
′ Log Data
λ Number of Log Messages
793
7.5 Scalability determination in large-scale distributed systems. In
Scalability is an important factor for log analysis algo- Proceedings of the 29th International Conference on
Distributed Computing Systems (ICDCS’09), pages
rithms. Many high performance computing systems gener- 623–630, 2009.
ate more than 1Mbytes log messages per second [21]. Fig- [11] G. Hamerly and C. Elkan. Learning the k in k-means. In
ure 3, Figure 4 and Figure 5 show the average running time Proceedings of NIPS, Vancouver, British Columbia,
comparison for all algorithms on the data sets with different Canada, December 2003.
sizes. We run each algorithm 3 times and plot the average [12] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts
running times. IPLoM is the fastest algorithm. The running and Techniques, 2ed. Morgan Kaufmann, 2005.
times of other algorithms depend on the number of itera- [13] J. L. Hellerstein, S. Ma, and C.-S. Perng. Discovering
actionable patterns in event data. IBM Systems Journal,
tions. Clearly, k-medoid based algorithms are not capable
43(3):475–493, 2002.
of handling large log data. Moreover, StringKernel is not [14] J. Kiernan and E. Terzi. Constructing comprehensive
efficient even though we use Sparse Vector to implement summaries of large event sequences. In Proceedings of ACM
the computation of its kernel functions. We keep track of KDD, pages 417–425, Las Vegas, Nevada, USA, August
its running process, and find out the low speed convergence 2008.
is mainly due to the high dimensionality. [15] T. Li, F. Liang, S. Ma, and W. Peng. An integrated
Figure 8 shows the scalability of logSig algorithm on framework on mining logs files for computing system
ThunderBird logs and Apache Error logs. Its actual run- management. In Proceedings of ACM KDD, pages 776–781,
Chicago, Illinois, USA, August 2005.
ning time is approximated linear with the log data size.
[16] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and
C. Watkins. Text classification using string kernels. The
8. CONCLUSION AND FUTURE WORK Journal of Machine Learning Research, 2:419–444, March
2002.
In this paper, we show the drawbacks of traditional meth-
[17] D. Maier. The complexity of some problems on
ods and previous log message analyzing methods. To address subsequences and supersequences. Journal of the ACM,
the limitations of existing methods, we propose logSig, a 25:322–336, April 1978.
message signature based method for events creation from [18] A. Makanju, A. N. Zincir-Heywood, and E. E. Milios.
system log messages. logSig utilizes the common subse- Clustering event logs using iterative partitioning. In
quence information of messages to partition and describe Proceedings of ACM KDD, pages 1255–1264, Paris, France,
the events generated from log messages. June 2009.
As for the future work, we will integrate the structural [19] K. Ning, H. K. Ng, and H. W. Leong. Finding patterns in
biological sequences by longest common subsequencesand
learning techniques into our framework to capture the struc- shortest common supersequences. In Proceedings of BIBE,
tural information of log messages. We hope those structures pages 53–60, Arlington, Virginia, USA, 2006.
could improve the performance of the logSig algorithm. [20] A. J. Oliner, A. Aiken, and J. Stearley. Alert detection in
system logs. In Proccedings of ICDM, pages 959–964, Pisa,
Italy, December 2008.
Acknowledgement [21] A. J. Oliner and J. Stearley:. What supercomputers say: A
The work is supported in part by NSF grants IIS-0546280 study of five system logs. In Proceedings of DSN 2007,
and HRD-0833093. pages 575–584, Edinburgh, UK, June 2007.
[22] W. Peng, C. Perng, T. Li, and H. Wang. Event
summarization for system management. In Proceedings of
9. REFERENCES ACM KDD, pages 1028–1032, San Jose, California, USA,
[1] Apache HTTP Server : An Open-Source HTTP Web
August 2007.
Server. http://httpd.apache.org/.
[23] G. Salton and M. McGill. Introduction to Modern
[2] FileZilla: An open-source and free FTP/SFTP solution. Information Retrieval. McGraw-Hill, 1984.
http://filezilla-project.org.
[24] J. Stearley. Towards informatic analysis of syslogs. In
[3] Hadoop : An Open-Source MapReduce computing Proceedings of IEEE International Conference on Cluster
platform. http://hadoop.apache.org/. Computing, pages 309–318, San Diego, California, USA,
[4] PVFS2 : The state-of-the-art parallel I/O and high September 2004.
performance virtual file system. http://pvfs.org.
[25] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to
[5] ThunderBird: A supercomputer in Sandia National Data Mining. Addison Wesley, 2005.
Laboratories. http://www.cs.sandia.gov/~jrstear/logs/. [26] L. Tang and T. Li. LogTree: A framework for generating
[6] M. Aharon, G. Barash, I. Cohen, and E. Mordechai. One system events from raw textual logs. In Proceedings of
graph is worth a thousand logs: Uncovering hidden ICDM, pages 491–500, Sydney, Australia, December 2010.
structures in massive system event logs. In Proceedings of [27] S. Theodoridis and r. e. Konstantinos Koutroumbas.
ECML/PKDD, pages 227–243, Bled, Slovenia, September
Pattern Recognition. Academic Press, 2006.
2009.
[28] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl:.
[7] S. Bereg, M. Kubica, T. Walen, and B. Zhu. RNA multiple Constrained k-means clustering with background
structural alignment with longest common subsequences. knowledge. In Proceedings of ICML, pages 577–584, June
Journal of Combinatorial Optimization, 13(2):179–188,
2001.
2007.
[29] P. Wang, H. Wang, M. Liu, and W. Wang. An algorithmic
[8] M. Bilenko, S. Basu, and R. J. Mooney. Integrating approach to event summarization. In Proceedings of ACM
constraints and metric learning in semi-supervised SIGMOD, pages 183–194, Indianapolis, Indiana, USA, June
clustering. In Proceedings of ICML, Alberta, Canada, July 2010.
2004.
[30] W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I.
[9] K. Fisher, D. Walker, and K. Q. Zhu. Incremental learning
Jordan. Mining console logs for large-scale system problem
of system log formats. SIGOPS Oper. Syst. Rev., detection. In SysML, San Diego, CA, USA, December 2008.
44(1):85–90, 2010.
[10] J. Gao, G. Jiang, H. Chen, and J. Han. Modeling
probabilistic measurement correlations for problem
794