LogSig Generating System Events From Raw Textual Logs

The document describes a new algorithm called logSig that can generate system events from raw textual log messages. Log messages typically contain variables like host names, dates, and error codes that make them difficult to cluster using traditional text mining techniques. The logSig algorithm searches for message signatures - common subsequences of terms - to categorize messages into event types. The authors evaluate logSig on five real-world log datasets and find it outperforms other clustering algorithms at transforming unstructured logs into structured system events.

Uploaded by

redzgn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views10 pages

LogSig Generating System Events From Raw Textual Logs

Uploaded by

redzgn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

LogSig: Generating System Events from Raw Textual Logs

Liang Tang Tao Li Chang-Shing Perng

School of Computer Science School of Computer Science IBM T.J. Watson Research
Florida International University Florida International University Center
11200 S.W. 8th Street 11200 S.W. 8th Street 19 Skyline Drive
Miami, Florida, 33199 Miami, Florida, 33199 Hawthorne, NY, 10532
U.S.A U.S.A U.S.A
ltang002@cs.fiu.edu taoli@cs.fiu.edu perng@us.ibm.com

ABSTRACT and records system internal operations, such as the starting

Modern computing systems generate large amounts of log and stopping of services, detection of network connections,
data. System administrators or domain experts utilize the software configuration modifications, and execution errors.
log data to understand and optimize system behaviors. Most System administrators or domain experts utilize the log data
system logs are raw textual and unstructured. One main to understand and optimize system behaviors.
fundamental challenge in automated log analysis is the gen- Most system logs are raw textual and unstructured. There
eration of system events from raw textual logs. Log messages are two challenges in analyzing system log data. The first
are relatively short text messages but may have a large vo- challenge is transforming raw textual logs into system events.
cabulary, which often result in poor performance when ap- The number of distinct events observed can be very large and
plying traditional text clustering techniques to the log da- also grows rapidly due to the large vocabulary size as well
ta. Other related methods have various limitations and only as various parameters in log generation [6]. Once raw tex-
work well for some particular system logs. In this paper, we tual logs are transformed into events, the second challenge
propose a message signature based algorithm logSig to gen- is to develop efficient algorithms to analyze or summarize
erate system events from textual log messages. By searching the patterns from events. A lot of studies investigate the
the most representative message signatures, logSig catego- second challenge and develop many algorithms to mine sys-
rizes log messages into a set of event types. logSig can tem events [22] [30] [13] [15] [10] [20] [29] [14]. In this paper,
handle various types of log data, and is able to incorporate we focus on the first challenge. The traditional solution to
human’s domain knowledge to achieve a high performance. the first challenge is to develop a specialized log parser for
We conduct experiments on five real system log data. Ex- a particular system. However, it requires users fully under-
periments show that logSig outperforms other alternative stand all kinds of log messages from the system. In practice,
algorithms in terms of the overall performance. this is time-consuming, if not impossible given the complex-
ity of current computing systems. In addition, a specialized
log parser is not universal and does not work well for other
Categories and Subject Descriptors types of systems.
I.5.4 [Pattern Recognition]: Applications; H.4.m [Information Recent studies [6] [18] [26] apply data clustering tech-
Systems Applications]: Miscellaneous; G.2.3 [Discrete niques to automatically partition log messages into different
Mathematics]: Applications groups. Each message group represents a particular type
of events. Due to the short length and large vocabulary
size of log messages [24], traditional data clustering meth-
General Terms ods based on the bag-of-word model cannot perform well
Algorithms, Experimentation when applied to the log message data. Therefore, new clus-
tering methods have been introduced to utilize both the for-
Keywords mat and the structure information of log data [6] [18] [26].
However, these methods only work well for strictly format-
Event generation, Message signature, System logs ted/structured logs and their performances heavily rely on
the format/structure features of the log messages. To ad-
1. INTRODUCTION dress these limitations, this paper proposes a message sig-
Modern computing systems generate large amounts of log nature based algorithm logSig to generate system events
data. The log data describes the status of each component from raw textual logs. It can handle various types of log
data, and is able to incorporate human’s domain knowledge
to achieve a high performance.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 1.1 Motivation
not made or distributed for profit or commercial advantage and that copies Each log message consists of a sequence of terms. Some
bear this notice and the full citation on the first page. To copy otherwise, to of the terms are variables or parameters for a system even-
republish, to post on servers or to redistribute to lists, requires prior specific t, such as the host name, the user name, IP address and
permission and/or a fee.
CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK. so on. Other terms are plain text words describing seman-
Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00. tic information of the event. For example, three sample log

785
messages of the Hadoop system [3] describing one type of duct experiments on five real system log data. Experiments
events about the IPC (Inter-Process Communication) sub- show that logSig outperforms other alternative algorithms
system are listed below: in terms of the overall performance.
The rest of the paper is organized as follows: Section 2 de-
1. 2011-01-26 13:02:28,335 INFO org.apache.hadoop.ipc. scribes the related work about system event generation from
Server: IPC Server Responder: starting; textual logs. Then, we formulate the problem of the system
events generation in Section 3. In Section 4, we present
2. 2011-01-27 09:24:17,057 INFO org.apache.hadoop.ipc. an overview of the logSig algorithm. Section 5 discusses
Server: IPC Server listener on 9000: starting; some detailed implementation issues. Section 6 proposes t-
wo approaches for incorporating the domain knowledge to
3. 2011-01-27 23:46:21,883 INFO org.apache.hadoop.ipc. improve the accuracy of the logSig algorithm. In Section 7,
Server: IPC Server handler 1 on 9000: starting. we present the experimental studies on five real system log
data. Finally, Section 8 concludes our paper and discusses
The three messages contain many different words(or terms), the future work.
such as the date, the hours, the handler name, and the
port number. People can identify them as the same even-
t type because they share a common subsequence: “INFO:
2. RELATED WORK
org.apache.hadoop.ipc.Server: IPC Server:starting ”. It has been shown in [24] that log messages are relative-
Let’s consider how the three log messages are generated by ly short text messages but could have a large vocabulary
the system. The Java source code for generating them is size. This characteristic often leads to a poor performance
described below: when using the bag-of-words model in text mining on log
data. The reason is that, each single log message has only
logger = Logger.getLogger ("org.apache.hadoop.ipc.Server"); a few terms, but the vocabulary size is very large. Hence,
logger.info("IPC Server "+handlerName+": starting"); the vector space established on sets of terms would be very
sparse. The string kernel can be used to extract deep se-
where logger is the log producer for the IPC subsystem. Us- mantic information (e.g., the order of terms) to improve the
ing different parameters, such as handlerName, the code can performance of the clustering algorithm. It maps a string
output different log messages. But the subsequence “INFO: to a high dimensional vector to represent all possible ter-
org.apache.hadoop.ipc.Server: IPC Server : start- m orders. However, because of the large vocabulary size,
ing ” is fixed in the source code. It will never change unless the dimensionality of the transformed space would be very
the source code has been modified. high. Although the kernel trick does not have to explicitly
Therefore, the fixed subsequence can be viewed as a sig- create those high dimensional vectors, clustering algorithms
nature for an event type. In other words, we can check the would still be influenced by the high dimensionality due to
signatures to identify the event type of a log message. Oth- the Curse of Dimensionality [25].
er parameter terms in the log message should be ignored, The related work about the log data analysis can be broad-
since messages of the same event type can have different ly summarized into two categories. One category is on sys-
parameter terms. Note that some parameters, such as the tem event generation from raw log data [6] [9] [18] [26] and
handlerName in this example, consist of different numbers the other category is on analyzing patterns from system
of terms. Consequently, the position of a message signa- events [22] [30] [13] [15] [10] [29] [14]. Our work in this pa-
ture may vary in different log messages. Hence, the string per belongs to the first category. A word matching similarity
matching similarity proposed in [6] would mismatch some measurement is introduced in [6] for clustering the log mes-
terms. Another method IPLoM proposed in [18] also fails to sages. One problem is that, if most terms of a log message
partition log messages using the term count since the length are parameter terms, then this type of log messages may
of handlerName is not fixed and three log messages have not have many matched common words. This method is de-
different numbers of terms. noted as StringMatch in this paper. [18] develops a 4-steps
Given an arbitrary log message, we do not know in ad- partitioning method IPLoM for clustering the log messages
vance which item is of its signature, or which term is its based on inherent characteristics of the log format. Howev-
parameter. That is the key challenge we aim to address in er, the method can only be useful for strictly formatted log
this paper. data. The logSig algorithm proposed in this paper can han-
dle various types of log data without much prior knowledge
about the log format.
1.2 Contributions
In this paper, we first describe the drawbacks of tradition-
al text clustering techniques for event generation from log 3. PROBLEM FORMULATION
messages. We show that, it is difficult for the bag-of-word The goal of this paper is to identify the event type of each
model to accurately partition log messages. We also analyze log message according to a set of message signatures. Given
that, the string kernel based approach would be inefficient a log message and a set of signatures, we need a metric to
when the log vocabulary size is large. In addition, we discuss determine which signature best matches this log message.
some limitations of related approaches proposed in previous Therefore, we propose the Match Score metric first.
literatures. Notations: Let D be a set of log messages, D = {X1 , ..., XN },
Then, we propose logSig algorithm to generate system where Xi is the ith log message, i = 1, 2, ..., N . Each Xi is
events from textual log messages. logSig algorithm tries a sequence of terms, i.e., Xi = wi1 wi2 ....wini . A message
to find k message signatures to match all given messages as signature S is also a sequence of terms S = wj1 wj2 ....wjn .
much as possible, where k is specified by the user. We con- Given a sequence X = w1 w2 ...wn and a term wi , wi ∈ X

786
indicates wi is a term in X. X −{wi } denotes a subsequence partition? Why not use the LCS as the similarity function
w1 ...wi−1 wi+1 ...wn . |X| denotes the length of the sequence to do k-means clustering? The answer for the two ques-
X. LCS(X, S) denotes the Longest Common Subsequence tions is that, our goal is not to ﬁnd good clusters of log
between two sequences X and S. messages, but to ﬁnd the message signatures of all types of
Definition 1. (Match Score) Given a log message Xi log messages. K-means can ensure every two messages in
and a message signature S, the match score is computed by one cluster share a subsequence. However, it cannot guar-
the function below: antee that there exists a common subsequence shared by all
(or most) messages in one cluster. We illustrate this by the
match(Xi , S) = |LCS(Xi , S)| − (|S| − |LCS(Xi , S)|) following example.
= 2|LCS(Xi , S)| − |S|. Example 2. There are three log messages X1 : “abcdef,
Intuitively, |LCS(Xi , S)| is the number of terms in Xi matched X2 : “abghij” and X3 : “xyghef”. Clearly, LCS(X1 , X2 )=2,
with S. |S| − |LCS(Xi , S)| is the number of terms in Xi not LCS(X2 , X3 )=2, and LCS(X1 , X3 )=2. However, there is
matched with S. match(Xi , S) is the number of matched no common subsequence that exists among all X1 , X2 and
terms minus the number of not-matched terms. We illus- X3 . In our case, it means there is no message signature to
trate this by a simple example below: describe all three log messages. Hence, it is hard to believe
that they are generated by the same log message template.
Example 1. A log messages X = abcdef and a mes-
sage signature S = axcey. The longest common subsequence 3.2 Problem Analysis
LCS(X, S) = ace. The matched terms are “a”,“c”,“e”, shown Problem 1 is an NP-hard problem, even if k = 1. When
Table 1: Example of Match Score k = 1, we can reduce the Multiple Longest Common Sub-
X a b c d e f sequence problem to the Problem 1. The Multiple Longest
S a x c e y Common Subsequence problem is a known NP-hard [17].
by underline words in Table 1. “x” and “y” in S are not Lemma 1. Problem 1 is an NP-hard problem when k = 1.
matched with any term in X. Hence, match(X, S) = |ace| −
|xy| = 1. Proof: Let D = {X1 , ..., XN }. When k = 1, S = {S1 }.
Note that this score can be negative. match(Xi , S) is used Construct another set of N sequences Y = {Y1 , ..., YN }, in
to measure the degree of the log message Xi owning the which each term is unique in both D and Y. Let D = D ∪Y,

signature S. If two log messages Xi and Xj have the same J(S, D ) = Xj ∈D match(Xj , S1 ) + Yl ∈Y match(Yl , S1 )
signature S, then we regard Xi and Xj as of the same event
type. The longest common subsequence matching is a widely Let S1∗ be the optimal message signature for D , i.e.,
used similarity metric in biological data analysis [7] [19], S1∗ = arg max J({S1 }, D ).
such as RNA sequences. S1

3.1 Problem Statement Then, the longest common subsequence of X1 ,...,XN must
be an optimal solution S1∗ . This can be proved by contradic-
If all message signatures S1 , S2 ,...,Sk are known, identi-
tion as follows. Let Slcs be the longest common subsequence
fying the event type of each log message in D is straight-
of X1 ,...,XN . Note that Slcs may be an empty sequence if
forward. But we don’t know any message signature at the
there is no common subsequence among all messages.
beginning. Therefore, we should partition log messages and
Case 1: If there exists a term wi ∈ S1∗ , but wi ∈
/ Slcs . Since
find their message signatures simultaneously. The optimal
wi ∈/ Slcs , wi is not matched with at least one message
result is that, within each partition, every log message match-
in X1 ,...,XN . Moreover, Y1 ,...,YN are composed by unique
es its signature as much as possible. This problem is formu-
terms, so wi cannot be matched with any of them. In D ,
lated below.
the number of messages not matching wi is at least N + 1,
Problem 1. Given a set of log messages D and an in- which is greater than the number of messages matching wi .
teger k, find k message signatures S = {S1 , ..., Sk } and a Therefore,
k-partition C1 ,...,Ck of D to maximize
J({S1∗ − {wi }}, D ) > J({S1∗ }, D ),

k
J(S, D) = match(Xj , Si ). which contradicts with S1∗ = arg max J({S1 }, D ).
S1
i=1 Xj ∈Ci
Case 2: If there exists a term wi ∈ Slcs , but wi ∈ / S1∗ .
The objective function J(S, D) is the summation of all match Since wi ∈ Slcs , X1 ,...,XN all match wi . The total number
scores. It is similar to the k-means clustering problem. The of messages that match wi in D is N . Then, there are N
choice of k depends on the user’s domain knowledge to the remaining messages not matching wi : Y1 ,...,YN . Therefore,
system logs. If there is no domain knowledge, we can bor-
J({Slcs }, D ) = J({S1∗ }, D ),
row the idea from the method finding k for k-means [11],
which plots clustering results with k. We can also display which indicates Slcs is also an optimal solution to maximize
generated message signatures for k = 2, 3, .. until the results objective function J on D .
can be approved by experts. To sum up the two cases above, if there is a polynomial
Comparing with k-means clustering problem time-complexity solution to find the optimal solution S1∗ in
Problem 1 is similar to the classic k-means clustering prob- D , the Multiple Longest Common Subsequence problem for
lem, since a message signature can be regarded as the repre- X1 ,...,XN can be solved in polynomial time as well. How-
sentative of a cluster. People may ask the following question- ever, Multiple Longest Common Subsequence problem is an
s: Why we propose the match function to find the optimal NP-hard problem [17].

787
Lemma 2. If when k = n Problem 1 is NP-hard, then Lemma 3. Given a message group C, it has n common
when k = n + 1 Problem 1 is NP-hard, where n is a positive term pairs, then the length of the longest
√ common subse-
integer. quence of messages in C is at least 2n.
Proof-Sketch: This can be proved by contradiction. We
Proof-sketch: Let l be the length of a longest common sub-
can construct a message Y whose term set has no overlap to
sequence of messages in C. Let T (l) be the number of term
the term set of messages in D in a linear time. Suppose the
pairs that generated by that longest common subsequence.
optimal solution for k = n and D is C = {C1 , ..., Ck }, then
Since each term
pair has two terms, this
sequence can gener-
the optimal solution for k = n + 1 and D ∪ {Y } should be
ate at most 2l pairs. Hence, T (l) ≤ 2l = l(l − 1)/2. Note
C = {C1 , ..., Ck , {Y }}. If there is a polynomial time solution
that each term pair of the longest common subsequence is a
for Problem 1 when k = n + 1, we could solve Problem 1
common term pair in C. Now, we already know √ T (l) = n,
when k = n in polynomial time.
so T (l) = n ≤ l(l − 1)/2. Then, we have l ≥ 2n.
4. ALGORITHM OVERVIEW Lemma 4. Given a set of log messages D and a k-partition
C = {C1 , ..., Ck } of D, if F (C, D) ≥ y, y is a constant, we
In this section, we first present an approximated version of can find a set of message signatures S such that on average:
Problem 1 and then present our logSig algorithm. logSig
algorithm consists of three steps. The first step is to separate 2y
J(S, D) ≥ |D| ·
every log message into several pairs of terms. The second k
step is to find k groups of log messages using local search
Proof-sketch: Since F (C, D) ≥ y, on average, each group
strategy such that each group share common pairs as many
has at least y/k common pairs. Then for each group, by
as possible. The last step is to construct message signatures
Lemma 3, the length
of the longest common subsequence
based on identified common pairs for each message group.
must be at least 2y
k
. If we choose this longest common
4.1 Term Pair Generation subsequence as
the message signature, each log message can
The first step of logSig algorithm is converting each log 2y
match at least k

terms of the signature. As a result, the
message into a set of term pairs. For example, there is a log
message collected from FileZilla [2] client: match score of each log message is at least 2y
k
. D has |D|
Then, we have the total match score J(S, D) ≥
2010-05-02 11:34:06 Command: mkdir ".indexes" messages.
We extract each pairwise of terms and preserve the order of |D| · 2y on average.
k
two terms. Then, the converted pairs are as follows:
{2010-05-02,11:34:06}, {2010-05-02,Command:}, Lemma 4 shows that, maximizing the F (C, D) is approx-
{2010-05-02, mkdir}, {2010-05-02, ".indexes" } imately maximizing the original objective function J(S, D).
{11:34:06,Command:}, {11:34:06,mkdir}, But F (C, D) is easier to optimize because it deals with dis-
{11:34:06,".indexes"}, {Command:,mkdir}, crete pairs.
{Command:,".indexes"}, {mkdir,".indexes"}
The converted term pairs preserve the order information of 4.2.2 Local Search
message terms. On the other hand, the computation on The logSig algorithm applies the local search strategy to
the discrete term pairs is easier than a sequence. A similar solve Problem 2. It iteratively moves one message to another
idea was proposed in string kernel [16] for text classification. message group to increase the objective function as large as
Their output is a high dimensional vector, and our output possible. However, unlike the classic local search optimiza-
is a set of pairs. tion method, the movement is not explicitly determined by
objective function F (·). The reason is that, the value of
4.2 Log Messages Partition F (·) may only be updated after a bunch of movements, not
The second step is to partition log messages into k groups just after every single movement. We illustrate this by the
based on converted term pairs. following example.

4.2.1 The Approximated Version of Problem 1 Example 3. Message set D is composed of 100 “ab” and
100 “cd”. Now we have 2-partition C = {C1 , C2 }. Each
Notations: Let X be a log message, R(X) denotes the set message group has 50% of each message type as shown in
of term pairs converted from X, and |R(X)| denotes the Table 2. The optimal 2-partition is C1 has 100 “ab” and C2
number of term pairs in R(X).
Problem 2. Given a set of log messages D and an inte- Table 2: Example of two message groups
XXX
ger k, find a k-partition C = {C1 , ..., Ck } of D to maximize XX group C1 C2
term pair XXXX
objective function F (C, D):
“ab” 50 50

k “cd” 50 50
F (C, D) = | R(Xj )|. has 100 “cd”, or in the reverse way. However, beginning with
i=1 Xj ∈Ci current C1 and C2 , F (C, D) is always 0 until we move 50
“ab” from C2 to C1 , or move 50 “cd” from C1 to C2 . Hence,
Object function F (C, D) is the total number of common pairs
for first 50 movements, F (C, D) cannot guide the local search
over all groups. Intuitively, if a group has more common
because no matter what movement you choose, it is always
pairs, it is more likely to have a longer common subsequence.
0.
Then, the match score of that group would be higher. There-
fore, maximizing function F is approximately maximizing J Therefore, F (·) is not proper to guide the movement in the
in Problem 1. Lemma 4 shows the average lower bound for local search. The decision of every movement should consid-
this approximation. er the potential value of the objective function, rather than

788
the immediate value. So we develop the potential function Algorithm 1 logSig_localsearch (D, k)
to guide the local search instead. Parameter: D : log messages set; k: the number of groups
Notations: Given a message group C, R(C) denotes the to partition;
union set of term pairs from messages of C. For a term pair Result: C : log message partition;
r ∈ R(C), N (r, C) denotes the number of messages in C 1: C ← RandomSeeds(k)
which contains r. p(r, C) = N (r, C)/|C| is the portion of 2: C ← ∅ // Last iteration’s partition
messages in C having r . 3: Create a map G to store message’s group index
4: for Ci ∈ C do
Definition 2. Given a message group C, the potential of 5: for Xj ∈ Ci do
C is defined as φ(C), 6: G[Xj ] ← i
7: end for
φ(C) = N (r, C)[p(r, C)]2 . 8: end for
r∈R(C)
9: while C = C do
10: C ← C
11: for Xj ∈ D do
The potential value indicates the overall “purity” of term
12: i ← G[Xj ]
pairs in C. φ(C) is maximized when every term pair is 13: j ∗ = arg max ΔiX j Φ(D)
contained by every message in the group. In that case, for j=1,..,k →
−

each r, N (r, C) = |C|, φ(C) = |C| · |R(C)|. It also means all 14: if i = j ∗ then
term pairs are common pairs shared by every log message. 15: Ci ← Ci − {Xj }
16: Cj ∗ ← Cj ∗ ∪ {Xj }
φ(C) is minimized when each term pair in R(C) is only 17: G[Xj ] ← j ∗
contained by one message in C. In that case, for each r, 18: end if
N (r, C) = 1, |R(C)| = |C|, φ(C) = 1/|C|. 19: end for
20: end while
Definition 3. Given a k-partition C = {C1 , ..., Ck } of a 21: return C
message set D, the overall potential of D is defined as Φ(D),

k
Φ(D) = φ(Ci ), Why choose this Potential Function?
i=1 Given a message
group C, let g(r) = N (r, C)[p(r, C)]2 , then
φ(C) = r∈R(C) g(r). Since we have to consider all term
where φ(Ci ) is the potential of Ci , i = 1, ..., k.
pairs in C, we deﬁne φ(C) as the sum of all g(r). As for
Connection between Φ and F : g(r), it should be a convex function. Figure 1 shows a curve
Objective function F computes the total number of common of g(r) by varying the number of messages having r, i.e.,
term pairs in each group. Both Φ and F are maximized N (r, C). The reason for why g(r) is convex is that, we hope
when each term pair is a common term in its corresponding 100
message group. Let’s consider the average case.
80
Lemma 5. Given a set of log messages D and a k-partition
C = {C1 , ..., Ck } of D, if F (C, D) ≥ y, y is a constant, then 60
g(r)

in the average case, Φ(D) ≥ y · |D|/k. 40

Proof-sketch: Since F (C, D) ≥ y, there are at least y com- 20

mon term pairs distributed in message groups. For each
common term pair ri , let Ci be its corresponding group. On 0
0 20 40 60 80 100
N(r, C)
average, |Ci | = |D|/k. Note that the common pair ri ap-
pears in every message of Ci , so N (ri , Ci ) = |Ci | = |D|/k
Figure 1: Function g(r), |C| = 100
and p(ri , Ci ) = 1. There are at least y common term pairs,
by Definition 2, we have Φ(D) ≥ y · |D|/k. to give larger awards to r when r is about to being a common
Lemma 5 implies, in the average case, if we try to increase term pair. That is because, if N (r, C) is large, then r is more
the value of F to be at least y, we have to increase the overall likely to be a common term pair. Only when r becomes a
potential Φ to be at least y · |D|/k. As for the local search common term pair, it can increase F (·). In other words, r
algorithm, we mentioned that Φ is easier to optimize than has more potential to increase the value of objective function
F. F (·), so the algorithm should pay more attention to r first.
Let ΔiX j Φ(D) denote the increase of Φ(D) by moving In the experimental section, we will empirically compare
→
− the effectiveness of our proposed potential function Φ with
X ∈ D from group Ci into group Cj , i, j = 1, ..., k, i = j.
the objective function function F in the local search.
Then, by Definition 3,
ΔiX j Φ(D) = [φ(Cj ∪ {X}) − φ(Cj )] 4.3 Message Signature Construction
→
−
The final step of logSig algorithm is to construct the
−[φ(Ci ) − φ(Ci − {X})],
message signature for each message group. Recall that a
where φ(Cj ∪ {X}) − φ(Cj ) is the potential increase brought message signature is a sequence of terms that has a high
by inserting X to Cj , φ(Ci )−φ(Ci −{X}) is the potential loss match score to every message in the corresponding group.
brought by removing X from Ci . Algorithm 1 is the pseu- So it could be constructed by highly frequent term pairs
docode of the local search algorithm in logSig. Basically, it identified in the second step.
iteratively updates every log message’s group according to Lemma 6. Let S be an optimal message signature for a
ΔiX→
−
j Φ(D) to increase Φ(D) until no more update operation message group C, the occurrence number of every term wj ∈
can be done. S must be equal or greater than |C|/2 .

789
The proof of Lemma 6 is similar to the proof of Lemma logSig’s potential function provides a more reliable heuris-
1. If there exists a term wj ∈ S only appearing in less tic to guide the optimization process. It is much less likely
than |C|/2 messages, we can have: J({S − {wj }}, C) > to stop at a local optima.
J({S}, C), then S is worse than S − {wj }. Thus, S is not Time Complexity and Space Complexity
optimal. The time complexity of converting log messages into term
Lemma 6 indicates that we only need to care about those pairs is O(|D| · L2 ), where L is the maximum length of log
terms which appear in at least one half of the messages in messages. For every Xj ∈ D, L2 ≥ |R(Xj )|. For the local
a group. By scanning every message in a group, we could search, each iteration goes through every message. So the
obtain the optimal sequence of those terms. Since log mes- time complexity of an iteration is O(k · L2 · |D|). Let t be
sages are usually very short, there are only a few term whose the number of iterations, so the time complexity of the local
occurrence number is equal or greater than |C|/2 . There- search is O(t · k · L2 · |D|). Our experiments shows that t is
fore, there are not many candidate sequences generated by usually 3 to 12 for most system log data.
those terms. We enumerate each one of them and select the The space cost of the algorithm is mainly determined by
best one to be the message signature. the term pair histograms created for every message group.
The total space complexity is the total number of term pairs,
O(|R(D)|).
5. ALGORITHM IMPLEMENTATION
In this section, we discuss some detailed issues about the
implementation of the logSig algorithm. 6. INCORPORATING DOMAIN KNOWLEDGE
Term Pair Histogram In real world applications, analyzing system behaviors
To efficiently compute the potential value of each message mainly relies on the generated system events, so the event
group Ci ∈ C, a histogram is maintained for each group. The generation algorithm should be reliable. Current approach-
histogram is implemented by a hash table, whose key is a es for grouping log messages into different event types are
term pair and the value is the number of messages containing totally unsupervised. In practice, for most sorts of logs
that term pair in the group. Then, for a term pair r, c(r, Ci ) (e.g., Hadoop system logs), there are some domain knowl-
and p(r, Ci ) can be obtained in a constant time complexity. edge about a fixed catalog of Java exceptions. In addition,
Approximately Computing ΔΦ(D) the log generation mechanisms implicitly create some asso-
The straightforward way to compute ΔiX j Φ(D) is enumer- ciations between the terminologies and the situations. In-
→
−
ating all term pairs in Ci and Cj . The computation cost of corporating domain knowledge into the clustering process
k could improve the performance of event generation. In this
finding the maximum ΔiX j Φ(D) is O( l=1 |R(Cl )|), which
→
− section, we present two approaches for incorporating domain
is the total number of term pairs in the entire data set. In
knowledge to improve the accuracy of logSig algorithm.
log data, this number is even tens of times greater than |D|.
Hence, this computation cost is not affordable for each single 6.1 Term Feature Layer
log message.
Some terms or words in log messages share some com-
Our approximated solution is to consider the change of
mon features which can help our algorithm identify the log
N (r, C), for each r ∈ R(X). The reason is that, |C| is large.
message type. The features are domain knowledge from hu-
The impact to |C| by inserting or removing one message
man experts. For example, “2011-02-01” and “2011-01-02”
could be ignored comparing to the impact to N (r, C). |C)| ≈
are different terms, but they are both dates; “192.168.0.12”
|C| + 1 ≈ |C| − 1. So |C| can be treated as a constant in one
and “202.119.23.10” are different terms, but they are both
exchange operation in local search. Then,
IP addresses.
∂φ(C) ∂[[N (r, C)]3 /|C|2 ] 3[N (r, C)]2 logSig algorithm allows users to provide a list of features.
= = = 3[p(r, C)]2 . Each feature is described by a regular expression. An addi-
∂N (r, C) ∂N (r, C) |C|2
tional feature layer is built to incorporate those features for
representing log messages. The feature layer is created by
ΔiX j Φ(D) ≈ 3· [[p(r, Cj )]2 − [p(r, Ci )]2 ]. regular expression matching. For example, we have follow-
→
−
r∈R(X) ing regular expressions 1 :
By utilizing the histograms, p(r, Cj ) and p(r, Ci ) can be
Timestamp : \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
obtained in a constant time. Hence, for each log message
FilePath : "(\w|\\)+"
X ∈ D, the time complexity of finding the largest ΔiX j Φ(D)
→
−
is at most O(k · |R(X)|). Although ΔiX j Φ(D) is would un- Then, we could scan the log message X1 to create the feature
→
−
derestimate the actual change of Φ(D), it can save a lot of layer Y1 as shown in Table 3. The feature layer Y1 contains
computation cost.
Table 3: Example of Feature Layer
5.1 Algorithm Analysis X1 2010-05-02 11:34:06 Command: mkdir ".indexes"
Convergence and Local Optima Y1 Timestamp Command: mkdir FilePath
In the local search algorithm, exchanging messages’ groups
only increases the overall potential Φ(D). There is no oper- more semantic information. It can gather similar terms and
ation to decrease the overall potential. Φ(D) is not infinite, reduce noisy terms. In addition, regular expression is easy
which is less than or equal to |R(D)| · |D|. Therefore, the for domain experts to write. Using a sophisticated regular
local search algorithm would converge. language toolkit, a feature layer can be built efficiently.
Similar to other local search based optimization algorithm- 1
The grammar of regular expressions is defined by Java Reg-
s, logSig may converge into a local optima as well. However, ular Library

790
Table 5: Experimental Machine
6.2 Constraints on Message Signature OS CPU bits Memory JVM Heap Size
In reality, domain experts usually identify the event type Linux 2.6.18 Intel Xeon(R) @ 64 16G 12G
of a log message by scanning keywords or phrases. For ex- 2.5GHz, 8 core
ample, as for SFTP/FTP logs, they would be sensitive to
those phrases: “Command”, “No such file”, “Error”, “Transfer 7.2 Data Collection
failed” and so on. Those sensitive phrases should be in- We collect log data from 5 diﬀerent real systems, which
cluded in the generated message signature to express system are summarized in Table 6. Logs of FileZilla [2], PVFS2
events. On the other hand, domain experts also know that [4] Apache [1] and Hadoop [3] are collected from the server
some trivial phrases should be ignored, such as the message machines/systems in the computer lab of a research center.
IDs and the timestamps. Those trivial phrases should not be Log data of ThunderBird [5] is collected from a supercom-
included in any message signature. To sum up, those knowl- puter in Sandia National Lab. The true categories of log
edge can be transferred as constraints on message signatures. messages are obtained by specialized log parsers. For in-
Those constraints can help the event generation algorithm stance, FillZilla’s log messages are categorized into 4 types:
to improve its accuracy. Unlike traditional constraints in “Command ”, “Status”, “Response”, “Error ”. Apache error
semi-supervised clustering, such as Must-Link or Cannot- log messages are categorized by the error type: “Permission
Link [28], our constraints are placed in the subsequences (or denied ”, “File not exist” and so on.
phrases) of messages, not on the messages themselves.
Constraint-based logSig Algorithm Table 6: Summary of Collected System Logs
To incorporate constraints on message phrases, Problem 1 System Description #Messages #Terms #Category
Per Mes-
can be revised as follows: sage
Problem 3. Given a set of log messages D, a set of sen- FileZilla SFTP/FTP Client 22,421 7 to 15 4
sitive phrases PS and trivial phrases PT , find k message ThunderBird
PVFS2
Supercomputer
Parallel File System
3,248,239
95,496
15 to 30
2 to 20
12
11
signatures S = {S1 , ..., Sk } with a k-partition C1 ,...,Ck of D Apache Web Server 236,055 10 to 20 6
to maximize: Error
Hadoop Parallel Computing 2,479 15 to 30 11

k Platform
J (S, D, PS , PT ) = λ match(Xj , Si ) +
i=1 Xj ∈Ci The vocabulary size is an important characteristic of log
data. Figure 2 exhibits the vocabulary sizes of the 5 diﬀerent

k logs along with the data size. It can be seen that some
(1 − λ) [ match(Ps , Si ) − match(Pt , Si )],
i=1 Ps ∈PS Ps ∈PT
vocabulary size could become very large if the data size is
large.
where λ is a user-defined parameter between 0 and 1.
The revised optimization problem can be solved by constraint- 50000
PVFS2
based logSig Algorithm. The basic idea of constraint-based FileZilla
logSig is to increase the weight of term pairs in R(PS ) and 40000 ThunderBird
decrease the weight of term pairs in R(PT ). Let wr de- ApacheError
Vocabulary size

30000 Hadoop
note the weight of pair r. The only revised part in logSig
algorithm is ΔiX j Φ(D). By multiplying the weight wr , it
→
− 20000
becomes:

ΔiX j Φ(D) ≈ 3 [[p(r, Cj )]2 − [p(r, Ci )]2 ] · wr , 10000
→
−
r∈R(X)
0
0K 10K 20K 30K 40K 50K
and
⎧ Number of log messages
⎨ 1.0, r∈/ R(PS ), r ∈
/ R(PT )
wr = λ , r ∈ R(PS ), r ∈
/ R(PT ) , Figure 2: Vocabulary size
⎩ 1/λ , r ∈
/ R(PS ), r ∈ R(PT )
7.3 Comparative Algorithms
where λ ≥ 1 is a user-deﬁned parameter that can be approx- We compare our algorithm with 7 alternative algorithms
imately derived from the original parameter λ. λ is utilized in this experiment. Those algorithms are described in Table
to increase(and decrease) the importance of term pairs in 7. 6 of them are unsupervised algorithms which only look
sensitive phrases(and trivial phrases). The choice of λ de- at the terms of log messages. 3 of them are semi-supervised
pends on users’ conﬁdence in those sensitive phrases and algorithms which are able to incorporate the domain knowl-
trivial phrases. In experiments, we let λ = 10.0. We show edge. IPLoM [18] and StringMatch [6] are two methods
the performances of the algorithm when varying λ from 0.0 proposed in recent related literatures . VectorModel [23],
to 20.0 and explain why we choose 10 in the experimental Jaccard [25], StringKernel [16] are traditional methods for
section. text clustering. VectorModel and semi-StringKernel are
implemented by k-means clustering algorithm [25]. Jaccard
7. EVALUATION and StringMatch are implemented by k-medoid algorithm
[12], since they cannot compute the centroid point of a clus-
7.1 Experimental Platforms ter. As for Jaccard, the Jaccard similarity is obtained by a
We implement our algorithm and other comparative algo- hash table to accelerate the computation. VectorModel and
rithms in Java 1.6 platform. Table 5 summarizes our exper- StringKernel use Sparse Vector [23] to reduce the compu-
imental enviroment. tation and space costs.

791
Table 4: Average F-Measure Comparison
hhh
h hhhhLog Data FileZilla PVFS2 ThunderBird Apache Error
hhh Hadoop
Algorithm hh
Jaccard 0.3794 0.4072 0.6503 0.7866 0.5088
VectorModel 0.4443 0.5243 0.4963 0.7575 0.3506
IPLoM 0.2415 0.2993 0.8881 0.7409 0.2015
StringMatch 0.5639 0.4774 0.6663 0.7932 0.4840
StringKernel0.8 0.4462 0.3894 0.6416 0.8810 0.3103
StringKernel0.5 0.4716 0.4345 0.7361 0.9616 0.3963
StringKernel0.3 0.4139 0.6189 0.8321 0.9291 0.4256
logSig 0.6949 0.7179 0.7882 0.9521 0.7658
semi-Jaccard 0.8283 0.4017 0.7222 0.7415 0.4997
semi-StringKernel0.8 0.8951 0.6471 0.7657 0.8645 0.7162
semi-StringKernel0.5 0.7920 0.4245 0.7466 0.8991 0.7461
semi-StringKernel0.3 0.8325 0.7925 0.7113 0.8537 0.6259
semi-logSig 1.0000 0.8009 0.8547 0.7707 0.9531

Table 7: Summary of comparative algorithms Table 8: Summary of small log data

Algorithm Description Measure #Message #Feature |R(P)S | |R(P)T |
VectorModel Vector space model proposed in information re- FileZilla 8555 10 4 4
treival ThunderBird 5000 10 11 9
Jaccard Jaccard similarity based k-medoid algorithm PVFS2 12570 10 10 1
StringKernel String kernel based k-means algorithm Apache Error 5000 2 4 2
IPLoM Iterative partition method proposed in [18] Hadoop 2479 2 7 3
StringMatch String matching method proposed in [6]
logSig Message signature based method proposed in PT are summarized in Table 8. In Section 7.5, larger logs
this paper are used to test the scalability.
semi-logSig logSig incorporating domain knowledge
semi-StringKernel Weighted string kernel based k-means 7.4 Quality of Generated Events
semi-Jaccard Weighted Jaccard similarity based k-medoid
7.4.1 Average F-Measure
Table 4 shows the accuracy comparison of generated sys-
semi-logSig, semi-StringKernel and semi-Jaccard are
tem events by different algorithms. The accuracy is evalu-
semi-supervised versions of logSig, StringKernel and Jac-
ated by F-measure (F1 score) [23], which is a traditional
card respectively. To make a fair comparison, all those semi-
metric combining precision and recall. Since the results of
supervised algorithms incorporate the same domain knowl-
k-medoid, k-means and logSig depend on the initial ran-
edge offered by users. Specifically, the 3 algorithms run on
dom seeds, we run each algorithm for 10 times, and put
the same transformed feature layer, and the same sensitive
the average F-measures into Table 4. From this table, it can
phrases PS and trivial phrases PT . Obviously, the choices of
be seen that StringKernel and logSig outperform other
features, PS and PT have a huge impact to the performances
algorithms in terms of the overall performance.
of semi-supervised algorithms. But we only compare a semi-
Jaccard and VectorModel apply the bag-of-word model,
supervised algorithm with other semi-supervised algorithms.
which ignores the order information about terms. Log mes-
Hence, they are compared under the same choice of features,
sages are usually short, so the information from the bag-of-
PS and PT . The approaches for those 3 algorithms to incor-
word model is very limited. In addition, different log mes-
porate with features, PS and PT are described as follows:
sages have many identical terms, such as date, username.
Feature Layer: Replacing every log message by the trans- That’s the reason why the two methods cannot achieve high
formed sequence of terms with features. F-measures. IPLoM performs well in ThunderBird log da-
ta, but poorly in other log data. The reason is that, the
PS and PT : As for semi-StringKernel, replacing Euclidean first step of IPLoM is to partition log message by the term
distance by Mahalanobis distance [8]: count. One type of log message may have different num-
bers of terms. For instance, in FileZilla logs, the length
DM (x, y) = (x − y)T M (x − y). of Command messages depends on the type of SFTP/FTP
command in the message. But for ThunderBird, most even-
where matrix M is constructed according to term pairs
t types are strictly associated with one message format.
PS , PT and λ . As for semi-Jaccard, for each term,
Therefore, IPLoM could easily achieve the highest score.
multiply a weight λ (or 1/λ ) if this term appears in
Due to the Curse of dimensionality [25], k-means based
PS ( or PT ).
StringKernel is not easy to converge in a high dimensional
Jaccard, StringMatch and semi-Jaccard algorithms ap- space. Figure 2 shows that, 50K ThunderBird log messages
ply classic k-medoid algorithm for message clustering. The contain over 30K distinct terms. As a result, the trans-
time complexity of k-medoid algorithm is very high: O(tn2 ) formed space has over (30K)2 = 900M dimensions. It is
[27], where t is the number of iterations, n is the number of quite sparse for 50K data points.
log messages. As a result, those 3 algorithms are not capa-
ble of handling large log data. Therefore, for the accuracy 7.4.2 Message Signatures
comparison, we split our log files into smaller files by time Generated message signatures are used as descriptors for
frame, and conduct the experiments on the small log data. system events, so that users can understand the meanings
The amounts of log messages, features, term pairs in PS and of those events. Due to the space limit, we cannot list all

792
Table 9: Message Signatures
System Log Message Signature Associated Category
FileZilla Date Hours Number Number Status: ... Status
Date Hours Number Number Response: Number Response
Date Hours Number Number Command: Command
Date Hours Number Number Error: File transfer failed Error
Apache Error Timestamp ( 13 ) Permission denied: /home/bear-005/users/xxx/public_html/ke/.htaccess Permission denied
pcfg_openfile: unable to check htaccess file ensure it is readable
Timestamp Error [ client ] File does not exist: /opt/website/sites/users.cs.fiu.edu/ File does not exist
data/favicon.ico
Timestamp Error [ client 66.249.65.4 ] suexec policy violation: see suexec log for Policy violation
more details
Timestamp /home/hpdrc-demo/sdbtools/public_html/hpdrc2/.htaccess: AuthName takes one Authentication
argument The Authentication realm ( e.g. "Members Only" )
Timestamp Error [ client ] 2010-04-01 using N/A
Timestamp Error [ client ] N/A

300 1000 1000

logsig logsig
vectormodel vectormodel
250
Running Time (Seconds)

Running Time (Seconds)

800 IPLoM 800 IPLoM
logsig jaccard jaccard
200 stringmatch stringmatch
vectormodel 600 600
IPLoM stringkernel stringkernel
150
jaccard
stringmatch 400 400
100 stringkernel
200 200
50

0 0 0
5K 8K 11K 14K 17K 20K 5K 15K 25K 35K 45K 55K 65K 75K 85K 95K 5K 15K 25K 35K 45K 55K 65K 75K 85K 95K
Number of Log Messages Number of Log Messages Number of Log Messages

Figure 3: Average Running Time Figure 4: Average Running Time Figure 5: Average Running Time
for FileZilla logs for ThunderBird logs for Apache logs
1.0
logsig using Φ 1400 Thunderbird
0.9
1.0 logsig using F Apache Error

Running Time (Seconds)

0.8 1200
F-Measure of logSig

Average F-Measure

0.7
0.8 1000
0.6
0.6 0.5 800
FileZilla 0.4 600
0.4 PVFS2 0.3
ThunderBird 400
0.2
0.2 ApacheError
Hadoop 0.1 200
0.0 0.0 0
0 2 4 6 8 10 12 14 16 18 20 FileZilla PVFS2 ThunderBirdApacheError Hadoop 15K 35K 55K 75K 100K 120K 140K 160K 180K 200K
′ Log Data
λ Number of Log Messages

Figure 6: Varying parameter λ Figure 7: Eﬀectiveness of Potential Figure 8: Scalability of logSig

Function
message signatures. Table 9 shows generated signatures of string kernel algorithms using three different decay factors:
FileZilla and Apache Error by semi-logSig, in which fea- StringKernel0.8 , StringKernel0.5 and StringKernel0.3 .
tures are indicated by italic words. As for the parameter λ of our algorithm logSig, we set

As for FileZilla log, each message signature corresponds to λ = 10 based on the experimental result shown by Figure 6.
a message category, so that the F-measure of FileZilla could For each value of λ , we run the algorithm for 10 times, and
achieve 1.0. But for Apache Error log, Only 4 message sig- plot the average F-measure in this figure. It can be seen that,
natures are associated with corresponding categories. The the performance becomes stable when λ is greater than 4.
other 2 signatures are generated by two ill-partitioned mes-
sage groups. They cannot be associates with any category of 7.4.4 Effectiveness of Potential Function
Apache Error logs. As a result, their “Associated Category” To evaluate the effectiveness of the potential function Φ,
in Table 9 are “N/A”. Therefore, the overall F-measure on we compare our proposed logSig algorithm with anoth-
Apache error log in Table 4 is only 0.7707. er logSig algorithm which uses the objective function F
to guide its local search. Figure 7 shows the average F-
7.4.3 Parameter Setting measures of the two algorithms on each data set. Clearly,
our proposed potential function Φ is more effective than F
All those algorithms have the parameter k, which is the in all data sets. In addition, we find logSig algorithm using
number of events to create. We let k be the actual number F always converges within 2 or 3 iterations. In other words,
of message categories. String kernel method has an addi- F is more likely to stop at a local optima in the local search.
tional parameter λ, which is the decay factor of a pair of The reason for this has been discussed in Section 4.2.2.
terms. We use StringKernelλ to denote the string kernel
method using decay factor λ. In our experiments, we set up

793
7.5 Scalability determination in large-scale distributed systems. In
Scalability is an important factor for log analysis algo- Proceedings of the 29th International Conference on
Distributed Computing Systems (ICDCS’09), pages
rithms. Many high performance computing systems gener- 623–630, 2009.
ate more than 1Mbytes log messages per second [21]. Fig- [11] G. Hamerly and C. Elkan. Learning the k in k-means. In
ure 3, Figure 4 and Figure 5 show the average running time Proceedings of NIPS, Vancouver, British Columbia,
comparison for all algorithms on the data sets with different Canada, December 2003.
sizes. We run each algorithm 3 times and plot the average [12] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts
running times. IPLoM is the fastest algorithm. The running and Techniques, 2ed. Morgan Kaufmann, 2005.
times of other algorithms depend on the number of itera- [13] J. L. Hellerstein, S. Ma, and C.-S. Perng. Discovering
actionable patterns in event data. IBM Systems Journal,
tions. Clearly, k-medoid based algorithms are not capable
43(3):475–493, 2002.
of handling large log data. Moreover, StringKernel is not [14] J. Kiernan and E. Terzi. Constructing comprehensive
efficient even though we use Sparse Vector to implement summaries of large event sequences. In Proceedings of ACM
the computation of its kernel functions. We keep track of KDD, pages 417–425, Las Vegas, Nevada, USA, August
its running process, and find out the low speed convergence 2008.
is mainly due to the high dimensionality. [15] T. Li, F. Liang, S. Ma, and W. Peng. An integrated
Figure 8 shows the scalability of logSig algorithm on framework on mining logs files for computing system
ThunderBird logs and Apache Error logs. Its actual run- management. In Proceedings of ACM KDD, pages 776–781,
Chicago, Illinois, USA, August 2005.
ning time is approximated linear with the log data size.
[16] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and
C. Watkins. Text classification using string kernels. The
8. CONCLUSION AND FUTURE WORK Journal of Machine Learning Research, 2:419–444, March
2002.
In this paper, we show the drawbacks of traditional meth-
[17] D. Maier. The complexity of some problems on
ods and previous log message analyzing methods. To address subsequences and supersequences. Journal of the ACM,
the limitations of existing methods, we propose logSig, a 25:322–336, April 1978.
message signature based method for events creation from [18] A. Makanju, A. N. Zincir-Heywood, and E. E. Milios.
system log messages. logSig utilizes the common subse- Clustering event logs using iterative partitioning. In
quence information of messages to partition and describe Proceedings of ACM KDD, pages 1255–1264, Paris, France,
the events generated from log messages. June 2009.
As for the future work, we will integrate the structural [19] K. Ning, H. K. Ng, and H. W. Leong. Finding patterns in
biological sequences by longest common subsequencesand
learning techniques into our framework to capture the struc- shortest common supersequences. In Proceedings of BIBE,
tural information of log messages. We hope those structures pages 53–60, Arlington, Virginia, USA, 2006.
could improve the performance of the logSig algorithm. [20] A. J. Oliner, A. Aiken, and J. Stearley. Alert detection in
system logs. In Proccedings of ICDM, pages 959–964, Pisa,
Italy, December 2008.
Acknowledgement [21] A. J. Oliner and J. Stearley:. What supercomputers say: A
The work is supported in part by NSF grants IIS-0546280 study of five system logs. In Proceedings of DSN 2007,
and HRD-0833093. pages 575–584, Edinburgh, UK, June 2007.
[22] W. Peng, C. Perng, T. Li, and H. Wang. Event
summarization for system management. In Proceedings of
9. REFERENCES ACM KDD, pages 1028–1032, San Jose, California, USA,
[1] Apache HTTP Server : An Open-Source HTTP Web
August 2007.
Server. http://httpd.apache.org/.
[23] G. Salton and M. McGill. Introduction to Modern
[2] FileZilla: An open-source and free FTP/SFTP solution. Information Retrieval. McGraw-Hill, 1984.
http://filezilla-project.org.
[24] J. Stearley. Towards informatic analysis of syslogs. In
[3] Hadoop : An Open-Source MapReduce computing Proceedings of IEEE International Conference on Cluster
platform. http://hadoop.apache.org/. Computing, pages 309–318, San Diego, California, USA,
[4] PVFS2 : The state-of-the-art parallel I/O and high September 2004.
performance virtual file system. http://pvfs.org.
[25] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to
[5] ThunderBird: A supercomputer in Sandia National Data Mining. Addison Wesley, 2005.
Laboratories. http://www.cs.sandia.gov/~jrstear/logs/. [26] L. Tang and T. Li. LogTree: A framework for generating
[6] M. Aharon, G. Barash, I. Cohen, and E. Mordechai. One system events from raw textual logs. In Proceedings of
graph is worth a thousand logs: Uncovering hidden ICDM, pages 491–500, Sydney, Australia, December 2010.
structures in massive system event logs. In Proceedings of [27] S. Theodoridis and r. e. Konstantinos Koutroumbas.
ECML/PKDD, pages 227–243, Bled, Slovenia, September
Pattern Recognition. Academic Press, 2006.
2009.
[28] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl:.
[7] S. Bereg, M. Kubica, T. Walen, and B. Zhu. RNA multiple Constrained k-means clustering with background
structural alignment with longest common subsequences. knowledge. In Proceedings of ICML, pages 577–584, June
Journal of Combinatorial Optimization, 13(2):179–188,
2001.
2007.
[29] P. Wang, H. Wang, M. Liu, and W. Wang. An algorithmic
[8] M. Bilenko, S. Basu, and R. J. Mooney. Integrating approach to event summarization. In Proceedings of ACM
constraints and metric learning in semi-supervised SIGMOD, pages 183–194, Indianapolis, Indiana, USA, June
clustering. In Proceedings of ICML, Alberta, Canada, July 2010.
2004.
[30] W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I.
[9] K. Fisher, D. Walker, and K. Q. Zhu. Incremental learning
Jordan. Mining console logs for large-scale system problem
of system log formats. SIGOPS Oper. Syst. Rev., detection. In SysML, San Diego, CA, USA, December 2008.
44(1):85–90, 2010.
[10] J. Gao, G. Jiang, H. Chen, and J. Han. Modeling
probabilistic measurement correlations for problem