Automatic Protocol Signature Generation Framework for
Deep Packet Inspection
Géza Szabó
Zoltán Turányi
László Toka
TrafficLab, Ericsson Research
Hungary
TrafficLab, Ericsson Research
Hungary
HSN Lab, BUTE, Hungary
Eurecom, France
toka@tmit.bme.hu
geza.szabo
zoltan.turanyi
@ericsson.com
@ericsson.com
Sándor Molnár
Alysson Santos
High Speed Networks Lab.
Dept. of Telecomm. and
Mediainformatics
Budapest Univ. of Technology
and Economics
molnar@tmit.bme.hu
ABSTRACT
We present an automatic application protocol signature generating framework for Deep Packet Inspection (DPI) techniques with performance evaluation. We propose to utilize
algorithms from the field of bioinformatics. We also present
preprocessing methods to accelerate our system. Moreover,
we developed several postprocessing techniques to refine the
accuracy of the results. Finally, we propose a DPI system,
based on approximate string matching, and find it a viable,
novel alternative for the refinement of exact string matching
algorithm’s results.
Keywords
deep packet inspection, automatic protocol signature generation, motif finding
Networking and Telecomm.
Research Group
Universidade Federal de
Pernambuco, Recife, Brazil
alysson@gprt.ufpe.br
Raw traffic
Preprocessing
Motif finding
Application
signatures
Postprocessing
Regexp conv.
Figure 1: The proposed framework
experts. To ease this manual work automatic protocol signature generation tools help to process the network traces
of a specific application and define signature candidates.
Automatic signature generation is cumbersome due to the
following requirements that it must fulfill:
• it should be automatic to as high extent as possible;
1. INTRODUCTION
In-depth understanding of the Internet traffic profile is a
challenging task for researchers and a mandatory requirement for most Internet Service Providers (ISP). Deep Packet
Inspection (DPI) can aid to ISPs in the profiling of networked applications. With this information ISPs may then
apply different charging policies, traffic shaping and offer
different quality of service guarantees to selected users or
applications. Current DPI tools and techniques rely on comparing the content of the packet payload with a set of strings
or regular expressions, which essentially assumed to represent a given “signature” of an application. The collection
and definition of the proper signatures is a time consuming,
challenging task requiring lot of manual work from protocol
• it should process a high number of samples within a
reasonable time period;
• it should provide the most descriptive (often this longest)
possible signature candidates;
• it should find important signatures to represent the
underlying traffic well.
In this paper we propose a framework for automatic signature generation. This framework is built on two basic algorithms: motif finding and sequence alignment inspired by
related methods from bioinformatics. Moreover, we also propose preprocessing and postprocessing methods to accelerate
the system and increase its accuracy. As a result we have a
system with several building boxes, see Figure 1. Throughout the paper the following notation is used to denote the
various processing steps. ’P’ for preprocessing, ’MF’ for motif finding, ’R’ for Regexp conversion and ’Po’ for Postprocessing. The performance factors of the system we consider
are speed and signature expressiveness. Speed can be measured by the CPU time used to generate the signatures from
recorded traffic traces. Signature expressiveness reflects the
appropriateness of the signature set found. The goal is to
find the smallest set of signatures for the biggest coverage
ratio for a specific application. There is an obvious tradeoff
between these two performance metrics, but we found that
our proposed system manages to perform better than prior
solutions in terms of both speed and expressiveness. The
improvement is so significant that this approach may open
new use cases in traffic classification e.g., online per-user
signature generation.
This paper is organized as follows. Section 2 overviews the
related work. The elements of our framework are introduced
in Section 3. Preprocessing and resulting speedup is discussed in Section 4. Our postprocessing steps to select the
best performing signatures are explained and evaluated in
Section 5. We discuss using Approximate String Matching
for DPI in Section 6. Finally, the paper is concluded in Section 7. The methods considered, their input and output are
summarized in Figure 1.
The methods for motif finding and sequence alignment are
algorithms widely used in bioinformatics for similar purposes
and a well-established tool set and literature are available to
rely on. However, the application of these algorithms for networking purposes is far from trivial due to several reasons,
such as the different distribution and number of symbols.
(In bioinformatics there are 4 symbols in DNA, 5 in RNA
and 19 in aminoacid sequences, while in the networking case
a 1-byte representation of network traffic streams induces
256 different symbols.)
2.
Since motif finding is a complex, time consuming procedure,
we introduce a preprocessing step to remove parts of the
input which appear only once or a few times. Preprocessing
comprises of two steps. The first step is a hash algorithm
based on the Rabin-Karp fingerprinting technique to filter
substrings which occur only once. The second step is a prefix
tree construction algorithm to collect substrings that occur
frequently. Preprocessing can reduce the running time to
about 3-16% of the original processing time.
We also introduce a postprocessing step in order to increase
signature expressiveness by decreasing the overlapping coverage on the flow set. It is composed of three steps. In the
first step the candidate signature set is refined by removing
those signatures which give false positive results by crosschecking the candidate signatures with other applications.
In the second step further information is collected about the
positions of signatures in specific byte streams of flows. In
the third step the minimal signature set with the maximal
flow hit is determined. The postprocessing results show significant improvements in signature effectiveness, i.e. the size
of the resulting signature set is decreased 5 times or even
more.
In addition for proposing a system for the automatic generation of regexp signatures, we also propose to use Approximate String Matching (ASM) for actual DPI as an alternative to the common DPI techniques based on regular
expressions. The proposed system results in high signature
expressiveness.
The main contribution of the paper is as follows:
• a general framework for automatic signature generation;
• various adaptations of the framework to achieve different performance purposes; and
• a DPI system using Approximate String Matching with
motifs as signatures.
RELATED WORK
Three types of protocol signature generation methods can
be found in the literature: a) worm signature generation
e.g., [18, 8, 10, 16], b) spam rule generation [2] and c) application signature generation [12, 14, 17, 26]. Authors of
[26] presented AutoSig which extracts multiple common substring sequences from sample flows as application signature.
First, all possible common substrings in an application protocol are extracted and then a substring tree is constructed
to generate the final signature of the application. Being one
of the latest articles in this topic we used AutoSig as a reference to measure the performance of our proposed system.
Topics in bioinformatics relevant to our problem are exact
string matching, global pair-wise sequence alignments, local
pair-wise sequence alignments, multiple sequence alignments
and sequence motif finding [5].
The adaptation of bioinformatics algorithms for network
protocol analysis has recently been found extremely useful.
The primary goal of bioinformatics is to increase our understanding of biological processes. One can observe a similarity in the problems of bioinformatics compared to protocol
analysis. As an example, in bioinformatics the purpose is
to identify genes that produce proteins, while in protocol
analysis the task is to identify the location and purpose of
fields in the packets. This similarity makes it possible to
investigate the application of bioinformatics algorithms in
protocol analysis, however, the differences in the problems
make this application a challenge. As an example, in [6] the
authors use bioinformatics algorithms to determine fields in
protocol packets. These authors propose a global sequence
alignment based on the Needleman-Wunsch algorithm [15]
with encouraging results.
Takeda proposes to apply bioinformatics algorithms for network intrusion detection in [22]. The method based on two
algorithms. The Smith-Waterman algorithm [20] is applied
to captured network traffic to locate patterns similar to
known intrusion traffic. The Needleman-Wunsch algorithm
is used to measure the similarity of the result to the known
intrusion patterns. Coull et al. also addresses the intrusion
detection by a bioinformatics approach [11]. Their method
is a variation of the Smith-Waterman algorithm and using a
novel scoring scheme to construct a semi-global alignment.
Tang et al. present a bioinformatics approach to generate accurate exploit-based signatures for polymorphic worms [23].
The core of the method is multiple sequence alignment which
is used to identify invariant bytes from a set of polymorphic
worm samples. The proposed pairwise sequence algorithm is
also an improvement of the Needleman-Wunsch algorithm.
The method is powerful to accurately analyze the intrinsic
Estimate
Dirichlet
Motif
finding
Traffic mixture
Traffic
Motifs,
Alignment
score
Hit offsets,
lengths
Sequence
alignment
Create Regexps,
motif occurences
clusters
Traffic
Remove flows with hit
Figure 2: The regular expression construction process
similarities of worm samples.
The main difference of our approach proposed in this paper
compared to the ones in related work is that we directly
apply algorithms from the field of bioinformatics for motif
finding. To our knowledge this is a novel approach and has
not been published before. For further detailed discussion on
related work and the application of bioinformatics approach
to our purpose see our technical report [4].
3. REGULAR EXPRESSION CONSTRUCTION
(M+R)
To construct regular expressions from the network traffic
we propose a system applying motif finding and sequence
alignment methods.
3.1 Proposed architecture
The input of the system is collected network traffic: either
an application-aware active measurement or the capture of
the traffic of an aggregating measurement point. In case
the input traffic is classified according to the protocols, the
generated application signatures can be associated with applications. Before feeding the input to our system, the TCP
flows are reconstructed from the actual packet trace using
tcpflow [9]. Since signatures are expected to be at the beginning of the flows only the first 10 − 100 packets worth of
data is considered from each flow.
In the following paragraph we show how we applied two
bioinformatics algorithms – motif finding and multiple sequence alignment –, for the automatic application signature
generation problem. A motif is a possibly gapped sequence
of key positions, which is a re-occurring semi-deterministic
sequence pattern found in multiple sequences generated by
the same source. Key positions hold symbols (sequence elements) that are important for the motif’s function. We used
the glam2 software package [3] developed to find motifs in
biological sequences. The main innovation of glam2 is that
it allows insertions and deletions in motifs: it essentially implements a generalization of Gibbs Sampling technique1 [24]
to allow insertions and deletions in a fully general fashion.
The tool takes the distribution of the symbol appearances as
input. This distribution should incorporate prior knowledge
1
A special case of the Metropolis-Hastings algorithm, and
thus an example of a Markov chain Monte Carlo algorithm.
Sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its
equilibrium distribution: the state of the chain after a large
number of steps is then used as a sample from the desired
distribution.
of the functional similarities between symbols. To accomplish this, a Dirichlet mixture distribution is used which is
a weighted sum of Dirichlet distributions [19].
The input flows are first fed to the Dirichlet mixture estimation algorithm. The output of this step is a Dirichlet
mixture. Then, the input flows are fed to the motif finding
algorithm as multiple sequences together with the Dirichlet
mixture. The output of the motif finding algorithm is a set
of motifs with alignment scores, which are the sum of the
score of each appearance of the given motif (how well the
motif fits the concrete character sequence it matches). In
this step we consider only the motif with the best alignment
score. To find the flows in which a hit occurred with the best
motif we apply sequence alignment on the input flows. The
output of sequence alignment is a list of flow ids, starting
and ending positions of the match in the decreasing order
of the matching scores. As we would like to get signatures
in the form of regular expression, we collect all the appearances of the best motif in the original flows by saving the
substrings in the positions indicated by the sequence alignment process. The byte values on the same positions with
multiple occurrences are collected and a regular expression
is created by putting an OR operator between them2 .
Applications typically have several protocol messages. In an
extreme case one particular motif could describe them all,
but the total score would be lower comparing to the case
when the protocol messages are clustered and several motifs
are defined for the message clusters. The creation of motif
clusters can be done by defining the clusters based on the
alignment scores. Those flows are considered together which
score at least 80% of the maximum value in the list. These
flows are separated from the original set of flows and the
whole regular expression construction process is started over
until no more flows left or only less than 10% of the original
flows can be removed from the original set. To follow the
steps of the signature generation algorithm, see Figure 2.
3.2
Performance evaluation
In order to evaluate the algorithm in different use cases, we
investigated the following metrics:
• Number of motifs generated by the process;
• Flow coverage ratio: the ratio of flows covered by the
yielded clusters;
2
The same method is used in MEME-suite [7] for motif to
regular expression conversion
100
with preprocessing
without
with full
but without common
preprocessing
preprocessing
Flow
coverage
ratio
CPU
occupancy
period [sec]
0
1
10
0
1
10
0
1
10
4
3.92
6.17
9.17
47%
61%
83%
884
972
1022
Figure 3: The
coverage ratio
1000 Gnutella
and number of
8
3.50
4.92
8.67
45%
49%
74%
889
1261
2609
bit/symbol
4
8
1.65
0.91
3.17
2.00
7.06
5.37
24%
17%
44%
34%
72%
55%
74
90
113
252
174
594
4
1.68
3.00
7.61
23%
37%
63%
22
36
76
8
0.84
2.33
6.85
15%
31%
47%
57
177
423
average number of motifs, the flow
and the CPU occupancy period of
flows in the function of bit/symbol
Dirichlet components
true positive coverage (flow) [%]
Average
number of
motifs
# of Dirichlet components
substring extraction
95
90
85
AutoSig (A)
Preprocessing (P)
Motif converted to Regexp (M+R)
Preprocess+Motif converted to Regexp (P+M+R)
Preproc+Motif to Regexp+Postproc (P+M+R+Po)
Motif using ASM (M)
80
75
0
10
20
30 40 50 60 70
number of signatures
80
90 100
• CPU time of the process.
Our aim is to find the alignment(s) with maximum score
and derive a well-fitting motif. The various parameters on
which the outcomes highly depend are3 :
• Number of bits describing a symbol: aggregating bits
may impose loss of information (e.g., in case of insertion, deletion); length of sequences and motifs decreases; number of symbols increases (varying alphabet size, e.g., 4 bits/symbol induces 16 symbols total).
Modifications on applied tools are required in order to
allow for general size of alphabets.
• Number of Dirichlet components: many components
slow down motif finding, few components lower quality of a priori information. The decision to use a given
number of components is somewhat arbitrary. As in
any statistical model, a balance must be struck between the complexity of the model and the data available to estimate the parameters of the model. A mixture with too few components will have a limited ability to represent different contexts for the symbols. On
the other hand, there may not be sufficient data to
estimate precisely the parameters of the mixture if it
has too many components.
For evaluation purposes we used active measurements with
per packet information about the generating application [21].
The traces were divided into training and testing data sets
each containing 1000 flows of the specific application. Note
that motif finding contains random initialization sequences
regarding e.g., the starting positions of the motif candidates.
In order to filter out its effects we repeated every measurement 100 times.
3.2.1
3
Effects of parameter settings on the motif finding algorithm
The investigation of other parameters, e.g., the minimum
number of sequences in the alignment or the deletion and
insertion preferences could be the target for future work.
Figure 4: The true positive coverage per flow in
the function of average number of signatures of the
examined methods
In Figure 3 (“without preprocessing” column, “Average number of motifs” row) the average number of motifs generated
for 1000 Gnutella flows shows that increasing the number of
Dirichlet components, the number of motifs also increases.
The consequence of this is that the flow coverage ratio and
the overall CPU occupancy period also increase. In the other
dimension, if the bit/symbol parameter is decreased from the
intuitively set 8 bit/symbol to 4 bit/symbol, the number of
found motifs also increases resulting in higher coverage ratio.
It is interesting to note that the CPU occupancy decreases,
probably due to the lower number of symbols. We tried to
further decrease the bit/symbol parameter to 2 and 1, but
in these cases the CPU occupancy increased with approximately 2-3 orders of magnitudes, therefore we considered
these parameter sets practically inapplicable. The conclusion is that the motif finding algorithm is more sensitive to
the length of the input sequences than the number of symbols.
3.2.2
Performance comparison to state-of-the-art tool
To compare the performance of the motif finding method for
regular expression creation (M+R in Figure 4) with a stateof-the-art tool, we used [1], which is an implementation of
AutoSig [26] with a slight speed-up. The matching algorithm for DPI is the conventional Deterministic Finite-state
Automata (DFA) method.
The true positive (TP) metric is calculated as follows. The
motifs are constructed per application, thus the training
data of the motif finding algorithm contained only the flows
of one specific application and later the found motifs are
evaluated on the testing data set of that specific application
which is a different set of flows than the training data. The
resulting true positive ratios were averaged over the tested
11 applications (see Table 3). The false positive (FP) metric is calculated in a similar methodology but the calculation
was different in the testing phase when the found motifs of
false positive coverage (flow) [%]
6
5
AutoSig (A)
Preprocessing (P)
Motif converted to Regexp (M+R)
Preprocess+Motif converted to Regexp (P+M+R)
Preproc+Motif to Regexp+Postpr (P+M+R+Po)
Motif using ASM (M)
Speed [flow/sec]
Avg. sig#
A
P
M+R
0.02
51.43
12.76
171.23
0.16
13.62
P+
M+R
3.38
29.3
P+M+ M
R+Po
2.72
0.16
12.41
9.17
4
Table 1: The speed and average number of generated signatures of the methods
3
4.
The above presented process may take hours even in a limited set of flows, thus we made efforts to speedup the process with a two step preprocessing phase. The preprocessing
steps can be seen in Figure 6.
2
1
4.1
0
65
SPEEDUP WITH PREPROCESSING
(P,P+M+R)
70
75
80
85
90
95
true positive coverage (flow) [%]
100
Figure 5: The false positive coverage of the examined methods in the function of true positive coverage
one specific application were tested on the testing flows of
10 other applications each.
Packet inspection complexity increases with the number of
signatures to look for. Hence the compactness of the signature set describing an application (i.e., the number of signature required) is an important metric. By tuning the parameters of the methods we can generate signature sets of
varying size – of course with different coverage of application traffic. On Figure 4 we have depicted the TP coverage
of the signature sets generated by the various methods as
a function of the signature set size. E.g., the method generated 100 signatures for a specific application and we take
the first 1, 2, ...100 which gives the highest flow hit number.
It can be seen that the regular expressions created by the
motif generation converge faster to a total coverage than the
output of the AutoSig tool. The most straightforward explanation for this is that AutoSig creates string signatures,
and the expressiveness of such a grammar is limited by definition.
An other metric of signature expressiveness is the FP ratio
in the function of TP ratio (M+R in Figure 5). If the FP ratio is high it means that the particular method finds mainly
short signatures which can be found in other applications as
well. Note that the ideal method would result in a dot at the
bottom right corner of the figure with TP=100 and FP=0
values. AutoSig converges to 100% TP coverage with much
higher FP coverage comparing to the M+R case. Regarding the average number of generated signatures the M+R
method creates only 1/4 of the AutoSig signatures. This
means a higher expressiveness of the constructed regular expressions. An other big difference is that the motif finding
is 8 times faster than the AutoSig method (see Table 1).
Speedup with fingerprinting
The first preprocessing phase applies a fast, memory efficient
technique to significantly reduce the input size of the raw
traffic by filtering substrings that occurred only once in the
raw traffic. A possible way to do this is to create hashes
from the content of a sliding window. The size of the hash
table can be estimated and limited, so to control memory
consumption. Then by flagging each hash value seen, we
can roughly determine if a certain substring has been seen
or not. In order to correctly detect substrings shorter than
the window size (Wlen ) there has to be a separate hash table
for all string lengths below Wlen . The hash algorithm we
used in this step is the Rabin-Karp fingerprinting method
in a similar way as in the Earlybird paper [8].
To compare the results of preprocessing to the raw traffic
input case we aggregated the total number of motifs, flow
coverage and CPU occupancy time for all examined applications and compared them to the original case when the
input of the motif finding was the raw traffic. In this way
we can obtain information which extent the preprocessing
phase affected the original traffic. Considering the flow coverage in the case of 4 bit/symbol and 10 Dirichlet components, it provides approx. 90% of the original coverage (see
Figure 3 “with preprocessing but without common substring
extraction” column, “flow coverage ratio” row). The required
overall CPU time was only approx. 8-23% comparing to the
original case (see Figure 3 “CPU occupancy period” row).
4.2
Speedup with prefix tree construction
The first preprocessing phase passes a substring to the second proposed step of the preprocessing phase only if it has
already been seen more than once. The output of the first
phase may contain longer substrings divided into shifted
smaller substrings occurring multiple times in the output.
Therefore we introduced a second preprocessing step to collect the same pre- and postfixes into the longest common
substring. This way the input to the motif finding algorithm
is further compressed.
The first step is to extract common substrings from the input streams. This is basically done running a fixed length
sliding window (of length Wlen ) over the input and inserting all window content into a tree, counting the times each
string has been inserted. Each node in the tree represents
a substring, which is not longer than Wlen . By summing
Byte
streams
Pre-selection
Rabin-Karp
fingerprinting
Substrings appearing
more than once
Common substring extraction
Variable depth pre- and
postfix word-trees
Frequent substrings
with appearance
count
Remove
paddings
Frequent
substrings
Figure 6: The preprocessing phases
the counters on the leafs of each sub-tree below a node, we
can see how often the prefix represented by the node occurred in the input stream. This allows us to generate a list
of substrings that occurred more than Omin times. When
one of two substrings is a prefix of the other, we take only
the longer one, except if the shorter one occurred at least
by Omin times more than the longer one. For example, if
“abcde” occurred 10 times and “abc” 30 times, we can deduct
that 10 out of the 30 occurrences of “abc” were as part of
“abcde”. So if Omin is, e.g., 15, we will print “abc”, too, with
20 occurrences. The resulting substrings are then checked
in the reverse direction once more to eliminate those which
are postfixes of another string present.
The above preprocessing construction algorithm can be run
in a second pass on the input stream to detect common
substrings longer than Wlen . In this case we consider only
those window contents which are preceded in the input by
one of the substrings of maximum length (Wlen ) resulting
from the first pass. If we find many occurrences of a such
substring (always following the same Wlen -length substring
from the first pass), we can concatenate this to the substring
from the first pass. This can then be repeated in multiple
passes to detect even longer common substrings. The result
of the whole tree operation is a list of common substrings
with an occurrence count.
The bottleneck in the prefix tree construction operation is
memory consumption in the first pass. Thus Wlen has to
be chosen in the function of the available memory. Many of
the window contents will occur only once, yet they are all
inserted into the tree. This limits the length of the window
(Wlen ) and makes the whole process longer.
The output of the prefix tree is substring candidates with
occurrence values. Motif finding is still needed as there are
several examples in practice where e.g., the middle of a signature there is a sequence number and takes all the possible
256 values of a byte many times (over the minimum occurrence threshold). These cases can not be handled with the
prefix tree. Feeding the substring candidates to the motif
finding tool causes the loss of the occurrence information.
Furthermore, a specific substring with high number of occurrences but with few substring variants is not found by
the motif finding algorithm, thus these signatures should be
added to motif clusters later. For example if “abc” occurred
100 times, “efxg”, “efyg” and “efzg” occurred 10 times each,
than the motif finding algorithm in this step would find with
the last three, as a motif (“ef.g”) can be found for them and
does not consider the first one at all.
Comparing the results of preprocessing to the raw traffic input case, we found that the 4 bit/symbol with 10 Dirichlet
component case provides approx. 76% of the original scores,
being the closest to it (see Figure 3 “with full preprocessing”
column, “flow coverage ratio” row). The required overall
CPU time is only approx. 3-16% comparing to the original case, so further gain is achieved comparing to the first
preprocessing phase (see Figure 3 “CPU occupancy period”
row).
4.3
Remove padding
The output of the two preprocessing phases usually contains
signature candidates with long padding, e.g., “00” and “ff”
runs. It also frequently occurs that some optional fields are
typically unused or unset in a protocol, or reserved for later
usage thus resulting in long zero runs. The motif finding
algorithm can not judge which zero runs are part of a signature or are only padding. We introduced a preprocessing
step to remove these zero runs: the third phase of preprocessing on the signatures skips all the forthcoming zeros in
case of 2 zero bytes. At the following non-zero byte, the
method starts to collect a new signature, thus the original
signatures are split by the double zero bytes. The same is
performed for the “ff ff” bytes.
4.4
Performance evaluation
4.4.1
Preprocessing as a method on its own
We compared the performance of the preprocessing phase (P
in Figures 4 and 5) with a state-of-the-art tool AutoSig [26].
The signatures of preprocessing converge to total coverage
the slowest in the function of the number of signatures in
Figure 4. Thus it needs the most signatures for certain TP
coverage among all the methods. It has the highest average
number of generated signatures. On the other hand, the FP
coverage does not jump up to high values converging to the
total TP coverage in Figure 5. It is 3 magnitudes faster than
AutoSig and 2 magnitudes faster than motif finding on the
raw traffic (P in Table 1).
4.4.2
Preprocessing with motif finding
Considering the preprocessing and motif finding methods together (P+M+R in Figure 4) it can be seen that it covers
better with the same number of signatures than the output
of the preprocessing (P) on its own but remains under AutoSig (A). Regarding the FP coverage (P+M+R in Figure 5)
it has similar characteristics to M+R. Note that in this case
the joint algorithm is 2 magnitudes faster than the AutoSig
method (P+M+R in Table 1).
5.
IMPROVING SIGNATURE EXPRESSIVENESS WITH POSTPROCESSING
(P+M+R+PO)
The signature candidates yielded by motif finding are frequently occurring signatures in the given traffic. To further
refine and restrict the signatures to the most valuable ones,
we applied several postprocessing phases (see Figure 7).
App. specific
unique signatures Offset distribution
Regexps,
Cross-check generated
occurences signatures with other applications
analysis
Signatures with frequently
occurring offsets
Check maximum
coverage
Regexps
Figure 7: The postprocessing phases
5.1 Proposed stages
The first stage in the post processing phase is the cross-check
of the resulting signature candidates with other applications.
Those signatures should be removed which can give false
positive results.
The second stage is gathering additional information about
the positions of signatures in specific byte streams of flows or
packets. This step receives the signatures and the flow list as
input and provides the following information per signature:
• the number of occurrences the given signature occurred
at a specific offset considering all the flows
• the total number of matches of the specific signature
(considering multiple times a multiple match per flow)
• the number of matches of the specific signature in different flows
• the number of different users with hits
The resulting signature set has usually overlapping coverage
on the flow set, meaning that for one given flow there are several signatures which occurred. This overlap is non-optimal
for the DPI engine as it has to check several signatures for
the same hit ratio. In the third stage the goal is to select
the minimal signature set which gives maximal flow, volume or user coverage. The problem is called the weighted
maximum coverage problem [25] and considered to be NPhard. A global optimum can be reached only by brute-force
method comparing the coverage of every possible signature
set.
The problem can be formulated in the following way. Lets
consider the example flow and signature set in Table 2, where
each row represents a flow, each column represents a signature and an ’X’ is placed in a cell if the specific signature
matches to the specific flow. The first column is an id of the
flow, while the second column is the weight of the flow (Wi ).
It can be tuned e.g., with the byte volume that some of the
flows are more important than the others. The third column
is the user id of the flow generating terminal (Ui ). The last
column is the logical connection of the specific signatures for
the specific flow. The elements of the signature set (S) are
binary variables (s1 , s2 , s3 = {0, 1}).
Several optimization problems can be formulated:
• Optimize on P
flow number coverage: The problem
is to
P low#
low#
determine a fi=1
Ci is maximum while fi=1
Si
is minimum.
• Optimize onP
byte volume coverage: The problem
is to
P low#
low#
Si
Ci ∗Wi is maximum while fi=1
determine a fi=1
is minimum.
flow
id (i)
1
2
3
weight
(Wi )
10
15
20
user id (Ui )
U1 = 192.168.1.1
U2 = 192.168.1.1
U3 = 192.168.1.2
Sign
(s1 )
X
s2
X
s3
Coverage (Ci )
X
X
C1 = s1 ∨ s3
C2 = s3
C3 = s2
Table 2: Flow coverage of signatures
• Optimize
user coverage: P
The problem is to deterPon
low#
low#
Ui are maximum
Ci ∧ Ui and fi=1
mine a fi=1
Pf low#
while i=1 Si is minimum.
Note that the obtained signatures for the different optimization cases are likely to differ. For instance regarding a P2P
application, the signaling (e.g., peer search, file search) flows
will be dominating the dataset, thus the optimization on
flow number coverage will suggest these signatures. After a
successful peer search, the data transfer flows will dominate
the dataset in volume. Thus signatures referring to data
transport will be suggested by the optimization on byte volume coverage. Few heavy users can dominate datasets thus
both the above optimizations may suggest signatures for the
specific user e.g., its user id or preferred music performers.
Optimization for user coverage can overcome this issue and
can provide non-user specific signatures.
We used constraint logic programming [13] to efficiently search
for the optimal signature set. Constraint logic programming
over finite domains makes it possible to narrow the search
space as much as possible. When we run out of state space
narrowing ideas at any time during the program execution,
labeling of the variables can be started which is an exhaustive search over the possible values of the variables (bruteforce search).
5.2
Performance evaluation
Considering the preprocessing, motif finding and postprocessing methods together (P+M+R+Po in Figure 4) it can be
seen that it preserved the main attributes of the P+M+R
line as the starting and ending TP coverage is similar to that
case but the number of required signatures to achieve this is
significantly lower. Regarding the FP coverage (P+M+R+Po
in Figure 5) the FP coverage is 0% as a consequence of the
working mechanism of the postprocessing phase. Note that
postprocessing can only consider those applications which
have signature candidates and traffic for cross-checking purposes. Further unprocessed applications may result in increased FP hits. A further consequence of postprocessing is
the limited TP coverage. For instance such flows of an application which uses also common protocols for communication
would provide signatures with FP hits thus removed from
the signature set resulting later false negative hits on those
flows by DPI. Regarding the overall speed of the combined
methods (P+M+R+Po in Table 1), it remained about 2
magnitudes faster than the AutoSig method (P+M+R+Po
in Table 1).
5.3 Example of found signatures
Table 3 shows a few examples of the resulting signatures
on the tested applications. It is important to note that
the number of motifs is far less than the number of different signatures found. A good example is in the case of
World of Warcraft, where approx. 30 gaming servers with
their ingame names, IPs and ports are collected out of the
traces, but they are all covered with one motif.
Application
BitTorrent
DirectConnect
Gnutella
MSN Messenger
POP3
RTP
SSH
World of Warcraft
eDonkey
pplive
Spotify
Regexp
BitTorrent ex
1:rd2:id2
node1:t8
6/6.TTH:A
ock ZLIG |
-Ultrapeer
NUTELLA CON
TLS@.UPC
uting: 0.1
DHTC.....DU
text value="aW..."
@hotmail
MIME-Version:1.0
w" /> <imtext
+OK 0 me
[T|U][T|U]UUUUUUUU
aes128-cb
.\%.@.2mv/.P’T3y.}...m..
rider.80.239.179.51:372
Blade:80.239.179.39.372
light:80.239.185.80.372
E.....].
..{......p
!.B..!.B..!.B..!.B..!
._.e..9^
aĂe֒
Table 3: Examples of signatures on the tested applications
One advantage of the proposed method is to make the DPI
engines to be able to use such signature sets which would
otherwise give false positive hits on their own. E.g., ’@hotmail.com’ for MSN is a good factor of the sum motif score (as
MSN usernames are usually hotmail addresses), but not application specific on its own. As not necessarily every motif
is specific for only one application but using the sum of the
motif scores for one specific application make them reliable
indicator for an application hint. It is also a straightforward
advantage of the method when such motifs are the application descriptors which known to be changed deliberately
e.g., the e-mail spam and other text-like characteristics protocols, such as ’VIAGRA’ changes to ’V.I.A.G.R.A’. The
motifs are even more robust for protocol version changes
over time than regular expressions. E.g., new option fields
in a protocol do not largely affect the motifs. It is also important to note that fewer motifs are enough to describe the
same protocol compared to the required number of regular
expressions.
6.2
Comparing the calculation complexity of the ASM with DFA
the following can be found. The DFA has O(n) complexity
where n is the length of input string. The sequence alignment has O(nm) complexity [5] where n is the length of the
input string, m is the length of the motif. The difference
is linear, thus the algorithm may be a proper candidate on
e.g., post processing of such traffic which can not be identified with the common DPI techniques.
6. APPLICATION OF APPROXIMATE STRING
MATCHING IN DPI (M)
7.
The most accurate method to recognize protocols would be
complete protocol parsing. As these techniques are very
resource consuming, DPI is used which searches for characteristic byte signatures in the traffic. This technique is
accepted to be the most accurate among the traffic classification techniques but it should be noted that this technique
remains a heuristic. On the contrary, results are considered
as a final verdict. If a match occurs, the traffic is classified
to the signature of the application which generated the hit.
All information related to the reliability of the hit is lost.
6.1 Proposed system
We propose to use approximate string matching (ASM) as a
basis of DPI. The identification of unknown traffic is as simple as performing sequence alignments for the unknown traffic with the several motifs describing various applications.
The resulting scores of the sequence alignment runs can be
summed per application motifs and can be compared to each
other. The underlying application of the group of motifs
with the highest sum score is the most probable generating
application of the unknown traffic.
Performance evaluation
Figure 4 (M) shows the case when the usual DFA is substituted with sequence alignment and motifs are used. It
can be seen that it has the highest coverage in the function
of the number of signatures. Regarding the FP coverage in
Figure 5, it has similar characteristics to the M+R case and
has the lowest among all methods. The motifs evaluated via
ASM represent a theoretical maximum of the motif expressiveness. During the regular expression conversion and the
DFA based evaluation in the M+R case, information lost
occurs by definition.
CONCLUSION
In this paper we present a general framework of an automatic
application protocol signature generation for Deep Packet
Inspection (DPI) techniques. The proposed framework utilizes algorithms from the field of bioinformatics. The framework also consists of preprocessing and postprocessing techniques to gain better performance. In the preprocessing
phase we applied a Rabin-Karp fingerprinting based method
to filter the once occurring substrings and a prefix tree construction method to summarize the substrings with common
pre- and postfixes. We found that the motif finding system
extended with the preprocessing phase can achieve high flow
coverage ratio with low CPU occupancy period. We also introduced several postprocessing methods to select the best
performing signatures of the candidates. We applied a crosschecking phase to filter out signatures with false positive hits
for other applications, an offset distribution analysis phase
and a maximum-coverage optimization phase with focus on
either flow number, volume or user number.
We carried out a detailed performance analysis and systematically compared the quality of the signatures generated
by the framework composed of different building boxes to a
state-of-the-art tool. We found that our introduced method
can result in better performance in terms of both speed and
signature expressiveness. It gives about 5 times smaller signature sets in about 100 times shorter period of time than
the state-of-the-art tool. Furthermore, our general framework can also be tuned by using different building boxes to
optimize specifically for speed or signature expressiveness.
Finally, we discussed a DPI system based on approximate
string matching and our results showed that it is a viable
alternative for the refinement of exact string matching algorithm outcomes.
8. REFERENCES
[1] A. F. Santos: Automatic Signature Generation,
Diploma Thesis, 2009.
http://www.cin.ufpe.br/~tg/2009-1/afs5.pdf.
[2] Eric Conrad: Detecting Spam with Genetic Regular
Expressions. http://www.sans.org/reading_room/
whitepapers/email/detecting_spam_with_genetic_
regular_expressions_2006.
[3] glam2 – a motif finding tool.
http://bioinformatics.org.au/glam2/.
[4] L. Toka, G. Szabó, Z. Turányi: Discovering Motifs in
Application Flows, Tech Report, 2010.
http://www.crysys.hu/~szabog/TR2010.pdf.
[5] Lectures in bioinformatics. http:
//www.cs.otago.ac.nz/cosc348/lectures.html.
[6] M. A. Beddoe: Network Protocol Analysis using
Bioinformatics Algorithms, 2009.
www.4tphi.net/~awalters/PI/pi.pdf.
[7] MEME-suite [2010].
http://meme.nbcr.net/meme4_3_0/doc/examples/
meme_example_output_files/meme.html.
[8] S. Singh, C. Estan, G. Varghese, and S. Savage: The
Earlybird System for the Real-time Detection of
Unknown Worms, UCSD, Department of Computer
Science, Technical Report CS2003-0761,.
http://www.cs.unc.edu/~jeffay/courses/nidsS05/
signatures/savage-earlybird03.pdf.
[9] tcpflow.
http://sourceforge.net/projects/tcpflow/.
[10] H. ah Kim. Autograph: Toward Automated,
Distributed Worm Signature Detection. In In
Proceedings of the 13th Usenix Security Symposium,
pages 271–286, 2004.
[11] S. Coull, J. Branch, B. Szymanski, and E. Breimer.
Intrusion detection: A bioinformatics approach. In
Proceedings of the 19th Annual Computer Security
Applications Conference, pages 24–33, 2001.
[12] P. Haffner, S. Sen, O. Spatscheck, and D. Wang. Acas:
automated construction of application signatures. In
MineNet ’05, New York, NY, USA, 2005.
[13] M. Hanus. Programming with constraints: An
introduction by kim marriott and peter j. stuckey, mit
press, 1998. J. Funct. Program., 11(2):253–262, 2001.
[14] J. Ma, K. Levchenko, C. Kreibich, S. Savage, and
G. M. Voelker. Unexpected Means of Protocol
Inference. In IMC ’06: Proceedings of the 6th ACM
SIGCOMM conference on Internet measurement,
pages 313–326, New York, NY, USA, 2006. ACM.
[15] S. B. Needleman and C. D. Wunsch. A general
method applicable to the search for similarities in the
amino acid sequence of two proteins. Journal of
Molecular Biology, 48(3):443–453, March 1970.
[16] J. Newsome, B. Karp, and D. Song. Polygraph:
Automatically generating signatures for polymorphic
worms. In Proceedings of the 2005 IEEE Symposium
on Security and Privacy, pages 226–241, Washington,
DC, USA, 2005. IEEE Computer Society.
[17] B. Park, Y. J. Won, M. Kim, and J. W. Hong.
Towards automated application signature generation
for traffic identification. In NOMS, pages 160–167,
2008.
[18] W. Scheirer and M. Chuah. The Strength of Syntax
Based Approaches to Dynamic Network Intrusion
Detection. In Information Sciences and Systems, 40th
Annual Conference on Volume, March 2006.
[19] K. Sjölander, K. Karplus, M. Brown, R. Hughey,
A. Krogh, I. Mian, and D. Haussler. Dirichlet
mixtures: a method for improved detection of weak
but significant protein sequence homology. Computer
Applications in the Biosciences, 12(4):327–345, 1996.
[20] T. F. Smith and M. S. Waterman. Identification of
common molecular subsequences. Journal of Molecular
Biology, 147(1):195–197, 1981.
[21] G. Szabó, D. Orincsay, I. Szabó, and S. Malomsoky.
On the validation of traffic classification algorithms. In
Proc. PAM, Cleveland, Ohio, USA, April 2008.
[22] K. Takeda. The application of bioinformatics to
network intrusion detection. In Security Technology,
2005. CCST ’05. 39th Annual 2005 International
Carnahan Conference on, pages 130 – 132, 11-14 2005.
[23] Y. Tang, B. Xiao, and X. Lu. Using a bioinformatics
approach to generate accurate exploit-based signatures
for polymorphic worms. Computers & Security,
28(8):827–842, 2009.
[24] G. Thijs, K. Marchal, M. Lescot, S. Rombauts,
B. De Moor, P. Rouzé, and Y. Moreau. A gibbs
sampling method to detect over-represented motifs in
the upstream regions of co-expressed genes. In
RECOMB ’01: Proceedings of the fifth annual
international conference on Computational biology,
pages 305–312, New York, NY, USA, 2001. ACM.
[25] V. V. Vazirani. Approximation algorithms.
Springer-Verlag New York, Inc., New York, NY, USA,
2001.
[26] M. Ye, K. Xu, J. Wu, and H. Po.
Autosig-automatically generating signatures for
applications. In CIT (2), pages 104–109. IEEE
Computer Society, 2009.