Academia.eduAcademia.edu

Heterogeneous Pattern Matching for Intrusion Detection Systems and Digital Forensics

2012, PhD Thesis

This thesis focuses on the development, implementation and optimization of pattern-matching algorithms in two different, yet closely-related research fields: malicious code detection in intrusion detection systems and digital forensics (with a special focus on the data recovery process and the metadata collection stages it involves). The thesis introduces the motivational backgrounds for the development of the work, later on presents the related work and then continues with the main achievements obtained, while in the end a few conclusions and future research directions are discussed. The main four chapters in this thesis show the main contributions of our work and address the following topics: we present an efficient storage mechanism for hybrid CPU/GPU-based systems, and compare it with other known approaches to date. We then propose an innovative, highly parallel approach to the fast construction of very large Aho-Corasick and Commentz-Walter pattern matching automata on hybrid CPU/GPU-based systems, and compare it to existing sequential approaches. Later on we propose a new heuristics for profiling malicious behavior based on system-call analysis, using the Aho-Corasick algorithm, and also discuss a new hybrid compression mechanism for this automata based on dynamic programming, that reduces the storage space required for it. Finally, we propose an efficient new method for collecting metadata and helping the human operator or automated tools used in the data recovery process as part of the computer forensic investigations. The research and models obtained in this thesis extend the existing literature in the field of intrusion detection systems (malicious code detection in particular), by presenting: an innovative heuristics for behavioral analysis of code in executable files through system-call interception, a novel and highly efficient approach to efficiently storing pattern-matching automata in hybrid CPU/GPU-based systems, which serves as the base for an innovative model for the fast, GPU-accelerated construction of such very large automata (for both the Aho-Corasick and Commentz-Walter algorithms) and a new hybrid compression technique applied to the Aho-Corasick automata using a dynamic programming approach, that reduces storage space significantly.

Heterogeneous Pattern Matching for Intrusion Detection Systems and Digital Forensics PhD Adviser: Prof. Dr. Viorel Negru Candidate: Ciprian-Petrişor Pungilǎ Faculty of Mathematics and Informatics West University of Timişoara A thesis submitted for the degree of Doctor of Philosophy 2012 “Life is like a box of chocolates - you never know what you’re gonna get.” (Forrest Gump) 2 Acknowledgments I would like to express my most grateful thoughts to my parents, for their continued motivation into pursuing an academic career. Therefore, I dedicate this thesis to them, as it was their support and advice throughout the years that had made me become the person I am today. Part of this thesis was supported by several international research grants, including: • European Commission FP7-REGPOT-CT-2011-284595 (HOST ) • Romanian national grant PN-II-ID-PCE-2011-3-0260 (AMICAS ) • European Community’s Seventh Framework Programme FP7/2007-2013 under grant agreement No. 224609 (DEHEMS ) Besides the motivational background involved in these projects, there are a number of people I would like to thank, which have guided my work at the university and encouraged my research, while also providing the means to pursue them both: first, I would like to express my gratitude to my supervisor, Prof. Dr. Viorel Negru, who inspired my academic career and whose help and support throughout the years has had a significant impact on the research I had performed. Second, I would like to thank Prof. Dr. Dana Petcu, for her support and patience throughout the development of the DEHEMS project. I wish to express my gratitude to the entire Computer Science department of the Faculty of Mathematics and Informatics at the West University of Timişoara, in particular to Prof. Dr. Daniela Zaharie (for her insight and advice related to the work I had carried out throughout the years), Assoc. Prof. Dr. Florin Fortiş (for trusting me and supporting my ideas in the different courses where we worked on together during the past few years) and Assoc. Prof. Dr. Mircea Drǎgan (for his help and attention to detail in proof-reading, as well as his continued interest in the research I had performed). I also wish to express my courtesy to the members of the IeAT Research Institute for the work we had done together in the past. Special thanks go to Sean Baxter and Tim Murray at nVIDIA Corp., for their help and useful advice on the CUDA architecture throughout the development of the work for this thesis. 3 Abstract This thesis focuses on the development, implementation and optimization of patternmatching algorithms in two different, yet closely-related research fields: malicious code detection in intrusion detection systems and digital forensics (with a special focus on the data recovery process and the metadata collection stages it involves). The thesis introduces the motivational backgrounds for the development of the work, later on presents the related work and then continues with the main achievements obtained, while in the end a few conclusions and future research directions are discussed. The main four chapters in this thesis, starting with Chapter 3 and up to chapter 6, show the main contributions of our work and address the following topics: Chapter 3 presents an efficient storage mechanism for hybrid CPU/GPU-based systems, and compares it with other known approaches to date. Chapter 4 proposes an innovative, highly parallel approach to the fast construction of very large Aho-Corasick and Commentz-Walter pattern matching automata on hybrid CPU/GPU-based systems, and compares it to existing sequential approaches. Chapter 5 proposes a new heuristics for profiling malicious behavior based on system-call analysis, using the AhoCorasick algorithm, and also discusses a new hybrid compression mechanism for this automata based on dynamic programming, that reduces the storage space required for it. Finally, Chapter 6 proposes an efficient new method for collecting metadata and helping the human operator or automated tools used in the data recovery process as part of the computer forensic investigations. The research and models obtained in this thesis extend the existing literature in the field of intrusion detection systems (malicious code detection in particular), by presenting: an innovative heuristics for behavioral analysis of code in executable files through system-call interception, a novel and highly efficient approach to efficiently storing pattern-matching automata in hybrid CPU/GPU-based systems, which serves as the base for an innovative model for the fast, GPU-accelerated construction of such very large automata (for both the Aho-Corasick and Commentz-Walter algorithms) and a new hybrid compression technique applied to the Aho-Corasick automata using a dynamic programming approach, that reduces storage space significantly. 4 Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction 1.1 Introduction . . . . . . . . 1.2 Motivation . . . . . . . . . 1.3 Objectives of the Thesis . 1.4 Contribution of the Thesis 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Pattern-Matching Algorithms . . . . . . . . . . . . 2.1.1 Single-Pattern Matching Algorithms . . . . 2.1.1.1 The Brute-Force Algorithm . . . . 2.1.1.2 The Karp-Rabin Algorithm . . . . 2.1.1.3 The Knuth-Morris-Pratt Algorithm 2.1.1.4 The Boyer-Moore Algorithm . . . . 2.1.2 Multiple-Pattern Matching Algorithms . . . 2.1.2.1 The Aho-Corasick Algorithm . . . 2.1.2.2 The Commentz-Walter Algorithm . 2.1.2.3 The Wu-Manber Algorithm . . . . 2.2 Intrusion Detection Systems . . . . . . . . . . . . . 2.2.1 Classification . . . . . . . . . . . . . . . . . 2.2.2 Detecting Malicious Code Behavior . . . . . 2.3 Digital Forensics . . . . . . . . . . . . . . . . . . . 2.3.1 Cyber-Crime Investigations . . . . . . . . . 2.3.2 Data Recovery . . . . . . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix x . . . . . 1 1 2 3 4 6 . . . . . . . . . . . . . . . . . 7 7 8 8 10 12 15 17 17 20 22 23 24 26 29 30 31 32 PhD Dissertation 3 An Efficient Mechanism for Storing Very Large Automata in Hybrid CPU/GPU-Based Memory 3.1 The CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The GF114 Streaming Multiprocessor . . . . . . . . . . . . . . 3.1.2 Experimental Testbed . . . . . . . . . . . . . . . . . . . . . . 3.1.3 The Programming Paradigm . . . . . . . . . . . . . . . . . . . 3.1.4 Atomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The ClamAV Signatures . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The Signature Formats . . . . . . . . . . . . . . . . . . . . . . 3.2.2 The Storage Methodology . . . . . . . . . . . . . . . . . . . . 3.3 CPU and GPU-Accelerated Approaches to Virus Scanning . . . . . . 3.3.1 Split-Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 GPU Gems 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 GrAVity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 MIDeA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 PFAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 A Highly-Efficient Storage Model for Pattern Matching Automata . . 3.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Towards a Hybrid CPU/GPU Efficient Storage Architecture . 3.4.3 The Storage Model . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 The Constraint-Based Aho-Corasick Algorithm . . . . . . . . 3.4.5 Towards a Parallel Implementation for the GPU . . . . . . . . 3.4.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 3.4.6.1 The CPU Implementation . . . . . . . . . . . . . . . 3.4.6.2 The GPU Implementation . . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 33 34 34 37 37 37 40 40 41 42 43 44 45 46 49 52 52 53 54 58 59 59 62 65 69 4 An Efficient Model for Hybrid CPU/GPU-Accelerated Construction of Very Large Automata 71 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 A Hybrid GPU-Accelerated Architecture . . . . . . . . . . . . . . . . 73 4.3 The Storage Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.1 The Aho-Corasick Automaton . . . . . . . . . . . . . . . . . . 75 4.3.2 The Commentz-Walter Automaton - Algorithm A . . . . . . . 76 4.3.3 The Commentz-Walter Automaton - Algorithm B . . . . . . . 76 4.4 Optimizing the Performance of the Pre-Processing Stage for the CommentzWalter Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 Achieving Automaton Construction Parallelism . . . . . . . . . . . . 78 4.5.1 The Aho-Corasick Automaton . . . . . . . . . . . . . . . . . . 78 4.5.2 The Commentz-Walter Automaton - Algorithm A . . . . . . . 79 4.5.3 The Commentz-Walter Automaton - Algorithm B . . . . . . . 79 iii PhD Dissertation 4.6 4.7 Performance Evaluation . . . . . . . . . . . . . . . . . . 4.6.1 The Aho-Corasick Automaton . . . . . . . . . . . 4.6.2 The Commentz-Walter Automaton - Algorithm A 4.6.3 The Commentz-Walter Automaton - Algorithm B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 82 83 88 89 5 Implementing Behavioral Heuristics and Hybrid Compression Techniques in the Aho-Corasick Automata 91 5.1 Behavioral Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.1 Towards Behavioral Analysis . . . . . . . . . . . . . . . . . . . 92 5.1.2 Applications to Malicious Code Detection . . . . . . . . . . . 93 5.1.3 A Metric-Based Heuristic for Malicious Code Detection . . . . 93 5.1.3.1 Program Slicing . . . . . . . . . . . . . . . . . . . . . 94 5.1.3.2 Time Dependency . . . . . . . . . . . . . . . . . . . 97 5.1.3.3 Measuring Similarity Using the Bray-Curtis Metric . 97 5.1.3.4 A Weighted Automaton Model . . . . . . . . . . . . 101 5.1.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 103 5.2 Hybrid Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.1 The Smith-Waterman Local Sequence Alignment Algorithm . 107 5.2.2 Towards An Efficient Compression Mechanism . . . . . . . . . 108 5.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 113 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6 Improving Carving Analysis in the Digital Forensics Process 6.1 Challenges of File-Carving in Digital Forensics . . . . . . . . . . . . . 6.1.1 File Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Forensic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2.1 Header and Footer Analysis . . . . . . . . . . . . . . 6.1.2.2 Structural Analysis . . . . . . . . . . . . . . . . . . . 6.1.2.3 Storage-Based Analysis . . . . . . . . . . . . . . . . 6.1.3 Quantitative Criteria for Measuring Quality . . . . . . . . . . 6.1.4 Fighting Fragmentation . . . . . . . . . . . . . . . . . . . . . 6.2 Improving File-Carving Through Data-Parallel Header and Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 A Comparison Between Scalpel and the TrID Pattern Scanner 6.2.3 Defining Signatures for File-Carving . . . . . . . . . . . . . . 6.2.4 Signatures as Regular Expressions . . . . . . . . . . . . . . . . 6.2.5 Extracting Relevant Patterns . . . . . . . . . . . . . . . . . . 6.2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 118 118 120 120 121 121 122 123 124 124 126 127 128 131 131 133 iv PhD Dissertation 7 Conclusions 7.1 Discussions . . 7.2 Contributions . 7.3 Future Research 7.4 Final Thoughts Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 136 137 139 140 142 List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 3.1 3.2 3.3 3.4 3.5 The input string and the pattern for upcoming examples. . . . . . . . Comparisons performed by the brute-force simple pattern matching algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons performed by the Karp-Rabin pattern matching algorithm. The Knuth-Morris-Pratt jump performed in case of a mismatch at position j. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Knuth-Morris-Pratt matching process for the example in Figure 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Boyer-Moore jump distance computation for good-suffix (left) and bad-character (right) shifts. . . . . . . . . . . . . . . . . . . . . . . . The Boyer-Moore pattern matching process for the example in Figure 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The trie tree created for the keyword set {he, she, his, hers}. Accepting states are numbered. . . . . . . . . . . . . . . . . . . . . . . . . . . . The Aho-Corasick created for the keyword set {he, she, his, hers}. Failure functions are dashed. . . . . . . . . . . . . . . . . . . . . . . . The Commentz-Walter tree for the keyword set {he, she, his, hers}. Shift distances are represented as [shift1, shift2] and leaf nodes are marked bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A typical misuse detection system (according to [Rah00]). . . . . . . . A typical anomaly detection system (according to [Rah00]). . . . . . The taxonomy for IDS classification as proposed by Debar et al in [DHW99]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The analytical and technical components used in intrusion detection systems as proposed by Shevchenko [She08]. . . . . . . . . . . . . . . The CUDA architecture [cud12]. . . . . . . . . . . . . . . . . . . The GTX 560 Ti overall architecture [Tom]. . . . . . . . . . . . The GF114 streaming multiprocessor architecture [Tom]. . . . . The CUDA thread model [CUDb]. . . . . . . . . . . . . . . . . . The ClamAV modified Aho-Corasick tree [MDWZ04]. Dashed show failure transitions. . . . . . . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . lines . . . 8 9 11 13 14 15 16 19 20 21 25 25 26 29 35 36 38 39 42 PhD Dissertation 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 The ClamAV signature distribution over time, according to Cha et al. [CMJ+ 10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The SplitScreen architecture as proposed by Cha et al. in [CMJ+ 10]. The algorithm for achieving virus signature matching as proposed by Seamans et al. in [SA07]. . . . . . . . . . . . . . . . . . . . . . . . . . The GrAVity architecture as proposed by Vasiliadis et al. in [VI10]. . The MIDeA architecture as proposed by Vasiliadis et al. in [VPI11]. . The MIDeA storage format used by Vasiliadis et al. in [VPI11]. . . . An example tree proposed by Leet et al. in [LH08]. . . . . . . . . . . The Parallel Failureless Aho-Corasick algorithm proposed by Lin et al. in [LTL+ 10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Parallel Failureless Aho-Corasick perfect hashing algorithm proposed by Lin et al. in [Liu12]. . . . . . . . . . . . . . . . . . . . . . . An example for the perfect hashing algorithm proposed by Lin et al. in [Liu12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The architecture for the perfect hashing algorithm proposed by Lin et al. in [Liu12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a) Sparse, consecutive memory allocations and gaps that appear in GPU memory; b) A single, sequentially allocated memory block in the GPU device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The bitmapped node implementation of a tree-based structure. . . . . The stack-based memory model (with the offsets vector) used for efficient storage of the pattern-matching automaton in the GPU. . . . . The stack-based memory model (with the offsets vector) used for efficient storage of the pattern-matching automaton in the GPU. . . . . Our constraint-based Aho-Corasick automaton for the set of patterns {00FA0F{4-8}*02FF1F*ABCCD0, 00AE{-12}01CE} [PN12]. Dashed lines represent links between leaves and failures have been omitted. . A situation where the match acad is split over two different chunks of data being processed in parallel, with none of the chunks matching the string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a) Aho-Corasick compressed nodes; b) Aho-Corasick normal nodes in the constraint-based automaton supporting RegEx signatures; c) Commentz-Walter normal nodes; d) Commentz-Walter compressed nodes. The signatures and their distribution when performing single-core and multi-core scanning on both the CPU and the GPU. . . . . . . . . . . The throughput of the Aho-Corasick automaton when using a singlecore CPU implementation. . . . . . . . . . . . . . . . . . . . . . . . . The throughput of the Commentz-Walter automaton when using a single-core CPU implementation. . . . . . . . . . . . . . . . . . . . . vi 43 44 45 46 47 48 48 49 50 51 51 54 55 55 57 58 60 60 61 61 62 PhD Dissertation 3.27 The throughput of the Wu-Manber automaton when using a single-core CPU implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.28 The throughput of the different hybrid parallel approaches tested using our storage model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.29 The memory usage measured for our hybrid parallel implementations. 3.30 The state/node count for the depth-limited Aho-Corasick automaton. 3.31 The memory usage for the depth-limited Aho-Corasick automaton. . 3.32 The bandwidth performance for the depth-limited Aho-Corasick automaton. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.33 The bandwidth performance for the depth-limited Aho-Corasick automaton. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.34 The total number of nodes for the both types of the automata used. . 3.35 The total memory usage for the hybrid-parallel GPU implementation of AC-CW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.36 The bandwidth performance of the AC-CW implementation on the GPU. 3.37 The bandwidth performance of the extended Aho-Corasick + CommentzWalter implementation on the GPU. . . . . . . . . . . . . . . . . . . 4.1 vii 63 64 65 66 66 67 67 68 68 69 69 Our hybrid CPU/GPU approach to the fast construction of the two types of automata. Steps marked (*) are only applicable to the CommentzWalter automaton. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Our approach to storing the automata in memory. Elements marked with [*] correspond to the Aho-Corasick automata only, while those marked (*) correspond to the Commentz-Walter automata only. . . . 75 4.3 Storing patterns in the GPU (device) memory for both types of automata. 75 4.4 Grid structure of our hybrid-parallel model for Commentz-Walter algorithm A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Results for the single-core CPU vs. GPU implementation of the AhoCorasick automaton. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.6 The speed-up obtained in the GPU implementation of the Aho-Corasick automaton, compared to the CPU single-core version. . . . . . . . . . 84 4.7 The number of nodes obtained in the Commentz-Walter automaton when using different wmin values. . . . . . . . . . . . . . . . . . . . . 84 4.8 The memory usage on the device (GPU) for the Commentz-Walter automaton when using different wmin values. . . . . . . . . . . . . . . 85 4.9 The running times for the classic (non-optimized) version of the CommentzWalter pre-processing stage on a single-core CPU. . . . . . . . . . . . 86 4.10 [Algorithm A] The running times for the optimized version of the Commentz-Walter pre-processing stage running on the GPU. . . . . . 86 4.11 [Algorithm A] The speed-up of the optimized GPU implementation versus the non-optimized CPU single-core implementation. . . . . . . 87 PhD Dissertation 4.12 [Algorithm A] The speed-up of both optimized implementations in the GPU versus a single-core CPU. . . . . . . . . . . . . . . . . . . . . . 4.13 [Algorithm B] The running times on the GPU. . . . . . . . . . . . . . 4.14 [Algorithm B] The running times on the GPU (continued). . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 6.1 viii 87 88 89 The program dependence graph (PDG) for the example code in Section 5.1.3.1 [Gra]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A slice starting from the last printf() statement for the PDG in Figure 5.1 [Gra]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 System dependence graph (SDG) for the program in Figure 5.1 [Gra]. 96 Interprocedural backward slicing for the program in Figure 5.1 [Gra]. 96 The extended Aho-Corasick automaton for the set of rules {AF 1E 2B 01 02 03 ∗ AA 1B 0E 03 AA BB, AA BB 1D CC}. Failure transitions are dashed lines. . . . . . . . . . . . . . . . . . . . . . . . 100 A step in the matching process for the automaton in Figure 5.5. . . . 102 Finding matches in the automaton in Figure 5.5. . . . . . . . . . . . 104 Maximum average distribution of non-suspicious system calls at different lengths for the rules in the knowledge base and for different threshold values as shown in [Pun09]. . . . . . . . . . . . . . . . . . . 105 Our constraint-baseds Aho-Corasick automaton (failures not included) for the regular expressions 001E2C*AABB00{6-}AABBFF and 000FA0{410}000FEE{-3}AACCFF. . . . . . . . . . . . . . . . . . . . . . . . . 108 The total number of nodes obtained for our automaton, when using the ClamAV signatures and also when using our min-based common signature extraction approach. For max, a very similar distribution was obtained and was not included for preserving clarity. . . . . . . . 110 Children density distribution in the automaton built from the ClamAV database. Similar distributions (not shown) as for min were also obtained for max. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Children density distribution in the automaton built from the ClamAV database. Similar distributions (not shown) as for min were also obtained for max. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Memory usage for different implementations of the hybrid-compression mechanisms in the Aho-Corasick automaton. . . . . . . . . . . . . . . 113 The number of local hits achieved at run-time. . . . . . . . . . . . . . 114 The throughput of the automaton when using our Smith-Watermanbased approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 An example of the PNG image file header, footer and contents, as discussed by Kloet in [Klo07]. . . . . . . . . . . . . . . . . . . . . . . 119 PhD Dissertation 6.2 6.3 6.4 6.5 6.6 6.7 An example of how a disk can have fragmented files (disk was analyzed using our custom-built program). Each block represents a sector on the disk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Different forensic analysis types as proposed by Carrier in [Car05]. . . The Multi-Carver architecture as proposed by Kloet in [Klo07]. . . . The sorted linked-list of children in a tree-based structure. . . . . . . The results of the single-threaded file-carving process on a 128 MB USB flash disk using definitions from TrID and the classic and modified versions of the Aho-Corasick algorithm. . . . . . . . . . . . . . . . . . The results of the multi-threaded file-carving process on a 128 MB USB flash disk using definitions from TrID and the classic and modified versions of the Aho-Corasick algorithm. . . . . . . . . . . . . . . . . . ix 119 120 124 129 132 132 List of Tables 5.1 5.2 5.3 5.4 6.1 6.2 6.3 6.4 Maximum Average Distribution of Intermediary Non-Suspicious System Calls for Bray-Curtis Normalization . . . . . . . . . . . . . . . . Benchmark Results of Standard AC Machine, Extended AC Machine and Bray-Curtis Weighted AC-Machine . . . . . . . . . . . . . . . . . An example of the Smith-Waterman local sequence alignment algorithm [SW81] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The number of nodes having from 0 up to 6 children in the constraintbased Aho-Corasick automaton [Pun13] . . . . . . . . . . . . . . . . . The qualitative scores as proposed by Kloet in [Klo07] . . . . . . . . The empiric evaluation of the carving speed, proposed by Hoffman and Metz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of patterns of lengths 3 to 7, extracted after pre-processing XML definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of patterns of lengths 8 to 12, extracted after pre-processing XML definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 106 106 108 110 123 123 131 131 Chapter 1 Introduction Contents 1.1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Objectives of the Thesis . . . . . . . . . . . . . . . . . . . 3 1.4 Contribution of the Thesis . . . . . . . . . . . . . . . . . . 4 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Introduction Along with the invention of the transistor in the 1950s, a new era has emerged, dominated by increasingly faster machines, controlled through fine-tuned electronics. Starting with the first transistor built, the age of digital computing had risen to become more and more persistent, throughout the years, into everyday life, and its effects are immensely felt today - we are, essentially, a society based on the correct operational cycles of the machines we had built. It is therefore not surprising that the transistor was considered to be a part of the greatest breakthroughs of the 20th century[Pri04]. Increasing operational speeds and the significant progress in nano-technology in the past years have determined computers to become faster and more accurate. Smaller transistors translate into better performance and improved response times. Consumer PCs nowadays use 22nm technology on a daily basis, while 14nm-based technology is scheduled to appear soon, with a roadmap for 5nm being also released and advertised as being actively researched[int12]. While the immediate benefits of such progress have numerous applications in most scientific areas, they are unfortunately also being used with malicious intent. Probably the oldest examples of such intents is represented by computer viruses, whose history begins in the early ages of computing, more exactly in Von Neumann’s research [vN66], where his theory of 1 PhD Dissertation 2 “self-reproducing automata” describes the possibility to design a program that can reproduce itself. A computer virus is, in fact, an executable program which may cause damage to the computer or may interfere with its normal operation, according to [mic12]. Computer viruses have evolved significantly over time. Along with the benefits of improved technology and continuously improved operating systems, viruses have been following a massively increasing trend, starting with a single known virus in 1985 and up to over 1 million viruses existing in 2010, according to [SQ10]. Nowadays, there are over 17,7 million signatures [vir12]. Furthermore, complexity has increased significantly also, beginning with polymorphic viruses (which appeared first in 1990) and ending with the most complex, most difficult to detect and to remove specimen of them all: metamorphic viruses. A polymorphic virus is based on an engine which can mutate the code, while leaving the functionality the same. Mutation is achieved most of the time through encryption, but encryption alone is not sufficient - a decryption module also needs to exist in order to correctly interpret the mutated code for leaving the original semantics untouched. The decryption module is usually Achille’s heel for polymorphic viruses, since antivirus programs can look for that particular module in order to try to identify the virus type or family. Metamorphic viruses are designed to rewrite their own code entirely from one infection to another. This makes them extremely resilient to antivirus detection and recent research performed into metamorphic code, such as Stuxnet [stu11] and Flame [fla12] (considered by the Kaspersky Labs researches to be “the most sophisticated cyber weapon” to have ever been in the wild [red12]), shows how tremendously difficult it is to decrypt, analyze and interpret such code. 1.2 Motivation In parallel with virus development, antivirus research, as part of the intrusion detection systems field, has progressed significantly and while many antivirus competitors exist on the market, offering different protection solutions with different efficiencies, open-source alternatives, such as the ClamAV initiative [cla12], have been the primary focus of academic security researches. Antivirus programs can detect viruses essentially in two ways: by looking up virus signatures in the executable code, or by profiling the run-time behavior, usually through emulation. Virus signatures are data patterns which uniquely describe the identity of a virus or of a family of viruses. The problem of identifying a signature inside binary code is a classic problem of pattern matching. With the growing number of signatures each day, it is becoming increasingly difficult to implement efficient pattern matching techniques. Although multiple pattern matches algorithms have been proposed, such as the Aho-Corasick [AC75], PhD Dissertation 3 Wu-Manber [WM94] and Commentz-Walter [CW79] variants, real-time implementations still pose a challenge to reducing the scanning time of an executable. Parallel approaches to both single and multiple pattern matching have been researched, however given the low number of cores available on most CPUs usually, the speed increase obtained is still causing significant bottlenecks in real-time implementations. Recent progress in the field of GPU technology, along with nVIDIA’s CUDA architecture [cud12] have made possible the ability to build hybrid, CPU/GPU-based solutions, that could benefit from the high degree of parallelism offered by the GPU hardware. For intrusion detection systems, GPU-accelerated approaches are an active field of research. A lot of research has also been conducted for integrating GPUacceleration into network intrusion detection systems (NIDS), and important speedups have also been obtained with GPU-accelerated variants of known algorithms applied to malicious code detection, particularly in antivirus-related research and implementations. Given the plural nature of pattern matching algorithms and the various fields of research where they find applications, an increasing interest in the past few years has been observed in the digital forensics area, where data recovery procedures and defragmentation approaches have been successfully used for tracking down pieces of information, putting together clues in forensic analysis and recovering lost or fragmented pieces of information from damaged hardware devices. The recent engineering breakthroughs in building more power-efficient GPUs, with better architectures and higher semiconductor densities has certainly opened up a wide range of possibilities for building complex hybrid systems in the near future, where the CPU and GPU interact, share information and perform computations as a single, unique and homogenous entity. The main objective of this PhD thesis is to propose, present and evaluate the theoretical backgrounds for achieving pattern matching on two different levels of heterogeneity: in hybrid hardware environments (using both the CPU and the GPU for carrying out tasks as a single, hybrid-processing unit - where we concentrate on two important aspects of the implementation: low storage requirements and highperformance throughput) and hybrid implementations (such as hybrid-parallel approaches and implementations of various extensions to the existing research) for solving the pattern-matching problem more efficiently in both intrusion detection systems (with primary focus on malicious code detection) and digital forensics (where we are focusing on carving analysis as part of the digital forensics process). 1.3 Objectives of the Thesis The objectives of this thesis are the following: PhD Dissertation 4 1. to analyze the current state-of-art for implementing pattern-matching in heterogeneous hardware environments, where the CPU and GPU work as a single entity designed to carry out the tasks given to it; 2. to propose a formalism for achieving efficient storage of very large patternmatching automata in hybrid CPU/GPU memory, implement it and evaluate its performance compared to other existing known approaches; 3. to propose a formalism, architecture and computational model for the highlyparallel construction of very large pattern-matching automata in heterogeneous hardware environments; 4. to propose a model for implementing behavioral heuristics through heterogeneous implementations or extensions of pattern-matching algorithms for tracking down malicious code behavior; 5. to propose a model for achieving highly-compact pattern-matching automata by using different compression techniques, and apply the model to the AhoCorasick automaton in the virus signature matching process; 6. to propose a heterogeneous approach to extending the regular-expression patternmatching problem to the area of carving analysis in digital forensics, outlining the primary advantages and benefits of such an approach as part of cyber-crime investigations and data recovery. 1.4 Contribution of the Thesis This thesis focuses on two aspects of research: the first relates to profiling malicious code behavior by optimizing and accelerating pattern matching algorithms, while the second discusses the applicability of our research to digital forensics and its possible future uses. Profiling malicious code behavior is a highly complex task, requiring several steps of computation, which is why in recent years numerous attempts to parallelize and optimize the known approaches have emerged. In this thesis, we are tackling with some of those approaches and propose innovative and highly efficient models for accelerating such algorithms, as well as optimizing them to support regular expression scanning and constraint-based matching. We start by discussing an innovative and highly-efficient storage model in Chapter 3, presenting a new hybrid approach for achieving very efficient malware detection in hybrid CPU/GPU systems, without affecting throughput performance in any way. We discuss known GPU and hybrid storage mechanisms and propose a new model whose primary advantage is its low-storage requirements, as well as full-duplex memory transfers at the highest possible throughput between the host and the device, while PhD Dissertation 5 also requiring a single external pointer to the memory area holding the automaton. We present the final results by comparing storage models in modern antivirus implementations, starting with the ClamAV antivirus and outlining our approach’s primary benefits in a hybrid-parallel implementation during the virus scanning process. The storage model in Chapter 3 is used as a basis for the discussion in Chapter 4, where we present an innovative highly-parallel architecture for constructing fast pattern matching automata using a hybrid CPU/GPU approach, which improves the processing times by an important number of times and allows real-time updates to such automata, opening new research opportunities by offering the possibility to construct very fast, self-adjustable automata. This research represents the first parallel model (which is also capable of offering maximum throughput performance) for a very fast construction of pattern-matching automata, showing how the construction performance can be improved significantly when using virus signatures as part of the virus signature matching process, withholding highly-efficient storage both in host and device memory. Chapter 5 proposes a new heuristic for profiling malicious code behavior through system call analysis by taking into account the moment in time when a call is made during program execution, with very little impact on performance. This represents the first model for time-based analysis of system calls, an important part of malicious code detection through system-call analysis heuristics in antivirus engines. The chapter also covers a new approach to achieving hybrid compression of the Aho-Corasick pattern matching automaton using dynamic programming aspects which significantly reduce the storage required for this automata, presenting the experimental results obtained when applying our methodology to the virus signature matching automaton in the virus detection process, a common component of all modern antivirus engines. Our hybrid approach shows that a combination of our dynamic programming implementation, along with several other improvements, can reduce the storage memory required by a few times compared to the naive or other existing approaches. The last chapter, Chapter 6, discusses a new approach to improving the carving analysis process in digital forensics through employing efficient pattern-matching automata supporting regular-expressions, and shows how the classic problem of filecarving can be adapted to this model. The implementation covers a lexical analyzer for converting the XML-based signatures used in the structural analysis process into regular-expression-based signature formats used in automata creation. Our approach is based on header and structural analysis and allows different heuristics to be built on top of the detection engine for an accurate reconstruction of the original file, permitting therefore efficient, low-memory knowledge-based forensic analysis engines to be built. PhD Dissertation 1.5 6 Organization The remaining chapters of this thesis are organized as follows: Chapter 2 presents the latest existing algorithmic approaches to pattern matching, discussing both single and multiple pattern matching approaches (with special emphasize on the latter), and discusses their applicability to intrusion detection systems, malicious code behavior detection and digital forensics. Chapter 3 discusses the challenges involved in performing efficient storage of the automata used in malicious code detection, discusses the GPU storage mechanisms and proposes a new, highly efficient approach to storing such automata in both RAM and GPU (device) memory. Chapter 4 proposes an innovative and highly efficient hybrid CPU/GPU-based model for the fast construction of very large Aho-Corasick [AC75] and CommentzWalter [CW79] automata, showing how such a model may be applied to malicious code detection in particular. We later on discuss the applicability of the model to other types of patterns and show that its applicability is not limited in the intrusion detection systems field of research, but can be used without any limitations to any other field involving multiple pattern matching through the two types of automata. Chapter 5 discusses in-depth a few profiling heuristics for behavioral analysis commonly used in pattern matching automata using the Aho-Corasick algorithm [AC75], as well as a hybrid approach to compress the automata using a dynamic programming method. We present and discuss the challenges involved, along with the mathematical models for behavioral analysis, and show how they can be applied to malicious code detection in intrusion detection systems. Chapter 6 proposes a new model (based on a pattern-matching automata heuristic) for collecting metadata, a vital component in the data recovery process commonly used in digital forensics, and shows how the model can significantly reduce human operator intervention by locating and reconstructing an improved partial map of the original un-damaged file, given the proper binary signatures for recovery are used. At the end, Chapter 7 outlines the main contributions of the thesis and discusses future research opportunities and directions. Chapter 2 Related Work Contents 2.1 2.2 2.3 2.4 2.1 Pattern-Matching Algorithms . . . . . . . . . . . . . . . . 7 2.1.1 Single-Pattern Matching Algorithms . . . . . . . . . . . . . 8 2.1.2 Multiple-Pattern Matching Algorithms . . . . . . . . . . . . 17 Intrusion Detection Systems . . . . . . . . . . . . . . . . . 23 2.2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Detecting Malicious Code Behavior . . . . . . . . . . . . . . 26 Digital Forensics . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.1 Cyber-Crime Investigations . . . . . . . . . . . . . . . . . . 30 2.3.2 Data Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 31 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Pattern-Matching Algorithms The pattern matching challenge has a long history in the computing era, dating back to the early stages in computing. The single pattern matching problem aims to find all occurrences of a given, non-empty keyword, into an input string, while later applications have extended the problem to finding multiple occurrences of a finite, non-empty set of keywords into an input string. Pattern matching applications are widely used in web search engines (where they are used to identify webpages of interest for the users), bioinformatics (in [Tum10], the authors discuss the ability to perform DNA analysis on GPU clusters, using a multiple pattern matching algorithm; in [Dud06], authors propose a heuristic that can reduce the number of comparisons by increasing the number of skips in the pattern matching problem) and intrusion detection systems (where they are for instance used in tracking down malicious network data packets, such as the approaches proposed in [MK06] and [CY06], or for enabling high-level 7 PhD Dissertation 8 Figure 2.1: The input string and the pattern for upcoming examples. architectures for pattern matching and implementing them on the GPU, as discussed in [LH12]; they are also widely used in antivirus engine implementations, such as the approach suggested in [Lee07] or using GPU-accelerated approaches for performing virus signature matching, as shown in [PN12]), with several other important applications in numerous fields of research. Since pattern matching algorithms have a wide range of applications and are numerous, with several variants existing, we will present as follows only those algorithms finding the most common and relevant applications to the fields of intrusion detection systems and digital forensics. Cleophas et al in [CZ04] discusses a taxonomy of pattern matching algorithms that improves on the work of Watson et al from [WZ96], including the SPARE toolkit that Watson et al had developed and presented in [Wat04]. 2.1.1 Single-Pattern Matching Algorithms According to Watson in [Wat94], the single pattern-matching problem can be defined as follows: given an alphabet (a finite sequence of characters) Σ and a text string S = Γ[0], Γ[1], ..., Γ[N-1] of length N, find a string pattern P = Λ[0], Λ[1], ..., Λ[M-1] of length M, where Γ[i] ∈ Σ and Λ[j] ∈ Σ, for each 1 ≤ i ≤ N , 1 ≤ j ≤ M and M ≤ N. The algorithms existing nowadays propose several approaches to efficiently implementing and solving the pattern matching problem, aiming to improve the runtime performance while also decreasing storage space as much as possible. We are going to present some of the most commonly used variants of single pattern matching algorithms, beginning with the brute-force approach and ending with approaches proposed by Knuth-Morris-Pratt ([Knu77]) and Boyer-Moore ([BM77]), which are the most commonly used in implementation nowadays given their very good performance. Most of the time, | Σ |= 256, to cover the entire ASCII alphabet, however some research, especially related to bioinformatics, uses smaller alphabet sizes. In the following, we denote by P[i] the character at position i in the string P. 2.1.1.1 The Brute-Force Algorithm The brute force algorithm is the simplest (and the slowest) of existing variants of multiple pattern matching. It’s basic idea lies in comparing, starting with each character Γ[j] at position j (j ∈ {0, 1, ..., N − M − 1}) in the input string, up to the PhD Dissertation 9 Figure 2.2: Comparisons performed by the brute-force simple pattern matching algorithm. next M-1 characters, the characters in S with those in P. In case of a mismatch, the comparisons begin again, starting with the next position, at Γ[j+1]. The algorithm’s worst-case performance is O(M × (N-M-1)) (and matches O(N × M), according to [Cor02]), in the situation where all comparisons match, excepting the last character in P, while the best-case performance is O(N) (assuming that the only comparison being performed is the one at the first character in P). According to Cormen in [Cor02], the expected number of comparisons, or the average case performance, is linear and at most 2 × (N-M+1), which equals O(N). Figure 2.2 shows the list of comparisons performed by the brute-force algorithm for the example shown in Figure 2.1. The algorithm for performing a brute-force search is presented as follows: • Input S, the input string (of length N ) 2: P, the pattern to look for (of length M ) 1: • Brute-force simple pattern matching for i ← 0 to N − 1 do 2: match ← true 3: for j ← 0 to M − 1 do 4: if S[i+j] 6= P[j] then 1: PhD Dissertation 5: 6: 7: 8: 9: 10: 11: 12: 2.1.1.2 10 match ← false break end if end for if match = true then report match found at position i end if end for The Karp-Rabin Algorithm Karp and Rabin[KR87] proposed a string matching solution by relying on hash computation. Hashes are mathematical functions that assign to each string their numerical value. Starting from the observation that if the two hashes of two strings are equal, then the strings themselves are also equal, Karp and Rabin have built hashes for each of the successive substrings of the pattern. Given that the algorithm relies heavily on the efficient computation of the hash values, and that in certain situations, a certain hash function can produce the same output for two different string inputs (also called a hash collision, resulting in a possible mismatch which may be reported by the algorithm as a match), an efficient hash function is vital to the correct functional behavior of the approach. The algorithm uses Rabin’s fingerprinting technique presented in [Rab81]. Broder, in [Bro93], discusses applications of Rabin’s fingerprinting method, and shows that a fingerprint is in fact similar to the hash computation of a larger input (similar to the MD5 hashes presented in [Riv92]), and that there is a slight chance that collisions may appear whenever two hash values are identical, even though the fingerprints are different. The algorithm works by computing all possible hash values for the N-M+1 substrings of length M found in the input string of length N, and making use of these at run-time, when comparing hash values ensure that we can avoid multiple comparisons, such as those performed by the brute-force algorithm for example in 2.1.1.1. The algorithm maintains a sliding window substring of length M over the input string, from left to right, that slides over the input and recomputes the hash value for the sliding window at each iteration. If the hash values of the keyword and of the sliding window are equal, we have a match, otherwise the sliding window slides one more character to the right. According to [Cor02], the worst-case run-time performance is O((N-M+1) × M) and matches that of the brute-force algorithm, while the expected run-time performance is O(N+M). 11 PhD Dissertation Figure 2.3: Comparisons performed by the Karp-Rabin pattern matching algorithm. A possible hashing function for a keyword w of length M (where | w[i] | represents the ASCII numerical code of the character at position i in w ) is: hash(w[0, .., M-1]) = ( M −1 X | w[i] | ×2M −1−i ) mod q, (2.1) i=0 where q is a large number (and usually prime). The sliding window concept ensures that for the input string S, the hashing value hash(S[i+1, ..., i+M]) is easily recomputed from hash(S[i, ..., i+M-1] and hash(S[i+M]), therefore we can express this as: hash(S[i+1, ..., i+M]) = recomputeHash(S[i], S[i+M], hash(S[i, ..., i+M-1])) (2.2) which would determine: recomputeHash(a, b, d) = ((d − a × 2M −1 ) × 2 + b) mod q (2.3) Whenever a hash collision occurs, a character-by-character comparison must be performed to ensure that there is a match in the input string. The algorithm for performing a Karp-Rabin search is shown below: • Input 12 PhD Dissertation S, the input string 2: P, the pattern to look for 3: hash(S[0, ..., M-1]), the hash of the sliding window 1: • The Karp-Rabin simple pattern matching algorithm 1: 2: 3: 4: 5: 6: 7: 8: i←0 while i ≤ N − M + 1 do if hash(P) = hash(S[i, ..., i+M-1]) then report match found at position i end if hash(S[i+1, ..., i+M]) = recomputeHash(S[i], S[i+M], S[i, ..., i+M-1]) i←i+1 end while An example of the run-time process on the pattern and input string from Figure 2.1 is shown in Figure 2.3. 2.1.1.3 The Knuth-Morris-Pratt Algorithm Knuth et al have presented in [Knu77] an improved algorithm for simple pattern matching based on the approach built by Morris and Pratt in [MP70], that can avoid certain character comparisons by precomputing a model for the pattern P itself, and using that computed model for avoiding certain unnecessary comparisons. This came as a natural extension to the observation that certain patterns, in particular those containing repetitive substrings (e.g. abracadabra, aabbaabbcc, etc.), perform unnecessary repeated comparisons whenever some mismatches occur. In order to present the algorithm we need to define, based on Cormen et al ([Cor02]) the notions of prefix and suffix of a pattern P = {P[0, 1, ..., M-1])} of length M. The prefix P reL (P ) of length L ≥ 1, and suffix SufL (P ) of length L, for pattern P are defined as follows: P reL (P ) = P [0, 1, ..., L − 1] (2.4) SufL (P ) = P [M − L, M − L + 1, ..., M − 1] (2.5) Basically, P reL (P ) is a prefix of P if P begins with P reL (P ), and SufL (P ) is a suffix of P if P ends in SufL (P ). We also define the border BorL (P ) of length L for pattern P as being a substring that is both a prefix and a suffix of P, meaning: BorL (P ) = P reL (P ), BorL (P ) = SufL (P ) (2.6) The empty string ε is always a border of P, but ε itself has no border. The preprocessing stage of the algorithm computes an array v ={v[0], v[1], ..., v[M] } of length M+1, PhD Dissertation 13 Figure 2.4: The Knuth-Morris-Pratt jump performed in case of a mismatch at position j. so that Borv[i] (P rei (P )) is non-empty and v[i] is the highest value with this property (basically, v[i] represents the length of the widest border of the prefix of length i of P ). As there is no border for the empty string, v[0] = −1. In the preprocessing stage, v[i+1] is determined by verifying if a border of the prefix P rei (P ) can be extended with the character P[i], e.g. if P[v[i+1]] = P[i]. The list of borders that are to be examined, in decreasing order, is obtained from the values v[i], v[v[i]], etc. and the loop performing this task terminates when no border can be extended (that is, the value becomes -1): • Input v, the array used in pre-computation S, the input string 3: P, the pattern to look for 1: 2: • The Knuth-Morris-Pratt preprocessing stage 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: i←0 j ← −1 v[0] ← −1 while i < M do while j ≥ 0 and P [i] 6= P [j] do j ← v[j] end while i←i+1 j ←j+1 if P [i] = P [j] then v[i] ← v[j] else v[i] ← j end if end while • The Knuth-Morris-Pratt algorithm PhD Dissertation 14 Figure 2.5: The Knuth-Morris-Pratt matching process for the example in Figure 2.1. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: i←0 j←0 while i < N do while j ≥ 0 and S[i] 6= P [j] do j ← v[j] end while i←i+1 j ←j+1 if j ≥ M then match found at position i − j j ← v[j] end if end while A jumping example based on the precomputation table for the Knuth-Morris-Pratt algorithm is presented in Figure 2.4: in case of a mismatch at position j, the longest border Bormaxj (P rej (P )) is considered. When resuming comparisons at position v[j], a jump is performed and if a mismatch occurs once more, the next longest border is considered, and so on, until no border is left (j = -1 ) or the next character is a match. If all M characters are matched in the sliding window (j = M ), a match is reported at position i - j, after which a jump is performed by the length of the widest border possible. As shown by Cormen et al in [Cor02], the searching algorithm has a complexity of O(N), while the preprocessing requires O(M) steps. An example of the comparisons performed by the algorithm for our example is shown in Figure 2.5. PhD Dissertation 15 Figure 2.6: The Boyer-Moore jump distance computation for good-suffix (left) and bad-character (right) shifts. 2.1.1.4 The Boyer-Moore Algorithm One of the fastest algorithms in practice was proposed by Boyer and Moore in[BM77] and bases its idea on performing comparisons from right to left, as opposed to usual left-to-right searches performed by most other algorithms, hoping that we can jump in this way on higher distances whenever a mismatch is found. By using two precomputed jumping distances for the pattern (called good-suffix shift and bad-character N shift), the algorithm can perform as fast as O( M ) in the best case. The formal definition of the jump distances for the Boyer-Moore algorithm, based on the definition proposed by Lecroq et al in [LC04], is: GoodSuffixShift[i+1] = min{s > 0: (for each j, i < j < M: k ≥ j or P[j-k] = P[j]) and (if j < i then P[i-s] 6= P[i])} and GoodSuffixShift[0] is equal to the length of the period of P. The bad-character jump distance is stored in a table of size |Σ| (size of the alphabet) and is defined as follows: BadCharacterShift[c] = {min(i: 1 ≤ i ≤ M-1 and P[M-1-i] = c)} if c appears in P, otherwise the value of the function is M. The jumping distances for the algorithm are calculated as follows (see Figure 2.6 for a graphical illustration): • For the good-suffix shift: 1. If a mismatch occurs between P[i] and S[i+j] at position j, the segment S[i+j+1, ..., j+M-1] = P[i+1, ..., M-1] will be aligned with the rightmost occurrence in P that is followed by a character different than P[i] (situation A, P[k] 6= S[i] ). 2. If there is no segment that verifies this property, the longest suffix Sufmax (S[i+ j + 1, ..., j + M − 1]) will be aligned with a matching prefix of P (situation B). • For the bad-character shift: PhD Dissertation 16 Figure 2.7: The Boyer-Moore pattern matching process for the example in Figure 2.1. 1. The character S[i+j] is aligned with the right-most occurrence in P[0, ..., M-2] (situation A, P[k] = S[i] ). 2. If there is no occurrence of S[i+j] in the pattern P, the sliding window is aligned with S[i+j+1] (situation B). For performing the shift, the maximum between the good-suffix shift and the badcharacter shift values is used, since the bad-character jump distance can be negative. The Boyer-Moore algorithm is presented as follows: • Input GoodSuffixShift (GS), the pre-computed shifting distances 2: BadCharacterShift (BC), the pre-computed shifting distances 3: S, the input string 4: P, the pattern to look for 1: • The Boyer-Moore algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: i←0 while i ≤ N − M do i←M −1 while i ≥ 0 and P [i] = S[i + j] do i←i+1 end while if i < 0 then match found at position j j ← j + GS[0] else j ← j + max(GS[i], BC[S[i + j]] − M + 1 + i) end if end while PhD Dissertation 17 The run-time performance of the algorithm is still quadratic in the worst-case, O(N × M), for a pattern containing repeated cycles. If the pattern is non-periodic however, Cole had shown in [Col91] that at most 3N comparisons are performed. 2.1.2 Multiple-Pattern Matching Algorithms The multiple pattern matching problem is a logical extension of the single pattern matching challenge and modifies the requirements so that the problem no longer focuses on a single pattern, but rather on multiple patterns to be searched for in the input string. While this problem could be easily (yet not necessarily efficiently) solved by applying repeated single pattern matching algorithms, there are better solutions that perform faster and have improved storage efficiency. We will focus in this section on these solutions and present some of the most widely spread approaches nowadays for solving this problem. The multiple pattern matching problem can be formulated as follows, keeping the notations from 2.1.1: given the alphabet (a finite sequence of characters) Σ and the input string S = Γ[0], Γ[1], ..., Γ[M-1] of length N, find all occurrences of the K patterns Pi = Λi [0], Λi [1], ..., Λi [Mi -1] of length Mi in S, where S, Pi ∈ Σ, i ∈ {0, 1, ..., K} and M ≤ N. 2.1.2.1 The Aho-Corasick Algorithm One of the most widespread algorithms used nowadays to solve the multiple pattern matching problem is that proposed by Aho and Corasick in [AC75]. The algorithm is an extension of the Knuth-Morris-Pratt approach to multiple keywords, by constructing a finite automaton and building on top of it. The Aho-Corasick algorithm is also related, although not identical, to the notion of suffix (position) trees, a concept introduced by Weiner in [Wei73], another approach commonly used in pattern matching applications because of its ability to find and represent repeated substrings efficiently. Crochemore et al in [CR94] have described an algorithm called the Backward Directed Acyclic Word Graph, which simulates the bit parallelism of the suffix automaton (therefore, its efficiency can be observed only if the length of the pattern is smaller than the memory word length of the machine it is ran on); later on, Frederiksson in [Fre09] had provided an algorithm that represents a hybrid based on the suffix tree concept ([Wei73]) and the Backward-DAWG algorithm ([CR94]). Cormen et al in [Cor02] have introduced the notion of finite automaton as being a 5-tuple pair. Based on this definition, we can introduce the definition of an AhoCorasick finite state machine as being a 7-tuple (S, s0 , S F , Σ, goto, failure, output), where: • S is a finite set of states PhD Dissertation 18 • s0 = root ∈ S is the initial state • S F ⊆ S is the set of accepting states • Σ is the input alphabet • goto is a function defined as goto: S × Σ → S, also known as the transition function • failure is defined as failure: S → S and represents the failure (mismatch) function • output: S → S represents the output function The basic behavior of the automaton can be expressed as follows: 1. if the automaton is in state s and reads input character c, the current state becomes goto(s, c) (if defined) or failure(s) if undefined 2. if the current state of the automaton is s and s ∈ S F , s is an accepting or final state 3. the output(s) function returns true if s ∈ S F The algorithm is based on the trie (prefix) tree data structure, which is commonly used to store an array having strings as keys. The root of the trie tree is associated to the empty ǫ string, and for a given node v of depth d, all nodes of the trie v’ with depth d′ > d have a common prefix of length d (Figure 2.8). The algorithm is comprised of the following steps: 1. create a trie of all the keywords to be searched for in the input string 2. pre-process the trie by computing, for each node, the failure function defined earlier • each node v is located at depth depth(v) and has the word word(v) associated to it • failure(v) = v’, where Sufd′ (word(v)) = word(v ′ ), where d’ < depth(v’) and d’ is the biggest integer for which this property is verified (in other words, the failure function points to a node v’, whose associated word is the maximal suffix of word(v), the word of the current node) • if there exists no d’ for which the property above is verified, failure(v)=root 3. perform the match process 19 PhD Dissertation Figure 2.8: The trie tree created for the keyword set {he, she, his, hers}. Accepting states are numbered. The algorithm is presented as follows: • Input S, the input string 2: Pi , the patterns to look for 1: • The Aho-Corasick algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: state ← root for i ← 1 to N do while goto(state, S[i]) = undefined do state ← f ailure(state) end while state ← goto(state, S[i]) if output(state) 6= ∅ then match found, state ∈ S F end if end for One great practical advantage of the algorithm is that it is optimal in the worst K X case, having a run-time performance of O(N+ Mi +Z), where Z is the number of i=0 matches found. The automaton created for the trie in Figure 2.8 is shown in Figure 2.9. PhD Dissertation 20 Figure 2.9: The Aho-Corasick created for the keyword set {he, she, his, hers}. Failure functions are dashed. 2.1.2.2 The Commentz-Walter Algorithm Commentz and Walter have proposed in [CW79] an algorithm that is based on a similar idea as that proposed by Boyer and Moore in [BM77] but extended to the multiple pattern matching problem, resulting in a variant which attempts to perform larger jumps in the input string whenever a mismatch occur, if the situation allows it, based on a pre-computed model which is similar to the bad-character and good-suffix shifts used in the single pattern matching variant. In other words, the algorithm attempts to spot certain properties of the keywords and use that to compute an efficient jump table that can be used at run-time to decrease the number of comparisons and therefore achieve better running times. The Boyer-Moore variant discussed in [BM77] had performed the search by comparing characters from right to left, as opposed to the normal direction of comparison from left to right, used by the brute-force or Knuth-Morris-Pratt variants. In a similar manner, the Commentz-Walter algorithm builds upon a trie tree constructed from the set of reversed patterns. Additionally, the algorithm introduces two new jump distances, called shift1 and shift2. They are computed as follows: • for the root of the trie, shift1=1 and shift2=wmin (where wmin is the length of the shortest keyword in the trie); • otherwise, shift1 =min{wmin , {l; l=d(v ’)-d(v )}}, where d(v ) is the depth of the current node v in the tree, and v ’ is an element of set1(v)={v ’; word(v ) is a PhD Dissertation 21 Figure 2.10: The Commentz-Walter tree for the keyword set {he, she, his, hers}. Shift distances are represented as [shift1, shift2] and leaf nodes are marked bold. proper suffix of word(v ’) }. • shift2 =min{shift2 (parent node of v ), {l; l=d(v’)-d(v)}}, where v ’ is an element of set2(v)={v ’; v ’ is an element of set1(v) and v ’ is a leaf in the trie}. Basically, the shift distances for a node are being computed as follows: • the shift1 distance is being computed as the minimum between wmin and the depth difference between the current node and a node of a higher depth for which the word at the current node is a suffix • the shift2 distance is being computed as the minimum between the shift2 value of the parent node, and the minimum depth difference between the current node and a leaf of higher depth for which the word at the current node is a suffix It begins scanning the input text and makes comparisons from right to left, shifting the text by a value of {min(shift2 (v ), max{shift1 (v ), char (inputText[pos-depth]depth-1}} (pos is the current position in the file, depth is the current depth in the tree). The char function is defined as char(c) = min{wmin +1, d (v )}, where v is a node with label c. An example of how the Commentz-Walter tree looks like for the same set of patterns used in the Aho-Corasick automaton in Figure 2.9 is shown in Figure 2.10. The algorithm for performing the scanning is presented below: • Input PhD Dissertation 22 S, the input string 2: Pi of length min(wPi ), the patterns to look for 1: • The Commentz-Walter algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: state ← root i ← min(wPi , i ∈ 0, 1, ..., K) j←0 while i ≤ N do while v has child v ′ with label S[i − j] do v ← v′ j ←j+1 if output(v) 6= ∅ then match found at position i − j end if end while i ← min(shift2(v), max(shift1(v), char(S[i-j])-j-1)) j←0 end while The algorithm is not optimal in the worst case and in fact, it has a worst-case behavior that is much worse than that of the Aho-Corasick algorithm, being O(N × wmax ), where wmax is the maximum length of the patterns in the input set. The variant presented is marked in the original paper [CW79] as B, with a different variant B1 that uses the exact same trie as the algorithm we have discussed, but which reduces the worst-case scanning time to O(N) by using a buffer of size wmax , which holds the last few characters scanned from the input string. In practice though, as shown by Watson in [Wat95], the algorithm performs much better than its Aho-Corasick counterpart, since it can skip larger portions of text. 2.1.2.3 The Wu-Manber Algorithm Wu and Manber in [WM94] have proposed a fast algorithm which is based on a hashing principle, similar to the Karp-Rabin approach in 2.1.1.2. The Wu-Manber multiple pattern matching algorithm focuses on the generic purpose of the matching, that is speeding up the search in the average case by implementing shift and hash tables, in a manner similar to the Boyer-Moore bad character shift approach [BM77].The algorithm uses three tables built during the pre-computation phase: a SHIFT table, a HASH table, and a PREFIX table. The SHIFT is similar to the Boyer-Moore bad character skip table [BM77]. The algorithm considers blocks of text instead of single patterns. If M is the sum of the keyword lengths and c the size of the alphabet, then the block length B is optimized for running time and space when B =logc 2M (in practice, B =2 or B =3 PhD Dissertation 23 are recommended). The SHIFT table determines the shift based on the last B bytes rather than just one. A hash function maps blocks to an integer used as an index in the SHIFT, whereas the total space required here is O(cB ) (for B =2, we have 216 =65,536 elements, and for B =3 we have 224 elements). If S has the B bytes currently being scanned in the input data, the algorithm takes into account two different situations: • if S does not appear as a substring in any pattern, SHIFT[hash(S)] = w min - B +1 • otherwise, find the rightmost occurrence pos of S in any of the keywords, then SHIFT[hash(S)] = w min - q The matching phase works as follows: 1. compute hash value based on current B bytes of data (e.g. input[i-B+1] ... input[i] ) 2. if SHIFT[h] > 0, shift text by this value and return to step 1, otherwise go to next step 3. compute hash value of prefix of text (read Wmin bytes to the left), store as hashPrefix 4. for each p where HASH[h] ≤ HASH[h+1 ], if PREFIX[p] = hashPrefix check actual keyword (pattern[p] ) against the text directly, output matches (if any) 2.2 Intrusion Detection Systems Scarfone et al in [SM07] define an intrusion detection system (or commonly known in literate under the acronym IDS) as being the process of monitoring all events which are occurring in a computer system or network, followed by the analysis of possible or imminent violations of security policies, with either of them being related to computer security, acceptable use or standard security practices. Intrusion prevention is the process of detecting violations of such policies and preventing them from occurring, while eradication refers to eliminating all violations of such sort and restoring the system to its original, pre-intrusion state. Malware (a term used for malicious software) is a particular type of program or software aimed at damaging or offering access to a victim’s computer. The term is used nowadays to express a wide variety of types of dangerous, intrusive or harassing software. The degree of malicious activity that a software contains is based on the perceived intent of the creator and not on functionality of the software itself. The most common types of malware existing today are: computer viruses, spyware, adware, PhD Dissertation 24 trojan horses, worms, rootkits and others. In digital forensics and law, the term computer contaminant may be found as describing malware, with several U.S. states (for example, California and West Virginia) containing a description of such term in their legal code ([Malb, Mala]). It has become a common practice during time for many malware programs to disguise themselves as being genuine products, while the actual purpose remained hidden. Such software can cause damages to the operating system it is running on (e.g. many viruses caused damages to the MS-DOS operating system back in the 90s and even changed or completely eradicated the bootloader, a critical component that loads the operating system in memory on computer startup), can disrupt normal activities during work (e.g. commonly known as adware and spyware programs), or can even go as far as damaging internal hardware devices (e.g. one of the first viruses to cause major damage was CIH, also known as Chernobyl or Spacefiller according to [Wikb], causing more than 60 million computers to be infected and an estimated of more than 1 billion dollars in damage). Another important area of research for intrusion detection systems are network IDS. A network IDS is an intrusion detection system which aims to detect unauthorized access to a computer network through traffic analysis. Network IDS are important to maintaining a secure network and preventing attacks on it, such as those of type Denial-of-Service (DoS). Denial of service attacks (or distributed DoS attacks) are attempts to make the a network or computer resource unavailable to its users, either temporarily or indefinitely. The common victims of such attacks are usually represented by banks, domain nameservers, payment gateways, high-profile websites and so on. The term is not limited to the network IDS terminology, but also finds applicability in the management of CPU resources, as discussed by Fledel et al. in [FC10]. 2.2.1 Classification Early intrusion detection system classifications were proposed in the late 1980s by Lunt [Lun88], followed by similar classification work in the early 90s by McAuliffe et al. [MH90] and Esmaili et al. [Esm95], with more in-depth reviews by Jackson et al. in [Jac99]. According to Rahman in [Rah00], there are two large categories of intrusion detection systems: • Statistical anomaly based systems, which evaluate the incoming data and attempt to quantify, using one or more algorithms, the anomalous activity which characterizes the data. If this classification takes the anomalous value associated with the data over a certain threshold, the data is classified as being anomalous and is probably therefore discarded. PhD Dissertation 25 Figure 2.11: A typical misuse detection system (according to [Rah00]). Figure 2.12: A typical anomaly detection system (according to [Rah00]). • Pattern-based systems, or signature-based, which attempt to classify the incoming data by comparing it to a list of known malicious activities, described by the signatures or patterns mentioned. The system is, in this case, dependant on an existing pattern database which characterizes the known malicious behaviors. A typical misuse detection system is shown in Figure 2.11, while an anomaly detection system is depicted in Figure 2.12. Debar et al in [DHW99] have proposed a taxonomy for classifying intrusion detection systems, presented in Figure 2.13. In the taxonomy proposed, they focus on two types of characteristics: functional and non-functional. The functional characteristics refer to three types of IDS, based on detection method, the behavior and the audit source location (an element mostly common to network IDS). The taxonomy of Debar et al. from [DHW99] has been later on revised in [DH00] (by Debar et al. themselves) and by Axelsson in [Axe00]. The revised taxonomy of PhD Dissertation 26 Figure 2.13: The taxonomy for IDS classification as proposed by Debar et al in [DHW99]. Debar et al. in [DH00] adds one new type of intrusion detection system, based on a detection paradigm, which is state (where the IDS attempts to classify the state as being a failure or an error state) or transition based (where the IDS is being monitored for all states that may be representing an intrusion), both supporting proactive and nonperturbing evaluation. 2.2.2 Detecting Malicious Code Behavior The problem of detecting malicious code is a critical component of modern antivirus engines nowadays and focuses on detecting pieces of code belonging to an executable file that are dangerous to the system or represent an imminent threat to it. Such malicious code can be sometimes difficult to track down and eradicate, and is most commonly found in viruses, worms and trojan horses. Detecting and removing malicious code can be a difficult task to accomplish, as pointed out by Tanase from Symantec Corp. in [Tan10]. The most important part in keeping the system safe is backing up permanently the important data, as later on there is very little room for error when tracking down intrusions and sometimes harmful compromises need to be made (e.g. files may need to be deleted as they become unrecoverable, or data is lost as a consequence of the intrusion). In order PhD Dissertation 27 to determine the malign behavior, a step-by-step analysis of the evidence needs be made, since most intruders leave traces of information behind which could be picked up by an in-depth forensic analysis, for instance logs, access file times, and so on. As we had already outlined in 1.1, the most obvious example of malicious code is represented by computer viruses. Polymorphic viruses and metamorphic viruses pose many additional challenges, since polymorphic viruses partially encrypt their own code to hide its detection from signature-based antivirus engines, while metamorphic viruses shift their code and structure permanently, creating completely new copies of themselves which keep the malicious functionality intact, yet are completely different than the previous copy of the virus that had propagated. Determining whenever an intrusion occurs can be usually conducted as a survey comprised of several steps, as pointed out in [Tan10], paying extra attention to the details that might show information about a potential perpetrator. Here are some generic guidelines to how malicious behavior and intrusions can be detected: • A first step would be the log analysis of the machine and, if possible, taking down the machine during the process to prevent further damage. Since some systems cannot be taken down too easily without disrupting certain types of services that they offer, the logs may have to be analyzed separately. And even in such a situation, there is always the possibility for an intruder to alter, delete or remove clues from the logs. Furthermore, certain logs are vast and the analysis may not be possible by human operator intervention, but using forensic tools or utilities. Any unusual or suspicious traces of activity must be captured with great attention to detail and later on analyzed for indications as to how and where the intrusion occurred. • A next logical step is the behavior analysis of the system. Discovering anomalies can be easier if there are changes in the running configuration of the system, and such information is usually observed by investigating the running processes, the amount of resources (both in terms of CPU usage and memory usage) a process uses, how often a job is scheduled to execute and to whom it belongs (for instance, system-scheduled jobs that require high CPU priority and run at a pre-defined regular interval could mean a potential breach in the system, a risk commonly found in a trojan horse’s behavior). • Network traffic analysis and network resource utilization is also an important step for locating intrusions. Certain types of malware (worms, trojan horses, rootkits, etc.) open ports on a victim’s computer and sometimes also establish permanent connections to a certain server from where the victim computer may be manipulated easily. Observing network data traffic, port usage and the running services on a machine can be therefore a vital component to tracking down abnormal behavior. PhD Dissertation 28 • Observing file system changes can offer clues of vital important to tracking down virus infections. Viruses replicate from one executable code to another, and keeping hashes of files can be a good practice in observing whenever file changes occur - an approach that had been used by older antivirus engines, such as the Norman Thunder-Byte AntiVirus (TBAV), and which is still found in certain modern antivirus engines today. Since viruses can easily change the access and modified times of a file, the only way to determine if an infection had occurred remains comparing the hash obtained by analyzing the file’s contents at regular time intervals: if the hashes differ, the file contents has changed which may suggest an infection of some sort; if a pattern can be observed in the changes performed on the files (e.g. only executable files have changed from the last hash verification), then the risk of an infection can increase exponentially and may suggest that immediate action is required. Shevchenko from Kaspersky Lab in [She08] discusses a model for detecting malicious code behavior, starting from the premise that any architecture of such type is comprised of two components: one that is analytical, and one technical, pointing out that in practice for certain infections such components may not be clearly separated, although in terms of functionality their differences are important. The technical component represents the set of program algorithms which select the data to be analyzed by the analytical component (the data itself varies and can be of several types, from program actions to file data contents), while the analytical component acts as a decision-making system, issuing after the analysis a verdict on the threat level of the data analyzed. One of the simplest techniques for finding infections is looking for virus signatures inside an executable file’s contents. In this scenario, the technical component discussed in [She08] collects information about the file and passes the data to the analytical component, which compares sequences of bytes in the files with the known signatures in order to determine whether an infection exists or not. A separation of the analytical and technical component is shown in Figure 2.14, where the HIPS acronym stands for Host Prevention Intrusion System. The emulation module in Figure 2.14 is a component widely spread today and relies on breaking down the program into commands, analyzing these commands and issuing a threat level for their effects. This way, the program’s behavior can be observed without posing any threat to the operating system or to the system itself. Emulating engines exist in most, if not all, modern antivirus engines, although sometimes they may be a part of a virtualization engine (also called a sandbox ). The sandbox offers real-world access to the executing program, but poses many additional limitation from preventing any damage from happening, for instance denying access to vital operating system files or resources during run-time. The level of abstraction PhD Dissertation 29 Figure 2.14: The analytical and technical components used in intrusion detection systems as proposed by Shevchenko [She08]. increases with the monitoring of system events and the final element of the architecture is represented by the system anomaly scanner, which determines if the operating system’s state is healthy or if it has been compromised in any way as a result of an infection. One of the most spread open-source antivirus programs today is ClamAV [cla12]. One important advantage towards research in the field of computer security and malicious code detection in particular is the permanently updated database (usually one or more times per day), with a large number of malware signatures, which are used in many studies for comparison and improvement to the existing solutions. 2.3 Digital Forensics Digital forensics, according to Reith et al. in [Rei02] and Carrier et al. in [Car02], is part of the forensic science, a branch of science focused on the investigation of material found in digital devices, along with its potential recovery, and is often related to crime investigations, also called cyber-crime [cyb]. The president of the United States, Barrack Obama, had declared in [Oba09] that the number of threats his country is dealing with “is one of the most serious economic and national security challenges we face as a nation”. Verizon, the Dutch High Tech Crime Unit and the U.S. Secret PhD Dissertation 30 Service had produced a report in 2011 [ea11], where they had stated that the amount of intrusions (hacking and malware combined) was at the highest level ever at that time. Palmer in [Pal01] had described the first steps to performing work in digital forensics as being preservation, collection, validation and identification. The full abstract digital forensics model proposed by Palmer in the same paper defines as key components the following though: identification, preparation, approach strategy, preservation, collection, examination, analysis, presentation and returning evidence. Identification refers to determining the type of an incident after successfully identifying it, preparation refers to preparing search warrants and the tools required, the approach strategy refers to defining a strategy which is capable of collecting untouched evidence without affecting the victim or interfering with it as little as possible, while preservation refers to isolating, securing and preserving the digital evidence. The collection of digital evidence refers to duplicating the evidence using standard forensic methods, so that the examination can be performed afterwards through an in-depth review of the collected evidence. The last two steps are the presentation of the evidence, which exposes a summary of the conclusions achieved as a result of all previous steps, and returning the evidence once the process has completed. Returning the evidence is not an explicit step in digital forensics and it is not used often in the investigation process. The real-world impact of digital forensics is significant and we will discuss as follows a few applications of the field to different areas of interest for this thesis. 2.3.1 Cyber-Crime Investigations One of the first major applications of digital forensics is cyber-crime investigations, where clues that may be vital to a trial can be found and prosecutors may be offered the proof they need to motivate the charges. As pointed out by Nelson et al. in [NC10], the evidence found as a result of the investigation can be incriminatory (also called inculpatory) or exculpatory, meaning the suspect is being cleared of the charges. While most of the time it is sufficient to use software forensic tools for looking up pieces of information useful in the investigation process, sometimes it may be required to use state-of-the-art technology to retrieve lost or deleted data, with the costs starting from $3,000 and going up to $20,000 or even more. Some of the most common crimes involving computers include covert financial operations (e.g. money laundering from drug operations), human trafficking, electronic fraud, identity theft, etc. Whenever the evidence is deleted a priori by the suspect, data forensic technologies help with its recovery and provide essential clues to the prosecution. Along the technical side of the cyber-crime investigations itself, which is the main subject of interest for this thesis, there is also a social and ethical side to the problem, PhD Dissertation 31 with direct impact in law, a topic thoroughly discussed by Harrington in [Har11]. The author discusses the challenges existing when co-working with a digital forensics expert and discusses the moral and ethical aspects that had appeared throughout time in several such cases. 2.3.2 Data Recovery A second major application of digital forensics is in the data recovery process. While computer forensics and data recovery are separated domains, according to Nelson et al. [NC10], the techniques in computer forensics can be successfully be applied to help improve the data recovery process. Such a process might be necessary in order to recover important business-related data that had been accidentally or unintentionally lost, but may also be useful in the cyber-crime investigation process when attempting to recover files or photos erased on purpose by potential criminals in order to hide their traces. Data loss is affecting business of all sizes, but the financial losses are usually reaching their peak when we are discussing medium or enterprise-sized businesses. While data loss has a number of potential causes (e.g. hardware defect, faulty human intervention, mechanical error, etc.), the consequences can be of great importance to how the business is further conducted and sometimes can mean bankruptcy if the matter is not treated seriously. Although storage devices have evolved during time, with the classic hard-disk drives being replaced slowly by the non-mechanical solid state disk (SSD) devices, which also offer the high performance benefit at a much higher cost, both types of storage are prone to hardware or logical failures, which directly affect the reliability and persistence of the data in question [Pun12]. While sometimes data recovery is impossible given the amount of damage produced (in which case, the backup solution remains the only viable approach to securing a business’s assets in time), most of the times partial data recovery is possible through the use of forensic investigation tools (for instance, when logical errors exist in the filesystem, or when the partition table has been damaged). The financial aspects of data recovery are also worth mentioning, given that according to [IoM], a typical recovery scenario costs from $5,000 per day, and can vary with the degree of damage to the data. In 2009, a study [HA] discussing the average SSD data recovery costs had estimated the amount of $2,850 per damaged SSD, with a timeframe to completion of about three weeks. A series of best approaches for data leakage protection is being discussed by Bunn et al. in [Ben11], where the authors discuss a series of steps required for protection, such as developing the incident response plan, establishing an emergency response team and testing the plan. They also discuss internal and external investigations and evaluate a series of products for performing the forensic analysis. PhD Dissertation 2.4 32 Summary In this chapter we have discussed a series of the most common pattern matching algorithms used in malicious code detection and digital forensics in Section 2.1, discussing both single pattern matching (Section 2.1.1) and multiple pattern matching (Section 2.1.2) variants, we have presented a classification of intrusion detection systems (Section 2.2) and a discussion on how malicious code can be detected (Section 2.2.2), and have introduced digital forensics in Section 2.3, along with some of its potential applications and real-world impact. We have outlined the most important aspects of an IDS, discussing some of the most commonly known classifications (Section 2.2.1) based on the works of Rahman in [Rah00] and Debar et al in [DHW99], and discussed the basic steps in a forensic investigation as defined by Palmer in [Pal01] (Section 2.3). Lastly, we have presented two of the major applications of the digital forensics process, particularly in fighting cyber-crime investigations (Section 2.3.1) and in the data recovery process (Section 2.3.2). Chapter 3 An Efficient Mechanism for Storing Very Large Automata in Hybrid CPU/GPU-Based Memory Contents 3.1 3.2 3.3 3.4 The CUDA Architecture . . . . . . . . . . . . . . . . . . . 34 3.1.1 The GF114 Streaming Multiprocessor . . . . . . . . . . . . 34 3.1.2 Experimental Testbed . . . . . . . . . . . . . . . . . . . . . 37 3.1.3 The Programming Paradigm . . . . . . . . . . . . . . . . . 37 3.1.4 Atomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 The ClamAV Signatures . . . . . . . . . . . . . . . . . . . 40 3.2.1 The Signature Formats . . . . . . . . . . . . . . . . . . . . 40 3.2.2 The Storage Methodology . . . . . . . . . . . . . . . . . . . 41 CPU and GPU-Accelerated Approaches to Virus Scanning 42 3.3.1 Split-Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 GPU Gems 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.3 GrAVity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.4 MIDeA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.5 PFAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 A Highly-Efficient Storage Model for Pattern Matching Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Towards a Hybrid CPU/GPU Efficient Storage Architecture 53 3.4.3 The Storage Model . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.4 The Constraint-Based Aho-Corasick Algorithm . . . . . . . 58 3.4.5 Towards a Parallel Implementation for the GPU . . . . . . 59 33 52 34 PhD Dissertation 3.4.6 3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 69 The rapid progress of technology in the last years had determined the sudden growth in processing power of both CPUs but also GPUs, with performance nowadays in GPUs that can far exceed the bandwidth of a normal CPU. The immediate advantage is that the CPU can be offloaded with certain computationally intensive tasks, while the GPU can handle them without disrupting the normal user’s work. The two major players in the graphics card industry, nVIDIA and Advanced Micro Devices (AMD, former ATI), have both announced graphics cards that are programmable, nVIDIA when introducing the CUDA [cud12] programming framework for their graphics cards, followed by AMD with the AMD FireStream [AMD], a streaming multiprocessor to use parallel stream processing for heavy computations. 3.1 The CUDA Architecture CUDA (Compute Unified Device Architecture) [cud12] was first proposed by nVIDIA in 2007, offering versions for both the Windows and Linux operating systems, while version 2.0 contained support for MacOS X also. CUDA was designed to be of use in both professional and casual graphics cards, having similarities with the OpenCL [Ope] (Khronos Group) framework and with Microsoft’s DirectCompute [Dir] alternative. Figure 3.1 presents the overall CUDA architecture. The latest GPUs in the nVIDIA line, starting with the G8X, can perform computations similar to a microprocessor using the CUDA programming model, however there is a particular feature to CUDA that makes it different than normal programming on a CPU: GPUs are designed to run multiple (but slower) execution threads, in parallel, compared to running one single (but much faster) thread as a normal CPU. Nonetheless, given the massively parallel architecture of GPUs today, the throughput of a GPU can greatly exceed (by a number of times) that of a CPU. The ability to perform CPU-alike operations on a GPU is often referred to as general-purpose computing on graphics processing units, or GPGPU [GPG]. 3.1.1 The GF114 Streaming Multiprocessor The GF114 streaming multiprocessor is commonly found in graphics card in the nVIDIA GTX 560 series. For the tests in this thesis, we have used a middle-range GeForce GTX 560 Ti graphics card, based on the Fermi [Fer] architecture, built in a 40nm fabrication process and having an estimated of 1.95 billion transistors in a die size of just 360mm2 . The theoretical peak memory bandwidth of the GF114 is 128.3 GB/s, while overclocked versions can reach up to 146.6 GB/s according to [Tec]. The chip’s overall architecture is presented in Figure 3.2. 35 PhD Dissertation Figure 3.1: The CUDA architecture [cud12]. The GTX 560 Ti graphics card has 8 streaming multiprocessors (also called SMs), with 48 cores per each SM, as shown in Figure 3.3. In GPGPU terminology [GPG], the term host refers to the CPU, while the term device refers to the GPU. Both have their own types of memory, with the host using the RAM memory, while the device uses the video memory, that is usually much more limited. The GF114 chip has both shared and global memory. Global memory is accessible to all threads, but accessing it is much slower than when working with shared memory, which is shared by all threads in a block and has the lifespan of the block. While registers are the fastest form of memory on the multiprocessor, they are only accessible from a thread and have the lifetime of the thread - similar to these, shared memory (which can go up to 48KB per multiprocessor as of now) can be just as fast as a register in certain situations (reading from same memory location, etc.), it is accessible from any thread of the block from which it was created and has the lifespan of a block. On the other side, global memory is much slower (up to 150× to 200× slower, but it is accessible from both the host and the device and has the lifespan of the application itself. Local memory belongs to a thread, can be still as slow as global memory (since it resides in it), and is only accessible by the thread (having therefore the lifespan of the thread). An optimization problem common to GPU programming using the CUDA architecture refers to uncoalesced reads and writes from and to the global memory, given PhD Dissertation Figure 3.2: The GTX 560 Ti overall architecture [Tom]. 36 PhD Dissertation 37 their slow performance compared to the shared memory. In order to significantly optimize the performance and throughput of the algorithms, it is recommended therefore to use the shared memory resources (limited as they are) and avoid dispersed reads and writes if at all possible. 3.1.2 Experimental Testbed For our implementation and experiments throughout this thesis, we have used an ASUS GeForce GTX 560 Ti DirectCU II Top graphics card, having 384 CUDA cores, equipped with 1 GB of DDR5 memory and having an engine clock of 900 MHz, with a memory clock set to 4,200 MHz and a shader clock of 1,800 MHz. To compare the results against a normal CPU, we have used an overclocked version of the Core i7 2600K CPU based on the Sandy Bridge architecture [San], clocked at 4.3 GHz. 3.1.3 The Programming Paradigm The fundamental programming execution unit in the CUDA architecture is the thread. A thread is a piece of code being executed sequentially. A block in CUDA can contain from 64 to 512 threads, and the recommended occupancy according to the CUDA Occupancy Calculator [CUDa] is about 66.66% for the GF114. For our GPU, in which a block contains at most 192 threads, that results into 128 threads per block, which is the amount we have used in all our implementations and experiments. Data is being processed in warps, which are groups of 32 threads in GF114, the minimum size of the data processed in SIMD by a CUDA multiprocessor. Warps which are currently active are scheduled using a thread scheduler, which periodically switches between warps using a proprietary scheduling algorithm, in order to increase to make efficient use of the computational power offered by all multiprocessors. Blocks are organized as linear, two-dimensional or three-dimensional grids (Figure 3.4), and the code runs in the GPU through kernels. A kernel is a function being executed by one thread in the GPU, but multiple threads are executing the same kernel at the same time. Since the GF114 chip has 8 streaming multiprocessors with 48 warps per SM and 32 threads per warp, it could in theory run up to 8 × 48 × 32 = 12,288 threads in parallel. 3.1.4 Atomics Atomic operations in parallel programming are operations which are performed without interference from other threads, in order to prevent race conditions, which are common problems in multi-threaded applications. According to [Rac], a race condition or hazard is a flaw in a process causing the output result of the process to be “unexpectedly and critically dependent on the sequence of timing of other events”. PhD Dissertation Figure 3.3: The GF114 streaming multiprocessor architecture [Tom]. 38 39 PhD Dissertation Figure 3.4: The CUDA thread model [CUDb]. As a simple explanation, whenever a thread attempts a increment a location at a memory location, it issues a few simple logical steps in accomplish that: it reads the value at that memory location (step 1), it increments it (step 2) and then writes it back (step 3). If no race condition would exist, and the thread would be stopped by another executing thread immediately after finishing step 2, but not yet reaching step 3, and the other executing thread increments the same variable at the same memory location, then in the end when the original thread resumes, it will write back the original value incremented only once, instead of it being incremented twice (by both threads therefore). The CUDA architecture supports several atomic operations that are capable of reading, writing, modifying and writing a value back to a memory address without the execution of such an operation being interrupted by another thread, which therefore guarantees that a race condition does not occur. Atomic operations applied to global memory addresses ensure that any two threads, even from different blocks, can access that memory location with the race condition safely avoided. We will exemplify with the atomicMax() function from the CUDA C Programming Guide [CUDc]. The prototypes of this function are the following: 1 2 int atomicMax(int* address, int val); unsigned int atomicMax(unsigned int* address, unsigned int val); The function reads the 32-bit word old from the memory address indicated by address (which can be located in both shared or global memory), computes the PhD Dissertation 40 maximum between old and val, and stores back at the same memory address the maximum of the two, returning the old value. A similar behavior is common to many other atomic functions in the CUDA SDK as well. 3.2 The ClamAV Signatures ClamAV [cla12] is an open-source initiative with a large database of malware samples. Security researchers throughout the world are using many of these samples as a testbed when testing experimental algorithms or when proposing new solutions for malicious code detection. The ClamAV signature database is comprised of two different types of signatures: a daily updated database, which is smaller in size (for easier updates), called daily.cvd, and a primary signature database which is larger in size, called main.cvd. Using the sigtool that ClamAV comes with, one can extract these two files from the database and inspect their contents, therefore obtaining access to the actual signatures. There are also other types of signature files that ClamAV uses, which we will discuss in Section 3.2.1. 3.2.1 The Signature Formats According to ClamAV’s own internal documents [Cre], the signatures formats that ClamAV understands are split into several types: • For .hdb files: – Format: MD5:Size:MalwareName • For .mdb files: – Format: PESectionSize:MD5:MalwareName – Example: 237568:ce914ca1bbea795a7021854431663623:Trojan.Bagle-328 • For .ndb files: – Format: MalwareName:TargetType:Offset:HexSignature – TargetType can be any of the following: 0 (any file), 1 (PE), 2 (OLE2 component, e.g. VBA script), 3 (HTML, normalized), 4 (mail file), 5 (graphics), 6 (ELF, UNIX format), 7 (ASCII text file, normalized) – Example: Trojan.Exchanger:1:EP+229:e81c000000e8e 6ffffff81c34e8dbffffffe 846ffffffe2e4 • For local signatures (ClamAV 0.94 and onwards), .ldb files: PhD Dissertation 41 – Format: SignatureName; TargetDescriptionBlock; LogicalExpression; Subsig0; Subsig1; Subsig2; ... – Example: Worm.Godog;Target:0;((0—1—2—3)& (4));(0);(1);(2);(3);(4); ... (signatures following) ... For .mdb files, the pe-sig Ruby script [pe-] is commonly used to produce signatures for each section inside a Portable Executable (PE) file. For antivirus programs, the signatures posing the most interest are those based on MD5 hashes [Riv92] and the ones containing opcodes that need to be matched inside a program executable’s contents. In Section 3.2.2 we discuss how ClamAV stores such signatures. The hash-based signatures ensure a fast and accurate match for most trojans and worms, but cannot detect the majority of viral infections, since the hash is being calculated for the entire file in question, while polymorphic and other viruses may shift their code and therefore trigger the hash to change from one infection to another. The opcode-based signatures can be significantly complex and may also contain several regular expressions in them, which allow for greater accuracy when performing a match. The regular expressions used are the following: • ?, used to match a single half-byte • ??, used to match en entire byte • *, used to match any number of bytes • (aakbb), used to match either expression aa or expression bb • {n}, used to match exactly n bytes • {n-}, used to match at least n bytes • {n-m}, used to match between n and m bytes • {-m}, used to match at most m bytes 3.2.2 The Storage Methodology Miretskiy et al. in [MDWZ04] have proposed an on-access antivirus file system, describing also the ClamAV storage format. ClamAV stores its virus signatures in memory using a modified Aho-Corasick [AC75] tree (Figure 3.5), which actually depends on the length of the smallest signature. Since the alphabet size for the ClamAV signatures is the size of the entire ASCII set, ClamAV needs to hold 256 pointers at each node in the tree to all possible matches. Each node occupies therefore (assuming 32-bit pointers) at least 4 bytes PhD Dissertation 42 Figure 3.5: The ClamAV modified Aho-Corasick tree [MDWZ04]. Dashed lines show failure transitions. per pointer × 256 pointers = 1,024 bytes (in their paper at [MDWZ04], the authors state that it occupied 1,049 bytes). There is a direct dependence of the size of memory required to store such a large automaton and the number of nodes in the automaton, as well as of the depth of the tree. The Aho-Corasick automaton builds a tree of depth equal to the length of the longest pattern in the input set, making such a storage solution very demanding in terms of space. Furthermore, ClamAV limits the tree in depth to the length of the smallest pattern, which is currently only 2 bytes therefore, the tree’s depth is just 2. From this point onwards, all patterns are being stored as linked lists. The primary performance bottleneck here appears whenever a large number of patterns share the same prefix. ClamAV’s regular expression engine has been improved over time [Lin06] and it now uses two matching engines: the first is using the Aho-Corasick algorithm described in Section 2.1.2.1, and the second is using the Wu-Manber approach described in Section 2.1.2.3. The slower Aho-Corasick is only used when scanning for regular expressions containing any type of wildcards, while Wu-Manber is used for the remaining patterns. 3.3 CPU and GPU-Accelerated Approaches to Virus Scanning Virus scanning implies determining whether a known virus signature resides in an executable file or not, which is in fact an application of the pattern matching problem discussed in Section 2.1. Given the high number of viruses and families of viruses in PhD Dissertation 43 Figure 3.6: The ClamAV signature distribution over time, according to Cha et al. [CMJ+ 10]. the spread, resulting in a very high number of signatures (at the time of this writing, the two .cvd files from ClamAV contained over 60,000 virus signatures alone, not counting other signature files that it uses), and that such algorithms have different run-time performances (outlined in Sections 2.1.1 and 2.1.2), the run-time performance can determine very slow operativity during work. Even the most powerful antivirus engines consume quite an amount of resources that could be spared if the CPU could be offloaded with such computational work, which is why the GPU-based virus scanning had been a successful and welcome research topic for the past few years. This section represents an outline of the research performed by the author in [PN] and outlines some of the most common CPU/GPU-based architectures for achieving fast virus signature matching. 3.3.1 Split-Screen Cha et al. in [CMJ+ 10] have proposed an efficient anti-malware architecture called Split-Screen that performs an additional screening step prior to the signature matching process which can eliminate, according to its authors, over 90% of the non-infected files, leaving only suspicious files to be scanned and therefore reducing the scanning time significantly. Their approach is based on bloom filters, a concept introduced by Bloom in [Blo70]. A Bloom filter is currently a very compact structure to store a dictionary, and can determine if a pattern belongs to a dictionary or not. In the construction stage, fragments of fixed length are extracted from each signature and inserted into the filter. Later on, during scanning, a sliding window of the same size as the fragments is parsing the input data and the content at every byte is tested against the filter. The architecture used by SplitScreen is presented in Figure 3.7. PhD Dissertation 44 Figure 3.7: The SplitScreen architecture as proposed by Cha et al. in [CMJ+ 10]. One important observation from [CMJ+ 10] is that ClamAV has spent, in the authors’ tests, over 95% of the time scanning and matching regular expressions, although these represent only a small fraction (about 16%) from the total number of signatures. The authors also propose a distributed architecture for their approach, showing that by allowing a small number of false positives in the detection process (which can be later on re-scanned using the classic algorithms implemented in ClamAV for greater accuracy), the throughput of ClamAV can be speed-up by 2 times, using twice less memory. As outlined in [PN12], the majority of the signatures (more exactly, 96.6% from the set of a little over 62,000 used in [PN12]) use only the last four types of constraints discussed in Section 3.2.1. It is also possible to transform the remaining signatures internally into a format similar to the four types mentioned earlier, for instance ?? can be replaced with {1-1}, * with {0,-} and regular expressions of type a(bkc)d can be transformed into two separate expressions, abd and acd. In the testbed used in [PN12], from 62,302 signatures extracted from the ClamAV database, only 6,831 had contained regular expressions (a fraction of 11%, less than the 16% mentioned in [CMJ+ 10], which could show that there is a decreasing trend in adding signatures containing regular expressions in the ClamAV database over the past few years). 3.3.2 GPU Gems 3 One of the first studies on how GPU acceleration can be used to achieve fast parallel virus signature matching was presented by Seamans et al. in [SA07]. Their approach has a small resemblance (when it comes to the storage methodology) to the one proposed in [MDWZ04], which is discussed in Section 3.2.2, and is similar in terms of parsing the input data with the original PFAC algorithm discussed in Section 3.3.5. Here, the GPU maintains a list of virus signatures by first matching 2 bytes from PhD Dissertation 45 Figure 3.8: The algorithm for achieving virus signature matching as proposed by Seamans et al. in [SA07]. the signature itself and using them as an index into a 64,000-entry array, with each index being mapped at most to one signature. The next 4 bytes of the signature are stored in the array. Each thread running on the GPU will read the two bytes from the input string, which will be used as an index in the array, it will read the next 4 bytes preceding the initial two assuring that this way every consecutive 2 bytes are read and compared to the database (Figure 3.8). The speedup achieved when using a GeForce 7800GTX GPu versus a Pentium 4 3GHz CPU averaged from 27× for the situation when no signatures match down to 11× as more matches are detected. While there are no discussions on the memory usage of the algorithm, the authors point out the lack of scalability of their approach (the algorithm supported at most 64,000 signatures, the size of the signature array in the GPU; furthermore, each signature had to have a unique sequence of 6 characters in it, including the two characters used as an index). A potential solution to the scalability problem is given with the mapping of the GPU array entries to multiple signatures, at the expense of additional processing at run-time. 3.3.3 GrAVity Vasiliadis et al. in [VI10] have proposed a GPU-based architecture for virus scanning called GrAVity, that benefits fully from GPU acceleration and operates some modifications on the Aho-Corasick tree commonly used when performing scanning. Their architecture is shown in Figure 3.9 and bases its grounds on an idea similar to that proposed in the AVFS antivirus file-system in [MDWZ04]: instead of storing the entire pattern in the automaton, only patterns with common prefixes are stored (the length of the prefix varies in their tests) and leaf nodes store the remaining of the patterns. PhD Dissertation 46 Figure 3.9: The GrAVity architecture as proposed by Vasiliadis et al. in [VI10]. Using this approach, certain false positives may appear in the matching process of the automaton, since the partial signatures may match certain files which are not threats or malware. In the experiments they performed, authors of GrAVity have observed that using a prefix of length 8 for the patterns generates less than 0.0001% false positives in a set of ordinary files in a computer system. One downside to the GrAVity approach is the high memory storage required for the automaton, since each node uses 1KB in their implementation, leading to almost 400 MB of memory required for storing the automaton comprised of 400,000 nodes (obtained for a prefix length of 14). Their experiments achieved a throughput in ClamAV that was 100 times higher on the GTX 295 used versus a Xeon E5520 CPU, and up to 550 times higher when using cached files. 3.3.4 MIDeA MIDeA is a network intrusion detection architecture using GPU acceleration proposed by Vasiliadis et al in [VPI11] that can benefit from network card, CPU and GPU parallelization, for increased scalability and performance. In their approach, there are no components needing to be serialized, meaning no component waits for another to execute or finish executing. The overall architecture is presented in Figure 3.10. The set of patterns used in MIDeA come from Snort [sno], a popular open-source network intrusion detection system. MIDeA is using a memory representation based on compact states tables. With such tables, each state only contains the elements between the lower and the upper limits. The authors call this method the AC-Compact algorithm, while AC-Full is used to describe the classic Aho-Corasick algorithm. In their case, the transition function pointing to the next state is represented by T[state, PhD Dissertation 47 Figure 3.10: The MIDeA architecture as proposed by Vasiliadis et al. in [VPI11]. ch], where state is the current state and ch is the input character. The next state is therefore computed using the model shown in Figure 3.11, which reduces the throughput of the algorithm at the benefit of lower memory consumption (AC-Compact has about 77.7% of the throughput of the AC-Full approach in their experiments). According to their experiments, for 8,192 rules and over 193,000 patterns, totalling a little over 1,7 million states in the automaton, the AC-Full algorithm used 890.46 MB of memory, while AC-Compact used only 24.18 MB. It is worth noticing however that if the bandwidth of each row in the table (the distance between the first non-zero value and the last non-zero value in the table) is sufficiently high, with a high dispersion of the elements throughout it, the memory consumption would be significantly higher, in the worst-case reaching that of the AC-Full algorithm, in which case it would also behave slower. The original author of the banded-row format storage used in MIDeA is Marc Norton, a member of the Snort team, who in [Nor04] described the algorithm. Later on, Lee at al. in [LH08] have suggested an improved format for banded-row storage. First, they improved the original format by removing un-necessary failure conditions existing in the state table. For instance, assuming the tree in Figure 3.12 and the alphabet {a, b, c, d, e, f, g, h}, the transition vector for state 8 would be (11, , , 9, , , , ) (where represents a failure), but this could also be represented as (11, , , 9) since all the following state transitions are failures. However, given that b and c fall in the bandwidth, they are both replaced with the transition from the failure PhD Dissertation 48 Figure 3.11: The MIDeA storage format used by Vasiliadis et al. in [VPI11]. Figure 3.12: An example tree proposed by Leet et al. in [LH08]. function obtained by applying the character (e.g. for c, state 8’s failure points to the root, so to state 0, and when applying c to state 0, as what would happen normally in case of a failure at state 8, we would once more reach state 8; applying the same logic to state d shows that the failure for this particular character points to state 0 again, so the new, compressed vector would become (4 0 11 0 8 9), where the first element, 4, represents the bandwidth length and the second element, 0, represents the starting index for this state). The improved approach that Lee at al. are presenting in [LH08] is continuing the previous idea of the improved banded row format and focuses on compressing consecutive single-childed states in the automaton which are not final, reducing therefore storage requirements even further. The data structures required for this operation are significant and the pre-processing time for the automaton increases due to the com- PhD Dissertation 49 Figure 3.13: The Parallel Failureless Aho-Corasick algorithm proposed by Lin et al. in [LTL+ 10]. plexity of their construction: they use vectors for branched states, for single-childed states (storing the pattern number, the position in the pattern and the distance to the end of the pattern for each state), a failure function for all states and an output function for final states. Nevertheless, their approach is using much less memory than that of the classic Aho-Corasick algorithm, with the memory requirements being under 2% of the original algorithm in their experiments, and has about 40% better efficiency than the improved banded row format presented earlier. While ClamAV was still using less memory given the tree of depth 2 used, it lacks the performance, since whenever a match occurs, all patterns get checked sequentially for further accurate matches, an operation which can be very time-consuming. 3.3.5 PFAC Lin et al. have presented in [Liu12] a library called PFAC for performing GPU-based string matching, which works in a similar way (in terms of input data parsing) to the approach presented in Section 3.3.2. PFAC comes from Parallel Failureless AhoCorasick and was first introduced by Lin et al. in [LTL+ 10]. Its basic idea is to remove the failure function from the Aho-Corasick state machine and allocate each byte, at each position in the input string, to an individual thread in the GPU and traverse the state machine beginning with that location. Whenever a full match is found, each thread will know the exact position when a match had occurred. If at a given state during the traversal, no match is found, the thread terminates reporting a mismatch at that position. An example of how the input string is parsed and how thread allocation works is shown in Figure 3.13. The time complexity of the initial PFAC algorithm is O(N×M), with a space complexity of O(256×N) (N is the number of nodes in the automaton, while M is the PhD Dissertation 50 Figure 3.14: The Parallel Failureless Aho-Corasick perfect hashing algorithm proposed by Lin et al. in [Liu12]. longest length from all patterns). Some possible optimizations are suggested, such as using shared memory for reducing uncoalesced reads, or caching the state transition table into texture memory. In their experiments, the system throughput had increased when using PFAC, however one of its main drawbacks in real-world implementations is the lack of regular expression support in its native form. Additional synchronization routines and race conditions would have to be implemented in order to implement regular expression support, even if using a naive implementation such as trimming all regular expressions into their non-regex components and searching for each individually, with a final step verifying all partial matches for exact full matches. In order to efficiently store the automaton in GPU (device) memory, the authors use a perfect hashing architecture (Figure 3.14) which allowed them to store only valid transitions of the automaton in a hashing table. They used the Slide-Left-thenRight First-Fit (SRFF) algorithm for creating a two-dimensional array of width w where each key k gets placed at location (row, column)=( wk , k mod w), prioritizing rows by the number of keys in them so that, according to their priority: a) slide each row to the left so that the first key is aligned at the first column and b) slide each row to the right until each column only has a single key, and store that offset in an array RT. The architecture of PFAC as proposed in [Liu12] may be found in Figure 3.16. To improve the perfect-hashing procedure, in the same paper the authors propose a twolevel perfect hashing approach, where at level 2 the same hashing function is applied but with different parameters. The computational drawbacks here were related to the mod operation (which is slow on GPUs) and also to the fact that the initial perfect hashing approach does not consider the tree structure of the automaton. To solve the problem, given that each state has only a few valid transitions, they also propose a modulo-free perfect hashing approach and prove a theorem discussing the space storage requirements of their approach, stating that for the prime p=257 applied to 51 PhD Dissertation Figure 3.15: An example for the perfect hashing algorithm proposed by Lin et al. in [Liu12]. Figure 3.16: The architecture for the perfect hashing algorithm proposed by Lin et al. in [Liu12]. their hash function: S= N X i=0 si ≤ min(21.4, 1 + 71 × L−1 ) × R, N −1 (3.1) where S is the total storage required, si is the number of memory locations for storing valid transitions of state i, N is the number of states and R the number of transitions. The run-time performance of the improved PFAC using perfect hashing was about the same as that of the initial PFAC algorithm, but using about 4 times less memory than that used by an optimal implementation of PFAC (620 KB for 10,076 rules obtained from Snort [sno], totalling 126,776 states). PhD Dissertation 3.4 52 A Highly-Efficient Storage Model for Pattern Matching Automata In [PN12] we have proposed a highly efficient memory storage architecture for patternmatching automata, supporting both the Aho-Corasick and Commentz-Walter approaches. We are going to present the model as follows, outlining its primary benefits and comparing it to other known approaches up-to-date. 3.4.1 Motivation The motivation for using exact signature matching when performing virus scanning lies in the history of malware and antivirus evolution over time: it is still feasible to detect and accurately match exact virus signatures, rather than going through more complicated steps (such as emulation or sandboxing, as discussed in Section 2.2.2) which may produce false positives. Furthermore, it is still the only accurate way of detecting most viral infections, excepting metamorphic viruses which are completely resilient to signature matching and where different heuristics need to be employed (in Chapter 5 we are discussing a series of such heuristics). The downside of this approach, compared to behavioral analysis, is that exact signature matching is much slower than employing heuristics. As a consequence, GPU-based virus scanners have been long scheduled to appear by both Kaspersky [Kas] and nVIDIA [nVI]. The operating system support for GPU-enabled processing had increased over the past few years. Microsoft had introduced the Windows Driver Model (WDM) in [wdm], which specifies that if the device is non-responsive by default for more than 3 seconds, as a result of a driver bug or other type of problem, the driver will reset itself. This has direct implications over how CUDA-enabled processing is being handled, since blocking the GPU with computations for more than 3 seconds would render the device driver to reset itself and re-initialize, causing all computations to be forcefully terminated and therefore the results completely lost. Currently, when performing development on CUDA-enabled devices, this feature can be disabled, however for endusers this is not an option as it increases the stability of the system. Furthermore, realtime architectures for virus scanning which work (or will work) with GPU acceleration will need to consider this limitation implicitly and implement a watchdog mechanism that would cease control of the GPU to the operating system at a time interval of less than 3 seconds, so that it can later on resume the processing and finish the tasks it was assigned. PhD Dissertation 3.4.2 53 Towards a Hybrid CPU/GPU Efficient Storage Architecture The computational processing power of a GPU can be easily adapted to a large variety of algorithms, in many different fields of research. The computations on the GPU however can only be performed using the GPU’s memory. In this case, data from the host (CPU) memory must be transferred to device (GPU) memory upfront, and the transfer needs to be completed before the computations begin, to ensure that the results are accurate and correct. Therefore, the basic operations that need to be performed in most hybrid architectures are the following: 1. (Host-side) Prepare the data to be processed by the device in host memory 2. (Host-side) Transfer the data from host memory to device memory 3. (Host-side) Initiate device computations 4. (Device-side) Perform computations using the transferred data 5. (Host-side) Wait for the results to be available in device memory 6. (Host-side) Transfer results from device memory to host memory 7. (Host-side) Interpret results and provide output to user While the computational steps themselves have variable run-times, the memory transfers are usually the bottleneck of such architectures. In fact, in several architectures and models existing today, such memory transfers are not given the proper attention or focus. Let’s exemplify by referring to the memory allocation process, both in the host and device memory: whenever employing malloc() type of calls in the CPU or GPU memory, the amount of memory requested is allocated and a pointer address is returned. However, repeating the memory allocation procedure a few times will generate different pointer addresses, at diverse (apparently sparse) memory locations (Figure 3.17) - this is caused by the internal memory management routines, which allocate memory so that future memory allocations or re-allocations can be successfully accomplished in the free spaces left behind, or memory areas expanded as needed by the program. While this has immediate benefits when using with memory-prone algorithms, which need to constantly resize their memory areas dynamically, it has severe drawbacks when using large data structures which require a large amount of memory, since those memory requirements can sometimes be a few times bigger than the actual memory being used by the program. Another problem related to memory allocation is step 2 above: transferring memory from host memory to device memory would require going through each memory location from host memory, reading it, allocating memory in the device and copying PhD Dissertation 54 Figure 3.17: a) Sparse, consecutive memory allocations and gaps that appear in GPU memory; b) A single, sequentially allocated memory block in the GPU device. it there. For a pattern matching automaton with a very high number of nodes (e.g. in the order of millions or more), this would require many consecutive memory allocation calls in device memory, which would again produce different pointers at sparse memory locations, as pointed out in Figure 3.17. Given the low amount of memory resources of a GPU compared to a CPU (a high-end GPU is still running memory in the order of only a few GBs, while consumer CPUs are commonly working with more than 8 GBs and up to 32 GBs on a daily basis), this would result in immediately filling the memory with unnecessary data, leaving little room for additional memory allocations and eventually causing the program to forcefully terminate on such an event. The PCI-Express standard [pci] ensures top transfer speeds of 8 gigatransfers per second (8 GT/s), which offers sufficient bandwidth performance for transferring large chunks of data from device to host memory. When dealing with pattern matching automata however, developers aim for low memory requirements (low-size requirements for each node structure of the automaton) given the high number of nodes, and the bottleneck becomes obvious when attempting to transfer millions of nodes (of only a few bytes or kilo-bytes in size) as in step 2 above. In one of our simple tests performed on a number of 350,000 nodes only, each occupying a little over 40 bytes, copying all the nodes from the host to the device took almost 3 hours. The model we are presenting does not suffer from these drawbacks: instead, it is able to represent the entire automaton as a consecutive memory area, using a series of compression and processing techniques, which allow it to be transferred all at once from the CPU to GPU memory, fully benefitting from the high-throughput offered by the PCI-Express interface in contemporary computers. Furthermore, the storage space required for it is always constant in S, the total number of states (nodes) in the automaton. 3.4.3 The Storage Model In order to implement our model, we used a testing bed of 62,302 virus signatures from the March, 2012 version of the ClamAV [cla12] database. In order to achieve a sequential representation of our pattern matching automaton, we have used a stackbased layout, replacing the pointers inside the code (usually pointing to the widely PhD Dissertation 55 Figure 3.18: The bitmapped node implementation of a tree-based structure. Figure 3.19: The stack-based memory model (with the offsets vector) used for efficient storage of the pattern-matching automaton in the GPU. spread memory locations) with offsets in the stack. There are two variants we have proposed: the first uses a separate vector of offsets (Figure 3.19), while the second discards the vector entirely. Our approach represents a serialized representation of the tree, obtained by flattening the tree so that all children of a node are kept in consecutive offsets. By ensure that all children of a node are stored in a consecutive area, we are ensuring that we only need the starting offset of the first child in the parent node. In order to reduce the size of the node structure, we used a bitmapped representation as opposed to the classic state table approach. In a bitmap representation (Figure 3.18), | Σ | bits are used for representing possible transitions for each characters in the Σ alphabet. Additionally, a pointer to a linked list of locations is stored in the node, and the linked list only stores the valid transitions, in lexicographical order. In order to determine the location of child at position i in the alphabet, we compute the popcount [Knu09] for the first i positions in the bitmap, which will give us the index in the linked list for finding the desired location. The popcount value represents the number of 1s in a binary string. PhD Dissertation 56 The algorithm for the construction of the stack (along with the offsets array) from the host’s memory representation is below: • Initialization topOf N odeStack ← 1 (top of node stack) 2: currentP osition ← 0 (position in the stack) 3: topOf Of f setStack ← 0 (top of the stack of offsets) 4: node (the currently processed node) 1: • function addNode(node, currentPosition) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: nodeStack[currentP osition] ← node nodeStack[currentP osition].of f set ← topOf Of f setStack add to hash (key ← node, value ← currentP osition) pc ← popCount(node.bitmap) for i ← 0 to pc − 1 do of f setsStack[topOf Of f setStack] ← topOf N odeStack + i topOf Of f setStack ← topOf Of f setStack + 1 end for old ← topOf N odeStack topOf N odeStack ← topOf N odeStack + pc for i ← 0 to pc − 1 do addNode(node.child[i], old + i) end for The offsets array can however be removed and the offset pointer can be stored inside the node structure itself (Figure 3.20). As a consequence, the child at index i in a node n is accessible through the reference nodes[offsets[n->offset+i]]. The algorithm for performing a construction using this approach is presented below: • Initialization topOf N odeStack ← 1 (top of node stack) 2: currentP osition ← 0 (position in the stack) 3: node (the currently processed node) 1: • function addNode(node, currentPosition) 1: 2: 3: 4: 5: 6: nodeStack[currentP osition] ← node nodeStack[currentP osition].of f set ← topOf N odeStack add in hash (key ← node, value ← currentP osition) pc ← popCount(node.bitmap) old ← currentP osition topOf N odeStack ← topOf N odeStack + pc PhD Dissertation 57 Figure 3.20: The stack-based memory model (with the offsets vector) used for efficient storage of the pattern-matching automaton in the GPU. for i ← 0 to pc − 1 do 8: addNode(node.child[i], old + i) 9: end for 7: If the node structures contain other pointers to different nodes (e.g. for instance, for the failure function in the Aho-Corasick automaton), the approach can be improved using a hash table that stores a list of all the nodes in the automaton, together with their corresponding offset in the serialized stack representation. A second preprocessing step would parse the GPU tree in pre-order (although any other type of parsing can be used), and would replace the pointers in the stack with the offsets, by locating them in the hash. The storage requirements for the Aho-Corasick automaton using an alphabet Σ of size | Σ | with a total of S states (nodes), and using bitmapped nodes, would be S × (2log2 S+ | Σ |) bits. The calculus was performed keeping in mind that for each node, a failure offset occupying log2 S bits needs to indicate the new state in case of a mismatch; the children offset pointer would also require log2 S bits and would be used to determine the starting offset of the current node’s children (it’s important to notice here that it is not necessary to have the children of a node immediately follow the node at consecutive memory locations in the representation); finally, the bitmapped representation of the alphabet would be occupying | Σ | bits and would be the one used when performing popcount computations. Compared to the PFAC approach in [Liu12], where the AC-Compact representation used by the authors occupied 24.18 MB for a total of 1,703,023 states (with about 15 bytes per state, assuming - since in the paper it is not specified - an alphabet size of | Σ |= 32), our approach would require a storage of just 2 × 21 + 32 = 74 bits per node, or 10 bytes per node, resulting PhD Dissertation 58 Figure 3.21: Our constraint-based Aho-Corasick automaton for the set of patterns {00FA0F{4-8}*02FF1F*ABCCD0, 00AE{-12}01CE} [PN12]. Dashed lines represent links between leaves and failures have been omitted. in a total of 15.02 MB of memory consumed - which is 1.6 times less memory compared to the PFAC storage methodology. Comparing our methodology with the one from GrAVity in Section 3.3.3, for 352,921 nodes, our storage model requires 18.63 MB of memory, while GrAVity with the inefficient storage of 1 KB per node, would use about 345 MB, almost 19 times more memory. 3.4.4 The Constraint-Based Aho-Corasick Algorithm In order to accommodate for the different types of regular expressions which the patterns from the ClamAV database contain, as discussed in Section 3.2.1, we have extended the Aho-Corasick automaton to support the last four most common types of constraints, which represent more than 96.5% from the entire ClamAV database [Pun13]. Each regular expression in the ClamAV signature database is being trimmed into its subexpression parts (none of the subexpressions will contain any regular expressions, they will be plain patterns only). Our algorithm adds pointers from the leaf of a subexpression at position i of such type to leaf of the previous subexpression at position i-1 in the pattern. We also store, for each leaf, the position of partial matches for it in the input data. Whenever a leaf is encountered at position j, the leaf of the subexpression at position j-1 is checked for a match; if a match is found, and if the constraint is verified, the current leaf is being marked as a match as well, and a PhD Dissertation 59 partial match is added for this leaf also. If a leaf is belonging to the last subexpression of a pattern, a full match has been found and is reported (Figure 3.21). 3.4.5 Towards a Parallel Implementation for the GPU Applying pattern matching in the GPU relies on heavy parallelism, which requires changes or adaptations of the classic algorithms in order to be used in data-parallel, task-parallel or hybrid-parallel manners. For the Aho-Corasick and Commentz-Walter automata, one of the simplest parallelization method is a data-parallel decomposition of the input string, and its applicability to the algorithm itself. One of the first attempts to parallelize the Aho-Corasick algorithm was proposed by Intel in 2007 [XW07], when Xu et al. had used Intel threading and profiling tools to implement a data-parallel version. They were able, as a consequence, to improve the speed of the algorithm by 1.714 times on a dual-core system. Data-parallel decomposition refers to splitting the input data into several, smaller chunks, and scanning each of them separately in its own thread. However, for pattern matching algorithms this could pose problems when scanning at the border of two different chunks of data, because the match might be split between those chunks (Figure 3.22). In order to solve the situation, several techniques could be employed. One of the simplest would be extending each chunk of data with D-1 bytes, where D is the length of the longest pattern in the dictionary, as discussed in our approach from [Pun09] and by Zha et al. in [Zha]. This approach may produce duplicate results, especially if there are patterns of smaller length which exist in the first D-1 bytes of the next chunk: for instance in the case of acad from Figure 3.22, if the pattern ad also exists in the dictionary, it would be found by both thread i and also by thread i+1, so it would be reported twice. A possible solution here would be to use a hash table to store match positions. Another approach to avoid border-line issues with data-parallel pattern matching would be to store, for each node, its depth d, and when reaching the end of the chunk to extend the search by D-d characters, however this approach is also prone to duplicate results and would also increase the storage requirements for the automata, by a significant amount when dealing with large structures. Throughout the implementations in this thesis, we have always followed the first one presented above, to avoid introducing additional memory overhead. 3.4.6 Performance Evaluation We have implemented the Aho-Corasick, Commentz-Walter and an extended version of the Aho-Corasick algorithm (presented in Section 3.4.4). The constraint-based automaton uses an additional array, called links, which is used for indicating the leaf of a previous subexpression in the pattern. Construction of the automaton is not the PhD Dissertation 60 Figure 3.22: A situation where the match acad is split over two different chunks of data being processed in parallel, with none of the chunks matching the string. Figure 3.23: a) Aho-Corasick compressed nodes; b) Aho-Corasick normal nodes in the constraint-based automaton supporting RegEx signatures; c) Commentz-Walter normal nodes; d) Commentz-Walter compressed nodes. subject of this performance evaluation, since the automaton is usually delivered in real-world environments as alread being pre-processed or constructed, however it has been thoroughly discussed in Chapter 4. Given that the automaton is now stored as a continuous block of memory, the transfer between the host memory and the device memory is instant. The Aho-Corasick automaton has been subject to numerous improvements in terms of efficiency over the past years. Some of the early attempts belonged to Tuck et al. in [TSCV], which had presented a path compression approach and applied it to the Snort [sno] network intrusion detection system. The model had been further improved to work with bitmapped nodes by Zha et al. in [ZS08]. Later on, Liu et al. in [LYLT09] had proposed a table compression approach for the same automaton, by eliminating equivalent rows in the state table and grouping state rows into equivalent classes. In our approach, we implement CPU-based version of the Aho-Corasick, CommentzWalter and Wu-Manber algorithms, and GPU versions as well. For the CPU implementations we used path compression and bitmapped nodes (as shown in Figure 3.23), and for the GPU implementation we used the stack-based representation discussed in Section 3.4.3. A data parallel approach was employed for performing scanning of the input data. PhD Dissertation 61 Figure 3.24: The signatures and their distribution when performing single-core and multi-core scanning on both the CPU and the GPU. Figure 3.25: The throughput of the Aho-Corasick automaton when using a single-core CPU implementation. PhD Dissertation 62 Figure 3.26: The throughput of the Commentz-Walter automaton when using a singlecore CPU implementation. The data structures used in our implementation from [PN12] for benchmarking purposes are presented in Figure 3.23. Although it would be possible to implement path compression in GPU-based implementation, the memory relocation routines in device memory are much slower compared to host memory. Equivalent functions to memmove() or realloc() in host memory, when applied to device memory have a significant on performance. By avoiding path compression, coding complexity also greatly reduces, which eventually results in better performance and better error resilience. Given that the Commentz-Walter algorithm heavily relies, as discussed in Section 2.1.2.2, on the length of the shortest keyword in the dictionary, wmin , we have trimmed the signatures chosen from the ClamAV database (62,302 in total) and delimited them according to the several different values for wmin , particularly {16, 32, 64, 96, 128}. Some of the signatures in the ClamAV database contain regular expressions, which is why we have counted them accordingly for reference, with the results being presented in Figure 3.24. 3.4.6.1 The CPU Implementation We then measured the performance of the Aho-Corasick, Commentz-Walter and WuManber algorithms based on the different values chosen for wmin , and the results are shown in Figures 3.25, 3.26 and 3.27, respectively. The results have shown the following: • The Commentz-Walter algorithm has outperformed the Aho-Corasick approach from a lowest of more than 14% (for wmin = 16) to more than 44% (for wmin = PhD Dissertation 63 Figure 3.27: The throughput of the Wu-Manber automaton when using a single-core CPU implementation. 128), as expected given the higher wmin values which eventually translates in higher jumps in the input data. • The Wu-Manber has outperformed the Commentz-Walter approach from a lowest of more than 6.8× the performance (for wmin = 64) to more than 53× (for wmin = 32). The highest throughput speeds-ups were obtained in the WuManber implementation for wmin = 32 (53× faster) and wmin = 128 (27× faster). Based on the observations made by Lin et al. in [LLL11], where an improved version of Wu-Manber was used, called Backward Hashing, for scanning simple signature patterns, and Aho-Corasick for scanning regular-expression-based ones, and given the observation made in GrAVity (Section 3.3.3) where the authors have used a limited tree depth of 8 (which achieved a very low number of false positives), we have built a multi-threaded scanner aiming to perform a similar operation, however also aiming to maximize throughput and minimize memory requirements when running on a GPU architecture. Such an approach would be the most feasible to implement in a real-time antivirus scanning engine, where low memory and high bandwidth are best desired. Similar to Lin in [LLL11], in our approach we have combined, using our own storage model, the three most common types of multiple pattern matching algorithms, Aho-Corasick, Commentz-Walter and Wu-Manber into a hybrid-parallel, multi-threaded scanner performing regular expression scanning in one thread and regular virus pattern scanning in the other. We implemented three variants for performing the scan- PhD Dissertation 64 Figure 3.28: The throughput of the different hybrid parallel approaches tested using our storage model. ning and measured the overall throughput of the system, using the following combinations: • One thread scanning regular expressions using Aho-Corasick and the other scanning regular virus signatures using again Aho-Corasick (depicted as AC-AC). • One thread scanning regular expressions using Aho-Corasick and the other scanning regular virus signatures using Commentz-Walter (depicted as AC-CW). • One thread scanning regular expressions using Aho-Corasick and the other scanning regular virus signatures using Wu-Manber (depicted as AC-WM). • We tried out combinations using 1, 2 and 4 threads for both scanners, using a data-parallel scanning approach for each thread (so for instance for 4 threads per scanner, we would have 4 threads performing data-parallel scanning regular expressions using Aho-Corasick, and another 4 threads performing data-parallel scanning of regular virus signatures using one of the other algorithms). The results obtained for AC-WM and AC-CW were identical, which is no surprise given that the Wu-Manber bandwidth is much higher than that of Aho-Corasick as shown earlier, however although given that the regular signature scanning process finishes first, it must still wait for the completion of the regular expression scanning, which becomes the bottleneck of the system. We have therefore not included the AC-WM results in our graphical representation of the results shown in Figure 3.28, for better clarity. PhD Dissertation 65 Figure 3.29: The memory usage measured for our hybrid parallel implementations. Finally, we measured the memory storage requirements for all hybrid implementations and the results are present in Figure 3.29. The AC-AC approach had produced almost constant memory usage throughout the test, which is normal given that the same structures were used for the same number of nodes in the automaton. The ACWM approach had the most memory usage, given the hash-table structures implicitly required by the Wu-Manber algorithm. For the Wu-Manber implementation specifically, since we used a block size of 3 (as recommended by the authors in [WM94] for a large number of patterns, to avoid collisions) and the result produced 224 = 16.7 million elements in the hash table. For storing the two elements that comprise the hash table described in Section 2.1.2.3, considering 32 or 64 bits per integer value (we used a 64-bit integer in our implementation), a total amount of 64 MB or 128 MB of memory is used, not counting any additional structures. 3.4.6.2 The GPU Implementation We implemented the same algorithms as in the CPU implementation in Section 3.4.6.1 on our test GPU and started from the same premise as in GrAVity by limiting the depth of the tree. We tested our algorithms with three different depths, of 8, 12 and 16. We started by implementing the classic Aho-Corasick algorithm for the three different depths, and one of the first concerns was memory usage and the number of states/nodes in the automaton. Figure 3.30 shows that there were about 350,000 nodes at a depth of 8, close to 600,000 nodes for a depth of 12 and over 800,000 nodes for a depth of 16. Their corresponding memory usage using our proposed methodology is shown in Figure 3.31. PhD Dissertation 66 Figure 3.30: The state/node count for the depth-limited Aho-Corasick automaton. Figure 3.31: The memory usage for the depth-limited Aho-Corasick automaton. PhD Dissertation 67 Figure 3.32: The bandwidth performance for the depth-limited Aho-Corasick automaton. Figure 3.33: The bandwidth performance for the depth-limited Aho-Corasick automaton. The bandwidth performance of the Aho-Corasick GPU implementation is shown in Figure 3.32. In the experiment performed, the GPU implementation had been about 38 times faster than our single-core CPU implementation of the same algorithm. We also tested the bandwidth performance of our proposed extended Aho-Corasick algorithm (Figure 3.33) and the throughput was about half of the original GPU implementation of the automaton. Given that in the previous attempt the best memory usage pattern was obtained when implementing a hybrid-parallel combination of the Aho-Corasick and Commentz-Walter algorithms (AC-CW), for the GPU implementation we focused on the same ones, however in our situation, with the trees being depth-limited (we imposed a permanent limit of 8 to the depth of both trees this time), we recomputed the total number of nodes in the automata and the memory usage of both combined. Results are shown in Figures 3.34 and 3.35. In the hybrid implementation, one kernel (kernel 1) in the GPU was running the regular expression scanner, and the PhD Dissertation 68 Figure 3.34: The total number of nodes for the both types of the automata used. Figure 3.35: The total memory usage for the hybrid-parallel GPU implementation of AC-CW. other (kernel 2) was running the regular virus signature scanner. The memory usage was the lowest for wmin = 32, being about 14.75 MB. In order to test the bandwidth, we have chosen to test with three different scenarios, each employing a different number of threads to run on the GPU - the first scenario used 128K threads, the second used 512K and the third used 1,024K threads (over 1 million threads in total). We performed a test where the AC-CW approach was used with a default Aho-Corasick implementation (not supporting regular expression scanning, Figure 3.36), and a second test where we had used the extended Aho-Corasick implementation in the first kernel (Figure 3.37). For 1 million threads, for AC-CW in Figure 3.36 the bandwidth reached 1,420Mbps, almost half the SATA-II maximum theoretical throughput, and about 34 times faster than the equivalent CPU implementation for a value of wmin = 128. When using the extended Aho-Corasick though, performance dropped from 50% for wmin = 16 to about 37% when using 1,024K threads and wmin = 128 (for the rest of the values, the performance was about half as the standard Aho-Corasick, similar to the CPU PhD Dissertation 69 Figure 3.36: The bandwidth performance of the AC-CW implementation on the GPU. Figure 3.37: The bandwidth performance of the extended Aho-Corasick + CommentzWalter implementation on the GPU. implementation). 3.5 Summary This chapter focused on describing the particular challenges involved when performing virus scanning and malicious code detection, starting with a short motivation on why we consider this is still an important subject today, continuing with the ClamAV database signature format and discussing a series of known and recent CPU and GPUaccelerated frameworks to virus scanning, such as Split-Screen, GrAVity, MIDeA or PFAC. Next, we focused on the challenges involved in achieving efficient and realtime virus scanning in modern systems today, with the final aim being to formalize, implement and test an efficient storage model which may be successfully used in achieving higher bandwidth performance for transfers between the host and the device and vice-versa, as well studying which hybrid-parallel memory models are the most PhD Dissertation 70 efficient for performing virus scanning, both on the CPU and on the GPU. In order to build an efficient storage model, we had to take into consideration several facts representing the current state-of-the-art in virus scanning: • we start with the open-source ClamAV antivirus, and continue by analyzing its memory storage format, algorithms employed and the virus signature formats used in describing malware; • we discuss the challenges of using GPU-based acceleration and the limitations imposed by the resources of the GPU when employing it in a hybrid system; • we present some of the most recent frameworks used in malicious code detection and network intrusion detection systems, analyzing their performance and focusing on their efficiency compared to existing solutions; Based on the above, we conclude that our proposed storage model offers additional benefits compared to other existing solutions, such as: • it offers the maximum possible bandwidth performance for transfers between the host memory and the device memory, and vice-versa; • it offers an efficient storage model for completely storing the entire automaton into memory, allowing it to be parsed entirely through simple offset references; • it imposes no limitation on the order of nodes, its only restriction being related to the consecutive storage of all immediate descendants of a given node; • it offers better performance than other memory models, by always employing constant storage space, as discussed in Section 3.4.3; • it bridges the gap between CPU and GPU-based hybrid algorithms involving pattern matching automaton, as discussed in Chapter 4; • it opens up new research opportunities for building self-modifying automaton which can adjust their size and structure based on dynamic demands; We have proposed and tested a hybrid-parallel model which employs multi-threaded scanning using different variants of multiple pattern matching algorithms, and have concluded that the best performance is obtained using our storage model for a hybridparallel combination of the Aho-Corasick and Commentz-Walter algorithms, with the first scanning regular expressions in patterns and the second scanning for regular virus signatures. We also tested the performance of such a model on both the CPU and the GPU, and issued a few remarks about their performance and behavior at runtime. The summary of the work conducted herein shows that our approach is a highly viable method for implementing and storing the most common types of pattern matching automaton used in both network intrusion detection systems and virus scanners, offering flexibility, high performance and efficiency in the implementation. Chapter 4 An Efficient Model for Hybrid CPU/GPU-Accelerated Construction of Very Large Automata Contents 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 A Hybrid GPU-Accelerated Architecture . . . . . . . . . 73 4.3 The Storage Format . . . . . . . . . . . . . . . . . . . . . . 74 4.4 4.5 4.6 4.7 4.3.1 The Aho-Corasick Automaton . . . . . . . . . . . . . . . . 75 4.3.2 The Commentz-Walter Automaton - Algorithm A . . . . . 76 4.3.3 The Commentz-Walter Automaton - Algorithm B . . . . . 76 Optimizing the Performance of the Pre-Processing Stage for the Commentz-Walter Automaton . . . . . . . . . . . 77 Achieving Automaton Construction Parallelism . . . . . 78 4.5.1 The Aho-Corasick Automaton . . . . . . . . . . . . . . . . 78 4.5.2 The Commentz-Walter Automaton - Algorithm A . . . . . 79 4.5.3 The Commentz-Walter Automaton - Algorithm B . . . . . 79 Performance Evaluation . . . . . . . . . . . . . . . . . . . 82 4.6.1 The Aho-Corasick Automaton . . . . . . . . . . . . . . . . 82 4.6.2 The Commentz-Walter Automaton - Algorithm A . . . . . 83 4.6.3 The Commentz-Walter Automaton - Algorithm B . . . . . 88 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Several research areas work with large or very large pattern matching automata. One of the most conclusive examples is the virus signature matching process, where 71 PhD Dissertation 72 the number of viral signatures is in the order of tens of thousands or more, with the corresponding automata reaching a high number number of nodes, in the order of millions or tens of millions. While the algorithms themselves only require the automata to be built when running, without the need to construct them on-the-fly, accelerating their construction could have important benefits in the long run, allowing not only self-adjustable automata to be built, but also permitting implementation of such automata on different hardware platforms and devices. In this chapter we are proposing a very fast, low-storage and innovative approach to efficiently constructing pattern matching automata using a hybrid CPU/GPUaccelerated architecture, and we are proposing, to the best of our knowledge, the first parallel model to allow such a construction, that can fully benefit from the best of both computational units. Heavy applications of pattern matching automata may be found in the intrusion detection system industry. Both network IDS [DPP07, VAP+ 08, Yun12, MK06, CY06] and virus scanners [Lee07, Pun09, VI10, PN12, Pun13] make use of these data structures for matching suspicious data packets or finding viral infections. Performance in such situations is critical for building real-time environments and offering responsive solutions which do not slow down the system considerably. The GPUindustry had made processing power more accessible to the end-consumer, allowing more algorithms to be parallelized and rewritten so they fully benefit from the GPU acceleration. Besides the IDS applications of pattern matching, there are numerous others in the field of bioinformatics [Dud06, Tum10, BS11], as well as natural language processing [Fre06, ZSS11] and many others. 4.1 Background While there are many studies on improving the performance of the pattern matching automata discussed, there are a very limited number of such studies dealing with the construction performance of these algorithms and ways of optimizing that. The reason why this happens is caused by the run-time performance of both the AhoCorasick and Commentz-Walter algorithms, which is completely separated from the construction phase and therefore it is assumed that the construction is performed upfront. While this is true and applicable, the uncompromising fact remains that for very large automata, the construction phase remains quite important due to the tremendous amount of computational power involved when dealing with a very large number of nodes in the tree. Kouzinopoulos et al. in [KM11] have tempered with the preprocessing performance of pattern matching algorithms. They had used a binary alphabet and an alphabet of size 8, which is very low and not applicable to many fields of research nowadays (by comparison, the virus signature matching uses an alphabet of size 256). PhD Dissertation 73 Their keywords, in total ranging from 10,000 to 100,000, were having lengths of 8 to 32 and were taken from the E.coli genome, the SWISS-PROT Amino Acid sequence database, the FASTA Amino Acid (FAA) and FASTA Nucleidic Acid (FNA) sequences of the A-thaliana genome. One important observation made by the authors as a result of the test, regarding the Commentz-Walter algorithm, was related to the increase by 50 times in the preprocessing time when larger keyword sets were used. In practice, the Commentz-Walter algorithm preprocessing stage is very complex, as discussed in Section 2.1.2.2. One important direct application of the construction phase performance is related to the update procedure in real-time antivirus scanners: a common approach to virus vendors when issuing updates to the virus signature database is to remove the currently loaded automaton from memory and re-map it from disk, where it is already pre-computed and ready to use. This does not require the structures to be reconstructed, possibly only a re-mapping of the pointers if needed, and eludes the preprocessing stage therefore entirely. However, whenever preparing the disk-mapped version of the automaton, antivirus vendors still need to rebuild the automaton on every change performed to it, and the pre-processing stage can no longer be ignored. Furthermore, as we have discovered in our experiments using the ClamAV [cla12] virus signature database, the construction of a large Commentz-Walter automata can take hours even on some of the fastest consumer CPUs today, which makes the update procedure of such automata very difficult and unfeasible to use in real-world scenarios. 4.2 A Hybrid GPU-Accelerated Architecture In [PN12], also discussed in Chapter 3, we have proposed an efficient storage mechanism for the two most common types of pattern matching automata, Aho-Corasick [AC75] and Commentz-Walter [CW79]. Using this approach, the entire automaton can be stored efficiently in the memory of the device, where it can be parsed allowing multiple GPU threads to access it as needed. The approach presented as follows offers a model for the fast construction of the Aho-Corasick automaton, and two algorithms (noted algorithm A and algorithm B ) for the construction of the Commentz-Walter automaton. We are going to test its performance by using the ClamAV [cla12] virus database and a set of 62,302 virus signatures from the version of the database dated March, 2012. Figure 4.1 shows the overall architecture of the system along with the steps involved. The basic idea is to combine the workload performed on the CPU with the computational resources offered by the GPU in an attempt to increase the efficiency of the overall system. PhD Dissertation 74 Figure 4.1: Our hybrid CPU/GPU approach to the fast construction of the two types of automata. Steps marked (*) are only applicable to the Commentz-Walter automaton. 4.3 The Storage Format Since both types of the automata rely on the trie-tree concept, their nodes will have common elements. In our case, these elements are: a bitmap of size | Σ |, where Σ is the input alphabet (in our situation, the alphabet size is 256), the offset pointing to the start location in the stack of the children of the current node (which may be 0, or any other value to that matter, if the popcount of the bitmap is zero). At this stage, two different approaches may be used for performing suffix computations: the first would store the patterns in GPU memory, and additionally storing at each node the pattern ID, as in Figure 4.2; a second approach would avoid storing the patterns in the device memory, but requires that we store, for each node, an offset to its parent (which would occupy 4 bytes if using a 32-bit unsigned integer value) and an additional character (occupying 1 byte), specifying the last matched character that led to the current node (e.g. for a node v of depth d belonging to pattern ID pid , i is the character at position v in the pattern patterns[pid ] ). Both approaches used do not affect the runtime performance, but they affect the storage space required for the automaton, and are suitable each in different situations: for instance, the second approach is only preferable when Ω > N × (sizeof (unsigned int) + 1), where Ω is the total length of all patterns and N is the number of nodes in the automaton; if the inequality is not verified, the first approach is the preferable one, as it uses less memory. PhD Dissertation 75 Figure 4.2: Our approach to storing the automata in memory. Elements marked with [*] correspond to the Aho-Corasick automata only, while those marked (*) correspond to the Commentz-Walter automata only. Having P patterns, we are creating an array called patternIndexes of size P+1 which stores, the each pattern pi , the starting offset of the pattern in the stack of nodes we have built with our model (Figure 4.3). 4.3.1 The Aho-Corasick Automaton For the Aho-Corasick automaton, we are using an extended version of the approach we proposed in [PN12] and we are storing, in addition to the fields in the original approach, the pattern ID number and the depth of the node in question. This information is used for parsing the patterns in Figure 4.3 in order to perform failure Figure 4.3: Storing patterns in the GPU (device) memory for both types of automata. PhD Dissertation 76 function computation. 4.3.2 The Commentz-Walter Automaton - Algorithm A The storage approach for the Commentz-Walter algorithm A is presented in Figure 4.2. Starting with the model we proposed in [PN12], we are using an additional stack for storing only the leaves of the tree built, which will later on be used to help compute the shift2 distances faster. The adapted algorithm from Section 3.4.3 for constructing the stack of nodes along with the stack of leaves is presented as follows: • Initialization topOf N odeStack ← 1 (top of node stack) 2: topOf LeavesStack ← 1 (top of leaves stack) 3: currentP os ← 0 (position in the stack) 4: node (the currently processed node) 1: • function addNode(node, currentPos) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: nodeStack[currentP os] ← node nodeStack[currentP os].of f set ← topOf N odeStack add in hash (key ← node, value ← currentP os) pc ← popCount(node.bitmap) if pc = 0 then topOf LeavesStack ← topOf LeavesStack + 1 leavesStack[topOf LeavesStack − 1] ← currentP os end if old ← currentP os topOf N odeStack ← topOf N odeStack + pc for i ← 0 to pc − 1 do addNode(node.child[i], old + i) end for We are employing here the same hash-based technique for restoring pointers, by inserting all pointers of the original tree in the hash and then, on re-parsing the serialized tree, replacing each pointer with its offset from the hash. 4.3.3 The Commentz-Walter Automaton - Algorithm B The algorithm B proposed for the Commentz-Walter automaton represents an improved version of algorithm A, and uses the same storage format (Figure 4.2). In this approach, we are going to reduce the computation of the shift1 jumping distance by building a stack for each character in the alphabet (in total, at most 256 stacks for an alphabet of size 256, such as the ASCII charset). Using this methodology, PhD Dissertation 77 stacki represents the list of nodes L of the automaton, so that if v is a node in L then word(v) ends in character “i”. As the construction of the tree relies heavily on the suffix computation stage at every node in the tree, we are going to use the stacks to limit the number of searches performed for a node whenever performing the suffix computation. Furthermore, using stacks allows us to better distribute the workload in a data-parallel implementation of the algorithm as will be shown later on. Following the same analogy, the shift2 jump distance computation can be reduced if using another set of stacks, of the same size as the alphabet, however in this case each stack will only contain the leaf nodes in the automata. A formal description of the approach is: stackLeavesi represents the list of nodes Lleaves in the automaton, so that if v is a node in Lleaves then word(v) ends in character “i” and v is a leaf node. 4.4 Optimizing the Performance of the Pre-Processing Stage for the Commentz-Walter Automaton Based on the definitions of the shift1 and shift2 distances in [CW79], we will demonstrate as follows a few properties which will help significantly reduce the workload during the pre-processing stage of this automaton. Proposition 4.4.1 (Parent nodes have the smallest shift1 values) Assuming that we are located at node v of depth ∆ in the Commentz-Walter tree, then all shift1 values of all child nodes at all higher depths in the tree have a value of at least shift1(v). Proof We will demonstrate this by reductio ad absurdum. Let’s assume that there is a node v of depth ∆ having shift1v =s1 . This means by definition that there is at least one node v’ of depth ∆ + s1 , for which word(v ) is a suffix of word(v’ ), and that s1 is the minimum value possible for which this property is verified. Assuming now that there is a second node, z at depth ∆ + d, where d > 0 for which shift1z =s2 < s1 , that means that there is at least one node z’ in the tree of depth ∆ + d + s2 for which word(z ) is a suffix of word(z’ ). Since word(v ) is a prefix of word(z ) by construction, that results in word(v ) being a prefix of word(z’ ) as well. The ending position of the prefix word(v ) in word(z’ ) can be then computed as: depth of z’ minus the distance between z and v, which translates to ∆ + d + s2 − ((∆ + d) − ∆) = ∆ + s2 , meaning that the node w at depth ∆+s2 verifies the property: word(v ) is a suffix for word(w ), which would contradict the assumption that s1 is the smallest possible positive integer value for which that property is verified. Proposition 4.4.2 (Depth-limited search for shift1 computation) Assuming that we are located at node v of depth ∆ in the Commentz-Walter tree, then we only need to search from nodes of depth ∆ + 1 to ∆ + Wmin − 1 to correctly compute shift1. PhD Dissertation 78 Proof By definition, shift1 =min{Wmin , {l; l=d(v ’)-d(v )}}, and since the first member of the min function is Wmin , it results that if the other other member has higher values, the expression would simplify to the first member. It is therefore sufficient to look, from the current node depth, at most Wmin − 1 extra depths to find the smaller possible values for shift1, if they exist. Proposition 4.4.3 (Limiting searches for shift2 computation) For each node v in the Commentz-Walter tree, then we only need to determine if word(v) is a suffix of word(z), where z is a leaf node in the tree, to correctly compute shift2. Proof By definition, shift2 =min{shift2 (parent node of v ), {l; l=d(v’)-d(v)}}, where v’ is a leaf node in the tree. If none of the leaves in the tree have word(v ) as a suffix, then it is sufficient to recursively parse the tree when all nodes have been processed and recursively compute shift2v =min(shift2 value of parent node of v). 4.5 Achieving Automaton Construction Parallelism For both types of automata, we will follow a data-parallel decomposition model, applied to the different types of scenarios as follows. 4.5.1 The Aho-Corasick Automaton The most computationally intensive part of the Aho-Corasick pre-processing stage is the failure function computation. In order to parallelize this process, each node in the three is assigned a thread in the GPU, executing the same kernel. For each thread t assigned to node v of depth depth > 1, we verify if suffix sl of length l for word(v) exists in the tree, beginning with the longest suffix length, which is depth(v)-1 and if found, the failure function indicates to the offset in the stack where this suffix is found, and the thread terminates. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: if node.depth <= 1 then node.f ailOf f set ← 0 stop end if node.f ailOf f set ← 0 for l ← node.depth − 1downto1 do if suffix sl of word(node) is matched by the automata at offset matchOf f set then node.f ailOf f set ← matchOf f set stop end if end for PhD Dissertation 79 Figure 4.4: Grid structure of our hybrid-parallel model for Commentz-Walter algorithm A. 4.5.2 The Commentz-Walter Automaton - Algorithm A Following the same data-parallel decomposition approach for the Commentz-Walter automaton, we have built a hybrid-parallel model that uses a data-parallel level approach to parse the stack of nodes we had constructed in Section 4.3.2 for computing the shift1 jump distance, and a task-parallel approach which assigned part of the available threads in the GPU to the shift2 jump distance computation. For each node of the automaton, we have assigned 65,536 threads to each node (the maximum allowed as a grid dimension in CUDA is 65,535), where each thread of index i performs comparisons between the node that it is associated with, and nodes from N odeStack N odeStack i × topOf65,536 to (i + 1) × topOf65,536 − 1, where topOf N odeStack is the total number of nodes in the stack. In order to achieve this, we created a kernel that odeStack × 65, 535 uses 128 threads per block and a grid of blocks of dimensions topOf N128 (Figure 4.4). 4.5.3 The Commentz-Walter Automaton - Algorithm B Algorithm B reduces the computational workload by assigning only a limited number of nodes to a thread in the data-parallel decomposition as opposed to algorithm A. In this improved approach, we are assigning to each node v a number of T threads (where T was chosen from a wider set as discussed later), which are used for parsing the two stacks used (Figure 4.2), stacknode(v).c stackLeavesnode(v).c using a data-parallel approach. After finishing the processing, we are parsing the automata once more PhD Dissertation 80 recursively and setting shift2 values based on the relationship defined in the original algorithm from Section 2.1.2.2, shift2=min(shift2(parent), current value). • Initialization 1: 2: 3: 4: 5: nodes[0].shif t1 ← 1 (for root), nodes[0].shif t1 ← Wmin (for others) nodes[0].shif t2 ← Wmin (shift2 initialization) node (the currently processed node), T (number of threads per node) T (number of threads per node), N (total number of nodes) patterns (the list of patterns), patternIndexes (the index list) • Jump distance computation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: if ThreadID > N × T then Stop end if nodeIndex ← T hreadID T toP rocess ← T hreadID mod T charIndex ← node.depth − 1 patternStart ← patternIndexes[node.patternID] c ← patterns[patternStart + charIndex] call computeShif t1() call computeShif t2() • setShift1(node, value) atomicM in(node.shif t1, value) 2: if node.shif t1 = Wmin then 3: for all child of node: child.shif t1 ← Wmin (recursive) 4: end if 1: • computeShift1() 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: if node.shif t1 <= Wmin then Stop end if if T > stack[c][0] then if toP rocess = 0 then f rom ← 1 to ← stack[c][0] else f rom ← 1 to ← 0 end if else PhD Dissertation 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: f rom ← 1 + toP rocess×stack[c][0] T to ← (toP rocess+1)×stack[c][0] T end if if to > stack[c][0] then to ← stack[c][0] end if f ound ← f alse for i ← f rom up to to do if node.shif t1 = Wmin or node.shif t1 = 1 then break end if j ← stack[c][i] if nodes[j].depth > node.depth and nodes[j].depth < node.depth+min(Wmin , node.shif then if isSuf f ix(node.patternID, charIndex, nodes[j].patternID, nodes[j].depth− 1) then if node.shif t1 > nodes[j].depth − node.depth then f ound ← true call setShif t1(node, min(nodes[j].depth − node.depth, Wmin )) end if end if end if end for if node.shif t1 > Wmin or f ound = f alse then call setShif t1(node, Wmin ) end if • computeShift2() 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 81 if node.shif t2 = 1 then Stop end if if T > stackLeaves[c][0] then if toP rocess = 0 then f rom ← 1 to ← stackLeaves[c][0] else f rom ← 1 to ← 0 end if else f rom ← 1 + toP rocess×stackLeaves[c][0] T PhD Dissertation 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 4.6 82 to ← (toP rocess+1)×stackLeaves[c][0] T end if if to > stackLeaves[c][0] then to ← stackLeaves[c][0] end if for i ← f rom up to to do if node.shif t2 = 1 then break end if j ← stackLeaves[c][i] if nodes[j].depth > node.depth and nodes[j].depth < node.depth+min(Wmin , node.shif then if isSuf f ix(node.patternID, charIndex, nodes[j].patternID, nodes[j].depth− 1) then if node.shif t2 > nodes[j].depth − node.depth then atomicM in(node.shif t2, nodes[j].depth − node.depth) end if end if end if end for Performance Evaluation The pre-processing stage for both types of automata studied in this chapter is the most computationally-demanding part of the construction stages. As the other stages of the construction, as defined in Figure 4.1 are performed on the CPU, we have benchmarked the preprocessing time in our performance evaluation and have compared these with the single-core CPU implementations, in order to determine the degree of speedup achieve by our proposed parallel architecture. 4.6.1 The Aho-Corasick Automaton In order to experiment with a high number of nodes in our automata, we have used a set of 56,900 signatures extracted from the ClamAV virus database, where each pattern was of a minimum length of 16 bytes, and we have created two additional datasets by concatenating the initial file with the same set of patterns, but in reversed order. This way, we obtained a total of 113,800 unique patterns to test with, and we ended up choosing three different datasets for the benchmark: the initial dataset of 56,900 virus signatures (in total, over 6,8 million nodes), the second dataset representing the 113,800 signatures (in total, over 13.5 million nodes) and a third dataset, PhD Dissertation 83 Figure 4.5: Results for the single-core CPU vs. GPU implementation of the AhoCorasick automaton. representing the first 90,000 signatures out of the second (totalling about 10,5 million nodes). The results obtained have shown a similar trendline in the running time for both the CPU and the GPU implementations (Figure 4.5), with better throughput in the GPU implementation by almost 24× for the first (smaller) dataset, more than 28× for the intermediate dataset and more than 23× for the third (largest) dataset (Figure 4.6). There is an important remark to be made here: for the largest dataset, the GPU implementation had to be built as a 64-bit application given the large memory requirements; as a consequence, the performance has degraded since words in memory have double-lengths in 64-bit implementations versus the 32-bit implementations. However, the rising trend in the acceleration is certain to continue until it reaches saturation (in our case, for as low as 6,8 million nodes, meaning 6,8 million threads in the GPU, the bandwidth was not saturated, which was immediately observable in the next test employing over 10.5 million threads, which caused a performance increase compared to the previous dataset of 12%), assuming that we compile both CPU and GPU implementations using the same 32 or 64-bit architecture. 4.6.2 The Commentz-Walter Automaton - Algorithm A Given the high-dependence of the Commentz-Walter algorithm on the length of the smallest keyword in the dictionary wmin , we have opted to use the same set of signatures as in Figure 3.24, having wmin ∈ {16, 32, 64, 96, 128}. The number of nodes obtained in each of these situations is shown in Figure 4.7, and their corresponding memory usage on the device is shown in Figure 4.8. In order to test the performance of the classic approach to the precomputation stage for this automaton and the new, optimized one, based on the mathematical PhD Dissertation 84 Figure 4.6: The speed-up obtained in the GPU implementation of the Aho-Corasick automaton, compared to the CPU single-core version. Figure 4.7: The number of nodes obtained in the Commentz-Walter automaton when using different wmin values. PhD Dissertation 85 Figure 4.8: The memory usage on the device (GPU) for the Commentz-Walter automaton when using different wmin values. demonstrations presented in Section 4.4, we have conducted a series of two benchmarks: the first had implemented the classic, non-optimized version of the algorithm on the CPU, and tested it against the optimized GPU implementation; the second benchmark tested the optimized version on both the GPU and the CPU, and measured the performance improvement. The running times for the non-optimized CPU single-core implementation are shown in Figure 4.9, while for the optimized GPU implementation are shown in Figure 4.10. The overall speed-up achieved here is presented in Figure 4.11. While the results may seem somewhat side-tracked, they are perfectly explainable here for two reasons: the first, we tested a parallel, mathematically-improved version of the algorithm on the GPU, while the single-core CPU version was implemented using the classic approach from the original algorithm definition with no improvements of any kind, and the second, much of the performance increase here is thanks to the low number of computations performed by the algorithm based on the propositions discussed earlier. In fact, by applying the shift1 and shift2 computation limitations, only about 700,000 nodes get processed and are updated through suffix computation for a 6.8 million-node automaton, which is less than 10%; the remaining nodes are updated through recursive calls based on proposition 1 from Section 4.4. A more fair comparison between the optimized CPU single-core implementation and the parallel GPU implementation and the resulting speedup is shown in Figure 4.12, where we noticed throughput speedups from 40× (for wmin = 96) to 50× (for wmin = 64) for the GPU implementation. PhD Dissertation 86 Figure 4.9: The running times for the classic (non-optimized) version of the Commentz-Walter pre-processing stage on a single-core CPU. Figure 4.10: [Algorithm A] The running times for the optimized version of the Commentz-Walter pre-processing stage running on the GPU. PhD Dissertation 87 Figure 4.11: [Algorithm A] The speed-up of the optimized GPU implementation versus the non-optimized CPU single-core implementation. Figure 4.12: [Algorithm A] The speed-up of both optimized implementations in the GPU versus a single-core CPU. PhD Dissertation 88 Figure 4.13: [Algorithm B] The running times on the GPU. 4.6.3 The Commentz-Walter Automaton - Algorithm B As discussed in Section 4.5.3, for this improved version of the algorithm we have chosen different values for T , the number of threads in the GPU per node in the automaton. We started with a value of 4 and went up to 4,096. The primary interest was measuring the speed-up compared to the speed of algorithm A discussed previously. Running times of the improved algorithm B are presented in Figures 4.13 and 4.14. The worst running times were obtained for T = 4, which is explainable by a non-saturation of the GPU pipelines. The best performance for wmin ∈ {16, 32, 64, 96} was obtained when using 16 threads per node: in this situation, the algorithm finished in 1.67s for wmin = 16 (12× faster than algorithm A), 1.623s for wmin = 32 (17× faster than algorithm A), 1.498s for wmin = 64 (14× faster than algorithm A) and 1.295s for wmin = 96 (19× faster than algorithm A). The best running time for wmin = 128 was obtained for T = 64, when the algorithm finished the processing in 0.89s (again, almost 19× better performance than algorithm A). The results are impressive and highly useful, considering a reduction of processing time from the magnitude of hours in the classic algorithm, to just a couple of seconds or less in the optimized GPU-based version in algorithm B, and offer the first parallel approach to the pre-computation stage in the Aho-Corasick and Commentz-Walter automata, with the best improvements in the latter due to the different properties of the nodes in the automaton. PhD Dissertation 89 Figure 4.14: [Algorithm B] The running times on the GPU (continued). 4.7 Summary This chapter focused on presenting the first hybrid-parallel CPU/GPU-accelerated architecture for the fast construction of very large Aho-Corasick and Commentz-Walter automata, two of the most common types of automata used in pattern matching applications in several fields of research. Starting with the efficient storage model proposed in Chapter 3, we designed and implemented an architecture which relies on the CPU-based construction of both types of automata, while employing the preprocessing stage to be computed in the GPU using a hybrid-parallel approach. We proposed an algorithm for the Aho-Corasick automaton in Section 4.5.1 and two different variants for the Commentz-Walter approach in Sections 4.5.2 and 4.5.3, with the latter offering much better performance. We based our improvements on a mathematical model outlining several properties of the jumping distances which could be used to reduce the computational workload, and which we have proven to be correct in Section 4.4. We conducted several benchmarks for determining the performance of our architecture in Section 4.6. Our proposed architecture adds the following benefits to the existing approaches: • it proposes the first hybrid-parallel architecture for constructing very large automata through an efficient collaboration between the CPU and the GPU, with the GPU handling the most computationally-intensive task, the pre-processing stage; PhD Dissertation 90 • it builds a formalism which can be applied to any type of pattern matching automata, not just to intrusion detection systems; • it offers an efficient way to store the automata in device memory, fully benefitting from the bandwidth transfers between the host and device memory and offering direct access to the nodes of the automata in a serialized structure; • it offers the first pre-processing results on the set of virus signatures in the ClamAV database for both the Aho-Corasick and Commentz-Walter automata; • it reduces the processing time for such automata from the magnitude of hours in the case of the Commentz-Walter automaton to less than a couple of seconds; • it opens up new research opportunities for implementing self-adjusting automata in real-time. It is our belief that in the future, self-adjusting automata (containing millions or tens of millions of states) which are capable of dynamically updating their contents according to the user’s requirements, will be easily implementable using parallel, GPU-accelerated approaches. We are also confident that the approach has applicability in several other fields of research where a very large number of patterns is required, such as bioinformatics or approximate pattern matching algorithms in natural language processing. Chapter 5 Implementing Behavioral Heuristics and Hybrid Compression Techniques in the Aho-Corasick Automata Contents 5.1 5.2 5.3 Behavioral Heuristics . . . . . . . . . . . . . . . . . . . . . 92 5.1.1 Towards Behavioral Analysis . . . . . . . . . . . . . . . . . 92 5.1.2 Applications to Malicious Code Detection . . . . . . . . . . 93 5.1.3 A Metric-Based Heuristic for Malicious Code Detection . . 93 5.1.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 103 Hybrid Compression . . . . . . . . . . . . . . . . . . . . . 106 5.2.1 The Smith-Waterman Local Sequence Alignment Algorithm 107 5.2.2 Towards An Efficient Compression Mechanism . . . . . . . 108 5.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 113 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 In this chapter we are going to present known approaches to using the AhoCorasick pattern matching automata for implementing behavioral heuristics and hybrid compression techniques, in different fields of research. We will focus primarily on the intrusion detection field and particularly on the virus scanning process, but the formal models used can be easily applicable to any other type of applications involving automaton-based heuristics or requiring low-storage space. 91 PhD Dissertation 5.1 92 Behavioral Heuristics Behavioral heuristics are focusing on reducing the time allocated to a decision-making process and are applied when there is insufficient time for assessing the ideal choices in each possible circumstance. In a game of chess for instance, a player may choose to take some non-trivial choices, such as sacrificing the rook in order to gain a tactical advantage with no other immediate benefit. While this rule of thumb may seem harsh at the time, such compromises are often beneficial in the long run. Of course, there are numerous situations as well when such behavior is rendered useless in the long run (e.g. by a smarter opponent which may use the sacrifice for winning the game) and may cause more harm than actually help. Gigerenzer et al. in [GG11] have described heuristics as being “efficient cognitive processes, conscious or unconscious, that ignore part of the information”. He points out that formal models for answering the descriptive question (which heuristics are the most appropriate given a certain situation? ) and the prescriptive question (why are heuristics to be considered when there may be better choices? ) are essential. 5.1.1 Towards Behavioral Analysis Using the Aho-Corasick automaton from [AC75] for dictionary-based storing has the immediate advantage of low-storage space thanks to the trie-based tree structure and the ability to convert the trie into an automaton by adding the failure function to it. We have presented two different applications of the behavioral analysis approach discussed in Section 5.1.3 in [Pun09] (which is also the base of our discussion) and [Pun10], using as a starting point the study performed in [PAF09], where we have analyzed and benchmarked the best approaches for storing the data collected from the wireless sensors installed in a smart-home in order to efficiently be able to collect, analyze and interpret such data. As a conclusion to this initial partially-empirical research, in [Pun10] where we have presented a model for the behavioral analysis of electrical apparatus in an intelligent home (using the approach presented in Section 3.4.4) by storing and learning their power consumption profiles. In this approach, the dictionary is comprised of a custom-built alphabet, used for storing the energy patterns in the tree. The automaton ensures accurate matching and also allows us to store the power profiles after an adaptation period (which is variable for each home) and recognize different power profiles of the apparatus afterwards by matching future consumption values through our extended automaton approach. With this implementation it is therefore easier to build additional heuristics, on top of the automaton, for classifying power consumption profiles and offer informed suggestions to the end-user about how they can improve their power consumption habits. PhD Dissertation 5.1.2 93 Applications to Malicious Code Detection Heuristics have been used in intrusion detection systems in several topics of interest, such as network IDS [CH11, PAP12], cloud-security [AX11, HZB11] and virus scanning [Pun09, Pun13]. There are two major parts to the analysis of executable code in order to assess its threat level: the static analysis component and the dynamic analysis component, related to the classification of intrusion detection systems in Section 2.2.1 in misuse detection and anomaly detection. Rozinov in his Master’s thesis [Roz05] discussed the challenges of efficient malicious code behavior and proposes an approach that utilizes only static analysis in order to assess the threat level of an executable, without requiring it to be ran upfront. As Choen [Coh87] and Chess-White [CW00] have pointed out, it is not possible to construct an algorithm that can detect all possible viruses. Landi in [Lan92] has shown that the problem of static analysis can be undecidable and uncomputable. These are the premises for building formal models which can assess the threat levels of executable code, either by analyzing them before execution (through techniques such as sandboxing or emulation, as discussed in Section 2.2.2), or by analyzing them at run-time through watchdog-alike mechanisms. Several anomaly-detection-based techniques for performing malicious code detection have been proposed [SBDB01, ALJ+ 95, HFS98, PN97, FKF+ 03, Roz05, CWL07, Pun09], with one of the most common methodologies involving system-call analysis: in short terms, a program is being traced and the sequence of system calls it makes at run-time is being analyzed and a threat level is computed for it; if the result exceeds a maximum threshold permitted, the program is considered to be dangerous, otherwise it is considered to be safe. 5.1.3 A Metric-Based Heuristic for Malicious Code Detection This section represents a review of the research we have performed in [Pun09]. We present as follows the general overview of our approach and outline its most important results and contribution. In [SBDB01], Sekar et al. have discussed an approach for the static extraction of sequences of system calls performed by a program, directly from the program’s executable code and a storage model for them. In [BGI+ 12], Barak et al. have shown that it is impossible to completely obfuscate a program in a bounded computational environment. A direct consequence is that no virus can completely hide itself from all possible detection techniques. Modern techniques for malicious code detection do not require access to the source code, but instead rely on different heuristics or PhD Dissertation 94 prediction models for classifying the programs into malicious or safe. The architecture for performing such an analysis requires the following components: • a module to disassemble the executable code and transform the opcode mnemonics into a humanly-readable form (e.g. assembler code); • a static analysis module for parsing the source code obtained in the previous step and determine all possible paths of execution for the given program, assembling all possible sequences of system calls as a consequence thereof; • an algorithm for analyzing the sequences extracted and applying different heuristics for assessing the threat level and classifying it into malicious or safe. 5.1.3.1 Program Slicing The system-call sequence extraction process mentioned earlier is based on the concept of program slicing, commonly found in static analysis research. Weiser in [Wei81] had proposed a slicing technique which was used later on by Sung et al. in [SXCM04] to build an architecture which captures every system-call made by the running program. Weiser had defined program slicing as being a method to build an abstract form of a program which is still functionally identical to the original; the reduced program, or ”slice”, is an autonomous program independent of the original, but preserving the behavior. His approach involved control flow analysis by parsing the program’s source code. Barak et al. in [BGI+ 12] have also proposed a slicing algorithm which only extract the relevant parts of an executable program. Weiser’s approach had been also used by Kiss et al. and Cifuentes et al. in [kKJLG03, CF97] for performing static analysis on executable code. We will demonstrate the program slicing technique with the help of a small C program, based on the example from [Gra]. 1 2 3 4 5 6 7 8 9 10 int main() { int sum = 0; int i = 1; while (i < 1) { sum = sum + i; i = i + 1; } printf("\%d\n", sum); printf("\%d\n", i); } The first step to performing slicing is the construction of the control dependence (CDG) and flow dependence graphs (FDG) [Gra]. Based on these two, the program dependence graph (PDG) can be constructed (Figure 5.1). At this stage, backward slicing is performed by determining all incoming arcs for a given node in the PDG, identifying all nodes from where these arcs are originating, and continuing the process PhD Dissertation 95 Figure 5.1: The program dependence graph (PDG) for the example code in Section 5.1.3.1 [Gra]. Figure 5.2: A slice starting from the last printf() statement for the PDG in Figure 5.1 [Gra]. recursively until no more new nodes are added to the list of nodes (Figure 5.2). At this stage, a program slice (as defined by Weiser in [Wei81]) has been identified. Interprocedural slicing builds a system dependence graph (SDG) for the code (Figure 5.3). We now perform backward slicing again, taking extra care to the intersecting paths of any two calls in different calling contents - such intersections must not occur, or the algorithm might run into an infinite loop. The result of the backward slicing is the program slice containing all the calls inside that portion of the code of the program (Figure 5.4). This approach is being used to extract the sequences of system-calls that a program makes during its execution. It is easy at this stage to assign ordinal numbers to each system-call. Usually, an additional cleanup process removes trivial (non-malicious) system calls with no real impact on security and the remaining result is the sequence of interest for being analyzed. The cleanup process does not affect the ordinal numbers assigned to the system calls however. PhD Dissertation 96 Figure 5.3: System dependence graph (SDG) for the program in Figure 5.1 [Gra]. Figure 5.4: Interprocedural backward slicing for the program in Figure 5.1 [Gra]. 97 PhD Dissertation 5.1.3.2 Time Dependency While there are many studies conducted showing how to extract, analyze and classify program slices from a binary program’s executable, there is little research done related to the time-dependency of the system-calls. In practice, this dependency relation can prove to represent the difference between detecting a false positive or not. For instance, considering the following scenario: if a program partially matches a malicious system-call sequence, but stops to perform additional harmless operations for a very long time, and after a significant amount of time (when the effect of the previous sequences had been neutralized) he performs what remains of the sequence, the result would be a false positive in the dynamic analysis engine of an antivirus, although the program did not perform any malicious activity at all. Some approaches that we proposed in [Pun09] for solving the problem mentioned are: 1. Associating a system-call with the time of the machine at the moment the call is made. This is a hardware-dependant approach which is not portable and therefore prone to errors. 2. Computing the number of instructions between system calls in the decompiled assembler source and associating an ordinal number to each system call. The number is computed according to the program counter (PC) value of the program, or the number of the instructions reached, but poses additional challenges with CISC architectures (which have variable length instructions, making them difficult to backtrace in static analysis). 3. One other possible approach, which we have used also, is computing the number of harmless system-calls between consecutive suspicious calls, and associating each suspicious system call an ordinal number. 5.1.3.3 Measuring Similarity Using the Bray-Curtis Metric The Bray-Curtis distance [Bra] (also called the Sorensen distance) is a normalization method commonly used in botany, ecology and environmental sciences, viewing space as a grid in a manner similar to the city block distance. An important property is that if all coordinates are positive, the value of the distance is between 0 and 1, with 0 representing exact similar coordinates. The Bray-Curtis distance is defined as: n P |xik −xjk | k=1 n P (xik +xjk ) k=1 n P k=1 =P n |xik −xjk | xik + k=1 n P k=1 and we define the Bray-Curtis function as follows: xjk 98 PhD Dissertation fBC : R+ → [0, 1], xij ǫR+ , ∀i, jǫN and xjk ≥ xik ∀j, kǫN n P fBC (xj,n+1 ) = |xik −xjk |+|xi,n+1 −xj,n+1 | k=1 n P (xik +xjk )+(xi,n+1 +xj,n+1 ) k=1 Let us prove the following: Theorem 5.1.1 (monotonicity of fBC ) fBC is a monotonically increasing function, that is ∀x1 , x2 ǫR+ , x1 > x2 , fBC (x1 ) > fBC (x2 ) Proof If we make the following notations: |xik − xjk | = α, k=1 xi,n+1 = a we will need to prove that: α+|a−x2 | β+(a+x2 ) n P < n P (xik + xjk ) = β and k=1 α+|a−x1 | β+(a+x1 ) where we know that a < x2 < x1. Since both the nominator and the denominators are positive, we obtain: α(a+x1 )+αβ +β(x2 −a)+(x2 −a)(a+x1 ) < αβ +α(a+x2 )+β(x1 −a)+(a+x2 )(x1 −a) which later on, after performing the calculus, is reduced to: β(x1 − x2 ) + a(x1 − x2 ) + a(x1 − x2 ) > 0 ⇔ (x1 − x2 )(β + 2α) > 0 which is true since from the hypothesis a < x2 < x1. Theorem 5.1.2 (convergence of fBC ) fBC is convergent towards 1, that is limx→∞ fBC (x) = 1 Proof Since fBC is monotonous and fBC (x)ǫ[0, 1], ∀xǫR+ from Weierstrass’s theorem we deduct that it is superiorly convergent towards 1. Let us now define the extended Bray-Curtis function, fBCX as follows: fBCX : Rn+1 → [0, 1], xij ǫR, ∀i, jǫN and xjk ≥ xik ∀j, kǫN n P fBCX ((xj0 , xj1 , .., xjn )) = |xik −xjk | k=0 n P (xik +xjk ) k=0 Theorem 5.1.3 fBCX is monotonously increasing, meaning: fBCX (v1 ) > fBCX (v2 ) where v1 = (x0 , x1 , .., xn )ǫRn+1 , v2 = (y0 , y1 , .., yn )ǫRn+1 and xi ≤ yi , ∀iǫ{0, 1, .., n} PhD Dissertation 99 Proof We base the proof on the fact that if v = (x0 , x1 , .., xn )ǫRn+1 + , by fixing the last n values of the vector we obtain fBCX (v) = fBC (x0 ) and we know from the previous theorem that fBC is a monotonically increasing function. Therefore, if all but the first variable of the input vector for fBCX are fixed, the theorem is proven. If we now consider fixed all the values in v except x1 , we see that fBCX (v) = fBC (x1 ), etc. which proves that whenever one variable xi , iǫ{0, 1, .., n} from v is variable and all the others are fixed, fBCX is monotonically increasing. In order to prove for all the variables, one can use mathematical induction, by considering fixed all variables except one and by using permutations between the variables. We are encoding the system-calls by assigning them ordinal numbers, encoded as hexadecimal numbers - at this stage, the encoding methodology is similar to that used in the ClamAV signature database and which we have addressed in Section 3.2.1. Assuming that no sequence of system calls has a length shorter than 5, up to 216 = 65, 536 can be inserted in to the dictionary, a sufficient number for all modern operating systems today. There is no limit imposed on the length of a rule. Each system call which is part of a rule will have an ordinal number assigned to it (the ordinal number of the system call before the resampling process mentioned in Section 5.1.3.1 takes place). Using an empirically determined threshold value for the degree of similarity between two sequences of system calls, we can use the above functions for performing the similarity comparison between the sequences existing in our dictionary and the ones detected in the program code. As a consequence, the metric that we have proposed, which is based on the BrayCurtis distance, can be used only when the ordinal values associated to the system calls detected in the program being analyzed have greater values than the corresponding positions in the dictionary (caused by the function fBCX being monotonously increasing). To that effect, a rule in the dictionary which has the ordinal values {1, 3, 6, 7, 11, 20, 21, 23, 29} associated with it would determine an input sequence of {1, 2, 5, 7, 11, 20, 21, 23, 29} to cause a mismatch unless a proper threshold Θ is used, because 2 ≤ 3 and 5 ≤ 6. However, having an input sequence of {1, 4, 8, 10, 13, 22, 25, 29, 33} preserves the conditions required to achieve function convergence above, showing that the higher the dispersion of system-calls in time, the probability of the sequence’s similarity to a rule decreases (basically, fBC is increasing towards 1 as the system calls are more and more dispersed in the binary code, and we assume that as time passes the probability of the program having a malicious behavior reduces; therefore, by carefully selecting a proper threshold value Θ so that a match is obtained if fBC < Θ we can ensure an accurate identification of such malicious patterns of sequences of system calls). PhD Dissertation 100 Figure 5.5: The extended Aho-Corasick automaton for the set of rules {AF 1E 2B 01 02 03 ∗ AA 1B 0E 03 AA BB, AA BB 1D CC}. Failure transitions are dashed lines. PhD Dissertation 5.1.3.4 101 A Weighted Automaton Model The extended Aho-Corasick automaton model for our approach, which assigns weights to each transition (with each weight representing the ordinal number of the systemcall), is comprised of the following elements: 1. a goto function g(s,a) that gives the next state entered in case of a match of the character a 2. each transition in the automaton from state s, in case of a match of the character a, has a weight associated to it given by a function h(s, a) • if the edge(u,v) is labeled by a, then g(u, a)=v • g(0, a) = 0 for each a that does not label an edge out of the root (the automaton remains at the initial state while scanning non-matching characters) 3. a failure function f(s) that shows what state to enter in case of a mismatch • search the longest proper suffix ps of word(s) so that ps is a prefix of a pattern, f(s) is the node labeled by ps 4. an output function out(s) that gives the set of patterns matched after entering state s 5. we are proposing two different approaches that may be used for when performing sequence matching in the tree • the first is taking into consideration a minimum depth ∆min of the tree, so that when the current matching is performed at a depth ∆ ≥ ∆min (or when we are reaching a leaf in the tree) we compute the Bray-Curtis function fBC defined earlier and if fBC < Θ, where Θ is the threshold value, we go on with the match, otherwise we call in the failure function and report a mismatch; taking into account the minimum length of the rules in the knowledge base, we conclude that ∆min ≥ 5. • a second approach is computing the Bray-Curtis distance fBC only when we have reached a leaf in the tree, independent of the current depth; if fBC < Θ, where Θ is the threshold value, we continue the algorithm as in the normal extended automaton, otherwise we call in the failure function and report a mismatch. PhD Dissertation Figure 5.6: A step in the matching process for the automaton in Figure 5.5. 102 PhD Dissertation 103 6. in both of these cases, the weights are automatically adjusted as the matching is being performed: if the next state (at position µ) where the matching is being accomplished is of depth λ, and the weight associated to the (µ − λ + 1)th last accepted system call in the automaton is η, then for each system call that follows, including the current one, substract η − 1 from its associated weight and assign it as its new weight. For example, let’s take the following rules as part of our dictionary: {AF/1 1E/5 2B/7 01/8 02/13 03/19 ∗ AA/1 1B/4 0E/6 03/8 AA/11 BB/12, AA/1 BB/5 1D/8 CC/11} (a rule in the form AA/10 means that the system call with the code AA has an ordinal value of 10 associated to it) and let’s build the automaton for it (Figure 5.5). Figure 5.6 shows the automaton as it is fed the following sequence of system calls and their associated weights: (AF/1 1E/6 2B/8 01/12 02/13 AA/15 BB/19 1D/21 CC/25). Considering ∆min = 5, once the depth in the tree is reached, the value of the Bray-Curtis function is compared to the threshold and since it verifies the similarity condition, the matching continues (Figure 5.7) until the matching is complete. 5.1.4 Performance Evaluation We have evaluated the performance of our implementation and used sequences of variable length (up to a maximum of 100) as patterns in our dictionary, with Θ ∈ {0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5}. We have then generated different input strings that were to be fed as input to the automaton and for which we have computed the Bray-Curtis normalization in order to determine the maximum average distribution of non-suspicious system calls between any two consecutive suspicious system calls so that fBC < Θ. This is useful for knowing the maximum average number of system calls that can exist between two consecutive suspicious system calls which are part of a sequence, before the threshold condition is no longer verified. The results are presented in Table 5.1 (a visual representation is reproduced also in Figure 5.8) and show that as the threshold value increases, so does the maximum average. One advantage to our approach is that the threshold value can be either statically determined a priori, or it can be dynamically adjusted as the system requires. Good performance was obtained for 0.1 ≤ Θ ≤ 0.2, when the maxim average varies from about 6 to 13 for rules of length 50, and from approximately 9 to 20 for rules of length 75. We have tested our implementation against the classic Aho-Corasick algorithm, using a set of 38,439 signatures from the ClamAV database [cla12] containing both simple signatures and regular-expressions. We associated weights to our model before beginning the computation, by generating them randomly. The input data used in our experiment was 7 MB in size (again, with randomly generated weights). Table 5.2 PhD Dissertation Figure 5.7: Finding matches in the automaton in Figure 5.5. 104 PhD Dissertation 105 Figure 5.8: Maximum average distribution of non-suspicious system calls at different lengths for the rules in the knowledge base and for different threshold values as shown in [Pun09]. 106 PhD Dissertation Table 5.1: Maximum Average Distribution of Intermediary Non-Suspicious System Calls for Bray-Curtis Normalization Rule Length Threshold Value 5 25 50 75 100 Θ = 0.05 0.25 1.42 2.73 4.05 5.36 Θ = 0.10 0.75 3.00 5.78 8.56 11.33 Θ = 0.15 1.25 4.75 9.18 13.58 18.00 Θ = 0.20 1.75 6.75 13.00 19.26 25.50 Θ = 0.30 3.00 11.58 22.29 33.00 43.72 Θ = 0.40 5.00 18.04 34.69 51.35 49.49 Θ = 0.50 7.5 27.08 52.04 66.55 49.49 shows the benchmark results that had been obtained when testing the two automata on a quad-core Q6600 CPU with 4 cores. Table 5.2: Benchmark Results of Standard AC Machine, Extended AC Machine and Bray-Curtis Weighted AC-Machine Scanning time Memory consumption Processing bandwidth (syscalls/second) Aho-Corasick Implementation Results Standard Extended Bray-Curtis-weighted 1.248 s 1.311 s 13.246 s 119.41 MB 144.81 MB 239.04 MB 5,743.59·103 5,467.58·103 541.14·103 In the weighted implementation, additional information is being stored for each node in the automaton (almost double the size compared to the classic approach), and the processing complexity also increases whenever a partial match (at a leaf in the tree) is found. Our approach produced a throughput of 541,000 system-calls being analyzed per second, which is fully compliable with contemporary programs which usually make at most a few thousand system calls per second. We conclude therefore that our approach is suitable for implementation in a real-time heuristic malicious code detection engine, having very little impact on the performance of the overall system. 5.2 Hybrid Compression Compressing the Aho-Corasick automaton in order to reduce its storage space has been an idea long tackled with in intrusion detection systems. Tuck et al. in [TSCV] have proposed a method for compressing the automaton by reducing single-childed consecutive nodes to a single, compressed node. Their technique was called pathcompression and could reduce the storage requirements for the automaton if there PhD Dissertation 107 were many nodes have one single direct descendant. The authors have applied their approach to the Snort [sno] signature database, and was later on improved slightly by Zha et al. in [ZS08], where bitmapped nodes were used together with pathcompression to achieve even lower storage requirements. While path-compression produces good results in general, it increases coding complexity for the automaton, as for instance there may be nodes having failure functions pointing to a path-compressed node; this would require us to store an additional offset for the failure pointer, which is only used when pointing inside a path-compressed node. Intrusion detection systems build very large automata, as we have discussed in Section 4.6, which requires the development of efficient storage methodologies for these structures. In [PN12] (detailed in Chapter 3) we have proposed such an efficient storage mechanism for both CPU and GPU implementations. As follows, we will propose a hybrid compression mechanism which aims to use dynamic programming concepts in order to reduce the memory requirements for very large automata. This section represents an outline of the research performed by us in [Pun13]. 5.2.1 The Smith-Waterman Local Sequence Alignment Algorithm The Smith-Waterman local sequence alignment algorithm was proposed by Smith et al. in [SW81] as a dynamic programming approach commonly used in bioinformatics [bio] for detecting and locating the longest common subsequences in two DNA strands. It works by building a two-dimensional matrix, with each cell corresponding to a pairing of a letter from each sequence, and begins by parsing the matrix from the top-left corner, following a mostly diagonal path, as it moves to the right or down in the matrix. In case of a match for the two letters compared, the path traverses the matrix diagonally. The algorithm is comprised of two steps: the computation of a similarity score and the search for the optimal alignment. A value vi,j in the matrix for sequences A (of length n) and B (of length m) is being computed as below in O(mn), where sub, ins and del are penalties for substitution, insertion and deletion:  vi−1,j−1 Ai = Bj    vi−1,j−1 + sub Ai 6= Bj vi,j = min (5.1) v + ins  i,j−1   vi−1,j + del An example is shown in Table 5.3 as follows: For our implementation, we used the common values of ins = 1, del = 1, sub = 2. An important application is in the problem of finding the longest common subsequence (LCS) and finding all common subsequences between two given sequences. 108 PhD Dissertation Table 5.3: An example of the Smith-Waterman [SW81] A C 0 1 2 A 1 0 1 T 2 1 2 C 3 2 1 local sequence alignment algorithm G 3 2 3 2 Figure 5.9: Our constraint-baseds Aho-Corasick automaton (failures not included) for the regular expressions 001E2C*AABB00{6-}AABBFF and 000FA0{4-10}000FEE{3}AACCFF. 5.2.2 Towards An Efficient Compression Mechanism As mentioned in Section 3.4.4, the majority of the ClamAV signature database (in total, 62,302 signatures, with an average length of 45 bytes and a maximum length of 804 bytes), which we have used for testing purposes of our approach, is represented (by an overwhelming percentage of 96.5%) by the last four types of constraints defined by us in Section 3.2.1. In order to implement the automaton, we are using the extended, constraint-based Aho-Corasick automaton described in Section 3.4.4, while as a storage model we are using the approach discussed in Section 3.4.3. In essence, the algorithm works by splitting regular expressions into their regular pattern parts and creating links between the different parts in the automaton. An outline for this approach is shown in Figure 5.9. The basic idea for the compression mechanism we are proposing lies in the virus development process itself: many viruses, of related families, share similar portions of code, which eventually leads to their identification inside executable files based on highly similar (yet not entirely equivalent) signatures in virus databases. Based PhD Dissertation 109 on this idea, we are aiming to extract common subsequences of code from the set of virus signatures in the ClamAV database and fragment the signatures so that they are becoming regular expressions themselves. We are further encouraged to take this approach by the results obtained in GrAVity [VI10] (discussed in Section 3.3.3), where limiting the tree to a depth of only 8 characters had produced false positives at a very low rate of 0.0001%. Since we used a very high number of signatures, we sorted them in lexicographical order of their characters and then we have applied the Smith-Waterman algorithm for identifying all common subsequences of code in the signatures, in groups of 1,000, progressively, up to the maximum tested. Assuming the common subsequence is C, and that the words in the tree are P1 = S1 CS2 and P2 = T1 CT2 , the resulting modification will imply transforming the patterns into P1 = S1 {0 − 0}C{0 − 0}S2 and P2 = T1 {0 − 0}C{0 − 0}T2 , so that C is stored only once in the automaton. We have used minimum subsequence lengths of 2 to 8, 10, 12, 16 and 32 and also two variations of the algorithm, one that performs splits starting with the longest common subsequences first (referenced from here onwards as max ), while the other begins with the shortest (referenced from here onwards as min): start ← 0, LIM IT ← 1000, LEN GT H ← {2,3,4,5,6,7,8,10,12,16,32} beginF rom ← 2, endAt ← maxKeywordLength (for min, reversed for max) repeat lowerLimit = start × LIM IT upperLimit = lowerLimit × LIM IT − 1 for i from lowerLimit to upperLimit do for j from i + 1 to upperLimit do pArray ← getAllCommonSubsequences(pattern[i], pattern[j]) for k from 0 to pArray.count()-1 do if pArray[k].length() > LEN GT H then insertSubSeqInArray(pArray[k].length(), pArray[k], i, j) incrementOccurences(pArray[k]) end if end for end for end for start ← start + 1 until all sequences processed for i from beginF rom to endAt do for j from 0 to subArray[i].count()-1 do splitPattern(pattern[subArray[i][j].id1], subArray[i][j].subsequence) splitPattern(pattern[subArray[i][j].id2], subArray[i][j].subsequence) end for PhD Dissertation 110 Figure 5.10: The total number of nodes obtained for our automaton, when using the ClamAV signatures and also when using our min-based common signature extraction approach. For max, a very similar distribution was obtained and was not included for preserving clarity. end for It is easily deductable that when splitting signatures by groups of two bytes, the automata will shrink in size considerably. This has been observed in the number of nodes we have obtained after performing our analysis, shown in Figure 5.10. The density of children distribution obtained as a resulting of the signature fragmentation process presented above is shown in Figure 5.11. Table 5.4 shows the actual numbers for different min values considered. Table 5.4: The number of nodes having from 0 up to 6 children in the constraint-based Aho-Corasick automaton [Pun13] No. of children 0 1 2 3 4 5 6 Default 67,046 6,882,919 14,174 3,431 1,091 533 262 min=2 618,980 1,316,209 9,138 2,141 2,942 4,562 6,504 min=3 197,492 5,462,903 28,309 11,394 5,357 2,918 1,775 min=4 141,090 5,930,441 25,379 8,112 3,425 1,736 961 min=5 118,409 6,190,955 22,634 6,679 2,654 1,374 743 An empirical observation from Table 5.4 shows that the leaf nodes in the default tree are over 67,000 which is more than the number of signatures (62,302). The difference is explainable in that there are many signatures using regular expressions, which themselves contain different patterns that are inserted in the tree - therefore, a PhD Dissertation 111 Figure 5.11: Children density distribution in the automaton built from the ClamAV database. Similar distributions (not shown) as for min were also obtained for max. single regular expression pattern can produce more than one leaf node. Furthermore, when fragmenting the signatures using a minimum signature length of 2, the number of leaf nodes increases 9 times, while the number of single-childed nodes is reduced 5 times. It is interesting to observe that there are thousands of nodes having from 2 to 6 children, although not comparable to the previous numbers which are in the magnitude of millions. The high-number of single-childed nodes in the automaton shows that employing path-compression here could produce much lower memory usage for this automaton. As also discussed in Section 3.4.4, the lists used for storing match positions are dynamically modified: when a match is found, if the match is part of a regular expression, a check is being made to determine if the previous subexpression node has been matched before, and if the constraints are being met (constraints of types {n-} are the only ones that, when matched, permanently store the match position), store the match position at the current node in the list; if constraints are not met, remove all permitted previous matches. We have implemented two different approaches to storing the information for each node: the first uses a sorted linked list with pointers to all children of the automaton (an approach also discussed in Section 6.2.4), which requires logarithmic matching time for verifying a transition using a divide-et-impera approach, and a second which uses bitmapped nodes (similar to the idea proposed in [ZS08, PN12]), PhD Dissertation 112 Figure 5.12: Children density distribution in the automaton built from the ClamAV database. Similar distributions (not shown) as for min were also obtained for max. which uses slightly more memory but at the benefit of a throughput increase. Based on the empirical observations from Figure 5.11 and Table 5.4, for the bitmapped implementation we have built four different types of nodes: type I, for nodes having from 0 to 15 transitions, type II for nodes having from 16 to 48 transitions, type III, for generic bitmapped nodes and type IV for compressed nodes. In fact, in the logarithmic matching time implementation, to even further reduce memory usage, types I, II and III all have two sub-types: the first includes the bitmapSequences, pointer and matches fields (which only apply to nodes that are leaves in a subexpression), while the second sub-type does not have these fields (this only applies to nodes that are not leaves). In conclusion, the bitmapped implementation has 4 types of nodes, while the non-bitmapped implementation uses 7 types of nodes (4 primary types and 3 sub-types), as shown in Figure 5.12. Additionally, the transition field is used for nodes whose parent node is type I or II. The leaf field determines the presence of the pointerToMatches and pointerToPatterns fields. To avoid creating gaps in memory, type IV (bitmapped) nodes are followed in the actual stack by the pattern itself (as a normal sequence of bytes), after which the next node structure follows. Our implementation followed the data-parallel approach presented in Section 3.4.5. PhD Dissertation 113 Figure 5.13: Memory usage for different implementations of the hybrid-compression mechanisms in the Aho-Corasick automaton. 5.2.3 Performance Evaluation We tested the throughput of the automaton on a file comprised of 50 MB random binary data. We have applied different compression techniques in order to determine the most efficient, as follows: we tested the default (uncompressed) automaton against a bitmapped implementation, a path-compression implementation, an extended bitmapped implementation using path-compression, and a Smith-Watermanbased implementation featuring no other compression techniques, but also featuring, separately and combined, path-compression and bitmapped implementations. We have obtained various results which are depicted in Figure 5.13. Applying path-compression to the ClamAV virus signature dataset used in the unoptimized automaton had determined a decrease in memory usage of more than 64%, while a bitmapped implementation (without path-compression) had decreased memory usage by 27%. A bitmapped implementation featuring path-compression re- PhD Dissertation 114 Figure 5.14: The number of local hits achieved at run-time. duced the memory usage by more than 78%. Applying the Smith-Waterman-based algorithm we proposed, the compression ratios were even better, with the best occurring for a minimum subsequence of length 2; the primary drawback here was that the local hit ratio (the total number of hits in leaves of the automaton) had increased exponentially, which was to be expected given that the tree was highly dense at a depth of 2, making almost every possible two-byte combination a hit. The fragmentation approach we proposed becomes however efficient for a high number of patterns to be matched, with a large number of common subsequences of minimum lengths of 3 or 4, where the probability of a local hit decreases as the minimum length increases (Figure 5.14). The local hit rate was very high when using a minimum subsequence length of 2, with the throughput decreasing to about 0.1 Mbps for the single-threaded implementation and 0.4 Mbps for the 8-threads implementation. For a minimum length of 3 (Figure 5.15), the throughput decreased by 25% compared to the standard implementation (although the local hit ratio was 4 times higher than that of the default implementation), while for consecutive lengths it was kept at the same level as the initial approach, because of lower local hit rates. It is worth observing a close similarity between the Wu-Manber [WM94] hashing technique when using 2-bytes and our Smith-Waterman-based approach for a minimum subsequence length of 2. The Wu-Manber algorithm can be successfully used when using 2-bytes hashing for a small number of patterns only, while for a large number the authors suggest using 3-bytes hashing; a similar empirical observation can be made for our approach, since for a minimum subsequence length of 3 or more, the number of local hits had been reduced considerably and the run-time performance PhD Dissertation 115 Figure 5.15: The throughput of the automaton when using our Smith-Watermanbased approach. of the automaton had increased up to the level of the uncompressed implementation, although the memory usage had decreased considerably. An open-problem here still lies in the ability to quickly perform the Smith-Waterman decomposition of patterns, which is not a trivial problem and requires significant processing power. In conclusion, we believe that our approach can be successfully applied to different pattern datasets wherever memory usage is a critical factor, assuming that the proper analysis is performed upfront on the dataset and the proper implementation decisions are taken. 5.3 Summary This chapter had discussed two common problems in the Aho-Corasick pattern matching automaton: the first, presented in Section 5.1 deals with implementing behavioral heuristics in pattern-matching automata, and how such heuristics can be applied to malicious code detection in particular, and we have also proposed a new heuristic, based on a distance metric, for assessing the threat level of a program by analyzing the system-call sequences that the program contains; the second problem, discussed in Section 5.2 deals with the various approaches existing today for reducing memory usage in the Aho-Corasick automaton, and we have proposed (in Section 5.2.2) a new, hybrid-approach to achieving better compression by employing a dynamicprogramming algorithm (presented in Section 5.2.1) and performing pattern decomposition. PhD Dissertation 116 The benefits of our approach, when it comes to performing malicious code detection, are outlined as follows: • the algorithm proposed is very fast and has a throughput of several hundreds of thousands of system-calls per second, which is a lot more than any contemporary program would use; • our approach does not require the code to be executed upfront, making it suitable for both static and dynamic (run-time) analysis in malicious code detection; • our approach is, to the best of our knowledge, the first to use time-dependency in the heuristic employed; • our approach is based on a metric which can be easily changed and adapted to the model desired. In regards to the hybrid compression mechanism, our approach’s particular characteristics are outlined below: • our approach is the first to use the dynamic programming aspect for performing pattern decomposition, and the first to propose a model for efficiently storing the automaton using the new datasets of patterns obtained; • our formalized model can be applied to any type of pattern, with any alphabet and of any size; • our approach represents the first to analyze the impact of hybrid compression techniques on the virus signatures existing in ClamAV database, and obtain conclusive experimental results about the degree to which the memory efficiency can be improved in real-world implementations. We believe that there is still room for much improvement here in both areas, as we have already pointed in the different sections throughout this chapter, therefore both the behavioral analysis and compression problems of the Aho-Corasick automaton remain open to debate. As a final thought, we are hoping that we have made small, yet significant steps towards improving the performance of the problems we tackled with throughout this chapter. Chapter 6 Improving Carving Analysis in the Digital Forensics Process Contents 6.1 6.2 6.3 Challenges of File-Carving in Digital Forensics . . . . . . 118 6.1.1 File Fragmentation . . . . . . . . . . . . . . . . . . . . . . . 118 6.1.2 Forensic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 120 6.1.3 Quantitative Criteria for Measuring Quality . . . . . . . . . 122 6.1.4 Fighting Fragmentation . . . . . . . . . . . . . . . . . . . . 123 Improving File-Carving Through Data-Parallel Header and Structural Analysis . . . . . . . . . . . . . . . . . . . . 124 6.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2.2 A Comparison Between Scalpel and the TrID Pattern Scanner126 6.2.3 Defining Signatures for File-Carving . . . . . . . . . . . . . 127 6.2.4 Signatures as Regular Expressions . . . . . . . . . . . . . . 128 6.2.5 Extracting Relevant Patterns . . . . . . . . . . . . . . . . . 131 6.2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 131 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 The data recovery process is vital in the business environment and although safety measures can be employed to reduce the risks of data loss or corruption, sometimes emergent data can still be lost as a result of mechanical defect or faulty human intervention. In digital forensics, data recovery is often used to collect forensic evidence to be used by prosecutors against criminal suspects. In this chapter we are going to discuss the challenges involving data recovery in the digital forensics process and propose and test a new and efficient method for identifying and collecting the metadata necessary to the human operator in the forensics analysis process. 117 PhD Dissertation 6.1 118 Challenges of File-Carving in Digital Forensics The data recovery process in digital forensics is based on the file-carving process. In general, the term carving, as also pointed out by Kloet in his Master’s Thesis [Klo07], refers to extracting file from raw, unformatted and unordered binary data, based on the particular characteristics of that data. The extraction process does not use the filesystem information to reconstruct the file, just the information contained within. Mikus had pointed out in his Master’s Thesis [Mik05] that disc carving is “an essential aspect of Computer Forensics”. Metz and Mora of Hoffman Investigations had formed one of the team competing in the Digital Forensics Research Workshop (DFRWS) in 2006 and responding to the challenge raised at that time, of building an efficient file carving algorithm that recovers more files and reduces the number of false positives. While the first attempts were mainly performed manually, which is not scalable to higher data sizes as Mora pointed out, in 2007 the challenge bar was raised to design completely automated tools that perform the same task. The team formed from Metz and Mora had initially made an empirical observation on the speed of a file-carver, stating it should proceed a theoretical 100 Gb of unallocated data daily, or about 1.16 MB per second, with tools handling less than half of that amount being considered unusable. 6.1.1 File Fragmentation Databases (or datasets, as they are called in [Klo07]) are clones, bit-by-bit, of the hard-disk drives (or any other type of storage device, for that matter) to be examined through the digital forensics process. The nature of such databases is dynamic, since data is continuously read and written from and to the device in time. One major challenge in the investigation process is represented by file fragmentation (Figure 6.2). A fragmented file is a file whose contents has been spread over several different, nonconsecutive areas of the disk, in other words, a file which has, between the starting address of its contents and its ending address, one or more gaps which are empty or occupied by other files. Kloet in [Klo07] classifies files as being of two types: with linear fragmentation or with non-linear fragmentation. In the case of linear fragmentation, the parts of the file are present in the database in their original order, while in the non-linear fragmentation the order is different. Modern filesystems contain a file allocation table, where information about each file and directory structure is usually kept (e.g. access time, creation time, size, access rights, etc.). Whenever deleting a file, it would be impractical (and slow, for larger files) to erase its contents on disk byte-by-byte, which is why the only operation usually performed by a filesystem is the removal of the file information from the file allocation table. While this approach is fast, it has two potential effects: the first, the file could be recovered by restoring the entry in PhD Dissertation 119 Figure 6.1: An example of the PNG image file header, footer and contents, as discussed by Kloet in [Klo07]. Figure 6.2: An example of how a disk can have fragmented files (disk was analyzed using our custom-built program). Each block represents a sector on the disk. PhD Dissertation 120 Figure 6.3: Different forensic analysis types as proposed by Carrier in [Car05]. the file allocation table, and the second, the file’s contents (which is now marked as free space on the disk in the filesystem) may be overwritten by other files created in the meantime. Therefore, certain deleted files in this situation are only partially recoverable. 6.1.2 Forensic Analysis The recovery techniques can vary, from header & footer based carving, to structural and content-based analysis. Each of these methods has it own disadvantages, which we will discuss as follows. Nevertheless, it is important to notice that the forensic analysis process is in fact another application of pattern matching algorithms, as it mainly involves locating pieces of the data of interest. Additionally, the results provided by the pattern matching algorithms may be used by different heuristic approaches to reconstruct the original file of interest to the investigator. Carrier in [Car05] had proposed an organizational chart for the different types of forensic analysis, depicted in Figure 6.3. 6.1.2.1 Header and Footer Analysis Analyzing the header and footer assumes that the forensic investigator has a precompiled list of known headers of different types of files (e.g. JPEG, PNG, TIFF images or other types of documents), along with a potential footer, which could identify that file accurately. The contents between the header and the footer would be relevant to the investigator because it could potentially be the actual contents of the file. Figure 6.1 shows the structure of a PNG image file. However, for large files or for a highly fragmented storage device, additional data may be placed between the header and the footer. Furthermore, assuming the nonlinear fragmentation pattern we have discussed above, it may happen that a file’s PhD Dissertation 121 footer comes before the actual header, which could cause false positives in the forensic analysis or could even cause the entire process to fail. The primary problem of this type of analysis is the low accuracy obtained in highly fragmented file systems. Kloet in [Klo07] states that another direct consequence of this limitation is that the file size of the recovered file is at least the same as that of the original, which is not entirely true: if a non-linear fragmentation pattern is employed for the file, and the footer is found before the header on the storage device, the recovery process would heavily rely on additional heuristics for reconstructing the original file, which may produce a smaller file or no file at all even, if insufficient data exists. 6.1.2.2 Structural Analysis Structural analysis starts from the concept of analyzing and identifying the structural elements of the file in question. Given for instance the PNG image structure in Figure 6.1, the structural analysis would take into consideration the header and footer location, along with the different locations where the identifier IDAT (which defines a size of a number of bytes in the file) appears and attempts to recover the same amount of data as the identifier specifies from that location. This would of course not be a perfect reconstruction, especially since the identifier size value may have already been rewritten by another file’s contents, but it gives a good shot to the investigator to obtaining a partially valid content for that file. By adding heuristics to the investigation process, an automated tool would be capable of estimating how the file can be reconstructed, and attempt its reconstruction. If the estimation fails, the file resulted would be invalid or incomplete, which would result partially recoverable content, and the file would become a false positive. If the heuristics would fail, the file would be discarded, leading to a lost file in the database, or a false negative. This method works best when there is sufficient information about a file’s structure which could be carved and therefore reconstructed. In essence, structural analysis builds a map of the original file by putting together many pieces of the puzzle, and from that point onwards it employs diverse heuristics for guessing what would fit in the missing spots. The more information a puzzle contains, the easier it is to reconstruct the big picture. Keeping the analogy, the more quantitative information a portion of the file holds, the easier it is for the heuristics employed to reconstruct it with greater accuracy. 6.1.2.3 Storage-Based Analysis Storage based analysis (also called block content based carving in [Klo07]) works best by analyzing the type of storage the database had been retrieved from. Different storage devices, such as hard-disk drives, SSDs, floppy disk drives, etc. use different block 122 PhD Dissertation sizes (e.g. a block size could have 512 bytes in early stage magnetic disks, or 4,096 bytes in the NTFS filesystem) to read and write information from and to. Information is mostly written in sectors (e.g. blocks of 512 bytes for some hard-disks), meaning that over a certain number of blocks in some storage devices, data is guaranteed not be fragmented in any way. This type of analysis performs a statistical analysis on the bytes of the block and employs different heuristics to determine whether they are relevant to the forensic investigation or not. A common approach to performing statistical evaluation in storage-based analysis is based on entropy computation. The entropy, according to [Wika], refers to the average number of bits used to encode a symbol - for instance, compressed or randomly generated data has much higher entropy than plain text data. 6.1.3 Quantitative Criteria for Measuring Quality One of the first quantitative criteria for quality measurement was proposed by Manning in [MS99]. Manning defines the recall as the proportion of the items that the system selected, the precision as being the number of items that the system correctly identified, and a combined metric of the two for evaluating the overall performance, called the Fmeasure : recall = p , p + mneg (6.1) where p is the number of matches being positives and mneg the number of matches being false negatives. precision = p , p + mpos (6.2) where mpos the number of matches being false positives. Fmeasure = α P1 1 , + (1 − α) R1 (6.3) where P is the precision, R is the recall and α ∈ [0, 1] (weight indicating preference for precision or recall). Based on the above, Kloet [Klo07] proposes a metric for measuring the qualitative performance and defines normative scores for the performance of file carving tools, which are presented in Table 6.1 and which are used to evaluate several such tools. Hoffman and Metz have proposed the initial empiric metric discussed in Section 6.1 for measuring the speed and performance as follows: Kloet [Klo07] had also performed a benchmark of several file-carving tools, including Foremost, FTk, PhotoRec, Recover My Files and Scalpel. The results, based on similar metrics as those defined above, have shown that Recover My Files was the best performer, having a very good carving precision and almost perfect carving 123 PhD Dissertation Descriptive score range 0 ≤ score < 0.5 0.5 ≤ score < 0.75 0.75 ≤ score < 0.85 0.85 ≤ score < 0.95 0.95 ≤ score < 1.0 score = 1.0 Evaluation Bad Mediocre Good Very good Almost perfect Perfect Table 6.1: The qualitative scores as proposed by Kloet in [Klo07] Throughput (MB/s) bandwidth < 0.58 0.58 ≤ bandwidth < 1.16 1.16 ≤ bandwidth Evaluation Not useful Useful in testing only Useful in real-world environments Table 6.2: The empiric evaluation of the carving speed, proposed by Hoffman and Metz recall, followed closely by Foremost and PhotoRec. These last two programs use a combined methodology for performing header and footer analysis for some files, and structural analysis as well for other (limited) number of files. FTk and Scalpel were the worst performers in the field, mainly because the first does not know all the file types and is less efficient, while the latter only uses header and footer analysis when carving. The author also proposes MultiCarver, a file carver aiming to support multiple different types of carving in one single application (Figure 6.4). 6.1.4 Fighting Fragmentation Nemon et al. in [BM06] have proposed a method to reassemble the fragments of an image using greedy algorithms. By constructing a weighted graph, they reduce the problem to a maximization or minimization problem, and propose three different approaches (pixel matching, sum of differences and median edge detector) using variants of greedy heuristics for achieving a rate of reconstruction of up to 86% for their tested datasets. An immediate disadvantage of this work is that the approach works for images mainly, and needs adaptation or rewriting for additional filetypes or formats. A sequential hypothesis approach for solving the fragmentation problem was proposed by Pal et al. in [PSM08]. Their work focuses on determining the fragmentation point of a file by sequentially analyzing adjacent pairs of blocks in the file until the point of interest is reached. Their approach relies on an adaptation of Dijkstra’s shortest path algorithm proposed in [Dij59], called Parallel Unique Path (PUP), that can reassemble a file by analyzing the headers of a file, then performing a sequential analysis where the best match for each header is chosen and the best match is chosen PhD Dissertation 124 Figure 6.4: The Multi-Carver architecture as proposed by Kloet in [Klo07]. and merged with the header. The process finishes until the file is fully reconstructed. They propose a few additional heuristics which enhance the detection rate and reduce processing time by a few times. Thing et al. in [TCC11] have proposed an architecture for the detection and recovery of fragments for forensic purposes, where they were able to reduce the processing time to half of that of a known commercial solution. 6.2 Improving File-Carving Through Data-Parallel Header and Structural Analysis In this section we are going to present the file-carving architecture proposed in [Pun12] for improving the structural analysis of files in digital forensic investigations. 6.2.1 Background The motivation for improving file-carving begins with the history of carvers: one of the first file-carvers built was Foremost [For], which was the basis for the popular open-source alternative Scalpel [Ric05]. The Scalpel set of rules is described in the external file scalpel.conf, which allows the user to configure how the data is to be carved. Scalpel (which may be downloaded from [Sca]) performs header and footer analysis, as discussed in Section 6.1.2.1, by specifying patterns for the header of the files it is capable of carving (as being unique sequences of bytes to be identified in the raw data contents, for instance <html for HTML documents) and also for the footer (for instance, <html/> for HTML documents). In order to solve the false positives issue which most file carvers can produce, an approach known as in-file carving had been PhD Dissertation 125 proposed by Richard et al. in [RI07]. Their approach generates sufficient metadata to allow a human expert operator to interfere later on and determine if the recovered contents is accurate or not. The slow performance of this methodology, which is still faster and uses less storage than other similar solutions, has been pointed out by Zha et al. in [Zha], where they have also proposed a different approach for speeding up the algorithm by employing the Aho-Corasick [AC75], Boyer-Moore [BM77] and Wu-Manber [WM94] algorithms in a multi-threaded implementation. As a result, the worst-case optimality of the Aho-Corasick automaton had produced significantly better results compared to the initial approaches. One freeware solution that had existed since 2003 in the online community, and which is partially community-based, is the TrID pattern scanner [tri]. The software’s purposes is identifying file types by analyzing their contents, and is complemented by the TrIDScan software, a module to create signatures for new file types, by analyzing multiple files belonging to the same type in groups, extracting relevant information (or metadata) particular to those file types (based on the properties of the files and on their contents) and compiling it into a definition to be used by the TrID scanner. The resulting signatures produced by the TrIDScan software are rewritten as XML-based sequences of bytes, that uniquely identify that type of file from that point onwards. Of course, the approach can be prone to false positive matching, assuming that: • a small number of files was analyzed, with highly similar contents - in this case, two scenarios may happen: the program may produce long signatures which may also produce false positives if the extracted information is not relevant (for example, when dealing with text files with common and repetitive textual patterns in them), or may be so strict that the identification process would fail for other files of that type; • a high number of files was analyzed, with a high degree of differences in their contents - in this situation, there is a high probability that the signatures would be invalid or would produce false positives, since it would be very difficult to extract fixed-length patterns of data which exists in both files. Memory investigation has been of strong forensic interest in the past few years. With the potential ability to implement intrusion detection systems or detect perpetrators, this topic becomes more interesting, even more as it relies on the file-carving techniques presented in Section 6.1.2. Interesting approaches to this problem have been proposed by Hejazi et al. in [HDT08], where a methodology for an automated extraction of files from Windows memory had been proposed. The authors have been able to extract several filetypes using their method, which is similar to the filecarving technique although it is applied to the Windows memory layout, and once obtained they could be correlated with other sources of data found on disk. Butler PhD Dissertation 126 and Murdock have proposed in [BM] a method for reconstructing the process space in physical memory, by outlining EXE and DLL (Dynamic Link Library) file structures and performing structural analysis (described in Section 6.1.2.2) on the physical memory areas. They had also discussed the ability to extract files from the shared cache, extending the discussion to other file-types. 6.2.2 A Comparison Between Scalpel and the TrID Pattern Scanner Scalpel’s [Ric05] limitations are mostly present in the number of pattern signatures existing in its database. At the time of this writing, the number of definitions in the Scalpel configuration file was about 60, which is not only low, but lacks the ability to detect other file types. While Scalpel focuses mainly on the ability to detect the most common file types known today, by employing multiple pattern matching, a forensics investigator often could use additional document types in a thorough research (e.g. spreadsheet documents, compressed archives, videos of different formats and encodings, etc.). Another important limitation of Scalpel is related to the way the program scans for files - currently, the program supports only the ability to define raw header and footer contents, as mentioned in Section 6.1.2.1, and bases its matching strategies on the premise that they can both be successfully identified. However, as pointed out in Sections 6.1.2.2 and 6.1.2.3, this type of strategy is not sufficient and could fail in many scenarios and situations, because it does not take into account any structural features of the filetype, or any relevant metadata existing in the file, which may later on be used for identifying pieces of the file throughout the storage device (if the file is, for instance, fragmented). A direct consequence is that the post-analysis process involved in digital forensics is only going to be lengthier and more difficult, with additional human operator intervention required. The TrID pattern scanner has had a long history since it had appeared online. Although it is not open-source, it is freeware (available for both the Windows and Linux operating systems) and, most importantly, it offers a community-based definitions database, where people can contribute whenever they have a new file-type to add. This does not make TrID suitable for direct file-carving, since the tool had not been built for such a purpose and it cannot read through raw data like a file-carver does, but given that its database has grown considerably ever since its appearance in 2003, to over 4,800 patterns for known filetypes along with their definitions, this denotes a high-degree of interest in this tool from the online community which could, ultimately, lead to a beneficial withstand in the file-carving process. PhD Dissertation 6.2.3 127 Defining Signatures for File-Carving The Scalpel configuration stored in scalpel.conf allows definitions in a simple form and also using simple wildcards, such as ?. An example of a PNG image file taken from its database is listed below: 1 png y 2500000 \x89\x50\x4e\x47\x0d\x0a\x1a\x0a The TrID pattern scanner uses an XML-based format for defining signatures, and has several areas in the definition. The first is the general information about the file, having two sections. The first section is the Info section: 1 2 3 4 5 6 7 8 9 10 11 <TrID ver="2.00"> <Info> <FileType>Java Bytecode</FileType> <Ext>class</Ext> <ExtraInfo> <Rem></Rem> <RefURL></RefURL> </ExtraInfo> <User>Marco Pontello</User> <E-Mail>marcopon@nospam@gmail.com</E-Mail> </Info> where information is being stored about the filetype, the type of extension it usually has and any extra information relevant to this file’s type (e.g. URL), and also the user who had provided this file definition (in this case, the author of TrID). The next section is General, and contains information about the definition itself: the date when the definition was added and whether to check for internal strings (metadata) in the file once the header information is found. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <General> <FileNum>80</FileNum> <CheckStrings>True</CheckStrings> <Date> <Year>2003</Year> <Month>11</Month> <Day>14</Day> </Date> <Time> <Hour>03</Hour> <Min>10</Min> <Sec>51</Sec> </Time> </General> The actual definition being used by the TrID software lies in the FrontBlock, Pattern and GlobalStrings areas of the XML. 1 2 3 <FrontBlock> <Pattern> <Bytes>CAFEBABE00</Bytes> PhD Dissertation 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 128 <Pos>0</Pos> </Pattern> <Pattern> <Bytes>00</Bytes> <Pos>6</Pos> </Pattern> <Pattern> <Bytes>00</Bytes> <Pos>11</Pos> </Pattern> </FrontBlock> <GlobalStrings> <String>CODE</String> <String>INIT</String> <String>JAVA</String> </GlobalStrings> In here, the software finds information about the header of the file (in the FrontBlock area), each of the unique patterns found in the header along with their relative position to the header (e.g. Pattern describes such a patterns, with the elements Bytes describing the actual sequence of bytes, for instance CAFEBABE00, and Pos describing the relative position to the header, in this case 0, meaning the file’s header begins with this text). Currently, TrID does not seem to offer a way to define the footer of the file, but as we had already discussed earlier in Section 6.1.2 about the different fragmentation types, footers can be difficult to detect in non-linear fragmentation models. There is however a downside to the community-based submission process implemented in TrID: certain submissions are false positives, which we have also probed in our implementation and experimental research using them. We have found that many patterns are completely unusable since they are too generic (the number of false positives was too high), or because they have been incorrectly submitted or analyzed (for instance, when performing the analysis on an insufficient number of sample files in the group). 6.2.4 Signatures as Regular Expressions For our own approach as presented in [Pun12], we have written a XML parsing tool which processes the XML definitions of TrID and stores them in our internal format. There is a high-resemblance between the ClamAV [cla12] regular expression virus signatures and the TrID XML-based file-signature storage format. For a start, the same alphabet is used in both definitions for both programs, and all strings are expressed in the hexadecimal format as well. The approach we have proposed in [Pun12] uses a similar background to the extended, constraint-based Aho-Corasick automaton from Section 3.4.4. As a reminder, PhD Dissertation 129 Figure 6.5: The sorted linked-list of children in a tree-based structure. this automaton supports regular expressions with four types of constraints, most commonly found in the ClamAV definitions database as pointed out in Section 3.2.1. We have analyzed the TrID definitions database and have concluded that it is feasible to convert the signatures from the XML-based format they are currently in, in a new, regular-expression-based format similar to the four types of constraints discussed. For the implementation in [Pun12], we have used a different approach for storing the automata, that stores at each node a sorted linked list of its children (Figure 6.5). This way, locating an element is equivalent to the problem of locating an item in a sorted list, of complexity O(log2 (N)), where N is the number of elements in the list (in our case, the number of children of a node). We have written a parser for converting the XML-based signatures into a regular expression format similar to that used in the ClamAV signature definitions in Section 3.2.1, by combining all Pattern elements into a single expression, as follows: • We start with an empty regular expression. • For each Pattern element of index i, we store the Bytes element (of length li ) and the Pos value, which represents the relative position in the file where this pattern should occur, as in piposi , and concatenate it to the final expression. • For the pattern at index i+1, the final expression would be concatenated with the previous pattern as follows: piposi | pi+1 posi+1 −li • The final regular expression would represent a concatenation of all patterns found in the XML description of the TrID signature database. For example, the patterns below: 1 2 3 4 5 6 7 <FrontBlock> <Pattern> <Bytes>4D4D</Bytes> <Pos>0</Pos> </Pattern> <Pattern> <Bytes>0002000A000000030000003D3D</Bytes> PhD Dissertation 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 130 <Pos>5</Pos> </Pattern> <Pattern> <Bytes>00</Bytes> <Pos>21</Pos> </Pattern> <Pattern> <Bytes>0A000000</Bytes> <Pos>24</Pos> </Pattern> <Pattern> <Bytes>000000</Bytes> <Pos>29</Pos> </Pattern> <Pattern> <Bytes>000000</Bytes> <Pos>36</Pos> </Pattern> </FrontBlock> <GlobalStrings> <String>SOME</String> <String>MORE</String> <String>DATA</String> </GlobalStrings> would be transformed into the regular expression: 1 4D4D{3-3}0002000A000000030000003D3D{3-3}00{2-2}0A000000 {1-1}000000{4-4}000000*534F4D45*4D4F5245*44415441 In order to assemble the regular expression, the starting 4D4D regular pattern is inserted, and its length is l0 = 2. The next pattern, 0002000A000000030000003D3D, begins at position pos1 = 5, which means that it will be inserted after the previous pattern and a distance of pos1 − l0 = 3 must be kept, which is why the {3-3} expression had been inserted between the two, denoting that exactly any three characters must exist between the two patterns. The logic continues for the remaining patterns also. After the FrontBlock element has been fully processed, the remaining patterns in the GlobalStrings element are inserted (still as hexadecimal characters) using * separators to denote any number of characters between the header and this metadata. It is worth noticing that this approach assumes that the GlobalStrings patterns are mentioned in the logical order of their appearance, but the algorithm does not necessarily need to use the same logic when applying for matches, especially considering the potential file fragmentation in the filesystem. Instead, for the GlobalStrings patterns, it would suffice to mark their matching location in the automaton leaf nodes instead and use that information in an automated tool for reconstructing the file, or in order for it to be analyzed by a human operator supervising the process. An immediate benefit of this advantage is the high scalability of other, commonly intrusion detection-related algorithms, to the problem of file-carving, with immediate PhD Dissertation 131 performance benefits and advantages. Another benefit is the reduction of storage space required for the patterns. 6.2.5 Extracting Relevant Patterns After building the regular expressions and designing the mechanism for employing pattern matching using our modified version of the constraint-based Aho-Corasick algorithm, the final step before performing the experiments was the elimination of false positives from the TrID definitions database. Using patterns of length 2, even if they are part of regular expressions, employed many false positives in our tests using a set of regular files on a Windows 7 disk, and even decreased the throughput performance of the automaton, since many local hits occurred at depth 2 in the tree. In order to avoid such situations, we have worked with a minimum pattern length of 3, and have conducted different experiments to determine how the pattern length would affect the number of definitions. We started with a set of 4,610 definitions and the number of valid signatures obtained after removing the false positives are shown in Tables 6.3 and 6.4. For a minimum length of 12, we observed that the same number of signatures was obtained when using a minimum length of 16 also, which is why our research stopped at this threshold. Table 6.3: Number of patterns of lengths 3 to 7, extracted after pre-processing XML definitions Pattern-length 3 4 5 6 7 Patterns extracted 3,052 2,838 2,031 1,864 1,666 Table 6.4: Number of patterns of lengths 8 to 12, extracted after pre-processing XML definitions Pattern-length 8 9 10 11 12 Patterns extracted 1,556 1,263 1,129 1,049 992 6.2.6 Performance Evaluation Our testing methodology used an intentionally damaged USB stick of 128 MB storage space, using a NTFS filesystem. We initially filled the disk with different filetypes of different sizes, after which the raw data of the disk was read and parsed by our processing algorithm and all partial and full matches were reported. In order to accurately determine the performance of the modified constraint-based Aho-Corasick algorithm, we had tested it against the classic Aho-Corasick implementation, which does not support matching of regular expressions (instead, it provides all partial matches for all parts of a regular expression only). We have used a dataparallel approach similar to the one described in Section 3.4.5. PhD Dissertation 132 Figure 6.6: The results of the single-threaded file-carving process on a 128 MB USB flash disk using definitions from TrID and the classic and modified versions of the Aho-Corasick algorithm. Figure 6.7: The results of the multi-threaded file-carving process on a 128 MB USB flash disk using definitions from TrID and the classic and modified versions of the Aho-Corasick algorithm. PhD Dissertation 133 Results in Figure 6.6 for the single-threaded implementation show that the modified version, although slower, behaved about 3% slower than the standard AhoCorasick implementation. A multi-threaded, data-parallel implementation, featuring 8 threads was also tested and the results are shown in Figure 6.7. This experiment was less conclusive, given the low amount of data and the high throughput of the system used, however the same sequential approach shows that the performance degradation is small, given the low number of local hits at the leaves of the automaton. In conclusion, the primary benefits of our proposed model are the following: • extended support for both header and footer analysis and structural analysis, as part of the same carving process; • support for many file types (over 4,600 in our experiments, with the ability to add many more); • easier metadata collection by enhancing the partial results and storing their location, building a map of all the logical pieces of a file; • partial-carving of files, by supporting only partial contents through regularexpression-formatted definitions; • partially fighting file fragmentation (although not supporting corrupted files yet) by building a logical map of a file’s structural contents through the analysis of all local hits in the Aho-Corasick automaton. 6.3 Summary This chapter had discussed some of the most common challenges involved when it comes to improving the file-carving process in digital forensics analysis, proposing different classifications for the different analysis types and showing how they can be accomplished. We have also proposed a new model for improving the file-carving process, which bases its grounds on a popular, freeware software program which has a permanently growing database for identifying the different filetypes, outlining its benefits and immediate application to the process of file-carving. The general outline of this chapter is presented as follows: • In Section 6.1 we have discussed about file fragmentation and how it affects the carving process, outlining different fragmentation types and outlining the particular characteristics for each of them. PhD Dissertation 134 • We moved on to classifying the different types of forensic analysis and discussing a few proposed solutions in literature throughout recent years in Section 6.1.2, also discussing some quantitative criteria for measuring quality and how they can be applied to file-carving procedures and methodologies proposed so far (Section 6.1.3). • We have discussed some common file-carvers existing today and outlined the advantages and disadvantages of each. • We have proposed a new model for employing file-carving using header and structural analysis, by constructing regular expressions similar in format to that used in Section 3.2.1 for the ClamAV virus signatures, and have written a parser for building the expressions from the database of a freeware, multiplatform software for identifying filetypes, called TrIDScan. • We have outlined the performance and benefits of using our approach, by showing how an efficient file-carver performing both header and structural analysis can be successfully used by the human operator in the post-analysis process for an efficient reconstruction of the original file. The summary of the work and model proposed in our approach shows that performing header, structural and storage-based analysis as part of the carving process is vital to the post-analysis process, where the human operator is usually employed or where automated tools using different heuristics are reconstructing the original file’s contents by assembling the pieces in the data in their logical order. Our proposed model uses thousands of signatures compared to tens used in equivalent file-carvers as Scalpel [Ric05], therefore offering higher flexibility. Furthermore, it shows that the model can be successfully used to perform both single-threaded and multi-threaded data-parallel pattern matching, without significant performance degradation, offering a scalable approach to extending our research to other platforms and/or hardware devices. Chapter 7 Conclusions Contents 7.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . 139 7.4 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . 140 Given the highly dynamic nature of the cyberspace today and the continuously growing number of threats that are released, with the newest ones targeting the emerging mobile platforms, such as Android, MacOS or iOS, the need for fast and efficient malicious code detection engines is becoming more stringent than ever, especially considering the very limited computational and storage requirements imposed by such platforms. The plural nature of pattern matching algorithms and their high applicability to various degrees in numerous fields of research has determined the appearance of significant improvements for the classic algorithms known, with some of the latest approaches looking to benefit from GPU acceleration in modern systems and others aiming to develop hybrid solutions that offer a good compromise between storage space and run-time performance. As a highly active computer security research area, data forensics is still a widely open subject to innovations, where improvements and performance optimizations could still be found for increasing the reliability, feasibility and integrity of the existing approaches known today. Its high applicability to cyber-crime prevention, either it being for tracking down vital clues or pieces of information in crime-related trials, or for recovering lost, yet highly valuable business data for medium or large enterprises, shows that this chapter is far from being closed in the coming years and that contributions here could help improve the way we look at the world in the near future. The divergent applicability of pattern matching algorithms in this area has shown numerous improvements in recent years, with breakthroughs that simplify the work of the human operators, increase the processing speed of the processes involved and help assess important and relevant information easier than ever. 135 PhD Dissertation 7.1 136 Discussions At the beginning of this thesis we outline related work in the field of pattern-matching, starting with the area of single-pattern matching (and presenting the most important contributions here) and ending with multiple-pattern matching (with special emphasis on the Aho-Corasick, Commentz-Walter and Wu-Manber implementations, the most widely used today in most pattern-matching applications). We then move on to present the field of intrusion detection systems, presenting a classification of generic intrusion detection systems and outlining ways to detect malicious code behavior. The state of the art ends by presenting the recently emerging field of digital forensics, outlining its applicability in cyber-crime investigations and data forensics. The next part of the thesis discusses our own contribution to improving the storage efficiency in heterogeneous hardware platforms, such as those relying on hybrid CPU/GPU computations, by presenting a model used to significantly reduce the storage requirements for very large pattern-matching automata and discussing how the model may be successfully be applied in the virus signature matching process. We outline the primary benefits of the approach and show the overall improvement compared to other existing methodologies at this time. We continue our dissertation by showing how the previous efficient storage model may be used for the very fast, highly-parallel and throughput-optimized construction of very large pattern-matching automata in any type of applications, with specific focus on intrusion detection systems and, particularly, malicious code detection. We also present a few mathematical proofs for showing how the number of computations may be dramatically reduced, and apply the model to our approach, showing the great impact on performance that this has in hybrid-parallel implementations and discussing the primary benefits of the approach. We observe that our approach may also be applied without any restrictions to any type of pattern-matching applications using high amounts of data which require highly intensive computational tasks to be performed as part of the construction stage. The next topic the thesis tackles with is focusing on implementing behavioral heuristics in large pattern-matching automata, and we propose a new heuristics, based on a weighted automaton, which is taking into account the moment of time of a system-call during program execution for classifying malicious code behavior. We also discuss for mathematical formalism behind the approach, outlining the very good performance results obtained in real-world scenarios, and concluding that the approach is suitable for both static (offline) and dynamic (online, at run-time) analysis of the code. Another contribution we have made is discussed next, outlining a hybrid compression approach of large pattern-matching automata through the means of dynamic programming. Our focus was to reduce the storage required for the automaton by compacting the tree sufficiently, by converting regular patterns into regular-expressions PhD Dissertation 137 and using our own constraint-based model for matching them. We discuss the performance evaluation of the algorithm and show that for sufficiently-sized patterns, the performance suffers almost no impact in the approach, allowing much better storage efficiency than the default naive implementation. The final contribution of this thesis is outlined last, showing how we can adapt the constraint-based regular-expression matching automata to the problem of filecarving, focusing in particular on the header and structural analysis process as part of data forensics. We discuss the challenges involved in resolving this problem, outlining the major problems posed by file fragmentation on modern file-systems, also showing how we can improve the file-carving process by converting a popular, opendatabase of file-types to a regular-expression format used by our implementation. Furthermore, we outline the potential future benefits of our approach and conclude that the performance impact of the implementation is very small, compared to the initial algorithm. The final chapter of the thesis outlines its main contributions and future research topics which have emerged as a result of the work and results presented herein. 7.2 Contributions The main contributions of this thesis are the following: • We have built a highly-efficient storage mechanism (outlined in Chapter 3) for hybrid CPU/GPU memory storage, with the following characteristics: – it can be easily constructed by parsing the tree associated to the automaton; – it stores the entire automaton in both host and/or device memory while still allowing it to be traversed entirely; – it uses a single pointer to external memory, with all other child pointers being offsets to different areas in the internal memory; – it offers the ability to transfer data back and forth (full-duplex) between the host and the device at maximum throughput allowed by the PCI-Express architecture; – it imposes no limitation on the ordering of nodes, with the single exception that all children of a node must be stored in consecutive memory locations in the layout; – it always employs constant storage space, an improvement over other previous similar approaches discussed. PhD Dissertation 138 • We have proposed, formalized and implemented (in Chapter 4) the first hybridparallel architecture for constructing very large automata through efficient CPU/GPU collaboration in heterogeneous hardware environments, as follows: – we have proposed and proved that the computation can be significantly reduced through the use of parallelism and a few mathematical properties of the resulting automata; – we have improved the performance significantly as a result of our formalism, dropping the processing time from the magnitude of a few hours to just under a few seconds, on high-end consumer hardware; – our approach may be successfully used in any type of pattern-matching applications involving very large automata; – the approach uses our own-proposed highly-compact storage format described earlier, which allows access to nodes in a single serialized memory area; – it offers the first pre-processing performance results for the ClamAV database for both the Aho-Corasick and Commentz-Walter automata; • We have proposed, to the best of our knowledge, the first time-dependant system-call analysis heuristic approach for detecting malicious code behavior (in Section 5.1): – the methodology uses the first known approach to tracing malicious code behavior by analyzing system-calls, while also taking into account the moment in time when the call has been made; – our approach has very fast performance (of several hundreds of thousands of system-calls being analyzed per second on average consumer hardware), which makes it suitable for both static and dynamic (at run-time) analysis, as well as implementation in real-time monitoring engines; – the approach does not necessarily require the code to be executed upfront; – the algorithm is based on a metric which may be easily adapted and changed to comply with various requirements during the implementation stage; • We have proposed a hybrid compression mechanism for the Aho-Corasick automaton to be used in the virus signature matching process (Section 5.2): – our approach is the first to use a Smith-Waterman-based, dynamic programming approach to converting regular patterns into regular-expressions in order to create a more compact tree representation; PhD Dissertation 139 – our formalized model may be applied to any other type of applications using pattern-matching automata, with any alphabet of any size; – we have presented the first real-world results of the impact of our hybrid compression approach over the ClamAV database of virus signatures, through conclusive experiments of the degree to which storage efficiency can be improved as a result of its application. • We have proposed an innovative model and approach to use regular-expressions for performing carving analysis in the data forensics process, in Chapter 6: – our approach may be successfully used for performing both header and structural data analysis, as part of the carving process; – our implementation can support several tens or hundreds of thousands of file-types and structural definitions of files; – our model can output a structural map of the pieces of a file, as they are spread throughout the disk’s physical layout; – partial-carving is also permitted, allowing fragments of files to be recovered easier through accurate regular-expression matching; – it supports heuristics on top of the implementation, for reconstructing the initial file by matching the found file pieces and restoring their logical order. 7.3 Future Research Future research may be structured according to the chapters debated in this thesis, as follows: • One possible improvement would be to design a methodology for storing the very-large pattern-matching automata discussed in Chapter 3 in GPU texture memory, which is faster since it has a cache mechanism behind. The approach may however limit the throughput performance if a large number of transfers are required between the host and the device, which makes the approach suitable for environments where cross-collaboration is minimal. • Another possible future research topic is given by the specific morphing ability of the architecture presented in Chapter 4, where the pattern-matching automata can change their own structure dynamically in real-time. Such an approach could be successfully used in other algorithmic implementations, with potential applicability to different heuristics (behavioral ones, in particular), where structures change all the time as a result of the impact of one or more external factors. PhD Dissertation 140 • Another potential improvement would represent an extension or adaptation of the Bray-Curtis weighted automaton presented in Section 5.1 to more complex regular-expressions or other similarity measurements and comparing the results obtained. A careful development of work may be required if a preservation of the system-call time dependency is desired in the new implementation as well. • Although widely discussed, compression of pattern-matching automata remains a challenging subject and it is our belief that there is still room for improvement in the topic. The approach we presented in Section 5.2 could be for instance extended to use a much-better, more efficient and parallel (by construction) approach to pre-processing the patterns and determining the longest common subsequences, which is the most laborious step of the work performed as part of that research. Although there have been studies on the topic, a heterogeneous hardware approach to solving the LCS problem for multiple patterns at once is still a potential subject to be researched further. • The algorithmic backgrounds discussed in Chapter 6 for improving the carving analysis process could be extended further to accommodate hybrid-parallel implementations where one task is given the header analysis, another is given the structural analysis and a potentially new task is employed to apply different heuristics for merging file-fragments together into a single data piece. Additionally, heterogeneous implementations of structural analysis engines for data forensics is still an active subject of the currently on-going research in the field. 7.4 Final Thoughts The future of computing lies in the hybrid, heterogeneous systems, where the integrated computational devices, such as the CPU and the GPU, work together flawlessly for the greater benefit of the end-user and for highly improved processing power. Improved nano-technology and improved electrical and chemical properties of semiconductors make this scenario a closer reality with each passing day, making research in this topic a viable source of interest for developing, building and improving on the existing architectures and algorithms, either by further parallelizing those known so far or by constructing new ones that benefit from the advantages offered by the new platforms. In conclusion, we argue that in this thesis we have identified and solved some of the pattern matching challenges commonly found nowadays in some of the most challenging fields of research in computer security, particularly related to the malicious code detection and digital data forensics problems and especially concerning heterogeneous hardware platforms making use of hybrid implementations. Given the fast changing pace of the world, it is our belief that the primary focus from here onwards PhD Dissertation 141 needs to carve deeper into building heterogeneous systems that are efficient in terms of performance and storage, which mixed together with the technological progress in the upcoming years would represent the foundational background of computational technology for the next couple of decades or more. Bibliography [AC75] A. V. Aho and M. J. Corasick, Efficient string matching: an aid to bibliographic search, Communications of the ACM, ACM Press, 1975, pp. 333– 340. [ALJ+ 95] Debra Anderson, Teresa Lunt, Harold Javitz, Ann Tamaru, and Alfonso Valdes, Next-generation Intrusion Detection Expert System (NIDES): A Summary, 1995. [AMD] Amd firestream, http://en.wikipedia.org/wiki/AMD FireStream. [AX11] Townend P. Arshad, J. and J. Xu, A novel intrusion severity analysis approach for clouds, Future Generation Computer Systems, 2011, doi: 10.1016/j.future.2011.08.009. [Axe00] S. Axelsson, Intrusion detection systems: A survey and taxonomy, Technical Report 99-15, Chalmers University of Technology, Dept. of Computer Engineering, Goteborg, Sweden, 2000, http://www.ce.chalmers.se/staff/sax/taxonomy.ps. [Ben11] M. et al. Benn, Digital forensics and data leakage protection, https://www.bewglobal.com/pdfs/digital-forensics-data-leakage.pdf. [BGI+ 12] Boaz Barak, Oded Goldreich, Russell Impagliazzo, Steven Rudich, Amit Sahai, Salil Vadhan, and Ke Yang, On the (im)possibility of obfuscating programs, J. ACM 59 (2012), no. 2, 6:1–6:48. [bio] Bioinformatics explained: Smith-waterman, http://www.clcbio.com/sciencearticles/Smith-Waterman.pdf. [Blo70] Burton H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM 13 (1970), no. 7, 422–426. [BM] J. Butler and J. Murdock, Physical memory forensics for files and cache, BlackHat.com, http://media.blackhat.com/bh-us11/Butler/BH US 11 ButlerMurdock Physical Memory ForensicsWP.pdf. 142 PhD Dissertation 143 [BM77] R.S. Boyer and J.S. Moore, A fast string searching algorithm, Communications of the ACM, 1977, ISSN 0001-0782, pp. 762–772. [BM06] J. Butler and J. Murdock, Automated reassembly of file fragmented images using greedy algorithms, IEEE Transactions on Image Processing, 2006, ISSN 1057-7149, pp. 385 – 393. [Bra] Kardi teknomo’s page: Bray curtis distance, http://people.revoledu.com/kardi/tutorial/Similarity/BrayCurtisDistance.html. [Bro93] A.Z. Broder, Some applications of rabin’s fingerprinting method, Sequences II: Methods in Communications, Security, and Computer Science, Springer-Verlag, 1993, pp. 143–152. [BS11] R. Bhukya and D.V.L.N. Somayajulu, Exact multiple pattern matching algorithm using dna sequence and pattern pair, International Journal of Computer Applications, 2011, pp. 32–38. [Car02] Brian Carrier, Defining digital forensic examination and analysis tools, International Journal of Digital Evidence 1 (2002), 2003. [Car05] B. Carrier, File system forensic analysis, Addison-Wesley Professional, 2005, ISBN 0321268172. [CF97] Cristina Cifuentes and Antoine Fraboulet, Intraprocedural static slicing of binary executables, Proceedings of the International Conference on Software Maintenance (Washington, DC, USA), ICSM ’97, IEEE Computer Society, 1997, pp. 188–. [CH11] Emilio Corchado and Álvaro Herrero, Neural visualization of network traffic data for intrusion detection, Appl. Soft Comput. 11 (2011), no. 2, 2042–2056. [cla12] Clamav antivirus, 2012, http://www.clamav.net/lang/en/. [CMJ+ 10] Sang Kil Cha, Iulian Moraru, Jiyong Jang, John Truelove, David Brumley, and David G. Andersen, Splitscreen: enabling efficient, distributed malware detection, Proceedings of the 7th USENIX conference on Networked systems design and implementation (Berkeley, CA, USA), NSDI’10, USENIX Association, 2010, pp. 25–25. [Coh87] F. Cohen, Computer viruses: theory and experiments, Comput. Secur. 6 (1987), no. 1, 22–35. 144 PhD Dissertation [Col91] R. Cole, Tight bounds on the complexity of the boyer-moore string matching algorithm, Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Society for Industrial and Applied Mathematics, 1991. [Cor02] Leiserson C.E. Rivest R.I Stein C. Cormen, T.H., Introduction to algorithms, second edition, The MIT Press and McGraw-Hill Book Company, 2002, ISBN 978-0262032933. [CR94] Czumaj A. Gasieniec L. Jarominek S. Lecroq T. Plandowski W. Crochemore, M. and W. Rytter, Speeding up two string matching algorithms, Algorithmica, 1994, pp. 247–267. [Cre] Creating signatures for clamav, http://www.clamav.net/doc/latest/signatures.pdf. [CUDa] The cuda occupancy calculator, http://developer.download.nvidia.com/compute/ cuda/CUDA Occupancy calculator.xls. [CUDb] The cuda thread model, http://pg-server.csc.ncsu.edu/mediawiki/index.php/ CSC/ECE 506 Spring 2011/ch2a mc. [CUDc] nvidia cuda c programming guide, http://developer.download.nvidia.com/compute/ DevZone/docs/html/C/doc/CUDA C Programming Guide.pdf. [cud12] The cuda parallel computing platform, 2012, http://www.nvidia.com/object/cuda-parallel-computing-platform.html. [CW79] B. Commentz-Walter, A string matching algorithm fast on the average, Proceedings of the 6th Colloquium, on Automata, Languages and Programming, Springer-Verlag, 1979, ISBN 3-540-09510-1, pp. 118–132. [CW00] D.M. Chess and R.S. White, An undetectable computer virus. [CWL07] Mohamed R. Chouchane, Andrew Walenstein, and Arun Lakhotia, Statistical signatures for fast filtering of instruction-substituting metamorphic malware, Proceedings of the 2007 ACM workshop on Recurring malcode (New York, NY, USA), WORM ’07, ACM, 2007, pp. 31–37. [CY06] Hongke Z. Chunyue, Z. and L. Yun, A pattern matching based network intrusion detection system, 9th International Conference on Control, Automation, Robotics and Vision (ICARCV), IEEE Computer Society, 2006, ISBN 1-4244-0341-3, pp. 1–4. [cyb] Cybercrime, areas/Cybercrime/Cybercrime. http://www.interpol.int/Crime- PhD Dissertation 145 [CZ04] Watson B.W. Cleophas, L. and G. Zwaan, A new taxonomy of sublinear keyword pattern matching algorithms, Technical Report CS-TR 04-07, Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, 2004. [DH00] et al. Debar H., A revised taxonomy for intrusion detection systems, Annales des Telecommunications (vol. 55), 2000, pp. 361–378. [DHW99] Dacier M. Debar H. and A. Wespi, Towards a taxonomy of intrusion detection systems, Computer Networks (vol. 31), 1999, pp. 805–822. [Dij59] E.W. Dijsktra, A note on two problems in connexion with graphs, Numerische Mathematik, 1959, pp. 269–271. [Dir] Microsoft directcompute, http://developer.nvidia.com/cuda/directcompute. [DPP07] Vassilis Dimopoulos, Ioannis Papaefstathiou, and Dionisios N. Pnevmatikatos, A memory-efficient reconfigurable aho-corasick fsm implementation for intrusion detection systems, ICSAMOS’07, 2007, pp. 186–193. [Dud06] L. Dudas, Improved pattern matching to find dna patterns, IEEE International Conference on Automation, Quality and Testing, Robotics, IEEE Xplore, 2006, ISBN 1-4244-0360-X, pp. 345–349. [ea11] Baker W. et al., 2011 data breach investigations report, 2011, http://www.verizonbusiness.com/resources/reports/rp data-breachinvestigations-report-2011 en xg.pdf. [Esm95] Safavi-Naini R. Pieprzyk J. Esmaili, M., Computer intrusion detection: A comparative survey, Technical Report 95-07/06, Center for Computer Security Research, University of Wollongong, Wollongong, NSW, Australia, 1995. [FC10] Kanonov-U. Elovici Y. Dolev S. Fledel, Y. and Chanan, Google android: A comprehensive security assessment, IEEE Security & Privacy, IEEE, 2010, ISSN 1540-7993. [Fer] nvidia fermi architecture, architecture.html. http://www.nvidia.com/object/fermi- [FKF+ 03] Henry Hanping Feng, Oleg M. Kolesnikov, Prahlad Fogla, Wenke Lee, and Weibo Gong, Anomaly detection using call stack information, Proceedings of the 2003 IEEE Symposium on Security and Privacy (Washington, DC, USA), SP ’03, IEEE Computer Society, 2003, pp. 62–. PhD Dissertation 146 [fla12] Rt.com, 2012, http://www.rt.com/news/flame-stuxnet-kaspersky-iran607/. [For] Foremost, http://foremost.sourceforge.net/. [Fre06] Kimmo Fredriksson, On-line approximate string matching in natural language, Fundam. Inf. 72 (2006), no. 4, 453–466. [Fre09] K. Frederiksson, Succinct backward-dawg-matching, Journal of Experimental Algorithmics (JEA), ACM, 2009, ISBN 1084-6654. [GG11] G. Gigerenzer and W. Gaissmaier, Heuristic decision making, Annual Review of Psychology, vol. 62, 2011, pp. 451–482. [GPG] Gpgpu, http://ro.wikipedia.org/wiki/GPGPU. [Gra] Control & flow dependence graphs, http://www.grammatech.com/research/ papers/staticAnalysis/imgSlides/sld021.html. [HA] S. Holewinski and G. Andrzejewski. [Har11] S.L. Harrington, Collaborating with a digital forensics expert: ultimate tag-team or disastrous duo?, William Mitchell Law Review 38 (2011), 353–396, ISSN 0270-272X. [HDT08] Seyed Mahmood Hejazi, Mourad Debbabi, and Chamseddine Talhi, Automated windows memory file extraction for cyber forensics investigation, J. Digit. Forensic Pract. 2 (2008), no. 3, 117–131. [HFS98] Steven A. Hofmeyr, Stephanie Forrest, and Anil Somayaji, Intrusion detection using sequences of system calls, J. Comput. Secur. 6 (1998), no. 3, 151–180. [HZB11] Amir Houmansadr, Saman A. Zonouz, and Robin Berthier, A cloud-based intrusion detection and response system for mobile phones, Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (Washington, DC, USA), DSNW ’11, IEEE Computer Society, 2011, pp. 31–32. [int12] Hexus, 2012, http://hexus.net/tech/news/cpu/39381-intel-currentlydeveloping-14nm-aiming-towards-5nm-chips/. [IoM] Data recovery pricing, recovery/datarecoveryfaq-prices.html. http://iomega.com/data- PhD Dissertation 147 [Jac99] K. Jackson, Intrusion detection system (ids) product survey, Technical Report LA-UR-99-3883, Los Alamos National Laboratory, Los Alamos, NM, 1999, http://libwww.lanl.gov/la-pubs/00416750.pdf. [Kas] Kaspersky lab utilizes nvidia technologies to enhance protection, http://www.kaspersky.com/news?id=207575979. [kKJLG03] kos Kiss, Judit Jsz, Gbor Lehotai, and Tibor Gyimthy, Interprocedural static slicing of binary executables, Source Code Analysis and Manipulation, IEEE, 2003, pp. 118–127. [Klo07] S.J.J. Kloet, Measuring and improving the quality of file carving methods, Master’s Thesis, 2007. [KM11] Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis, A performance evaluation of the preprocessing phase of multiple keyword matching algorithms, Informatics, Panhellenic Conference on 0 (2011), 85–89. [Knu77] Morris-J.H. Pratt V. Knuth, D., Fast pattern matching in strings, SIAM Journal on Computing, 1977, ISSN 0097-5397, pp. 323–350. [Knu09] D.E. Knuth, The art of computer programming volume 4, fascicle 1: Bitwise tricks & techniques; binary decision diagrams, Addison-Wesley Professional, 2009. [KR87] R.M. Karp and M.O. Rabin, Efficient randomized pattern-matching algorithms, IBM Journal of Research and Development, 1987, pp. 249–260. [Lan92] William Landi, Undecidability of static analysis, ACM Lett. Program. Lang. Syst. 1 (1992), no. 4, 323–337. [LC04] T. Lecroq and C. Charras, Handbook of exact string matching algorithms, College Publications, 2004, ISBN 978-0954300647. [Lee07] T.H Lee, Generalized aho-corasick algorithm for signature based antivirus applications, Proceedings of 16th International Conference on Computer Communications and Networks, IEEE Computer Society, 2007, ISBN 978-1-4244-1251-8, pp. 792–797. [LH08] T.H. Lee and N.L. Huang, An efficient and scalable pattern matching scheme for network security applications, Proceedings of 17th International Conference on Computer Communications and Networks (ICCN), IEEE, 2008, pp. 1–7. PhD Dissertation 148 [LH12] Chang-S.C. Lin, C.H. and W.K. Hon, Memory-efficient pattern matching architectures using perfect hashing on graphic processing units, Proceedings of the INFOCOM, IEEE Computer Society, 2012, ISBN 978-1-46730773-4, pp. 1978–1986. [Lin06] P.C. Lin, Profiling and accelerating string matching algorithms in three network content security applications, IEEE Communications Surveys & Tutorials vol. 8, issue 2, IEEE, 2006, pp. 24–37. [Liu12] Chien-L.S. Chang S.C. Hon W.K. Liu, C.H., Pfac library: Gpubased string matching algorithm, PU Technology Conference (GTC), 2012, http://developer.download.nvidia.com/GTC/PDF/GTC2012/ PresentationPDF/S0054-GTC2012-PFAC-GPU-Algorithm.pdf. [LLL11] Po-Ching Lin, Ying-Dar Lin, and Yuan-Cheng Lai, A hybrid algorithm of backward hashing and automaton tracking for virus scanning, IEEE Trans. Comput. 60 (2011), no. 4, 594–601. [LTL+ 10] Cheng-Hung Lin, Sheng-Yu Tsai, Chen-Hsiung Liu, Shih-Chieh Chang, and Jyuo-Min Shyu, Accelerating string matching using multi-threaded algorithm on gpu, GLOBECOM’10, 2010, pp. 1–5. [Lun88] T.F. Lunt, Automated audit trail analysis and intrusion detection: A survey, 11th National Computer Security Conference, Baltimore, MD, 1988, pp. 65–73. [LYLT09] Yanbing Liu, Yifu Yang, Ping Liu, and Jianlong Tan, A table compression method for extended aho-corasick automaton, Proceedings of the 14th International Conference on Implementation and Application of Automata (Berlin, Heidelberg), CIAA ’09, Springer-Verlag, 2009, pp. 84–93. [Mala] 18.2-152.4:1 penalty for computer contamination, Joint Commission on Technology and Science, http://jcots.state.va.us/2005 Content/pdf/Computer Contamination Bill.pdf. [Malb] National conference of state legislatures virus/contaminant/destructive transmission statutes by state, http://www.ncsl.org/programs/lis/cip/viruslaws.htm. [MDWZ04] Yevgeniy Miretskiy, Abhijith Das, Charles P. Wright, and Erez Zadok, Avfs: an on-access anti-virus file system, Proceedings of the 13th conference on USENIX Security Symposium - Volume 13 (Berkeley, CA, USA), SSYM’04, USENIX Association, 2004, pp. 6–6. PhD Dissertation 149 [MH90] Wolcott-D. Schaefer L. Kelem N. Hubbard B. McAuliffe, N. and T. Haley, Is your computer being misused? a survey of current intrusion detection system technology, Sixth Computer Security Applications Conference, 1990, pp. 260–272. [mic12] Microsoft safety & security center, 2012, http://www.microsoft.com/security/pc-security/virus-whatis.aspx. [Mik05] N.A. Mikus, An analysis of disc carving techniques, Naval Postgraduate School Master’s Thesis, 2005, http://handle.dtic.mil/100.2/ADA432468. [MK06] Muthprasanna-M. Mansoor, A. and V. Kumar, High speed pattern matching for network ids/ips, Proceedings of the Proceedings of the 2006 IEEE International Conference on Network Protocols, IEEE Computer Society, 2006, ISBN 1-4244-0593-9, pp. 187–196. [MP70] J.H. Morris and V.R. Pratt, A linear pattern-matching algorithm, Technical Report TR 40, University of California, Berkley, CA, 1970. [MS99] Christopher D. Manning and Hinrich Schütze, Foundations of statistical natural language processing, MIT Press, Cambridge, MA, USA, 1999. [NC10] Philips-A. Nelson, B. and Steuart C., Guide to computer forensics and investigations, Cengage Learning, 2010, ISBN 9781435498839. [Nor04] M. Norton, Optimizing pattern matching for intrusion detection, SourceFire, Inc., 2004. [nVI] Cuda-powered anti-virus scanning is in development, http://news.softpedia.com/news/CUDA-Powered-Anti-Virus-ScanningIs-in-Development-123648.shtml. [Oba09] Barrack Obama, Remarks by the president on securing our nation’s cyber infrastructure, 2009, http://www.whitehouse.gov/the press office/Remarks-by-the-Presidenton-Securing-Our-Nations-Cyber-Infrastructure/. [Ope] Khrinis group: Opencl, http://www.khronos.org/opencl/. [PAF09] Ciprian. Pungila, Ovidiu. Aritoni, and Teodor-Florin. Fortis, Benchmarking Database Systems for the Requirements of Sensor Readings, IETE Technical Review 26 (2009), no. 5, 342–349. PhD Dissertation 150 [Pal01] G. Palmer, A road map for digital forensic research, DFRWS 16 (2001), http://www.dfrws.org/2001/dfrws-rm-final.pdf. [PAP12] Mrutyunjaya Panda, Ajith Abraham, and Manas Ranjan Patra, A hybrid intelligent approach for network intrusion detection, Procedia Engineering 30 (2012), no. 0, 1 – 9, ¡ce:title¿International Conference on Communication Technology and System Design 2011¡/ce:title¿. [pci] Pci express, http://en.wikipedia.org/wiki/PCI Express. [pe-] pe-sig, http://vrt-blog.snort.org/2009/03/generating-virus-signaturesautomated.html. [PN] C. Pungila and V. Negru, Towards building efficient malware detection engines using hybrid cpu/gpu-accelerated approaches, Architectures and Protocols for Secure Information Technology, IGI Global, (accepted to be published in 2013). [PN97] Phillip A. Porras and Peter G. Neumann, EMERALD: Event monitoring enabling responses to anomalous live disturbances, Proceedings of the 20th National Information Systems Security Conference, 1997, pp. 353– 365. [PN12] C. Pungila and V. Negru, A highly-efficient memory-compression approach for gpu-accelerated virus signature matching, Proceedings of 8th International Information Security Conference (ISC), Lecture Nodes In Computer Science, 2012. [Pri04] R. W. Price, Roadmap to entrepreneurial success: Powerful strategies for building a high-profit business, AMACOM, 2004, p. 42. [PSM08] Anandabrata Pal, Husrev T. Sencar, and Nasir Memon, Detecting file fragmentation point using sequential hypothesis testing, Digit. Investig. 5 (2008), S2–S13. [Pun09] Ciprian-Petrisor Pungila, A bray-curtis weighted automaton for detecting malicious code through system-call analysis, Proceedings of the 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (Washington, DC, USA), SYNASC ’09, IEEE Computer Society, 2009, pp. 392–400. [Pun10] Ciprian Pungila, A model for energy-efficient household maintenance through behavioral analysis of electrical appliances, E-Business Engineering, IEEE International Conference on (2010), 409–414. 151 PhD Dissertation [Pun12] C. Pungila, Improved file-carving through data-parallel pattern matching for data forensics, 7th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI), 2012, ISBN 978-1-46731013-0, pp. 197–202. [Pun13] Ciprian Pungila, Hybrid compression of the aho-corasick automaton for static analysis in intrusion detection systems, International Joint Conference CISIS12-ICEUTE12-SOCO12 Special Sessions (lvaro Herrero, Vclav Snel, Ajith Abraham, Ivan Zelinka, Bruno Baruque, Hctor Quintin, Jos Luis Calvo, Javier Sedano, and Emilio Corchado, eds.), Advances in Intelligent Systems and Computing, vol. 189, Springer Berlin Heidelberg, 2013, 10.1007/978-3-642-33018-6 8, pp. 77–86. [Rab81] M.O. Rabin, Fingerprinting by random polynomials, Technical Report TR-15-81, Harvard Aiken Computation Laboratory, 1981. [Rac] Race condition, http://en.wikipedia.org/wiki/Race condition. [Rah00] S.A. Rahman, Network intrusion detection systems, CS594-Computer and Network Security, 2000. [red12] Redorbit, 2012, http://www.redorbit.com/news/technology/1112543496/flamevirus-most-sophisticated-cyber-weapon-ever-used/. [Rei02] Carr-C. Gunsch G. Reith, M., An examination of digital forensic models, International Journal of Digital Evidence, 2002. [RI07] Roussev V. Marziale L. Richard III, G., In-place file carving, Science Direct, 2007. [Ric05] Roussev V. Richard, G. III, Scalpel: A frugal, high performance file carver, Digital Forensics Research Workshop, 2005. [Riv92] R. Rivest, The md5 message-digest algorithm, RFC 1321, 1992. [Roz05] K. Rozinov, Efficient static analysis of executable for detecting malicious behaviors, Polytehnic University, Brooklyn, NY, 2005, Master Thesis. [SA07] Networks J. Seamans, E. and T. Alexander, Gpu gems 3: Chapter 35 fast virus signature matching on the gpu, first ed., Addison-Wesley Professional, 2007. [San] The sandy bridge microarchitecture, Sandy Bridge (microarchitecture). http://en.wikipedia.org/wiki/ PhD Dissertation 152 [SBDB01] R. Sekar, M. Bendre, D. Dhurjati, and P. Bollineni, A fast automatonbased method for detecting anomalous program behaviors, Proceedings of the 2001 IEEE Symposium on Security and Privacy (Washington, DC, USA), SP ’01, IEEE Computer Society, 2001, pp. 144–. [Sca] Scalpel, http://www.digitalforensicssolutions.com/Scalpel/. [She08] A. Shevchenko, Malicious code detection technologies, Kaspersky Lab, 2008, http://brazil.kaspersky.com/sites/brazil.kaspersky.com/files/knowledgecenter/malicious code detection technologies.pdf. [SM07] K. Scarfone and P. Mell, Guide to intrusion detection and prevention systems (idps), Computer Security Resource Center (National Institute of Standards and Technology), 2007. [sno] Snort, http://www.snort.org. [SQ10] Vermaat M. E. Shelly, G. B. and J. J.(CON) Quasney, Living in a digital world, introductory, Discovering Computers 2010, Cengage Learning, 2010, p. 144. [stu11] Wired.com, 2011, http://www.wired.com/threatlevel/2011/07/howdigital-detectives-deciphered-stuxnet/all/1. [SW81] T. F. Smith and M. S. Waterman, Identification of common molecular subsequences., Journal of molecular biology 147 (1981), no. 1, 195–197. [SXCM04] A. H. Sung, J. Xu, P. Chavez, and S. Mukkamala, Static analyzer of vicious executables (save), Proceedings of the 20th Annual Computer Security Applications Conference (Washington, DC, USA), ACSAC ’04, IEEE Computer Society, 2004, pp. 326–334. [Tan10] M. Tanase, Detecting and removing malicious code, 2010, http://www.symantec.com/connect/articles/detecting-and-removingmalicious-code. [TCC11] Vrizlynn L. L. Thing, Tong-Wei Chua, and Ming-Lee Cheong, Design of a digital forensics evidence reconstruction system for complex and obscure fragmented file carving, CIS’11, 2011, pp. 793–797. [Tec] Techreport.com: The gf114, http://techreport.com/articles.x/20293/6. [Tom] Tom’s hardware: Geforce gtx 560 http://www.tomshardware.com/reviews/nvidia-geforce-gtx-560-tigf114,2845.html. ti, PhD Dissertation 153 [tri] Tridscan, http://mark0.net/soft-tridscan-e.html. [TSCV] Nathan Tuck, Timothy Sherwood, Brad Calder, and George Varghese, Deterministic memory-efficient string matching algorithms for intrusion detection, In IEEE Infocom, Hong Kong, pp. 333–340. [Tum10] A. Tumeo, Accelerating dna analysis applications on gpu clusters, IEEE 8th Symposium on Application Specific Processors (SASP), IEEE Xplore, 2010, ISBN 978-1-4244-7953-5, pp. 71–76. [VAP+ 08] Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos P. Markatos, and Sotiris Ioannidis, Gnort: High performance network intrusion detection using graphics processors, Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection (Berlin, Heidelberg), RAID ’08, Springer-Verlag, 2008, pp. 116–134. [VI10] Giorgos Vasiliadis and Sotiris Ioannidis, Gravity: a massively parallel antivirus engine, Proceedings of the 13th international conference on Recent advances in intrusion detection (Berlin, Heidelberg), RAID’10, SpringerVerlag, 2010, pp. 79–96. [vir12] Zdnet, 2012, http://www.zdnet.com/blog/bott/the-malware-numbersgame-how-many-viruses-are-out-there/4783/. [vN66] J. von Neumann, Theory of self-reproducing automata, Essays on Cellular Automata, University of Illinois Press, 1966, pp. 66–87. [VPI11] Giorgos Vasiliadis, Michalis Polychronakis, and Sotiris Ioannidis, Midea: a multi-parallel intrusion detection architecture, Proceedings of the 18th ACM conference on Computer and communications security (New York, NY, USA), CCS ’11, ACM, 2011, pp. 297–308. [Wat94] B. Watson, The performance of single-keyword and multiple-keyword pattern matching algorithms, Technical Report CS TR 94-19, Department of Computing Science, Eindhoven University of Technology, 1994. [Wat95] B.W. Watson, Taxonomies and toolkits of regular language algorithms, Eindhoven University of Technology, 1995, pp. 53–116. [Wat04] et al. Watson, B.W., Spare parts: a c++ toolkit for string pattern recognition, Software Practice and Experience, 2004. [wdm] Windows driver model, http://msdn.microsoft.com/enus/library/windows/hardware/ff565698(v=vs.85).aspx. PhD Dissertation 154 [Wei73] P. Weiner, Linear pattern matching algorithms, 14th Annual IEEE Symposium on Switching and Automata Theory, IEEE Computer Society, 1973, pp. 1–11. [Wei81] Mark Weiser, Program slicing, Proceedings of the 5th international conference on Software engineering (Piscataway, NJ, USA), ICSE ’81, IEEE Press, 1981, pp. 439–449. [Wika] Wikipedia - entropy, http://en.wikipedia.org/wiki/Information entropy. [Wikb] Wikipedia, Cih - computer virus, CIH (computer virus). http://en.wikipedia.org/wiki/ [WM94] S. Wu and U. Manber, A fast algorithm for multi-pattern searching, Technical Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994. [WZ96] B.W. Watson and G. Zwaan, A taxonomy of sublinear multiple keyword pattern matching algorithms, Science of Computer Programming, 1996. [XW07] X. Xu and L. Wang, Parallel software development with intel threading analysis tools, Intel Technology Journal, vol. 11, issue 04, 2007. [Yun12] SangKyun Yun, An efficient tcam-based implementation of multipattern matching using covered state encoding, IEEE Trans. Comput. 61 (2012), no. 2, 213–221. [Zha] Sahni S. Zha, X., Fast in-place file carving for digital forensics, eForensics. [ZS08] Xinyan Zha and Sartaj Sahni, Highly compressed aho-corasick automata for efficient intrusion detection, ISCC’08, 2008, pp. 298–303. [ZSS11] Xinyan Zha, D.P. Scarpazza, and S. Sahni, Highly compressed multipattern string matching on the cell broadband engine, Computers and Communications (ISCC), 2011 IEEE Symposium on, 28 2011-july 1 2011, pp. 257 –264.