Dos and Don’ts of Machine Learning
in Computer Security
Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke,
Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, Konrad Rieck
USENIX Security 2022
Machine Learning already solved
many problems in computer security
2
Machine Learning already
Unfortunately not… !solved
many problems in computer security
2
Motivation—Historical Examples
Network intrusion detection: The base rate fallacy
• Intrusion detectors should have low false positive rates (FPR)
• ‘Low’ FPR often still corresponds to large number of false positives
Android malware detection: Spatio-temporal bias inflating performance
• Models trained with access to ‘future’ information
• Unrealistic class balance inflates performance
Axelsson. The base-rate fallacy and the difficulty of intrusion detection. ACM TISSEC, 2000.
Pendlebury et al. TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. USENIX Security, 2019.
4
Overview
1. Identification of common pitfalls
• 10 subtle issues affecting ML for security
• Recommendations for avoiding them
2. Survey on the prevalence of pitfalls
• Review of 30 top papers in security
• Pitfalls are widespread
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
3. Case studies demonstrating impact of pitfalls
• Mobile malware detection
• Vulnerability discovery
• Source code authorship attribution
• Network intrusion detection
Important remark
This work should not be interpreted as a finger-pointing exercise. Any work mentioned as having pitfalls still has important
contributions and we identify pitfalls in our own work also.
8
ML Pipeline and Pitfalls
Performance
System Design and Evaluation
Learning
Data Collection and Deployment and
Labeling • P6 Inappropriate baselines Operation
• P3 Data snooping
• P7 Inappropriate measures
• P1 Sampling bias • P4 Spurious correlations
• P8 Base rate fallacy • P9 Lab-only evaluation
• P2 Label Inaccuracy • P5 Biased parameters
• P10 Inappropriate threat
model
19
Prevalence Study
1. Paper Selection 2. Review Process 3. Authors Feedback
Pitfall is either…
present (but discussed)
partly present (but discussed)
not present
unclear from text
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
22
Prevalence Study
Present Partly present Discussed
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Sampling Bias 18 24 3 6 3
Label Inaccuracy 3 5 3 7 6
Data Snooping 17 5
Spurious Correlations 6 1
Biased Parameters 3 2
Inappropriate Baseline 6 2
Inappropriate Measures 10 6
Base Rate Fallacy 3 5 6 7 3
Lab-Only Evaluation 14 17 2 3 2
Inappropriate Threat Model 5 1 16 14 4
10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 %
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
22
Prevalence Study
Present Partly present Discussed
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Sampling Bias 18 24 3 6 3
Label Inaccuracy 3 5 3 7 6
Data Snooping 17 5
Spurious Correlations 6 1
Biased Parameters 3 2 PItfalls are prevalent even in top research!
Inappropriate Baseline 6 2
Inappropriate Measures 10 6
Base Rate Fallacy 3 5 6 7 3
Lab-Only Evaluation 14 17 2 3 2
Inappropriate Threat Model 5 1 16 14 4
10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 %
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
22
Impact Analysis
Android Malware Detection
P1: Sampling Bias
P4: Spurious Correlations
Network Intrusion Detection
P7: Inappropriate Performance Measures
P6: Inappropriate baselines
P9: Lab-only evaluation
Authorship Attribution Vulnerability Discovery
P1: Sampling Bias P2: Label Inaccuracy
P4: Spurious Correlations P4: Spurious Correlations
P6: Inappropriate Baselines
23
Impact Analysis
Android Malware Detection
P1: Sampling Bias
P4: Spurious Correlations
Network Intrusion Detection
P7: Inappropriate Performance Measures
P6: Inappropriate baselines
P9: Lab-only evaluation
Authorship Attribution Vulnerability Discovery
P1: Sampling Bias P2: Label Inaccuracy
P4: Spurious Correlations P4: Spurious Correlations
P6: Inappropriate Baselines
23
Impact Study: Mobile Malware Detection P1: Sampling Bias
P4: Spurious Correlations
P7: Inappropriate Performance Measures
What is the problem?
• Merging of data from different sources leads to sampling bias
• Different origins of malware and benign apps can introduce unwanted shortcuts
GooglePlay Store Chinese Markets Other Origins
1,00
Sampling Probability
≈80% ≈70%
0,75
0,50
0,25
0,00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of AV detections
Allix et al. AndroZoo: collecting millions of Android apps for the research community. ACM MSR, 2016.
Arp et al. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket. NDSS, 2014.
24
Impact Study: Mobile Malware Detection P1: Sampling Bias
P4: Spurious Correlations
P7: Inappropriate Performance Measures
What is the impact?
• Comparison on datasets with (D1) and without (D2) the artifact
• Training of SVM on two different feature sets
1 -11%
-17%
True positive rate
0,75
0,5
With artifact (D1)
0,96 0,88
0,85 Without artifact (D2)
0,73
0,25
0
Drebin Opseqs
Results
• Experimental results show how sampling bias affects results (P1)
• The URL „play.google.com" is among top features in D1 (P4)
• Using Accuracy would have underestimated the presence of bias (P7)
Allix et al. AndroZoo: collecting millions of Android apps for the research community. ACM MSR, 2016.
Arp et al. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket. NDSS, 2014.
25
Dos and Don’ts of Machine Learning in Computer Security
• We identify 10 subtle pitfalls affecting the field
• Find that they are prevalent throughout top research
• Demonstrate their impact through case studies
Updates on pitfalls and recommendations:
• https://dodo-mlsec.org/ "