Abstract
This paper deals with a major challenge in cyber-security: the need to respond to ever renewed techniques used by attackers in order to avoid detection based on analysing static features of malware. These constantly renewed techniques consist of various changes in file geometry, entropy a.s.o. As a consequence, static malware features sets describe less and less accurately the malicious files; hence, the performance of machine learning models in detecting new variants of the same malware family may be severely impaired. The paper focuses on a promising approach to this detection challenge: defining file features based on OS (operating system) API (Application Program Interface) calls sequences. We explore in detail the detection potential of such features, since, in order to act maliciously, these features are highly unlikely to be hidden. We studied several tens of thousands of such features, a modest-sized subset of which were subsequently fed to several machine learning models. The database used for training and testing consists of 1.5 million files, including malicious files from the polymorphic families Emotet and Trickbot. Using this database, nearly 4,000 pairings (classifier, feature selection algorithm) were trained / tested. Our experimental results show that the API (Application Program Interface) calls-oriented feature mining process is well suited for detecting polymorphic malware. A comparative discussion of the detection results of the various models is presented; depending on the target optimisation criterion (detection rate / false positive rate / saving resources), three of the 4,000 classification models turn out to be best suited for real-world applications: Random Forrest, Legacy Neural Networks and Decision Tree.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The data that support the findings of this study are available from Bitdefender’s Cyber Threat Intelligence Lab but restrictions apply to the availability of these data, which were used under licence for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Bitdefender’s Cyber Threat Intelligence Lab.
Code Availability
The emulator used for this research is property of Bitdefender’s and restrictions apply to the availability of the source code. The designed algorithms are available from the corresponding author on reasonable request.
Notes
file-compacting techniques
Endpoint Detection and Response
eXtended Detection and Response
Secrity Information and Event Management
Managed Detection and Response
Support Vector Machine
Principal Component Analysis
Exceptions include the APIs that trigger memory operations like VirtualAlloc/VirtualFree/etc.
The emulator will either allow or block a required access
true positive rate
true negative rate
The use of AdaBoost and DT in malware detection is presented, for example in [33].
Sensitivity
Specificity
While a file is being scanned, access to that file is delayed until the scan process finishes.
The so-called whitelisting methods
References
Balan G, GavriluŢ DT, Luchian H (2022) Using api calls for sequence-pattern feature mining-based malware detection. In: Information security practice and experience, pp 233–251
Catalano C, Chezzi A, Angelelli M, Tommasi F (2022) Deceiving ai-based malware detection through polymorphic attacks. Comput Ind 143:103751. https://doi.org/10.1016/j.compind.2022.103751
Alhashmi AA, Darem AA, Alashjaee AM, Alanazi SM, Alkhaldi TM, Ebad SA, Ghaleb FA, Almadani AM (2023) Similarity-based hybrid malware detection model using api calls. Mathematics 11(13). https://doi.org/10.3390/math11132944
Pascanu R, Stokes J, Sanossian H, Marinescu M, Thomas A (2015) Malware classification with recurrent networks, pp 1916–1920. https://doi.org/10.1109/ICASSP.2015.7178304
Athiwaratkun B, Stokes J (2017) Malware classification with lstm and gru language models and a character-level cnn, pp 2482–2486. https://doi.org/10.1109/ICASSP.2017.7952603
Rabadi D, Teo S (2020) Advanced windows methods on malware detection and classification, pp 54–68. https://doi.org/10.1145/3427228.3427242
Amer E, Zelinka I (2020) A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence. Comput Secur. https://doi.org/10.1016/j.cose.2020.101760
Amer E, El-Sappagh S, Hu J (2020) Contextual identification of windows malware through semantic interpretation of api call sequence. Appl Sci 10. https://doi.org/10.3390/app10217673
Li C, Cheng Z, Zhu H, Wang L, Lv Q, Wang Y, Li N, Sun D (2022) Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning. Comput Secur 122:102872. https://doi.org/10.1016/j.cose.2022.102872
Lin C-T, Wang N-J, Xiao H, Eckert C (2015) Feature selection and extraction for malware classification
Xu K, Li Y, Deng R, Chen K, Xu J (2019) Droidevolver: Self-evolving android malware detection system. https://doi.org/10.1109/EuroSP.2019.00014
Kim H, Kim J, Kim Y, Kim I, Kim K, Kim H (2019) Improvement of malware detection and classification using api call sequence alignment and visualization. Clust Comput 22. https://doi.org/10.1007/s10586-017-1110-2
Uppal D, Sinha R, Mehra V, Jain V (2014) Malware detection and classification based on extraction of api sequences, pp 2337–2342. https://doi.org/10.1109/ICACCI.2014.6968547
Choi S, Bae J, Lee C, Kim Y, Kim J (2020). Attention-based automated feature extraction for malware analysis. https://doi.org/10.3390/s20102893
Wang X, Wu P, Xu Q, Zeng Z, Xie Y (2021) Joint image clustering and feature selection with auto-adjoined learning for high-dimensional data. Knowl-Based Syst 232:107443. https://doi.org/10.1016/j.knosys.2021.107443
Tahir R (2018) A study on malware and malware detection techniques. Int J Educ Manag Eng 8:20–30. https://doi.org/10.5815/ijeme.2018.02.03
Anderson H (2017) Evading machine learning malware detection
Anderson H, Kharkar A, Filar B, Evans D, Roth P (2018) Learning to evade static pe machine learning malware models via reinforcement learning
TrendMicro (2023) DARKCOMET. https://www.trendmicro.com/vinfo/us/threat-encyclopedia/malware/DARKCOMET. Accessed 2023-07-31
Sentinel H (2023) HD Sentinel. https://www.hdsentinel.com/download.php. Accessed 2023-07-31
Virustotal (2023) DarkComet. https://www.virustotal.com/gui/file/707d4a225237425bb60718dd0b914cba. Accessed 2023-07-31
Lita C, Cosovan D, Gavrilut D (2018). Anti-emulation trends in modern packers: a survey on the evolution of anti-emulation techniques in upa packers. https://doi.org/10.1007/s11416-017-0291-9
Sundarkumar G, Vadlamani R, Nwogu I, Govindaraju V (2015) Malware detection via api calls, topic models and machine learning, pp 1212–1217. https://doi.org/10.1109/CoASE.2015.7294263
Alazab M, Venkatraman S, Watters P (2010). Towards understanding malware behaviour by the extraction of api calls. https://doi.org/10.1109/CTC.2010.8
Elhadi A, Maarof M, Barry B (2013) Improving the detection of malware behaviour using simplified data dependent api call graph. Int J Secur its Appl 7:29–42. https://doi.org/10.14257/ijsia.2013.7.5.03
Ki Y, Kim E, Kim HK (2015) A novel approach to detect malware based on api call sequence analysis. Int J Distrib Sens Netw 2015:1–9. https://doi.org/10.1155/2015/659101
Gavrilut D, Cimpoesu M, Anton D, Ciortuz L (2009) Malware detection using perceptrons and support vector machines, pp 283–288. https://doi.org/10.1109/ComputationWorld.2009.85
Balan G, Popescu A (2018) Detecting java compiled malware using machine learning techniques, pp 435–439. https://doi.org/10.1109/SYNASC.2018.00073
Gavrilut D, Benchea R, Vatamanu C (2012) Optimized zero false positives perceptron training for malware detection, pp 247–253. https://doi.org/10.1109/SYNASC.2012.34
Kurbiel T, Khaleghian S (2017) Training of deep neural networks based on distance measures using RMSProp. https://doi.org/10.48550/ARXIV.1708.01911
Zhao M, Ge F, Zhang T, Yuan Z (2011) Antimaldroid: An efficient svm-based malware detection framework for android. In: Liu C, Chang J, Yang A (eds.) Information computing and applications
Sanjaa B, Chuluun E (2013) Malware detection using linear svm. In: Ifost. https://doi.org/10.1109/IFOST.2013.6616872
Abu Al-Haija Q, Odeh A, Qattous H (2022) Pdf malware detection based on optimizable decision trees 11(19). https://doi.org/10.3390/electronics11193142
Garcia FCC, II FPM (2016) Random forest for malware classification. CoRR arXiv:1609.07770
Artur M (2021) Review the performance of the bernoulli naive bayes classifier in intrusion detection systems using recursive feature elimination with cross-validated selection of the best number of features. Procedia Comput Sci. https://doi.org/10.1016/j.procs.2021.06.066
Gavrilut DT, Anton DG, Popoiu G (2017) Machine learning based malware detection - how to balance memory footprint with model accuracy. In: 2017 19th International symposium on symbolic and numeric algorithms for scientific computing (SYNASC), pp 232–238. https://doi.org/10.1109/SYNASC.2017.00045
Acknowledgements
We would like to thank Bitdefender’s Cyber Threat Intelligence Lab for providing malicious and benign files needed to conduct our research.
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception, design and implementation. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflicts of interest or competing interests regarding the publication of this study.
Consent to participate / for publication
All authors of this paper made substantial contributions to the conception and design of the work, drafted the work or revised it critically for important intellectual content, approved the version to be published and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Balan, G., Simion, CA., Gavriluţ, D.T. et al. Feature mining and classifier selection for API calls-based malware detection. Appl Intell 53, 29094–29108 (2023). https://doi.org/10.1007/s10489-023-05086-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05086-2