Feature mining and classifier selection for API calls-based malware detection

Balan, Gheorghe; Simion, Ciprian-Alin; Gavriluţ, Dragoş Teodor; Luchian, Henri

doi:10.1007/s10489-023-05086-2

Feature mining and classifier selection for API calls-based malware detection

Published: 21 October 2023

Volume 53, pages 29094–29108, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Gheorghe Balan ORCID: orcid.org/0009-0001-0138-8604^1,2,
Ciprian-Alin Simion^1,2^na1,
Dragoş Teodor Gavriluţ^1,2^na1 &
…
Henri Luchian¹^na1

258 Accesses
Explore all metrics

Abstract

This paper deals with a major challenge in cyber-security: the need to respond to ever renewed techniques used by attackers in order to avoid detection based on analysing static features of malware. These constantly renewed techniques consist of various changes in file geometry, entropy a.s.o. As a consequence, static malware features sets describe less and less accurately the malicious files; hence, the performance of machine learning models in detecting new variants of the same malware family may be severely impaired. The paper focuses on a promising approach to this detection challenge: defining file features based on OS (operating system) API (Application Program Interface) calls sequences. We explore in detail the detection potential of such features, since, in order to act maliciously, these features are highly unlikely to be hidden. We studied several tens of thousands of such features, a modest-sized subset of which were subsequently fed to several machine learning models. The database used for training and testing consists of 1.5 million files, including malicious files from the polymorphic families Emotet and Trickbot. Using this database, nearly 4,000 pairings (classifier, feature selection algorithm) were trained / tested. Our experimental results show that the API (Application Program Interface) calls-oriented feature mining process is well suited for detecting polymorphic malware. A comparative discussion of the detection results of the various models is presented; depending on the target optimisation criterion (detection rate / false positive rate / saving resources), three of the 4,000 classification models turn out to be best suited for real-world applications: Random Forrest, Legacy Neural Networks and Decision Tree.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Hybrid Approach for Malware Detection Using Frequent Opcodes and API Call Sequences

Using API Calls for Sequence-Pattern Feature Mining-Based Malware Detection

Improving Malware Detection with a Novel Dataset Based on API Calls

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The data that support the findings of this study are available from Bitdefender’s Cyber Threat Intelligence Lab but restrictions apply to the availability of these data, which were used under licence for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Bitdefender’s Cyber Threat Intelligence Lab.

Code Availability

The emulator used for this research is property of Bitdefender’s and restrictions apply to the availability of the source code. The designed algorithms are available from the corresponding author on reasonable request.

Notes

file-compacting techniques
Endpoint Detection and Response
eXtended Detection and Response
Secrity Information and Event Management
Managed Detection and Response
https://www.av-comparatives.org/tests/real-world-protection-test-february-may-2023/
Support Vector Machine
Principal Component Analysis
Exceptions include the APIs that trigger memory operations like VirtualAlloc/VirtualFree/etc.
The emulator will either allow or block a required access
true positive rate
true negative rate
The use of AdaBoost and DT in malware detection is presented, for example in [33].
Sensitivity
Specificity
While a file is being scanned, access to that file is delayed until the scan process finishes.
The so-called whitelisting methods

References

Balan G, GavriluŢ DT, Luchian H (2022) Using api calls for sequence-pattern feature mining-based malware detection. In: Information security practice and experience, pp 233–251
Catalano C, Chezzi A, Angelelli M, Tommasi F (2022) Deceiving ai-based malware detection through polymorphic attacks. Comput Ind 143:103751. https://doi.org/10.1016/j.compind.2022.103751
Article Google Scholar
Alhashmi AA, Darem AA, Alashjaee AM, Alanazi SM, Alkhaldi TM, Ebad SA, Ghaleb FA, Almadani AM (2023) Similarity-based hybrid malware detection model using api calls. Mathematics 11(13). https://doi.org/10.3390/math11132944
Pascanu R, Stokes J, Sanossian H, Marinescu M, Thomas A (2015) Malware classification with recurrent networks, pp 1916–1920. https://doi.org/10.1109/ICASSP.2015.7178304
Athiwaratkun B, Stokes J (2017) Malware classification with lstm and gru language models and a character-level cnn, pp 2482–2486. https://doi.org/10.1109/ICASSP.2017.7952603
Rabadi D, Teo S (2020) Advanced windows methods on malware detection and classification, pp 54–68. https://doi.org/10.1145/3427228.3427242
Amer E, Zelinka I (2020) A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence. Comput Secur. https://doi.org/10.1016/j.cose.2020.101760
Amer E, El-Sappagh S, Hu J (2020) Contextual identification of windows malware through semantic interpretation of api call sequence. Appl Sci 10. https://doi.org/10.3390/app10217673
Li C, Cheng Z, Zhu H, Wang L, Lv Q, Wang Y, Li N, Sun D (2022) Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning. Comput Secur 122:102872. https://doi.org/10.1016/j.cose.2022.102872
Article Google Scholar
Lin C-T, Wang N-J, Xiao H, Eckert C (2015) Feature selection and extraction for malware classification
Xu K, Li Y, Deng R, Chen K, Xu J (2019) Droidevolver: Self-evolving android malware detection system. https://doi.org/10.1109/EuroSP.2019.00014
Kim H, Kim J, Kim Y, Kim I, Kim K, Kim H (2019) Improvement of malware detection and classification using api call sequence alignment and visualization. Clust Comput 22. https://doi.org/10.1007/s10586-017-1110-2
Uppal D, Sinha R, Mehra V, Jain V (2014) Malware detection and classification based on extraction of api sequences, pp 2337–2342. https://doi.org/10.1109/ICACCI.2014.6968547
Choi S, Bae J, Lee C, Kim Y, Kim J (2020). Attention-based automated feature extraction for malware analysis. https://doi.org/10.3390/s20102893
Wang X, Wu P, Xu Q, Zeng Z, Xie Y (2021) Joint image clustering and feature selection with auto-adjoined learning for high-dimensional data. Knowl-Based Syst 232:107443. https://doi.org/10.1016/j.knosys.2021.107443
Article Google Scholar
Tahir R (2018) A study on malware and malware detection techniques. Int J Educ Manag Eng 8:20–30. https://doi.org/10.5815/ijeme.2018.02.03
Article Google Scholar
Anderson H (2017) Evading machine learning malware detection
Anderson H, Kharkar A, Filar B, Evans D, Roth P (2018) Learning to evade static pe machine learning malware models via reinforcement learning
TrendMicro (2023) DARKCOMET. https://www.trendmicro.com/vinfo/us/threat-encyclopedia/malware/DARKCOMET. Accessed 2023-07-31
Sentinel H (2023) HD Sentinel. https://www.hdsentinel.com/download.php. Accessed 2023-07-31
Virustotal (2023) DarkComet. https://www.virustotal.com/gui/file/707d4a225237425bb60718dd0b914cba. Accessed 2023-07-31
Lita C, Cosovan D, Gavrilut D (2018). Anti-emulation trends in modern packers: a survey on the evolution of anti-emulation techniques in upa packers. https://doi.org/10.1007/s11416-017-0291-9
Sundarkumar G, Vadlamani R, Nwogu I, Govindaraju V (2015) Malware detection via api calls, topic models and machine learning, pp 1212–1217. https://doi.org/10.1109/CoASE.2015.7294263
Alazab M, Venkatraman S, Watters P (2010). Towards understanding malware behaviour by the extraction of api calls. https://doi.org/10.1109/CTC.2010.8
Elhadi A, Maarof M, Barry B (2013) Improving the detection of malware behaviour using simplified data dependent api call graph. Int J Secur its Appl 7:29–42. https://doi.org/10.14257/ijsia.2013.7.5.03
Ki Y, Kim E, Kim HK (2015) A novel approach to detect malware based on api call sequence analysis. Int J Distrib Sens Netw 2015:1–9. https://doi.org/10.1155/2015/659101
Article Google Scholar
Gavrilut D, Cimpoesu M, Anton D, Ciortuz L (2009) Malware detection using perceptrons and support vector machines, pp 283–288. https://doi.org/10.1109/ComputationWorld.2009.85
Balan G, Popescu A (2018) Detecting java compiled malware using machine learning techniques, pp 435–439. https://doi.org/10.1109/SYNASC.2018.00073
Gavrilut D, Benchea R, Vatamanu C (2012) Optimized zero false positives perceptron training for malware detection, pp 247–253. https://doi.org/10.1109/SYNASC.2012.34
Kurbiel T, Khaleghian S (2017) Training of deep neural networks based on distance measures using RMSProp. https://doi.org/10.48550/ARXIV.1708.01911
Zhao M, Ge F, Zhang T, Yuan Z (2011) Antimaldroid: An efficient svm-based malware detection framework for android. In: Liu C, Chang J, Yang A (eds.) Information computing and applications
Sanjaa B, Chuluun E (2013) Malware detection using linear svm. In: Ifost. https://doi.org/10.1109/IFOST.2013.6616872
Abu Al-Haija Q, Odeh A, Qattous H (2022) Pdf malware detection based on optimizable decision trees 11(19). https://doi.org/10.3390/electronics11193142
Garcia FCC, II FPM (2016) Random forest for malware classification. CoRR arXiv:1609.07770
Artur M (2021) Review the performance of the bernoulli naive bayes classifier in intrusion detection systems using recursive feature elimination with cross-validated selection of the best number of features. Procedia Comput Sci. https://doi.org/10.1016/j.procs.2021.06.066
Gavrilut DT, Anton DG, Popoiu G (2017) Machine learning based malware detection - how to balance memory footprint with model accuracy. In: 2017 19th International symposium on symbolic and numeric algorithms for scientific computing (SYNASC), pp 232–238. https://doi.org/10.1109/SYNASC.2017.00045

Download references

Acknowledgements

We would like to thank Bitdefender’s Cyber Threat Intelligence Lab for providing malicious and benign files needed to conduct our research.

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Ciprian-Alin Simion, Dragoş Teodor Gavriluţ and Henri Luchian are contributed equally to this work.

Authors and Affiliations

Faculty of Computer Science, Al.I. Cuza University, Iaşi, Romania
Gheorghe Balan, Ciprian-Alin Simion, Dragoş Teodor Gavriluţ & Henri Luchian
Cyber Threat Intelligence Laboratory, Bitdefender, Iaşi, Romania
Gheorghe Balan, Ciprian-Alin Simion & Dragoş Teodor Gavriluţ

Authors

Gheorghe Balan
View author publications
You can also search for this author in PubMed Google Scholar
Ciprian-Alin Simion
View author publications
You can also search for this author in PubMed Google Scholar
Dragoş Teodor Gavriluţ
View author publications
You can also search for this author in PubMed Google Scholar
Henri Luchian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception, design and implementation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Gheorghe Balan.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest or competing interests regarding the publication of this study.

Consent to participate / for publication

All authors of this paper made substantial contributions to the conception and design of the work, drafted the work or revised it critically for important intellectual content, approved the version to be published and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Balan, G., Simion, CA., Gavriluţ, D.T. et al. Feature mining and classifier selection for API calls-based malware detection. Appl Intell 53, 29094–29108 (2023). https://doi.org/10.1007/s10489-023-05086-2

Download citation

Accepted: 05 October 2023
Published: 21 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-05086-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Feature mining and classifier selection for API calls-based malware detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Efficient Hybrid Approach for Malware Detection Using Frequent Opcodes and API Call Sequences

Using API Calls for Sequence-Pattern Feature Mining-Based Malware Detection

Improving Malware Detection with a Novel Dataset Based on API Calls

Availability of data and materials

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Consent to participate / for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Feature mining and classifier selection for API calls-based malware detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Efficient Hybrid Approach for Malware Detection Using Frequent Opcodes and API Call Sequences

Using API Calls for Sequence-Pattern Feature Mining-Based Malware Detection

Improving Malware Detection with a Novel Dataset Based on API Calls

Explore related subjects

Availability of data and materials

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Consent to participate / for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation