Skip to main content

Advertisement

Log in

Feature mining and classifier selection for API calls-based malware detection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper deals with a major challenge in cyber-security: the need to respond to ever renewed techniques used by attackers in order to avoid detection based on analysing static features of malware. These constantly renewed techniques consist of various changes in file geometry, entropy a.s.o. As a consequence, static malware features sets describe less and less accurately the malicious files; hence, the performance of machine learning models in detecting new variants of the same malware family may be severely impaired. The paper focuses on a promising approach to this detection challenge: defining file features based on OS (operating system) API (Application Program Interface) calls sequences. We explore in detail the detection potential of such features, since, in order to act maliciously, these features are highly unlikely to be hidden. We studied several tens of thousands of such features, a modest-sized subset of which were subsequently fed to several machine learning models. The database used for training and testing consists of 1.5 million files, including malicious files from the polymorphic families Emotet and Trickbot. Using this database, nearly 4,000 pairings (classifier, feature selection algorithm) were trained / tested. Our experimental results show that the API (Application Program Interface) calls-oriented feature mining process is well suited for detecting polymorphic malware. A comparative discussion of the detection results of the various models is presented; depending on the target optimisation criterion (detection rate / false positive rate / saving resources), three of the 4,000 classification models turn out to be best suited for real-world applications: Random Forrest, Legacy Neural Networks and Decision Tree.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Availability of data and materials

The data that support the findings of this study are available from Bitdefender’s Cyber Threat Intelligence Lab but restrictions apply to the availability of these data, which were used under licence for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Bitdefender’s Cyber Threat Intelligence Lab.

Code Availability

The emulator used for this research is property of Bitdefender’s and restrictions apply to the availability of the source code. The designed algorithms are available from the corresponding author on reasonable request.

Notes

  1. file-compacting techniques

  2. Endpoint Detection and Response

  3. eXtended Detection and Response

  4. Secrity Information and Event Management

  5. Managed Detection and Response

  6. https://www.av-comparatives.org/tests/real-world-protection-test-february-may-2023/

  7. Support Vector Machine

  8. Principal Component Analysis

  9. Exceptions include the APIs that trigger memory operations like VirtualAlloc/VirtualFree/etc.

  10. The emulator will either allow or block a required access

  11. true positive rate

  12. true negative rate

  13. The use of AdaBoost and DT in malware detection is presented, for example in [33].

  14. Sensitivity

  15. Specificity

  16. While a file is being scanned, access to that file is delayed until the scan process finishes.

  17. The so-called whitelisting methods

References

  1. Balan G, GavriluŢ DT, Luchian H (2022) Using api calls for sequence-pattern feature mining-based malware detection. In: Information security practice and experience, pp 233–251

  2. Catalano C, Chezzi A, Angelelli M, Tommasi F (2022) Deceiving ai-based malware detection through polymorphic attacks. Comput Ind 143:103751. https://doi.org/10.1016/j.compind.2022.103751

    Article  Google Scholar 

  3. Alhashmi AA, Darem AA, Alashjaee AM, Alanazi SM, Alkhaldi TM, Ebad SA, Ghaleb FA, Almadani AM (2023) Similarity-based hybrid malware detection model using api calls. Mathematics 11(13). https://doi.org/10.3390/math11132944

  4. Pascanu R, Stokes J, Sanossian H, Marinescu M, Thomas A (2015) Malware classification with recurrent networks, pp 1916–1920. https://doi.org/10.1109/ICASSP.2015.7178304

  5. Athiwaratkun B, Stokes J (2017) Malware classification with lstm and gru language models and a character-level cnn, pp 2482–2486. https://doi.org/10.1109/ICASSP.2017.7952603

  6. Rabadi D, Teo S (2020) Advanced windows methods on malware detection and classification, pp 54–68. https://doi.org/10.1145/3427228.3427242

  7. Amer E, Zelinka I (2020) A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence. Comput Secur. https://doi.org/10.1016/j.cose.2020.101760

  8. Amer E, El-Sappagh S, Hu J (2020) Contextual identification of windows malware through semantic interpretation of api call sequence. Appl Sci 10. https://doi.org/10.3390/app10217673

  9. Li C, Cheng Z, Zhu H, Wang L, Lv Q, Wang Y, Li N, Sun D (2022) Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning. Comput Secur 122:102872. https://doi.org/10.1016/j.cose.2022.102872

    Article  Google Scholar 

  10. Lin C-T, Wang N-J, Xiao H, Eckert C (2015) Feature selection and extraction for malware classification

  11. Xu K, Li Y, Deng R, Chen K, Xu J (2019) Droidevolver: Self-evolving android malware detection system. https://doi.org/10.1109/EuroSP.2019.00014

  12. Kim H, Kim J, Kim Y, Kim I, Kim K, Kim H (2019) Improvement of malware detection and classification using api call sequence alignment and visualization. Clust Comput 22. https://doi.org/10.1007/s10586-017-1110-2

  13. Uppal D, Sinha R, Mehra V, Jain V (2014) Malware detection and classification based on extraction of api sequences, pp 2337–2342. https://doi.org/10.1109/ICACCI.2014.6968547

  14. Choi S, Bae J, Lee C, Kim Y, Kim J (2020). Attention-based automated feature extraction for malware analysis. https://doi.org/10.3390/s20102893

  15. Wang X, Wu P, Xu Q, Zeng Z, Xie Y (2021) Joint image clustering and feature selection with auto-adjoined learning for high-dimensional data. Knowl-Based Syst 232:107443. https://doi.org/10.1016/j.knosys.2021.107443

    Article  Google Scholar 

  16. Tahir R (2018) A study on malware and malware detection techniques. Int J Educ Manag Eng 8:20–30. https://doi.org/10.5815/ijeme.2018.02.03

    Article  Google Scholar 

  17. Anderson H (2017) Evading machine learning malware detection

  18. Anderson H, Kharkar A, Filar B, Evans D, Roth P (2018) Learning to evade static pe machine learning malware models via reinforcement learning

  19. TrendMicro (2023) DARKCOMET. https://www.trendmicro.com/vinfo/us/threat-encyclopedia/malware/DARKCOMET. Accessed 2023-07-31

  20. Sentinel H (2023) HD Sentinel. https://www.hdsentinel.com/download.php. Accessed 2023-07-31

  21. Virustotal (2023) DarkComet. https://www.virustotal.com/gui/file/707d4a225237425bb60718dd0b914cba. Accessed 2023-07-31

  22. Lita C, Cosovan D, Gavrilut D (2018). Anti-emulation trends in modern packers: a survey on the evolution of anti-emulation techniques in upa packers. https://doi.org/10.1007/s11416-017-0291-9

  23. Sundarkumar G, Vadlamani R, Nwogu I, Govindaraju V (2015) Malware detection via api calls, topic models and machine learning, pp 1212–1217. https://doi.org/10.1109/CoASE.2015.7294263

  24. Alazab M, Venkatraman S, Watters P (2010). Towards understanding malware behaviour by the extraction of api calls. https://doi.org/10.1109/CTC.2010.8

  25. Elhadi A, Maarof M, Barry B (2013) Improving the detection of malware behaviour using simplified data dependent api call graph. Int J Secur its Appl 7:29–42. https://doi.org/10.14257/ijsia.2013.7.5.03

  26. Ki Y, Kim E, Kim HK (2015) A novel approach to detect malware based on api call sequence analysis. Int J Distrib Sens Netw 2015:1–9. https://doi.org/10.1155/2015/659101

    Article  Google Scholar 

  27. Gavrilut D, Cimpoesu M, Anton D, Ciortuz L (2009) Malware detection using perceptrons and support vector machines, pp 283–288. https://doi.org/10.1109/ComputationWorld.2009.85

  28. Balan G, Popescu A (2018) Detecting java compiled malware using machine learning techniques, pp 435–439. https://doi.org/10.1109/SYNASC.2018.00073

  29. Gavrilut D, Benchea R, Vatamanu C (2012) Optimized zero false positives perceptron training for malware detection, pp 247–253. https://doi.org/10.1109/SYNASC.2012.34

  30. Kurbiel T, Khaleghian S (2017) Training of deep neural networks based on distance measures using RMSProp. https://doi.org/10.48550/ARXIV.1708.01911

  31. Zhao M, Ge F, Zhang T, Yuan Z (2011) Antimaldroid: An efficient svm-based malware detection framework for android. In: Liu C, Chang J, Yang A (eds.) Information computing and applications

  32. Sanjaa B, Chuluun E (2013) Malware detection using linear svm. In: Ifost. https://doi.org/10.1109/IFOST.2013.6616872

  33. Abu Al-Haija Q, Odeh A, Qattous H (2022) Pdf malware detection based on optimizable decision trees 11(19). https://doi.org/10.3390/electronics11193142

  34. Garcia FCC, II FPM (2016) Random forest for malware classification. CoRR arXiv:1609.07770

  35. Artur M (2021) Review the performance of the bernoulli naive bayes classifier in intrusion detection systems using recursive feature elimination with cross-validated selection of the best number of features. Procedia Comput Sci. https://doi.org/10.1016/j.procs.2021.06.066

  36. Gavrilut DT, Anton DG, Popoiu G (2017) Machine learning based malware detection - how to balance memory footprint with model accuracy. In: 2017 19th International symposium on symbolic and numeric algorithms for scientific computing (SYNASC), pp 232–238. https://doi.org/10.1109/SYNASC.2017.00045

Download references

Acknowledgements

We would like to thank Bitdefender’s Cyber Threat Intelligence Lab for providing malicious and benign files needed to conduct our research.

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception, design and implementation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Gheorghe Balan.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest or competing interests regarding the publication of this study.

Consent to participate / for publication

All authors of this paper made substantial contributions to the conception and design of the work, drafted the work or revised it critically for important intellectual content, approved the version to be published and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Balan, G., Simion, CA., Gavriluţ, D.T. et al. Feature mining and classifier selection for API calls-based malware detection. Appl Intell 53, 29094–29108 (2023). https://doi.org/10.1007/s10489-023-05086-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05086-2

Keywords

Navigation