Abstract
Many real-world datasets exhibit imbalanced distributions, in which the majority classes have sufficient samples, whereas the minority classes often have a very small number of samples. Data resampling has proven to be effective in alleviating such imbalanced settings, while feature selection is a commonly used technique for improving classification performance. However, the joint impact of feature selection and data resampling on two-class imbalance classification has rarely been addressed before. This work investigates the performance of two opposite imbalanced classification frameworks in which feature selection is applied before or after data resampling. We conduct a large-scale empirical study with a total of 9225 experiments on 52 publicly available datasets. The results show that both frameworks should be considered for finding the best performing imbalanced classification model. We also study the impact of classifiers, the ratio between the number of majority and minority samples (IR), and the ratio between the number of samples and features (SFR) on the performance of imbalance classification. Overall, this work provides a new reference value for researchers and practitioners in imbalance learning.













Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Change history
30 July 2022
A Correction to this paper has been published: https://doi.org/10.1007/s10489-022-03953-y
References
Alcalá-Fdez J, Fernández A, Luengo J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17(2–3):255–287
Asuncion A, Newman DJ (2007) UCI machine learning repository. http://www.ics.uci.edu/mlearn/MLRepository.html
Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Applic 6(3):245–256
Batista GE, Carvalho AC, Monard MC (2000) Applying one-sided selection to unbalanced datasets. Lect Notes Comput Sci, 315–325
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29
Cawley GC, Talbot NLC, Girolami MA (2006) Sparse multinomial logistic regression via bayesian L1 regularisation. In: Advances in neural information processing systems, 209–216
Chawla NV, Bowyer KW, Hall LO, et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(3):321–357
Galar M, Fernández A, Barrenechea E et al (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recogn 44:1761–1776
García V, Mollineda RA, Sánchez JS (2009) Index of balanced accuracy: a performance measure for skewed class distributions. In: Iberian conf on pattern recognition and image analysis, pp 441–448
Gütlein M, Frank E, Hall MA, et al (2009) Large-scale attribute selection using wrappers. In: Proceedings of the IEEE symposium on computational intelligence and data mining, pp 332–339
Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516
He H, Bai Y, Garcia EA, et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the international joint conference on neural networks, pp 1322–1328
Khoshgoftaar TM, Gao K, Seliya N (2010) Attribute selection and imbalanced data: problems in software defect prediction. In: 2010 22nd IEEE international conference on tools with artificial intelligence (ICTAI). IEEE, pp 137–144
Khoshgoftaar TM, Gao K, Napolitano A et al (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16(5):801–822
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progress Artif Intell 5(4):221–232
Li J, Cheng K, Wang S, et al (2018) Feature selection: a data perspective. ACM Comput Surv (CSUR) 50(6):94:1–94:45
López V, Fernández A et al, García S (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform Sci 250:113–141
Maldonado S, López J, Vairetti C (2019) An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl Soft Comput 76:380–389
Maldonado S, Vairetti C, Fernandez A et al (2022) FW-SMOTE: a feature-weighted oversampling approach for imbalanced classification. Pattern Recogn 124:108,511
Pan T, Zhao J, Wu W, et al (2020) Learning imbalanced datasets based on SMOTE and gaussian distribution. Inform Sci 512:1214–1233
Sheskin DJ (2003) Handbook of parametric and nonparametric statistical procedures. CRC Press
Shi H, Zhang Y, Chen Y et al (2022) Resampling algorithms based on sample concatenation for imbalance learning. Knowledge-Based Systems, https://doi.org/10.1016/j.knosys.2022.108592
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: the 10th int conf on data warehousing and knowledge discovery, pp 283–292
Sun J, Lang J, Fujita H et al (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on smote and bagging with differentiated sampling rates. Inform Sci 425:76–91
Sun J, Li H, Fujita H et al (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inform Fus 54:128–144
Thabtah F, Hammoud S, Kamalov F et al (2020) Data imbalance in classification: experimental evaluation. Inform Sci 513:429–441
Wang W, Wang X, Feng D et al (2014) Exploring permission-induced risk in android applications for malicious application detection. IEEE Trans Inform Forens Secur 9(11):1869–1882
Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400
Watanabe S (1985) Pattern recognition: human and mechanical. Wiley, New York
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82
Zhang C, Bi J, Soda P (2017) Feature selection and resampling in class imbalance learning: which comes first? An empirical study in the biological domain. In: 2017 IEEE International conference on bioinformatics and biomedicine (BIBM, 2017), pp 933–938
Zhang C, Bi J, Xu S, et al (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl-Based Syst 174:137–143
Acknowledgments
Prof. Chongsheng Zhang was partially funded by the Laboratory of Yellow River Heritage, Henan University, and the Henan Laboratory of Yellow River (Henan University). Prof. Salvador Garcia was partially supported by the research projects TIN2017-89517-P and A-TIC-434-UGR20.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Chongsheng Zhang and Paolo Soda contributed equally to this work.
The original online version of this article was revised: There was mistake in images and captions of Figures 10, 11, 12 and 13.
Rights and permissions
About this article
Cite this article
Zhang, C., Soda, P., Bi, J. et al. An empirical study on the joint impact of feature selection and data resampling on imbalance classification. Appl Intell 53, 5449–5461 (2023). https://doi.org/10.1007/s10489-022-03772-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03772-1