A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets
Abstract
:1. Introduction
- (1)
- Algorithmic modification. Some traditional classifiers work well on imbalanced data after their internal operations are changed. Yu et al. [10] utilized an optimized decision threshold adjustment strategy in a support vector machine (SVM). Zhao et al. [11] proposed a weighted maximum margin criterion to optimize the data-dependent kernel in an SVM. In addition to kernel-based SVMs, fuzzy rule-based classification systems are used and modified to deal with imbalanced data, such as in López et al. [12] and Alshomrani et al. [13].
- (2)
- Cost-sensitive classification. This classification takes into consideration that the minority class misclassification costs are more expensive than those for the majority class. Zhou and Liu [14] moved an output threshold toward majority class such that minority class examples become more difficult to misclassify in a training cost-sensitive neural network. Siers and Islam [15] proposed a cost-sensitive voting technique to minimize the classification costs for a decision forest. Lee et al. [16] adjusted factor scores by categorizing instances based on an SVM’s margin for AdaBoost.
- (3)
- Ensemble learning. This class can be used reduce variances by aggregating predictions of a set of base classifiers. Sun et al. [17] investigated cost-sensitive boosting algorithms with different weight updating strategies for imbalanced data. Sun et al. [18] turned an imbalanced dataset into multiple balanced sub-datasets and used them in base classifiers. Another very common way type of ensemble learning is where it is combined with resampling techniques, such as SMOTEBagging [19], random balance-boost [20], and the synthetic oversampling ensemble [21].
- (4)
- Data particle geometrical divide (GD). The GD technique creates class-based data particles to classify data examples by comparing data gravitation between different data particles. Rybak and Dudczyk [22] developed a new GD method with four algorithms for determining the mass of a data particle to effectively improve gravitational classification in the Moons and Circles datasets. Furthermore, Rybak and Dudczyk [23] proposed the variant of GD method named unequal geometrical divide to improve classification performance of imbalanced occupancy detection datasets.
- (5)
- Resampling techniques. Here, the aim is to balance the class distribution by removing majority class examples (undersampling) or by inflating minority class examples (oversampling). Since the synthetic minority oversampling technique (SMOTE) [24] was proposed in 2002, it has become one of the most influential data preprocessing/oversampling techniques in machine learning and data mining. To improve the SMOTE, undersampling techniques e.g., condensed nearest neighbor (CNN) [25], Tome lines [26], etc. were used after the oversampling. SMOTE_IPE [27] is another combined resampling method. It uses an iterative-partitioning filter [28] to remove noisy samples in both majority and minority classes to clean up boundaries and make them more regular. Li et al. [29] used the mega-trend-diffusion technique [30] for undersampling and used a two-parameter Weibull distribution estimation for oversampling in their work. A more improved oversampling technique will be introduced in Section 2.
2. Oversampling Techniques
3. Motivation
4. Methodology
4.1. Boundary Information
4.2. Procedure for the Boundary-Information-Based Oversampler
4.3. Computational Complexity of BIBO
4.4. Strengths Analysis
5. Experiment
5.1. Evaluation Metrics
5.2. Dataset Description
5.2.1. The Simulated Datasets
5.2.2. The Real-World Datasets
5.3. Oversampler Performance Evalutation
6. Results and Discussion
6.1. Comparative Strengths Results
- (1)
- All the oversamplers outperform the RAW on the rec and basic metrics, but they do not on the acc, spec, and metrics. This means that, for oversamplers, more positive examples are correctly classified, but at the same time, many more negative examples are incorrectly classified as positive values.
- (2)
- The oversamplers outperform RAW on most of the integrated metrics and in terms of the average of all of the metrics (ave). This means that oversampling techniques are helpful for improving imbalanced data learning problems.
- (3)
- Compared with SMOTE, DIBOs perform better on the rec and basic metrics, and most SIBOs are better on the acc, spec and metrics (with the exception of SL_SMOTE, which was caused by too many virtual samples being created in the negative class areas). It can be said that DIBOs tend to avoid missing any positive examples being correctly classified, and SIBOs improve the problem of the negative examples being incorrectly classified. However, DIBOs and SIBOs do not always outperform SMOTE on integrated metrics.
- (4)
- On the acc, spec, and basic metrics, BIBO has good performance like the SIBOs, and it outperforms all the DIBOs. On the contrary, for the rec and metrics, BIBO outperforms all the SIBOs, similar to the DIBOs. These findings confirm that BIBO is better than the SIBOs and DIBOs in general due to moving virtual samples toward decision boundaries.
- (5)
- BIBO has better performance results on the Fmeas, AUC and ave. Hence, BIBO is better oversampler for improving imbalanced dataset learning problems.
6.2. Performance Results
- (1)
- When C4.5 is used as the classifier, BIBO obtains better results on all of the metrics, even the acc. It can be deduced that the virtual samples created by BIBO are near the real decision nodes on the decision tree.
- (2)
- For the four classifiers and the five metrics on each of them, 20 metrics in total, half of them indicate that BIBOs have better performance results (the values in bold and underlined). Consequently, this further confirms that BIBO is better technique for improving imbalanced data learning problems.
- (3)
- Some oversampler performance results are not better than those for RAW, especially in the case of SVC_L and SVC_S. This may have been caused by (1) contradictory metrics, (2) overlapping blurriness, (3) the noise of virtual samples, or (4) the effectiveness of some classifiers on some imbalanced datasets.
6.3. Comparative Results of Computational Complexity
6.4. An Example of Using the Proposed BIBO Method
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wang, G. D-self-SMOTE: New method for customer credit risk prediction based on self-training and smote. ICIC Express Lett. Part B Appl. Int. J. Res. Surv. 2018, 9, 241–246. [Google Scholar]
- Veganzones, D.; Séverin, E. An investigation of bankruptcy prediction in imbalanced datasets. Decis. Support Syst. 2018, 112, 111–124. [Google Scholar] [CrossRef]
- Mao, W.; Liu, Y.; Ding, L.; Li, Y. Imbalanced fault diagnosis of rolling bearing based on generative adversarial network: A comparative study. IEEE Access 2019, 7, 9515–9530. [Google Scholar] [CrossRef]
- Al-Shehari, T.; Alsowail, R.A. An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques. Entropy 2021, 23, 1258. [Google Scholar] [CrossRef] [PubMed]
- Lokanan, M.; Liu, S. Predicting Fraud Victimization Using Classical Machine Learning. Entropy 2021, 23, 300. [Google Scholar] [CrossRef]
- Jo, T.; Japkowicz, N. Class imbalances versus small disjuncts. ACM Sigkdd Explor. Newsl. 2004, 6, 40–49. [Google Scholar] [CrossRef]
- Weiss, G.M. The impact of small disjuncts on classifier learning. In Data Mining; Springer: Boston, MA, USA, 2010; pp. 193–226. [Google Scholar]
- García, V.; Alejo, R.; Sánchez, J.S.; Sotoca, J.M.; Mollineda, R.A. Combined Effects of Class Imbalance and Class Overlap on Instance-Based Classification. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning, Burgos, Spain, 20–23 September 2006; pp. 371–378. [Google Scholar]
- García, V.; Mollineda, R.A.; Sánchez, J.S. On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 2008, 11, 269–280. [Google Scholar] [CrossRef]
- Yu, H.; Mu, C.; Sun, C.; Yang, W.; Yang, X.; Zuo, X. Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl. Based. Syst. 2015, 76, 67–78. [Google Scholar] [CrossRef]
- Zhao, Z.; Zhong, P.; Zhao, Y. Learning SVM with weighted maximum margin criterion for classification of imbalanced data. Math. Comput. Modell. 2011, 54, 1093–1099. [Google Scholar] [CrossRef]
- López, V.; Fernández, A.; Del Jesus, M.J.; Herrera, F. A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl. Based. Syst. 2013, 38, 85–104. [Google Scholar] [CrossRef]
- Alshomrani, S.; Bawakid, A.; Shim, S.-O.; Fernández, A.; Herrera, F. A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping in imbalanced datasets. Knowl. Based. Syst. 2015, 73, 1–17. [Google Scholar] [CrossRef]
- Zhou, Z.-H.; Liu, X.-Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 2005, 18, 63–77. [Google Scholar] [CrossRef]
- Siers, M.J.; Islam, M.Z. Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 2015, 51, 62–71. [Google Scholar] [CrossRef]
- Lee, W.; Jun, C.-H.; Lee, J.-S. Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification. Inf. Sci. 2017, 381, 92–103. [Google Scholar] [CrossRef]
- Sun, Y.; Kamel, M.S.; Wong, A.K.; Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 2007, 40, 3358–3378. [Google Scholar] [CrossRef]
- Sun, Z.; Song, Q.; Zhu, X.; Sun, H.; Xu, B.; Zhou, Y. A novel ensemble method for classifying imbalanced data. Pattern Recognit. 2015, 48, 1623–1637. [Google Scholar] [CrossRef]
- Wang, S.; Yao, X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA, 30 March–2 April 2009; pp. 324–331. [Google Scholar]
- Díez-Pastor, J.F.; Rodríguez, J.J.; Garcia-Osorio, C.; Kuncheva, L.I. Random balance: Ensembles of variable priors classifiers for imbalanced data. Knowl. Based. Syst. 2015, 85, 96–111. [Google Scholar] [CrossRef]
- Wang, Q.; Luo, Z.; Huang, J.; Feng, Y.; Liu, Z. A novel ensemble method for imbalanced data learning: Bagging of extrapolation-SMOTE SVM. Comput. Intell. Neurosci. 2017, 2017, 1827016. [Google Scholar] [CrossRef]
- Rybak, Ł.; Dudczyk, J. A geometrical divide of data particle in gravitational classification of moons and circles data sets. Entropy 2020, 22, 1088. [Google Scholar] [CrossRef]
- Rybak, Ł.; Dudczyk, J. Variant of Data Particle Geometrical Divide for Imbalanced Data Sets Classification by the Example of Occupancy Detection. Appl. Sci. 2021, 11, 4970. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Hart, P. The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
- Ivan, T. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. Syst. 1976, 6, 769–772. [Google Scholar]
- Sáez, J.A.; Luengo, J.; Stefanowski, J.; Herrera, F. SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 2015, 291, 184–203. [Google Scholar] [CrossRef]
- Khoshgoftaar, T.M.; Rebours, P. Improving software quality prediction by noise filtering techniques. J. Comput. Sci. Technol. 2007, 22, 387–396. [Google Scholar] [CrossRef]
- Li, D.-C.; Hu, S.C.; Lin, L.-S.; Yeh, C.-W. Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets. PLoS ONE 2017, 12, e0181853. [Google Scholar] [CrossRef] [Green Version]
- Li, D.-C.; Wu, C.-S.; Tsai, T.-I.; Lina, Y.-S. Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput. Oper. Res. 2007, 34, 966–982. [Google Scholar] [CrossRef]
- Dal Pozzolo, A.; Caelen, O.; Bontempi, G. When Is Undersampling Effective in Unbalanced Classification Tasks? In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Porto, Portugal, 7–11 September 2015; pp. 200–215. [Google Scholar]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 2011, 3, 4–21. [Google Scholar] [CrossRef]
- Piri, S.; Delen, D.; Liu, T. A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decis. Support Syst. 2018, 106, 15–29. [Google Scholar] [CrossRef]
- Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
- Fahrudin, T.; Buliali, J.L.; Fatichah, C. Enhancing the performance of smote algorithm by using attribute weighting scheme and new selective sampling method for imbalanced data set. Int. J. Innov. Comput. Inf. Control 2019, 15, 423–444. [Google Scholar] [CrossRef]
- Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-Level-Smote: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
- Maciejewski, T.; Stefanowski, J. Local Neighbourhood Extension of SMOTE for Mining Imbalanced Data. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, France, 11–15 April 2011; pp. 104–111. [Google Scholar]
- Cieslak, D.A.; Chawla, N.V.; Striegel, A. Combating Imbalance in Network Intrusion Datasets. In Proceedings of the IEEE International Conference on Granular Computing, Atlanta, GA, USA, 10–12 May 2006; pp. 732–737. [Google Scholar]
- Sanchez, A.I.; Morales, E.F.; Gonzalez, J.A. Synthetic oversampling of instances using clustering. Int. J. Artif. Intell. Tools 2013, 22, 1350008. [Google Scholar] [CrossRef] [Green Version]
- Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef] [Green Version]
- Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef] [Green Version]
- Alcalá-Fdez, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
- Napierała, K.; Stefanowski, J.; Wilk, S. Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. In Proceedings of the International Conference on Rough Sets and Current Trends in Computing, Warsaw, Poland, 28–30 June 2010; pp. 158–167. [Google Scholar]
- Fernández, A.; García, S.; del Jesus, M.J.; Herrera, F. A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst. 2008, 159, 2378–2398. [Google Scholar] [CrossRef]
- Asuncion, A.; Newman, D. UCI Machine Learning Repository. 2007. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 22 February 2022).
- Kovács, G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing 2019, 366, 352–354. [Google Scholar] [CrossRef]
- Quinlan, J.R. C4. 5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
Oversamplers | Authors [Reference] | Methods |
---|---|---|
Danger-information-based oversamplers (DIBOs) | Han et al. [32] | B1_SMOTE |
B2_SMOTE | ||
He et al. [33] | ADASYN | |
Nguyen et al. [34] | BOS | |
Barua et al. [36] | MWMOTE | |
Piri et al. [35] | SIMO | |
Fahrudin et al. [37] | AWH_SMOTE | |
Safe-information-based oversamplers (SIBOs) | Cieslak et al. [40] | C_SMOTE |
Bunkhumpornpat et al. [38] | SL_SMOTE | |
Maciejewski and Stefanowski [39] | LN_SMOTE | |
Sanchez et al. [41] | SOI_CJ | |
Douzas et al. [42] | km_SMOTE |
Input: | |
imbData: | An imbalanced dataset. |
K: | The number of kNPNs for oversampling. |
k: | The number of kNNs for computing BIW. |
Output: | |
resData: | The imbData that had been resampled by this procedure |
Procedure Begin | |
1. minority class set from imbData | |
2. majority class set from imbData | |
3. resData = imbData | |
4. while the length of resData < twice the length of N: | |
5. for p in P: | |
6. the K nearest neighbors of p from P | |
7. the k nearest neighbors of p from imbData | |
8. computing BIW of p with kNNp | |
9. if : return to 4, then continue | |
10. randomly select an example from KNNp | |
11. the k nearest neighbors of pp from imbData | |
12. computing BIW of pp with kNNpp | |
13. : | |
14. | |
15. | |
16. else: | |
17. | |
18. | |
19. resData = resData + s: | |
20. if the length of resData >= twice the length of N 21. break # the numbers of the two classes are balanced | |
22. return the resData | |
Procedure End. |
Predicted Positive | Predicted Negative | |
---|---|---|
Positive class | True positive (TP) | False Negative (FN) |
Negative class | False positive (FP) | True Negative (TN) |
No. | Name | Ex. | IR | DR (%) | No. | Name | Ex. | IR | DR (%) |
---|---|---|---|---|---|---|---|---|---|
1 | paw1 | 600 | 5 | 0 | 10 | clover4 | 800 | 7 | 0 |
2 | paw2 | 600 | 5 | 30 | 11 | clover5 | 800 | 7 | 30 |
3 | paw3 | 600 | 5 | 60 | 12 | clover6 | 800 | 7 | 60 |
4 | paw4 | 800 | 7 | 0 | 13 | subcl1 | 600 | 5 | 0 |
5 | paw5 | 800 | 7 | 30 | 14 | subcl2 | 600 | 5 | 30 |
6 | paw6 | 800 | 7 | 60 | 15 | subcl3 | 600 | 5 | 60 |
7 | clover1 | 600 | 5 | 0 | 16 | subcl4 | 800 | 7 | 0 |
8 | clover2 | 600 | 5 | 30 | 17 | subcl5 | 800 | 7 | 30 |
9 | clover3 | 600 | 5 | 60 | 18 | subcl6 | 800 | 7 | 60 |
No. | Name | Att. | Ex. | IR | No. | Name | Att. | Ex. | IR |
---|---|---|---|---|---|---|---|---|---|
1 | ecoli-0_vs_1 | 7 | 220 | 1.86 | 12 | page-blocks0 | 10 | 5472 | 8.79 |
2 | ecoli1 | 7 | 336 | 3.36 | 13 | pima | 8 | 768 | 1.87 |
3 | ecoli2 | 7 | 336 | 5.46 | 14 | segment0 | 19 | 2308 | 6.02 |
4 | ecoli3 | 7 | 336 | 8.6 | 15 | vehicle0 | 18 | 846 | 3.25 |
5 | glass-0-1-2-3 _vs_4-5-6 | 9 | 214 | 3.2 | 16 | vehicle1 | 18 | 846 | 2.9 |
17 | vehicle2 | 18 | 846 | 2.88 | |||||
6 | glass0 | 9 | 214 | 2.06 | 18 | vehicle3 | 18 | 846 | 2.99 |
7 | glass1 | 9 | 214 | 1.82 | 19 | wisconsin | 9 | 683 | 1.86 |
8 | glass6 | 9 | 214 | 6.38 | 20 | yeast1 | 8 | 1484 | 2.46 |
9 | haberman | 3 | 306 | 2.78 | 21 | yeast3 | 8 | 1484 | 8.1 |
10 | new-thyroid1 | 5 | 215 | 5.14 | 22 | ionosphere | 34 | 351 | 1.79 |
11 | new-thyroid2 | 5 | 215 | 5.14 | 23 | Swarm Behaviour Aligned | 2400 | 24017 | 2.20 |
Oversamplers | Methods | acc | rec | spec | Gmean | Fmeas | AUC | ave | ||
---|---|---|---|---|---|---|---|---|---|---|
RAW | - | 4.09 | 11.00 | 2.52 | 5.57 | 2.39 | 10.43 | 7.65 | 11.00 | 8.00 |
SMOTE | - | 7.13 | 4.22 | 7.17 | 7.22 | 7.91 | 2.91 | 6.83 | 4.22 | 5.57 |
DIBOs | B1_SMOTE | 7.30 | 5.30 | 7.87 | 7.26 | 5.96 | 4.00 | 6.26 | 5.30 | 6.80 |
B2_SMOTE | 10.00 | 7.78 | 9.96 | 10.22 | 9.65 | 6.35 | 9.91 | 7.78 | 9.91 | |
ADASYN | 10.43 | 5.43 | 10.26 | 9.39 | 9.09 | 5.22 | 9.78 | 5.43 | 9.65 | |
MWMOTE | 6.70 | 1.57 | 8.00 | 5.91 | 7.70 | 2.04 | 5.57 | 1.57 | 5.26 | |
SIBOs | SL_SMOTE | 9.17 | 8.96 | 7.22 | 9.74 | 9.04 | 8.22 | 9.87 | 8.96 | 9.87 |
LN_SMOTE | 4.93 | 2.65 | 5.09 | 4.52 | 5.22 | 2.83 | 3.43 | 2.65 | 3.78 | |
SOI_CJ | 3.02 | 4.61 | 4.04 | 2.83 | 4.00 | 6.09 | 2.04 | 4.61 | 2.04 | |
km_SMOTE | 1.78 | 4.91 | 2.83 | 1.09 | 3.04 | 8.09 | 1.13 | 4.91 | 1.04 | |
BIBO | - | 1.43 | 6.57 | 2.04 | 2.26 | 1.00 | 9.83 | 3.52 | 6.57 | 3.30 |
Classifiers | kNN | C4.5 | |||||||||
Oversamplers | Methods | acc | Gmean | Fmeas | AUC | ave | acc | Gmean | Fmeas | AUC | ave |
RAW | - | 4.65 | 6.52 | 8.35 | 8.00 | 8.96 | 7.61 | 8.09 | 8.57 | 9.52 | 8.26 |
SMOTE | - | 7.59 | 8.04 | 7.61 | 5.83 | 7.30 | 7.39 | 7.13 | 7.09 | 7.00 | 7.04 |
DIBOs | B1_SMOTE | 7.48 | 7.91 | 7.00 | 5.78 | 7.26 | 8.78 | 8.83 | 8.22 | 8.30 | 8.65 |
B2_SMOTE | 9.87 | 9.26 | 9.30 | 7.57 | 9.26 | 9.65 | 9.65 | 9.65 | 9.04 | 9.65 | |
ADASYN | 9.17 | 7.43 | 7.57 | 3.74 | 7.04 | 6.30 | 5.35 | 4.61 | 2.87 | 4.30 | |
MWMOTE | 6.20 | 5.39 | 5.26 | 1.30 | 5.00 | 4.48 | 3.61 | 2.48 | 4.78 | 3.25 | |
SIBOs | SL_SMOTE | 10.26 | 10.96 | 9.78 | 9.26 | 10.87 | 10.26 | 9.17 | 9.26 | 9.91 | 8.22 |
LN_SMOTE | 4.70 | 4.17 | 3.26 | 3.35 | 3.48 | 2.63 | 3.56 | 3.48 | 2.61 | 3.74 | |
SOI_CJ | 2.65 | 2.35 | 1.30 | 3.04 | 1.43 | 2.57 | 3.04 | 4.13 | 4.39 | 4.13 | |
km_SMOTE | 1.78 | 1.30 | 1.87 | 5.74 | 1.70 | 4.26 | 5.09 | 5.17 | 5.35 | 5.22 | |
BIBO | - | 1.65 | 2.65 | 3.70 | 6.39 | 3.70 | 2.57 | 2.83 | 2.35 | 2.22 | 2.30 |
Classifiers | SVC_L | SVC_S | |||||||||
Oversamplers | Methods | acc | Gmean | Fmeas | AUC | ave | acc | Gmean | Fmeas | AUC | ave |
RAW | - | 4.50 | 7.70 | 6.50 | 4.50 | 6.50 | 7.37 | 8.46 | 7.50 | 5.50 | 5.50 |
SMOTE | - | 7.57 | 6.13 | 5.04 | 4.30 | 5.78 | 7.74 | 6.30 | 5.04 | 4.57 | 6.00 |
DIBOs | B1_SMOTE | 8.96 | 9.46 | 6.87 | 6.48 | 7.13 | 8.91 | 7.70 | 6.70 | 6.22 | 7.09 |
B2_SMOTE | 9.96 | 8.70 | 7.87 | 7.39 | 7.96 | 9.00 | 8.61 | 7.83 | 7.17 | 8.00 | |
ADASYN | 4.96 | 2.74 | 2.09 | 1.61 | 2.26 | 4.87 | 3.70 | 1.83 | 2.04 | 3.91 | |
MWMOTE | 5.39 | 3.09 | 1.96 | 1.91 | 2.35 | 5.87 | 2.39 | 2.13 | 1.35 | 2.78 | |
SIBOs | SL_SMOTE | 7.54 | 10.46 | 10.50 | 10.50 | 10.50 | 7.37 | 10.46 | 10.50 | 10.50 | 10.50 |
LN_SMOTE | 5.00 | 5.35 | 4.91 | 5.26 | 5.13 | 4.48 | 5.30 | 5.00 | 5.70 | 5.00 | |
SOI_CJ | 7.54 | 6.74 | 8.74 | 8.74 | 8.74 | 4.74 | 3.61 | 2.91 | 2.83 | 2.87 | |
km_SMOTE | 2.76 | 2.63 | 4.70 | 6.61 | 2.57 | 2.70 | 2.53 | 4.87 | 6.39 | 2.61 | |
BIBO | - | 1.83 | 3.65 | 2.83 | 2.70 | 3.09 | 1.96 | 6.48 | 8.70 | 8.74 | 8.74 |
Oversamplers | SMOTE | B1_SMOTE | B2_SMOTE | ADASYN | MWMOTE |
computational time (s) | 0.085 | 0.353 | 0.355 | 0.371 | 3.025 |
Oversamplers | SL_SMOTE | LN_SMOTE | SOI_CJ | km_SMOTE | BIBO |
computational time (s) | 0.899 | 2.002 | 50.090 | 4.076 | 0.689 |
NO. | Mcg | Gvh | Lip | Chg | Aac | Alm1 | Alm2 | Class |
---|---|---|---|---|---|---|---|---|
1 | 0.23 | 0.48 | 0.48 | 0.50 | 0.59 | 0.88 | 0.89 | Negative |
2 | 0.56 | 0.40 | 0.48 | 0.50 | 0.49 | 0.37 | 0.46 | Positive |
… | … | … | … | … | … | … | … | … |
175 | 0.24 | 0.41 | 0.48 | 0.50 | 0.49 | 0.23 | 0.34 | Positive |
176 | 0.20 | 0.44 | 0.48 | 0.50 | 0.46 | 0.51 | 0.57 | Positive |
No. | Synthetic Examples | |||
---|---|---|---|---|
1 | 11.053 | 11.458 | 0.982 | [0.50, 0.37, …, 0.69] |
2 | 16.335 | 9.340 | 0.272 | [0.00, 0.51, …, 0.44] |
… | … | … | … | … |
53 | 9.503 | 10.345 | 0.958 | [0.12, 0.67, …, 0.63] |
54 | 5.326 | 9.503 | 0.718 | [0.33, 0.37, …, 0.65] |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, D.-C.; Shi, Q.-S.; Lin, Y.-S.; Lin, L.-S. A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets. Entropy 2022, 24, 322. https://doi.org/10.3390/e24030322
Li D-C, Shi Q-S, Lin Y-S, Lin L-S. A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets. Entropy. 2022; 24(3):322. https://doi.org/10.3390/e24030322
Chicago/Turabian StyleLi, Der-Chiang, Qi-Shi Shi, Yao-San Lin, and Liang-Sian Lin. 2022. "A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets" Entropy 24, no. 3: 322. https://doi.org/10.3390/e24030322
APA StyleLi, D.-C., Shi, Q.-S., Lin, Y.-S., & Lin, L.-S. (2022). A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets. Entropy, 24(3), 322. https://doi.org/10.3390/e24030322