WPR - (1) 6.25.40 PM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Term Paper Report

on

AI/ML: Developing algorithms for handling


imbalanced datasets in classification tasks

Submitted to

Amity University Uttar Pradesh

in partial fulfillments of the requirements for the award of the degree of

Bachelor of Technology
in
Computer Science & Engineering
Submitted By

Sakshi Mishra
A7605223145

under the guidance of


Dr. Ashish Tiwari
Assistant Professor,
Department of Computer Science and Engineering

AMITY SCHOOL OF ENGINEERING AND TECHNOLOGY


AMITY UNIVERSITY UTTAR PRADESH
LUCKNOW (U.P.)
August 2024

1
AMITY UNIVERSITY
–––––––––UTTAR PRADESH–––––––––

DECLARATION BY THE STUDENT

I, Sakshi Mishra , student of B. Tech(CS&E) 3rd Semester hereby declare that the
term paper titled AI/ML: Developing algorithms for handling imbalanced datasets in
classification tasks which is submitted by me to Department of Computer Science and
Engineering, Amity School of Engineering and Technology, Lucknow, Amity University
Uttar Pradesh, Lucknow Campus, in partial fulfilment of requirement for the award of the
degree of Bachelor of Technology in Computer Science and Engineering has not been
previously formed the basis for the award of any degree, diploma or other similar title or
recognition.
The Author attests that permission has been obtained for the use of any copy
righted material appearing in the Project report other than brief excerpts requiring only
proper acknowledgement in scholarly writing and all such use is acknowledged.

Lucknow
Date : -
Sakshi Mishra
B. Tech (CS&E) 3rd Semester
A7605223145

2
AMITY UNIVERSITY
–––––––––UTTAR PRADESH–––––––––

CERTIFICATE

On the basis of declaration submitted by Sakshi Mishra student of B.Tech


(Computer Science and Engineering) 3rd semester , I hereby certify that the Term Paper
titled AI/ML: Developing algorithms for handling imbalanced datasets in classification
tasks which is submitted to Amity School of Engineering and Technology, Amity
University Uttar Pradesh, Lucknow, in partial fulfilment of the requirement for the award
of the degree of Bachelor of Technology in Computer Science and Engineering , is an
original contribution with existing knowledge and faithful record of work carried out by
her under my guidance and supervision.
To the best of my knowledge this work has not been submitted in part or full for any
Degree or Diploma to this University or elsewhere.

Lucknow

Date :-

Dr. Ashish Tiwari


Assistant Professor
ASET
Amity University
(Faculty Guide)

3
ACKNOWLEDGEMENT
I extend my heartfelt thanks to the following individuals whose support and guidance
were indispensable in the completion of this term paper:
Dr. Ashish Tiwari , Professor , Amity University: Provided invaluable insights and
guidance throughout the research process. My family and friends: Provided continuous
encouragement and understanding during the research period.

4
ABSTRACT
In the realm of artificial intelligence and machine learning, addressing imbalanced datasets is
crucial for improving classification task performance. This paper investigates various algorithms
and techniques designed to tackle this challenge, specifically focusing on their effectiveness in
improving predictive accuracy for minority class instances. Through a comprehensive review of
existing literature, the study outlines current methodologies, emphasizing strategies such as
oversampling, under sampling, ensemble methods, and cost-sensitive learning. Methodologically,
this research conducts a comparative analysis of these techniques using benchmark datasets,
evaluating key performance metrics such as precision, recall, and F1-score. The findings highlight
that ensemble methods like Random Forests and boosting algorithms such as AdaBoost and
XGBoost demonstrate promising results in alleviating class imbalance issues. However, persistent
challenges include model overfitting and complexities in parameter tuning. The study concludes
by advocating for further exploration into hybrid approaches and innovative algorithms tailored
specifically for imbalanced datasets, providing valuable insights for advancing machine learning
applications in classification tasks.

5
TABLE OF CONTENT

Sl. No. Description Page No.

I Title Page 1

II Declaration by the student 2

III Certificate 3

IV Acknowledgements 4

V Abstract 5

VI Table of Contents 6

1 Introduction 7

2 International and National Status 8

3 Review of Literature 10

4 Methodology / Observations / Results 12

5 Result 14

6 Discussion 17

7 Reference

6
1. INTRODUCTION
When it comes to building smart AI systems, especially for tasks like spotting fraud, diagnosing
diseases, or predicting rare events, a common hurdle arises: imbalanced data. This means there's
a lopsided distribution, where some data points (representing a certain class) vastly outnumber
others. Traditional AI struggles with this bias, prioritising the frequent class and neglecting the
rarer one, which might be more crucial.
This research tackles this challenge head-on! We're developing new algorithms that combine
different approaches to balance the data and improve classifier performance. By addressing this
imbalance, we aim to create fairer and more accurate AI models that can make reliable predictions
across all data categories, regardless of their frequency.

Fig 1: A flow chart outlining the development of algorithms for handling imbalanced datasets in
classification tasks.

7
2. INTERNATIONAL AND NATIONAL STATUS OF
WORK IN THE AREA

International Status

Handling imbalanced datasets in classification tasks is a major focus within the global AI/ML
research community. Internationally, significant progress has been made in developing algorithms
and methodologies to tackle this challenge. Key advancements include:

• Algorithm Development: Techniques such as the Synthetic Minority Over-sampling


Technique (SMOTE) and its variants (e.g., Borderline-SMOTE, SMOTE-ENN) are widely
used. Additionally, ensemble methods like Balanced Random Forests and cost-sensitive
learning approaches have been extensively researched and implemented.
• Deep Learning Approaches: Researchers have explored deep learning methods to address
imbalanced datasets. Techniques such as auto-encoders, generative adversarial networks
(GANs), and other neural network-based methods are utilised to generate synthetic data and
improve the performance on minority classes.

Fig 2: A step-by-step introduction of work in the field.


• Global Competitions and Benchmarks: International conferences and competitions, such
as neuritis and ICML, emphasise the importance of addressing imbalanced datasets. These

8
platforms provide benchmarks and facilitate the comparison of new methods against state-
of-the-art techniques.
• Comprehensive Reviews: Numerous international reviews and meta-analyses have
synthesised progress in this field, identifying effective techniques and future research
directions.

National Status
National efforts in various countries also highlight the importance of developing algorithms for
imbalanced datasets in classification tasks. Key aspects include:
• Research Institutions and Universities: Leading national research institutions and
universities are at the forefront of this research. For instance, in the United States,
institutions like MIT and Stanford have pioneered algorithmic solutions, often collaborating
with industry partners.
• Industry Collaboration: National collaborations between academia and industry have led
to practical applications of research findings. Sectors such as healthcare, finance, and
cybersecurity benefit from tailored solutions addressing imbalanced data issues.
• Government and Funding Support: National governments and funding agencies, such as
the National Science Foundation (NSF) in the US, provide grants and support for research
projects focused on imbalanced datasets, recognising their importance for advancing AI/ML
capabilities.
• Educational Initiatives: Many countries incorporate the study of imbalanced datasets into
AI/ML educational programs. Workshops, seminars, and courses are conducted to train
future researchers and practitioners in advanced techniques for handling imbalanced data.

In summary, both international and national efforts are crucial for advancing the field of handling
imbalanced datasets in classification tasks. Global collaborations, continuous research, and
support from academic, industrial, and governmental sectors are essential for further progress

9
3. REVIEW OF LITERATURE
Imbalanced datasets, where one class significantly outnumbers others, pose a significant
challenge for machine learning classification tasks. Conventional models regularly organise the
lion's share course, driving to destitute execution on the minority lesson. To address this,
researchers have developed various algorithms and techniques.

Data-Level Methods:
Oversampling and Undersampling:
• SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic minority
class samples for a more balanced distribution (Chawla et al., 2002).
• Random Oversampling/Under sampling: Simple techniques that duplicate minority or
remove majority class instances, but can lead to overfitting or information loss.

Algorithm-Level Methods:
• Cost-Sensitive Learning:Adjusts the loss function to penalise misclassifications of the
minority class more heavily (Elian, 2001).
• Decision Threshold Adjustment: Shifts the classification threshold to favour the minority
class, often used with probabilistic classifiers.

Ensemble Methods:
• Balanced Random Forests: Combines bagging with random under sampling to create
balanced training subsets for each decision tree (Chen et al., 2004).
• AdaBoost Adaptation: Adapts AdaBoost for imbalanced data by incorporating cost-
sensitive learning or modifying sample weights (Freund & Scrappier, 1997).

Deep Learning Approaches:


• Generative Models: Utilises auto encoders and GANs (Generative Adversarial Networks)
to generate realistic synthetic minority class data.
• Cost-Sensitive Deep Learning: Adapts deep neural networks to incorporate cost-sensitive
learning during training.

10
Evaluation Metrics:
• Beyond Accuracy: Standard accuracy is insufficient. Metrics like Precision, Recall, F1-
Score, AUC (Area Under the Curve), and MCC (Matthews Correlation Coefficient) provide
a more comprehensive assessment.

Reviews and Meta-Analyses:


• Synthesise Progress: Analyse strengths and weaknesses of different approaches, offering
valuable insights and guiding future research directions (e.g., He & Garcia, 2009; Krawczyk,
2016).

This review highlights the diverse approaches available for handling imbalanced datasets in
classification tasks. From traditional methods to recent advancements in deep learning,
researchers continue to develop effective algorithms to address this critical challenge in AI and
ML.

11
4. METHODOLOGY / OBSERVATION / RESULTS
Methodology
Research in this area primarily focuses on evaluating and comparing different algorithms for
handling imbalanced datasets. Common methodologies involve:

• Simulating Imbalanced Datasets: Researchers create synthetic datasets with varying class
imbalance ratios to test algorithms.

• Performance Evaluation Metrics: Accuracy alone is misleading. Metrics like precision,


recall, F1-score, AUC, and MCC are used for a more comprehensive evaluation.

• Comparative Analysis: Different algorithms (e.g., SMOTE, cost-sensitive learning, deep


learning) are applied to the same imbalanced dataset, and their performance is compared.

Observation
• Traditional Techniques: Oversampling and under sampling offer a simple solution but can
lead to overfitting or information loss.
• Algorithm-Level Methods: Cost-sensitive learning and decision threshold adjustment
improve minority class performance but may require careful parameter tuning.
• Ensemble Methods: Techniques like Balanced Random Forests and adapted AdaBoost can
be effective, but their performance depends on the chosen base learner.
• Deep Learning: Generative models show promise in creating realistic synthetic data, but
training can be computationally expensive. Cost-sensitive deep learning can be effective,
but requires adaptation of the specific architecture.

12
5. Results
The effectiveness of an algorithm depends on the specific dataset and imbalance ratio. Here are
some general observations:
• Data-Level Techniques: SMOTE and its variants often outperform random oversampling
in terms of minority class recall. However, they might not always improve overall
classification accuracy.
• Algorithm-Level Techniques: Cost-sensitive learning can significantly improve minority
class recall, but may lead to a slight decrease in majority class precision.
• Ensemble Methods: Balanced Random Forests can achieve good overall performance
while maintaining reasonable minority class recall. Adapted AdaBoost can be particularly
effective for highly imbalanced datasets.
• Deep Learning: Generative models like GANs can produce high-quality synthetic data,
leading to improved performance on minority classes. Cost-sensitive deep learning
approaches have shown promising results in various domains.

Fig 3: Method applied in Methodology / Observations / Results.

13
6. DISCUSSION
Imbalanced Data: A Thorn in the Side of AI Classification
Classification tasks in AI and machine learning (ML) often grapple with imbalanced datasets.
This occurs when one class vastly outnumbers others, leading to biased models that favour the
majority class and neglect the minority class – often the one of greater interest. This paper
explores the challenges and strategies for overcoming this hurdle in AI/ML classification.
The Pitfalls of Imbalanced Data
• Skewed Predictions: Models trained on imbalanced data prfioritize the majority class,
resulting in high overall accuracy but poor detection of the minority class.
• Misleading Metrics: Traditional metrics like accuracy can be deceptive, failing to capture
a model's performance on the minority class.
• Data Scarcity: Limited examples of the minority class hinder the model's ability to learn
its characteristics effectively.
Balancing the Scales: Strategies for Imbalanced Data

Data-Level Techniques:
Resampling: Techniques like oversampling (e.g., SMOTE) or under sampling aim to balance
the dataset by either creating synthetic minority class examples or removing majority class
instances. However, oversampling risks overfitting, while under sampling might discard valuable
data.
Synthetic Data Generation: Techniques like SMOTE (Synthetic Minority Over-sampling
Technique) create artificial minority class examples to balance the data without simple
duplication.

Algorithm-Level Techniques:
Cost-Sensitive Learning: This approach assigns a higher penalty to misclassifying the
minority class, forcing the model to prfioritize its correct prediction.
Ensemble Methods: Combining multiple models (e.g., bagging, boosting) can help mitigate
bias towards the majority class. Specific algorithms like Balanced Random Forest and
EasyEnsemble are designed for imbalanced data.

Evaluation Metrics:
Precision, Recall, F1-Score: These metrics provide a more nuanced understanding of the
model's performance on the minority class.
AUC-ROC and Precision-Recall Curves: These curves offer a broader evaluation by
illustrating the trade-off between correctly identifying the minority class and mistakenly
identifying the majority class at various thresholds.

14
Conclusion and Future Scope
Developing robust algorithms for handling imbalanced datasets is crucial for improving model
performance, especially in applications where accurate minority class detection is critical (e.g.,
fraud detection, medical diagnosis). Utilizing a combination of data -level and algorithm-level
strategies along with appropriate evaluation metrics is essential for empowering AI/ML models
to handle imbalanced data effectively. Continued research and innovation in this area are
paramount for advancing the field and achieving fair and robust AI systems.

15
7. REFERENCE
1. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on
knowledge and data engineering, 21(9), 1263-1284.
2. Krawczyk, B. (2016). Learning from imbalanced data: foundations and applications.
Springer.
3. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip Kegelmeyer (2002).
SMOTE: Synthetic Minority Over-sampling Technique. Diary of Manufactured Insights
Investigate (JAIR), 16, 321-357
4. Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the
seventeenth international joint conference on artificial intelligence (Vol. 2, pp. 913 -918).
Morgan Kaufmann Publishers Inc.
5. Freund, Y., & Schapire, R. E. (1997). A generalisation in decision theory for online learning
and its application to boosting. Journal of computer and system sciences, 55(1), 119-139.
6. Mehdiyev, N., Wei, Y., & He, X. (2016). Imbalanced learning with generative adversarial
networks for anomaly detection. In 2016 Worldwide Joint Conference on Neural Systems
(IJCNN) (pp.2548-2555). IEEE.
7. Powers, D. M. (2011). Evaluation: from precision, recall and F-score to ROC, AUC and lift
chart. Journal of Machine Learning Technologies, 20(1), 3-11.

16

You might also like