0% found this document useful (0 votes)
4 views

Anomaly Detection Using Ml (1)

The document discusses the importance of anomaly detection in network traffic, emphasizing its applications in fraud detection and cybersecurity. It reviews various machine learning algorithms suitable for detecting anomalies, comparing their effectiveness and suitability for different data types. The KDD Cup 1999 dataset is utilized to build predictive models, with results indicating high performance in detecting various types of network attacks.

Uploaded by

Nizam Azmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Anomaly Detection Using Ml (1)

The document discusses the importance of anomaly detection in network traffic, emphasizing its applications in fraud detection and cybersecurity. It reviews various machine learning algorithms suitable for detecting anomalies, comparing their effectiveness and suitability for different data types. The KDD Cup 1999 dataset is utilized to build predictive models, with results indicating high performance in detecting various types of network attacks.

Uploaded by

Nizam Azmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

ANOMALY DETECTION IN

NETWORK TRAFFIC
ANOMALY DETECTION
USING ML
Muhamad Nizam Azmi
Mataram University
WHY ANOMALY DETECTION ?
Anomaly Detection Overview

Crucial task in identifying deviations from normal


behavior
Applications in fraud detection, industrial maintenance,
cybersecurity

Challenges

High-dimensional data
Large-scale distributed systems
Vast amounts of data
advacements in Machine Learning

Powerful tools for pattern recognition


Effective in complex and high-dimensional
datasets

NEXT- PROJECT
Distinguishing between normal and anomalous
behaviors

Unsupervised Learning Algorithms FOCUS


No need for labeled attack data
Ideal for real-world applications with scarce
labeled data
point anomaly

Collective anomaly

3 TYPES OF
contextual anomaly ANOMALY
point anomaly

Collective anomaly

3 TYPES OF
contextual anomaly ANOMALY

Point anomaly
detection
focus project

COMPARISON OF VARIOUS MACHINE LEARNING ALGORTHM


Aim: Identify the most effective techniques for different anomaly detection tasks

Provide insights and guidelines for future applications in cybersecurity, industrial monitoring, and beyond

ADABOOST Naive Bayes Gradient Boosting

Logistic Regression K-Nearest Neigbors (KNN) SVM

Random Forest Decision Tree Neural Network (NN)


WHY CHOOSE THESE
ALGORITHMS?
AdaBoost Naive Bayes Gradent Boosting Random Forest
AdaBoost is chosen for its Naive Bayes is selected for Gradient Boosting is chosen for Random Forest is chosen for its
ability to enhance the its simplicity, speed, and its high predictive accuracy high accuracy, ability to handle
performance of simple effectiveness in handling and ability to handle a variety large datasets with higher
models, making it effective in large datasets, making it of data types and distributions, dimensionality, and robustness
scenarios where data may suitable for real-time which is crucial for detecting against overfitting.
have a lot of noise or complex anomaly detection tasks. subtle anomalies.
patterns.

Logistic Regression KNN SVM Neural Network (NN)


KNN is chosen for its
Logistic Regression is selected SVM is selected for its Neural Networks are chosen
for its interpretability and simplicity and ability to
robustness in high- for their flexibility and ability to
effectiveness in binary perform well with small dimensional spaces and its model complex, non-linear
classification problems, making to medium-sized effectiveness in cases where relationships in data, which is
it useful for understanding and datasets, particularly the anomaly classes are not essential for accurately
explaining anomaly detection
when the data is not linearly separable. detecting anomalies in high-
results.
linearly separable. dimensional datasets.
WHY CHOOSE THESE
ALGORITHMS?
AdaBoost Naive Bayes Gradent Boosting Random Forest

ability to enhance the


Add a These algorithms collectively provide Random
AdaBoost is chosen for its Naive Bayes is selected for
its simplicity, speed, and
a
Gradient Boosting is chosen for
its high predictive accuracy
Forest is chosen for its
high accuracy, ability to handle

robust framework for identifying anomalies,dimensionality, and robustness


performance of simple effectiveness in handling and ability to handle a variety
large datasets with higher
models, making it effective in large datasets, making it of data types and distributions,
scenarios where data may suitable for real-time which is crucial for detecting
against overfitting.
leveraging their individual strengths to enhance
have a lot of noise or complex
patterns.
anomaly detection tasks. subtle anomalies.

the accuracy and reliability of anomaly detection


Logistic Regression KNN SVM Neural Network (NN)
Logistic Regression is selected
system.SVM is selected for its
KNN is chosen for its
Neural Networks are chosen
for its interpretability and simplicity and ability to
robustness in high- for their flexibility and ability to
effectiveness in binary perform well with small dimensional spaces and its model complex, non-linear
classification problems, making to medium-sized effectiveness in cases where relationships in data, which is
it useful for understanding and datasets, particularly the anomaly classes are not essential for accurately
explaining anomaly detection
when the data is not linearly separable. detecting anomalies in high-
results.
linearly separable. dimensional datasets.
Evaluation of Distributed ML Algorithm for Anomaly Detection
PREV-
RESEARCH
Astekin, M., Zengin, H., and Sözer, H. (2018) compared distributed machine
learning algorithms for system log analysis. They focused on scalability
and efficiency, highlighting the strengths of certain algorithms in handling
large datasets.

Metaheuristics and Machine Learning for Anomaly Detection in Big Data


Cavallaro, C., Cutello, V., Pavone, M., and Zito, F. (2023) reviewed the use of
metaheuristics combined with machine learning for anomaly detection.
Their study showed improved detection accuracy and adaptability to
different datasets.

Industrial Anomaly Detection with Neural Network Architectures


Siegel, B. (2020) compared neural network architectures for detecting
industrial anomalies. The study focused on real-time detection capabilities
and accuracy, discussing implementation challenges and solutions.

ETC.
SUMMARIZE
PREV-
RESEARCH
DATASET INFORMATION
Purpose: The KDD Cup 1999 dataset is used to build a predictive model to distinguish
between "bad" connections (intrusions or attacks) and "good" (normal) connections in
a computer network. It aims to protect the network from unauthorized users, including
potential insiders.

Background: The dataset is based on the 1998 DARPA Intrusion Detection Evaluation
Program, managed by MIT Lincoln Labs. The program's objective was to evaluate
research in intrusion detection using a standard set of data that includes various
intrusions simulated in a military network environment.

Data Collection.....
Environment: Simulated a typical U.S. Air Force LAN with multiple simulated attacks.
Duration: Data was collected over nine weeks (seven weeks for training, two weeks for testing).
Data Size:
Training data: 4 gigabytes of compressed binary TCP dump data, resulting in about five million connection
records.
Test data: Around two million connection records.
Connection Records: Each connection is a sequence of TCP packets between a source IP address and a target IP
address, labeled as either normal or a specific type of attack.
TYPES OF ATTACKS

DOS R2L
Denial of Service e.g., Syn flood Remote to Local e.g., guessing
passwords

U2R Probing
User to Root e.g., Buffer overflow e.g., port scanning
attacks
RESULT

ICMP: The most frequent protocol


type with over 250,000
occurrences.
TCP: The second most common
protocol type with over 150,000
occurrences.
UDP: The least frequent protocol
type with fewer than 50,000
occurrences.
RESULT

0: Not logged in
1: Successfully logged in

The number of users who did not


log in (0) is significantly higher
than those who successfully
logged in (1).
RESULT
Categories:
dos: Denial of Service attacks -
391,458 instances
normal: Normal traffic (no
attack) - 97,278 instances
probe: Surveillance and
probing attacks - 4,107
instances
r2l: Remote to local attacks -
1,126 instances
u2r: User to root attacks - 52
instances
RESULT
num_root: Removed due to high correlation with
num_compromised (Correlation = 0.9938).
srv_serror_rate: Removed due to high correlation
with serror_rate (Correlation = 0.9984).
srv_rerror_rate: Removed due to high correlation
with rerror_rate (Correlation = 0.9947).
dst_host_srv_serror_rate: Removed due to high
correlation with srv_serror_rate (Correlation =
0.9993).
dst_host_serror_rate: Removed due to high
correlation with rerror_rate (Correlation = 0.9870).
dst_host_rerror_rate: Removed due to high
correlation with srv_rerror_rate (Correlation =
0.9822).
dst_host_srv_rerror_rate: Removed due to high
correlation with rerror_rate (Correlation = 0.9852).
dst_host_same_srv_rate: Removed due to high
correlation with dst_host_srv_count (Correlation =
0.9737).
DECISION TREE RESULT
RANDOM FOREST RESULT
SVM RESULT
KNN RESULT
LOGISTIC REGRESSION RESULT
NEURAL NETWORK RESULT
GRADIENT BOOSTING RESULT
NAIVE BAYES RESULT
ADABOOST RESULT
ROC CURVE & FEATURE IMPORTANCES RESULT
DOS (Class 0): AUC (Area Under the Curve) = 1.00
Normal (Class 1): AUC = 1.00
Probe (Class 2): AUC = 0.99
R2L (Class 3): AUC = 0.97
U2R (Class 4): AUC = 0.82

High Performance:
Classes 0 and 1 have perfect AUC scores of
1.00, indicating excellent classification
performance with no false positives.
Class 2 also demonstrates very high
performance with an AUC of 0.99.
Moderate Performance:
Class 3 has an AUC of 0.97, showing strong
performance with minimal false positives.
Class 4 has a lower AUC of 0.82, indicating
room for improvement in distinguishing this
class from others.
ROC CURVE & FEATURE IMPORTANCES RESULT

Dominant Feature: The srv_count feature


overwhelmingly dominates the feature
importance, indicating it has a critical
impact on the model's performance.
Other Features: Although other features
contribute less, they still hold importance for
the model, affecting specific aspects of its
predictions.
CONCLUSION
Effectiveness of Machine Learning Algorithms:
This project successfully demonstrated that various machine learning
algorithms such as ADABOOST, Naive Bayes, Gradient Boosting, Logistic
Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM),
Random Forest, Decision Tree, and Neural Networks (NN) are capable of
detecting anomalies in large and complex datasets.

Recommendations for Further Development:


For future research, it is recommended to apply advanced data
augmentation techniques and feature engineering to further enhance
model performance. Additionally, combining multiple algorithms or using
ensemble approaches can help improve the accuracy and robustness of
anomaly detection.
REFERENCES
Astekin, M., Zengin, H., & Sözer, H. (2018). Evaluation of Distributed Machine Learning Algorithms for Anomaly Detection from
Large-Scale System Logs: A Case Study. Proceedings of the IEEE International Conference on Big Data, 862-1967. Get In Touch
With Us
Cavallaro, C., Cutello, V., Pavone, M., & Zito, F. (2023). Discovering anomalies in big data: a review focused on the application
of metaheuristics and machine learning techniques. Frontiers in Big Data, 6. Get In Touch With Us
Shabat, G., Segev, D., & Averbuch, A. (2017). Uncovering Unknown Unknowns in Financial Services Big Data by Unsupervised
Methodologies: Present and Future trends. Proceedings of the Machine Learning Research, 71, 8-19. Get In Touch With Us
Siegel, B. (2020). Industrial Anomaly Detection: A Comparison of Unsupervised Neural Network Architectures. IEEE Sensors
Journal, 4(8), 1-4. Get In Touch With Us
Zoppi, T., Ceccarelli, A., & Bondavalli, A. (2020). Into the Unknown: Unsupervised Machine Learning Algorithms for Anomaly-
Based Intrusion Detection. Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN),
50200, 44. Get In Touch With Us

10
THANK YOU FOR
YOUR ATTENTION

You might also like