Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)

Classification Algorithms
INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)

N U N A L @ I S E P. I P P. P T
O M S @ I S E P. I P P. P T
I C P @ I S E P. I P P. P T
Classification Algorithms
• Machine Learning comprises different learning paradigms that relate to different tasks
• When focusing on Supervised Learning
• We can perform Classification and Regression tasks
• Classification will be addressed in INACS since it is widely used in Cybersecurity

• There are multiple algorithms for classification, in INACS we are going to address some of
them:
• K-Nearest Neighbors (KNN)
• A simple and diversified algorithm that is quite explainable on its own
• Decision Tree (DT)
• DT are highly explainable and good performing algorithms, especially when
combining multiple trees in an Ensemble
• Ensemble Algorithms
• Well performing models that take advantage of multiple Estimators
K-Nearest Neighbors
• It works for both classification and regression
problems
• It’s easy to implement and understand, but

has a major drawback of becoming
significantly slow as the size of dataset grows
K-Nearest Neighbors
• KNN requires:
• A set of labeled records
• Proximity metric to compute distance/similarity between a pair of records

• Euclidean, Manhattan or Minkowski distance
• The value of k, the number of nearest neighbors to retrieve
• A method for using class labels of K nearest neighbors to determine the class label of n
unknown record
• It can be the majority voting of the k nearest neighbors of the unknown record
K-Nearest Neighbors
• KNN works as follows:
• In the training phase:

• Store all training records and labels in an efficient data structure
• Store all algorithm’s configurations

• E.g., the number of k neighbors, the distance metric
• There is no real “training” involved as the algorithm stores some sort of “memory”
K-Nearest Neighbors
• KNN works as follows:
• In the inference phase:

1. Determine the distance between the unknown record and all records of the
dataset according to a specific measure (e.g., Euclidean)
• This is the algorithm’s bottleneck, as it gets slower as the dataset increases
• Note: Manhattan distance is preferred over Euclidean for high-dimensional
data
2. Take the K nearest neighbors
3. Among these k neighbors, count the data points that belong to each label
4. Assign the new data point to the most frequent label among its nearest neighbors
K-Nearest Neighbors
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
• Sklearn’s implementation
• https://towardsdatascience.com/make-knn-300-times-faster-than-scikit-learns-in-20-lines-
5e29d74e76bb
• https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-
search/
• With Facebook’s faiss, custom implementation can be up to 300x faster
Decision Trees
How can an algorithm be represented as a tree?
• Example from the Titanic dataset

• 3 features / attributes / columns
considered
• Sex, age and sibsp (number of spouses
or children along)
• Decision nodes test the features
classes
Decision Trees
• Nodes:
• Root node
• Node
• Leaf node
• Parent and Child Node
• Processes:
• Splitting
• Pruning
Decision Trees
• Intuition
Hitters dataset to predict a baseball player’s Salary (mean log salary)

based on Years (the number of years played in the major leagues) and
Hits (the number of hits made in the previous year).
•R1 ={X | Years<4.5 }

•R2 ={X | Years>=4.5 ,Hits<117.5
•R3 ={X | Years>=4.5 ,
Hits>=117.5 }
Decision Trees
• The decision boundaries are parallel to the axes because the conditions tests on tree nodes
involve only one attribute at a time
Decision Trees
• Greedy strategy
• Separate the records based on the test of an attribute that optimizes a specific criterion
• Particularities
• How to separate the records?
• Determine the attribute (s) that lead to the best partition?
• Specify the test condition of the attribute?
• Different algorithms lead to different decision trees, based on local optimization of
attributes
• When to stop data partition?

Decision Trees
• Splitting in Decision Trees
• start at the tree root and split the data on the feature that results in the largest
information gain (IG)
• objective function is to maximize the information gain (IG) at each split
• f is the feature to perform the split

• Dp and Dj are data set of the parent, j-th child node
• I is the impurity measure
• Np is the total number of samples at the parent node
• Nj is the number of samples in the j-th child node
Decision Trees
• Most libraries (including scikit-learn) implement binary decision trees
• How to calculate the Impurity?

• Gini Impurity
• Entropy
• Misclassification Impurity
Note: the lower the impurity of the child nodes, the larger the information gain
Decision Trees
• Gini Impurity
• measure of how often a randomly chosen element from the dataset would be incorrectly
labeled if it was randomly labeled according to the distribution of labels in the subset
• where j is the number of classes present in the node and p is the distribution of the
class in the node
• Total Gini Impurity = weighted average of IGini

Decision Trees
• Gini Impurity - example
• Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes
• Target consist 138 value 0 and 165 value 1
• Focus on attributes Sex, fbs (fasting blood sugar) and exang (exercise induced angina)
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Decision Trees
• Gini Impurity - example • Gini Impurity Sex = 0
• Gini Impurity Sex = 1
• Total Gini Impurity

Decision Trees
• Gini Impurity - example • Gini Impurity Fbs = 0
• Gini Impurity Fbs = 1

Decision Trees
• Gini Impurity - example • Gini Impurity Exang = 0
= 0,495
• Gini Impurity Exang = 1
= 0,24
= 0, 356
Decision Trees
• Gini Impurity – example
• Gini Impurity
• Sex = 0,399
• Fbs = 0,249
• Exang = 0,356
• Gini Impurity tells what is the probability of misclassifying an observation
• Which is the best attribute to use for the partition?

• The lower the Gini the better the split, as the likelihood of misclassification is lower
Decision Trees
• Entropy
• Measures the amount of information for a variable that can take n values
• where j is the number of classes present in the node and p is the distribution of the class in
the node
• The highest Entropy value on the first iteration will be the Root Node
• Information Gain = information before partition – information after partition

Decision Trees
• Entropy - example
• Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes
• Target consist 138 value 0 and 165 value 1
• Focus on attributes Sex, fbs (fasting blood sugar) and exang (exercise induced angina)
calculate entropy in Target attribute first
Information Gain
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Decision Trees
• Entropy - example • Entropy Sex = 0
• Entropy Sex = 1
• Information Gain
Decision Trees
• Entropy - example • Entropy Fbs = 0
• Entropy Fbs = 1
Decision Trees
• Entropy - example • Entropy Exang = 0
• Entropy Exang = 1
Decision Trees
• Entropy – example
• Sex = 0,328
• Fbs = 0,389
• Exang = 0,224
• Which is the best attribute to use for the partition?
• The highest the Information Gain the better the split

Decision Trees
• Misclassification Impurity
C1 0
C2 6
C1 1
C2 5
C1 3
C2 3
Decision Trees
• Classification and Regression Trees (CART) algorithm
• Step 1: Start at the root node with all training instances
• Step 2: Select an attribute based on splitting criteria
• Step 3: Partition instances according to selected attribute recursively
• Partitioning stops when:

• There are no examples left
• All examples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority class is the leaf
• Uses Gini Impurity

CART - example
◦ Number of instances: 14
◦ Number of attributes: 4
◦ Step 1: Start at the root node with all training

instances
◦ Step 2: Select an attribute based on splitting
criteria (use Gini Impurity)
◦ Step 3: Partition instances according to
selected attribute recursively
https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Oulook
Gini(Oulook)
CART - example
Temp
Gini(Temp)
CART - example
Humidity
Gini(Humidity)
CART - example
Wind
Gini(Wind)
CART - example
Time to decide
Let’s place Outlook at the top of the tree

CART - example
CART - example
Next decisions:
• Focus on the sub dataset for Sunny outlook
• Focus on the sub dataset for Rain outlook
CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Temp
Gini(Outlook=Sunny and Temp.)

CART - example
Humidity
Gini(Outlook=Sunny and Humidity.)

CART - example
Wind
Gini(Outlook=Sunny and Wind)

= 0.466
CART - example
Time to decide
Let’s place Humidity at the extension of Sunny Outlook

CART - example
• Next decisions:
• Focus on the sub dataset for Sunny outlook
CART - example
• Next decisions:
CART - example
Focus on the sub dataset for Rain outlook Step 2: Select an attribute on the basis of splitting criteria
Temp
Gini(Outlook=Rain and Temp.)

CART - example
Humidity
Gini(Outlook=Rain and Humidity)

CART - example
Wind
Gini(Outlook=Rain and Wind)

CART - example
Time to decide
Let’s place Wind at the extension of Rain Outlook

CART - example
• Next decisions:
CART - example
Decision Trees
• Stopping criteria
• All examples belong to the same class
• All examples have the same values in their attributes (but different classes)
• Few examples left
• The gain of all possible solutions is very low

Decision Trees
• Model Selection
• the possibility for model overfitting increases as the model becomes more complex
• Occam’s razor or the principle of parsimony

• given two models with the same errors, the simpler model is preferred over the more
complex model – keep it simple!
Decision Trees
• Advantages
• Simple to Understand, Visualize and Interpret
• Nonlinear parameters don’t influence its performance
• Can handle both Numerical and Categorical data
• Disadvantages
• The model can get unstable due to the slight variation of the data
• Overfitting happens when the algorithm grabs noise in the data
• A highly complicated decision tree tends to have a low bias / high variance
• Which makes it problematic for the model to work with new data
Decision Trees
• Using a Decision Tree Analysis for Intrusion Detection: A How-To Guide
• https://sansorg.egnyte.com/dl/6edQgfwngE
• Classification of Malware Attacks Using Machine Learning In Decision Tree

• http://repository.uwl.ac.uk/id/eprint/8022/1/IJS-155.pdf
• Network intrusion detection system using J48 Decision Tree

• https://ieeexplore.ieee.org/document/7275914
• A Consolidated Decision Tree-Based Intrusion Detection System for Binary and Multiclass
Imbalanced Datasets
• https://www.mdpi.com/2227-7390/9/7/751/pdf
• Decision Tree and Context Based Malicious Event Detection

• Chapter 7 of Book “Hands-On Machine Learning for Cybersecurity”
Ensemble Learning
• It uses multiple machine learning models to make the better
• Specific models do well in modeling one aspect of data, while others do well in modeling
another
• Instead of learning a single complex model, learn several simple models and combine their
output to build the final decision
• The combined strength of the models compensates for individual model variances and biases
• Delivers a combined prediction where the final accuracy is better than the accuracy of the
individual models
Bagging
• Bagging (Bootstrap Aggregating)
• Random samples of the training
data set (subsets of training data
set) are created
• A classifier is built for each sample
• The results of these multiple

classifiers are combined using
average or majority voting.
• Bagging helps to reduce the

variance error (!)
Boosting
• Boosting
• Boosting provides sequential
learning of the predictors
• The first predictor is learned on the

whole data set, while the following
are learnt on the errors of the
previous ones
Boosting
• Boosting
• It starts by classifying original data set and giving equal weights to each observation
• If classes are predicted incorrectly using the first learner, then it gives higher weight to
the missed classified observation
• Being an iterative process, it continues to add classifier learner until a limit is reached in
the number of models or accuracy
• Boosting has shown better predictive accuracy than bagging, but it also tends to overfit
the training data as well (!)
Stacking
• Stacking
• Use a learner to combine output
from different learners, learning a
function of their predictions
• This can lead to decrease in either

bias or variance error depending on
the combining learner we use
Tree Ensembles
• Tree Ensembles
• Random Forest
• The "forest" it builds, is an ensemble of decision trees, usually trained with the
“bagging” method
• Instead of searching for the most important feature while splitting a node, it
searches for the best feature among a random subset of features
• This results in a wide diversity that generally results in a better model
• Is based on the “Wisdom of Crowds” theory
• To assure good performance:

• Requires good representative data
• The correlation between the predictions of each tree must be low
Tree Ensembles
• Tree Ensembles
• XGBoost (eXtreme Gradient Boosting)

• Highly efficient tree-ensemble based on “boosting”
• It is the champion of many open Kaggle competitions
• Features:
• Sparse Aware implementation with automatic handling of missing data values
• Block Structure to support the parallelization of tree construction
• Continued Training so that you can further boost an already fitted model on
new data
• Support several Boosting algorithms:
• Gradient Boosting algorithm with custom learning rate
• Stochastic Gradient Boosting with multiple sub-sampling split levels
• Regularized Gradient Boosting with both L1 and L2 regularization.
Tree Ensembles
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
• sklearn’s “RandomForestClassifier”
• https://xgboost.readthedocs.io/en/stable/tutorials/model.html
• Inner details of XGBoost framework
• https://practicaldatascience.co.uk/machine-learning/how-to-create-a-classification-model-using-xgboost
• https://www.kaggle.com/code/dansbecker/xgboost/notebook
• Practical examples with XGBoost
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
• sklearn’s “StackingClassifier”

Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Copyright:

Available Formats

Classification Algorithms

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)

• Classification will be addressed in INACS since it is widely used in Cybersecurity

• It’s easy to implement and understand, but

• Proximity metric to compute distance/similarity between a pair of records

• The value of k, the number of nearest neighbors to retrieve

• In the training phase:

• Store all algorithm’s configurations

• In the inference phase:

2. Take the K nearest neighbors

• Example from the Titanic dataset

Hitters dataset to predict a baseball player’s Salary (mean log salary)

•R1 ={X | Years<4.5 }

• When to stop data partition?

• f is the feature to perform the split

• How to calculate the Impurity?

• Total Gini Impurity = weighted average of IGini

• Gini Impurity Sex = 1

• Total Gini Impurity

• Gini Impurity Fbs = 1

• Total Gini Impurity

• Total Gini Impurity

• Gini Impurity tells what is the probability of misclassifying an observation

• Which is the best attribute to use for the partition?

• Information Gain = information before partition – information after partition

calculate entropy in Target attribute first

• Which is the best attribute to use for the partition?

• The highest the Information Gain the better the split

• Partitioning stops when:

• Uses Gini Impurity

◦ Step 1: Start at the root node with all training

Let’s place Outlook at the top of the tree

Gini(Outlook=Sunny and Temp.)

Gini(Outlook=Sunny and Humidity.)

Gini(Outlook=Sunny and Wind)

Let’s place Humidity at the extension of Sunny Outlook

Gini(Outlook=Rain and Temp.)

Gini(Outlook=Rain and Humidity)

Gini(Outlook=Rain and Wind)

Let’s place Wind at the extension of Rain Outlook

• All examples belong to the same class

• Few examples left

• The gain of all possible solutions is very low

• Occam’s razor or the principle of parsimony

• Classification of Malware Attacks Using Machine Learning In Decision Tree

• Network intrusion detection system using J48 Decision Tree

• Decision Tree and Context Based Malicious Event Detection

• A classifier is built for each sample

• The results of these multiple

• Bagging helps to reduce the

• The first predictor is learned on the

• This can lead to decrease in either

• To assure good performance:

• XGBoost (eXtreme Gradient Boosting)

You might also like