Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Classification Algorithms

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)


N U N A L @ I S E P. I P P. P T
O M S @ I S E P. I P P. P T
I C P @ I S E P. I P P. P T
Classification Algorithms
• Machine Learning comprises different learning paradigms that relate to different tasks
• When focusing on Supervised Learning
• We can perform Classification and Regression tasks

• Classification will be addressed in INACS since it is widely used in Cybersecurity


• There are multiple algorithms for classification, in INACS we are going to address some of
them:
• K-Nearest Neighbors (KNN)
• A simple and diversified algorithm that is quite explainable on its own
• Decision Tree (DT)
• DT are highly explainable and good performing algorithms, especially when
combining multiple trees in an Ensemble
• Ensemble Algorithms
• Well performing models that take advantage of multiple Estimators
K-Nearest Neighbors
• It works for both classification and regression
problems

• It’s easy to implement and understand, but


has a major drawback of becoming
significantly slow as the size of dataset grows
K-Nearest Neighbors
• KNN requires:
• A set of labeled records

• Proximity metric to compute distance/similarity between a pair of records


• Euclidean, Manhattan or Minkowski distance

• The value of k, the number of nearest neighbors to retrieve

• A method for using class labels of K nearest neighbors to determine the class label of n
unknown record
• It can be the majority voting of the k nearest neighbors of the unknown record
K-Nearest Neighbors
• KNN works as follows:

• In the training phase:


• Store all training records and labels in an efficient data structure

• Store all algorithm’s configurations


• E.g., the number of k neighbors, the distance metric

• There is no real “training” involved as the algorithm stores some sort of “memory”
K-Nearest Neighbors
• KNN works as follows:

• In the inference phase:


1. Determine the distance between the unknown record and all records of the
dataset according to a specific measure (e.g., Euclidean)
• This is the algorithm’s bottleneck, as it gets slower as the dataset increases
• Note: Manhattan distance is preferred over Euclidean for high-dimensional
data

2. Take the K nearest neighbors

3. Among these k neighbors, count the data points that belong to each label

4. Assign the new data point to the most frequent label among its nearest neighbors
K-Nearest Neighbors
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
• Sklearn’s implementation
• https://towardsdatascience.com/make-knn-300-times-faster-than-scikit-learns-in-20-lines-
5e29d74e76bb
• https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-
search/
• With Facebook’s faiss, custom implementation can be up to 300x faster
Decision Trees
How can an algorithm be represented as a tree?

• Example from the Titanic dataset


• 3 features / attributes / columns
considered
• Sex, age and sibsp (number of spouses
or children along)
• Decision nodes test the features
classes
Decision Trees
• Nodes:
• Root node
• Node
• Leaf node
• Parent and Child Node

• Processes:
• Splitting
• Pruning
Decision Trees
• Intuition

Hitters dataset to predict a baseball player’s Salary (mean log salary)


based on Years (the number of years played in the major leagues) and
Hits (the number of hits made in the previous year).

•R1 ={X | Years<4.5 }


•R2 ={X | Years>=4.5 ,Hits<117.5
•R3 ={X | Years>=4.5 ,
Hits>=117.5 }
Decision Trees
• The decision boundaries are parallel to the axes because the conditions tests on tree nodes
involve only one attribute at a time
Decision Trees
• Greedy strategy
• Separate the records based on the test of an attribute that optimizes a specific criterion

• Particularities
• How to separate the records?
• Determine the attribute (s) that lead to the best partition?
• Specify the test condition of the attribute?
• Different algorithms lead to different decision trees, based on local optimization of
attributes

• When to stop data partition?


Decision Trees
• Splitting in Decision Trees

• start at the tree root and split the data on the feature that results in the largest
information gain (IG)
• objective function is to maximize the information gain (IG) at each split

• f is the feature to perform the split


• Dp and Dj are data set of the parent, j-th child node
• I is the impurity measure
• Np is the total number of samples at the parent node
• Nj is the number of samples in the j-th child node
Decision Trees
• Most libraries (including scikit-learn) implement binary decision trees

• How to calculate the Impurity?


• Gini Impurity
• Entropy
• Misclassification Impurity

Note: the lower the impurity of the child nodes, the larger the information gain
Decision Trees
• Gini Impurity
• measure of how often a randomly chosen element from the dataset would be incorrectly
labeled if it was randomly labeled according to the distribution of labels in the subset

• where j is the number of classes present in the node and p is the distribution of the
class in the node

• Total Gini Impurity = weighted average of IGini


Decision Trees
• Gini Impurity - example
• Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes
• Target consist 138 value 0 and 165 value 1
• Focus on attributes Sex, fbs (fasting blood sugar) and exang (exercise induced angina)

https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Decision Trees
• Gini Impurity - example • Gini Impurity Sex = 0

• Gini Impurity Sex = 1

• Total Gini Impurity


Decision Trees
• Gini Impurity - example • Gini Impurity Fbs = 0

• Gini Impurity Fbs = 1

• Total Gini Impurity


Decision Trees
• Gini Impurity - example • Gini Impurity Exang = 0

= 0,495
• Gini Impurity Exang = 1

= 0,24

• Total Gini Impurity

= 0, 356
Decision Trees
• Gini Impurity – example
• Gini Impurity
• Sex = 0,399
• Fbs = 0,249
• Exang = 0,356

• Gini Impurity tells what is the probability of misclassifying an observation

• Which is the best attribute to use for the partition?


• The lower the Gini the better the split, as the likelihood of misclassification is lower
Decision Trees
• Entropy
• Measures the amount of information for a variable that can take n values

• where j is the number of classes present in the node and p is the distribution of the class in
the node

• The highest Entropy value on the first iteration will be the Root Node

• Information Gain = information before partition – information after partition


Decision Trees
• Entropy - example
• Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes
• Target consist 138 value 0 and 165 value 1
• Focus on attributes Sex, fbs (fasting blood sugar) and exang (exercise induced angina)

calculate entropy in Target attribute first

Information Gain

https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Decision Trees
• Entropy - example • Entropy Sex = 0

• Entropy Sex = 1

• Information Gain
Decision Trees
• Entropy - example • Entropy Fbs = 0

• Entropy Fbs = 1

• Information Gain
Decision Trees
• Entropy - example • Entropy Exang = 0

• Entropy Exang = 1

• Information Gain
Decision Trees
• Entropy – example

• Information Gain
• Sex = 0,328
• Fbs = 0,389
• Exang = 0,224

• Which is the best attribute to use for the partition?

• The highest the Information Gain the better the split


Decision Trees
• Misclassification Impurity

C1 0
C2 6

C1 1
C2 5

C1 3
C2 3
Decision Trees
• Classification and Regression Trees (CART) algorithm
• Step 1: Start at the root node with all training instances
• Step 2: Select an attribute based on splitting criteria
• Step 3: Partition instances according to selected attribute recursively

• Partitioning stops when:


• There are no examples left
• All examples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority class is the leaf

• Uses Gini Impurity


CART - example
◦ Number of instances: 14
◦ Number of attributes: 4

◦ Step 1: Start at the root node with all training


instances
◦ Step 2: Select an attribute based on splitting
criteria (use Gini Impurity)
◦ Step 3: Partition instances according to
selected attribute recursively

https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Oulook

Gini(Oulook)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Temp

Gini(Temp)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Humidity

Gini(Humidity)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Wind

Gini(Wind)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Time to decide

Let’s place Outlook at the top of the tree


CART - example
CART - example

Next decisions:
• Focus on the sub dataset for Sunny outlook
• Focus on the sub dataset for Rain outlook
CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Temp

Gini(Outlook=Sunny and Temp.)


CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Humidity

Gini(Outlook=Sunny and Humidity.)


CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Wind

Gini(Outlook=Sunny and Wind)


= 0.466
CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Time to decide

Let’s place Humidity at the extension of Sunny Outlook


CART - example
• Next decisions:
• Focus on the sub dataset for Sunny outlook
CART - example
• Next decisions:
• Focus on the sub dataset for Rain outlook
CART - example
Focus on the sub dataset for Rain outlook Step 2: Select an attribute on the basis of splitting criteria
Temp

Gini(Outlook=Rain and Temp.)


CART - example
Focus on the sub dataset for Rain outlook Step 2: Select an attribute on the basis of splitting criteria
Humidity

Gini(Outlook=Rain and Humidity)


CART - example
Focus on the sub dataset for Rain outlook Step 2: Select an attribute on the basis of splitting criteria
Wind

Gini(Outlook=Rain and Wind)


CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Time to decide

Let’s place Wind at the extension of Rain Outlook


CART - example
• Next decisions:
• Focus on the sub dataset for Rain outlook
CART - example
Decision Trees
• Stopping criteria

• All examples belong to the same class

• All examples have the same values in their attributes (but different classes)

• Few examples left

• The gain of all possible solutions is very low


Decision Trees
• Model Selection
• the possibility for model overfitting increases as the model becomes more complex

• Occam’s razor or the principle of parsimony


• given two models with the same errors, the simpler model is preferred over the more
complex model – keep it simple!
Decision Trees
• Advantages
• Simple to Understand, Visualize and Interpret
• Nonlinear parameters don’t influence its performance
• Can handle both Numerical and Categorical data

• Disadvantages
• The model can get unstable due to the slight variation of the data
• Overfitting happens when the algorithm grabs noise in the data
• A highly complicated decision tree tends to have a low bias / high variance
• Which makes it problematic for the model to work with new data
Decision Trees
• Using a Decision Tree Analysis for Intrusion Detection: A How-To Guide
• https://sansorg.egnyte.com/dl/6edQgfwngE

• Classification of Malware Attacks Using Machine Learning In Decision Tree


• http://repository.uwl.ac.uk/id/eprint/8022/1/IJS-155.pdf

• Network intrusion detection system using J48 Decision Tree


• https://ieeexplore.ieee.org/document/7275914

• A Consolidated Decision Tree-Based Intrusion Detection System for Binary and Multiclass
Imbalanced Datasets
• https://www.mdpi.com/2227-7390/9/7/751/pdf

• Decision Tree and Context Based Malicious Event Detection


• Chapter 7 of Book “Hands-On Machine Learning for Cybersecurity”
Ensemble Learning
• It uses multiple machine learning models to make the better

• Specific models do well in modeling one aspect of data, while others do well in modeling
another

• Instead of learning a single complex model, learn several simple models and combine their
output to build the final decision

• The combined strength of the models compensates for individual model variances and biases

• Delivers a combined prediction where the final accuracy is better than the accuracy of the
individual models
Bagging
• Bagging (Bootstrap Aggregating)
• Random samples of the training
data set (subsets of training data
set) are created

• A classifier is built for each sample

• The results of these multiple


classifiers are combined using
average or majority voting.

• Bagging helps to reduce the


variance error (!)
Boosting
• Boosting
• Boosting provides sequential
learning of the predictors

• The first predictor is learned on the


whole data set, while the following
are learnt on the errors of the
previous ones
Boosting
• Boosting
• It starts by classifying original data set and giving equal weights to each observation

• If classes are predicted incorrectly using the first learner, then it gives higher weight to
the missed classified observation

• Being an iterative process, it continues to add classifier learner until a limit is reached in
the number of models or accuracy

• Boosting has shown better predictive accuracy than bagging, but it also tends to overfit
the training data as well (!)
Stacking
• Stacking
• Use a learner to combine output
from different learners, learning a
function of their predictions

• This can lead to decrease in either


bias or variance error depending on
the combining learner we use
Tree Ensembles
• Tree Ensembles

• Random Forest
• The "forest" it builds, is an ensemble of decision trees, usually trained with the
“bagging” method
• Instead of searching for the most important feature while splitting a node, it
searches for the best feature among a random subset of features
• This results in a wide diversity that generally results in a better model
• Is based on the “Wisdom of Crowds” theory

• To assure good performance:


• Requires good representative data
• The correlation between the predictions of each tree must be low
Tree Ensembles
• Tree Ensembles

• XGBoost (eXtreme Gradient Boosting)


• Highly efficient tree-ensemble based on “boosting”
• It is the champion of many open Kaggle competitions
• Features:
• Sparse Aware implementation with automatic handling of missing data values
• Block Structure to support the parallelization of tree construction
• Continued Training so that you can further boost an already fitted model on
new data
• Support several Boosting algorithms:
• Gradient Boosting algorithm with custom learning rate
• Stochastic Gradient Boosting with multiple sub-sampling split levels
• Regularized Gradient Boosting with both L1 and L2 regularization.
Tree Ensembles
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
• sklearn’s “RandomForestClassifier”
• https://xgboost.readthedocs.io/en/stable/tutorials/model.html
• Inner details of XGBoost framework
• https://practicaldatascience.co.uk/machine-learning/how-to-create-a-classification-model-using-xgboost
• https://www.kaggle.com/code/dansbecker/xgboost/notebook
• Practical examples with XGBoost
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
• sklearn’s “StackingClassifier”

You might also like