0% found this document useful (0 votes)
35 views68 pages

Anomoly Detection - Ensemble - Classifiers

The document discusses different techniques for anomaly detection in data including point, contextual and collective anomalies. It also covers unsupervised and supervised anomaly detection methods and specific techniques like isolation forest, local outlier factor, and DBSCAN. Key differences between anomaly and outlier, and categories of anomaly detection techniques are explained.

Uploaded by

33. Pushkal OJha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views68 pages

Anomoly Detection - Ensemble - Classifiers

The document discusses different techniques for anomaly detection in data including point, contextual and collective anomalies. It also covers unsupervised and supervised anomaly detection methods and specific techniques like isolation forest, local outlier factor, and DBSCAN. Key differences between anomaly and outlier, and categories of anomaly detection techniques are explained.

Uploaded by

33. Pushkal OJha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Anomaly Detection

Anomaly
Anomaly Detection

A data point significantly off the average and, depending on the


goal, either removing or resolving them from the analysis to
prevent skewing is known as outlier detection.
Outliers are most commonly caused by
➢ Dummy outliers created to test detection methods
➢ Data manipulation or data set unintended mutations
➢ Extracting or mixing data from wrong or various
sources
➢ Not an error, novelties in the data
➢ Data entry errors (human errors)
Difference Between Anomaly and Outlier
➢ An actual data point significantly outside a distribution’s mean
or median is an outlier.
➢ Extreme values in your data series are called outliers. They are
questionable. One student can be much more brilliant than other
students in the same class, and it is possible.

➢ An anomaly is a false data point made by a different process


than the rest of the data.
➢ Anomalies are unquestionably errors. For example, one million
degrees outside, or the air temperature won’t stay the same for
two weeks. As a result, you disregard this data.s
Types of Anomaly Detection
1. Point Anomaly
2. Contextual Anomaly
3. Collective Anomaly
Point Anomaly

A tuple within the dataset can be said as a Point


anomaly if it is far away from the rest of the data.

Example: An example of a point anomaly is a


sudden transaction of a huge amount from a credit
card.
Contextual Anomaly
Contextual anomaly is also known as conditional
outliers.

If a particular observation is different from other data


points, then it is known as a contextual Anomaly.

In such types of anomalies, an anomaly in one context


may not be an anomaly in another context.
Collective Anomaly
Collective anomalies occur when a data point
within a set is anomalous for the whole dataset,
and such values are known as collective outliers.

In such anomalies, specific or individual values


are not anomalous as a whole or contextually.
Categories of Anomaly detection
techniques
Anomaly detection techniques are broadly
categorized into two types:

1. Supervised Anomaly detection


2. Unsupervised Anomaly detection
Supervised Anomaly detection
➢ Supervised learning techniques use real-world input and output data
to detect anomalies.
➢ These types of anomaly detection systems require a data analyst to
label data points as either normal or abnormal to be used as training
data.
➢ A machine learning model trained with labeled data will be able to
detect outliers based on the examples it is given.
➢ This type of machine learning is useful in known outlier detection
but is not capable of discovering unknown anomalies or predicting
future issues.
Unsupervised Anomaly detection
➢ Unsupervised learning techniques do not require labeled data and can
handle more complex data sets.
➢ Find patterns from input data and make assumptions about what data
is perceived as normal.
Anomaly detection techniques
➢ Density-based algorithms determine when an outlier differs from a
larger, hence denser normal data set, using algorithms like K-
nearest neighbor and Isolation Forest.
➢ Cluster-based algorithms evaluate how any point differs
from clusters of related data using techniques like K-means cluster
analysis.
➢ Bayesian network algorithms develop models for estimating the
probability that events will occur based on related data and then
identifying significant deviations from these predictions.
➢ Neural network algorithms train a neural network to predict an
expected time series and then flag deviations.
1. Isolation forest
2. Local outlier factor
3. Robust covariance
4. One-Class support vector machine (SVM)
5. One-Class SVM with stochastic gradient descent
(SGD)
Inter Quartile Range (IQR)
• IQR measures variability by dividing the dataset into four equal
quartiles.
• First, the entire data is to be sorted in ascending order
• Then splitting it into four equal quartiles called Q1, Q2, Q3, and Q4,
which can be calculated using the following equation.
Z-score Method
• The Z-score of the values is the difference between that value and the
mean, divided by the standard deviation.
• Z-scores help identify outliers by values if a particular data point has
a Z-score value either less than -3 or greater than +3.
• Z score can be mathematically expressed as
Local Outliers Finder (LOF)
• Local Outlier Finder is an unsupervised machine learning technique to
detect outliers based on the closest neighborhood density of data
points and works well when the spread of the dataset (density) is not
the same.
• LOF considers K-distance (distance between the points) and K-
neighbors (set of points lies in the circle of K-distance (radius)).
• Lof considers two major parameters
(1) n_neighbors: The number of neighbors which has a default value of 20
(2) Contamination: the proportion of outliers in the given dataset which can be
set ‘auto’ or float values (0, 0.02, 0.005).
Density-Based Spatial Clustering for
Application with Noise (DBSCAN)
• multiple numeric features(multivariate)
• DBSCAN considers two main parameters to form a cluster with the
nearest data point and based on the high or low-density region,

(1) Epsilon (Radius of datapoint that we can calculate based on k-


distance graph)
(2) Min_samples (number of data points to be considered in the
Epsilon (radius) which depends on domain knowledge or expert
advice)
IQR is the simplest and most mathematically
explained technique.

It is good for univariate and bivariate data to


identify outliers as it considers the median as a
measure of dispersion to detect extreme values.

Limited to multivariate datasets while dealing


with huge numbers of numeric features.
Z-score measures how far raw data is from the
mean in the standard deviation unit and has an
advantage over its application at normally
distributed data sets.

when the dataset is not symmetric (left or right


skewed), then Z-score techniques may lead to
erroneous results.
LOF (local Outlier Factor) has an advantage when
data spread (density) is not uniformly distributed
throughout the space as it identifies the outliers
based on its proximity with neighboring dense
regions where other global methods find it
difficult.

However, explainability is an isssue as it is


difficult to say at what threshold a data point can
be considered an outlier.
DBSCAN does not require to be defined by
several clusters and can detect anomalies
where data spread is arbitrarily distributed
and linearly not separable.

It has its limitations while working with


varying density data spread.
Difference between
Supervised and Unsupervised
Learning
Regression analysis in ML
Introduction
➢Regression analysis is a statistical method to model the
relationship between dependent (target) and independent
(predictor) variables with one or more independent variables.
➢ More specifically, Regression analysis helps us to
understand how the value of the dependent variable changes
corresponding to an independent variable when other
independent variables are held fixed.
➢It predicts continuous/real values such as temperature, age,
salary, price, etc.
Terminologies Related to the Regression
Analysis
• Dependent Variable
• Independent Variable
• Outliers
• Multicollinearity
• If the independent variables are highly correlated with each other
than other variables, then such condition is called
Multicollinearity.
• It should not be present in the dataset, because it creates
problems while ranking the most affecting variable.
• Underfitting and Overfitting
Why do we use Regression Analysis?
➢ Regression estimates the relationship between the target
and the independent variable.
➢ It is used to find the trends in data.
➢ It helps to predict real/continuous values.
➢ By performing the regression, we can confidently
determine the most important factor, the least important
factor, and how each factor is affecting the other factors.
Ensemble Learning
Ensemble learning is a machine learning technique that
enhances accuracy and resilience in forecasting by
merging predictions from multiple models.

It aims to mitigate errors or biases that may exist in


individual models by leveraging the collective intelligence
of the ensemble
Combine the outputs of diverse models to create a more precise
prediction.

Ensemble learning improves the overall performance of the learning


system.

Enhances accuracy but also provides resilience against uncertainties in


the data

By effectively merging predictions from multiple models, ensemble


learning has proven to be a powerful tool in various domains, offering
more robust and reliable forecasts.
Simple Ensemble
Simple Techniques
Ensemble Techniques

Max Weighted
Averaging
Voting Averaging
Max Voting technique

➢ Used for classification problems


➢ Multiple models are used to make predictions for each data point
➢ Predictions by each model are considered as a ‘vote’.
➢ Predictions that we get from the majority of the models are used as the
final prediction.
Example
Averaging technique
Example
Weighted Averaging technique
Example
Advanced Ensemble techniques

Stacking Blending Bagging Boosting


Bagging
• Bagging or Bootstrap Aggregation is a parallel ensemble learning
technique to reduce the variance in the final prediction.
• similar to averaging, the only difference is that bagging uses random
sub-samples of the original dataset to train the same/multiple models
and then combines the prediction, whereas in averaging the same
dataset is used to train models.
• Also called Bootstrap Aggregation as it combines both
Bootstrapping (or Sampling of data) and Aggregation to form an
ensemble model.
In bagging, every base model is trained on a
different subset of data and all the results are
combined, so the final model is less overfitted
and variance is reduced.

Ex: Random Forest


Voting Technique :Hard and Soft
▪ 3 classifiers out of 5
voted for email
being not spam.

▪ On the other hand,


2 out of 5 voted the
email as spam

▪ most votes are the


final prediction –
NOT SPAM
Hard Voting
Soft Voting
It considers the probability scores of each class predicted by
individual models and averages them to produce a more
refined final prediction.

“Not spam” (0.502) > “spam” (0.498)

Final Prediction : “NOT SPAM”


Boosting

• Boosting is a sequential ensemble learning technique to convert weak base


learners to strong learners who perform better and are less biased.
• Boosting is an iterative method that adjusts the weight of an observation
based on the previous classification.
Types of boosting
➢ Adaptive boosting(AdaBoost)
▪ AdaBoost initially gives the same weight to each dataset.

▪ Then, it automatically adjusts the weights of the data points after every decision tree.

▪ It gives more weight to incorrectly classified items to correct them for the next round.

▪ It repeats until the residual error, or the difference between actual and predicted values,

falls below an acceptable threshold.


Gradient boosting
➢ Similar to AdaBoost

➢ GB does not give incorrectly classified items more weight.

➢ Optimizes the loss function by generating base learners sequentially so that the

present base learner is always more effective than the previous one

➢ Attempts to generate accurate results initially instead of correcting errors

throughout the process, like AdaBoost.


Extreme gradient boosting(XGBoost)
• Improves gradient boosting for computational speed and scale in several
ways.

• XGBoost uses multiple cores on the CPU so that learning can occur in parallel
during training.

• It is a boosting algorithm that can handle extensive datasets, making it


attractive for big data applications.

• The key features of XGBoost are parallelization, distributed computing, and


cache optimization
Benefits of boosting

➢ Ease of implementation

➢ Reduction of bias(presence of uncertainty)

➢ Computational efficiency
Stacking
A technique that combines multiple machine learning algorithms via
meta-learning (algorithm learns from another learning algorithm)
Hyperparameter Tuning
Using Grid Search and
Random Search in Python
• Hyperparameter optimization is a technique that involves
searching through a range of values to find a subset of
results that achieve the best performance on a given
dataset.
• Two popular techniques used to perform hyperparameter
optimization –
• Grid
• Random search.
Grid Search
➢ We first need to define a parameter space or parameter grid, where
we include a set of possible hyperparameter values that can be used
to build the model.
➢ Used to place these hyperparameters in a matrix-like structure, and
the model is trained on every combination of hyperparameter
values.
➢ The model with the best performance is then selected.
➢ All possible values selected
1. estimator – A scikit-learn model
2. param_grid – A dictionary with parameter names as keys and lists of
parameter values.
3. scoring – The performance measure. For example, ‘r2’ for regression
models, and ‘precision’ for classification models.
4. cv – An integer that is the number of folds for K-fold cross-validation.
Random Search
• Random samples from a grid of hyperparameters instead of
conducting an exhaustive search.
• Can specify the number of total runs the random search should try
before returning the best model.
rf_random = RandomizedSearchCV(rf, rs_space,
n_iter=500, scoring='accuracy’,
n_jobs=-1, cv=3)

model_random = rf_random.fit(X,y)

You might also like