Module_11(c)
Module_11(c)
The definition of both “normal” and anomalous data significantly varies depending on the context.
Below are a few examples of anomaly detection in action.
1. Financial transactions
Outlier: A massive withdrawal from Ireland from the same account, hinting at potential fraud.
Outlier: Abrupt increase in data transfer or use of unknown protocols signaling a potential breach or
malware.
Outlier: Sudden increase in heart rate and decrease in blood pressure, indicating a potential
emergency or equipment failure.
Anomaly detection includes many types of unsupervised methods to identify divergent samples.
Data specialists choose them based on anomaly type, the context, structure, and characteristics of
the dataset at hand. We’ll cover them in the coming sections.
Even though we saw some examples above, let’s look at a real-life story of how anomaly detection
works in finance.
Shaq O’Neal, four times NBA winner, gets traded from the Miami Heat to the Phoenix Suns. When
Shaq arrives at the empty apartment provided by the Phoenix Suns, he wants to furnish his
apartment immediately in the middle of the night. So, he goes to Walmart and makes the biggest
purchase in Walmart history for $70,000. Or at least, he tries to; his card gets declined.
He wonders what could possibly be the problem (he can’t be broke!) At 2 am in the morning,
American Express security calls him and tells him that his card was suspected as stolen because
somebody was trying to make a $70,000 purchase at Walmart in Phoenix.
There are so many other real-world applications of anomaly detection beyond finance and fraud
detection:
Cybersecurity
Healthcare
Anomaly detection is deeply woven into the daily services we use and often, we don’t even notice it.
Data is the most precious commodity in data science, and anomalies are the most disruptive threats
to its quality. Bad data quality means bad:
Statistical tests
Dashboards
Decisions
As the internals of almost all machine learning models rely heavily on these two metrics, timely
detection of anomalies is crucial.
Types of Anomalies
Anomaly detection encompasses two broad practices: outlier detection and novelty detection.
Outliers are abnormal or extreme data points that exist only in training data. In contrast, novelties
are new or previously unseen instances compared to the original (training) data.
For example, consider a dataset of daily temperatures in a city. Most days, the temperatures range
between 20°C and 30°C. However, one day, there’s a spike of 40°C. This extreme temperature is an
outlier as it significantly deviates from the usual daily temperature range.
Now, imagine that the city installs a new, more accurate weather monitoring station. As a result, the
dataset starts consistently recording slightly higher temperatures, ranging from 25°C to 35°C. This
sustained increase in temperatures is a novelty, representing a new pattern introduced by the
improved monitoring system.
Anomalies, on the other hand, is a broad term for both outliers and novelties. It can be used to
define any abnormal instance in any context.
Identifying the type of anomalies is crucial as it allows you to choose the right algorithm to detect
them.
Types of Outliers
As there are two types of anomalies, there are two types of outliers as well: univariate and
multivariate. Depending on the type, we will use different detection algorithms.
1. Univariate outliers exist in a single variable or feature in isolation. Univariate outliers are
extreme or abnormal values that deviate from the typical range of values for that specific
feature.
2. Multivariate outliers are found by combining the values of multiple variables at the same
time.
For example, consider a dataset of housing prices in a neighborhood. Most houses cost between
$200,000 and $400,000, but there is House A with an exceptionally high price of $1,000,000. When
we analyze only the price, House A is a clear outlier.
Now, let’s add two more variables to our dataset: the square footage and the number of bedrooms.
When we consider the square footage, the number of bedrooms, and the price, it’s House B that
looks odd:
When we look at these variables individually, they seem ordinary. Only when we put them together
do we find out that House B is a clear multivariate outlier.
Anomaly detection algorithms differ depending on the type of outliers and the structure in the
dataset.
1. Z-score (standard score): the z-score measures how many standard deviations a data point
is away from the mean. Generally, instances with a z-score over 3 are chosen as outliers.
2. Interquartile range (IQR): The IQR is the range between the first quartile (Q1) and the third
quartile (Q3) of a distribution. When an instance is beyond Q1 or Q3 for some multiplier of
IQR, they are considered outliers. The most common multiplier is 1.5, making the outlier
range [Q1–1.5 * IQR, Q3 + 1.5 * IQR].
3. Modified z-scores: similar to z-scores, but modified z-scores use the median and a measure
called Median Absolute Deviation (MAD) to find outliers. Since mean and standard deviation
are easily skewed by outliers, modified z-scores are generally considered more robust.
For multivariate outliers, we generally use machine learning algorithms. Because of their depth and
strength, they are able to find intricate patterns in complex datasets:
1. Isolation Forest: uses a collection of isolation trees (similar to decision trees) that recursively
divide complex datasets until each instance is isolated. The instances that get isolated the
quickest are considered outliers.
2. Local Outlier Factor (LOF): LOF measures the local density deviation of a sample compared
to its neighbours. Points with significantly lower density are chosen as outliers.
3. Clustering techniques: techniques such as k-means or hierarchical clustering divide the
dataset into groups. Points that don’t belong to any group or are in their own little clusters
are considered outliers.
4. Angle-based Outlier Detection (ABOD): ABOD measures the angles between individual
points. Instances with odd angles can be considered outliers.
Apart from the type of anomalies, you should consider dataset characteristics and project
constraints. For example, Isolation Forest works well on almost any dataset but it is slower and
computation-heavy as it is an ensemble method. In comparison, LOF is very fast in training but may
not perform as well as Isolation Forest.