8 Data Mining Algorithms
8 Data Mining Algorithms
8 Data Mining Algorithms
An algorithm in data mining (or machine learning) is a set of heuristics and calculations that
creates a model from data.
To create a model, the algorithm first analyzes the data you provide, looking for specific types of
patterns or trends. The algorithm uses the results of this analysis over many iterations to find the
optimal parameters for creating the mining model. These parameters are then applied across the
entire data set to extract actionable patterns and detailed statistics.
The mining model that an algorithm creates from your data can take various forms, including:
A set of clusters that describe how the cases in a dataset are related.
A decision tree that predicts an outcome, and describes how different criteria affect that
outcome.
A mathematical model that forecasts sales.
A set of rules that describe how products are grouped together in a transaction, and the
probabilities that products are purchased together.
However, there is no reason that you should be limited to one algorithm in your solutions.
Experienced analysts will sometimes use one algorithm to determine the most effective inputs
(that is, variables), and then apply a different algorithm to predict a specific outcome based on
that data.
From the above example, the support and confidence are supplemented with another
interestingness measure i.e. correlation analysis which will help in mining interesting patterns.
A => B [support, confidence, correlation].
Correlation rule is measured by support, confidence and correlation between itemsets A and B.
Correlation is measured by Lift and Chi-Square.
(i) Lift: As the word itself says, Lift represents the degree to which the presence of one itemset
lifts the occurrence of other itemsets.
The lift between the occurrence of A and B can be measured by:
Lift (A, B) = P (A U B) / P (A). P (B).
If it is < 1, then A and B are negatively correlated.
If it is >1. Then A and B are positively correlated which means that the occurrence of one
implies the occurrence of the other.
If it is = 1, then there is no correlation between them.
(ii) Chi-Square: This is another correlation measure. It measures the squared difference between
the observed and expected value for a slot (A and B pair) divided by the expected value.
5. Bayes Classification
Bayesian Classification is another method of Classification Analysis. Bayes Classifiers predict
the probability of a given tuple to belong to a particular class. It is based on the Bayes theorem,
which is based on probability and decision theory.
Bayes Classification works on posterior probability and prior probability for the decision-making
process. By posterior probability, the hypothesis is made from the given information i.e. the
attribute values are known, while for prior probability, the hypotheses are given regardless of the
attribute values.
6. Clustering Analysis
It is a technique of partitioning a set of data into clusters or groups of objects. The clustering is
done using algorithms. It is a type of unsupervised learning as the label information is not
known. Clustering methods identify data that are similar or different from each other, and
analysis of characteristics is done.
Cluster analysis can be used as a pre-step for applying various other algorithms such as
characterization, attribute subset selection, etc. Cluster Analysis can also be used for Outlier
detection such as high purchases in credit card transactions.
Applications: Image recognition, web search, and security.
7. Outlier Detection
The process of finding data objects which possess exceptional behavior from the other objects is
called outlier detection. Outlier detection and cluster analysis are related to each other. Outlier
methods are categorized into statistical, proximity-based, clustering-based and classification
based.
There are different types of outliers, some of them are:
Global Outlier: The data object deviated significantly from the rest of the data set.
Contextual Outlier: It depends on certain factors like day, time, and location. If a data
object deviates significantly with reference to a context.
Collective Outlier: When a group of data objects has different behavior from the entire
data set.
Application: Detection of credit card fraud risks, novelty detection, etc.
8. Sequential Patterns
A trend or some consistent patterns are recognized in this type of data mining. Understanding
customer purchase behavior and sequential patterns are used by the stores to display their
products on shelves.
Application: E-commerce example where when you buy item A, it will show that Item B is
often bought with Item A looking at the past purchasing history.
9. Regression Analysis
This type of analysis is supervised and identifies which item sets amongst the different
relationships are related to or are independent of each other. It can predict sales, profit,
temperature, forecast human behavior, etc. It has a data set value that is already known.
When an input is provided, the regression algorithm will compare the input and expected value,
and the error is calculated to get to the accurate result.
Application: Marketing and Product Development Efforts comparison.
Assignment