Top 10 Data Mining Algorithms

rayli.
net
Learn. Teach. Grow.
Top 10 data mining algorithms in

plain English
Today, Im going to explain in plain English the top 10 most influential data
miningalgorithms as voted on by 3 separate panels in this survey paper.
Once you know what they are, how they work, what they do and where you can find
them, my hope is youll have this blog post as a springboard to learn even more
about data mining.
What are we waiting for? Lets get started!
Contents [hide]
1. C4.5
2. k-means
3. Support vector machines
4. Apriori
5. EM
6. PageRank
7. AdaBoost
8. kNN
9. Naive Bayes
10. CART
Interesting Resources
Now its your turn
Update 16-May-2015: Thanks to Yuval Merhav and Oliver Keyes for their
suggestions which Ive incorporated into the post.
Update 28-May-2015: Thanks to DanSteinberg (yes, the CART expert!) for the
suggested updates to the CART section which have now been added.
1. C4.5
What does it do?C4.5 constructs a classifier in the form of a decision tree.In order
to do this, C4.5 is given a set of data representing things that are already classified.
Wait, whats a classifier?A classifier is a tool in data mining that takes a bunch of
data representing things we want to classify and attempts to predict which class the
new data belongs to.
Whats an example of this?Sure, suppose a dataset contains a bunch of patients.

We know various things about each patient like age, pulse, blood pressure, VO2max,
family history, etc. These are called attributes.
Now:
Given these attributes, we want to predict whether the patient will get cancer. The
patient can fall into 1 of 2 classes: will get cancer or wont get cancer.C4.5 is told the
class for each patient.
And heres the deal:
Using a set of patient attributes and the patients corresponding class, C4.5
constructs a decision tree that can predict the class for new patients based on their
attributes.
Cool, so whats a decision tree?Decision tree learning creates something similar
to a flowchart to classify new data. Using the same patient example, one particular
path in the flowchart could be:
Patient has a history of cancer

Patient is expressing a gene highly correlated with cancer patients
Patient has tumors
Patients tumor size is greater than 5cm
The bottomline is:
At each point in the flowchart is a question about the value of some attribute, and
depending on those values, he or she gets classified. You can find lots of examples
of decision trees.
Is this supervised or unsupervised?This is supervised learning, since the training

dataset is labeled with classes. Using the patient example, C4.5 doesnt learn on its
own that a patient will get cancer or wont get cancer. We told it first, it generated a
decision tree, and now it uses the decision tree to classify.
You might be wondering how C4.5 is different than other decision tree
systems?
First, C4.5 uses information gain when generating the decision tree.
Second, although other systems also incorporate pruning, C4.5 uses a single-
pass pruning process to mitigate over-fitting. Pruning results in many
improvements.
Third, C4.5 can work with both continuous and discrete data. My
understanding is it does this by specifying ranges or thresholds for
continuous data thus turning continuous data into discrete data.
Finally, incomplete data is dealt with inits own ways.
Why use C4.5?Arguably, the best selling point of decision trees is their ease of
interpretation and explanation. They are also quite fast, quite popular and the
output is human readable.
Where is it used? A popular open-source Java implementation can be found over

atOpenTox.Orange, an open-source data visualization and analysistool for data
mining,implements C4.5 in their decision tree classifier.
Checkout how I used C5.0 (latest version of C4.5)
Classifiers are great, but make sure to checkout the next algorithm about
clustering
2. k-means
What does it do?k-means creates groups from a set of objects so that the
members of a group are more similar. Its a popular cluster analysis technique for
exploring a dataset.
Hang on, whats cluster analysis?Cluster analysis is a family of algorithms

designed to form groups such that the group members are more similar versus
non-group members. Clusters and groups are synonymous in the world of cluster
analysis.
Is there an example of this?Definitely, suppose we have a dataset of patients. In

cluster analysis, these would be called observations. We know various things about
each patient like age, pulse, blood pressure, VO2max, cholesterol, etc. This is a
vector representing the patient.
Look:
You can basically think of a vector as a list of numbers we know about the patient.
This list can also be interpreted as coordinates in multi-dimensional space. Pulse
can be one dimension, blood pressure another dimension and so forth.
You might be wondering:
Given this set of vectors, how do we cluster together patients that have similar age,
pulse, blood pressure, etc?
Want to know the best part?
You tell k-means how many clusters you want. K-means takes care of the rest.
How does k-means take care of the rest?k-means has lots of variations to
optimize for certain types of data.
At a high level, they all do something like this:

1. k-means picks points in multi-dimensional space to represent each of the k
clusters. These are called centroids.
2. Every patient will be closest to 1 of these k centroids. They hopefully wont
allbe closest to the same one, so theyll form a cluster around their nearest
centroid.
3. What we have are k clusters, and each patient is now a member of a cluster.
4. k-means then finds the center for each of the k clusters based on its cluster
members (yep, using the patient vectors!).
5. This center becomes the new centroid for the cluster.
6. Since the centroid is in a different place now, patients might now be closer to
other centroids. In other words, they may change cluster membership.
7. Steps 2-6 are repeated until the centroids no longer change, and the cluster
memberships stabilize. This is called convergence.
Is this supervised or unsupervised?It depends, but most would classify k-means

as unsupervised. Other than specifying the number of clusters, k-means learns
the clusters on its own without any information about which cluster an observation
belongs to. k-means can besemi-supervised.
Why use k-means?I dont think many will have an issue with this:
The key selling point of k-means is its simplicity. Its simplicity means its generally
faster and more efficient than other algorithms, especially over large datasets.
It gets better:
k-means can be used topre-clustera massive dataset followed by a more

expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly
play with k and explore whether there are overlooked patterns or relationships in
the dataset.
Its not all smooth sailing:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to
the initial choice of centroids. One final thing to keep in mind is k-means is designed
to operate on continuous data youll need to do some tricksto get it to work on
discrete data.
Where is it used?A ton of implementations for k-means clustering are available

online:
Apache Mahout
Julia
R
SciPy
Weka
MATLAB
SAS
Checkout how I used k-means
If decision trees and clustering didnt impress you, youre going to love the next
algorithm
3. Support vector machines

What does it do?Support vector machine (SVM) learns a hyperplane to classify
data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM
doesnt use decision trees at all.
Whoa, a hyper-what?A hyperplane is a function like the equation for a line,

. In fact, for a simple classification task with just 2 features, the
hyperplane can be a line.
As it turns out
SVM can perform a trick to project your data into higher dimensions. Once
projected into higher dimensions
SVM figures out the best hyperplane which separates your data into the 2 classes.
Do you have an example?Absolutely, the simplest example I found starts with a

bunch of red and blue balls on a table. If the balls arent too mixed together, you
could take a stick and without moving the balls, separate them with the stick.
You see:
When a new ball is added on the table, by knowing which side of the stick the ball is
on, you can predict its color.
What do the balls, table and stick represent? The balls represent data points,
and the red and blue color represent 2 classes. The stick represents the hyperplane
which in this case is a line.
And the coolest part?
SVM figures out the function for the hyperplane.
What if things get more complicated?Right, they frequently do. If the balls are
mixed together, a straight stick wont work.
Heres the work-around:
Quickly lift up the table throwing the balls in the air. While the balls are in the air
and thrown up in just the right way, you use a large sheet of paper to divide the
balls in the air.
You might be wondering if this is cheating:
Nope, lifting up the table is the equivalent of mapping your data into higher
dimensions. In this case, we go from the 2 dimensional table surface to the 3
dimensional balls in the air.
How does SVM do this? By using a kernel we have a nice way to operate in higher
dimensions. The large sheet of paper is still called a hyperplane, but it is now a
function for a plane rather than a line. Note from Yuval that once were in 3
dimensions, the hyperplane must be a plane rather than a line.
I found this visualization super helpful:

Reddit also has 2 great threads on this in theELI5andMLsubreddits.
How do balls on a table or in the air map to real-life data?A ball on a table has
a location that we can specify using coordinates. For example, a ball could be 20cm
from the left edge and 50cm from the bottom edge. Another way to describe the
ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.
Heres the deal:
If we had a patient dataset, each patient could be described by various

measurements like pulse, cholesterol level, blood pressure, etc. Each of these
measurements is a dimension.
The bottomline is:
SVM does its thing, maps them into a higher dimension and then finds the
hyperplane to separate the classes.
Margins are often associated with SVM? What are they?The margin is the
distance between the hyperplane and the 2 closest data points from each
respective class. In the ball and table example, the distance between the stick and
the closest red and blue ball is the margin.
The key is:
SVM attempts to maximize the margin, so that the hyperplane is just as far away
from red ball as the blue ball. In this way, it decreases the chance of
misclassification.
Where does SVM get its name from?Using the ball and table example, the
hyperplane is equidistant from a red ball and a blue ball. These balls or data points
are called support vectors, because they support the hyperplane.
Is this supervised or unsupervised?This is a supervised learning, since a dataset

is used to first teach the SVM about the classes. Only then is the SVM capable of
classifying new data.
Why use SVM?SVM along with C4.5 are generally the2 classifiers to try first. No
classifier will be the best in all cases due to theNo Free Lunch Theorem. In
addition, kernel selection and interpretability are some weaknesses.
Where is it used?There are many implementations of SVM. A few of the popular

ones arescikit-learn,MATLABand of courselibsvm.
Checkout how I used SVM
The next algorithm is one of my favorites
4. Apriori
What does it do?The Apriori algorithm learns association rules and is applied to a
database containing a large number of transactions.
What are association rules?Association rule learning is a data mining technique

for learning correlations and relations among variables in a database.
Whats an example of Apriori?Lets say we have a database full of supermarket

transactions. You can think of a database as a giant spreadsheet where each row is
a customer transaction and every column represents a different grocery item.
Heres the best part:
By applying the Apriori algorithm, we can learn the grocery items that are
purchased together a.k.a association rules.
The power of this is:
You can find those items that tend to be purchased together more frequently than
other items the ultimate goal being to get shoppers to buy more. Together, these
items are called itemsets.
For example:
You can probably quickly see that chips + dip and chips + soda seem to frequently
occur together. These are called 2-itemsets. With a large enough dataset, it will be
much harder to see the relationships especially when youre dealing with
3-itemsets or more. Thats precisely what Apriori helps with!
You might be wondering how Apriori works?Before getting into the nitty gritty of
algorithm, youll need to define 3 things:
1. The first is thesizeof your itemset. Do you want to see patterns for a
2-itemset, 3-itemset, etc.?
2. The second is yoursupportor the number of transactions containing the
itemset divided by the total number of transactions. An itemset that meets
the support is called a frequent itemset.
3. The third is yourconfidenceor the conditional probability of some item
given you have certain other items in your itemset. A good example is given
chips in your itemset, there is a 67% confidence of having soda also in the
itemset.
The basic Apriori algorithm is a 3 step approach:
1. Join.Scan the whole database for how frequent 1-itemsets are.

2. Prune.Those itemsets that satisfy thesupportandconfidencemove onto
the next round for 2-itemsets.
3. Repeat.This is repeated for each itemset level until we reach our previously
definedsize.
Is this supervised or unsupervised?Apriori is generally considered an

unsupervised learning approach, since its often used to discover or mine for
interesting patterns and relationships.
But wait, theres more
Apriori can also be modified todo classificationbased on labelled data.
Why use Apriori?Apriori is well understood, easy to implement and hasmany

derivatives.
On the other hand

The algorithm can be quite memory, space and time intensive when generating
itemsets.
Where is it used?Plenty of implementations of Apriori are available. Some popular

ones are theARtool,Weka, andOrange.
Checkout how I used Apriori
The next algorithm was the most difficult for me to understand, look at the next
algorithm
5. EM
What does it do?In data mining, expectation-maximization (EM) is generally used
as a clustering algorithm (like k-means) for knowledge discovery.
In statistics, the EM algorithmiterates and optimizes the likelihood of seeing

observed data while estimating the parameters of a statistical model with
unobserved variables.
OK, hang on while I explain
Im not a statistician, so hopefully my simplification is both correct and helps with

understanding.
Here are a few concepts that will make this way easier
Whats a statistical model?I see amodelas something that describes how

observed data is generated. For example, the grades for an exam could fit a bell
curve, so the assumption that the grades are generated via a bell curve (a.k.a.
normal distribution) is the model.
Wait, whats a distribution?A distribution represents the probabilities for all

measurable outcomes.For example, the grades for an exam could fit a normal
distribution. This normal distribution represents all the probabilities of a grade.
In other words, given a grade, you can use the distribution to determine how many
exam takers are expected to get that grade.
Cool, what are the parameters of a model?Aparameterdescribes a
distribution which is part of a model. For example, a bell curve can be described by
itsmeanandvariance.
Using the exam scenario, the distribution of grades on an exam (the measurable
outcomes) followed a bell curve (this is the distribution). The mean was85and the
variance was100.
So, all you need to describe a normal distribution are 2 parameters:
1. The mean
2. The variance
And likelihood?Going back to our previous bell curve example suppose we have
a bunch of grades and are told the grades follow a bell curve. However, were not
given all the grades only a sample.
Heres the deal:
We dont know the mean or variance of all the grades, but we can estimate them
using the sample. The likelihood is the probability that the bell curve with estimated
mean and variance results in those bunch of grades.
In other words, given a set of measurable outcomes, lets estimate the parameters.
Using these estimated parameters, the hypothetical probability of the outcomes is
called likelihood.
Remember, its the hypothetical probability of theexistinggrades, not the

probability of afuture grade.
Youre probably wondering, whatsprobabilitythen?
Using the bell curve example, suppose we know the mean and variance. Then were
told the grades follow a bell curve. The chance that we observe certain grades and
how often they are observed is the probability.
In more general terms, given the parameters, lets estimate what outcomes should
be observed. Thats what probability does for us.
Great! Now, whats the difference between observed and unobserved
data?Observed data is the data that you saw or recorded. Unobserved data is data
that is missing. There a number of reasons that the data could be missing (not
recorded, ignored, etc.).
Heres the kicker:
For data mining and clustering, whats important to us is looking at the class of a
data point as missing data. We dont know the class, so interpreting missing data
this way is crucial for applying EM to the task of clustering.
Once again:The EM algorithmiterates and optimizes the likelihood of seeing

observed data while estimating the parameters of a statistical model with
unobserved variables. Hopefully, this is way more understandable now.
The best part is
By optimizing the likelihood, EM generates an awesome model that assigns class

labels to data points sounds like clustering to me!
How does EM help with clustering?EM begins by making a guess at the model
parameters.
Then it follows an iterative 3-step process:
1. E-step:Based on the model parameters, it calculates theprobabilities for

assignments of each data point to a cluster.
2. M-step:Update the model parameters based on the cluster assignments
from the E-step.
3. Repeat until the model parameters and cluster assignments stabilize (a.k.a.
convergence).
Is this supervised or unsupervised?Since we do not provide labeled class

information, this is unsupervised learning.
Why use EM?A key selling point of EMis its simple and straight-forward to
implement. In addition, not only can it optimize for model parameters, it can also
iteratively make guesses about missing data.
This makes it great for clustering and generating a model with parameters. Knowing
the clusters and model parameters, its possible to reason about what the clusters
have in common and which cluster new data belongs to.
EM is not without weaknesses though
First, EM is fast in the early iterations, but slow in the later iterations.
Second, EM doesnt always find the optimal parameters and gets stuck in
local optima rather than global optima.
Where is it used?The EM algorithm is available inWeka. R has an implementation

in themclust package. scikit-learn also has an implementation in itsgmm module.
Checkout how I used EM
What data mining does Google do? Take a look
6. PageRank
What does it do?PageRankis a link analysis algorithm designed to determine the
relative importance of some object linked within a network of objects.
Yikes.. whats link analysis?Its a type of network analysis looking to explore the
associations (a.k.a. links) among objects.
Heres an example:The most prevalent example of PageRank is Googles search

engine. Although their search engine doesnt solely rely on PageRank, its one of the
measures Google uses to determine a web pages importance.
Let me explain:
Web pages on the World Wide Web link to each other. If rayli.net links to a web
page on CNN, a vote is added for the CNN page indicating rayli.net finds the CNN
web page relevant.
And it doesnt stop there
rayli.nets votes are in turn weighted by rayli.nets importance and relevance. In

other words, any web page thats voted for rayli.net increases rayli.nets relevance.
The bottom line?
This concept of voting and relevance is PageRank. rayli.nets vote for CNN increases
CNNs PageRank, and the strength of rayli.nets PageRank influences how much its
vote affects CNNs PageRank.
What does a PageRank of 0, 1, 2, 3, etc. mean?Although the precise meaning of a

PageRank number isnt disclosed by Google, we can get a sense of its relative
meaning.
And heres how:
You see?
Its a bit like a popularity contest. We all have a sense of which websites are relevant
and popular in our minds. PageRank is just an uber elegant way to define it.
What other applications are there of PageRank?PageRank was specifically

designed for the World Wide Web.
Think about it:
At its core, PageRank is really just a super effective way to do link analysis.The
objects being linked dont have to be web pages.
Here are 3 innovative applications of PageRank:
1. Dr Stefano Allesina, from theUniversity of Chicago,applied PageRank to

ecologytodetermine which species are critical for sustaining ecosystems.
2. Twitter developedWTF(Who-to-Follow) which is a personalized PageRank
recommendation engine about who to follow.
3. Bin Jiang, fromThe Hong Kong Polytechnic University, used avariant of
PageRank to predict human movement ratesbased on topographical
metrics in London.
Is this supervised or unsupervised?PageRank is generally considered an

unsupervised learning approach, since its often used to discover the importance or
relevance of a web page.
Why use PageRank?Arguably, the main selling point of PageRank is its robustness
due to the difficulty of getting a relevant incoming link.
Simply stated:
If you have a graph or network and want to understand relative importance,

priority, ranking or relevance, give PageRank a try.
Where is it used?The PageRank trademark is owned by Google. However, the

PageRank algorithm is actuallypatented by Stanford University.
You might be wondering if you can use PageRank:
Im not a lawyer, so best to check with an actual lawyer, but you can probably use
the algorithm as long as it doesnt commercially compete against Google/Stanford.
Here are 3 implementations of PageRank:
1. C++ OpenSource PageRank Implementation

2. Python PageRank Implementation
3. igraph The network analysis package (R)
Checkout how I used PageRank
With our powers combined, we are
7. AdaBoost
What does it do?AdaBoost is a boosting algorithm which constructs a classifier.
As you probably remember, a classifier takes a bunch of data and attempts to

predict or classify which class a new data element belongs to.
But whats boosting?Boosting is an ensemble learning algorithm which takes
multiple learning algorithms (e.g. decision trees) and combines them. The goal is to
take an ensembleor group of weak learners and combine them to create a single
strong learner.
Whats the difference between a strong and weak learner?A weak learner
classifies with accuracy barely above chance. A popular example of a weak learner
is the decision stump which is a one-level decision tree.
Alternatively
A strong learner has much higher accuracy, and an often used example of a strong
learner is SVM.
Whats an example of AdaBoost?Lets start with 3 weak learners. Were going to

train them in 10 rounds on a training dataset containing patient data. The dataset
contains details about the patients medical records.
The question is
How can we predict whether the patient will get cancer?
Heres how AdaBoost answers the question
In round 1:AdaBoost takes a sample of the training dataset and tests to see how
accurate each learner is. The end result is we find the best learner.
In addition, samples that are misclassified are given a heavier weight, so that they
have a higher chance of being picked in the next round.
One more thing, the best learner is also given a weight depending on its accuracy
and incorporated into the ensemble of learners (right now theres just 1 learner).
In round 2:AdaBoost again attempts to look for the best learner.
And heres the kicker:
The sample of patient training data is now influenced by the more heavily
misclassified weights. In other words, previously misclassified patients have a
higher chance of showing up in the sample.
Why?
Its like getting to the second level of a video game and not having to start all over
again when your character is killed. Instead, you start at level 2 and focus all your
efforts on getting to level 3.
Likewise, the first learner likely classified some patients correctly. Instead of trying
to classify them again, lets focus all the efforts on getting the misclassified patients.
The best learner is again weighted and incorporated into the ensemble,
misclassified patients are weighted so they have a higher chance of being picked
and we rinse and repeat.
At the end of the 10 rounds:Were left with an ensemble of weighted learners

trained and then repeatedly retrained on misclassifieddata from the previous
rounds.
Is this supervised or unsupervised?This is supervised learning, since each

iteration trains the weaker learners with the labelled dataset.
Why use AdaBoost?AdaBoost is simple. The algorithm is relatively straight-

forward to program.
In addition, its fast!Weak learners are generally simpler than strong learners. Being
simpler means theyll likely execute faster.
Another thing
Its a super elegant way to auto-tune a classifier, since each successive AdaBoost
round refines the weights for each of the best learners. All you need to specify is
the number of rounds.
Finally, its flexible and versatile. AdaBoost can incorporate any learning algorithm,
and it can work with a large variety of data.
Where is it used?AdaBoost has a ton of implementations and variants. Here are a

few:
scikit-learn
ICSIBoost
gbm: Generalized Boosted Regression Models
Checkout how I used AdaBoost
If you like Mr. Rogers, youll like the next algorithm
8. kNN
What does it do?kNN, or k-Nearest Neighbors, is a classification algorithm.
However, it differs from the classifiers previously described because its a lazy
learner.
Whats a lazy learner?A lazy learner doesnt do much during the training process
other than store the training data. Only when new unlabeled data is input does this
type of learner look to classify.
On the other hand, an eager learner builds a classification model during training.
When new unlabeled data is input, this type of learner feeds the data into the
classification model.
How does C4.5, SVM and AdaBoost fit into this?Unlike kNN, they are all eager
learners.
Heres why:
1. C4.5 builds a decision tree classification model during training.

2. SVM builds a hyperplane classification model during training.
3. AdaBoost builds an ensemble classification model during training.
So what does kNN do?kNN builds no such classification model. Instead, it just
stores the labeled training data.
When new unlabeled data comes in, kNN operates in 2 basic steps:
1. First, it looks at the closest labeled training data points in other words,
the k-nearest neighbors.
2. Second, using the neighbors classes, kNN gets a better idea of how the new
data should be classified.
You might be wondering

How does kNN figure out whats closer?For continuous data, kNN uses a
distance metric like Euclidean distance. The choice of distance metric largely
depends on the data. Some even suggest learning a distance metric based on the
training data. Theres tons more details and paperson kNN distance metrics.
For discrete data, the idea is transform discrete data into continuous data. 2
examples of this are:
1. UsingHamming distanceas a metric for the closeness of 2 text strings.

2. Transforming discrete data into binary features.
These 2 Stack Overflow threads have some more suggestions on dealing with
discrete data:
KNN classification with categorical data

Using k-NN in R with categorical values
How does kNN classify new data when neighbors disagree?kNN has an easy
time when all neighbors are the same class. The intuition is if all the neighbors
agree, then the new data point likely falls in the same class.
Ill bet you can guess where things get hairy
How does kNN decide the class when neighbors dont have the same class?
2 common techniques for dealing with this are:
1. Take a simple majority vote from the neighbors. Whichever class has the
greatest number of votes becomes the class for the new data point.
2. Take a similar vote except give a heavier weight to those neighbors that are
closer. A simple way to do this is to use reciprocal distance e.g. if the
neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets
further away, the reciprocal distance gets smaller and smaller exactly what
we want!
Is this supervised or unsupervised?This is supervised learning, since kNN is

provided a labeled training dataset.
Why use kNN?Ease of understanding and implementing are 2 of the key reasons
to use kNN. Depending on the distance metric, kNN can be quite accurate.
But thats just part of the story
Here are 5 things to watch out for:
1. kNN can get very computationally expensive when trying to determine the
nearest neighbors on a large dataset.
2. Noisy data can throw off kNN classifications.
3. Features with a larger range of values can dominate the distance metric
relative to features that have a smaller range, so feature scaling is important.
4. Since data processing is deferred, kNN generally requires greater storage
requirements than eager classifiers.
5. Selecting a good distance metric is crucial to kNNs accuracy.
Where is it used?A number of kNN implementations exist:
MATLAB k-nearest neighbor classification

scikit-learn KNeighborsClassifier
k-Nearest Neighbour Classification in R
Checkout how I used kNN
Spam? Fuhgeddaboudit! Read ahead to learn about the next algorithm
9. Naive Bayes
What does it do?Naive Bayes is not a single algorithm, but a family of classification
algorithms that share one common assumption:
Every feature of the data being classified is independent of all other features given
the class.
What does independent mean?2 features are independent when the value of one
feature has no effect on the value of another feature.
For example:
Lets say you have a patient dataset containing features like pulse, cholesterol level,
weight, height and zip code. All features would be independent if the value of all
features have no effect on each other. For this dataset, its reasonable to assume
that the patients height and zip code are independent, since a patients height has
little to do with their zip code.
But lets not stop there, are the other features independent?
Sadly, the answer is no. Here are 3 feature relationships which are not
independent:
If height increases, weight likely increases.

If cholesterol level increases, weight likely increases.
If cholesterol level increases, pulse likely increases as well.
In my experience, the features of a dataset are generally not all independent.
And that ties in with the next question
Why is it called naive?The assumption that all features of a dataset are

independent is precisely why its called naive its generally not the case that all
features are independent.
Whats Bayes?Thomas Bayes was an English statistician for which Bayes Theorem
is named after. You can click on the link to find about more aboutBayes Theorem.
In a nutshell, the theorem allows us to predict the class given a set of features using
probability.
The simplified equation for classification looks something like this:
Lets dig deeper into this
What does the equation mean?The equation finds the probability of Class A
given Features 1 and 2. In other words, if you see Features 1 and 2, this is the
probability the data is Class A.
The equation reads: The probability of Class A given Features 1 and 2 is a fraction.
The fractions numerator is the probability of Feature 1 given Class A
multiplied by the probability of Feature 2 given Class A multiplied by the
probability of Class A.
The fractions denominator is the probability of Feature 1 multiplied by the
probability of Feature 2.
What is an example of Naive Bayes?Below is a great example taken from aStack

Overflow thread (Rams answer).
Heres the deal:
We have a training dataset of 1,000 fruits.

The fruit can be a Banana, Orange or Other (these are the classes).
The fruit can be Long, Sweet or Yellow (these are the features).
What do you see in this training dataset?
Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.
If we are given the length, sweetness and color of a fruit (without knowing its class),
we can now calculate the probability of it being a banana, orange or other fruit.
Suppose we are told the unknown fruit islong, sweetandyellow.
Heres how we calculate all the probabilities in 4 steps:
Step 1:To calculate the probability the fruit is a banana, lets first recognize that this
looks familiar. Its the probability of the class Banana given the features Long, Sweet
and Yellow or more succinctly:
This is exactly like the equation discussed earlier.

Step 2:Starting with the numerator, lets plug everything in.
Multiplying everything together (as in the equation), we get:
Step 3:Ignore the denominator, since itll be the same for all the other calculations.
Step 4:Do a similar calculation for the other classes:
Since the is greater than , Naive Bayes would classify this long, sweet
and yellow fruit as a banana.
Is this supervised or unsupervised?This is supervised learning, since Naive Bayes

is provided a labeled training dataset in order to construct the tables.
Why use Naive Bayes?As you could see in the example above, Naive Bayes
involves simple arithmetic. Its just tallying up counts, multiplying and dividing.
Once the frequency tables are calculated, classifying an unknown fruit just involves
calculating the probabilities for all the classes, and then choosing the highest
probability.
Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, its
been found to be effective for spam filtering.
Where is it used?Implementations of Naive Bayes can be found inOrange,scikit-

learn,WekaandR.
Check out how I used Naive Bayes
Finally, check out the 10th algorithm

10. CART
What does it do?CART stands for classification and regression trees.It is a
decision tree learning technique that outputs either classification or regression
trees. Like C4.5, CART is a classifier.
Is a classification tree like a decision tree?A classification tree is a type of

decision tree. The output of a classification tree is a class.
For example, given a patient dataset, you might attempt to predict whether the
patient will get cancer. The class would either be will get cancer or wont get
cancer.
Whats a regression tree?Unlike a classification tree which predicts a class,

regression trees predict a numeric or continuous value e.g. a patients length of stay
or the price of a smartphone.
Heres an easy way to remember
Classification trees output classes, regression trees output numbers.
Since weve already covered how decision trees are used to classify data, lets jump
right into things
How does this compare with C4.5?
C4.5 CART
Usesinformation gainto segment UsesGini impurity(not to be confused

data during decision tree generation. withGini coefficient).A good
discussionof the differences between
the impurity and coefficient is available
on Stack Overflow.
Uses asingle-pass pruning processto Uses thecost-complexity methodof

mitigate over-fitting. pruning.Starting at the bottom of the
tree, CART evaluates the
misclassification cost with the node vs.
without the node.If the cost doesnt
meet a threshold, it is pruned away.
The decision nodes can have 2 or more The decision nodes have exactly 2
branches. branches.
Probabilitically distributesmissing Uses surrogatesto distribute the

values to children. missing values to children.
Is this supervised or unsupervised?CART is a supervised learning technique,

since it is provided a labeled training dataset in order to construct the classification
or regression tree model.
Why use CART?Many of the reasons youd use C4.5 also apply to CART, since they
are both decision tree learning techniques.Things like ease of interpretation and
explanation also apply to CART as well.
Like C4.5, they are also quite fast, quite popular and the output is human readable.
Where is it used? scikit-learnimplements CART in their decision tree classifier.

Rstree packagehas an implementation of CART.WekaandMATLABalso have
implementations.
Finally, Salford Systems has the only implementation of the original proprietary
CART code based on the theory introduced byworld-renowned statisticians at
Stanford University and the University of California at Berkeley.
Checkout how I used CART
Interesting Resources
Apriori algorithm for Data Mining made simple
What Is Google PageRank and How Is It Earned and Transferred?
2 main differences between classification and regression trees
AdaBoost Tutorial
Ton of References
Now its your turn

Now that Ive shared my thoughts and research around these data mining
algorithms, I want to turn it over to you.
Are you going to give data mining a try?

Which data mining algorithms have you heard of but werent on the list?
Or maybe you have a question about an algorithm?
Let me know what you think by leaving a comment below right now.
Related Posts:
1. Top 10 data mining algorithms in plain R
2. History of data mining
MAY 2, 2015
Posted in Data Tagged data mining 148 Comments
PREVIOUS POST
HISTORY OF DATA MINING
NEXT POST
TOP 10 DATA MINING ALGORITHMS IN PLAIN R
148 thoughts on Top 10 data mining algorithms in

plain English
Pingback: 1 Top data mining algorithms in plain English | blog.offeryour.com
Pingback: Bookmarks for May 17th | Chris's Digital Detritus
Joe Guy
May 17, 2015 at 6:19 pm
Your explanation of SVM is the best I have ever seen. Thanks!

REPLY
Raymond Li
May 17, 2015 at 7:14 pm
Thanks, Joe. Definitely appreciate it!
I owe a lot of it to a few threads from Reddit and Yuval (both are linked in the
post above).
REPLY
Heather Stark
May 29, 2015 at 9:25 am
Agree!!!
REPLY
Roger Huang
May 17, 2015 at 7:10 pm
Really snappy and informative view into data mining algorithms. I clicked on a
whole ton of links: always a mark of a resource done right! Kudos.
REPLY
Raymond Li
May 17, 2015 at 8:27 pm
Thanks, Roger. Im happy you found it snappy and click-worthy. Sometimes

data mining resources can be a bit on the dry side.
REPLY
Lakshminarayanan
May 17, 2015 at 11:41 pm
Thanks for the excellent compile

This is what I was looking for as a starter.
REPLY
Ray Li
May 18, 2015 at 12:19 pm
Thanks, Lakshminarayanan!
REPLY
Pingback: Els 10 primers algoritmes del Big data explicats en paraules | Blog
d'estadstica oficial
Pingback: LessThunk.com Top 10 data mining algorithms in plain English

recommended for big data users
Meghana
May 18, 2015 at 2:57 am
Out of all the numerous websites about data mining algorithms I have gone
through, this one is by far the best! Explaining everything in such casual terms really
helps beginners like me. The examples were definitely apt and helpful.
Thank you so much! You made my work a lot easier.
REPLY
Ray Li
May 18, 2015 at 12:26 pm
Im excited to hear this helped with your work, Meghana! And really appreciate
the kind words.
REPLY
Vrashabh Irde
May 18, 2015 at 3:36 am
This is an awesome list! Thanks. Trying to dabble into ML myself and having a
simple know how of everything is very useful
REPLY
Ray Li
May 18, 2015 at 12:29 pm
Very glad to hear you find it useful, Vrashabh. Thank you!
REPLY
Pingback: Data Mining Algorithms Explained : Stephen E. Arnold @ Beyond

Search
Pingback: Distilled News | Data Analytics & R
Kyle
May 18, 2015 at 9:23 am
Excellent man, this is so well explained. Thanks!!
REPLY
Ray Li
May 18, 2015 at 12:32 pm
My pleasure, Kyle.
REPLY
suanfazu
May 18, 2015 at 9:47 am
Thanks for the excellent share
REPLY
Ray Li
May 18, 2015 at 12:35 pm
My pleasure, Suanfazu! Thanks for exploring the blog and leaving your kind
words.
REPLY
Anonymous
May 18, 2015 at 9:53 am
Hey, great introduction! I would love to see more posts like this in our community;
great way to grasp the concept of algorithms before diving into the hard math.
Just one thing, though: On Step 2 in Naive Bayes you repeated P(Long | Banana)
twice. The third one should be P(Yellow | Banana).
Thanks again!
REPLY
Ray Li
May 18, 2015 at 12:45 pm
Hi Anonymous,
Nice catch! I fixed it now, but have no one to attribute the fix to.
I totally agree about understanding the concepts of the algorithm before the
hard math. Ive always felt using concepts and examples as a platform for
understanding makes the math part way easier.
Thanks again,
Ray
REPLY
Robert Klein
May 18, 2015 at 9:59 am
This is a great resource. Ive bookmarked it. Thanks for your work. I love using
height-zip code to illustrate independence. That will be a go-to for me now. The only
thing I can offer in return is a heads-up about the API we just released for ML
preprocessing. Its all about correlating themes in unstructured information
streams. Hope its useful. Let us know what you think. Thanks again.
REPLY
Ray Li
May 18, 2015 at 12:57 pm
Thanks for bookmarking and the heads-up, Robert!
REPLY
Raghav
May 18, 2015 at 12:46 pm
Hello Ray,
Thanks for a great article.
It looks like there is a typo in step 2 of Naive Bayes. One of the probabilities should
be P(Yellow|Banana).
Thanks again!
REPLY
Ray Li
May 18, 2015 at 1:00 pm
My pleasure, Raghav. Thanks also for letting me know about the typo. It should
be corrected now.
REPLY
Jens
May 18, 2015 at 1:00 pm
Hello Raymond,
first of all kudos for your sum up of data mining algos!
Ive been exploring this for a few weeks now (mainly using scikit learn and nltk in
python).
In the past few days I came up with the idea to create a classifier that is able to
group products by their title to a corresponding product taxonomy.
For that I crawled a German product marketplace for their category landingpages
and created a corpus consisting of a taxonomy tree node in column a and a set of
snowball stemmed relevant uni and bigram keywords ( appx. 50 per node) that
have been extracted from all products on each category page (this is comma
separated in column b).
Now I would like to build a classifier from that with the idea in mind, that I could
throw stemmed product titles at the classifier and let it return the most probable
taxonomy node.
Could you advise which would be the most appropriate one for the given task. I can
email you the corpus
Hope to get some direction to omit any detours / too much trial and error.
Looking forward to your reply.
Thanks again for your great article.
Cheers from Cologne Germany
Jens
REPLY
Ray Li
May 18, 2015 at 9:05 pm
Hi Jens,
Thanks for the kudos and taking the time to leave a comment.
Short answer to your question

I dont know. It sounds like theres a bunch I could learn from you!
For example:
You just taught me about stemming and the Snowball framework. Honestly, Im
amazed there are tools like Snowball that can create stemming algorithms. Very
cool!
Longer answer
I found the StackOverflow.com, stats.stackexchange.com and reddit.com forums
invaluable when I was learning, researching and simplifying the algorithms to
make them easier to describe.
Sorry I couldnt be more help, but Im working to catch up
Ray
REPLY
Jens
May 20, 2015 at 6:48 am
Hi Ray,
thanks for your feedback

I found a good solution in the meantime using a naive bayes approach.
By the way your regular contact form does not work. There is an htaccess
authentication popping up upon form submit.
Cheers
Jens
REPLY
Ray Li
May 20, 2015 at 7:30 am
Awesome!
Also, thanks for the heads up about the contact form. It should be fixed
now. Theres a small issue with the confirmation message (some fields are
not displayed), but no more auth pop-up and the message successfully
sends.
REPLY
Malhar
May 18, 2015 at 1:21 pm
This goes in my bookmarks. Excellent simple explanation. Loved you have taken
SVM. It would be great if you can put Neural network with various kernels.
REPLY
Ray Li
May 18, 2015 at 9:21 pm
Definitely appreciate the bookmark, Malhar! Thanks for your suggestion about
the neural nets. Ill definitely be diving into that one very soon.
REPLY
Meghana
May 18, 2015 at 11:37 pm
Exactly the same concern, Malhar. I was looking for information on Neural
Networks as well.
REPLY
Serge
May 18, 2015 at 6:04 pm
Man, I really wish I had this guide a few years ago! I was trying my hand at
unsupervised categorization of email messages. I didnt know what terms to google,
so the only thing I used was LSM (latent semantic mapping). The problem is, when
you have thousands of words and tens of thousands of emails, the N^2 matrix gets
a little hard to handle, computationally. I ended up giving up on it.
What I had never considered was using a different algorithm to pre-create groups,
which would have helped a lot. This was a useful read.
REPLY
Ray Li
May 18, 2015 at 9:31 pm
Thanks for reading and your kind words, Serge!
REPLY
Pingback: The Data Scientist - Professional Data Science in Singapore 10 Data
Science Algorithms Explained In English
David
May 19, 2015 at 5:27 pm
Great article! Now, as a public service, how about a decision tree or categorization
matrix for selecting the right algorithm?
REPLY
Ray Li
May 20, 2015 at 12:28 am
Thanks, David.
Its a good call about selecting the right algorithm. From all the readings so far, I
feel picking the right one is the hardest part.
Its one of the main reasons I was attracted to the original survey paper despite
it being a bit outdated. Might as well dive into the ones the panelists thought
were important, and then figure out why they use them.
I certainly have a lot more to learn, and Im already having some ideas on future
posts.
Ray
REPLY
D Lego
May 19, 2015 at 5:48 pm
Good post. It is curious, Im write one version in spanish about this same theme.
REPLY
Ray Li
May 20, 2015 at 12:34 am
Thank you, D Lego. Im curious can you email me the link?
REPLY
Pingback: Data mining algorithms | has many :code_blocks
michael davies
May 20, 2015 at 8:24 am
Great work Raymond
REPLY
Ray Li
May 20, 2015 at 12:57 pm
Appreciate it, Michael!
REPLY
Sthitaprajna Sahoo
May 20, 2015 at 9:04 am
Couldnt ask for more simpler explanation. A very good collection and hoping more
posts from you .
REPLY
Ray Li
May 20, 2015 at 1:00 pm
My pleasure, Sthitaprajna.
REPLY
Pingback: Data Mining Algorithms Explained In Plain Language | Artificial

Intelligence Matters
Stephen Oman
May 20, 2015 at 10:36 am
This is a really excellent article with some nice explanations. Looking forward to
your piece on Artificial Neural Networks too!
REPLY
Ray Li
May 20, 2015 at 1:31 pm
Thanks, Stephen!
REPLY
Richard Grigonis
May 21, 2015 at 9:13 am
Including Decision Forests would have been nice.
REPLY
Ray Li
May 21, 2015 at 10:13 pm
Although I havent used that one myself, thats a good one, Richard!
REPLY
Pingback: Top 10 data mining algorithms in plain English Another Word For It
Daniel Zilber
May 21, 2015 at 12:03 pm
Thanks for the write up!
REPLY
Ray Li
May 21, 2015 at 10:13 pm
Appreciate it, Daniel.
REPLY
Pingback: Els 10 primers algoritmes del Big data explicats en paraules |

Econometria aplicada
Sylvio Allore
May 21, 2015 at 10:26 pm
Hello,
It is a good review of things undergraduates learn but what about starting with just
a single example of application in predicting stock returns, for example. Do you
have an example of applying, for example, naive Bayes to predicting stock returns?
That would be more useful that listing a set of methods one can find in most ML
books.
REPLY
Ray Li
May 21, 2015 at 11:00 pm
Thanks, Sylvio. I appreciate the constructive comments.
Depth and real-life applications are certainly something to improve on in this

article series (Yep I think it deserves to be a series!). Stay tuned
REPLY
Ray Li
May 21, 2015 at 10:31 pm
Super excited about this
Due to all your comments and sharing, this article has been reposted to KDnuggets,
a leading resource on data mining: http://bit.ly/1AoicbW!
Theres no way this couldve happened without you reading, commenting and
sharing. My sincerest thank you!
REPLY
Matt Cairnduff
May 22, 2015 at 5:07 am
Echoing all the sentiments above Ray. This is a tremendously useful resource thats
gone straight into my bookmarks. Really appreciate the informal writing style as
well, which makes it nice and accessible, and easy to share with colleagues!
REPLY
Ray Li
May 22, 2015 at 5:41 pm
Thank you, Matt. Im glad you found the writing style accessible and shareable.
Please do share
REPLY
Adriana Wilde
May 22, 2015 at 5:36 am
Excellent blogpost! Very accessible and rather complete (apart from multilayer
perceptrons, which I hope youll touch in a follow up post).
I found useful that you refer to the NFL theorem and list characteristics of each
algorithm which make them more suited to one type of problem than another (e.g.
lazy learners are faster in training but slower classifiers, and why). I also liked you
explained which algorithms are for supervised and unsupervised learning. These
are all things to take into account when choosing a classifier. Wish I read this 5
years ago!
Thanks!
REPLY
Ray Li
May 22, 2015 at 5:52 pm
Hi Adriana,
Thank you for your kind words.
I think I came across the standard perceptron while researching SVM. Definitely
thinking about tackling MLPs and more recently all the buzz about deep learning
at some point.
Thanks for your insightful comment.
Ray
REPLY
brian piercy
May 22, 2015 at 8:00 am
What an awesome article! I learned more from this than 20 hours of plowing
through SciKit. Well done!
REPLY
Ray Li
May 22, 2015 at 5:53 pm
Appreciate it, Brian!
REPLY
david berneda
May 25, 2015 at 3:04 am
Thanks a lot Ray for your article !

I did a clustering library sometime ago, your article encourages me to try expanding
it with more algorithms.
regards
david
REPLY
Ray Li
May 25, 2015 at 1:11 pm
My pleasure, David.
REPLY
Pingback: Les liens de la semaine dition #133 | French Coding
Pingback: #1 Time Management is Key | Kenechi Learns Code
Martin Campbell
May 25, 2015 at 11:25 pm
This is a fantastic article and just what I needed as I start attempting to learn all this
stuff. Ill be shooting up the Kaggle rankings in now time (well, from 100,000 to
90,000 perhaps!).
REPLY
Ray Li
May 26, 2015 at 12:45 pm
Appreciate it, Martin. Im really happy to hear that it helps to get the ball rolling
for you. Your increased Kaggle ranking would be nice icing on the cake!
REPLY
Yolande Tra
May 26, 2015 at 6:27 am
Excellent overview. You have a gift in teaching complex topics into down-to earth
terms. Here is my comment: when using data mining algorithm, in this list
(classifiers) I am more concerned about accuracy. We can try and use each one of
these but in the end we are interested in validation after training. Accuracy was only
addressed with SVM and Adaboost.
REPLY
Ray Li
May 26, 2015 at 12:50 pm
Thank you for your kind words, Yolande.
Its a good point about the accuracy. Ill definitely keep this in mind to explore
accuracy in an upcoming post.
REPLY
Maksim Gayduk
May 26, 2015 at 8:43 am
I didnt quite understand the part about C4.5 pruning.

In the link provided, it says that in order to decide whether to prune a tree or not, it
calculates error rate of both pruned and unpruned tree and decides which one
leads to the lower limit of confidence interval.
It should work okey for already pruned trees, but how does it start? Usually decision
tree algorhythms build the tree until it reaches entrophy = 0, which means zero
error rate, and zero upper limit for confidence interval. In this case, such tree can
never be pruned, using that logic
REPLY
Ray Li
May 26, 2015 at 5:30 pm
This is a great question, Maksim. It got me thinking a bunch, but unfortunately I

dont have an answer that Im satisfied with.
My investigation so far indicates that the error rate for the training data is
distinct from the estimated error rate for the unseen data. As you pointed out,
this is what the confidence interval is meant to bound. Based on the formula in
the link, given f=0, Im also at a loss on how a pruned tree could beat the
unpruned tree.
If youre up for it, CrossValidated or StackOverflow might be an awesome place

to get your question answered. You or I could even post a link here for
reference.
REPLY
Pingback: No solutions for a simple predictive analytics challenge? | Decision

Management Community
Ilan Sharfer
May 26, 2015 at 12:42 pm
Ray, thanks a lot for this really useful review. Some of the algorithms are
already familiar to me, others are new. So it surely helps to have them all in
one place.
As a practical application Im interested in a data mining algorithm that can
be used in investment portfolio selection based on historical data, that is,
decide which stocks to invest in and make timely buy/sell orders. Can you
recommend a suitable algorithm?
REPLY
Ray Li
May 26, 2015 at 6:33 pm
My pleasure, Ilan. Same here, Ive come across a few of these algorithms before
writing this article, and I had to teach myself the unfamiliar ones.
Im planning to go into more practical applications in an upcoming post. Stay

tuned for that one
On a side note, you might already be aware of them, and the random walk
hypothesis and efficient-market hypothesis might be of interest to you. It
doesnt answer your question, but it is an alternate perspective on predicting
future returns based on historical data.
REPLY
Zeeshan
May 26, 2015 at 7:59 pm
Awesome explanation!
REPLY
Ray Li
May 26, 2015 at 8:29 pm
Much appreciated, Zeeshan.

REPLY
Lalit A Patel
May 26, 2015 at 11:09 pm
This is an excellent blog. It is helping me digest what I have studied elsewhere.

Thanks a lot.
REPLY
Ray Li
May 28, 2015 at 8:07 am
Thank you, Lalit. Im happy to hear the blog is helping you with your studies.
REPLY
Phaneendra
May 28, 2015 at 1:40 am
Fantastic post ray. Nicely explained. Helped me enhancing my understanding.

Please keep sharing the knowledge It helps.
Regards,
Phaneendra
REPLY
Ray Li
May 28, 2015 at 8:10 am
Thanks, Phaneendra. More is definitely on the way
REPLY
Adrian Cuyugan
May 28, 2015 at 7:01 am
These are very good and simple explanation. Thank you for sharing!
REPLY
Ray Li
May 28, 2015 at 8:11 am
Appreciate it, Adrian.
REPLY
Pingback: BirdView (2) Ranking Everything: an Overview of Link Analysis

Using PageRank Algorithm | datawarrior
Peter Nour
May 28, 2015 at 4:50 pm
Thanks Ray! This is a fantastic post with great details and yet so simple to
understand.
Cheers,
Peter
REPLY
Ray Li
May 28, 2015 at 8:39 pm
Much appreciated, Peter. Glad you liked the post.
REPLY
Sanjoy
May 29, 2015 at 8:08 am
Awesome explanation of some of the oft-used data-mining algorithms.
Are you thinking of doing something similar for some of the other algorithms
(Discriminant Analysis, Neural Networks, etc.) as well?
Would love to read your posts on them.
Thanks,
Sanjoy
REPLY
Ray Li
May 31, 2015 at 12:55 am
Thanks, Sanjoy. Those are good ones. NNs are definitely at the top of the list.
REPLY
Suresh
May 29, 2015 at 11:12 am
Thanks Ray!! Awesome compilation and explanation. This truly helps me get started
with learning and applying data science.
REPLY
Ray Li
May 31, 2015 at 12:56 am
My pleasure, Suresh. Im really happy to hear the post helped you start learning
and applying.
REPLY
Pingback: June 2015 Items of Interest | Tidewater Analytics
Ulf
May 30, 2015 at 10:24 am
Im afraid to be rather boring by having nothing to contribute than more of the well
deserved praise to the quality of your article: thanks, really a great wrap-up and
very good primer for the subject.
I shared the link to your post on the intranet of my company and rarely an article
has received so many likes in no time.
The only thing I was missing was a bit more visual support. You have an excellent
video embedded for SVM. But for many of the other concepts, there are also rather
straight forward visual representations possible (e.g. clustering, k-nearest-
neighbour).
I found the book Data Science for Business (http://www.data-science-for-
biz.com/) a VERY good start into the subject (.though I would have prefered to
have read your article beore, as it really wraps it up so well.). This book offers real
real inspiration as to how the underlying concepts of the algorithms you explain can
be visualized and thus be made more intuitively understandable.
Enhancing your article with a bit more visual support would be the cherry on the
icing on the cake
REPLY
Ray Li
May 31, 2015 at 1:06 am
Hi Ulf,
Really appreciate your kind words and you sharing it with your colleagues.
Thats a good point about visualizations especially for visual learners. Like in
the case of the SVM video, I found seeing it in action made it so much clearer.
I definitely appreciate the book recommendation. From the sound of it, that
book might be a fantastic reference not just for this article but for future articles
covering this area.
Thanks again,
Ray
REPLY
Praveen G S
May 31, 2015 at 11:57 pm
Thanks for your wonderful post. I like the way you describe the SVM, kNN, Bayes.
Since you language is so user friendly and easy to understand. Can you also write a
blog on the some of the ensembles like random forest which is one of the most
popular machine learning algorithm and has a good predictive power compared to
other algorithms
REPLY
Ray Li
June 1, 2015 at 5:56 pm
Thanks, Praveen. Those are good ones, and Ill add them to my growing list of
potential algorithms to dive into.
REPLY
Tom F
June 2, 2015 at 5:17 am
Fantastic article. Thanks.
One point:
>> What do the balls, table and stick represent? The balls represent data points, and
the red and blue color represent 2 classes. The stick represents the simplest
hyperplane which is a line.
The simplest (i.e. 1 dimensional) hyperplane is a point, not a line.
REPLY
Ray Li
June 2, 2015 at 1:32 pm
Thanks, Tom. Good point about the simplest hyperplane. Ive modified the
sentence to read The stick represents the hyperplane which in this case is a
line.
REPLY
Pingback: Guide to Data Science Competitions | Happy Endpoints
vdep
June 15, 2015 at 2:31 am
Hi Ray,
All Algorithms are explained in a simple and neat manner. It will be extremely
useful for beginners as well as pros if u could come up with a cheat sheet,
explaining best and worst scenario, for each algorithms. ( I mean how to choose the
best algorithm for a given data).
Thank you
REPLY
Ray Li
June 15, 2015 at 12:14 pm
Appreciate your kind words, vdep! Thanks also for your suggestion about the
cheat sheet.
REPLY
Houssem
June 16, 2015 at 12:30 am
Hi Ray,
Thank you for your effort to explain such algorithms with such simplicity.
Good to start on data science !
REPLY
Ray Li
June 16, 2015 at 12:37 am
My pleasure, Houssem!
REPLY
Pingback: Poesa eres t se suma a la IA: ahora compone y recita poemas |

Rubn Hinojosa Chapel - Blog personal
Pingback: Linkblog #6 | Ivan Yurchenko
Pingback: Web Picks (week of 1 June 2015) | DataMiningApps
Pingback: DB Weekly No.59 | ENUE Blog
Paris
September 11, 2015 at 2:45 am
Excellent simplified approach!
REPLY
Ray Li
September 11, 2015 at 9:37 am
Thanks, Paris! Much appreciated
REPLY
Pingback: Klicks #33: Vielmehr berbleibsel - Ole Reimann
Pingback: Very interesting explainer: Top 10 data mining algorithms in plain

English rayli.net/blog/data/top-10-dat (via @TheBrowser) | Stromabnehmer
Pingback: (Machine Learning)&(Deep Learning)(Chapter 1)

| ~ Code flavor ~
Pingback: Data Lab Link Roundup: python pivot tables, Hypothesis for testing,
data mining algorithms in plain english and more | Open Data Aha!
Pingback: Top 10 Data mining algorithm C4.5 | Ken's Study Note
Pingback: Top 10 Data mining algorithm k-means | Ken's Study Note
Pingback: Top 10 Data mining algorithm kNN | Ken's Study Note
Pingback: How To Learn Everything About Machine Learning | Meanchey

Center
Kurac
November 23, 2015 at 2:54 pm
The latest downloadable Orange data mining suite and its Associate add-on
doesnt seem to be using Apriori for enumerating frequent itemsets but FP-growth
algorithm instead.
I must say its MUCH faster now.
REPLY
Ray Li
November 29, 2015 at 2:34 pm
Thanks, Kurac.
REPLY
Pingback: Simulando, visualizando ML, algoritmos, cheatsheet y conjuntos de

datos: Lecturas para el fin de semana | To the mean!
Pingback: February 2016 Items of Interest | Tidewater Analytics
mounika
February 20, 2016 at 12:30 am
is there any searching technique algorithm in data mining ..please help me..
REPLY
Ray Li
February 20, 2016 at 2:06 pm
Yes, even within the context of the 10 data mining algorithms, we are searching.
The first 3 that come to mind are K-means, Apriori and PageRank.
K-means groups similar data together. Its essentially a way to search through
the data and group together data that have similar attributes.
Apriori attempts to search for relationships and patterns among a set of

transactions.
Finally, PageRank searches through a network in order to unearth the relative

importance of an object in the network.
Hope this helps!
REPLY
Ray Li
February 20, 2016 at 2:12 pm
However, if youre looking for a search algorithm that finds specific item(s) that
match certain attributes, these 10 data mining algorithms may not be a good fit.
REPLY
Pingback: :
Jenny
March 1, 2016 at 9:37 pm
This article is so helpful!
Ive always have trouble understanding the Naive Bayes and SVM algorithms.
Your article has done a really great job in explaining these two algorithms that now
I have a much better understanding on these algorithms.
Thanks alot!
REPLY
Ray Li
March 2, 2016 at 9:34 am
Glad you found the article helpful, Jenny. Thanks for the kind words!
REPLY
Pingback: Spectroscopy and Chemometrics News Weekly #9, 2016 | NIR

Calibration Model
Mikail
March 14, 2016 at 3:34 pm
Thank you!
REPLY
David Millie
April 2, 2016 at 1:35 pm
very nice summary article question is the current implementation of Orange

(still) using C4.5 as the classification tree algorithm I cannot find any reference to
it in the current documentation
REPLY
Ray Li
April 3, 2016 at 2:54 pm
Thanks, David. This might help:

http://orange.biolab.si/docs/latest/reference/rst/Orange.classification.tree
.html.
Orange includes multiple implementations of classification tree learners: a

very flexible TreeLearner, a fast SimpleTreeLearner, and a C45Learner, which
uses the C4.5 tree induction algorithm.
Hope this helps!
REPLY
Mak
April 3, 2016 at 4:04 pm
Good job! This is a great resource for a beginner like me.
REPLY
Ray Li
April 3, 2016 at 8:18 pm
Thank you, Mak!
REPLY
Jermaine Allgood
April 12, 2016 at 10:31 am
THANK YOU!!!!!!! As a budding data scientist, this is really helpful. I appreciate it

immensely!!!!!
REPLY
Ray Li
April 13, 2016 at 7:02 am
Thanks, Jermaine! Good luck in your data scientist journey.
REPLY
Bruno Ferreira
April 23, 2016 at 1:57 pm
Thank very much for this article.
This is from a far the best page about the most used data-mining algorithms.
As a data-mining student, this was very helpful.
REPLY
Ray Li
April 24, 2016 at 3:28 pm
My pleasure, Bruno. Thanks for the kind words!
REPLY
Paolo
May 2, 2016 at 7:27 am
Great article, Ray, top level, thank you so much!
This question could be a bit OT: which technique do you feel to suggest for the
analysis of biological networks? Classical graph theory measures, functional
cartography (by Guimera & Amaral), entropy and clustering are already used with
good results. PageRank on undirected networks provides similar results to
betweenness centrality, I am looking for innovative approaches to be compared
with the mentioned ones.
Thanks again!
REPLY
Ray Li
May 8, 2016 at 7:19 pm
Thank you, Paolo. Really appreciate it!
From the techniques youve already mentioned, it sounds like youre already
deep into the area of biological network analysis. Although I dont have any new
approaches to add (and probably not as familiar with this area as you are),
perhaps someone reading this thread could point us in the right direction.
REPLY
abdul
May 7, 2016 at 7:17 am
Wonderful list and even more wonderful explanations. Question though, you dont
think Random Forests merit a place on that list?
Cheers
REPLY
Ray Li
May 8, 2016 at 7:26 pm
Thanks, Abdul! Random forests is a great one. However, the authors of the
original 2007 paper describe how their analysis arrived at these top 10. If a
similar analysis were done today, Im sure random forest would be a strong
contender.
REPLY
Abdul
May 11, 2016 at 6:05 am
Ok. Fair enough

Again, nice work
REPLY
Phil
May 8, 2016 at 9:41 pm
I did not read the whole article, but the description of the Apriori algorithm is
incorrect.
It is said that there are three steps and that the second step is Those itemsets that
satisfy the support and confidence move onto the next round for 2-itemsets.
This is incorrect and it is not how the Apriori algorithm works.. The Apriori
algorithms does NOT consider the confidence when generating itemsets. It only
considers the confidence after finding the itemsets, when it is generating the rules.
In other words, the Apriori algorithms first find the frequent itemsets by applying
the three steps. Then it applies another algorithm for generating the rules from
these itemsets. The confidence is only considered by the second algorithm. It is not
considered during itemset generation.
REPLY
Pingback: (Machine Learning)&(Deep Learning) | Dotte
Pingback: d204: Top 10 data mining algorithms explained in plain English

[nd009 study materials] AI
Pingback: Top 10 data mining algorithms in plain English | rayli.net Unstable

Contextuality Research
Aftab khan
January 28, 2017 at 4:29 am
Sir,
This information is very helpful for the students like me. I was searching for an
algorithm for my final year project in data mining. Now i can easily select an
algorithm to start my work on my final year project. Thanks
REPLY
Leave a Reply
Your email address will not be published. Required fields are marked *
Name *
Email *
Website
Please enter an answer in digits:
21=
Comment
You may use these HTML tags and attributes: <a href="" title=""> <abbr title="">
<acronym title=""> <b> <blockquote cite=""> <cite> <code> <del
datetime=""> <em> <i> <q cite=""> <strike> <strong>
POST COMMENT
Replies to my comments Notify me of followup comments via e-mail. You can
also subscribe without commenting.
rayli.net
Enter your email address to get your free email updates, stuff you can't find on the
blog and a free report.
Email Address
First Name
GIVE ME!
Topics
Data
Programming
Life
Blogging
Search
Recent Comments
Burcu Kolbay on Top 10 data mining algorithms in plain R
abdur rahman on Top 10 data mining algorithms in plain R
HamSl NisseManden on How to convert Fahrenheit to Celsius?
Aftab khan on Top 10 data mining algorithms in plain English
Por que hay tantos Pythons? Desarrollando Juntos on What is the

purpose of PyPy?
RaySF on Top 10 data mining algorithms in plain R
saurabh gupta on Top 10 differences between Java and C#
7 habits of highly effective data analysis | rayli.net Unstable

Contextuality Research on 7 habits of highly effective data analysis
Top 10 data mining algorithms in plain English | rayli.net Unstable

Contextuality Research on Top 10 data mining algorithms in plain
English
Alireza on History of data mining
Free report reveals
The Top 10 Tools I Use To Supercharge My Learning (Hint: The Top 5 Are Free)
Enter your email address to get your free email updates, stuff you can't find on the
blog and a free report:
Email Address
First Name
GIVE ME!
Topics
Data
Programming
Life
Blogging
Follow Me
Facebook
Twitter
LinkedIn
RSS
Home Articles Hacker Bits Resume About Contact
Proudly powered by WordPress Theme: Sequential by WordPress.com.

Top 10 Data Mining Algorithms

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Top 10 Data Mining Algorithms

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Top 10 Data Mining Algorithms

Uploaded by

Copyright:

Available Formats

rayli.

Top 10 data mining algorithms in

What are we waiting for? Lets get started!

Whats an example of this?Sure, suppose a dataset contains a bunch of patients.

And heres the deal:

Patient has a history of cancer

The bottomline is:

Is this supervised or unsupervised?This is supervised learning, since the training

Where is it used? A popular open-source Java implementation can be found over

Hang on, whats cluster analysis?Cluster analysis is a family of algorithms

Is there an example of this?Definitely, suppose we have a dataset of patients. In

You might be wondering:

Want to know the best part?

At a high level, they all do something like this:

Is this supervised or unsupervised?It depends, but most would classify k-means

k-means can be used topre-clustera massive dataset followed by a more

Its not all smooth sailing:

Where is it used?A ton of implementations for k-means clustering are available

Checkout how I used k-means

3. Support vector machines

Whoa, a hyper-what?A hyperplane is a function like the equation for a line,

Do you have an example?Absolutely, the simplest example I found starts with a

And the coolest part?

SVM figures out the function for the hyperplane.

Heres the work-around:

You might be wondering if this is cheating:

I found this visualization super helpful:

Heres the deal:

If we had a patient dataset, each patient could be described by various

The bottomline is:

The key is:

Is this supervised or unsupervised?This is a supervised learning, since a dataset

Where is it used?There are many implementations of SVM. A few of the popular

Checkout how I used SVM

The next algorithm is one of my favorites

What are association rules?Association rule learning is a data mining technique

Whats an example of Apriori?Lets say we have a database full of supermarket

Heres the best part:

The power of this is:

The basic Apriori algorithm is a 3 step approach:

1. Join.Scan the whole database for how frequent 1-itemsets are.

Is this supervised or unsupervised?Apriori is generally considered an

But wait, theres more

Apriori can also be modified todo classificationbased on labelled data.

Why use Apriori?Apriori is well understood, easy to implement and hasmany

On the other hand

Where is it used?Plenty of implementations of Apriori are available. Some popular

Checkout how I used Apriori

In statistics, the EM algorithmiterates and optimizes the likelihood of seeing

OK, hang on while I explain

Im not a statistician, so hopefully my simplification is both correct and helps with

Whats a statistical model?I see amodelas something that describes how

Wait, whats a distribution?A distribution represents the probabilities for all

So, all you need to describe a normal distribution are 2 parameters:

Heres the deal:

Remember, its the hypothetical probability of theexistinggrades, not the

Youre probably wondering, whatsprobabilitythen?

Heres the kicker: