Machine Learning
Supervised learning
Classification Vs predictions
• Supervised learning may be used for classification and
prediction
Classification and Prediction
• What is classification? What is prediction?
• Classification:
– predicts categorical class labels (discrete or nominal)
e.g. Yes and NO, Treatment A, B, C, Safe and risky etc.
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and
uses it in classifying new data
Classification Vs predictions cont..
Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
– E.g. Amount of loan safe, Customer spent amount for
purchase
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification Process steps
• Model construction
• describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
Example Dataset
• q
Age income student Credit_rating Buys_Computer
Youth high no fair no
youth high no excellent no
Middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
Middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
Middle_aged medium no excellent yes
Middle_aged high yes fair yes
senior medium no excellent no
Classification Process steps cont..
• Model usage
• for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that
are correctly classified by the model
• Test set is independent of training set, otherwise over-
fitting will occur
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
Model Construction: Phase 1
Training phase
Classification
Algorithms
Training
Data
NAME RANK experience senior Classifier
VPS Assistant 17 Yes (Model)
Professor
PK Goyal Associate 7 yes
Shubhas Professor 2 yes
Denendra Associate 7 yes IF rank = ‘professor’
Dinesh Assistant 6 no OR experience > 6
Kapil Associate 3 no THEN Senior = ‘yes’
Use the Model in Prediction: Phase2
• Testing Phase
Classifier
Testing
Data
Unseen Data
NAME RANK experienc Senior
e
Rk Assistant 2 no
Professor
(RP, Professor, 4)
Dimple Associate 7 no
KV Professor 8 yes Senior ?
Rajesh Assistant 7 no
Evaluating Classification Methods
• Accuracy
• Speed
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
Example Dataset
Age income student Credit_rating Buys_Computer
Youth high no fair no
youth high no excellent no
Middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
Middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
Middle_aged medium no excellent yes
Middle_aged high yes fair yes
senior medium no excellent no
Decision Trees
Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
• E.g., decision making
• Who buys a computer?
A Decision Tree for “buys_computer”
• Example
age?
youth Middle_aged senior
student? no credit rating?
no yes excellent fair
no yes no yes
Decision Trees cont..
Decision tree induction
– Many Algorithms
• Hunt’s Algorithm
• – One of the earliest methods
• ID3 and its successor, C4.5 , CART ,SLIQ and SPRINT
• Greedy strategy
• Split the records based on an attribute test that optimizes a certain
criterion
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-
conquer manner
• The algorithm is called with three parameters: D, attribute_
list, and Attribute_ selection method
– At start, all the training examples are at the root or node
ND
• Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
– Node ND is labeled with the splitting criterion and branch
is grown from node ND
Algorithm for Decision Tree Induction cont..
Splitting: how to specify the attribute test condition?
Depends on attribute types
• Discrete
• Continuous
Depends on number of ways to split
• Binary split
• Multi-way split
Algorithm for Decision Tree Induction cont..
Algorithm for Decision Tree Induction cont..
Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
• There are no samples left
Attribute Selection Measure
• Determine the best split?
Information gain
• Select the attribute with the highest information gain
• Construct simple tree
• The expected information needed to classify a tuple in D is given by
where pi is the probability that an arbitrary tuple in D belongs to class
Ci and is estimated by |Ci,D|/|D|.
Information gain
• After the partitioning , how much info still we
need
• Information gain is defined as
Example
• Let’s consider Buys_ computer example
• There are two distinct classes (that is, m = 2). Let class C1
correspond to yes and class C2 correspond to no
• expected information needed to classify a tuple in D
Example cont..
• Now ,Expected information requirement for each attribute.
• Let’s start with the attribute age
• Hence, the gain in information from such a partitioning would
be
Example cont..
• Similarly, Gain(income) = 0.029 bits
• Gain(student) = 0.151 bits, and
• Gain(credit rating) = 0.048 bits.
• Because age has the highest information gain among the
attributes, it is selected as the splitting attribute.
• Continue recursively to grow the tree until stop conditions are
met
Example cont..
• So node will be labeled by age and branches are grown for
each value of attributes
Example cont..
• For continuous valued attributes midpoint between each pair of
adjacent values is consider as a possible split point and find as
follows
• For each possible split point of A we calculate InfoA(D)
• Point with min. expected information for A is selected as
split_point of A
Attribute Selection Measure cont..
Gini Index(CART)
• Consider binary split for each attributes
• Gini index measures impurities in D as
• To determined best binary split of A consider all possible
subsets(excluding Power and empty subset set of attribute A)
• For discrete valued attributes, the subset that gives the
minimum Gini index for that attribute is selected as its
splitting subset.
Gini Index cont..
• And computes weighted sum of impurities for each
resulting partition as follows
• For continuous-valued attributes, split point giving the
minimum Gini index for a given (continuous-valued)
attribute is taken as the split-point
• The reduction in impurity
Gini(A) = Gini(D)-GiniA(D)
• The attribute that maximizes the reduction in impurity (or,
equivalently, has the minimum Gini index) is selected as
the splitting attribute
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
Decision tree advantages
Advantages:
• Inexpensive to construct(does not required parameter setting
or domain knowledge)
• Extremely fast at learning step and also for classifying
unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification
• Techniques for many simple data sets
• Very good average performance over many
datasets(multidimentional)
Overfitting in Classification
What is overfitting?
• Happens when a model learns the detail and noise in the
training data to the extent that it negatively impact the
performance of the model on new data
• That is, noise and random fluctuation in training data is picked
up and learn as concepts by the model
• An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
Solutions for overfitting
Pruning : to remove the least reliable branches.
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node
if this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
– E.g. Cost complexity Algorithm( finding cost at each node of
subtree) with and without pruning.
• Use a set of data different from the training data to decide
which is the “best pruned tree”
Approaches to Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
• Use minimum description length (MDL) principle
– halting growth of the tree when the encoding is minimized
Bayesian Classification
Introduction
• These are statistical classifiers
• Predict class membership probabilities, such as the probability
that a given tuple belongs to a particular class.
• Based on Bayes’ theorem
• Example of simple Bayesian classifier known as the naïve
Bayesian classifier assumes class conditional independence.
• high accuracy and speed when applied to large databases.
Bayes’ Theorem
• Let X be a data sample whose class label is unknown
• Let H be a hypothesis that X belongs to class C
• For classification problems, determine P(H|X): the probability
that the hypothesis holds given the observed data sample X
• P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data
• P(X): prior probability of X
• P(X|H) : Posterior probability of observing the sample X,
given that the hypothesis holds
Bayes’ Theorem cont..
• Given training data X, posteriori probability of a hypothesis H,
P(H|X) follows the Bayes’ theorem
• P(X/H) P(H)
P(H/X)=
P(X)
• Informally, this can be written as
posterior =likelihood x prior / evidence
• P(H), P(X/H), and P(X) may be estimated from the given data
Naïve Bayes Classifier
• Let D be a training set of tuples and their associated class
labels
• Suppose that there are m classes, C1, C2, …. , Cm. Given tuple
X , naïve Bayesian classifier predicts that tuple X belongs to
the class Ci if and only if
• P(Ci/X) > P(Cj/X) for 1≤ j≤ m; j ≠ i
• Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is
maximized is called the maximum posteriori hypothesis. By
Bayes’ theorem
• P(X/Ci)P(Ci)
• P(Ci/X) =
P(X)
Naïve Bayes Classifier cont..
• As P(X) is constant for all classes, only P(X/Ci)P(Ci) need be
maximized
• Classes are equally likely, that is, P(C1) = P(C2) = = P(Cm),
and we would therefore maximize P(X/Ci). Otherwise, we
maximize P(X/Ci)P(Ci)
• P(Ci)=|Ci,D|/|D|,where |Ci,D| is the number of training tuples
of class Ci in D.
• A simplified assumption: attributes are conditionally
independent so
n
• ∏ P(xk/Ci)
• P(X/Ci) = k=1
=P(x1/Ci)P(x2/Ci) P(xn/Ci)
• Here , xk refers to the value of attribute Ak for tuple X.
Naïve Bayes Classifier cont..
•If Ak is categorical, then P(xk/Ci) is the number of tuples of class
Ci in D having the value xk for Ak, divided by |Ci,D|, the number of
tuples of class Ci in D
•If Ak is continuous-valued, assumed to have a Gaussian
distribution with a mean μ and standard deviation sigma ∂,given by
Naïve Bayes Classifier cont..
•. Topredict the class label of X, P(X/Ci)P(Ci) is evaluated for each
class Ci
• Classifier prediction will be class Ci if and only if
Naïve Bayes Classifier cont..
• What happens if we should end up with a probability value of
zero for some P.xkjCi/?
• Plugging this zero value into Eq would return a zero
probability for P(X/Ci)
• zero probability cancels the effects of all the other (posteriori)
probabilities (on Ci) involved in the product
K-Nearest Neighbor classifier
• Eager learners, when given a set of training tuples, will
construct a generalization (i.e., classification) model before
receiving new (e.g., test) tuples to classify
• When given a training tuple, a lazy learner simply stores it (or
does only a little minor processing)
• waits until it is given a test tuple
• lazy learners store the training tuples or “instances,” they are
also referred to as instance based learners
KNN cont..
• Labor intensive when given large training sets
• Based on learning by comparing a given test tuple with
training tuples that are similar to it
• When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training tuples
that are closest to the unknown tuple.
• These k training tuples are the k “nearest neighbors” of the
unknown tuple.
• Closeness” is defined in terms of a distance metric, such as
Euclidean distance.
KNN cont..
• Euclidean distance between two points or tuples, say, X1 =
(x11, x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is
• n
dist(X1,X2) = ∑ (x1i - x2i)2
• i=1
• Normalize the values of each attribute before using Equation
• unknown tuple is assigned the most common class among its k
nearest neighbors.
Height(in Weight(in T shirt size
cms kgs) Example
158 58 M
158 59 M
158 63 M
160 59 M
New customer Akshay height 161 cm
160 60 M
weight 61kg
163 60 M
T shirt size= ?
163 61 M
160 64 L
163 64 L Euclidean distance b/w first
165 61 L observation and new customer akshay
165 62 L is = SQRT((161-158)^2+(61-58)^2)
165 65 L
168 62 L
168 63 L
168 66 L
Heigh Weight(i
t(in n kgs)
T shirt
size
Distanc Rank
e
Example cont..
cms
• E
158 58 M 4.2
158 59 M 3.6
158 63 M 3.6
160 59 M 2.2 4
160 60 M 1.4 1
163 60 M 2.2 3 Similarly calculate
163 61 M 2.0 2 distance of all tuples
160 64 L 3.2 5
163 64 L 3.6 and give smallest
165 61 L 4.0 distance ranked 1
165 62 L 4.1
165 65 L 5.7
168 62 L 7.1
168 63 L 7.3
168 66 L 8.6
Example cont..
• Let k=5 then algorithm searches for 5 customer closet to
Akshay
• Look their categories and assigned Akshay and assigned most
common class to akshay
• E.g 4 having T shirt size M and 1 L, then classification for
Akshay is M
• Also used for numeric or real-valued prediction for a given
unknown tuple
• The classifier returns the average value of the real-valued
labels associated with the k nearest neighbors of the unknown
tuple.
KNN cont..
• If variables in training sample measures different units then
need to standardized variable before finding distance
• One influence the other variable in distance calculation
• Min –Max and Z-score normalization
• V’= v-MinA/MaxA-MinaA
• how can distance be computed for attributes that not numeric,
but categorical, such as color?
• Many methods e.g. Same color distance zero and one for
different
KNN cont..
• For missing value of either one or both variable distance be
taken as 1
• Good value for K, Trial and error method
• Poor accuracy with noisy and irrelevant attribute
• Slow with classification of test tuples
Regression Model
Predictions ?
• Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Regression Model cont..
• Predictive modeling: Predict data values or construct
generalized linear models based on the database data
Introduction
• Regression analysis is a statistical technique used to describe
relationships among variables
• The simplest case to examine is one in which a variable y ,
referred to as the dependent or target variable, may be related to
one variable x, called an independent or explanatory variable, or
simply a Regressor
Linear regression
• Linear regression: If the relationship between y and x is
believed to be linear, then the equation
y = w0+w1x;
– Two parameters , w0 and w1 are regression coefficients
specify the line and are to be estimated by using the data at
hand
• In simplest terms, the purpose of regression is to try to find the
best line or equation that expresses the relationship between y
and x
– using the least squares criterion to the known values of y1,
y2, …, x1, x2, …. To solve coefficients
• Minimizes the error between the actual data and the estimate
of the line.
Linear regression cont..
• To estimate coefficients use following equation
Linear regression
Consider the following data points
x( Experience) 3 8 9 13 3 6 11 21 1 16
y(Salary) 30 57 64 72 36 43 59 90 20 83
For given data we compute x=9.1 and y = 55.4
Linear regression cont..
• The equation of the least squares line is estimated by
y = 23.6+3.5x.
• Using this equation, we can predict
• Salary of a college graduate with, say, 10 years of experience
will be $ 58,600
Multiple Linear regression
• Extension of straight-line regression and involve more than
one predictor variable
• Example
• Say, 2 predictor variables or attributes, A1, A2 describing a
tuple, X then multiple linear regression model will be as
• y = w0+w1x1+w2x2,
• Least squares can be extended to solve but equations are
tedious to solve
• Statistical software packages such as SPSS etc.. Used to solve
Nonlinear Regression
• To model data that does not show a linear dependence
• e.g. Response variable and predictor variable relationship
can be modeld by polynomial function
• y = w0+w1x+w2x²+w3x³
• For linear transformation of above equation, define new
variables
• x1=x, x2=x², x3=x³
• we get y= w0 +w1x1 +w2x2 +w3x3, can be solved using least
square
Logistic regression
• Linear regression is used to model continuous-valued functions
• To predict categorical labels?” Generalized linear models such as Logisitc
regression
• Models the probability of some event occurring as a linear function of a set of
predictor variables.
• Used for classification problems e.g. binary classification 0 or 1
• Called linear regression model but used complex cost function define as sigmoid
function or logistic function
• Used for mapping predicted values to probabilities that is map any real value
between 0 and 1
• 1
• 𝜎=
1+e−z
Logistic regression cont..
• Its limits as z tends to ∞, σ ( z) tends to 1, because e−z tends to 0.
• As z tends to −∞, σ ( z) tends to 0 because e−z tends to ∞. Now
exactly at z=0, σ ( z) is equal to 0.5.
• So it is monotonic function.
• Plot of the sigmoid curve
• y will always lie between 0 and 1.
Logistic regression cont..
• Linear function fail to represent it because its value may be
greater than 1 and less than 0
• When used linear regression formula is y= w. x
• For logistic regression modified formula will be
• y=σ (w. x), where w includes w0 ,w1 ,...wn if you have n
features, and x here is x1 , x2 ,... xn.
• y= σ (z) =σ (w0+w1 x)
• y= 1
1+e− (w0+w1 x)
Logistic regression cont..
Decision boundary
• Classifier give us a set of outputs or classes based on
probability score between 0 and 1
• E.g we have classes 1- Red and 0- orange
• We decide a threshold value above which we classify values
into class red and below which in class orange
• So if we chose threshold as 0.5 then 0.7 value return by
prediction function is classified as Red class and 0.2 is as
orange