0% found this document useful (0 votes)
43 views30 pages

Unit 3

ML

Uploaded by

mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views30 pages

Unit 3

ML

Uploaded by

mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

LEARNING TECHNIQUES

 Supervised Learning
 Decision Tree Learning
 Instance Based Learning: k-Nearest neighbor algorithm
Unit 3  Support Vector Machines
 Ensemble learning: boosting, bagging
 Artificial Neural Networks
 Probabilistic Models
 Maximum Likelihood Estimation
LEARNING TECHNIQUES  MAP
 Naive Bayes Classifiers
 Minimum description length principle
 Bayesian Networks
 Unsupervised Learning
 Principal Components Analysis
 K-means and Hierarchical Clustering
 Gaussian Mixture Models
 EM algorithm
 Hidden Markov Models
Unit 3 2

Decision Tree Algorithm Types of Decision Trees


• A decision tree is a graphical representation that
makes use of branching methodology to exemplify all • Classification Trees- These are considered as the
possible outcomes of a decision, based on certain default kind of decision trees used to separate a
conditions. dataset into different classes, based on the
• In a decision tree, the internal node represents a test response variable. These are generally used when
on the attribute, each branch of the tree represents the response variable is categorical in nature.
the outcome of the test and the leaf node represents a
particular class label i.e. the decision made after • Regression Trees-When the response or target
computing all of the attributes. variable is continuous or numerical, regression
• The classification rules are represented through the trees are used. These are generally used in
path from root to the leaf node. predictive type of problems when compared to
classification.

Unit 3 3 Unit 3 4

Why should you use Decision Tree


Why use Decision Tree Algorithm
Machine Learning algorithm?
• These machine learning algorithms help make
decisions under uncertainty and help you improve
communication, as they present a visual representation
of a decision situation.
• Decision tree machine learning algorithms help a data
scientist capture the idea that if a different decision
was taken, then how the operational nature of a
situation or model would have changed intensely.
• Decision tree algorithms help make optimal decisions
by allowing a data scientist to traverse through forward
and backward calculation paths.

Unit 3 5 Unit 3 6

1
When to use Decision Tree Machine Advantages of Using Decision Tree
Learning Algorithm Machine Learning Algorithms
• Decision trees are robust to errors and if the training • Decision trees are very instinctual and can be explained to anyone with ease.
People from a non-technical background, can also decipher the hypothesis drawn
data contains errors- decision tree algorithms will be from a decision tree, as they are self-explanatory.
• When using decision tree machine learning algorithms, data type is not a
best suited to address such problems. constraint as they can handle both categorical and numerical variables.
• Decision trees are best suited for problems where • Decision tree machine learning algorithms do not require making any assumption
on the linearity in the data and hence can be used in circumstances where the
instances are represented by attribute value pairs. parameters are non-linearly related. These machine learning algorithms do not
make any assumptions on the classifier structure and space distribution.
• If the training data has missing value then decision • These algorithms are useful in data exploration. Decision trees implicitly perform
feature selection which is very important in predictive analytics. When a decision
trees can be used, as they can handle missing values tree is fit to a training dataset, the nodes at the top on which the decision tree is
nicely by looking at the data in other columns. split, are considered as important variables within a given dataset and feature
selection is completed by default.
• Decision trees are best suited when the target function • Decision trees help save data preparation time, as they are not sensitive to missing
values and outliers. Missing values will not stop you from splitting the data for
has discrete output values. building a decision tree. Outliers will also not affect the decision trees as data
splitting happens based on some samples within the split range and not on exact
absolute values.

Unit 3 7 Unit 3 8

Drawbacks of Using Decision Tree Applications of Decision Tree Machine


Machine Learning Algorithms Learning Algorithm
• The more the number of decisions in a tree, less is the accuracy of any expected
outcome.
• Decision trees are among the popular machine learning
• A major drawback of decision tree machine learning algorithms, is that the
algorithms that find great use in finance for option pricing.
outcomes may be based on expectations. When decisions are made in real-time, • Remote sensing is an application area for pattern
the payoffs and resulting outcomes might not be the same as expected or planned.
There are chances that this could lead to unrealistic decision trees leading to bad recognition based on decision trees.
decision making. Any irrational expectations could lead to major errors and flaws
in decision tree analysis, as it is not always possible to plan for all eventualities that
• Decision tree algorithms are used by banks to classify loan
can arise from a decision. applicants by their probability of defaulting payments.
• Decision Trees do not fit well for continuous variables and result in instability and • Gerber Products, a popular baby product company, used
classification plateaus.
• Decision trees are easy to use when compared to other decision making models
decision tree machine learning algorithm to decide
but creating large decision trees that contain several branches is a complex and whether they should continue using the plastic PVC (Poly
time consuming task. Vinyl Chloride) in their products.
• Decision tree machine learning algorithms consider only one attribute at a time
and might not be best suited for actual data in the decision space. • Rush University Medical Centre has developed a tool
• Large sized decision trees with multiple branches are not comprehensible and named Guardian that uses a decision tree machine learning
pose several presentation difficulties. algorithm to identify at-risk patients and disease trends.

Unit 3 9 Unit 3 10

Bayesian Classification: Why? Bayesian Theorem: Basics


• A statistical classifier: performs probabilistic prediction, i.e., • Let X be a data sample (“evidence”): class label is unknown
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem. • Let H be a hypothesis that X belongs to class C
• Performance: A simple Bayesian classifier, naïve Bayesian • Classification is to determine P(H|X), the probability that the
classifier, has comparable performance with decision tree and hypothesis holds given the observed data sample X
selected neural network classifiers
• P(H) (prior probability), the initial probability
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — – E.g., X will buy computer, regardless of age, income, …
prior knowledge can be combined with observed data • P(X): probability that sample data is observed
• Standard: Even when Bayesian methods are computationally • P(X|H) (posteriori probability), the probability of observing the
intractable, they can provide a standard of optimal decision
making against which other methods can be measured sample X, given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Unit 3 11 Unit 3 12

2
Bayesian Theorem Towards Naïve Bayesian Classifier
• Given training data X, posteriori probability of a hypothesis H, • Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
P(H|X), follows the Bayes theorem
x2, …, xn)
P(H | X)  P(X | H )P(H ) • Suppose there are m classes C1, C2, …, Cm.
P(X) • Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• Informally, this can be written as
• This can be derived from Bayes’ theorem
posteriori = likelihood x prior/evidence P(X | C )P(C )
P(C | X)  i i
• Predicts X belongs to C2 iff the probability P(Ci|X) is the highest i P(X)
among all the P(Ck|X) for all the k classes • Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
• Practical difficulty: require initial knowledge of many i i i
needs to be maximized
probabilities, significant computational cost
Unit 3 13 Unit 3 14

Derivation of Naïve Bayes Classifier Naïve Bayesian Classifier: Training Dataset


age income studentcredit_rating
buys_computer
• A simplified assumption: attributes are conditionally <=30 high no fair no
independent (i.e., no dependence relation between attributes): <=30 high no excellent no
n Class: 31…40 high no fair yes
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i )  ...  P( x | C i )
• This greatly reduces the k computation
1
k
cost:
1
Only2 counts then C1:buys_computer = ‘yes’ >40 medium no fair yes
class distribution C2:buys_computer = ‘no’ >40 low yes fair yes
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk >40 low yes excellent no
for Ak divided by |Ci, D| (# of tuples of Ci in D) Data sample
31…40 low yes excellent yes
X = (age <=30,
• If Ak is continous-valued, P(xk|Ci) is usually computed based on <=30 medium no fair no
Income = medium,
Gaussian distribution with a mean μ and standard deviation σ <=30 low yes fair yes
Student = yes
Credit_rating = Fair) >40 medium yes fair yes
( x )2
and P(xk|Ci) is 1  <=30 medium yes excellent yes
g ( x,  ,  )  e 2 2

2  31…40 medium no excellent yes


31…40 high yes fair yes
P(X | Ci)  g ( xk , Ci , Ci )
>40 medium no excellent no
Unit 3 15 Unit 3 16

Naïve Bayesian Classifier: An Example Avoiding the 0-Probability Problem


• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357 • Naïve Bayesian prediction requires each conditional prob. be non-zero.
Otherwise, the predicted prob. will be zero
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 n
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P( X | C i )   P( x k | C i )
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 k 1
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 (990), and income = high (10),
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• Use Laplacian correction (or Laplacian estimator)
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair) – Adding 1 to each case
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 Prob(income = low) = 1/1003
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
Prob(income = medium) = 991/1003
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Prob(income = high) = 11/1003
Therefore, X belongs to class (“buys_computer = yes”) – The “corrected” prob. estimates are close to their “uncorrected”
counterparts
Unit 3 17 Unit 3 18

3
Naïve Bayesian Classifier: Comments Rule-Based Classifier
• Advantages
– Easy to implement • Classify records by using a collection of “if…then…”
– Good results obtained in most of the cases rules
• Disadvantages
– Assumption: class conditional independence, therefore loss of
• Rule: (Condition)  y
accuracy – where
• Condition is a conjunctions of attributes
– Practically, dependencies exist among variables • y is the class label
• E.g., hospitals: patients: Profile: age, family history, etc.
– LHS: rule antecedent or condition
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
– RHS: rule consequent
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier – Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
• How to deal with these dependencies?
• (Taxable Income < 50K)  (Refund=Yes)  Evade=No
– Bayesian Belief Networks
Unit 3 19 Unit 3 20

Rule Extraction from a Decision Tree Rule-based Classifier (Example)


age? Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
<=30 31..40 >40 salmon cold no no yes fishes
 Rules are easier to understand than large trees whale warm yes no yes mammals
student? credit rating? frog cold no no sometimes amphibians
yes
 One rule is created for each path from the root to a komodo
bat
cold
warm
no
yes
no
yes
no
no
reptiles
mammals
no yes excellent fair pigeon warm no yes no birds
leaf cat warm yes no no mammals
no yes yes leopard shark cold yes no yes fishes
 Each attribute-value pair along a path forms a turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
conjunction: the leaf holds the class prediction porcupine
eel
warm
cold
yes
no
no
no
no
yes
mammals
fishes
salamander cold no no sometimes amphibians
 Rules are mutually exclusive and exhaustive gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
• Example: Rule extraction from our buys_computer decision-tree dolphin warm yes no yes mammals
eagle warm no yes no birds
IF age = young AND student = no THEN buys_computer = no
R1: (Give Birth = no)  (Can Fly = yes)  Birds
IF age = young AND student = yes THEN buys_computer = yes
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
IF age = mid-age THEN buys_computer = yes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
IF age = old AND credit_rating = excellent THEN buys_computer = yes
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
IF age = young AND credit_rating = fair THEN buys_computer = no R5: (Live in Water = sometimes)  Amphibians
Unit 3 21 Unit 3 22

Application of Rule-Based Classifier Rule Coverage and Accuracy


Tid Refund Marital Taxable
Income Class
• A rule r covers an instance x if the attributes of • Coverage of a rule:
Status

1 Yes Single 125K No


the instance satisfy the condition of the rule – Fraction of records 2 No Married 100K No

R1: (Give Birth = no)  (Can Fly = yes)  Birds that satisfy the 3 No Single 70K No
R2: (Give Birth = no)  (Live in Water = yes)  Fishes 4 Yes Married 120K No
antecedent of a rule
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals 5 No Divorced 95K Yes
R4: (Give Birth = no)  (Can Fly = no)  Reptiles • Accuracy of a rule: 6 No Married 60K No
R5: (Live in Water = sometimes)  Amphibians 7 Yes Divorced 220K No
– Fraction of records 8 No Single 85K Yes

hawk
Name Blood Type
warm
Give Birth
no
Can Fly
yes
Live in Water
no
Class
?
that satisfy both the 9 No Married 75K No
grizzly bear warm yes no no ? antecedent and 10
10 No Single 90K Yes

The rule R1 covers a hawk => Bird


consequent of a rule (Status=Single)  No
The rule R3 covers the grizzly bear => Mammal Coverage = 40%, Accuracy = 50%

Unit 3 23 Unit 3 24

4
How does Rule-based Classifier Work? Characteristics of Rule-Based Classifier
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals • Mutually exclusive rules
R4: (Give Birth = no)  (Can Fly = no)  Reptiles – Classifier contains mutually exclusive rules if the
R5: (Live in Water = sometimes)  Amphibians
rules are independent of each other
– Every record is covered by at most one rule
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ? • Exhaustive rules
– Classifier has exhaustive coverage if it accounts for
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
every possible combination of attribute values
A dogfish shark triggers none of the rules – Each record is covered by at least one rule

Unit 3 25 Unit 3 26

From Decision Trees To Rules Rules Can Be Simplified


Tid Refund Marital Taxable
Refund Status Income Cheat
Classification Rules Refund
Yes No 1 Yes Single 125K No
(Refund=Yes) ==> No Yes No
NO Marita l 2 No Married 100K No
(Refund=No, Marital Status={Single,Divorced}, NO Marita l
{Single, Status 3 No Single 70K No
{Married} Taxable Income<80K) ==> No {Single, Status
Divorced} {Married} 4 Yes Married 120K No
(Refund=No, Marital Status={Single,Divorced}, Divorced}
Taxable NO Taxable Income>80K) ==> Yes 5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No
(Refund=No, Marital Status={Married}) ==> No Income
< 80K > 80K
< 80K > 80K 7 Yes Divorced 220K No
NO YES 8 No Single 85K Yes
NO YES
9 No Married 75K No

Rules are mutually exclusive and exhaustive 10 No Single 90K Yes


10

Rule set contains as much information as the


tree Initial Rule: (Refund=No)  (Status=Married)  No
Simplified Rule: (Status=Married)  No
Unit 3 27 Unit 3 28

Ordered Rule Set


Effect of Rule Simplification • Rules are rank ordered according to their
• Rules are no longer mutually exclusive priority
– A record may trigger more than one rule – An ordered rule set is known as a decision list

– Solution? • When a test record is presented to the


• Ordered rule set classifier
• Unordered rule set – use voting schemes – It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no)  (Can Fly = yes)  Birds
• Rules are no longer exhaustive R2: (Give Birth = no)  (Live in Water = yes)  Fishes
– A record may not trigger any rules R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
– Solution? R5: (Live in Water = sometimes)  Amphibians
• Use a default class Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Unit 3 29 Unit 3 30

5
Rule Ordering Schemes Building Classification Rules
• Rule-based ordering • Direct Method:
– Individual rules are ranked based on their quality • Extract rules directly from data
• Class-based ordering • e.g.: RIPPER, CN2, Holte’s 1R
– Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering • Indirect Method:


(Refund=Yes) ==> No (Refund=Yes) ==> No
• Extract rules from other classification models (e.g.
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
decision trees, neural networks, etc).
(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No • e.g: C4.5rules
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

Unit 3 31 Unit 3 32

Rule Extraction from the Training Data How to Learn-One-Rule?


• Star with the most general rule possible: condition = empty
• Sequential covering algorithm: Extracts rules directly from training data
• Adding new attributes by adopting a greedy depth-first strategy
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
– Picks the one that most improves the rule quality
• Rules are learned sequentially, each for a given class Ci will cover many tuples
of Ci but none (or few) of the tuples of other classes • Rule-Quality measures: consider both coverage and accuracy

• Steps: – Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
pos' pos
FOIL _ Gain  pos'(log 2  log 2 )
– Rules are learned one at a time pos'neg ' pos  neg
It favors rules that have high accuracy and cover many positive tuples
– Each time a rule is learned, the tuples covered by the rules are removed
• Rule pruning based on an independent set of test tuples
– The process repeats on the remaining tuples unless termination condition, pos  neg
e.g., when no more training examples or when the quality of a rule FOIL _ Prune( R) 
pos  neg
returned is below a user-specified threshold
Pos/neg are # of positive/negative tuples covered by R.
• Comp. w. decision-tree induction: learning a set of rules simultaneously
If FOIL_Prune is higher for the pruned version of R, prune R

Unit 3 33 Unit 3 34

Direct Method: Sequential Covering Example of Sequential Covering


1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule
function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping
criterion is met
(i) Original Data (ii) Step 1

Unit 3 35 Unit 3 36

6
Example of Sequential Covering… Aspects of Sequential Covering
• Rule Growing
R1 R1
• Instance Elimination

• Rule Evaluation
R2

(iii) Step 2 (iv) Step 3


• Stopping Criterion

• Rule Pruning
Unit 3 37 Unit 3 38

Instance Elimination Stopping Criterion and Rule Pruning


• Why do we need to
eliminate instances? • Stopping criterion
R3 R2
– Otherwise, the next rule is
R1
– Compute the gain
identical to previous rule + + + +
– If gain is not significant, discard the new rule
+
+ ++ +
• Why do we remove positive class = +
+
+++
+ +
+
+
+
+ +
instances? + + + +
+ + +
– Ensure that the next rule is -
+ +
- -
-
- - -
- • Rule Pruning
different
– Similar to post-pruning of decision trees
- -

• Why do we remove negative class = - - - -

– Reduced Error Pruning:


- -
instances? -
- -
– Prevent underestimating
- -
- • Remove one of the conjuncts in the rule
accuracy of rule • Compare error rate on validation set before and after
– Compare rules R2 and R3 in pruning
the diagram • If error improves, prune the conjunct

Unit 3 39 Unit 3 40

Advantages of Rule-Based Classifiers Classification: A Mathematical Mapping

• As highly expressive as decision trees • Classification:


– predicts categorical class labels
• Easy to interpret
• E.g., Personal homepage classification
• Easy to generate – xi = (x1, x2, x3, …), yi = +1 or –1
• Can classify new instances rapidly – x1 : # of a word “homepage”
• Performance comparable to decision trees – x2 : # of a word “welcome”
• Mathematically
– x  X = n, y  Y = {+1, –1}
– We want a function f: X  Y

Unit 3 41 Unit 3 42

7
Linear Classification SVM—Support Vector Machines
• Binary Classification problem • A new classification method for both linear and nonlinear data
• The data above the red line • It uses a nonlinear mapping to transform the original training
belongs to class ‘x’ data into a higher dimension
• The data below red line • With the new dimension, it searches for the linear optimal
x belongs to class ‘o’
x x separating hyperplane (i.e., “decision boundary”)
x x • Examples: SVM, Perceptron, • With an appropriate nonlinear mapping to a sufficiently high
x Probabilistic Classifiers
x x x o dimension, data from two classes can always be separated by a
o hyperplane
x o o
ooo • SVM finds this hyperplane using support vectors (“essential”
o o training tuples) and margins (defined by the support vectors)
o o o o

Unit 3 43 Unit 3 44

SVM—History and Applications SVM—General Philosophy


• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to
their ability to model complex nonlinear decision boundaries
(margin maximization)
• Used both for classification and prediction
• Applications:
– handwritten digit recognition, object recognition, speaker Small Margin Large Margin
identification, benchmarking time-series prediction tests Support Vectors

Unit 3 45 Unit 3 46

SVM—Margins and Support Vectors SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
Unit 3 47 Unit 3 48

8
Why Is SVM Effective on High Dimensional Data?
SVM—Linearly Separable
 A separating hyperplane can be written as  The complexity of trained classifier is characterized by the # of support
W●X+b=0 vectors rather than the dimensionality of the data
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 The support vectors are the essential or critical training examples —they lie
 For 2-D it can be written as
closest to the decision boundary (MMH)
w0 + w1 x1 + w2 x2 = 0
 The hyperplane defining the sides of the margin:  If all other training examples are removed and the training is repeated, the
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and same separating hyperplane would be found
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1  The number of support vectors found can be used to compute an (upper)
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the bound on the expected error rate of the SVM classifier, which is independent
sides defining the margin) are support vectors
of the data dimensionality
 This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints  Quadratic  Thus, an SVM with a small number of support vectors can have good
Programming (QP)  Lagrangian multipliers generalization, even when the dimensionality of the data is high
Unit 3 49 Unit 3 50

A2

SVM—Linearly Inseparable SVM—Kernel functions


A1
 Instead of computing the dot product on the transformed data tuples, it is
 Transform the original input data into a higher dimensional mathematically equivalent to instead applying a kernel function K(Xi, Xj) to
space the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
 Typical Kernel Functions

 SVM can also be used for classifying multiple (> 2) classes and for regression
analysis (with additional user parameters)
 Search for a linear separating hyperplane in the new space
Unit 3 51 Unit 3 52

Nearest Neighbor Classifiers Nearest neighbor method


• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s • Majority vote within the k nearest neighbors
probably a duck
new
Compute
Distance Test
Record K= 1: brown
K= 3: green

Training Choose k of the


Records “nearest” records

Unit 3 53 Unit 3 54

9
Nearest-Neighbor Classifiers Definition of Nearest Neighbor
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X X X

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
– Use class labels of nearest
neighbors to determine the
class label of unknown record K-nearest neighbors of a record x are data points
(e.g., by taking majority vote) that have the k smallest distance to x

Unit 3 55 Unit 3 56

1 nearest-neighbor Nearest Neighbor Classification


Voronoi Diagram
• Compute distance between two points:
– Euclidean distance
d ( p, q )   ( pi
i
q ) i
2

• Determine the class from nearest neighbor list


– take the majority vote of class labels among the k-
nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2

Unit 3 57 Unit 3 58

Nearest Neighbor Classification… Nearest Neighbor Classification…


• Choosing the value of k: • Scaling issues
– If k is too small, sensitive to noise points – Attributes may have to be scaled to prevent
– If k is too large, neighborhood may include points from distance measures from being dominated by one
other classes of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
X
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M

Unit 3 59 Unit 3 60

10
Nearest Neighbor Classification… Nearest neighbor Classification…
• Problem with Euclidean measure: • k-NN classifiers are lazy learners
– High dimensional data – It does not build models explicitly
• curse of dimensionality – Unlike eager learners such as decision tree
– Can produce counter-intuitive results induction and rule-based systems
111111111110 100000000000 – Classifying unknown records are relatively
vs
expensive
011111111111 000000000001
d = 1.4142 d = 1.4142

 Solution: Normalize the vectors to unit length


Unit 3 61 Unit 3 62

Discussion on the k-NN Algorithm Lazy vs. Eager Learning


• Lazy vs. eager learning
• The k-NN algorithm for continuous-valued target functions
– Calculate the mean values of the k nearest neighbors – Lazy learning (e.g., instance-based learning): Simply stores
• Distance-weighted nearest neighbor algorithm training data (or only minor processing) and waits until it is
– Weight the contribution of each of the k neighbors according to their
given a test tuple
distance to the query point xq – Eager learning (the above discussed methods): Given a set of
w 1
• giving greater weight to closer neighbors training set, constructs a classification model before receiving
d ( xq , xi )2
– Similarly, for real-valued target functions new (e.g., test) data to classify
• Robust to noisy data by averaging k-nearest neighbors • Lazy: less time in training but more time in predicting
• Curse of dimensionality: distance between neighbors could be dominated • Accuracy
by irrelevant attributes.
– To overcome it, axes stretch or elimination of the least relevant – Lazy method effectively uses a richer hypothesis space since
attributes. it uses many local linear functions to form its implicit global
approximation to the target function
– Eager: must commit to a single hypothesis that covers the
entire instance space
Unit 3 63 Unit 3 64

Lazy Learner: Instance-Based Methods Case-Based Reasoning (CBR)


• Instance-based learning: • CBR: Uses a database of problem solutions to solve new problems
– Store training examples and delay the processing (“lazy • Store symbolic description (tuples or cases)—not points in a Euclidean space
evaluation”) until a new instance must be classified • Applications: Customer-service (product-related diagnosis), legal ruling
• Typical approaches • Methodology
– k-nearest neighbor approach – Instances represented by rich symbolic descriptions (e.g., function graphs)
• Instances represented as points in a Euclidean space. – Search for similar cases, multiple retrieved cases may be combined
– Locally weighted regression – Tight coupling between case retrieval, knowledge-based reasoning, and
• Constructs local approximation problem solving
– Case-based reasoning • Challenges

• Uses symbolic representations and knowledge-based – Find a good similarity metric


inference – Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases

Unit 3 65 Unit 3 66

11
What Is Prediction? Linear Regression
• Linear regression: involves a response variable y and a single predictor
• (Numerical) prediction is similar to classification variable x
– construct a model y = w0 + w1 x
– use model to predict continuous or ordered value for a given input where w0 (y-intercept) and w1 (slope) are regression coefficients
• Prediction is different from classification • Method of least squares: estimates the best-fitting straight line
– Classification refers to predict categorical class label | D|

 (x  x )( yi  y )
– Prediction models continuous-valued functions w  i 1
i
w  y w x
1 | D|
0 1
• Major method for prediction: regression  (x
i 1
i  x )2
– model the relationship between one or more independent or predictor
variables and a dependent or response variable • Multiple linear regression: involves more than one predictor variable
• Regression analysis – Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
– Linear and multiple regression – Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
– Non-linear regression – Solvable by extension of least square method or using SAS, S-Plus
– Other regression methods: generalized linear model, Poisson regression, – Many nonlinear functions can be transformed into the above
log-linear models, regression trees
Unit 3 67 Unit 3 68

Nonlinear Regression Other Regression-Based Models


• Some nonlinear models can be modeled by a polynomial function • Generalized linear model:
• A polynomial regression model can be transformed into linear – Foundation on which linear regression can be applied to modeling
regression model. For example, categorical response variables
– Variance of y is a function of the mean value of y, not a constant
y = w0 + w1 x + w2 x2 + w3 x3
– Logistic regression: models the prob. of some event occurring as a linear
convertible to linear with new variables: x2 = x2, x3= x3 function of a set of predictor variables
y = w0 + w1 x + w2 x2 + w3 x3 – Poisson regression: models the data that exhibit a Poisson distribution
• Other functions, such as power function, can also be transformed • Log-linear models: (for categorical data)
to linear model – Approximate discrete multidimensional prob. distributions
• Some models are intractable nonlinear (e.g., sum of exponential – Also useful for data compression and smoothing
terms) • Regression trees and model trees
– possible to obtain least square estimates through extensive – Trees to predict continuous values rather than class labels
calculation on more complex formulae
Unit 3 69 Unit 3 70

Regression Trees and Model Trees Predictive Modeling in Multidimensional Databases

• Regression tree: proposed in CART system (Breiman et al. 1984) • Predictive modeling: Predict data values or construct
– CART: Classification And Regression Trees generalized linear models based on the database data
– Each leaf stores a continuous-valued prediction
• One can only predict value ranges or category distributions
• Method outline:
– It is the average value of the predicted attribute for the training tuples
– Minimal generalization
that reach the leaf
– Attribute relevance analysis
• Model tree: proposed by Quinlan (1992)
– Generalized linear model construction
– Each leaf holds a regression model—a multivariate linear equation for – Prediction
the predicted attribute • Determine the major factors which influence the prediction
– A more general case than regression tree – Data relevance analysis: uncertainty measurement, entropy
• Regression and model trees tend to be more accurate than linear regression analysis, expert judgement, etc.
when the data are not represented well by a simple linear model • Multi-level prediction: drill-down and roll-up analysis

Unit 3 71 Unit 3 72

12
Estimating Error Rates I Estimating Error Rates II

• Training Error: • Holdout method


– Partition: Training-and-testing
– The proportion of training records that are misclassified • Given data is randomly partitioned into two independent sets
– Overly optimistic. Classifiers that overfit the datasets can • Training set (e.g., 2/3) for model construction
have poor accuracy on unseen data • Test set (e.g., 1/3) for accuracy estimation
• Unbiased, efficient. But require a large number of samples
• used for data set with large number of samples
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained

Unit 3 73 Unit 3 74

Estimating Error Rates III

• Cross-validation (k-fold, where k = 10 is most popular)


– Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– Stratified cross-validation: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial
data

Unit 3 75 Unit 3 76

Estimating Error Rates IV


• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several boostrap methods, and a common one is .632 boostrap
– Suppose we are given a data set of d tuples. The data set is sampled d times, with
replacement, resulting in a training set of d samples. The data tuples that did not
make it into the training set end up forming the test set. About 63.2% of the
original data will end up in the bootstrap, and the remaining 36.8% will form the
test set (since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedue k times, overall accuracy of the model:
k
acc ( M )   (0.632  acc ( M i )test _ set 0.368  acc ( M i )train_ set )
i 1

Unit 3 77 Unit 3 78

13
Model Selection: ROC Curves
Metrics for Performance Evaluation
• ROC (Receiver Operating Characteristics) • Focus on the predictive capability of a model
curves: for visual comparison of classification
models
– Rather than how fast it takes to classify or build
• Originated from signal detection theory
models, scalability, etc.
• Shows the trade-off between the true positive • Confusion Matrix:
rate and the false positive rate
• The area under the ROC curve is a measure of
 Vertical axis represents PREDICTED CLASS
the true positive rate
the accuracy of the model
 Horizontal axis rep. the Class=Yes Class=No
• Rank the test tuples in decreasing order: the false positive rate
a: TP (true positive)
b: FN (false negative)
one that is most likely to belong to the positive The plot also shows a Class=Yes a b
class appears at the top of the list

ACTUAL c: FP (false positive)
diagonal line
• The closer to the diagonal line (i.e., the closer CLASS Class=No c d
d: TN (true negative)
 A model with perfect
the area is to 0.5), the less accurate is the accuracy will have an area
model of 1.0
Unit 3 79 Unit 3 80

Metrics for Performance Evaluation… Limitation of Accuracy


PREDICTED CLASS

Class=Yes Class=No
• Consider a 2-class problem
– Number of Class 0 examples = 9990
Class=Yes a b
ACTUAL (TP) (FN) – Number of Class 1 examples = 10
CLASS
Class=No c d
(FP) (TN)
• If model predicts everything to be class 0,
ad TP  TN accuracy is 9990/10000 = 99.9 %
Accuracy  
a  b  c  d TP  TN  FP  FN – Accuracy is misleading because model does not
• Most widely-used metric: detect any class 1 example

Unit 3 81 Unit 3 82

Cost Matrix Computing Cost of Classification


Cost PREDICTED CLASS
Matrix
PREDICTED CLASS C(i|j) + -
ACTUAL
+ -1 100
C(i|j) Class=Yes Class=No CLASS
- 1 0
Class=Yes C(Yes|Yes) C(No|Yes)
ACTUAL Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS
CLASS Class=No C(Yes|No) C(No|No)
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200
C(i|j): Cost of misclassifying class j example as class i
Accuracy = 80% Accuracy = 90%
Cost = 3910 Cost = 4255
Unit 3 83 Unit 3 84

14
Cost vs Accuracy Cost-Sensitive Measures
a
Accuracy is proportional to cost if Precision (p) 
Count PREDICTED CLASS 1. C(Yes|No)=C(No|Yes) = q ac
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
a
Recall (r) 
Class=Yes a b N=a+b+c+d ab
ACTUAL
CLASS 2rp 2a
Class=No c d F - measure (F)  
Accuracy = (a + d)/N r  p 2a  b  c
Cost PREDICTED CLASS
Cost = p (a + d) + q (b + c)  Precision is biased towards C(Yes|Yes) & C(Yes|No)
Class=Yes Class=No
= p (a + d) + q (N – a – d)  Recall is biased towards C(Yes|Yes) & C(No|Yes)
Class=Yes p q = q N – (q – p)(a + d)  F-measure is biased towards all except C(No|No)
ACTUAL
= N [q – (q-p)  Accuracy]
CLASS Class=No q p wa  w d
Weighted Accuracy  1 4

wa  wb wc  w d
1 2 3 4

Unit 3 85 Unit 3 86

Predictor Error Measures


More measures
• Measure predictor accuracy: measure how far off the predicted value is from
• True Positive Rate (TPR) (Sensitivity) the actual known value
– a/a+b • Loss function: measures the error betw. yi and the predicted value yi’
– Absolute error: | yi – yi’|
• True Negative Rate (TNR) (Specificity) – Squared error: (yi – yi’)2
– d/c+d • Test error (generalization error):
d the average loss over thedtest set
– Mean absolute error: 
| yi  yi ' |
(y  yi ' ) 2
• False Positive Rate (FPR) i 1
d
Mean squared error:
i 1
i

d d

(y
d

| y  yi ' ) 2
– c/c+d – Relative absolute error: i 1
d
i  yi ' |
Relative squared error: i 1
i

| y
d
y|
• False Negative Rate (FNR) i 1
i
(y
i 1
i  y)2

The mean squared-error exaggerates the presence of outliers


– b/a+b
Popularly use (square) root mean-square error, similarly, root relative
squared error
Unit 3 87 Unit 3 88

Ensemble Methods: Increasing the Accuracy


Ensemble Methods
• Construct a set of classifiers from the training
data
• Ensemble methods
– Use a combination of models to increase accuracy
• Predict class label of previously unseen – Combine a series of k learned models, M1, M2, …, Mk, with
records by aggregating predictions made by the aim of creating an improved model M*
multiple classifiers • Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers
Unit 3 89 Unit 3 90

15
General Idea Why does it work?
Original
D Training data
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt – Assume classifiers are independent
Data Sets
– Probability that the ensemble classifier makes a
Step 2: wrong prediction:
Build Multiple C1 C2 Ct -1 Ct
Classifiers 25
 25 i
  i  (1   )25i  0.06
i 13  
Step 3:
Combine C*
Classifiers

Unit 3 91 Unit 3 92

Examples of Ensemble Methods Bagging: Boostrap Aggregation


• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• How to generate an ensemble of classifiers? – Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., boostrap)
– Bagging
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– Boosting
– The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
• Accuracy
– Often significant better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
Unit 3 93 Unit 3 94

Bagging Boosting
• Sampling with replacement • An iterative procedure to adaptively change
Original Data 1 2 3 4 5 6 7 8 9 10
distribution of training data by focusing more
Bagging (Round 1)
Bagging (Round 2)
7
1
8
4
10
9
8
1
2
2
5
3
10
2
10
7
5
3
9
2 on previously misclassified records
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
– Initially, all N records are assigned equal weights
– Unlike bagging, weights may change at the end of
• Build classifier on each bootstrap sample boosting round

• Each sample has probability (1 – 1/n)n of being


selected
Unit 3 95 Unit 3 96

16
Boosting Boosting
• Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
• How boosting works? • Records that are wrongly classified will have
– Weights are assigned to each training tuple their weights increased
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the • Records that are classified correctly will have
subsequent classifier, Mi+1, to pay more attention to the training tuples their weights decreased
that were misclassified by Mi
Original Data 1 2 3 4 5 6 7 8 9 10
– The final M* combines the votes of each individual classifier, where the Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
weight of each classifier's vote is a function of its accuracy
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• The boosting algorithm can be extended for the prediction of continuous • Example 4 is hard to classify
values
• Its weight is increased, therefore it is more
• Comparing with bagging: boosting tends to achieve greater accuracy, but it likely to be chosen again in subsequent rounds
also risks overfitting the model to misclassified data

Unit 3 97 Unit 3 98

Adaboost (Freund and Schapire, 1997) Example: AdaBoost


• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd) • Base classifiers: C1, C2, …, CT
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Error rate:
– Tuples from D are sampled (with replacement) to form a training set Di
of the same size

 w  C ( x )  y 
N
– Each tuple’s chance of being selected is based on its weight 1
– A classification model Mi is derived from Di i  j i j j
– Its error rate is calculated using Di as a test set
N j 1

– If a tuple is misclssified, its weight is increased, o.w. it is decreased


• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi • Importance of a classifier:
error rate is the sum of the weights of the misclassified tuples:
1  1  i 
i  ln 
d
error( M i )   w j  err ( X j ) 
j
1  error( M i ) 2   i 
• The weight of classifier Mi’s vote is log
error( M i )

Unit 3 99 Unit 3 100

Example: AdaBoost Illustrating AdaBoost


Initial weights for each data point Data points
• Weight update: for training
 j
( j 1)

exp if C j ( xi )  yi
wi( j ) 0.1 0.1 0.1
wi 
 Original

 Zj
exp
j
if C j ( xi )  yi Data +++ - - - - - ++
where Z j is the normalizat ion factor
B1
• If any intermediate rounds produce error rate 0.0094 0.0094 0.4623
Boosting
higher than 50%, the weights are reverted back Round 1 +++ - - - - - - -  = 1.9459
to 1/n and the resampling procedure is repeated
• Classification:
C * ( x )  arg max  j C j ( x )  y 
T

y j 1
Unit 3 101 Unit 3 102

17
Illustrating AdaBoost
B1
K Means Clustering Algorithm
0.0094 0.0094 0.4623
Boosting
+++ - - - - - - - • K-means is a popularly used unsupervised machine learning
Round 1  = 1.9459 algorithm for cluster analysis. K-Means is a non-deterministic and
iterative method. The algorithm operates on a given data set
B2 through pre-defined number of clusters, k. The output of K Means
Boosting
0.3037 0.0009 0.0422 algorithm is k clusters with input data partitioned among the
Round 2 - - - - - - - - ++  = 2.9323
clusters.
• For instance, let’s consider K-Means Clustering for Wikipedia Search
results. The search term “Jaguar” on Wikipedia will return all pages
B3 containing the word Jaguar which can refer to Jaguar as a Car,
0.0276 0.1819 0.0038 Jaguar as Mac OS version and Jaguar as an Animal. K Means
Boosting clustering algorithm can be applied to group the webpages that talk
Round 3 +++ ++ ++ + ++  = 3.8744
about similar concepts. So, the algorithm will group all web pages
that talk about Jaguar as an Animal into one cluster, Jaguar as a Car
into another cluster and so on.

Overall +++ - - - - - ++
Unit 3 103 Unit 3 104

Advantages of using K-Means


Applications of K-Means Clustering
Clustering
• In case of globular clusters, K-Means produces • K Means Clustering algorithm is used by most of
the search engines like Yahoo, Google to cluster
tighter clusters than hierarchical clustering. web pages by similarity and identify the
• Given a smaller value of K, K-Means clustering ‘relevance rate’ of search results. This helps
search engines reduce the computational time for
computes faster than hierarchical clustering the users.
for large number of variables. • Data Science Libraries in Python to implement K-
Means Clustering – SciPy, Sci-Kit Learn, Python
Wrapper
• Data Science Libraries in R to implement K-Means
Clustering – stats

Unit 3 105 Unit 3 106

Support Vector Machine Learning


Advantages of Using SVM
Algorithm
• Support Vector Machine is a supervised machine learning algorithm for
classification or regression problems where the dataset teaches SVM about the • SVM offers best classification performance
classes so that SVM can classify any new data. It works by classifying the data into
different classes by finding a line (hyperplane) which separates the training data
set into classes. As there are many such linear hyperplanes, SVM algorithm tries to
(accuracy) on the training data.
maximize the distance between the various classes that are involved and this is
referred as margin maximization. If the line that maximizes the distance between • SVM renders more efficiency for correct
the classes is identified, the probability to generalize well to unseen data is
increased. classification of the future data.
• SVM’s are classified into two categories:
• Linear SVM’s – In linear SVM’s the training data i.e. classifiers are separated by a
hyperplane.
• The best thing about SVM is that it does not
• Non-Linear SVM’s- In non-linear SVM’s it is not possible to separate the training
data using a hyperplane. For example, the training data for Face detection consists
make any strong assumptions on data.
of group of images that are faces and another group of images that are not faces
(in other words all other images in the world except faces). Under such conditions, • It does not over-fit the data.
the training data is too complex that it is impossible to find a representation for
every feature vector. Separating the set of faces linearly from the set of non-face is
a complex task.

Unit 3 107 Unit 3 108

18
Applications of Support Vector PCA
Machine
• principal components analysis (PCA) is a technique
• SVM is commonly used for stock market that can be used to simplify a dataset
forecasting by various financial institutions. • It is a linear transformation that chooses a new
For instance, it can be used to compare the coordinate system for the data set such that
relative performance of the stocks when – greatest variance by any projection of the data set comes
to lie on the first axis (then called the first principal
compared to performance of other stocks in component),
the same sector. The relative comparison of – the second greatest variance on the second axis, and so
stocks helps manage investment making on.
decisions based on the classifications made by • PCA can be used for reducing dimensionality by
the SVM learning algorithm. eliminating the later principal components.

Unit 3 109 Unit 3 110

PCA PCA process –STEP 1


• By finding the eigenvalues and eigenvectors of the
covariance matrix, we find that the eigenvectors with • Subtract the mean
the largest eigenvalues correspond to the dimensions from each of the data dimensions. All the x values
that have the strongest correlation in the dataset. have x subtracted and y values have y subtracted
• This is the principal component. from them. This produces a data set whose mean is
zero.
• PCA is a useful statistical technique that has found
application in: Subtracting the mean makes variance and covariance
– fields such as face recognition and image compression calculation easier by simplifying their equations. The
– finding patterns in data of high dimension
variance and co-variance values are not affected by
the mean value.

Unit 3 111 Unit 3 112

PCA process –STEP 1 PCA process –STEP 1


DATA: ZERO MEAN DATA:
x y x y
2.5 2.4 .69 .49
0.5 0.7 -1.31 -1.21
2.2 2.9 .39 .99
1.9 2.2 .09 .29
3.1 3.0 1.29 1.09
2.3 2.7
.49 .79
2 1.6
.19 -.31
1 1.1
1.5 1.6 -.81 -.81
1.1 0.9 -.31 -.31
-.71 -1.01
Unit 3 113 Unit 3 114

19
PCA process –STEP 2 PCA process –STEP 3
• Calculate the covariance matrix
• Calculate the eigenvectors and eigenvalues of
cov = .616555556 .615444444
the covariance matrix
.615444444 .716555556
eigenvalues = .0490833989
1.28402771
• since the non-diagonal elements in this
eigenvectors = -.735178656 -.677873399
covariance matrix are positive, we should
expect that both the x and y variable increase .677873399 -.735178656
together.

Unit 3 115 Unit 3 116

PCA process –STEP 3 PCA process –STEP 4


• Reduce dimensionality and form feature vector the
•eigenvectors are plotted as
diagonal dotted lines on the eigenvector with the highest eigenvalue is the principle
plot. component of the data set.
•Note they are
perpendicular to each other.
•Note one of the • In our example, the eigenvector with the larges eigenvalue
eigenvectors goes through
the middle of the points, like was the one that pointed down the middle of the data.
drawing a line of best fit.
•The second eigenvector
gives us the other, less • Once eigenvectors are found from the covariance matrix, the
important, pattern in the next step is to order them by eigenvalue, highest to lowest.
data, that all the points
follow the main line, but are
This gives you the components in order of significance.
off to the side of the main
line by some amount.

Unit 3 117 Unit 3 118

PCA process –STEP 4 PCA process –STEP 4


• Now, if you like, you can decide to ignore the • Feature Vector
components of lesser significance
FeatureVector = (eig1 eig2 eig3 … eign)
We can either form a feature vector with both of the
• You do lose some information, but if the eigenvectors:
eigenvalues are small, you don’t lose much -.677873399 -.735178656
-.735178656 .677873399
– n dimensions in your data
or, we can choose to leave out the smaller, less
– calculate n eigenvectors and eigenvalues significant component and only have a single
– choose only the first p eigenvectors column:
– final data set has only p dimensions. - .677873399
- .735178656
Unit 3 119 Unit 3 120

20
PCA process –STEP 5 PCA process –STEP 5
• Deriving the new data
FinalData = RowFeatureVector x RowZeroMeanData
• RowFeatureVector is the matrix with the eigenvectors in the R = U S VT
columns transposed so that the eigenvectors are now in the
variables factors factors variables
rows, with the most significant eigenvector at the top
sig. significant
• RowZeroMeanData is the mean-adjusted data transposed, ie.
the data items are in each column, with each row holding a

significant
noise noise

noise
separate dimension.
factors factors

samples samples

Unit 3 121 Unit 3 122

PCA process –STEP 5 PCA process –STEP 5


FinalData transpose: dimensions
• FinalData is the final data set, with data items in along columns
columns, and dimensions along rows. x y
-.827970186 -.175115307
• What will this give us? 1.77758033 .142857227
– It will give us the original data solely in terms of the vectors -.992197494 .384374989
we chose. -.274210416 .130417207
• We have changed our data from being in terms of -1.67580142 -.209498461
the axes x and y , and now they are in terms of our 2 -.912949103 .175282444
.0991094375 -.349824698
eigenvectors.
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287

Unit 3 123 Unit 3 124

PCA process –STEP 5 Reconstruction of original Data


• If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example let
us assume that we considered only the x dimension…

Unit 3 125 Unit 3 126

21
Reconstruction of original Data Markov Models
• A discrete (finite) system:
x – N distinct states.
-.827970186
– Begins (at time t=1) in some initial state(s).
1.77758033
-.992197494
– At each time step (t=1,2,…) the system moves
-.274210416 from current to next state (possibly the same as
-1.67580142 the current state) according to transition
-.912949103 probabilities associated with current state.
.0991094375
• This kind of system is called a finite, or discrete
1.14457216
.438046137
Markov model
1.22382056
• After Andrei Andreyevich Markov (1856 -1922)
Unit 3 127 1283
Unit

Discrete Markov Model: Example


• Discrete Markov
Markov Property
Model with 5 states. • Markov Property: The state of the system at time t+1

• Each aij represents the depends only on the state of the system at time t
probability of moving P[X t 1  x t 1 | X t  x t , X t -1  x t -1 , . . . , X1  x1 , X 0  x 0 ]
from state i to state j
 P[X t 1  x t 1 | X t  x t ]
• The aij are given in a
matrix A = {aij}
• The probability to Xt= Xt=3 Xt=
Xt=1 Xt=5
start in a given state i 2 4

is i , The vector 
repre-sents these start
probabilities. 1293
Unit 1303
Unit

Markov Chains Simple Minded Weather Example


Stationarity Assumption • raining today rain tomorrow prr = 0.4
Probabilities independent of t when process is “stationary” • raining today no rain tomorrow prn = 0.6
So, • no raining today rain tomorrow pnr = 0.2
for all t, P[X t 1  x j|X t  x i ]  pij
• no raining today no rain tomorrow prr = 0.8
This means that if system is in state i, the probability that
the system will next move to state j is pij , no matter what
the value of t is

1313
Unit 1323
Unit

22
Simple Minded Weather Example Coke vs. Pepsi (a cental cultural dilemma)
Transition matrix for our example Given that a person’s last cola purchase was Coke ™,
 0.4 0.6  there is a 90% chance that her next cola purchase will
P    also be Coke ™.
 0.2 0.8 
If that person’s last cola purchase was Pepsi™, there
• Note that rows sum to 1 is an 80% chance that her next cola purchase will also
• Such a matrix is called a Stochastic Matrix be Pepsi™.
• If the rows of a matrix and the columns of a matrix all
0.9 0.1
sum to 1, we have a Doubly Stochastic Matrix 0.8

coke pepsi

0.2
1333
Unit 1343
Unit

Coke vs. Pepsi Coke vs. Pepsi


Given that a person is currently a Pepsi purchaser,
Given that a person is currently a Coke
what is the probability that she will purchase Coke
drinker, what is the probability that she will
two purchases from now?
purchase Pepsi three purchases from now?
The transition matrices are:

0.9 0.1 (corresponding to 0.9 0.1 0.83 0.17 0.781 0.219


P  one purchase ahead) P3     
0.2 0.8 0.2 0.8 0.34 0.66 0.438 0.562

0.9 0.1 0.9 0.1 0.83 0.17


P2     
0.2 0.8 0.2 0.8 0.34 0.66
1353
Unit 1363
Unit

Equilibrium (Stationary) Distribution


Coke vs. Pepsi • Suppose 60% of all people now drink Coke, and
Assume each person makes one cola purchase per 40% drink Pepsi. What fraction will be drinking
week. Suppose 60% of all people now drink Coke, and
40% drink Pepsi. Coke 10,100,1000,10000 … weeks from now?
What fraction of people will be drinking Coke three • For each week, probability is well defined. But
weeks from now? does it converge to some equilibrium distribution
P00 [p0,p1]?
Let (Q0,Q1)=(0.6,0.4) be the initial probabilities.
• If it does, then eqs. : .9p0+.2p1 =p0, .8p1+.1p0 =p1
We will regard Coke as 0 and Pepsi as 1 0.9 0.1
P 
We want to find P(X3=0) 0.2 0.8 must hold, yielding p0= 2/3, p1=1/3 .
0.9 0.1
0.8
1
P( X 3  0)   Qi pi(03)  Q0 p00
( 3)
 Q1 p10
( 3)
 0.6  0.781  0.4  0.438  0.6438 coke pepsi
i 0

1373
Unit 1383
Unit
0.2

23
Equilibrium (Stationary) Distribution Equilibrium (Stationary) Distribution
Whether or not there is a stationary distribution, and • If the Markov chain is positive recurrent, there
whether or not it is unique if it does exist, are determined exists a stationary distribution. If it is positive
by certain properties of the process. Irreducible means recurrent and irreducible, there exists a
that
every state is accessible from every other state. Aperiodic unique stationary distribution, and
means that there exists at least one state for which the furthermore the process constructed by taking
transition from that state to itself is possible. Positive the stationary distribution as the initial
recurrent means that the expected return time is finite for distribution is ergodic. Then the average of a
every state. 0.9 0.1
0.8 function f over samples of the Markov chain is
equal to the average with respect to the
coke pepsi
stationary distribution,
0.2
1393
Unit 1403
Unit

Equilibrium (Stationary) Distribution Discrete Markov Model - Example

• Writing P for the transition matrix, a stationary


• States – Rainy:1, Cloudy:2, Sunny:3
distribution is a vector π which satisfies the
equation
• Matrix A –
– Pπ = π .
• In this case, the stationary distribution π is an
eigenvector of the transition matrix, associated • Problem – given that the weather on day 1 (t=1) is sunny(3), what is the
probability for the observation O:
with the eigenvalue 1.

1413
Unit 1423
Unit

Discrete Markov Model – Example (cont.) Types of Models


• The answer is -
• Ergodic model
Strongly connected - directed
path w/ positive probabilities
from each state i to state j
(but not necessarily complete
directed graph)

1433
Unit 1443
Unit

24
Third Example: A Friendly Gambler Fourth Example: A Friendly Gambler
Game starts with 10$ in gambler’s pocket
Irreducible means that every state is accessible from every other state.
– At each round we have the following: Aperiodic means that there exists at least one state for which the
• Gambler wins 1$ with probability p transition from that state to itself is possible. Positive
or recurrent means that the expected return time is finite for every state.
• Gambler loses 1$ with probability 1-p If the Markov chain is positive recurrent, there exists a stationary
– Game ends when gambler goes broke (no sister in bank), distribution.

or accumulates a capital of 100$ (including initial capital) Is the gambler’s chain positive recurrent? Does it have a stationary
– Both 0$ and 100$ are absorbing states distribution (independent upon initial distribution)?
p p p p p p p p

0 1 2 N-1 N 0 1 2 N-1 N

1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p


Start Start
(10$)
1453
Unit (10$)
1463
Unit

Hidden Markov Models


Let Us Change Gear (probabilistic finite state automata)
• Enough with these simple Markov chains. Often we face scenarios where states cannot be
directly observed.
We need an extension: Hidden Markov Models
• Our next destination: Hidden Markov chains. a11 a44
a22 a33

a12 a23 a34


Start
1/2 1/2 b11 b14
tail b13
1/2 tail b12
0.1 1/4
4
Fair loaded 1
0.1 2 3
0.9
1/2 3/4 0.9 aij are state transition probabilities.
head head Observed bik are observation (output) probabilities.
phenomenon b11 + b12 + b13 + b14 = 1,
1473
Unit 1483
Unit b21 + b22 + b23 + b24 = 1, etc.

Hidden Markov Models - HMM Example: Dishonest Casino

Hidden variables

H1 H2 Hi HL-1 HL

X1 X2 Xi XL-1 XL

Observed data Actually, what is hidden in this model?

1493
Unit 1503
Unit

25
Coin-Tossing Example Loaded Coin Example
Start Start
1/2 tail 1/2 1/2
tail 1/2
1/2 tail
1/2 tail 0.1 1/4
1/4
0.1 Fair loaded
Fair loaded 0.9 0.1
1/2 3/4 0.9
0.9 0.1
3/4 0.9
1/2 head
head
head head
Fair/Loade
L tosses d
L tosses Fair/Loade
d H1 H2 Hi HL-1 HL

X1 X2 Xi XL-1 XL
H1 H2 Hi HL-1 HL
Head/Tail

X1 X2 Xi XL-1 XL
Q1.: What is the probability of the sequence of observed
Head/Tail outcome (e.g. HHHTHTTHHT), given the model?
1513
Unit 1523
Unit

C-G Islands Example


HMMs – Question I
C-G islands: DNA parts which are very rich in C and G
q/4

• Given an observation sequence O = (O1 O2 O3 … OL), and a A G


q/4 P q

model M = {A, B,  }, how do we efficiently compute Regular P


change
q

P q
P(O|M), the probability that the given model M DNA P q

produces the observation O in a run of length L ? C


q/4
T
q/4 p/6 p/3

• This probability can be viewed as a measure of the (1-P)/4 G


A
quality of the model M. Viewed this way, it enables (1-q)/6
discrimination/selection among alternative models.
(1-q)/3 p/3
P/6

C-G island T C
1533
Unit 1553
Unit

Example: CpG islands CpG Islands


• We construct Markov
• In human genome, CG dinucleotides are
relatively rare chain for CpG rich and
– CG pairs undergo a process called methylation that poor regions
modifies the C nucleotide • Using maximum likelihood
– A methylated C mutate (with relatively high chance)
to a T
estimates from 60K
• Promotor regions are CG rich nucleotide, we get two
– These regions are not methylated, and thus mutate models
less often
– These are called CpG islands

1563
Unit 1573
Unit

26
Ratio Test for CpC islands Empirical Evalation
• Given a sequence X1,…,Xn we compute the
likelihood ratio P ( X1 ,, Xn | )
S ( X1 , , Xn )  log
P ( X1 , , Xn | )
A  Xi X i
  log 1

i A  X i Xi 1

   Xi X i  1
i

1583
Unit 1593
Unit

Finding CpG islands Alternative Approach


Simple Minded approach: • Build a model that include “+” states and “-” states
• Pick a window of size N
(N = 100, for example)
• Compute log-ratio for the sequence in the
window, and classify based on that

Problems:
• How do we select N?
• What do we do when the window intersects the • A state “remembers” last nucleotide and the type of region
boundary of a CpG island? • A transition from a - state to a + describes a start of CpG island

1603
Unit 1613
Unit

A Different C-G Islands Model HMM Recognition (question I)


G
A
change • For a given model M = { A, B, p} and a given state
sequence Q1 Q2 Q3 … QL ,, the probability of an
T C
observation sequence O1 O2 O3 … OL is
A G

P(O|Q,M) = bQ1O1 bQ2O2 bQ3O3 … bQTOT


T C • For a given hidden Markov model M = { A, B, p} the
probability of the state sequence Q1 Q2 Q3 … QL is (the
C-G
island? initial probability of Q1 is taken to be pQ1)
H1 H2 Hi HL-1 HL P(Q|M) = pQ1 aQ1Q2 aQ2Q3 aQ3Q4 … aQL-1QL
• So, for a given HMM, M the probability of an observation
X1 X2 Xi XL-1 XL sequence O1O2O3 … OT is obtained by summing over all
A/C/G/T possible state sequences
1623
Unit 1633
Unit

27
HMM – Recognition (cont.) HMM – Recognition (cont.)
• Why isn’t it efficient? – O(2LQL)
P(O| M) = S P(O|Q) P(Q|M) – For a given state sequence of length L we have
about 2L calculations
= SQ Q1 bQ1O1 aQ1Q2 bQ2O2 aQ2Q3 bQ2O2 …
• P(Q|M) = Q aQ Q aQ Q aQ Q … aQ
1 1 2 2 3 3 4 T-1QT

• P(O|Q) = bQ O bQ O bQ O … bQ O
1 1 2 2 3 3 T T

• Requires summing over exponentially many


– There are QL possible state sequence
paths
– So, if Q=5, and L=100, then the algorithm requires
• Can this be made more efficient? 200x5100 computations
– We can use the forward-backward (F-B) algorithm
to do things efficiently

1643
Unit 1653
Unit

The F-B Algorithm (cont.) HMM – Question II (Harder)


• Given an observation sequence, O = (O1 O2 … OT), and
Option 1) The likelihood is measured using any a model, M = {A, B, p }, how do we efficiently
compute the most probable sequence(s) of states, Q
sequence of states of length T
?
– This is known as the “Any Path” Method
• Namely the sequence of states Q = (Q1 Q2 … QT) ,
which maximizes P(O|Q,M), the probability that the
Option 2) We can choose an HMM by the given model M produces the given observation O
probability generated using the best possible when it goes through the specific sequence of states
sequence of states Q.
– We’ll refer to this method as the “Best Path” • Recall that given a model M, a sequence of
Method observations O, and a sequence of states Q, we can
efficiently compute P(O|Q,M) (should watch out for
1663
Unit
numeric underflows) 1673
Unit

Most Probable States Sequence (Q. II) Dishonest Casino (again)


• Computing posterior probabilities for “fair” at
Idea: each point in a long sequence:
• If we know the identity of Qi , then the most
probable sequence on i+1,…,n does not
depend on observations before time i

• A white board presentation of Viterbi’s


algorithm

1683
Unit 1693
Unit

28
HMM – Question III (Hardest) Learning
• Given an observation sequence O = (O1 O2 … OL), and
a class of models, each of the form M = {A,B,p}, which Given a sequence x1,…,xn, h1,…,hn
specific model “best” explains the observations?
• How do we learn Akl and Bka ?
• A solution to question I enables the efficient
computation of P(O|M) (the probability that a • We want to find parameters that maximize the
specific model M produces the observation O). likelihood P(x1,…,xn, h1,…,hn)
• Question III can be viewed as a learning problem: We We simply count:
want to use the sequence of observations in order to
• Nkl - number of times hi=k & hi+1=l
“train” an HMM and learn the optimal underlying
model parameters (transition and output • Nka - number of times hi=k & xi = a
probabilities). Nkl Nka
Akl  Bka 
1703
Unit

l'
Nkl ' 
a
Nka
'3
171
Unit
'

Learning
Given only sequence x1,…,xn • If we have Akl and Bka we can compute
• How do we learn Akl and Bka ? P (Hi  k , Hi 1  l | x1 , , xn )
P (Hi  k , Hi 1  l , x1 , , xn )

• We want to find parameters that maximize the P (x1 , , xn )
likelihood P(x1,…,xn) P (Hi  k , x1 , , xi )P (xi 1 | Hi 1  l )P (xi 2 , , xn | Hi 1  l )

P (x1 , , xn )
Problem: fk (i )Blxi 1 bl (i  1 )
 

• Counts are inaccessible since we do not observe P (x1 , , xn )


hi
1723
Unit 1733
Unit

Expected Counts Expectation Maximization (EM)


• We can compute expected number of times • Choose Akl and Bka
hi=k & hi+1=l E-step:
• Compute expected counts E[Nkl], E[Nka]
E [Nkl ]   P (Hi  k , Hi 1  l | x1 ,, xn ) M-Step:
i E [Nkl ]
• Restimate: A'kl 

l'
E [Nkl ] '

E [Nka ]
• Similarly B'ka 
E [Nka ]   P (Hi
i x a
 k | x1 ,, xn ) 
a'
E [Nka ] '
, i  • Reiterate

1743
Unit 1753
Unit

29
EM - basic properties Complexity of E-step
• P(x1,…,xn: Akl, Bka)  P(x1,…,xn: A’kl, B’ka) • Compute forward and backward messages
– Likelihood grows in each iteration – Time & Space complexity: O(nL)
• Accumulate expected counts
• If P(x1,…,xn: Akl, Bka) = P(x1,…,xn: A’kl, B’ka) – Time complexity O(nL2)
then Akl, Bka is a stationary point of the – Space complexity O(L2)
likelihood
– either a local maxima, minima, or saddle point

1763
Unit 1773
Unit

EM - problems Communication Example


Local Maxima:
• Learning can get stuck in local maxima
• Sensitive to initialization
• Require some method for escaping such maxima

Choosing L
• We often do not know how many hidden values
we should have or can learn

1783
Unit 1793
Unit

Thank You

30

You might also like