Unit 3
Unit 3
Supervised Learning
Decision Tree Learning
Instance Based Learning: k-Nearest neighbor algorithm
Unit 3 Support Vector Machines
Ensemble learning: boosting, bagging
Artificial Neural Networks
Probabilistic Models
Maximum Likelihood Estimation
LEARNING TECHNIQUES MAP
Naive Bayes Classifiers
Minimum description length principle
Bayesian Networks
Unsupervised Learning
Principal Components Analysis
K-means and Hierarchical Clustering
Gaussian Mixture Models
EM algorithm
Hidden Markov Models
Unit 3 2
Unit 3 3 Unit 3 4
Unit 3 5 Unit 3 6
1
When to use Decision Tree Machine Advantages of Using Decision Tree
Learning Algorithm Machine Learning Algorithms
• Decision trees are robust to errors and if the training • Decision trees are very instinctual and can be explained to anyone with ease.
People from a non-technical background, can also decipher the hypothesis drawn
data contains errors- decision tree algorithms will be from a decision tree, as they are self-explanatory.
• When using decision tree machine learning algorithms, data type is not a
best suited to address such problems. constraint as they can handle both categorical and numerical variables.
• Decision trees are best suited for problems where • Decision tree machine learning algorithms do not require making any assumption
on the linearity in the data and hence can be used in circumstances where the
instances are represented by attribute value pairs. parameters are non-linearly related. These machine learning algorithms do not
make any assumptions on the classifier structure and space distribution.
• If the training data has missing value then decision • These algorithms are useful in data exploration. Decision trees implicitly perform
feature selection which is very important in predictive analytics. When a decision
trees can be used, as they can handle missing values tree is fit to a training dataset, the nodes at the top on which the decision tree is
nicely by looking at the data in other columns. split, are considered as important variables within a given dataset and feature
selection is completed by default.
• Decision trees are best suited when the target function • Decision trees help save data preparation time, as they are not sensitive to missing
values and outliers. Missing values will not stop you from splitting the data for
has discrete output values. building a decision tree. Outliers will also not affect the decision trees as data
splitting happens based on some samples within the split range and not on exact
absolute values.
Unit 3 7 Unit 3 8
Unit 3 9 Unit 3 10
2
Bayesian Theorem Towards Naïve Bayesian Classifier
• Given training data X, posteriori probability of a hypothesis H, • Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
P(H|X), follows the Bayes theorem
x2, …, xn)
P(H | X) P(X | H )P(H ) • Suppose there are m classes C1, C2, …, Cm.
P(X) • Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• Informally, this can be written as
• This can be derived from Bayes’ theorem
posteriori = likelihood x prior/evidence P(X | C )P(C )
P(C | X) i i
• Predicts X belongs to C2 iff the probability P(Ci|X) is the highest i P(X)
among all the P(Ck|X) for all the k classes • Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C )
• Practical difficulty: require initial knowledge of many i i i
needs to be maximized
probabilities, significant computational cost
Unit 3 13 Unit 3 14
3
Naïve Bayesian Classifier: Comments Rule-Based Classifier
• Advantages
– Easy to implement • Classify records by using a collection of “if…then…”
– Good results obtained in most of the cases rules
• Disadvantages
– Assumption: class conditional independence, therefore loss of
• Rule: (Condition) y
accuracy – where
• Condition is a conjunctions of attributes
– Practically, dependencies exist among variables • y is the class label
• E.g., hospitals: patients: Profile: age, family history, etc.
– LHS: rule antecedent or condition
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
– RHS: rule consequent
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier – Examples of classification rules:
• (Blood Type=Warm) (Lay Eggs=Yes) Birds
• How to deal with these dependencies?
• (Taxable Income < 50K) (Refund=Yes) Evade=No
– Bayesian Belief Networks
Unit 3 19 Unit 3 20
R1: (Give Birth = no) (Can Fly = yes) Birds that satisfy the 3 No Single 70K No
R2: (Give Birth = no) (Live in Water = yes) Fishes 4 Yes Married 120K No
antecedent of a rule
R3: (Give Birth = yes) (Blood Type = warm) Mammals 5 No Divorced 95K Yes
R4: (Give Birth = no) (Can Fly = no) Reptiles • Accuracy of a rule: 6 No Married 60K No
R5: (Live in Water = sometimes) Amphibians 7 Yes Divorced 220K No
– Fraction of records 8 No Single 85K Yes
hawk
Name Blood Type
warm
Give Birth
no
Can Fly
yes
Live in Water
no
Class
?
that satisfy both the 9 No Married 75K No
grizzly bear warm yes no no ? antecedent and 10
10 No Single 90K Yes
Unit 3 23 Unit 3 24
4
How does Rule-based Classifier Work? Characteristics of Rule-Based Classifier
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals • Mutually exclusive rules
R4: (Give Birth = no) (Can Fly = no) Reptiles – Classifier contains mutually exclusive rules if the
R5: (Live in Water = sometimes) Amphibians
rules are independent of each other
– Every record is covered by at most one rule
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ? • Exhaustive rules
– Classifier has exhaustive coverage if it accounts for
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
every possible combination of attribute values
A dogfish shark triggers none of the rules – Each record is covered by at least one rule
Unit 3 25 Unit 3 26
5
Rule Ordering Schemes Building Classification Rules
• Rule-based ordering • Direct Method:
– Individual rules are ranked based on their quality • Extract rules directly from data
• Class-based ordering • e.g.: RIPPER, CN2, Holte’s 1R
– Rules that belong to the same class appear together
Unit 3 31 Unit 3 32
• Steps: – Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
pos' pos
FOIL _ Gain pos'(log 2 log 2 )
– Rules are learned one at a time pos'neg ' pos neg
It favors rules that have high accuracy and cover many positive tuples
– Each time a rule is learned, the tuples covered by the rules are removed
• Rule pruning based on an independent set of test tuples
– The process repeats on the remaining tuples unless termination condition, pos neg
e.g., when no more training examples or when the quality of a rule FOIL _ Prune( R)
pos neg
returned is below a user-specified threshold
Pos/neg are # of positive/negative tuples covered by R.
• Comp. w. decision-tree induction: learning a set of rules simultaneously
If FOIL_Prune is higher for the pruned version of R, prune R
Unit 3 33 Unit 3 34
Unit 3 35 Unit 3 36
6
Example of Sequential Covering… Aspects of Sequential Covering
• Rule Growing
R1 R1
• Instance Elimination
• Rule Evaluation
R2
• Rule Pruning
Unit 3 37 Unit 3 38
Unit 3 39 Unit 3 40
Unit 3 41 Unit 3 42
7
Linear Classification SVM—Support Vector Machines
• Binary Classification problem • A new classification method for both linear and nonlinear data
• The data above the red line • It uses a nonlinear mapping to transform the original training
belongs to class ‘x’ data into a higher dimension
• The data below red line • With the new dimension, it searches for the linear optimal
x belongs to class ‘o’
x x separating hyperplane (i.e., “decision boundary”)
x x • Examples: SVM, Perceptron, • With an appropriate nonlinear mapping to a sufficiently high
x Probabilistic Classifiers
x x x o dimension, data from two classes can always be separated by a
o hyperplane
x o o
ooo • SVM finds this hyperplane using support vectors (“essential”
o o training tuples) and margins (defined by the support vectors)
o o o o
Unit 3 43 Unit 3 44
Unit 3 45 Unit 3 46
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
Unit 3 47 Unit 3 48
8
Why Is SVM Effective on High Dimensional Data?
SVM—Linearly Separable
A separating hyperplane can be written as The complexity of trained classifier is characterized by the # of support
W●X+b=0 vectors rather than the dimensionality of the data
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
The support vectors are the essential or critical training examples —they lie
For 2-D it can be written as
closest to the decision boundary (MMH)
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin: If all other training examples are removed and the training is repeated, the
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and same separating hyperplane would be found
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1 The number of support vectors found can be used to compute an (upper)
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the bound on the expected error rate of the SVM classifier, which is independent
sides defining the margin) are support vectors
of the data dimensionality
This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints Quadratic Thus, an SVM with a small number of support vectors can have good
Programming (QP) Lagrangian multipliers generalization, even when the dimensionality of the data is high
Unit 3 49 Unit 3 50
A2
SVM can also be used for classifying multiple (> 2) classes and for regression
analysis (with additional user parameters)
Search for a linear separating hyperplane in the new space
Unit 3 51 Unit 3 52
Unit 3 53 Unit 3 54
9
Nearest-Neighbor Classifiers Definition of Nearest Neighbor
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X X X
Unit 3 55 Unit 3 56
Unit 3 57 Unit 3 58
Unit 3 59 Unit 3 60
10
Nearest Neighbor Classification… Nearest neighbor Classification…
• Problem with Euclidean measure: • k-NN classifiers are lazy learners
– High dimensional data – It does not build models explicitly
• curse of dimensionality – Unlike eager learners such as decision tree
– Can produce counter-intuitive results induction and rule-based systems
111111111110 100000000000 – Classifying unknown records are relatively
vs
expensive
011111111111 000000000001
d = 1.4142 d = 1.4142
Unit 3 65 Unit 3 66
11
What Is Prediction? Linear Regression
• Linear regression: involves a response variable y and a single predictor
• (Numerical) prediction is similar to classification variable x
– construct a model y = w0 + w1 x
– use model to predict continuous or ordered value for a given input where w0 (y-intercept) and w1 (slope) are regression coefficients
• Prediction is different from classification • Method of least squares: estimates the best-fitting straight line
– Classification refers to predict categorical class label | D|
(x x )( yi y )
– Prediction models continuous-valued functions w i 1
i
w y w x
1 | D|
0 1
• Major method for prediction: regression (x
i 1
i x )2
– model the relationship between one or more independent or predictor
variables and a dependent or response variable • Multiple linear regression: involves more than one predictor variable
• Regression analysis – Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
– Linear and multiple regression – Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
– Non-linear regression – Solvable by extension of least square method or using SAS, S-Plus
– Other regression methods: generalized linear model, Poisson regression, – Many nonlinear functions can be transformed into the above
log-linear models, regression trees
Unit 3 67 Unit 3 68
• Regression tree: proposed in CART system (Breiman et al. 1984) • Predictive modeling: Predict data values or construct
– CART: Classification And Regression Trees generalized linear models based on the database data
– Each leaf stores a continuous-valued prediction
• One can only predict value ranges or category distributions
• Method outline:
– It is the average value of the predicted attribute for the training tuples
– Minimal generalization
that reach the leaf
– Attribute relevance analysis
• Model tree: proposed by Quinlan (1992)
– Generalized linear model construction
– Each leaf holds a regression model—a multivariate linear equation for – Prediction
the predicted attribute • Determine the major factors which influence the prediction
– A more general case than regression tree – Data relevance analysis: uncertainty measurement, entropy
• Regression and model trees tend to be more accurate than linear regression analysis, expert judgement, etc.
when the data are not represented well by a simple linear model • Multi-level prediction: drill-down and roll-up analysis
Unit 3 71 Unit 3 72
12
Estimating Error Rates I Estimating Error Rates II
Unit 3 73 Unit 3 74
Unit 3 75 Unit 3 76
Unit 3 77 Unit 3 78
13
Model Selection: ROC Curves
Metrics for Performance Evaluation
• ROC (Receiver Operating Characteristics) • Focus on the predictive capability of a model
curves: for visual comparison of classification
models
– Rather than how fast it takes to classify or build
• Originated from signal detection theory
models, scalability, etc.
• Shows the trade-off between the true positive • Confusion Matrix:
rate and the false positive rate
• The area under the ROC curve is a measure of
Vertical axis represents PREDICTED CLASS
the true positive rate
the accuracy of the model
Horizontal axis rep. the Class=Yes Class=No
• Rank the test tuples in decreasing order: the false positive rate
a: TP (true positive)
b: FN (false negative)
one that is most likely to belong to the positive The plot also shows a Class=Yes a b
class appears at the top of the list
ACTUAL c: FP (false positive)
diagonal line
• The closer to the diagonal line (i.e., the closer CLASS Class=No c d
d: TN (true negative)
A model with perfect
the area is to 0.5), the less accurate is the accuracy will have an area
model of 1.0
Unit 3 79 Unit 3 80
Class=Yes Class=No
• Consider a 2-class problem
– Number of Class 0 examples = 9990
Class=Yes a b
ACTUAL (TP) (FN) – Number of Class 1 examples = 10
CLASS
Class=No c d
(FP) (TN)
• If model predicts everything to be class 0,
ad TP TN accuracy is 9990/10000 = 99.9 %
Accuracy
a b c d TP TN FP FN – Accuracy is misleading because model does not
• Most widely-used metric: detect any class 1 example
Unit 3 81 Unit 3 82
14
Cost vs Accuracy Cost-Sensitive Measures
a
Accuracy is proportional to cost if Precision (p)
Count PREDICTED CLASS 1. C(Yes|No)=C(No|Yes) = q ac
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
a
Recall (r)
Class=Yes a b N=a+b+c+d ab
ACTUAL
CLASS 2rp 2a
Class=No c d F - measure (F)
Accuracy = (a + d)/N r p 2a b c
Cost PREDICTED CLASS
Cost = p (a + d) + q (b + c) Precision is biased towards C(Yes|Yes) & C(Yes|No)
Class=Yes Class=No
= p (a + d) + q (N – a – d) Recall is biased towards C(Yes|Yes) & C(No|Yes)
Class=Yes p q = q N – (q – p)(a + d) F-measure is biased towards all except C(No|No)
ACTUAL
= N [q – (q-p) Accuracy]
CLASS Class=No q p wa w d
Weighted Accuracy 1 4
wa wb wc w d
1 2 3 4
Unit 3 85 Unit 3 86
d d
(y
d
| y yi ' ) 2
– c/c+d – Relative absolute error: i 1
d
i yi ' |
Relative squared error: i 1
i
| y
d
y|
• False Negative Rate (FNR) i 1
i
(y
i 1
i y)2
15
General Idea Why does it work?
Original
D Training data
• Suppose there are 25 base classifiers
– Each classifier has error rate, = 0.35
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt – Assume classifiers are independent
Data Sets
– Probability that the ensemble classifier makes a
Step 2: wrong prediction:
Build Multiple C1 C2 Ct -1 Ct
Classifiers 25
25 i
i (1 )25i 0.06
i 13
Step 3:
Combine C*
Classifiers
Unit 3 91 Unit 3 92
Bagging Boosting
• Sampling with replacement • An iterative procedure to adaptively change
Original Data 1 2 3 4 5 6 7 8 9 10
distribution of training data by focusing more
Bagging (Round 1)
Bagging (Round 2)
7
1
8
4
10
9
8
1
2
2
5
3
10
2
10
7
5
3
9
2 on previously misclassified records
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
– Initially, all N records are assigned equal weights
– Unlike bagging, weights may change at the end of
• Build classifier on each bootstrap sample boosting round
16
Boosting Boosting
• Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
• How boosting works? • Records that are wrongly classified will have
– Weights are assigned to each training tuple their weights increased
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the • Records that are classified correctly will have
subsequent classifier, Mi+1, to pay more attention to the training tuples their weights decreased
that were misclassified by Mi
Original Data 1 2 3 4 5 6 7 8 9 10
– The final M* combines the votes of each individual classifier, where the Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
weight of each classifier's vote is a function of its accuracy
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• The boosting algorithm can be extended for the prediction of continuous • Example 4 is hard to classify
values
• Its weight is increased, therefore it is more
• Comparing with bagging: boosting tends to achieve greater accuracy, but it likely to be chosen again in subsequent rounds
also risks overfitting the model to misclassified data
Unit 3 97 Unit 3 98
w C ( x ) y
N
– Each tuple’s chance of being selected is based on its weight 1
– A classification model Mi is derived from Di i j i j j
– Its error rate is calculated using Di as a test set
N j 1
y j 1
Unit 3 101 Unit 3 102
17
Illustrating AdaBoost
B1
K Means Clustering Algorithm
0.0094 0.0094 0.4623
Boosting
+++ - - - - - - - • K-means is a popularly used unsupervised machine learning
Round 1 = 1.9459 algorithm for cluster analysis. K-Means is a non-deterministic and
iterative method. The algorithm operates on a given data set
B2 through pre-defined number of clusters, k. The output of K Means
Boosting
0.3037 0.0009 0.0422 algorithm is k clusters with input data partitioned among the
Round 2 - - - - - - - - ++ = 2.9323
clusters.
• For instance, let’s consider K-Means Clustering for Wikipedia Search
results. The search term “Jaguar” on Wikipedia will return all pages
B3 containing the word Jaguar which can refer to Jaguar as a Car,
0.0276 0.1819 0.0038 Jaguar as Mac OS version and Jaguar as an Animal. K Means
Boosting clustering algorithm can be applied to group the webpages that talk
Round 3 +++ ++ ++ + ++ = 3.8744
about similar concepts. So, the algorithm will group all web pages
that talk about Jaguar as an Animal into one cluster, Jaguar as a Car
into another cluster and so on.
Overall +++ - - - - - ++
Unit 3 103 Unit 3 104
18
Applications of Support Vector PCA
Machine
• principal components analysis (PCA) is a technique
• SVM is commonly used for stock market that can be used to simplify a dataset
forecasting by various financial institutions. • It is a linear transformation that chooses a new
For instance, it can be used to compare the coordinate system for the data set such that
relative performance of the stocks when – greatest variance by any projection of the data set comes
to lie on the first axis (then called the first principal
compared to performance of other stocks in component),
the same sector. The relative comparison of – the second greatest variance on the second axis, and so
stocks helps manage investment making on.
decisions based on the classifications made by • PCA can be used for reducing dimensionality by
the SVM learning algorithm. eliminating the later principal components.
19
PCA process –STEP 2 PCA process –STEP 3
• Calculate the covariance matrix
• Calculate the eigenvectors and eigenvalues of
cov = .616555556 .615444444
the covariance matrix
.615444444 .716555556
eigenvalues = .0490833989
1.28402771
• since the non-diagonal elements in this
eigenvectors = -.735178656 -.677873399
covariance matrix are positive, we should
expect that both the x and y variable increase .677873399 -.735178656
together.
20
PCA process –STEP 5 PCA process –STEP 5
• Deriving the new data
FinalData = RowFeatureVector x RowZeroMeanData
• RowFeatureVector is the matrix with the eigenvectors in the R = U S VT
columns transposed so that the eigenvectors are now in the
variables factors factors variables
rows, with the most significant eigenvector at the top
sig. significant
• RowZeroMeanData is the mean-adjusted data transposed, ie.
the data items are in each column, with each row holding a
significant
noise noise
noise
separate dimension.
factors factors
samples samples
21
Reconstruction of original Data Markov Models
• A discrete (finite) system:
x – N distinct states.
-.827970186
– Begins (at time t=1) in some initial state(s).
1.77758033
-.992197494
– At each time step (t=1,2,…) the system moves
-.274210416 from current to next state (possibly the same as
-1.67580142 the current state) according to transition
-.912949103 probabilities associated with current state.
.0991094375
• This kind of system is called a finite, or discrete
1.14457216
.438046137
Markov model
1.22382056
• After Andrei Andreyevich Markov (1856 -1922)
Unit 3 127 1283
Unit
• Each aij represents the depends only on the state of the system at time t
probability of moving P[X t 1 x t 1 | X t x t , X t -1 x t -1 , . . . , X1 x1 , X 0 x 0 ]
from state i to state j
P[X t 1 x t 1 | X t x t ]
• The aij are given in a
matrix A = {aij}
• The probability to Xt= Xt=3 Xt=
Xt=1 Xt=5
start in a given state i 2 4
is i , The vector
repre-sents these start
probabilities. 1293
Unit 1303
Unit
1313
Unit 1323
Unit
22
Simple Minded Weather Example Coke vs. Pepsi (a cental cultural dilemma)
Transition matrix for our example Given that a person’s last cola purchase was Coke ™,
0.4 0.6 there is a 90% chance that her next cola purchase will
P also be Coke ™.
0.2 0.8
If that person’s last cola purchase was Pepsi™, there
• Note that rows sum to 1 is an 80% chance that her next cola purchase will also
• Such a matrix is called a Stochastic Matrix be Pepsi™.
• If the rows of a matrix and the columns of a matrix all
0.9 0.1
sum to 1, we have a Doubly Stochastic Matrix 0.8
coke pepsi
0.2
1333
Unit 1343
Unit
1373
Unit 1383
Unit
0.2
23
Equilibrium (Stationary) Distribution Equilibrium (Stationary) Distribution
Whether or not there is a stationary distribution, and • If the Markov chain is positive recurrent, there
whether or not it is unique if it does exist, are determined exists a stationary distribution. If it is positive
by certain properties of the process. Irreducible means recurrent and irreducible, there exists a
that
every state is accessible from every other state. Aperiodic unique stationary distribution, and
means that there exists at least one state for which the furthermore the process constructed by taking
transition from that state to itself is possible. Positive the stationary distribution as the initial
recurrent means that the expected return time is finite for distribution is ergodic. Then the average of a
every state. 0.9 0.1
0.8 function f over samples of the Markov chain is
equal to the average with respect to the
coke pepsi
stationary distribution,
0.2
1393
Unit 1403
Unit
1413
Unit 1423
Unit
1433
Unit 1443
Unit
24
Third Example: A Friendly Gambler Fourth Example: A Friendly Gambler
Game starts with 10$ in gambler’s pocket
Irreducible means that every state is accessible from every other state.
– At each round we have the following: Aperiodic means that there exists at least one state for which the
• Gambler wins 1$ with probability p transition from that state to itself is possible. Positive
or recurrent means that the expected return time is finite for every state.
• Gambler loses 1$ with probability 1-p If the Markov chain is positive recurrent, there exists a stationary
– Game ends when gambler goes broke (no sister in bank), distribution.
or accumulates a capital of 100$ (including initial capital) Is the gambler’s chain positive recurrent? Does it have a stationary
– Both 0$ and 100$ are absorbing states distribution (independent upon initial distribution)?
p p p p p p p p
0 1 2 N-1 N 0 1 2 N-1 N
Hidden variables
H1 H2 Hi HL-1 HL
X1 X2 Xi XL-1 XL
1493
Unit 1503
Unit
25
Coin-Tossing Example Loaded Coin Example
Start Start
1/2 tail 1/2 1/2
tail 1/2
1/2 tail
1/2 tail 0.1 1/4
1/4
0.1 Fair loaded
Fair loaded 0.9 0.1
1/2 3/4 0.9
0.9 0.1
3/4 0.9
1/2 head
head
head head
Fair/Loade
L tosses d
L tosses Fair/Loade
d H1 H2 Hi HL-1 HL
X1 X2 Xi XL-1 XL
H1 H2 Hi HL-1 HL
Head/Tail
X1 X2 Xi XL-1 XL
Q1.: What is the probability of the sequence of observed
Head/Tail outcome (e.g. HHHTHTTHHT), given the model?
1513
Unit 1523
Unit
P q
P(O|M), the probability that the given model M DNA P q
C-G island T C
1533
Unit 1553
Unit
1563
Unit 1573
Unit
26
Ratio Test for CpC islands Empirical Evalation
• Given a sequence X1,…,Xn we compute the
likelihood ratio P ( X1 ,, Xn | )
S ( X1 , , Xn ) log
P ( X1 , , Xn | )
A Xi X i
log 1
i A X i Xi 1
Xi X i 1
i
1583
Unit 1593
Unit
Problems:
• How do we select N?
• What do we do when the window intersects the • A state “remembers” last nucleotide and the type of region
boundary of a CpG island? • A transition from a - state to a + describes a start of CpG island
1603
Unit 1613
Unit
27
HMM – Recognition (cont.) HMM – Recognition (cont.)
• Why isn’t it efficient? – O(2LQL)
P(O| M) = S P(O|Q) P(Q|M) – For a given state sequence of length L we have
about 2L calculations
= SQ Q1 bQ1O1 aQ1Q2 bQ2O2 aQ2Q3 bQ2O2 …
• P(Q|M) = Q aQ Q aQ Q aQ Q … aQ
1 1 2 2 3 3 4 T-1QT
• P(O|Q) = bQ O bQ O bQ O … bQ O
1 1 2 2 3 3 T T
1643
Unit 1653
Unit
1683
Unit 1693
Unit
28
HMM – Question III (Hardest) Learning
• Given an observation sequence O = (O1 O2 … OL), and
a class of models, each of the form M = {A,B,p}, which Given a sequence x1,…,xn, h1,…,hn
specific model “best” explains the observations?
• How do we learn Akl and Bka ?
• A solution to question I enables the efficient
computation of P(O|M) (the probability that a • We want to find parameters that maximize the
specific model M produces the observation O). likelihood P(x1,…,xn, h1,…,hn)
• Question III can be viewed as a learning problem: We We simply count:
want to use the sequence of observations in order to
• Nkl - number of times hi=k & hi+1=l
“train” an HMM and learn the optimal underlying
model parameters (transition and output • Nka - number of times hi=k & xi = a
probabilities). Nkl Nka
Akl Bka
1703
Unit
l'
Nkl '
a
Nka
'3
171
Unit
'
Learning
Given only sequence x1,…,xn • If we have Akl and Bka we can compute
• How do we learn Akl and Bka ? P (Hi k , Hi 1 l | x1 , , xn )
P (Hi k , Hi 1 l , x1 , , xn )
• We want to find parameters that maximize the P (x1 , , xn )
likelihood P(x1,…,xn) P (Hi k , x1 , , xi )P (xi 1 | Hi 1 l )P (xi 2 , , xn | Hi 1 l )
P (x1 , , xn )
Problem: fk (i )Blxi 1 bl (i 1 )
E [Nka ]
• Similarly B'ka
E [Nka ] P (Hi
i x a
k | x1 ,, xn )
a'
E [Nka ] '
, i • Reiterate
1743
Unit 1753
Unit
29
EM - basic properties Complexity of E-step
• P(x1,…,xn: Akl, Bka) P(x1,…,xn: A’kl, B’ka) • Compute forward and backward messages
– Likelihood grows in each iteration – Time & Space complexity: O(nL)
• Accumulate expected counts
• If P(x1,…,xn: Akl, Bka) = P(x1,…,xn: A’kl, B’ka) – Time complexity O(nL2)
then Akl, Bka is a stationary point of the – Space complexity O(L2)
likelihood
– either a local maxima, minima, or saddle point
1763
Unit 1773
Unit
Choosing L
• We often do not know how many hidden values
we should have or can learn
1783
Unit 1793
Unit
Thank You
30