0% found this document useful (0 votes)
23 views46 pages

Probability & Sample Space

Uploaded by

rvijay1909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views46 pages

Probability & Sample Space

Uploaded by

rvijay1909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Machine Learning

Lesson 1
Probability Basics
• A probability is a quantitative measure of uncertainty - a number that
conveys the strength of our belief in the occurrence of an uncertain
event.
• Refers to the chance of occurrence of an event or happening.
Probability: CONCEPTS
• BASIC CONCEPTS
• An experiment
• An experiment is a process that leads to one of several possible outcomes. An outcome
of an experiment is some observation or measurement.
• A sample space
• The sample space is the universal set S pertinent to a given experiment. It is the set of all
possible outcomes of an experiment.
• An event
• An event is a subset of a sample space. It is a set of basic outcomes.
Probability: APPROACHES
• APPROACHES TO PROBABILITY THEORY
• Situation I : THE CLASSICAL APPROACH
• have a sample space with equally likely basic outcomes
• equally likely outcomes.

• P(A) = n(A)/N(S)
• n(A) = Number of outcome favorable to the event.
• N(S) = total number of outcomes.
Probability: APPROACHES
• APPROACHES TO PROBABILITY THEORY
• situation II : THE RELATIVE FREQUENCY APPROACH
• Not equally likely outcomes.
• Not known total number of outcomes. (frequency)
• The probability of occurrence of an event is the ratio of the number of times the event
occurs to the total number of trial.
• P(A) = n/N
• n = The number of the event occur.
• N= total number of Trials.
Probability Basics
• Events that are certain to occur have probability 1.00. The probability
of the entire sample space S is equal to 1.00: P(S) = 1.00.
Probability Basics
• PROBABILITY DISTRIBUTIONS
• sample space : Child birth of three births.

• P(X=0) = 1/8 since one out of 8 equally likely points leads to X = 0


• P(X=1) = 3/8 since three out of 8 equally likely points leads to X = 1
• P(X=2) = 3/8 since three out of 8 equally likely points leads to X = 2
• P(X=3) = 1/8 since one out of 8 equally likely points leads to X = 3
• A random variable is an uncertain quantity whose value depends on
chance.
Probability Basics

P(X)

1,0.375
2,0.375
0,0.125 3,0.125

The probability distribution of a discrete random variable lists the probabilities of


occurrence of different values of the random variable.
Permutations
• Sample space is a set of arrangements or sequences (Permutations)
• The letters a,b,c,d,e,f are arranged at random to form a six-letter
word. we must use each letter once only.
• S = {fabcde, abcdfe, . . . , fedcbag}
• Assign the same probability to each (formed the word “at random”).
• Count the number of ways that we can construct such a word – each
way corresponds to a unique word.
Permutations
• Any positive integer n, we define n factorial as: n(n- 1)(n- 2) ………1.
• For n =6: 6 x 5 x 4x 3x2x1 = 720
• Event A: the second letter is ‘e’ or ‘a’ so.
• We can fill the second box in 2 ways
2 way 4x3 x2x1 = 24
2 x5 = 10 4 way
a 10 x24 = 240
5 way
e
3 way 1 way
2 way

P(A) = number of outcomes in A / number of outcomes in S


= 240/ 720 = 1/3
Permutations
• Arrangements of length n using each symbol once and only once. This
product is denoted by n! (“n factorial”).
• n! = n(n- 1)(n- 2) ………1.
• Five applications arrive at a center on the same day, all written at different
times.
• What is the probability that they will be read in the order in which they
were written?
• Since there are 120 ways to order five applications, the probability of a
particular order (the order in which the applications were written) is 1/120.
Ans: 5! = (5) (4) (3) (2) (1) = 120
Permutations
• Permutations are the possible ordered selections of r objects out of a total of n objects. The
number of permutations of n objects taken r at a time is denoted by nPr

= n(n - l)(n - 2) ……… (n - r + 1)

• 4 people are to be randomly chosen out of 10 people.


• 4 people are to be assigned to four interviewers.
• What are the possibilities?
• The first interviewer has 10 choices, the second 9 choices, the third 8, and the fourth 7. Thus,
there are (10)(9)(8)(7) = 5,040 selections.
• the probability of any predetermined assignment of 4 people out of a group of 10 is 1/5,040.
Combinations
• Randomly select a subset of three digits from the set {0; 1; 2; 3; 4; 5;
6; 7; 8; 9} so that the sample space is:
• S = {{1; 2; 3}; {0; 1; 3}; {0; 1; 4}; : : : ; {7; 8; 9}}
• Digits in each outcome are unique.
• Do not consider {1; 1; 2} to be a subset of S.
• Order of the elements in a subset is not relevant.
• Subsets {1; 2; 3} and {3; 1; 2} are the same.
Combinations
• Combinations are the possible selections of r items from a group of n items regardless of the
order of selection. The number of combinations is denoted by nC r and is read n choose r. We
define the number of combinations of r out of n elements as:

• Suppose that 3 out of the 10 members of the board of directors of a large corporation are to
be randomly selected to serve on a particular task committee. How many possible selections
are there?

• If the committee is chosen in a truly random fashion, what is the probability that the three
committee members chosen will be the three senior board members? This is 1 combination
out of a total of 120, so the answer is 1/120 = 0.00833.
Combinations
• In a region, a Telecommunication service provider has the following
services to their subscriber base of million.
• Voice call.
• Video call.
• MMS. (Multimedia Messaging Service)
• SMS. (Short Message Service)
• Text messaging.

The probable/combination services that the subscriber base might avail:


Combinations

The list of availing services:

The individual services: n=5 & r =1. 120/24 = 5. Five individual services={ Voice call (VC), Video call (VD), MMS , SMS, Text Msg (TM) }

The combination of 2: n=5 & r =2. 120/12=10. { (VC,VD),(VC,MMS),(VC,SMS),(VC,TM),(VD,MMS),(VD,SMS),(VD,TM),(MMS,SMS)


(MMS,TM),(SMS,TM)}

The combinations of 3: n=5 & r=3. 120/12=10. { (VC,VD,MMS), (VC,MMS,SMS),(VC,SMS,TM),(VD,MMS,SMS),(VD,SMS,TM),


(MMS,SMS,TM), (MMS,TM,VC), (SMS,TM,VC),(SMS,VC,VD),(TM,VC,VD)}

The Combinations of 4: n=5 & r =4. 120/24 =5. {(VC,VD,MMS,SMS), (VC,MMS,SMS,TM),(VD,MMS,SMS,TM),(SMS,TM,VC,VD)


(TM,VC,VD,MMS)}

The Combinations of 5: n=5 & r=5. 120/120 =1. { (VC,VD,MMS,SMS,TM) }.

Total Combinations = 31.


Big-O Notation – Scaling Out
• Big-O notation is a way of quantifying the rate at which some quantity
grows.
• A brick: Height 10 CM & Width 20 CM

10 CM
• In order to build 100 x 100 size wall, how many
bricks are required? - O(n) ?
20 CM
Big-O Notation – Scaling Out
Max_value_combination(int N){ ///{Voice call (VC), Video call (VD), MMS , SMS, Text Msg (TM) }

finds_the_single_value_comb(N) Subscriber VC VD MMS SMS TM Business Value


A High Medium Low Medium High Pos
finds_2_value_comb(N) B Medium High Medium Low Medium Pos
C Medium Low Medium Medium High Neg
finds_3_value_comb(N) .
.
finds_4_value_comb(N) .
.
finds_5_value_comb(N) N High Medium Low Medium High Pos

.
.
.
finds_n_value_comb(N)
}
Big-O Notation – Scaling Out
finds_the_single_value_comb(int N) //// {Voice call (VC), Video call (VD), MMS , SMS, Text Msg (TM) }

for(i=0;i<N;i++){
QueryDataForEachSingleComb() --- O(n) – run time
}
}

More practical:
Doubling the size of the input roughly doubles the runtime. Therefore, the input and
runtime have a linear (O(n)) relationship
Big-O Notation – Scaling Out
finds_2_value_comb(int N){
for(i=0;i<N;i++){
selectEachAttaribute -----O(n)
for(i=0;i<N;i++){
QueryDataForEachComb() --- O(n) Total run time: O(n^2)
}
}
}
User-Driven and Data-Driven approaches
• User-Driven:
• Using data as one of the inputs against only data as input in decision making.
• The other factors are like preconceived objectives (ideas) and employee expertise,
in decision-making.
• It is Judgmental.

• Data-Driven:
• Data-driven means making decisions based solely on data.
• This approach is equivalent to possibly explores all possible solution states (Sample
Space).
• This approach might bring the hidden truth or the unexplored solution state of the
user expertise.
Machine learning
• Machine learning refers to the automated detection of meaningful
patterns in data.
• In contrast to more traditional uses of computers.
• Due to the complexity of the patterns that need to be detected
(exploring all sample space – Big O), a human programmer cannot
provide an explicit, fine detailed specification of how such tasks
should be executed.
• Machine learning tools are concerned with providing programs with
the ability to “learn” and adapt.
Machine learning
• Tasks beyond Human Capabilities:
• With more and more available digitally recorded data, it becomes obvious
that there are treasures of meaningful information buried in data archives
that are way too large and too complex for humans to make sense of.
• Learning to detect meaningful patterns in large and complex data sets is a
promising domain in which the combination of programs that learn and
detect the predominant patterns.
Information Gain
• In order to define the information gain precisely, a measure
"ENTROPY" is used in information theory.
• ENTROPY: measure the level of Impurity/Homogeneity.

• Pi - Probability of class i.
Entropy
Subscriber VC VD MMS SMS TM Business Value
• Attribute: VC A
B
High
Medium
Medium
High
Low
Medium
Medium
Low
High
Medium
Pos
Pos
C Medium Low Medium Medium Medium Neg
• Phigh = 2/4 = 0.5 D High Medium Low Low High Pos

• PMedium = 2/4 = 0.5

• Log2 (Phigh) = Log2 (0.5)= -1


• Log2 (PMedium) = Log2 (0.5)= -1
• -(0.5) * (-1 ) - (0.5) * (-1 )
• 0.5 – (-0.5) = 0.5+0.5 =1
Entropy
Subscriber VC VD MMS SMS TM Business Value
• Attribute: BusinessValue (Parent) A
B
High
Medium
Medium
High
Low
Medium
Medium
Low
High
Medium
Pos
Pos
C Medium Low Medium Medium Medium Neg
• PPos = 3/4 = 0.75 D High Medium Low Low High Pos

• PNeg = 1/4 = 0.25

• Log2 (PPos) = Log2 (0.75)= -0.415


• Log2 (PNeg) = Log2 (0.25)= -2
• -(0.75) * (-0.415 ) - (0.25) * (-2 )
• 0.31125 – (-0.5) = 0.31125 +0.5 =0.81125
Information Gain
VC: High
Pos Information Gain = entropy(parent) – [Weighted average
entropy = - 1 log21 = 0 entropy(children)]
Pos
Pos Subscriber VC VD MMS SMS TM Business Value
A High Medium Low Medium High Pos
Pos B Medium High Medium Low Medium Pos
Pos C Medium Low Medium Medium Medium Neg
D High Medium Low Low High Pos
Neg VC: Medium PHigh = 1/2 = 0.5
PLow = 1/2 = 0.5
Pos
Neg
Log2 (PHigh) = Log2 (0.5)= -1
Log2 (PLow) = Log2 (0.5)= -1
-(0.5) * (-1 ) - (0.5) * (-1 )
0.5 – (-0.5) = 0.5+0.5 =1
(Weighted) Average Entropy of Children (VC):
(2/4) * 0 + (2/4)*1 =0.5

Information Gain= 0.81125 - 0.5 = 0.31125


Information Gain
VD: Medium
Pos Information Gain = entropy(parent) – [Weighted average
entropy = - 1 log21 = 0 entropy(children)]
Pos
Pos Subscriber VC VD MMS SMS TM Business Value
A High Medium Low Medium High Pos
Pos B Medium High Medium Low Medium Pos
Pos C Medium Low Medium Medium Medium Neg
D High Medium Low Low High Pos
Neg VD: High
Pos entropy = - 1 log21 = 0
(Weighted) Average Entropy of Children (VC):
(2/4) * 0 + (1/4)*0 + (1/4)*0 =0

VD: Low Information Gain= 0.81125 - 0 = 0.81125


Neg entropy = - 1 log21 = 0
Information Gain
MMS: LOW
Pos Information Gain = entropy(parent) – [Weighted average
entropy = - 1 log21 = 0 entropy(children)]
Pos
Pos Subscriber VC VD MMS SMS TM Business Value
A High Medium Low Medium High Pos
Pos B Medium High Medium Low Medium Pos
Pos C Medium Low Medium Medium Medium Neg
D High Medium Low Low High Pos
Neg MMS: Medium PHigh = 1/2 = 0.5
PLow = 1/2 = 0.5
Pos
Neg
Log2 (PHigh) = Log2 (0.5)= -1
Log2 (PLow) = Log2 (0.5)= -1
-(0.5) * (-1 ) - (0.5) * (-1 )
0.5 – (-0.5) = 0.5+0.5 =1
(Weighted) Average Entropy of Children (VC):
(2/4) * 0 + (2/4)*1 =0.5

Information Gain= 0.81125 - 0.5 = 0.31125


Information Gain
SMS: LOW
Pos Information Gain = entropy(parent) – [Weighted average
entropy = - 1 log21 = 0 entropy(children)]
Pos
Pos Subscriber VC VD MMS SMS TM Business Value
A High Medium Low Medium High Pos
Pos B Medium High Medium Low Medium Pos
Pos C Medium Low Medium Medium Medium Neg
D High Medium Low Low High Pos
Neg SMS: Medium PHigh = 1/2 = 0.5
PLow = 1/2 = 0.5
Pos
Neg
Log2 (PHigh) = Log2 (0.5)= -1
Log2 (PLow) = Log2 (0.5)= -1
-(0.5) * (-1 ) - (0.5) * (-1 )
0.5 – (-0.5) = 0.5+0.5 =1
(Weighted) Average Entropy of Children (VC):
(2/4) * 0 + (2/4)*1 =0.5

Information Gain= 0.81125 - 0.5 = 0.31125


Information Gain
TM: High
Pos Information Gain = entropy(parent) – [Weighted average
entropy = - 1 log21 = 0 entropy(children)]
Pos
Pos Subscriber VC VD MMS SMS TM Business Value
A High Medium Low Medium High Pos
Pos B Medium High Medium Low Medium Pos
Pos C Medium Low Medium Medium Medium Neg
D High Medium Low Low High Pos
Neg TM: Medium PHigh = 1/2 = 0.5
PLow = 1/2 = 0.5
Pos
Neg
Log2 (PHigh) = Log2 (0.5)= -1
Log2 (PLow) = Log2 (0.5)= -1
-(0.5) * (-1 ) - (0.5) * (-1 )
0.5 – (-0.5) = 0.5+0.5 =1
(Weighted) Average Entropy of Children (VC):
(2/4) * 0 + (2/4)*1 =0.5

Information Gain= 0.81125 - 0.5 = 0.31125


Decision Tree
• A decision tree is a tree-like model.
• Nodes represent decision points.
• Branches represent the outcomes of those decisions.
• The decision points (node) are based on the values of the input
variables.
• The outcomes (branches) are the possible classifications.
Decision Tree
• Constructed by recursively partitioning the input data into subsets
based on the values of the input variables.
• Each partition corresponds to a node in the tree.
• The partitions are chosen so as to minimize the impurity or increase
the homogeneity of the resulting subsets.
• The stopping criterion could be a maximum depth for the tree, a
minimum number of instances in each leaf node, or once all possible
branches in our decision tree end in leaf nodes, we’re done.
Decision Tree
Full Training Set VD
Subscriber
A
VC
High
VD
Medium
MMS
Low
SMS
Medium
TM
High
Business Value
Pos
B Medium High Medium Low Medium Pos
C Medium Low Medium Medium Medium Neg
D High Medium Low Low High Pos

Medium High Low


Services Information Gain
VC 0.31125
When VD value is When VD value is When VD value is VD 0.81125
“Medium”, the “High”, the probability “LOW”, the probability MMS
SMS
0.31125
0.31125
probability is 2/2 = 1. is 1/1 = 1. is 1/1 = 1. TM 0.31125
Decision Tree (Example to explain the
recursive)
Full Training Set VC
Subscriber
A
VC
High
VD
Medium
MMS
Low
SMS
Medium
TM
High
Business Value
Pos
B Medium High Medium Low Medium Pos
C Medium Low Medium Medium Medium Neg
D High Medium Low Low High Pos

Medium P=0.5 High Services Information Gain


VC 0.31125
When VC value is VD 0.81125
Subset Data Rec :B “High”, the probability MMS 0.31125
Rec :C is 1/1 = 1.
SMS 0.31125
TM 0.31125

VD

High Low

P=1 P=1
Classification and Regression Trees (CART)
• CART: Ability to handle both categorical and continuous features.
• CART: Ability to capture non-linear relationships between features and the target variable.
• The CART algorithm produces only binary Trees: non-leaf nodes always have two children.
• On the contrary, ID3, can produce Decision Trees with nodes having more than two
children
• CART can be used to explain a continuous or categorical dependent variable in terms of
multiple independent variables. The independent variables can be continuous or categorical.
• The CART algorithm works to find the independent variable that creates the best
homogeneous group when splitting the data.

• The algorithm selects a feature (attribute) and a threshold that best separates the training
data into two groups. Gini Impurity is used.
Gini impurity
• Measure of the quality of a split in a decision tree algorithm.
• It measures the probability of misclassification of a random sample

• Gini index operates on the categorical target variables in terms of


“success” (Pos) or “failure” (Neg) and performs only binary split.
• The Gini index is used by CART algorithms.
GINI Impurity (Full Dataset)
Subscribe
r VC VD MMS SMS TM Business Value
A 800 500 100 510 900 Pos
B 500 950 300 100 550 Pos
C 490 200 300 480 490 Neg
D 900 480 200 190 950 Pos
C =2

P(pos) = ¾ =0.75
P(Neg) =1/4 =0.25

G=0.75 *(1-0.75) + 0.25 *(1-0.25)


G=0.1875 +0.1875
G=0.375
Gini Gain: VC
Subscribe
r VC VD MMS SMS TM Business Value
A 800 500 100 510 900 Pos
B 500 950 300 100 550 Pos
C 490 200 300 480 490 Neg
D 900 480 200 190 950 Pos

Split > 500

P(VC =Blue) = 1 * (1-1) + 0 * (1-0) = 0 (Branch right side)

P(VC =Red) = 0.5 * (1-0.5) + 0.5 * (1-0.5) = 0.5 (Branch left side) Pos

weighting the impurity of each branch by how many elements it


has. Neg

2/4 *0.5 +2/4*0 = 0.25

Thus, the amount of impurity we’ve “removed” with this split is:

Gini Gain = 0.375 – 0.25 = 0.125.


Gini Gain: VD
Subscribe
VC VD MMS SMS TM Business Value
r
A 800 500 100 510 900 Pos
B 500 950 300 100 550 Pos
C 490 200 300 480 490 Neg
D 900 480 200 190 950 Pos

Split > 450

P(VD =Blue) = 1 * (1-1) + 0 * (1-0) = 0 (Branch right side)

P(VD =Red) = 1 * (1-1) + 0 * (1-0) = 0 (Branch left side) Pos

weighting the impurity of each branch by how many elements it


has. Neg

= 1/4 *0 +3/4*0 = 0

Thus, the amount of impurity we’ve “removed” with this split is:

Gini Gain =0.375 – 0 = 0.375.


Gini Gain: MMS
Subscribe
VC VD MMS SMS TM Business Value
r
A 800 500 100 510 900 Pos
B 500 950 300 100 550 Pos
C 490 200 300 480 490 Neg
D 900 480 200 190 950 Pos

Split > 200

P(MMS =Red) = 1 * (1-1) + 0 * (1-0) = 0 (Branch right side)

P(VC =Red) = 0.5 * (1-0.5) + 0.5 * (1-0.5) = 0.5 (Branch left side) Pos

weighting the impurity of each branch by how many elements it


has. Neg

2/4 *0.5 +2/4*0 = 0.25

Thus, the amount of impurity we’ve “removed” with this split is:

Gini Gain = 0.375 – 0.25 = 0.125


Gini Gain: SMS
Subscribe
VC VD MMS SMS TM Business Value
r
A 800 500 100 510 900 Pos
B 500 950 300 100 550 Pos
C 490 200 300 480 490 Neg
D 900 480 200 190 950 Pos

Split > 400

P(VC =Blue) = 1 * (1-1) + 0 * (1-0) = 0 (Branch left side)


Pos
P(VC =Red) = 0.5 * (1-0.5) + 0.5 * (1-0.5) = 0.5 (Branch right side)

weighting the impurity of each branch by how many elements it Neg


has.

2/4 *0.5 +2/4*0 = 0.25

Thus, the amount of impurity we’ve “removed” with this split is:

Gini Gain = 0.375 – 0.25 = 0.125


Gini Gain: TM
Subscribe
VC VD MMS SMS TM Business Value
r
A 800 500 100 510 900 Pos
B 500 950 300 100 550 Pos
C 490 200 300 480 490 Neg
D 900 480 200 190 950 Pos

Split > 600

P(VC =Blue) = 1 * (1-1) + 0 * (1-0) = 0 (Branch right side)


Pos
P(VC =Red) = 0.5 * (1-0.5) + 0.5 * (1-0.5) = 0.5 (Branch left side)

weighting the impurity of each branch by how many elements it Neg


has.

2/4 *0.5 +2/4*0 = 0.25

Thus, the amount of impurity we’ve “removed” with this split is:

Gini Gain = 0.375 – 0.25 = 0.125


CART Decision Tree
Subscriber VC VD MMS SMS TM Business Value
A 800 500 100 510 900 Pos
B 500 950 300 100 550 Pos
Full Training Set VD C 490 200 300 480 490 Neg
D 900 480 200 190 950 Pos

VD < 450 VD > 450


Services Gini Gain
VC 0.125
When VD value <=450 When VD value > 450 VD 0.375
probability is 1/1 = 1. probability is 3/3 = 1. MMS 0.125
SMS 0.125
(Neg) (Pos) TM 0.125
The Split Threshold choosing methods
• ChiMerge Discretization method.
• Learning Vector Quantization (LVQ) based method
• Histogram-based method.
• For a classification problem where the response variable is categorical,
this is decided by calculating the information gained based upon the
entropy resulting from the split. For numeric response, homogeneity is
measured by statistics such as standard deviation or variance.

You might also like