Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Andrew Ng
Decision Tree New test example
root node
Ear shape
Pointy Floppy
decision nodes
leaf nodes
Andrew Ng
Decision Tree
Fac e Fac e
E ar E ar s hape s hape
s hape s hape
C at N ot c at Ear
Not Cat
shape
Whiskers N ot c at Not cat Fac e
s hape
P ointy Floppy
P res ent A bs ent N ot round
Round
Andrew Ng
Decision Trees
Learning Process
Decision Tree Learning
?
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
shape
4/4 cats
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
shape
Cat
0/1 cats
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
?
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
Whiskers
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
Whiskers
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
Whiskers
shape
Andrew Ng
Decision Tree Learning
Decision 1: How to choose what feature to split on at each
node?
Maximize purity (or minimize impurity)
Andrew Ng
Decision Tree Learning
C at
node? Y es No
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting? Depth 0
Depth 2
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting? Depth 0
Depth 2
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting?
4 /7 c ats 1 /3 c ats
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting?
Andrew Ng
Decision Tree Learning
Measuring purity
Entropy as a measure of impurity
p1 = fraction of examples that
are cats p1 = 0 H(p1) = 0
D og D og D og D og D og D og
p1 = 3/6 H(p1) = 1
C at C at C at D og D og D og
p1
p1 = 6/6 H(p1) = 0
C at C at C at C at C at C at
Andrew Ng
Entropy as a measure of impurity
p1 = fraction of examples that
are cats
𝑝0 = 1 − 𝑝1
H(p1)
𝐻(𝑝1 ) = −𝑝1 𝑙𝑜𝑔2(𝑝1 ) − 𝑝0 𝑙𝑜𝑔2(𝑝0)
= −𝑝1 𝑙𝑜𝑔2(𝑝1) − (1 − 𝑝1 )𝑙𝑜𝑔2(1 − 𝑝1 )
Note: “0 log(0)” = 0
p1
Andrew Ng
Decision Tree Learning
𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33
𝐻 0.8 = 0.72 𝐻 0.2 = 0.72 𝐻 0.57 = 0.99 𝐻 0.33 = 0.92 𝐻 0.75 = 0.81 𝐻 0.33 = 0.92
5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.57 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10
Andrew Ng
Information Gain
𝑝1root = 5ൗ10 = 0.5
E ar
s hape
Information gain
P ointy Floppy
right
= H(p1root ) − 𝑤 left 𝐻 𝑝1left + 𝑤 right 𝐻 𝑝1
right 1
𝑝1left = 4ൗ5 𝑝1 = ൗ5
𝑤 left = 5ൗ10 5
𝑤 right = ൗ10
Andrew Ng
Decision Tree Learning
Putting it together
Decision Tree Learning
• Start with all examples at the root node
• Calculate information gain for all possible features, and pick the one with
the highest information gain
• Split dataset according to selected feature, and create left and right
branches of the tree
• Keep repeating splitting process until stopping criteria is met:
• When a node is 100% one class
• When splitting a node will result in the tree exceeding a maximum
depth
• Information gain from additional splits is less than threshold
• When number of examples in a node is below a threshold
Andrew Ng
Recursive splitting
?
Andrew Ng
Recursive splitting
E ar
s hape
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e ?
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Whis kers
Recursive algorithm
A bs ent
Round N ot Round P res ent
Andrew Ng
Decision Tree Learning
3 possible values
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Andrew Ng
One hot encoding
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Andrew Ng
One hot encoding and neural networks
Pointy ears Floppy ears Round ears Face shape Whiskers Cat
1 0 0 Round Present 1
0 1 0 Round 1 Absent 0 1
0 1 0 Round 1 Absent 0 1
Andrew Ng
Decision Tree Learning
Andrew Ng
Splitting on a continuous variable
Cat
Weight ≤ 8 lbs .
Weight ≤ 9 lbs .
No
Y es
2 2 8 3
H(0.5) − 𝐻 + 𝐻 = 0.24
10 2 10 8
4 4 6 1
H(0.5) − 𝐻 + 𝐻 = 0.61
10 4 10 6
7 5 3 0
H(0.5) − 𝐻 + 𝐻 = 0.40
10 7 10 3
Andrew Ng
Decision Tree Learning
Andrew Ng
Regression with Decision Trees
E ar
s hape
P ointy
Floppy
Fac e Fac e
s hape Shape
Round N ot round
Round N ot Round
Andrew Ng
Choosing a split
Variance at root Variance at root Variance at root
node: 20.51 node: 20.51 node: 20.51
E ar Fac e Whis kers
s hape Shape
Round N ot round
P ointy Floppy P res ent A bs ent
Weights: 7.2, Weights: 8.8, 15, Weights: 7.2, 15, Weights: 8.8,9.2,11 Weights: 7.2, 8.8, Weights: 15, 7.6,
9.2, 8.4,7.6, 10.2 11, 18, 20 8.4, 7.6,10.2, 18, 20 9.2, 8.4 11, 10.2, 18, 20
Variance: 1.47 Variance: 21.87 Variance: 27.80 Variance: 1.37 Variance: 0.75 Variance: 23.32
𝑤 left = 5ൗ10 𝑤 right = 5ൗ10 𝑤 left = 7ൗ10 𝑤 right = 3ൗ10 𝑤 left = 4ൗ10 𝑤 right = 6ൗ10
7 3 20.51 − 4 6
5 5 20.51 − ∗ 0.75 + ∗ 23.32
20.51 − ∗ 1.47 + ∗ 21.87 10
∗ 27.80 +
10
∗ 1.37 10 10
10 10
= 8.84 = 0.64 = 6.22
Andrew Ng
Tree ensembles
E ar
s hape Whis kers
Andrew Ng
Tree ensemble New test example
Ear
shape N ot c at Cat
Face Whis kers
shape Whiskers
P ointy Floppy
A bs ent P res ent A bs ent
Round N ot round P res ent
C at N ot c at
Not cat Not Cat Not Cat Cat Not Cat
Cat
Andrew Ng
Tree ensembles
Andrew Ng
Sampling with replacement
Ear shape Face shape Whiskers Cat
Pointy Round Present 1
Andrew Ng
Tree ensembles
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
Train a decision tree on the new dataset
Andrew Ng
Randomizing the feature choice
Andrew Ng
Tree ensembles
XGBoost
Boosted trees intuition
Given training set of size 𝑚
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
But instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify
Train a decision tree on the new dataset
Whiskers Ear Face
Ear Face shape shape Whiskers Prediction
shape shape Whiskers Cat Absent
Present
Pointy Round Present Cat ✅
Pointy Round Present Yes
Floppy Not Round Present Not cat ❌
Floppy Round Absent No
Ear shape Floppy Round Absent Not cat ✅
Floppy Round Absent No Not cat
Pointy Not Round Present Not cat ✅
Pointy Round Present Yes
Pointy Round Present Cat ✅
Pointy Not Round Present Yes
Not round Pointy Round Absent Not cat
Floppy Round Absent No Round ❌
Floppy Not Round Absent Not cat
Floppy Round Present Yes ✅
Pointy Round Absent Not cat ❌
Pointy Not Round Absent No
Cat Not cat Floppy Round Absent Not cat ✅
Pointy Not Round Absent No
Floppy Round Absent Not cat ✅
Pointy Not Round Present Yes
Andrew Ng
XGBoost (eXtreme Gradient Boosting)
• Open source implementation of boosted trees
• Fast efficient implementation
• Good choice of default splitting criteria and criteria for when
to stop splitting
• Built in regularization to prevent overfitting
• Highly competitive algorithm for machine learning
competitions (eg: Kaggle competitions)
Andrew Ng
Using XGBoost
Classification Regression
Andrew Ng
Conclusion
Diagnostics
(bias, variance and error
analysis)
Train
model
Andrew Ng
Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
• Small decision trees may be human interpretable
Neural Networks
• Works well on all types of data, including tabular (structured) and
unstructured data
• May be slower than a decision tree
• Works with transfer learning
• When building a system of multiple models working together, it
might be easier to string together multiple neural networks
Andrew Ng