Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Decision Trees
Decision Tree Model

Cat classification example
Ear shape (x 1 ) Face shape (x 2 ) Whiskers (x 3 ) Cat
Pointy Round Present 1
Floppy Not round Present 1
Floppy Round Absent 0
Pointy Not round Present 0
Pointy Round Absent 1
Floppy Not round Absent 0
Categorical (discrete values)
Andrew Ng
Decision Tree New test example
root node
Ear shape
Pointy Floppy
decision nodes
Ear shape: Pointy

Face Face shape: Round
Whiskers
shape Whiskers: Present
Round Not round Present Absent
Cat Not cat Cat Not cat
leaf nodes
Andrew Ng
Decision Tree
Fac e Fac e
E ar E ar s hape s hape
s hape s hape
Round N ot round Round N ot round

P ointy Floppy P ointy Floppy
C at N ot c at Ear
Not Cat
shape
Whiskers N ot c at Not cat Fac e
s hape
P ointy Floppy
P res ent A bs ent N ot round
Round
C at N ot c at Whis kers Not Cat

Cat Not Cat
P res ent A bs ent
Cat Not Cat
Andrew Ng
Decision Trees
Learning Process
Decision Tree Learning
?
Andrew Ng
Ear shape
Pointy Floppy
Andrew Ng
Ear shape
Pointy Floppy
Andrew Ng
Ear shape
Pointy Floppy
Andrew Ng
Ear shape
Pointy Floppy
Face
shape
Round Not round
4/4 cats
Andrew Ng
Ear shape
Pointy Floppy
Face
shape
Round Not round
Cat
0/1 cats
Andrew Ng
Ear shape
Pointy Floppy
Face
shape
Round Not round
Cat Not cat
Andrew Ng
Ear shape
Pointy Floppy
Face
?
shape
Round Not round
Cat Not cat
Andrew Ng
Ear shape
Pointy Floppy
Face
Whiskers
shape
Round Not round
Cat Not cat
Andrew Ng
Ear shape
Pointy Floppy
Face
Whiskers
shape
Cat Not cat
1/1 cats 0/4 cats
Andrew Ng
Ear shape
Pointy Floppy
Face
Whiskers
shape
Cat Not cat Cat Not Cat
Andrew Ng
Decision 1: How to choose what feature to split on at each
node?
Maximize purity (or minimize impurity)
Andrew Ng
C at
Decision 1: How to choose what feature to split on at each DNA
node? Y es No
Maximize purity (or minimize impurity)
E ar Fac e Whis kers

s hape Shape
Round N ot round P res ent 5 /5 c ats 0 /5 c ats

P ointy Floppy A bs ent
4 /5 c ats 1 /5 c ats 4 /7 c ats 1 /3 c ats 3 /4 c ats 2 /6 c ats
Andrew Ng
Decision 2: When do you stop splitting? Depth 0
• When a node is 100% one class

• When splitting a node will result in the tree exceeding
a maximum depth Depth 1
Depth 2
Andrew Ng
Decision 2: When do you stop splitting? Depth 0

a maximum depth Depth 1
Depth 2
Andrew Ng
Decision 2: When do you stop splitting?

a maximum depth Fac e
Shape
• When improvements in purity score are below a
threshold Round N ot round
• When number of examples in a node is below a

threshold
4 /7 c ats 1 /3 c ats
Andrew Ng
Decision 2: When do you stop splitting?

a maximum depth Fac e
Shape
• When improvements in purity score are below a
threshold Round N ot round
• When number of examples in a node is below a

Not cat
threshold
Andrew Ng
Measuring purity
Entropy as a measure of impurity
p1 = fraction of examples that
are cats p1 = 0 H(p1) = 0
D og D og D og D og D og D og
H(p1) p1 = 2/6 H(p1) = 0.92

C at C at D og D og D og D og
p1 = 3/6 H(p1) = 1
C at C at C at D og D og D og
p1 = 5/6 H(p1) = 0.65

C at C at C at C at C at D og
p1
p1 = 6/6 H(p1) = 0
C at C at C at C at C at C at
Andrew Ng
Entropy as a measure of impurity
p1 = fraction of examples that
are cats
𝑝0 = 1 − 𝑝1
H(p1)
𝐻(𝑝1 ) = −𝑝1 𝑙𝑜𝑔2(𝑝1 ) − 𝑝0 𝑙𝑜𝑔2(𝑝0)
= −𝑝1 𝑙𝑜𝑔2(𝑝1) − (1 − 𝑝1 )𝑙𝑜𝑔2(1 − 𝑝1 )
Note: “0 log(0)” = 0
p1
Andrew Ng
Choosing a split: Information

Gain
Choosing a split
𝑝 1 = 5ൗ10 = 0.5 𝐻 0.5 = 1 𝐻 0.5 = 1
𝐻 0.5 = 1 E ar Fac e Whis kers
s hape Shape
P ointy Floppy Round N ot round

P res ent A bs ent
𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33
𝐻 0.8 = 0.72 𝐻 0.2 = 0.72 𝐻 0.57 = 0.99 𝐻 0.33 = 0.92 𝐻 0.75 = 0.81 𝐻 0.33 = 0.92
5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.57 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10
= 0.28 = 0.03 = 0.12
Andrew Ng
Information Gain
𝑝1root = 5ൗ10 = 0.5
E ar
s hape
Information gain
P ointy Floppy
right
= H(p1root ) − 𝑤 left 𝐻 𝑝1left + 𝑤 right 𝐻 𝑝1
right 1
𝑝1left = 4ൗ5 𝑝1 = ൗ5
𝑤 left = 5ൗ10 5
𝑤 right = ൗ10
Andrew Ng
Putting it together
• Start with all examples at the root node
• Calculate information gain for all possible features, and pick the one with
the highest information gain
• Split dataset according to selected feature, and create left and right
branches of the tree
• Keep repeating splitting process until stopping criteria is met:
• When splitting a node will result in the tree exceeding a maximum
depth
• Information gain from additional splits is less than threshold
• When number of examples in a node is below a threshold
Andrew Ng
Recursive splitting
?
Andrew Ng
Recursive splitting
E ar
s hape
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e ?
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Round N ot Round P res ent
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Whis kers
Recursive algorithm
A bs ent
Andrew Ng
Using one-hot encoding of

categorical features
Features with three possible values
Ear shape (𝑥1) Face shape (𝑥 2 ) Whiskers (𝑥3 ) Cat (𝑦)
Oval Not round Present 1

Oval Round Absent 0
Ear
Pointy Not round Present 0 shape
Oval Round Present 1 Pointy
Pointy Round Absent 1 Floppy Oval

Oval Round Absent 1

3 possible values
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Oval Not round Present 1

Oval 0 0 1 Round Absent 0
Pointy 1 0 0 Not round Present 0

Oval 0 0 1 Round Present 1
Pointy 1 0 0 Round Absent 1
Floppy 0 1 0 Not round Absent 0
Floppy 0 1 0 Round Absent 0

Andrew Ng
One hot encoding
If a categorical feature can take on 𝑘 values,

create 𝑘 binary features (0 or 1 valued).
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Pointy 1 0 0 Round Present 1
Oval 0 0 1 Not round Present 1

Pointy 1 0 0 Not round Present 0

Oval 0 0 1 Round Present 1
Pointy 1 0 0 Round Absent 1
Floppy 0 1 0 Not round Absent 0

Andrew Ng
One hot encoding and neural networks
Pointy ears Floppy ears Round ears Face shape Whiskers Cat
1 0 0 Round Present 1
0 0 1 Not round Present 1

0 0 1 Round Absent 0
1 0 0 Not round Present 1 0

0 0 1 Round 1 Present 1 1
1 0 0 Round 1 Absent 0 1
0 1 0 Not round 0 Absent 0 1
Andrew Ng
Continuous valued features

Continuous features
Ear shape Face shape Whiskers Weight (lbs.) Cat
Pointy Round Present 7.2 1
Floppy Not round Present 8.8 1
Floppy Round Absent 15 0
Pointy Not round Present 9.2 0
Pointy Round Present 8.4 1

Pointy Round Absent 7.6 1
Floppy Not round Absent 11 0
Pointy Round Absent 10.2 1
Andrew Ng
Splitting on a continuous variable
Cat
Weight ≤ 8 lbs .
Weight ≤ 9 lbs .
No
Y es
Not Cat Weight (lbs.)
2 2 8 3
H(0.5) − 𝐻 + 𝐻 = 0.24
10 2 10 8
4 4 6 1
H(0.5) − 𝐻 + 𝐻 = 0.61
10 4 10 6
7 5 3 0
H(0.5) − 𝐻 + 𝐻 = 0.40
10 7 10 3
Andrew Ng
Regression Trees (optional)

Regression with Decision Trees: Predicting a number
Ear shape Face shape Whiskers Weight (lbs.)
Pointy Round Present 7.2
Floppy Not round Present 8.8

Pointy Not round Present 9.2

Pointy Round Present 8.4
Pointy Round Absent 7.6
Pointy Round Absent 10.2

Andrew Ng
Regression with Decision Trees
E ar
s hape
P ointy
Floppy
Fac e Fac e
s hape Shape
Round N ot round
Round N ot Round
Weights(lbs.): Weights(lbs.): Weights (lbs.): Weights(lbs.):

7.2, 8.4, 7.6, 10.2 9.2 15, 18, 20 8.8,11
Andrew Ng
Choosing a split
Variance at root Variance at root Variance at root
node: 20.51 node: 20.51 node: 20.51
E ar Fac e Whis kers
s hape Shape
Round N ot round
P ointy Floppy P res ent A bs ent
Weights: 7.2, Weights: 8.8, 15, Weights: 7.2, 15, Weights: 8.8,9.2,11 Weights: 7.2, 8.8, Weights: 15, 7.6,
9.2, 8.4,7.6, 10.2 11, 18, 20 8.4, 7.6,10.2, 18, 20 9.2, 8.4 11, 10.2, 18, 20
Variance: 1.47 Variance: 21.87 Variance: 27.80 Variance: 1.37 Variance: 0.75 Variance: 23.32
𝑤 left = 5ൗ10 𝑤 right = 5ൗ10 𝑤 left = 7ൗ10 𝑤 right = 3ൗ10 𝑤 left = 4ൗ10 𝑤 right = 6ൗ10
7 3 20.51 − 4 6
5 5 20.51 − ∗ 0.75 + ∗ 23.32
20.51 − ∗ 1.47 + ∗ 21.87 10
∗ 27.80 +
10
∗ 1.37 10 10
10 10
= 8.84 = 0.64 = 6.22
Andrew Ng
Tree ensembles
Using multiple decision trees

Trees are highly sensitive to small
changes of the data
E ar
s hape Whis kers
P ointy Floppy P res ent A bs ent
Andrew Ng
Tree ensemble New test example
Whis kers E ar Fac e

s hape s hape
Ear shape: Pointy
P res ent A bs ent Floppy Face shape: Not Round
Round N ot Round
P ointy
Whiskers: Present
Ear
shape N ot c at Cat
Face Whis kers
shape Whiskers
P ointy Floppy
A bs ent P res ent A bs ent
Round N ot round P res ent
C at N ot c at
Not cat Not Cat Not Cat Cat Not Cat
Cat
Prediction: Cat Prediction: Not cat Prediction: Cat
Final prediction: Cat
Andrew Ng
Tree ensembles
Sampling with replacement

Tokens
Sampling with replacement:
Andrew Ng
Ear shape Face shape Whiskers Cat

Pointy Not round Present 0

Floppy Not round Present 1

Andrew Ng
Tree ensembles
Random forest algorithm

Generating a tree sample
Given training set of size 𝑚
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
Train a decision tree on the new dataset
Whiskers Ear Face Ear shape

Ear Face shape shape Whiskers Cat
shape shape Whiskers Cat Absent Floppy
Present Pointy
Pointy Round Present Yes
Pointy Round Present Yes Pointy Round Absent Yes
Floppy
Floppy
Round
Round
Absent
Absent
No
No Ear shape Not cat
Floppy
Floppy
Not Round
Not Round
Absent
Absent
No
No
Face
shape Not cat
…
Pointy Round Present Yes Pointy Round Absent Yes
Pointy Not Round Present Yes Floppy Round Absent No
Floppy Round Absent No Round Not round Not round
Floppy Round Absent No Round
Floppy Round Present Yes Floppy Round Absent No
Pointy Not Round Absent No Pointy Not Round Absent No
Pointy Not Round Absent No Pointy Round Present Yes
Pointy Not Round Present Yes
Bagged decision tree
Andrew Ng
Randomizing the feature choice
At each node, when choosing a feature to use to split,

if 𝑛 features are available, pick a random subset of
𝑘 < 𝑛 features and allow the algorithm to only choose
from that subset of features.
Random forest algorithm
Andrew Ng
Tree ensembles
XGBoost
Boosted trees intuition
Given training set of size 𝑚
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
But instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify
Train a decision tree on the new dataset
Whiskers Ear Face
Ear Face shape shape Whiskers Prediction
shape shape Whiskers Cat Absent
Present
Pointy Round Present Cat ✅
Floppy Not Round Present Not cat ❌
Floppy Round Absent No
Ear shape Floppy Round Absent Not cat ✅
Floppy Round Absent No Not cat
Pointy Not Round Present Not cat ✅
Pointy Round Present Cat ✅
Not round Pointy Round Absent Not cat
Floppy Round Absent No Round ❌
Floppy Not Round Absent Not cat
Floppy Round Present Yes ✅
Pointy Round Absent Not cat ❌
Pointy Not Round Absent No
Cat Not cat Floppy Round Absent Not cat ✅
Pointy Not Round Absent No
Floppy Round Absent Not cat ✅
Andrew Ng
XGBoost (eXtreme Gradient Boosting)
• Open source implementation of boosted trees
• Fast efficient implementation
• Good choice of default splitting criteria and criteria for when
to stop splitting
• Built in regularization to prevent overfitting
• Highly competitive algorithm for machine learning
competitions (eg: Kaggle competitions)
Andrew Ng
Using XGBoost
Classification Regression
from xgboost import XGBClassifier from xgboost import XGBRegressor
model = XGBClassifier() model = XGBRegressor()
model.fit(X_train, y_train) model.fit(X_train, y_train)

y_pred = model.predict(X_test) y_pred = model.predict(X_test)
Andrew Ng
Conclusion
When to use decision trees

Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
Choose architecture
(model, data, etc.)
Diagnostics
(bias, variance and error
analysis)
Train
model
Andrew Ng
Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
• Small decision trees may be human interpretable
Neural Networks
• Works well on all types of data, including tabular (structured) and
unstructured data
• May be slower than a decision tree
• Works with transfer learning
• When building a system of multiple models working together, it
might be easier to string together multiple neural networks
Andrew Ng

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Decision Tree Model

Categorical (discrete values)

Ear shape: Pointy

Round Not round Present Absent

Cat Not cat Cat Not cat

Round N ot round Round N ot round

C at N ot c at Whis kers Not Cat

P res ent A bs ent

Cat Not Cat

Round Not round

Round Not round

Round Not round

Cat Not cat

Round Not round

Cat Not cat

Round Not round

Cat Not cat

Round Not round Present Absent

Cat Not cat

1/1 cats 0/4 cats

Round Not round Present Absent

Cat Not cat Cat Not Cat

Decision 1: How to choose what feature to split on at each DNA

Maximize purity (or minimize impurity)

E ar Fac e Whis kers

Round N ot round P res ent 5 /5 c ats 0 /5 c ats

4 /5 c ats 1 /5 c ats 4 /7 c ats 1 /3 c ats 3 /4 c ats 2 /6 c ats

• When a node is 100% one class

• When a node is 100% one class

• When a node is 100% one class

• When number of examples in a node is below a

• When a node is 100% one class

• When number of examples in a node is below a

H(p1) p1 = 2/6 H(p1) = 0.92

p1 = 5/6 H(p1) = 0.65

Choosing a split: Information

P ointy Floppy Round N ot round

= 0.28 = 0.03 = 0.12

Cat Not cat

Cat Not cat

Cat Not cat

Cat Not cat

Fac e Whis kers

Cat Not cat

Fac e Whis kers

Cat Not cat

Fac e Whis kers

Cat Not cat

Fac e Whis kers

Cat Not cat Cat

Fac e Whis kers

Cat Not cat Cat

Fac e Whis kers

Cat Not cat Cat Not cat

Cat Not cat Cat Not cat

Using one-hot encoding of