Note 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24


Lecture Notes #5: Neural Nets

(Textbook reading: Chapter 11)

Basic Idea
 Combine input information in a complex & flexible neural net “model”

 Model “coefficients” are continually tweaked in an iterative process

 The network’s interim performance in classification and prediction informs successive


Network Structure
 Multiple layers
1. Input layer (raw observations)
2. Hidden layers
3. Output layer
 Nodes
 Weights (like coefficients, subject to iterative adjustment)
 Bias values (also like coefficients, but not subject to iterative adjustment)

Schematic Diagram

Example – Using fat & salt content to predict consumer acceptance of cheese

Circles are nodes, wij on arrows are weights, and ϴj are node bias values

Tiny Example – Data

Moving Through the Network

The Input Layer

For input layer, input = output

E.g., for record #1:

Fat input = output = 0.2
Salt input = output = 0.9

Output of input layer = input into hidden layer

The Hidden Layer
 In this example, it has 3 nodes
 Each node receives as input the output of all input nodes
 Output of each hidden node is some function of the weighted sum of inputs

𝑜𝑢𝑡𝑝𝑢𝑡𝑗 = 𝑔(Θj + wij xi )


The Weights
 The weights q (theta) and w are typically initialized to random values in the range -0.05
to +0.05
 Equivalent to a model with random prediction (in other words, no predictive value)
 These initial weights are used in the first round of training

Output of Node 3 if g is a Logistic Function


𝑜𝑢𝑡𝑝𝑢𝑡𝑗 = 𝑔(Θj + wij xi )


𝑜𝑢𝑡𝑝𝑢𝑡3 = = 0.43
1 + 𝑒 −[−0.3+ 0.05 0.2 + 0.01 0.9 ]

Initial Pass of the Network

Node outputs (bold) using first record in this example, and logistic function

Output Layer
 The output of the last hidden layer becomes input for the output layer
 Uses same function as above, i.e. a function g of the weighted average

Mapping the output to a classification

 Output = 0.506 for “like” and 0.481 for “dislike”
 So classification, at this early stage, is “like”, using a cut-off probability of 0.5

Relation to Linear Regression

A net with a single output node and no hidden layers, where g is the identity function, takes the
same form as a linear regression model

𝑦 =Θ+ wi xi

Training the Model

Preprocessing Steps
 Scale variables to 0-1
 Categorical variables
 If equidistant categories, map to equidistant interval points in 0-1 range
 Otherwise, create dummy variables
 Transform (e.g., log) skewed variables

Initial Pass Through Network

 Goal: Find weights that yield best predictions
 The process we described above is repeated for all records

 At each record compare prediction to actual
 Difference is the error for the output node
 Error is propagated back and distributed to all the hidden nodes and used to update their

Back Propagation (“back-prop”)

 Output from output node k:
 Error associated with that node:

Note: this is like ordinary error, multiplied by a correction factor

Error is used to Update Weights

𝜃𝑗𝑛𝑒𝑤 = 𝜃𝑗𝑜𝑙𝑑 + 𝑙 𝑒𝑟𝑟𝑗

w𝑗𝑛𝑒𝑤 = 𝑤𝑗𝑜𝑙𝑑 + 𝑙(𝑒𝑟𝑟𝑗 )

l = constant between 0 and 1, reflects the “learning rate” or “weight decay parameter”

Why It Works
 Big errors lead to big changes in weights
 Small errors leave weights relatively unchanged
 Over thousands of updates, a given weight keeps changing until the error associated with
that weight is negligible, at which point weights change little

R Functions for Neural Nets

neuralnet (used here)
nnet (does not support multilayer networks)

#### Table 11.2

> df <- read.csv("tinydata.csv")
> df
Obs. Fat Salt Acceptance
1 1 0.2 0.9 like
2 2 0.1 0.1 dislike
3 3 0.2 0.4 dislike
4 4 0.2 0.5 dislike
5 5 0.4 0.5 like
6 6 0.3 0.8 like
> df$Like <- df$Acceptance=="like"
> df$Dislike <- df$Acceptance=="dislike"

> class(df$Like)
[1] "logical"
> class(df$Dislike)
[1] "logical"
> df
Obs. Fat Salt Acceptance Like Dislike
1 1 0.2 0.9 like TRUE FALSE
2 2 0.1 0.1 dislike FALSE TRUE
3 3 0.2 0.4 dislike FALSE TRUE
4 4 0.2 0.5 dislike FALSE TRUE
5 5 0.4 0.5 like TRUE FALSE
6 6 0.3 0.8 like TRUE FALSE
> set.seed(1)
# Multiclass classification
# Like + Dislike are 2 logical classes; can extend to multi class

nn <- neuralnet(Like + Dislike ~ Salt + Fat, data = df,

linear.output = F, hidden = 3)

#Indicates 1 hidden layer with 3 nodes, the syntax hidden = 3,4

#would mean 2 layers, 3 nodes in the first, 4 in the second

> # display weights
> nn$weights
[,1] [,2] [,3]
[1,] -1.061143694 3.057021840 3.337952001
[2,] 2.326024132 -3.408663181 -4.293213530
[3,] 4.106434697 -6.525668384 -5.929418648

[,1] [,2]
[1,]-0.3495332882 -1.677855862
[2,] 5.8777145665 -3.606625360
[3,]-5.3529200726 5.620329700
[4,]-6.1115038896 6.696286857
> # display predictions
> prediction(nn)
Data Error: 0;
Salt Fat Like Dislike
1 0.1 0.1 0.0002415535993 0.99965512479
2 0.4 0.2 0.0344215786564 0.96556787694
3 0.5 0.2 0.1248666747740 0.87816827940
4 0.9 0.2 0.9349452648141 0.07022732257
5 0.8 0.3 0.9591361793188 0.04505630529
6 0.5 0.4 0.8841904620140 0.12672437721

Salt Fat Like Dislike
1 0.1 0.1 0 1
2 0.4 0.2 0 1
3 0.5 0.2 0 1
4 0.9 0.2 1 0
5 0.8 0.3 1 0
6 0.5 0.4 1 0

# plot network
plot(nn, rep="best")

#neuralnet(formula, data, hidden = 1, threshold = 0.01,
# stepmax = 1e+05, rep = 1, startweights = NULL,
# learningrate.limit = NULL, learningrate.factor = list(minus = 0.5,
# plus = 1.2), learningrate = NULL, lifesign = "none",
# lifesign.step = 1000, algorithm = "rprop+", err.fct = "sse",
# act.fct = "logistic", linear.output = TRUE, exclude = NULL,
# constant.weights = NULL, likelihood = FALSE)

x: an object of class nn
rep: repetition of the neural network. If rep="best", the repetition
with the smallest error will be plotted. If not stated all repetitions
will be plotted, each in a separate window.

### Table 11.3
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
Warning message:
package ‘caret’ was built under R version 3.4.4
# prediction from nn model using “compute”
> predict <- compute(nn, data.frame(df$Salt, df$Fat))

> class(predict)
[1] "list"
> names(predict)
[1] "neurons" "net.result"
> predict

1 df.Salt df.Fat
[1,] 1 0.9 0.2
[2,] 1 0.1 0.1
[3,] 1 0.4 0.2
[4,] 1 0.5 0.2
[5,] 1 0.5 0.4
[6,] 1 0.8 0.3

[,1] [,2] [,3] [,4]
[1,] 1 0.8645451275 0.2114997846 0.1529272964
[2,] 1 0.3970198947 0.8873134952 0.9101680690
[3,] 1 0.6660899104 0.5959029849 0.6070151908
[4,] 1 0.7156845856 0.5118869030 0.5013653731
[5,] 1 0.8512504353 0.2213912623 0.2349762878
[6,] 1 0.8840757738 0.1641581374 0.1329130140

[,1] [,2]
[1,] 0.9349452648141 0.07022732257
[2,] 0.0002415535993 0.99965512479
[3,] 0.0344215786564 0.96556787694
[4,] 0.1248666747740 0.87816827940
[5,] 0.8841904620140 0.12672437721
[6,] 0.9591361793188 0.04505630529

> apply(predict$net.result,1,which.max)

[1] 1 2 2 2 1 1


apply() is a R function which enables to make quick operations on matrix, vector or array. The
operations can be done on the lines, the columns or even both of them.

How does it work?

The pattern is really simple : apply(variable, margin, function).

–variable is the variable you want to apply the function to.
–margin specifies if you want to apply by row (margin = 1), by column (margin = 2), or for each
element (margin = 1:2). Margin can be even greater than 2, if we work with variables of
dimension greater than two.
–function is the function you want to apply to the elements of your variable.

#the matrix we will work on:

a = matrix(c(1:15), nrow = 5 , ncol = 3)

#will apply the function mean to all the elements of each row
apply(a, 1, mean)
# [1] 6 7 8 9 10

#will apply the function mean to all the elements of each column
apply(a, 2, mean)
# [1] 3 8 13

> predicted.class = apply(predict$net.result,1,which.max)-1

> apply(predict$net.result,1,which.max)-1

[1] 0 1 1 1 0 0


> df
Obs. Fat Salt Acceptance Like Dislike
1 1 0.2 0.9 like TRUE FALSE
2 2 0.1 0.1 dislike FALSE TRUE
3 3 0.2 0.4 dislike FALSE TRUE
4 4 0.2 0.5 dislike FALSE TRUE
5 5 0.4 0.5 like TRUE FALSE
6 6 0.3 0.8 like TRUE FALSE
> table(ifelse(predicted.class=="1", "dislike", "like"), df$Acceptance)

dislike like
dislike 3 0
like 0 3

> # alternatively, use predict as oppose to compute

> predict <- predict(nn, df)
> class(predict)
[1] "matrix"
> names(predict)
> predict
[,1] [,2]
[1,] 0.9349452648 0.07022732
[2,] 0.0002415536 0.99965512
[3,] 0.0344215787 0.96556788
[4,] 0.1248666748 0.87816828
[5,] 0.8841904620 0.12672438
[6,] 0.9591361793 0.04505631
> predicted.class=apply(predict,1,which.max)-1
> table(ifelse(predicted.class=="1", "dislike", "like"),

dislike like
dislike 3 0
like 0 3

Data Normalization

One of the most important procedures when forming a neural network is data normalization. This
involves adjusting the data to a common scale so as to accurately compare predicted and actual
values. We can do this in two ways in R:
 Scale the data frame automatically using the scale function in R
 Transform the data using a max-min normalization technique

Scaled Normalization
scale( )

Max-Min Normalization
For this method, we invoke the following function to normalize our data:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))

Then, we use lapply to run the function across our existing data (e.g., mydata):
maxmindf <-, normalize))

# normalizing the Salt and Fat in df

> df[,2:3]<-scale(df[,2:3])

> scale(df[,2:3])
Fat Salt
[1,] -0.3227486 1.2752820
[2,] -1.2909944 -1.5071514
[3,] -0.3227486 -0.4637389
[4,] -0.3227486 -0.1159347
[5,] 1.6137431 -0.1159347
[6,] 0.6454972 0.9274778
Fat Salt
3.700743e-17 8.326673e-17
Fat Salt
1 1

> nn <- neuralnet(Like + Dislike ~ Salt + Fat, data = df,
linear.output = F, hidden = 3)
> #nn # long output
> # display weights
> nn$weights
[,1] [,2] [,3]
[1,] 0.3960458 0.03305449 -1.049633
[2,] -2.4471788 -1.47511998 4.219826
[3,] -2.6744987 -2.18269129 3.543871

[,1] [,2]
[1,] 0.4923133 0.5164109
[2,] -3.2832295 1.7656123
[3,] -2.0970303 1.2261949
[4,] 4.0179416 -3.6538050

> # display predictions
> prediction(nn)
Data Error: 0;
Salt Fat Like Dislike
1 -1.5071514 -1.2909944 0.007591287 0.97067726
2 -0.4637389 -0.3227486 0.015621549 0.95544668
3 -0.1159347 -0.3227486 0.030765916 0.93157454
4 1.2752820 -0.3227486 0.967788452 0.07879514
5 0.9274778 0.6454972 0.986252988 0.04763819
6 -0.1159347 1.6137431 0.986508633 0.04769084

Salt Fat Like Dislike
1 -1.5071514 -1.2909944 0 1
2 -0.4637389 -0.3227486 0 1
3 -0.1159347 -0.3227486 0 1
4 1.2752820 -0.3227486 1 0
5 0.9274778 0.6454972 1 0
6 -0.1159347 1.6137431 1 0

> # plot network

> plot(nn, rep="best")

> predict <- predict(nn, df)
> predict
[,1] [,2]
[1,] 0.967788452 0.07879514
[2,] 0.007591287 0.97067726
[3,] 0.015621549 0.95544668
[4,] 0.030765916 0.93157454
[5,] 0.986508633 0.04769084
[6,] 0.986252988 0.04763819
> predicted.class=apply(predict,1,which.max)-1
> table(ifelse(predicted.class=="1", "dislike", "like"),

dislike like
dislike 3 0
like 0 3

Example 2: Classifying Accident Severity

Subset from the accidents data, for a high-fatality region

1 1 1 1 1 1
2 2 1 1 1 0
3 2 1 1 1 1
4 1 1 1 1 0
5 2 1 1 1 2
6 2 0 1 1 1
7 2 0 1 3 1
8 2 0 1 4 1
9 2 0 1 2 0
10 2 0 1 2 0

Description of Variables for Automobile Accident Example

ALCHL_I Presence (1) or absence (2) of alcohol
PROFIL_I_R Profile of the roadway: level (1), other (0)
SUR_COND Surface condition of the road: dry (1), wet (2), snow/slush (3), ice (4),
unknown (9)
VEH_INVL Number of vehicles involved
MAX_SEV_IR Presence of injuries/fatalities: no injuries (0), injury (1), fatality (2)

A neural network with two nodes in the hidden layer (accidents data)

#### Table 11.6, 11.7


> accidents.df <- read.csv("accidents1.csv")

> View(accidents.df)

> head(accidents.df)
1 0 2 2 1 0 1 0 3 0 0 1
0 1
2 1 2 1 0 0 1 1 3 2 0 1
1 1
3 1 2 1 0 0 1 0 3 2 0 1
1 1
4 1 2 1 1 0 0 0 3 2 0 1
1 1
5 1 1 1 0 0 1 0 3 2 0 0
1 1
6 1 2 1 1 0 1 0 3 0 0 1
0 1
1 40 4 0 3 1 1 1 1 0 0
2 70 4 0 3 2 2 0 0 1 0
3 35 4 1 2 2 2 0 0 1 0
4 35 4 1 2 2 1 0 0 1 0
5 25 4 0 2 3 1 0 0 1 0
6 70 4 0 2 1 2 1 1 0 0
> str(accidents.df)
'data.frame': 42183 obs. of 24 variables:
$ HOUR_I_R : int 0 1 1 1 1 1 1 1 1 0 ...
$ ALCHL_I : int 2 2 2 2 1 2 2 2 2 2 ...
$ ALIGN_I : int 2 1 1 1 1 1 1 1 1 1 ...
$ STRATUM_R : int 1 0 0 1 0 1 0 1 1 0 ...
$ WRK_ZONE : int 0 0 0 0 0 0 0 0 0 0 ...
$ WKDY_I_R : int 1 1 1 0 1 1 1 1 1 0 ...
$ INT_HWY : int 0 1 0 0 0 0 1 0 0 0 ...
$ LGTCON_I_R : int 3 3 3 3 3 3 3 3 3 3 ...
$ MANCOL_I_R : int 0 2 2 2 2 0 0 0 0 0 ...
$ PED_ACC_R : int 0 0 0 0 0 0 0 0 0 0 ...
$ RELJCT_I_R : int 1 1 1 1 0 1 0 0 1 1 ...
$ REL_RWY_R : int 0 1 1 1 1 0 0 0 0 0 ...
$ PROFIL_I_R : int 1 1 1 1 1 1 1 1 1 1 ...
$ SPD_LIM : int 40 70 35 35 25 70 70 35 30 25 ...
$ SUR_COND : int 4 4 4 4 4 4 4 4 4 4 ...
$ TRAF_CON_R : int 0 0 1 1 0 0 0 0 0 0 ...
$ TRAF_WAY : int 3 3 2 2 2 2 2 1 1 1 ...
$ VEH_INVL : int 1 2 2 2 3 1 1 1 1 1 ...
$ WEATHER_R : int 1 2 2 1 1 2 2 1 2 2 ...
$ INJURY_CRASH : int 1 0 0 0 0 1 0 1 0 0 ...
$ NO_INJ_I : int 1 0 0 0 0 1 0 1 0 0 ...

$ PRPTYDMG_CRASH: int 0 1 1 1 1 0 1 0 1 1 ...
$ FATALITIES : int 0 0 0 0 0 0 0 0 0 0 ...
$ MAX_SEV_IR : int 1 0 0 0 0 1 0 1 0 0 ...
> # selected variables
> vars <- c("ALCHL_I", "PROFIL_I_R", "VEH_INVL")
> vars

> # partition the data
> set.seed(2)
> training=sample(row.names(accidents.df), dim(accidents.df)[1]*0.6)
> validation=setdiff(row.names(accidents.df), training)

class.ind <- function(cl)

n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)) )
x[(1:n) + n*(unclass(cl)-1)] <- 1
dimnames(x) <- list(names(cl), levels(cl))

> #
> head(class.ind(accidents.df[training,]$SUR_COND) )
1 2 3 4 9
[1,] 1 0 0 0 0
[2,] 1 0 0 0 0
[3,] 1 0 0 0 0
[4,] 1 0 0 0 0
[5,] 1 0 0 0 0
[6,] 1 0 0 0 0
> tail(class.ind(accidents.df[training,]$SUR_COND) )
1 2 3 4 9
[25304,] 1 0 0 0 0
[25305,] 1 0 0 0 0
[25306,] 1 0 0 0 0
[25307,] 1 0 0 0 0
[25308,] 1 0 0 0 0
[25309,] 0 1 0 0 0

> head(class.ind(accidents.df[training,]$MAX_SEV_IR))
0 1 2
[1,] 1 0 0
[2,] 1 0 0
[3,] 1 0 0
[4,] 0 1 0
[5,] 1 0 0
[6,] 1 0 0
> tail(class.ind(accidents.df[training,]$MAX_SEV_IR))
0 1 2
[25304,] 0 1 0
[25305,] 0 1 0
[25306,] 0 1 0
[25307,] 1 0 0
[25308,] 0 1 0
[25309,] 0 1 0

# when y has multiple classes - need to dummify

trainData <- cbind(accidents.df[training,c(vars)],
names(trainData) <- c(vars,
paste("SUR_COND_", c(1, 2, 3, 4, 9), sep=""),
paste("MAX_SEV_IR_", c(0, 1, 2), sep=""))

validData <- cbind(accidents.df[validation,c(vars)],

names(validData) <- c(vars,
paste("SUR_COND_", c(1, 2, 3, 4, 9), sep=""),
paste("MAX_SEV_IR_", c(0, 1, 2), sep=""))

# run nn with 2 hidden nodes

# use hidden= with a vector of integers specifying number of hidden
nodes in each layer
nn <- neuralnet(MAX_SEV_IR_0 + MAX_SEV_IR_1 + MAX_SEV_IR_2 ~
+ SUR_COND_3 + SUR_COND_4, data = trainData, hidden = 2)

> # this takes longer to run
> plot(nn)

> training.prediction <- compute(nn, trainData[,-c(8:11)])
> training.class <- apply(training.prediction$net.result,1,which.max)-1
> table(training.class, accidents.df[training,]$MAX_SEV_IR)

training.class 0 1 2
0 10798 9964 190
1 1644 2645 68
> validation.prediction <- compute(nn, validData[,-c(8:11)])
> validation.class <-
> table(validation.class, accidents.df[validation,]$MAX_SEV_IR)

validation.class 0 1 2
0 7179 6596 148
1 1100 1791 60


Confusion Matrix

Prediction 0 1 2
0 10798 9964 190
1 1644 2645 68
2 0 0 0

Confusion Matrix

Prediction 0 1 2
0 7179 6596 148
1 1100 1791 60
2 0 0 0

Common Criteria to Stop the Updating
 When weights change very little from one iteration to the next

 When the misclassification rate reaches a required threshold

 When a limit on runs is reached

Avoiding Overfitting
With sufficient iterations, neural net can easily overfit the data

To avoid overfitting:
 Track error in validation data
 Limit iterations
 Limit complexity of network

User Inputs

Specify Network Architecture

Number of hidden layers

 Most popular – one hidden layer

Number of nodes in hidden layer(s)

 More nodes capture complexity, but increase chances of overfit

Number of output nodes

 For classification with m classes, use m or m-1 nodes
 For numerical prediction use one

“Learning Rate”
 Low values “downweight” the new information from errors at each iteration
 This slows learning, but reduces tendency to overfit to local structure

 High values keep weights changing in same direction as previous iteration
 Likewise, this helps avoid overfitting to local structure, but also slows learning

Arguments in neuralnet
 hidden: a vector specifying the number of nodes per layer (thus specifying both the size
and number of layers)
 learningrate: value between 0 and 1

 Good predictive ability
 Can capture complex relationships
 No need to specify a model

 Considered a “black box” prediction machine, with no insight into relationships between
predictors and outcome
 No variable-selection mechanism, so you have to exercise care in selecting variables
 Heavy computational requirements if there are many variables (additional variables
dramatically increase the number of weights to calculate)

Deep Learning
The most active application area for neural nets
• In image recognition, pixel values are predictors, and there might be 100,000+ predictors
– big data! (voice recognition similar)
• Deep neural nets with many layers (“neural nets on steroids”) have facilitated
revolutionary breakthroughs in image/voice recognition, and in artificial intelligence (AI)
• Key is the ability to self-learn features (“unsupervised”)
• For example, clustering could separate the pixels in this 12” by 12” football field image
into the “green field” and “yard marker” areas without knowing that those concepts exist
• From there, the concept of a boundary, or “edge” emerges
• Successive stages move from identification of local, simple features to more global &
complex features

 Neural networks can be used for classification and prediction
 Can capture a very flexible/complicated relationship between the outcome and a set of
 The network “learns” and updates its model iteratively as more data are fed into it
 Major danger: overfitting
 Requires large amounts of data
 Good predictive performance, yet “black box” in nature
 Deep learning, very complex neural nets, is effective in image recognition and AI

3. Car Sales. Consider the data on used cars (ToyotaCorolla.csv) with 1436 records and
details on 38 attributes, including Price, Age, KM, HP, and other specifications. The goal
is to predict the price of a used Toyota Corolla based on its specifications.
a. Fit a neural network model to the data. Use a single hidden layer with 2
 Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic,
Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco,
Automatic_airco, CD_Player,
Powered_Windows, Sport_Model, and Tow_Bar.
 Remember to first scale the numerical predictor and outcome
variables to a 0–1 scale (use function preprocess() with method =
“range”—see Chapter 7) and convert categorical predictors to dummies.
Record the RMS error for the training data and the validation data. Repeat the
process, changing the number of hidden layers and nodes to {single layer with 5
nodes}, {two layers, 5 nodes in each layer}.
iii.What happens to the RMS error for the training data as the number of layers and
nodes increases?

iv. What happens to the RMS error for the validation data?
v. Comment on the appropriate number of layers and nodes for this


You might also like