Note 5
Note 5
Note 5
Basic Idea
Combine input information in a complex & flexible neural net “model”
Network Structure
Multiple layers
1. Input layer (raw observations)
2. Hidden layers
3. Output layer
Nodes
Weights (like coefficients, subject to iterative adjustment)
Bias values (also like coefficients, but not subject to iterative adjustment)
Schematic Diagram
1
Example – Using fat & salt content to predict consumer acceptance of cheese
Circles are nodes, wij on arrows are weights, and ϴj are node bias values
2
The Hidden Layer
In this example, it has 3 nodes
Each node receives as input the output of all input nodes
Output of each hidden node is some function of the weighted sum of inputs
p
The Weights
The weights q (theta) and w are typically initialized to random values in the range -0.05
to +0.05
Equivalent to a model with random prediction (in other words, no predictive value)
These initial weights are used in the first round of training
1
𝑜𝑢𝑡𝑝𝑢𝑡3 = = 0.43
1 + 𝑒 −[−0.3+ 0.05 0.2 + 0.01 0.9 ]
Node outputs (bold) using first record in this example, and logistic function
3
Output Layer
The output of the last hidden layer becomes input for the output layer
Uses same function as above, i.e. a function g of the weighted average
𝑦 =Θ+ wi xi
i=1
Preprocessing Steps
Scale variables to 0-1
Categorical variables
If equidistant categories, map to equidistant interval points in 0-1 range
Otherwise, create dummy variables
Transform (e.g., log) skewed variables
4
At each record compare prediction to actual
Difference is the error for the output node
Error is propagated back and distributed to all the hidden nodes and used to update their
weights
l = constant between 0 and 1, reflects the “learning rate” or “weight decay parameter”
Why It Works
Big errors lead to big changes in weights
Small errors leave weights relatively unchanged
Over thousands of updates, a given weight keeps changing until the error associated with
that weight is negligible, at which point weights change little
5
#### Table 11.2
library(neuralnet)
> df <- read.csv("tinydata.csv")
> df
Obs. Fat Salt Acceptance
1 1 0.2 0.9 like
2 2 0.1 0.1 dislike
3 3 0.2 0.4 dislike
4 4 0.2 0.5 dislike
5 5 0.4 0.5 like
6 6 0.3 0.8 like
>
> df$Like <- df$Acceptance=="like"
> df$Dislike <- df$Acceptance=="dislike"
> class(df$Like)
[1] "logical"
> class(df$Dislike)
[1] "logical"
>
> df
Obs. Fat Salt Acceptance Like Dislike
1 1 0.2 0.9 like TRUE FALSE
2 2 0.1 0.1 dislike FALSE TRUE
3 3 0.2 0.4 dislike FALSE TRUE
4 4 0.2 0.5 dislike FALSE TRUE
5 5 0.4 0.5 like TRUE FALSE
6 6 0.3 0.8 like TRUE FALSE
>
> set.seed(1)
# Multiclass classification
# Like + Dislike are 2 logical classes; can extend to multi class
6
> # display weights
> nn$weights
[[1]]
[[1]][[1]]
[,1] [,2] [,3]
[1,] -1.061143694 3.057021840 3.337952001
[2,] 2.326024132 -3.408663181 -4.293213530
[3,] 4.106434697 -6.525668384 -5.929418648
[[1]][[2]]
[,1] [,2]
[1,]-0.3495332882 -1.677855862
[2,] 5.8777145665 -3.606625360
[3,]-5.3529200726 5.620329700
[4,]-6.1115038896 6.696286857
>
> # display predictions
> prediction(nn)
Data Error: 0;
$rep1
Salt Fat Like Dislike
1 0.1 0.1 0.0002415535993 0.99965512479
2 0.4 0.2 0.0344215786564 0.96556787694
3 0.5 0.2 0.1248666747740 0.87816827940
4 0.9 0.2 0.9349452648141 0.07022732257
5 0.8 0.3 0.9591361793188 0.04505630529
6 0.5 0.4 0.8841904620140 0.12672437721
$data
Salt Fat Like Dislike
1 0.1 0.1 0 1
2 0.4 0.2 0 1
3 0.5 0.2 0 1
4 0.9 0.2 1 0
5 0.8 0.3 1 0
6 0.5 0.4 1 0
>
# plot network
plot(nn, rep="best")
7
Remarks:
#neuralnet(formula, data, hidden = 1, threshold = 0.01,
# stepmax = 1e+05, rep = 1, startweights = NULL,
# learningrate.limit = NULL, learningrate.factor = list(minus = 0.5,
# plus = 1.2), learningrate = NULL, lifesign = "none",
# lifesign.step = 1000, algorithm = "rprop+", err.fct = "sse",
# act.fct = "logistic", linear.output = TRUE, exclude = NULL,
# constant.weights = NULL, likelihood = FALSE)
plot.nn
x: an object of class nn
rep: repetition of the neural network. If rep="best", the repetition
with the smallest error will be plotted. If not stated all repetitions
will be plotted, each in a separate window.
8
### Table 11.3
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
Warning message:
package ‘caret’ was built under R version 3.4.4
# prediction from nn model using “compute”
> predict <- compute(nn, data.frame(df$Salt, df$Fat))
> class(predict)
[1] "list"
> names(predict)
[1] "neurons" "net.result"
>
> predict
$neurons
$neurons[[1]]
1 df.Salt df.Fat
[1,] 1 0.9 0.2
[2,] 1 0.1 0.1
[3,] 1 0.4 0.2
[4,] 1 0.5 0.2
[5,] 1 0.5 0.4
[6,] 1 0.8 0.3
$neurons[[2]]
[,1] [,2] [,3] [,4]
[1,] 1 0.8645451275 0.2114997846 0.1529272964
[2,] 1 0.3970198947 0.8873134952 0.9101680690
[3,] 1 0.6660899104 0.5959029849 0.6070151908
[4,] 1 0.7156845856 0.5118869030 0.5013653731
[5,] 1 0.8512504353 0.2213912623 0.2349762878
[6,] 1 0.8840757738 0.1641581374 0.1329130140
$net.result
[,1] [,2]
[1,] 0.9349452648141 0.07022732257
[2,] 0.0002415535993 0.99965512479
[3,] 0.0344215786564 0.96556787694
[4,] 0.1248666747740 0.87816827940
[5,] 0.8841904620140 0.12672437721
[6,] 0.9591361793188 0.04505630529
9
>
>
> apply(predict$net.result,1,which.max)
[1] 1 2 2 2 1 1
Remarks:
apply() is a R function which enables to make quick operations on matrix, vector or array. The
operations can be done on the lines, the columns or even both of them.
#will apply the function mean to all the elements of each row
apply(a, 1, mean)
# [1] 6 7 8 9 10
#will apply the function mean to all the elements of each column
apply(a, 2, mean)
# [1] 3 8 13
> apply(predict$net.result,1,which.max)-1
[1] 0 1 1 1 0 0
>
10
> df
Obs. Fat Salt Acceptance Like Dislike
1 1 0.2 0.9 like TRUE FALSE
2 2 0.1 0.1 dislike FALSE TRUE
3 3 0.2 0.4 dislike FALSE TRUE
4 4 0.2 0.5 dislike FALSE TRUE
5 5 0.4 0.5 like TRUE FALSE
6 6 0.3 0.8 like TRUE FALSE
>
> table(ifelse(predicted.class=="1", "dislike", "like"), df$Acceptance)
dislike like
dislike 3 0
like 0 3
dislike like
dislike 3 0
like 0 3
>
11
Data Normalization
One of the most important procedures when forming a neural network is data normalization. This
involves adjusting the data to a common scale so as to accurately compare predicted and actual
values. We can do this in two ways in R:
Scale the data frame automatically using the scale function in R
Transform the data using a max-min normalization technique
Scaled Normalization
scale( )
Max-Min Normalization
For this method, we invoke the following function to normalize our data:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
Then, we use lapply to run the function across our existing data (e.g., mydata):
maxmindf <- as.data.frame(lapply(mydata, normalize))
> df[,2:3]<-scale(df[,2:3])
> scale(df[,2:3])
Fat Salt
[1,] -0.3227486 1.2752820
[2,] -1.2909944 -1.5071514
[3,] -0.3227486 -0.4637389
[4,] -0.3227486 -0.1159347
[5,] 1.6137431 -0.1159347
[6,] 0.6454972 0.9274778
attr(,"scaled:center")
Fat Salt
3.700743e-17 8.326673e-17
attr(,"scaled:scale")
Fat Salt
1 1
>
12
> nn <- neuralnet(Like + Dislike ~ Salt + Fat, data = df,
linear.output = F, hidden = 3)
> #nn # long output
>
>
> # display weights
> nn$weights
[[1]]
[[1]][[1]]
[,1] [,2] [,3]
[1,] 0.3960458 0.03305449 -1.049633
[2,] -2.4471788 -1.47511998 4.219826
[3,] -2.6744987 -2.18269129 3.543871
[[1]][[2]]
[,1] [,2]
[1,] 0.4923133 0.5164109
[2,] -3.2832295 1.7656123
[3,] -2.0970303 1.2261949
[4,] 4.0179416 -3.6538050
>
> # display predictions
> prediction(nn)
Data Error: 0;
$rep1
Salt Fat Like Dislike
1 -1.5071514 -1.2909944 0.007591287 0.97067726
2 -0.4637389 -0.3227486 0.015621549 0.95544668
3 -0.1159347 -0.3227486 0.030765916 0.93157454
4 1.2752820 -0.3227486 0.967788452 0.07879514
5 0.9274778 0.6454972 0.986252988 0.04763819
6 -0.1159347 1.6137431 0.986508633 0.04769084
$data
Salt Fat Like Dislike
1 -1.5071514 -1.2909944 0 1
2 -0.4637389 -0.3227486 0 1
3 -0.1159347 -0.3227486 0 1
4 1.2752820 -0.3227486 1 0
5 0.9274778 0.6454972 1 0
6 -0.1159347 1.6137431 1 0
>
> # plot network
13
> plot(nn, rep="best")
>
>
> predict <- predict(nn, df)
> predict
[,1] [,2]
[1,] 0.967788452 0.07879514
[2,] 0.007591287 0.97067726
[3,] 0.015621549 0.95544668
[4,] 0.030765916 0.93157454
[5,] 0.986508633 0.04769084
[6,] 0.986252988 0.04763819
>
> predicted.class=apply(predict,1,which.max)-1
>
> table(ifelse(predicted.class=="1", "dislike", "like"),
df$Acceptance)
dislike like
dislike 3 0
like 0 3
>
>
14
Example 2: Classifying Accident Severity
A neural network with two nodes in the hidden layer (accidents data)
15
> head(accidents.df)
HOUR_I_R ALCHL_I ALIGN_I STRATUM_R WRK_ZONE WKDY_I_R INT_HWY LGTCON_I_R MANCOL_I_R PED_ACC_R RELJCT_I_R
REL_RWY_R PROFIL_I_R
1 0 2 2 1 0 1 0 3 0 0 1
0 1
2 1 2 1 0 0 1 1 3 2 0 1
1 1
3 1 2 1 0 0 1 0 3 2 0 1
1 1
4 1 2 1 1 0 0 0 3 2 0 1
1 1
5 1 1 1 0 0 1 0 3 2 0 0
1 1
6 1 2 1 1 0 1 0 3 0 0 1
0 1
SPD_LIM SUR_COND TRAF_CON_R TRAF_WAY VEH_INVL WEATHER_R INJURY_CRASH NO_INJ_I PRPTYDMG_CRASH FATALITIES
MAX_SEV_IR
1 40 4 0 3 1 1 1 1 0 0
1
2 70 4 0 3 2 2 0 0 1 0
0
3 35 4 1 2 2 2 0 0 1 0
0
4 35 4 1 2 2 1 0 0 1 0
0
5 25 4 0 2 3 1 0 0 1 0
0
6 70 4 0 2 1 2 1 1 0 0
1
> str(accidents.df)
'data.frame': 42183 obs. of 24 variables:
$ HOUR_I_R : int 0 1 1 1 1 1 1 1 1 0 ...
$ ALCHL_I : int 2 2 2 2 1 2 2 2 2 2 ...
$ ALIGN_I : int 2 1 1 1 1 1 1 1 1 1 ...
$ STRATUM_R : int 1 0 0 1 0 1 0 1 1 0 ...
$ WRK_ZONE : int 0 0 0 0 0 0 0 0 0 0 ...
$ WKDY_I_R : int 1 1 1 0 1 1 1 1 1 0 ...
$ INT_HWY : int 0 1 0 0 0 0 1 0 0 0 ...
$ LGTCON_I_R : int 3 3 3 3 3 3 3 3 3 3 ...
$ MANCOL_I_R : int 0 2 2 2 2 0 0 0 0 0 ...
$ PED_ACC_R : int 0 0 0 0 0 0 0 0 0 0 ...
$ RELJCT_I_R : int 1 1 1 1 0 1 0 0 1 1 ...
$ REL_RWY_R : int 0 1 1 1 1 0 0 0 0 0 ...
$ PROFIL_I_R : int 1 1 1 1 1 1 1 1 1 1 ...
$ SPD_LIM : int 40 70 35 35 25 70 70 35 30 25 ...
$ SUR_COND : int 4 4 4 4 4 4 4 4 4 4 ...
$ TRAF_CON_R : int 0 0 1 1 0 0 0 0 0 0 ...
$ TRAF_WAY : int 3 3 2 2 2 2 2 1 1 1 ...
$ VEH_INVL : int 1 2 2 2 3 1 1 1 1 1 ...
$ WEATHER_R : int 1 2 2 1 1 2 2 1 2 2 ...
$ INJURY_CRASH : int 1 0 0 0 0 1 0 1 0 0 ...
$ NO_INJ_I : int 1 0 0 0 0 1 0 1 0 0 ...
16
$ PRPTYDMG_CRASH: int 0 1 1 1 1 0 1 0 1 1 ...
$ FATALITIES : int 0 0 0 0 0 0 0 0 0 0 ...
$ MAX_SEV_IR : int 1 0 0 0 0 1 0 1 0 0 ...
>
> # selected variables
> vars <- c("ALCHL_I", "PROFIL_I_R", "VEH_INVL")
> vars
[1] "ALCHL_I" "PROFIL_I_R" "VEH_INVL"
>
>
> # partition the data
> set.seed(2)
> training=sample(row.names(accidents.df), dim(accidents.df)[1]*0.6)
> validation=setdiff(row.names(accidents.df), training)
>
> #
> head(class.ind(accidents.df[training,]$SUR_COND) )
1 2 3 4 9
[1,] 1 0 0 0 0
[2,] 1 0 0 0 0
[3,] 1 0 0 0 0
[4,] 1 0 0 0 0
[5,] 1 0 0 0 0
[6,] 1 0 0 0 0
> tail(class.ind(accidents.df[training,]$SUR_COND) )
1 2 3 4 9
[25304,] 1 0 0 0 0
[25305,] 1 0 0 0 0
[25306,] 1 0 0 0 0
[25307,] 1 0 0 0 0
[25308,] 1 0 0 0 0
[25309,] 0 1 0 0 0
17
> head(class.ind(accidents.df[training,]$MAX_SEV_IR))
0 1 2
[1,] 1 0 0
[2,] 1 0 0
[3,] 1 0 0
[4,] 0 1 0
[5,] 1 0 0
[6,] 1 0 0
> tail(class.ind(accidents.df[training,]$MAX_SEV_IR))
0 1 2
[25304,] 0 1 0
[25305,] 0 1 0
[25306,] 0 1 0
[25307,] 1 0 0
[25308,] 0 1 0
[25309,] 0 1 0
>
18
> # this takes longer to run
> plot(nn)
19
> training.prediction <- compute(nn, trainData[,-c(8:11)])
> training.class <- apply(training.prediction$net.result,1,which.max)-1
> table(training.class, accidents.df[training,]$MAX_SEV_IR)
training.class 0 1 2
0 10798 9964 190
1 1644 2645 68
>
> validation.prediction <- compute(nn, validData[,-c(8:11)])
> validation.class <-
apply(validation.prediction$net.result,1,which.max)-1
> table(validation.class, accidents.df[validation,]$MAX_SEV_IR)
validation.class 0 1 2
0 7179 6596 148
1 1100 1791 60
>
Remarks
Confusion Matrix
Reference
Prediction 0 1 2
0 10798 9964 190
1 1644 2645 68
2 0 0 0
Confusion Matrix
Reference
Prediction 0 1 2
0 7179 6596 148
1 1100 1791 60
2 0 0 0
20
Common Criteria to Stop the Updating
When weights change very little from one iteration to the next
Avoiding Overfitting
With sufficient iterations, neural net can easily overfit the data
To avoid overfitting:
Track error in validation data
Limit iterations
Limit complexity of network
User Inputs
“Learning Rate”
Low values “downweight” the new information from errors at each iteration
This slows learning, but reduces tendency to overfit to local structure
“Momentum”
High values keep weights changing in same direction as previous iteration
Likewise, this helps avoid overfitting to local structure, but also slows learning
21
Arguments in neuralnet
hidden: a vector specifying the number of nodes per layer (thus specifying both the size
and number of layers)
learningrate: value between 0 and 1
Advantages
Good predictive ability
Can capture complex relationships
No need to specify a model
Disadvantages
Considered a “black box” prediction machine, with no insight into relationships between
predictors and outcome
No variable-selection mechanism, so you have to exercise care in selecting variables
Heavy computational requirements if there are many variables (additional variables
dramatically increase the number of weights to calculate)
Deep Learning
The most active application area for neural nets
• In image recognition, pixel values are predictors, and there might be 100,000+ predictors
– big data! (voice recognition similar)
• Deep neural nets with many layers (“neural nets on steroids”) have facilitated
revolutionary breakthroughs in image/voice recognition, and in artificial intelligence (AI)
• Key is the ability to self-learn features (“unsupervised”)
• For example, clustering could separate the pixels in this 12” by 12” football field image
into the “green field” and “yard marker” areas without knowing that those concepts exist
• From there, the concept of a boundary, or “edge” emerges
• Successive stages move from identification of local, simple features to more global &
complex features
22
Summary
Neural networks can be used for classification and prediction
Can capture a very flexible/complicated relationship between the outcome and a set of
predictors
The network “learns” and updates its model iteratively as more data are fed into it
Major danger: overfitting
Requires large amounts of data
Good predictive performance, yet “black box” in nature
Deep learning, very complex neural nets, is effective in image recognition and AI
Problems
3. Car Sales. Consider the data on used cars (ToyotaCorolla.csv) with 1436 records and
details on 38 attributes, including Price, Age, KM, HP, and other specifications. The goal
is to predict the price of a used Toyota Corolla based on its specifications.
a. Fit a neural network model to the data. Use a single hidden layer with 2
nodes.
Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic,
Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco,
Automatic_airco, CD_Player,
Powered_Windows, Sport_Model, and Tow_Bar.
Remember to first scale the numerical predictor and outcome
variables to a 0–1 scale (use function preprocess() with method =
“range”—see Chapter 7) and convert categorical predictors to dummies.
Record the RMS error for the training data and the validation data. Repeat the
process, changing the number of hidden layers and nodes to {single layer with 5
nodes}, {two layers, 5 nodes in each layer}.
iii.What happens to the RMS error for the training data as the number of layers and
nodes increases?
23
iv. What happens to the RMS error for the validation data?
v. Comment on the appropriate number of layers and nodes for this
application.
24