Exploratory_Data_Analysis_Course_Notes
Exploratory_Data_Analysis_Course_Notes
Xing Su
Contents
Principle of Analytic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Exploratory Graphs (examples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
One Dimension Summary of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Two Dimensional Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Process of Making a Plot/Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Base Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Base Graphics Functions and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Base Plot Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Multiple Plot Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Graphics Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
lattice Plotting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
lattice Functions and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
lattice Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
ggplot2 Plotting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
ggplot2 Functions and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
ggplot2 Comprehensive Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Procedure for Constructing Hierarchical Clusters (hclust function) . . . . . . . . . . . . . . . 22
Approaches for Merging Points/Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Characteristics of Hierarchical Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . 23
hclust Function and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
myplcclust Function and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
heatmap Function and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
image Function and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Procedure for Constructing K-means Clusters (kmeans function) . . . . . . . . . . . . . . . . 27
Characteristics of K-means Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 28
Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Principal Components Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
SVD and PCA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1
Create Approximations/Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Color Packages in R Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
grDevices Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
RColorBrewer Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Case Study: Human Activity Tracking with Smart Phones . . . . . . . . . . . . . . . . . . . . . . . 41
Case Study: Fine Particle Pollution in the U.S. from 1999 to 2012 . . . . . . . . . . . . . . . . . . 49
2
Principle of Analytic Graphics
3
• Principle 4: Integration of evidence
– use as many modes of evidence/displaying evidence as possible (modes of data presentation)
– integrate words/numbers/images/diagrams (information rich)
– analysis should drive the tool
• Principle 5: Describe/document evidence with appropriate labels/scales/sources
– add credibility to that data graphic
• Principle 6: Content is the most important
– analytical presentations ultimately stand/fall depending on quality/relevance/integrity of content
• Purpose: understand data properties, find pattern in data, suggest modeling strategies, debug
• Characteristics: made quickly, large number produced, gain personal understanding, appearances
and presentation are aren’t as important
• summary(data) = returns min, 1st quartile, median, mean, 3rd quartile, max
• boxplot(data, col = “blue”) = produces a box with middles 50% highlighted in the specified color
√
– whiskers = ±1.58IQR/ n
∗ IQR = interquartile range, Q3 - Q1
– box = 25%, median, 75%
• histograms(data, col = “green”) = produces a histogram with specified breaks and color
– breaks = 100 = the higher the number is the smaller/narrower the histogram columns are
• rug(data) = density plot, add a strip under the histogram indicating location of each data point
• barplot(data, col = wheat) = produces a bar graph, usually for categorical data
• Overlaying Features
• abline(h/v = 12) = overlays horizontal/vertical line at specified location
4
• histogram:
– par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) = set margin
– hist(subset(pollution, region == "east")$pm25, col = "green") = first histogram
– hist(subset(pollution, region == "west")$pm25, col = "green") = second histogram
• scatterplot
– with(pollution, plot(latitude, pm25, col = region))
– abline(h = 12, lwd = 2, lty = 2) = plots horizontal dotted line
– plot(jitter(child, 4)~parent, galton) = spreads out data points at the same position to
simulate measurement error/make high frequency more visibble
5
62 66 70 74
jitter(child, 4)
64 66 68 70 72
parent
6
Process of Making a Plot/Considerations
Base Plotting
• arguments
– pch: plotting symbol (default = open circle)
– lty: line type (default is solid)
7
∗ 0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash
– lwd: line width (integer)
– col: plotting color (number string or hexcode, colors() returns vector of colors)
– xlab, ylab: x-y label character strings
– cex: numerical value giving the amount by which plotting text/symbols should be magnified
relative to the default
∗ cex = 0.15 * variable: plot size as an additional variable
• par() function = specifies global graphics parameters, affects all plots in an R session (can be overridden)
– las: orientation of axis labels
– bg: background color
– mar: margin size (order = bottom left top right)
– oma: outer margin size (default = 0 for all sides)
– mfrow: number of plots per row, column (plots are filled row-wise)
– mfcol: number of plots per row, column (plots are filled column-wise)
– can verify all above parameters by calling par("parameter")
• plotting functions
– lines: adds liens to a plot, given a vector of x values and corresponding vector of y values
– points: adds a point to the plot
– text: add text labels to a plot using specified x,y coordinates
– title: add annotations to x,y axis labels, title, subtitles, outer margin
– mtext: add arbitrary text to margins (inner or outer) of plot
– axis: specify axis ticks
library(datasets)
# type =“n” sets up the plot and does not fill it with data
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", type = "n"))
# subsets of data are plotted here using different colors
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue"))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("May", "Other Months"))
model <- lm(Ozone ~ Wind, airquality)
# regression line is produced here
abline(model, lwd = 2)
8
Ozone and Wind in New York City
May
150
100 Other Months
Ozone
50
0
5 10 15 20
Wind
• Note: typing example(points) in R will launch a demo of base plotting system and may provide some
helpful tips on graphing
# this expression sets up a plot with 1 row 3 columns, sets the margin and outer margins
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, {
# here three plots are filled in with their respective titles
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
plot(Temp, Ozone, main = "Ozone and Temperature")
# this adds a line of text in the outer margin*
mtext("Ozone and Weather in New York City", outer = TRUE)}
)
9
Ozone and Weather in New York City
Ozone and Wind Ozone and Solar Radiation Ozone and Temperature
150
150
150
100
100
100
Ozone
Ozone
Ozone
50
50
50
0
0
5 10 15 20 0 50 150 250 60 70 80 90
10
Graphics Device
11
## Create plot on screen device
with(faithful, plot(eruptions, waiting))
## Add a main title
title(main = "Old Faithful Geyser data")
90
waiting
70
50
eruptions
12
lattice Plotting System
• Funtions
– xyplot() = main function for creating scatterplots
– bwplot() = box and whiskers plots (box plots)
– histogram() = histograms
– stripplot() = box plot with actual points
– dotplot() = plot dots on “violin strings”
– splom() = scatterplot matrix (like pairs() in base plotting system)
– levelplot()/contourplot() = plotting image data
• Arguments for xyplot(y ~ x | f * g, data, layout, panel)
– default blue open circles for data points
– formula notation is used here (~) = left hand side is the y-axis variable, and the right hand side is
the x-axis variable
– f/g = conditioning/categorical variables (optional)
∗ basically creates multi-panelled plots (for different factor levels)
∗ * indicates interaction between two variables
∗ intuitively, the xyplot displays a graph between x and y for every level of f and g
– data = the data frame/list from which the variables should be looked up
∗ if nothing is passed, the parent frame is used (searching for variables in the workspace)
∗ if no other arguments are passed, defaults will be used
– layout = specifies how the different plots will appear
∗ layout = c(5, 1) = produces 5 subplots in a horizontal fashion
∗ padding/spacing/margin automatically set
– [optional] panel function can be added to control what is plotted inside each panel of the plot
13
∗ panel functions receive x/y coordinates of the data points in their panel (along with any
additional arguments)
∗ ?panel.xyplot = brings up documentation for the panel functions
∗ Note: no base plot functions can be used for lattice plots
lattice Example
library(lattice)
set.seed(10)
x <- rnorm(100)
f <- rep(0:1, each = 50)
y <- x + f - f * x+ rnorm(100, sd = 0.5)
f <- factor(f, labels = c("Group 1", "Group 2"))
## Plot with 2 panels with custom panel function
xyplot(y ~ x | f, panel = function(x, y, ...) {
# call the default panel function for xyplot
panel.xyplot(x, y, ...)
# adds a horizontal line at the median
panel.abline(h = median(y), lty = 2)
# overlays a simple linear regression line
panel.lmline(x, y, col = 2)
})
−2 −1 0 1 2
Group 1 Group 2
2
0
y
−1
−2
−2 −1 0 1 2
14
ggplot2 Plotting System
“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic
attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain
statistical transformations of the data and is drawn on a specific coordinate system”
• grammar graphics plot, splits the different between base and lattice systems
• automatically sets spacings/text/tiles but also allows annotations to be added
• default makes a lot of choices, but still customizable
library(ggplot2)
qplot(displ, hwy, data = mpg, color = drv, shape = drv)
15
40
drv
30 4
hwy
f
r
20
2 3 4 5 6 7
displ
40
30
hwy
20
10
2 3 4 5 6 7
displ
16
qplot(hwy, data = mpg, fill = drv)
30
drv
4
count 20
f
r
10
0
10 20 30 40
hwy
4 f r
40
30
hwy
20
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ
17
30
20
4
10
0
30
count
20
f
10
0
30
20
r
10
0
10 20 30 40
hwy
• density smooth: smooths the histograms into a line tracing its shape
– geom = "density" = replaces the default scatterplot with density smooth curve
– example
• ggplot()
– built up in layers/modularly (similar to base plotting system)
∗ data → overlay summary → metadata/annotation
– g <- ggplot(data, aes(var1, var2))
∗ initiates call to ggplot and specifies the data frame that will be used
∗ aes(var1, var2) = specifies aesthetic mapping, or var1 = x variable, and var2 = y variable
∗ summary(g) = displays summary of ggplot object
∗ print(g) = returns error (“no layer on plot”) which means the plot does know how to draw
the data yet
– g + geom_point() = takes information from g object and produces scatter plot
– + geom_smooth() = adds low S mean curve with confidence interval
18
∗ method = "lm" = changes the smooth curve to be linear regression
∗ size = 4, linetype = 3 = can be specified to change the size/style of the line
∗ se = FALSE = turns off confidence interval
– + facet_grid(row ~ col) = splits data into subplots by factor variables (see facets from qplot())
∗ conditioning on continous variables is possible through cutting/making a new categorical
variable
∗ cutPts <- quantiles(df$cVar, seq(0, 1, length=4), na.rm = TRUE) = creates quan-
tiles where the continuous variable will be cut
· seq(0, 1, length=4) = creates 4 quantile points
· na.rm = TRUE = removes all NA values
∗ df$newFactor <- cut(df$cVar, cutPts) = creates new categorical/factor variable by using
the cutpoints
· creates n-1 ranges from n points = in this case 3
– annotations:
∗ xlab(), ylab(), labs(), ggtitle() = for labels and titles
· labs(x = expression("log " * PM[2.5]), y = "Nocturnal") = specifies x and y la-
bels
· expression() = used to produce mathematical expressions
∗ geom functions = many options to modify
∗ theme() = for global changes in presentation
· example: theme(legend.position = "none")
∗ two standard themes defined: theme_gray() and theme_bw()
∗ base_family = "Times" = changes font to Times
– aesthetics
∗ + geom_point(color, size, alpha) = specifies how the points are supposed to be plotted
on the graph (style)
· Note: this translates to geom_line()/other forms of plots
· color = "steelblue" = specifies color of the data points
· aes(color = var1) = wrapping color argument this way allows a factor variable to be
assigned to the data points, thus subsetting it with different colors based on factor variable
values
· size = 4 = specifies size of the data points
· alpha = 0.5 = specifies transparency of the data points
∗ example
19
– axis limits
∗ + ylim(-3, 3) = limits the range of y variable to a specific range
· Note: ggplot will exclude (not plot) points that fall outside of this range (outliers),
potentially leaving gaps in plot
∗ + coord_cartesian(ylim(-3, 3)) = this will limit the visible range but plot all points of
the data
# initiates ggplot
g <- ggplot(maacs, aes(logpm25, NocturnalSympt))
g + geom_point(alpha = 1/3) # adds points
+ facet_wrap(bmicat ~ no2dec, nrow = 2, ncol = 4) # make panels
+ geom_smooth(method="lm", se=FALSE, col="steelblue") # adds smoother
+ theme_bw(base_family = "Avenir", base_size = 10) # change theme
+ labs(x = expression("log " * PM[2.5]) # add labels
+ labs(y = "Nocturnal Symptoms”)
+ labs(title = "MAACS Cohort”)
20
21
Hierarchical Clustering
• useful for visualizing high dimensional data, organizes things that are close into groups
• agglomerative approach (most common) — bottom up
1. start with data
2. find closest pairs, put them together (create “super point” and remove original data)
3. find the next closest
4. repeat = yields a tree showing order of merging (dendrogram)
– requires
∗ merging approach: how to merge two points
∗ distance metric: calculating distance
p between two points
∗ continuous - Euclidean distance → (A1 − A2 )2 + (B1 − B2 )2 + · · · + (Z1 − Z2 )2
∗ continuous - correlation similarity → how correlated two data points are
∗ binary - Manhattan distance (“city block distance”) → |A1 − A2 | + |B1 − B2 | + · · · + |Z1 − Z2 |
1. calculate all pair wise distances between all points to see which points are closest together
• dist(data.frame(x=x, y=y) = returns pair wise distances for all of the (x,y) coordinates
• Note: dist() function uses Euclidean distance by default
2. group two closest points from the calculated distances and merge them to a single point
3. find the next two closest points and merge them, and repeat
4. order of clustering is shown in the dendrogram
• the approach is specified in the argument method = "complete" or "average" in hclust() function
• average linkage = taking average of the x and y coordinates for both points/clusters (center of mass
effectively)
22
• complete linkage = to measure distance of two clusters, take the two points in the clusters that are
the furthest apart
• Note: two approaches may produce different results so it’s a good idea to use both approaches to validate
results
• hh <- hclust(dist(dataFrame)) function = produces a hierarchical cluster object based on pair wise
distances from a data frame of x and y values
– dist() = defaults to Euclidean, calculates the distance/similarity
p between two observations; when
applied to a data frame, the function applies the (A1 − A2 )2 + (B1 − B2 )2 + ... + (Z1 − Z2 )2
formula to every pair of rows of data to construct a matrix of distances between the roes
∗ order of the hierarchical cluster is derived from the distance
– plot(hh) = plots the dendrogram
– automatically sorts column and row according to cluster
– names(hh) = returns all parameters of the hclust object
∗ hh$order = returns the order of the rows/clusters from the dendrogram
∗ hh$dist.method = returns method for calculating distance/similarity
• Note: dendrogram that gets generated DOES NOT show how many clusters there are, so cutting
(at 2.0 level for example) must be done to determine number of clusters — must be a convenient and
sensible point
• hclust Example
set.seed(1234)
x <- rnorm(12,mean=rep(1:3,each=4),sd=0.2)
y <- rnorm(12,mean=rep(c(1,2,1),each=4),sd=0.2)
dataFrame <- data.frame(x=x,y=y)
distxy <- dist(dataFrame)
hClustering <- hclust(distxy)
plot(hClustering)
23
Cluster Dendrogram
1.5
Height
0.0
8
1
4
2
3
7
9
12
10
11
5
6
distxy
hclust (*, "complete")
• Note: myplcclust = a function to plot hclust objects in color (clusters labeled 1 2 3 etc.), but must
know how many clusters there are initially
24
Cluster Dendrogram
2
1
1
1
1
2
3
3
3
3
2
2
distxy
hclust (*, "complete")
set.seed(12345)
data <- matrix(rnorm(400), nrow = 40)
heatmap(data)
28
13
6
21
30
10
14
31
36
16
40
25
23
38
35
24
9
27
18
2
7
22
12
19
1
17
37
8
11
3
15
26
32
20
39
4
34
33
5
29
4
6
2
9
8
3
1
7
5
10
25
– t(dataMatrix)[, nrow(dataMatrix)]
∗ t(dataMatrix) = transpose of dataMatrix, this is such that the plot will be displayed in the
same fashion as the matrix (rows as values on the y axis and columns as values on the x axis)
· example 40 x 10 matrix will have graph the 10 columns as x values and 40 rows as y
values
∗ [, nrow(dataMatrix)] = subsets the data frame in reverse column order; when combined
with the t() function, it reorders the rows of data from 40 to 1, such that the data from the
matrix is displayed in order from top to bottom
· Note: without this statement the rows will be displayed in order from bottom to top, as
that is in line with the positive y axis
– x, y = used to specify the values displayed on the x and y axis
∗ Note: must be in increasing order
• example
20
10
2 4 6 8 10
1:10
26
K-means Clustering
• similar to hierarchical clustering, focuses on finding things that are close together
– define close, groups, visualizing/interpreting grouping
• partitioning approach
1. set number of clusters initially
2. find centroids for each cluster
3. assign points to the closest centroid
4. recalculate centroid
5. repeat = yields estimate of cluster centroids and which cluster each point belongs to
– requires
∗ distance metric
∗ initial number of clusters
∗ initial guess as to where the cluster centroids are
• example
set.seed(1234)
x <- rnorm(12,mean=rep(1:3,each=4),sd=0.2)
y <- rnorm(12,mean=rep(c(1,2,1),each=4),sd=0.2)
dataFrame <- data.frame(x=x,y=y)
# specifies initial number of clusters to be 3
kmeansObj <- kmeans(dataFrame,centers=3)
names(kmeansObj)
## [1] 3 3 3 3 1 1 1 1 2 2 2 2
par(mar=rep(0.2,4))
plot(x,y,col=kmeansObj$cluster,pch=19,cex=2)
points(kmeansObj$centers,col=1:3,pch=3,cex=3,lwd=3)
27
Characteristics of K-means Clustering Algorithms
28
Dimension Reduction
• two kinds of problems that relate to high-dimension dataset/matrix with many variables
1. find a new set (smaller) of variables that are uncorrelated and explain as much variance of data as
possible
– normally many variables are not independent (i.e. height vs weight)
– statistical problem, commonly solved with PCA
2. find a lower rank matrix (best matrix created with fewer variables) that still explains the data
– data compression problem, commonly solved SVD
• example
– Note: we are arbitrarily introduced pattern in data: we flip a coin and if the it is heads, we replace
the row with [0, 0, 0, 0, 0, 3, 3, 3, 3, 3]
– here we plot the patterns in rows and columns (already sorted)
for(i in 1:40){
# flip a coin
coinFlip <- rbinom(1,size=1,prob=0.5)
# if coin is heads add a common pattern to that row
if(coinFlip){
data[i,] <- data[i,] + rep(c(0,3),each=5)
}
}
# hierarchical clustering
hh <- hclust(dist(data))
dataOrdered <- data[hh$order,]
# create 1 x 3 panel plot
par(mfrow=c(1,3))
# heat map (sorted)
image(t(dataOrdered)[,nrow(dataOrdered):1])
# row means (40 rows)
plot(rowMeans(dataOrdered),40:1,,xlab="Row Mean",ylab="Row",pch=19)
# column means (10 columns)
plot(colMeans(dataOrdered),xlab="Column",ylab="Column Mean",pch=19)
1.0
1.5
40
0.8
30
1.0
Column Mean
0.6
Row
20
0.4
0.5
10
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 −0.5 0.5 1.0 1.5 2.0 2 4 6 8 10
29
Singular Value Decomposition (SVD)
• Let X = matrix which each variable in column (measurement) and each observation in row (subject)
• SVD in this case is a matrix decomposition process, in which X is divided into three separate
matrices as follows:
X = U DV T
– U = left singular vector, orthogonal matrix (columns independent of each other)
– D = singular values, diagonal matrix
– V = right singular vector, orthogonal matrix (columns independent of each other)
– Note: orthogonal implies that a matrix is always invertible [A−1 = AT ] and that the product of
the matrix and its transpose equals the identity matrix [AAT = I]
∗ when a orthogonal matrices, A, is multiplied by another matrix, B, it is effectively a linear
transformation in that the length and angles of B are preserved
– Note: diagonal implies that any value outside of the main diagonal (&) = 0
∗ example
1 0 0
A = 0 2 0
0 0 3
• Note: scale of data matters for SVD/PCA (scaling the data may help), patterns detected maybe mixed
together, and computation is intensive for these operations
• U and V Matrices
– s <- svd(data) = performs SVD on data (n × m matrix) and splits it into u, v, and d matrices
∗ s$u = n × m matrix → horizontal variation
∗ s$d = 1 × m vector → vector of the singular/diagonal values
· diag(s$d) = m × m diagonal matrix
∗ s$v = m × m matrix → vertical variation
∗ s$u %*% diag(s$d) %*% t(s$v) = returns the original data → X = U DV T
– scale(data) = scales the original data by subtracting each data point by its column mean and
dividing by its column standard deviation
# running svd
svd1 <- svd(scale(dataOrdered))
# create 1 by 3 panel plot
par(mfrow=c(1,3))
# data heatmap (sorted)
image(t(dataOrdered)[,nrow(dataOrdered):1])
# U Matrix - first column
plot(svd1$u[,1],40:1,,xlab="Row",ylab="First left singular vector",pch=19)
# V vector - first column
plot(svd1$v[,1],xlab="Column",ylab="First right singular vector",pch=19)
30
1.0
40
0.4
First right singular vector
0.8
0.3
30
0.6
0.2
20
0.1
0.4
−0.1 0.0
10
0.2
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.1 0.2 2 4 6 8 10
Row Column
0.4
Singular value
10
0.2
6
0.0
2
2 4 6 8 2 4 6 8
Column Column
• Relationship to PCA
– p <- prcomp(data, scale = TRUE) = performs PCA on data specified
∗ scale = TRUE = scales the data before performing PCA
∗ returns prcomp object
31
∗ summary(p) = prints out the principal component’s standard deviation, proportion of variance,
and cumulative proportion
– PCA’s rotation vectors are equivalent to their counterparts in the V matrix from the SVD
# SVD
svd1 <- svd(scale(dataOrdered))
# PCA
pca1 <- prcomp(dataOrdered,scale=TRUE)
# Plot the rotation from PCA (Principal Components) vs v vector from SVD
plot(pca1$rotation[,1],svd1$v[,1],pch=19,xlab="Principal Component 1",
ylab="Right Singular Vector 1")
abline(c(0,1))
Right Singular Vector 1
0.3
0.1
−0.1
Principal Component 1
# summarize PCA
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9930 1.2518 1.0905 1.0024 0.90836 0.72211 0.60630
## Proportion of Variance 0.3972 0.1567 0.1189 0.1005 0.08251 0.05214 0.03676
## Cumulative Proportion 0.3972 0.5539 0.6728 0.7733 0.85582 0.90797 0.94473
## PC8 PC9 PC10
## Standard deviation 0.52145 0.40286 0.34423
## Proportion of Variance 0.02719 0.01623 0.01185
## Cumulative Proportion 0.97192 0.98815 1.00000
32
set.seed(678910)
# setting pattern
data <- matrix(rnorm(400), nrow = 40)
for(i in 1:40){
# flip a coin
coinFlip1 <- rbinom(1,size=1,prob=0.5)
coinFlip2 <- rbinom(1,size=1,prob=0.5)
# if coin is heads add a common pattern to that row
if(coinFlip1){
data[i,] <- data[i,] + rep(c(0,5),each=5)
}
if(coinFlip2){
data[i,] <- data[i,] + rep(c(0,5),5)
}
}
hh <- hclust(dist(data)); dataOrdered <- data[hh$order,]
# perform SVD
svd2 <- svd(scale(dataOrdered))
par(mfrow=c(2,3))
image(t(dataOrdered)[,nrow(dataOrdered):1])
plot(rep(c(0,1),each=5),pch=19,xlab="Column", main="True Pattern 1")
plot(rep(c(0,1),5),pch=19,xlab="Column",main="True Pattern 2")
image(t(dataOrdered)[,nrow(dataOrdered):1])
plot(svd2$v[,1],pch=19,xlab="Column",ylab="First right singular vector",
main="Detected Pattern 1")
plot(svd2$v[,2],pch=19,xlab="Column",ylab="Second right singular vector",
main="Detected Pattern 2")
33
1.0 True Pattern 1 True Pattern 2
1.0
1.0
0.8
0.8
0.8
rep(c(0, 1), each = 5)
rep(c(0, 1), 5)
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
Column Column
0.4
0.4
Second right singular vector
First right singular vector
0.8
0.3
0.2
0.6
0.2
0.0
0.4
−0.2
0.1
0.2
−0.4
0.0
0.0
Column Column
• Missing Data
– SVD cannot be performed on dataset with NA values
– impute package from Bioconductor can help approximate missing values from surrounding values
∗ impute.knn function takes the missing row and imputes the data using the k nearest neighbors
to that row
· k=10 = default value (take the nearest 10 rows)
34
Original Imputed
0.4
0.4
svd1$v[, 1]
svd2$v[, 1]
0.2
0.2
0.0
0.0
2 4 6 8 2 4 6 8
Index Index
• SVD can be used to create lower rank representation, or compressed representation of data
• if we look at the variance explained plot below, most of the variation is explained by the first few
principal components
# load faceData
load("figures/face.rda")
# perform SVD
svd3 <- svd(scale(faceData))
plot(svd3$d^2/sum(svd3$d^2),pch=19,xlab="Singular vector",ylab="Variance explained")
Variance explained
0.4
0.2
0.0
0 5 10 15 20 25 30
Singular vector
• approximations can thus be created by taking the first few components and using matrix multiplication
with the corresponding U , V , and D components
35
image(t(approx10)[,nrow(approx10):1], main = "10 Component")
image(t(faceData)[,nrow(faceData):1], main = "Original")
36
Color Packages in R Plots
• proper use of color can help convey the message by improving clarity/contrast of data presented
• default color schemes for most plots in R are fairly terrible, so some external packages are helpful
grDevices Package
• colorRampPalette function
– takes any set of colors and return a function that takes integer arguments and returns a vector of
colors interpolating the palette (like heat.colors or topo.colors)
– pal <- colorRampPalette(c("red", "yellow")) defines a colorRampPalette function
– pal(10) returns 10 interpolated colors in hexadecimal format that range between the defined ends
of spectrum
– example
• rgb function
– red, green, and blue arguments = values between 0 and 1
– alpha = 0.5 = transparency control, values between 0 and 1
– returns hexadecimal string for color that can be used in plot/image commands
– colorspace package cna be used for different control over colors
– example
37
x <- rnorm(200); y <- rnorm(200)
par(mfrow=c(1,2))
# normal scatter plot
plot(x, y, pch = 19, main = "Default")
# using transparency shows data much better
plot(x, y, col = rgb(0, 0, 0, 0.2), main = "With Transparency")
2
1
1
0
0
y
y
−3 −2 −1
−3 −2 −1
−3 −1 0 1 2 3 −3 −1 0 1 2 3
x x
RColorBrewer Package
38
• brewer.pal(n, "BuGn") function
– n = number of colors to generated
– "BuGn" = name of palette
∗ ?brewer.pal list all available palettes to use
– returns list of n hexadecimal colors
• example
library(RColorBrewer)
# generate 3 colors using brewer.pal function
cols <- brewer.pal(3, "BuGn")
pal <- colorRampPalette(cols)
par(mfrow=c(1,3))
# heat.colors/default
image(volcano, main = "Heat.colors/Default")
# topographical colors
image(volcano, col = topo.colors(20), main = "Topographical Colors")
# RColorBrewer colors
image(volcano, col = pal(20), main = "RColorBrewer Colors")
39
1.0 Heat.colors/Default Topographical Colors RColorBrewer Colors
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
• smoothScatter function
– used to plot large quantities of data points
– creates 2D histogram of points and plots the histogram
– default color scheme = “Blues” palette from RColorBrewer package
– example
−3
−4 −2 0 2 4
40
Case Study: Human Activity Tracking with Smart Phones
##
## laying sitting standing walk walkdown walkup
## 1407 1286 1374 1226 986 1073
0.0
tBodyAcc.mean...X
tBodyAcc.mean...Y
0.2
−0.2
0.0
standing
−0.4
sitting
laying
−0.2
walk
−0.6
walkdown
walkup
Index Index
41
Height
67 239 72
68 71 238 73
par(mfrow=c(1,2))
133
261
84
174
191
15
338
242
135
141
271
241
92
106
# create 1 x 2 panel
# form hclust object
146
101
117
337
93
source("myplclust.R")
262
85
252
52
214
301
341
136
139
# run myplclust on data
147
300
126
346
293
332
118
251
272
143
# load myplclust function
340
144
142
104
171
333
258
97
162
# calculate distance matrix
82
96
110
122
34290
123
87
115
279
29
268
286
266
321
178
288
310
256
278
121
325
40
75
324
130
277
250
228
167
179
170
66
distanceMatrix <- dist(sub1[,1:3])
163
294
330
263
273
164
215
127
161
316
153
243
154
107
hclustering <- hclust(distanceMatrix)
307
88
112
336
86
168
176
254
108
119
157
264
315
116
344
295
305
328
274
42
Clustering Based on Only Average Acceleration
39
98
102
105
76
10
25
77
180
53
194
229
distanceMatrix
43
46
78
232
237
36
253
11
205
185
496
plot(sub1[,10],pch=19,col=sub1$activity,ylab=names(sub1)[10],
231
61
223
plot(sub1[,11],pch=19,col = sub1$activity,ylab=names(sub1)[11],
234
14
200
21
Cluster Dendrogram
199
18
26
188
201
220
60
30
51
236
184
198
62
213
221
38
56
20
235
211
230
202
47
216
224
37
24
55
225
41
217
57
64
207
222
42
59
2197
45
218
19
58
233
206
209
175
197
44
187
13
50 3
186
210
23
195
12
189
181
190
298
248
265
334
303
335
99
257
137
246
302
81
103
347
34
114
287
203
109
311
148
158
165
247
283
22669
297 7028
32
33
331 65
149
156
318
327
125
345
150
140
155
173
313
322
151
304
159
145
138
270
299
31
95
89
134
Max Body Acceleration for X Max Body Acceleration for Y
0.2
0.5
tBodyAcc.max...X
tBodyAcc.max...Y
0.0
0.0
−0.2
−0.5
−0.4
−0.6
−1.0
Index Index
43
Height
143
134
150 297
336 341
127
par(mfrow=c(1,2))
126
128
270
342
148
149
137
138340
345
145
146
141
142
136
347
135
346
339
132
133
334
335
332
331
333
139
298
301
303
144
302
130
131
299
300
129
147
44
308
314
121
122
172
98
171
104
105
277
285
264
278
271
255
256
173
174
316
317
313
123
124
168
84
85
170
328
318
319
103
108
158
305
79
81
80
320
321
312
330
329
88
89
159
311
309
310
157
304
83
154
15586
118
240
177 1 203238
distanceMatrix
53
245
246
78
30
77
232
229
236
46
24
47
42
9
43
17
44
193
196
206
199
202
233
235
50
37
45
14
234
13
188 5
218
55
64
219
63
223
224
225
197
198
210 4
222
190
61
62
247
38
41
48
49
216
217
182
183
21
36
211
54
212
27
56
19
23
228
0.10
0.05
0.05
svd1$u[, 1]
svd1$u[, 2]
−0.05
0.00
−0.05
−0.15
0 50 150 250 350 0 50 150 250 350
Index Index
45
##
##
##
##
##
##
##
##
Height
6
5
4
3
2
1
0.0 0.5 1.0 1.5 2.0
0
0
10
24
16
0
322
318
328 168
326
327
330
166 176
175
312 167
323
165
329
120
163
164
315
152
153
161
172
154
155
311
0
0
2
33
12
0
319
158
305
320
321
151
156
162
306
324
325
160
169
159
170310
309
157
304
313
174
173
316
317
307
308
171
3147172
67
68
0
0
0
46
7
0
## [1] "fBodyAcc.meanFreq...Z"
names(samsungData)[maxContrib]
192
39
49
173
4
50
59
229
24
44
227
608
0
0
0
0
0
95
table(kClust$cluster,sub1$activity)
246
215
248
214
74
181
193
182
245
243 2
31
70
133
65
69 28
0
49
0
0
0
0
43
37
189
200
19
57
188
30
46
42
40
53
0
0
0
0
0
laying sitting standing walk walkdown walkup
194
78
58
231
55
212
54
196
236
64
13
202
76
61
62
208
14
27
222
233
20
56
220
230
197
18
206
221
223
224
21
23
218
16
205
distanceMatrix
51779
81
87
119
249
125
109
110105
121
277
256
271
hclust (*, "complete")
255
285
104
122
123
82
117
118
281
265
Cluster Dendrogram
296
8397
257
275
266
258
274
276
90
96
99
113
101
106
284
286
12495
103
10894
272
262
267
84
85
88
89
93
107
92
102
115
273
289 91
269
290
29180
86 98
116
293
294
270
283
292
114
288260
261
287
250
259
251
282
252
253
268
111
254
280
295
264
278
263
279
340
145
345
346
112
148146
331
144
147
143
342
137
347 297
135
150
336
128
126
127 134
337
338
140 341
100
149
343
136
138
130
131
299
300
332
334
335
141 344
142
# tabulate 6 clusteres against 6 activity but many clusters contain multiple activities
129
301
302
298
133
303
132
339
139
333
# run k-means algorithm 100 times
kClust <- kmeans(sub1[,-c(562,563)],centers=6,nstart=100)
# tabulate results
table(kClust$cluster,sub1$activity)
##
## laying sitting standing walk walkdown walkup
## 1 0 37 51 0 0 0
## 2 3 0 0 0 0 53
## 3 18 10 2 0 0 0
## 4 0 0 0 0 49 0
## 5 29 0 0 0 0 0
## 6 0 0 0 95 0 0
##
## laying sitting standing walk walkdown walkup
## 1 18 10 2 0 0 0
## 2 0 0 0 0 49 0
## 3 29 0 0 0 0 0
## 4 0 37 51 0 0 0
## 5 0 0 0 95 0 0
## 6 3 0 0 0 0 53
# plot first 10 centers of k-means for laying to understand which features drive the activity
plot(kClust$center[1,1:10],pch=19,ylab="Cluster Center",xlab="")
0.0
Cluster Center
−0.4
−0.8
2 4 6 8 10
47
Cluster 2 Variable Centers (Walking)
# plot first 10 centers of k-means for laying to understand which features drive the activity
plot(kClust$center[4,1:10],pch=19,ylab="Cluster Center",xlab="")
0.2
Cluster Center
−0.2
−0.6
−1.0
2 4 6 8 10
48
Case Study: Fine Particle Pollution in the U.S. from 1999 to 2012
summary(x0)
## NA.1990 NA.2012
## 1 0.1125608 0.05607125
par(mfrow = c(1,2))
# regular boxplot, data too right skewed
boxplot(x0, x1, main = "Regular Boxplot")
# log boxplot, significant difference in means, but more spread
boxplot(log10(x0), log10(x1), main = "log Boxplot")
49
Regular Boxplot log Boxplot
3
200 400 600 800
2
1
0
−1
−2
0
# summary again
summary(x1)
## [1] 26474
## [1] 0.0215034
50
Histogram of dates
0.004
Density
0.002
0.000
dates
51
sapply(split(cnt1, cnt1$county.site), nrow)
## [1] 30 29
dim(pm0sub)
## [1] 122 29
8
6
4
dates1
52
Plot data for 1999
40
30
PM2.5 Polution Level in 1999
x0sub
20
10
dates0
53
Pollution in 1999 Pollution in 2012
40
40
30
30
x0sub
x1sub
20
20
10
10
Jul Sep Nov Jan Jan 01 Feb 01 Mar 15
dates0 dates1
# divide data by state and find tne mean of pollution level for 1999
mn0 <- with(pm0, tapply(Sample.Value, State.Code, mean, na.rm = T))
# divide data by state and find tne mean of pollution level for 1999
mn1 <- with(pm1, tapply(Sample.Value, State.Code, mean, na.rm = T))
# convert to data frames while preserving state names
d0 <- data.frame(state = names(mn0), mean = mn0)
d1 <- data.frame(state = names(mn1), mean = mn1)
# merge the 1999 and 2012 means by state
mrg <- merge(d0, d1, by = "state")
# dimension of combined data frame
dim(mrg)
## [1] 52 3
54
# plot the pollution levels data points for 1999
with(mrg, plot(rep(1, 52), mrg[, 2], xlim = c(.8, 2.2), ylim = c(3, 20),
main = "PM2.5 Pollution Level by State for 1999 & 2012",
xlab = "", ylab = "State-wide Mean PM"))
# plot the pollution levels data points for 2012
with(mrg, points(rep(2, 52), mrg[, 3]))
# connected the dots
segments(rep(1, 52), mrg[, 2], rep(2, 52), mrg[, 3])
# add 1999 and 2012 labels
axis(1, c(1, 2), c("1999", "2012"))
15
10
5
0.8 1999
1.0 1.2 1.4 1.6 1.8 2012
2.0 2.2
55