R Session A
R Session A
R Session A
> v = rnorm(256)
> A = as.matrix (v,16,16)
> summary(A)
> library (fields)
> image.plot (A)
>…
> dyn.load( “foo.so”)
> .C( “foobar” )
> dyn.unload( “foo.so” )
2
Why R?
Scripting
(R, MATLAB, IDL)
Object Oriented
(C++, Java)
Functional languages
(C, Fortran)
Assembly
4
Features of R
6
Exploring the iris data
• Load iris data into your R session:
– data (iris);
– help (data);
• Check that iris was indeed loaded:
– ls ();
• Check the class that the iris object belongs to:
– class (iris);
• Print the content of iris data:
– iris;
• Check the dimensions of the iris data:
– dim (iris);
• Check the names of the columns:
– names (iris);
7
Basic usage: arithmetic in R
9
1
0 Outline
• Variables and Vectors
• Factors
• Arrays and Matrices
• Data Frames
• Functions and Conditionals
• Graphical Procedures
Variables and assignment
• More
• msg <- “hello”
• print(msg)
• R vector, integer sequence of length 20
• x <- 10:30
• R Objects. R has five basic or “atomic” classes of objects
• Character
• Numeric(real numbers)
• Integer
• Complex
• Logical
R Commands
• Commands can be expressions or assignments
• Separate by semicolon or new line
• Can split across multiple lines
• R will change prompt to + if command not finished
• Useful commands for variables
• ls(): List all stored variables
• rm(x): Delete one or more variables
• class(x): Describe what type of data a variable stores
• save(x,file=“filename”): Store variable(s) to a binary file
• load(“filename”): Load all variables from a binary file
• Save/load in current directory or My Documents by default
1
4 A Numeric Vector
• Simplest data structure
– Numeric vector
– > v <- c(1,2,3)
– <- is the assignment operator
– c is the list concatenation operator
• To print the value, v
– Type : > v
– Output: [1] 1 2 3
1
5 A vector is a full fledged variable
• Let us do the following:
• > 1/v
[1] 1.0000000 0.5000000 0.3333333
• >v+2
[1] 3 4 5
• We can treat a vector as a regular variable
• For example, we can have:
– > v1 <- v / 2
> v1
[1] 0.5 1.0 1.5
1
6 Creating a vector with vectors
• > v <- c (1,2,3)
>v
[1] 1 2 3
> vnew <- c (v,0,v)
> vnew
[1] 1 2 3 0 1 2 3
The c operator concatenates all the vectors
R Vectors
• The c() function can be used to create vectors of objects by
concatenating things together.
• X <- c(0.5, 0.5) ## Numeric
• x <- c(TRUE, FALSE) ## logical
• x <- c(T, F) ## logical
• x <- c("a", "b", "c") ## character
• x <- 9:29 ## integer
• x <- c(1+0i, 2+4i) ## complex
• You can also use the vector() function to initialize vectors.
• x <- vector("numeric", length = 10)
1
8
Functions on Vectors and Complex
Numbers
• If v is a vector
• Here, are a few of the functions that take vectors as
inputs:
mean(v), max(v), sqrt(v), length(v), sum(v),
prod(v), sort (v) (in ascending order)
• > x <- 1 + 1i
> y <- 1i
>x*y
[1] -1+1i
1
9 Generating Vectors
• Suppose we want a vector of the form:
(1,2,3,... 100)
• We do not have to generate it manually.
• We can use the following commands:
> v <- 1:100
OR
> v <- seq(1,100)
• seq takes an additional argument, which is the
difference between consecutive numbers:
– seq (1,100,10) gives (1,11,21,31 ... , 91)
• rep (2,5) generates a vector (2, 2, 2, 2, 2)
2
0 Boolean Variables and Vectors
25
2
6 Outline
Indian 6
Chinese 8
Indian
Indian 7
Chinese
Chinese 9
Russian
Indian 8
Factor
Russian 10
Nationality Marks
2
9 Code
# character starts
a comment
List of marks
3
1 Time for the results
> results
Chinese Indian Russian
8.5 7.0 10.0
> v[2,1,2]
[1] 6
3
7 The matrix command
• A matrix is a 2-D array
• There is a fast method of creating a matrix
– Use the matrix (data, dim1, dim2) command
• Example:
> matrix(1:4, 2, 2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
38 cbind and rbind
cbind
mat1
mat1 mat2
mat2
rbind
3
9
Problem: set the diagonal elements of a
matrix to 0
Feature Function
QR decomposition qr
Matrices & matrix operations
To create a matrix:
# matrix() command to create matrix A with rows and cols
A=matrix(c(54,49,49,41,26,43,49,50,58,71),nrow=5,ncol=2))
B=matrix(1,nrow=4,ncol=4)
It is a table in R
> df
entries price num
1 cars 8 1
2 trucks 10 2
3 bikes 5 3
50 Accessing an Element
Group by entries
> aggregate(df,by = list(entries), mean)
Group.1 entries price num
1 bikes NA 5 3
2 cars NA 8 1
3 trucks NA 10 2
52 Reading Data from Files
Additional resources
Beginning R: An Introduction to Statistical Programming by Larry Pace
Introduction to R webpage on APSnet:
http://www.apsnet.org/edcenter/advanced/topics/ecologyandepidemiologyinr
/introductiontor/Pages/default.aspx
The R Inferno:
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
59
Conditional statements
• Perform different commands in different situations
• if (condition) command_if_true
• Can add else command_if_false to end
• Group multiple commands together with braces {}
• if (cond1) {cmd1; cmd2;} else if (cond2) {cmd3; cmd4;}
• Conditions use relational operators
• ==, !=, <, >, <=, >=
• Do not confuse = (assignment) with == (equality)
• = is a command, == is a question
• Combine conditions with and (&&) and or (||)
• Use & and | for vectors of length > 1 (element-wise)
Loops
• Most common type of loop is the for loop
• for (x in v) { loop_commands; }
• v is a vector, commands repeat for each value in v
• Variable x becomes each value in v, in order
• Example: adding the numbers 1-10
• total = 0; for (x in 1:10) total = total + x;
• Other type of loop is the while loop
• while (condition) { loop_commands; }
• Condition is identical to if statement
• Commands are repeated until condition is false
• Might execute commands 0 times if already false
• while loops are useful when you don’t know number of iterations
Scripting in R
• A script is a sequence of R commands that perform some
common task
• E.g., defining a specific function, performing some analysis
routine, etc.
• Save R commands in a plain text file
• Usually have extension of .R
• Run scripts with source() :
• source(“filename.R”)
• To save command output to a file, use sink():
• sink(“output.Rout”)
• sink() restores output to console
• Can be used with or outside of a script
Lists
• Objects containing an ordered collection of objects
• Components do not have to be of same type
• Use list() to create a list:
• a <- list(“hello”,c(4,2,1),“class”);
• Components can be named:
• a <- list(string1=“hello”,num=c(4,2,1),string2=“class”)
• Use [[position#]] or $name to access list elements
• E.g., a[[2]] and a$num are equivalent
• Running the length() command on a list gives the number of
higher-level objects
Sampling and Creating simulated data
• Create data from a specific distribution
• rnorm(5,0,1)
• ## [1] -0.2986 -0.3762 0.7070 0.1605 0.7018
• rnorm(10,6,2)
• ## [1] 7.530 7.741 5.947 4.622 8.804 8.149 10.949 3.339 6.213
• ## [10] 6.554
• Sample from existing data
• sample(1:10)
• sample(1:10,replace = TRUE)
• sd
Writing your own functions
• Writing functions in R is defined by an assignment like:
• a <- function(arg1,arg2) { function_commands; }
• Functions are R objects of type “function”
• Functions can be written in C/FORTRAN and called via .C() or
.Fortran()
• Arguments may have default values
• Example: my.pow <- function(base, pow = 2) {return
base^pow;}
• Arguments with default values become optional, should usually
appear at end of argument list (though not required)
• Arguments are untyped
• Allows multipurpose functions that depend on argument type
Writing your own functions
• Writing functions in R is defined by an assignment like:
• a <- function(arg1,arg2) { function_commands; }
• myMean<- function(y1){
mean<- sum(y1)/length(y1)
return(mean)
}
• testVec<- rnorm(50,20,4)
• mean(testVec)
• ## [1] 19.72
• myMean(testVec)
• ## [1] 19.72
Writing your own functions
• Writing functions in R is defined by an assignment like:
• a <- function(arg1,arg2) { function_commands; }
• Functions are R objects of type “function”
• Functions can be written in C/FORTRAN and called via .C() or
.Fortran()
• Arguments may have default values
• Example: my.pow <- function(base, pow = 2) {return
base^pow;}
• Arguments with default values become optional, should usually
appear at end of argument list (though not required)
• Arguments are untyped
• Allows multipurpose functions that depend on argument type
68 Applying a Function
[[2]]
[1] 8
> li = list("klaus","martin","georg")
> sapply(li, toupper)
[1] "KLAUS" "MARTIN" "GEORG"
Scope of variables in R
Function arguments (valid only inside the function)
Local variables (valid only inside the function)
Global variables (balance)
74 Functional Programming: Closures
area <- 0
+ points <- seq(a, b, length = n + 1)
+
+ area <- 0
+ for (i in seq_len(n)) { Function for
+ area <- area + rule(f, points[i], numerical
points[i + 1]) integration
+ }
+
+ area
+ }
> midpoint <- function(f, a, b) {
+ (b - a) * f((a + b) / 2) Midpoint rule
function passed + }
as an argument > composite(sin, 0, pi, n = 1000, rule =
midpoint)
[1] 2.00000
76 Outline
A basic 2D plot:
vec1 <-cube(seq(1,100,10)) Plot type
(overplotted)
vec2 <-cube(seq(5,100,10))
plot(vec1, type="o", col="blue“, ylim=c(0,3e5))
title(main=“Plot of Cubes", col.main="red")
library("MASS")
data(cats) # load data
plot(cats$Bwt, cats$Hwt) # scatter plot of cats body weight vs heart rate
M <- lm(formula = cats$Hwt ~ cats$Bwt, data=cats) # fit a linear model
regmodel <- predict(M) # predict values using this model
plot(cats$Bwt, cats$Hwt, pch = 16, cex = 1.3, col = "blue", main = "Heart
rate plotted against body weight of cats", xlab = "Body weight", ylab =
"Heart rate") # scatter plot
abline(M) # plot the regression line
79 Creating 3-D plots
#source: http://blog.revolutionanalytics.com/2014/02/3d-
plots-in-r.html
library ('ggplot2')
library(plot3D)
par(mar = c(2, 2, 2, 2))
par(mfrow = c(1, 1))
R <- 3; r <- 2
x <- seq(0, 2*pi,length.out=50)
y <- seq(0, pi,length.out=50)
M <- mesh(x, y)
alpha <- M$x; beta <- M$y
surf3D(x = (R + r*cos(alpha)) * cos(beta),
y = (R + r*cos(alpha)) * sin(beta),
z = r * sin(alpha),
colkey=FALSE,
bty="b2",
main="Half of a Torus")
81 Creating 3-D plots: persp3d
xdim <- 16
newmap <- array(0,dim=c(xdim,xdim))
newmap <- rnorm(256,1,.2)
jet.colors <- colorRampPalette( c("yellow", "red") )
pal <- jet.colors(100)
col.ind <- cut(newmap,100) # colour indices of each point
persp3d(seq(1:xdim),seq(1:xdim),newmap,shade=TRUE,
type="wire", col=pal[col.ind],xlab="",ylab="",zlab="",
cex.axis=1.5,xtics="",aspect=2,zlim=c(0,5))
Graphical display and plotting
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Decision Trees
• Used for classifying data by partitioning
attribute space
• Tries to find axis-parallel decision boundaries
for specified optimality criteria
• Leaf nodes contain class labels, representing
classification decisions
• Keeps splitting nodes based on split criterion,
such as GINI index, information gain or entropy
• Pruning necessary to avoid overfitting
Decision Trees in R
mydata<-data.frame(iris)
attach(mydata)
library(rpart)
model<-rpart(Species ~ Sepal.Length +
Sepal.Width + Petal.Length +
Petal.Width,
data=mydata,
method="class")
plot(model)
text(model,use.n=TRUE,all=TRUE,cex=0.8)
Decision Trees in R
library(tree)
model1<-tree(Species ~ Sepal.Length
+ Sepal.Width + Petal.Length +
Petal.Width,
data=mydata,
method="class",
split="gini")
plot(model1)
text(model1,all=TRUE,cex=0.6)
Decision Trees in R
library(party)
model2<-
ctree(Species ~
Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width,
data=mydata)
plot(model2)
Controlling number of nodes
This is just an
example. You
library(tree)
can come up
mydata<-
with better or
data.frame(iris)
more efficient
attach(mydata)
methods!
model1<-tree(Species
~ Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width,
data=mydata,
method="clas
s",
control =
tree.control(n
obs = 150, mincut =
10))
Controlling number of nodes
model2<-
ctree(Species ~ This is just an
Sepal.Length + example. You
Sepal.Width + can come up
Petal.Length + with better or
Petal.Width, more efficient
data = mydata, methods!
controls =
ctree_control(maxd
epth=2))
plot(model2)
id <- seq(1:18)
age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22,
43, 57, 33, 22, 63, 40, 48, 28, 49)
chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1,
3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0)
> anova(reg)
Analysis of Variance Table
Response: chol
Df Sum Sq Mean Sq F value Pr(>F)
age 1 10.4944 10.4944 114.57 1.058e-08 ***
Residuals 16 1.4656 0.0916
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results of R analysis
> summary(reg)
Call:
lm(formula = chol ~ age)
Residuals:
Min 1Q Median 3Q Max
-0.40729 -0.24133 -0.04522 0.17939 0.63040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.089218 0.221466 4.918 0.000154 ***
age 0.057788 0.005399 10.704 1.06e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2
Standardized residuals
6 6
Residuals
1
0
-1
-0.4
17
17
8 1
8
2
6
Standardized residuals
Standardized residuals
17 6 0.5
1.0
1
0
0.5
-1
2
Cook's distance
0.0
0.5
2.5 3.0 3.5 4.0 4.5 0.00 0.05 0.10 0.15 0.20 0.25
id <- seq(1:44)
bmi <- c(11.00, 12.00, 12.50, 14.00, 14.00, 14.00, 14.00,
14.00, 14.00, 14.80, 15.00, 15.00, 15.50, 16.00,
16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 19.00,
20.00, 20.00, 20.00, 20.50, 22.00, 23.00, 23.00,
24.00, 24.50, 25.00, 25.00, 26.00, 26.00, 26.50,
28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50,
36.00, 36.00)
sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5,
3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3,
6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7,
3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1,
2.1, 2.0, 1.8, 1.7)
Linear regression analysis of BMI
and SA
reg <- lm (sa ~ bmi)
summary(reg)
Residuals:
Min 1Q Median 3Q Max
-2.54204 -0.97584 0.05082 1.16160 2.70856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.92512 0.64489 7.637 1.81e-09 ***
bmi -0.05967 0.02862 -2.084 0.0432 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2
21 21
20 20
Standardized residuals
2
1
1
Residuals
0
-1
-1
-2
10
-3
-2
10
2
1.2
Standardized residuals
Standardized residuals
1
0.8
0
0.4
-1
1
-2
10 3
Cook's distance
0.0
3.0 3.5 4.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12
6
5
sa
4
3
2
10 15 20 25 30 35
bmi
Reference