R Session A

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 107

Introduction to R

What is R and why do we use it?

Open source, most widely used


for statistical analysis and graphics
Extensible via dynamically
loadable add-on packages
>1,800 packages on CRAN

> v = rnorm(256)
> A = as.matrix (v,16,16)
> summary(A)
> library (fields)
> image.plot (A)
>…
> dyn.load( “foo.so”)
> .C( “foobar” )
> dyn.unload( “foo.so” )
2
Why R?

• Statistics & Data Mining


• Commercial

• Technical computing Statistical computing


• Matrix and vector
and graphics
formulations
http://www.r-project.org
• Data Visualization • Developed by R. Gentleman & R. Ihaka
• Expanded by community as open source
and analysis platform
• Statistically rich
• Image processing,
vector computing
3
The Programmer’s Dilemma
What
programming
language to
use & why?

Scripting
(R, MATLAB, IDL)

Object Oriented
(C++, Java)

Functional languages
(C, Fortran)

Assembly
4
Features of R

R is an integrated suite of software for data manipulation,


calculation, and graphical display

• Effective data handling


• Various operators for calculations on arrays/matrices
• Graphical facilities for data analysis
• Well-developed language including conditionals, loops, recursive
functions and I/O capabilities.
Useful R links
• R Home: http://www.r-project.org/
• R’s CRAN package distribution: http://cran.cnr.berkeley.edu/
• Introduction to R manual:
http://cran.cnr.berkeley.edu/doc/manuals/R-intro.pdf
• Writing R extensions:
http://cran.cnr.berkeley.edu/doc/manuals/R-exts.pdf
• Other R documentation:
http://cran.cnr.berkeley.edu/manuals.html

6
Exploring the iris data
• Load iris data into your R session:
– data (iris);
– help (data);
• Check that iris was indeed loaded:
– ls ();
• Check the class that the iris object belongs to:
– class (iris);
• Print the content of iris data:
– iris;
• Check the dimensions of the iris data:
– dim (iris);
• Check the names of the columns:
– names (iris);

7
Basic usage: arithmetic in R

• You can use R as a calculator


• Typed expressions will be evaluated and printed out
• Main operations: +, -, *, /, ^
• Obeys order of operations
• Use parentheses to group expressions
• More complex operations appear as functions
• sqrt(2)
• sin(pi/4), cos(pi/4), tan(pi/4), asin(1), acos(1), atan(1)
• exp(1), log(2), log10(10)
Getting help
• help(function_name)
– help(prcomp)
• ?function_name
– ?prcomp
• help.search(“topic”)
– ??topic or ??“topic”
• Search CRAN
– http://www.r-project.org
• From R GUI: Help  Search help…
• CRAN Task Views (for individual packages)
– http://cran.cnr.berkeley.edu/web/views/

9
1
0 Outline
• Variables and Vectors
• Factors
• Arrays and Matrices
• Data Frames
• Functions and Conditionals
• Graphical Procedures
Variables and assignment

• Use variables to store values


• Three ways to assign variables
• a=6
• a <- 6
• 6 -> a
• Update variables by using the current value in an assignment
• x=x+1
• Naming rules
• Can include letters, numbers, ., and _
• Names are case sensitive
• Must start with . or a letter
Variables and assignment

• More
• msg <- “hello”
• print(msg)
• R vector, integer sequence of length 20
• x <- 10:30
• R Objects. R has five basic or “atomic” classes of objects
• Character
• Numeric(real numbers)
• Integer
• Complex
• Logical
R Commands
• Commands can be expressions or assignments
• Separate by semicolon or new line
• Can split across multiple lines
• R will change prompt to + if command not finished
• Useful commands for variables
• ls(): List all stored variables
• rm(x): Delete one or more variables
• class(x): Describe what type of data a variable stores
• save(x,file=“filename”): Store variable(s) to a binary file
• load(“filename”): Load all variables from a binary file
• Save/load in current directory or My Documents by default
1
4 A Numeric Vector
• Simplest data structure
– Numeric vector
– > v <- c(1,2,3)
– <- is the assignment operator
– c is the list concatenation operator
• To print the value, v
– Type : > v
– Output: [1] 1 2 3
1
5 A vector is a full fledged variable
• Let us do the following:
• > 1/v
[1] 1.0000000 0.5000000 0.3333333
• >v+2
[1] 3 4 5
• We can treat a vector as a regular variable
• For example, we can have:
– > v1 <- v / 2
> v1
[1] 0.5 1.0 1.5
1
6 Creating a vector with vectors
• > v <- c (1,2,3)
>v
[1] 1 2 3
> vnew <- c (v,0,v)
> vnew
[1] 1 2 3 0 1 2 3
The c operator concatenates all the vectors
R Vectors
• The c() function can be used to create vectors of objects by
concatenating things together.
• X <- c(0.5, 0.5) ## Numeric
• x <- c(TRUE, FALSE) ## logical
• x <- c(T, F) ## logical
• x <- c("a", "b", "c") ## character
• x <- 9:29 ## integer
• x <- c(1+0i, 2+4i) ## complex
• You can also use the vector() function to initialize vectors.
• x <- vector("numeric", length = 10)
1
8
Functions on Vectors and Complex
Numbers
• If v is a vector
• Here, are a few of the functions that take vectors as
inputs:
mean(v), max(v), sqrt(v), length(v), sum(v),
prod(v), sort (v) (in ascending order)
• > x <- 1 + 1i
> y <- 1i
>x*y
[1] -1+1i
1
9 Generating Vectors
• Suppose we want a vector of the form:
(1,2,3,... 100)
• We do not have to generate it manually.
• We can use the following commands:
> v <- 1:100
OR
> v <- seq(1,100)
• seq takes an additional argument, which is the
difference between consecutive numbers:
– seq (1,100,10) gives (1,11,21,31 ... , 91)
• rep (2,5) generates a vector (2, 2, 2, 2, 2)
2
0 Boolean Variables and Vectors

• R recognizes the constants: TRUE, FALSE


– TRUE corresponds to 1
– FALSE corresponds to 0
• We can define a vector of the form:
– v <- c (TRUE, FALSE, TRUE)
• We can also define a logical vector
– Can be created with logical operators: <, <=, >=, ==,
!=, & and I

> v <- 1:9 > 5


> v
[1] FALSE FALSE FALSE FALSE FALSE
TRUE TRUE TRUE TRUE
2
1 String Vectors

• Similarly, we can have a vector of strings


– > vec <- c (“f1”, “f2”, “f3”)
> vec
[1] "f1" "f2" "f3“
• The paste function can be used to create a
vector of strings
paste(1:3, 3:5,sep="*")
[1] "1*3" "2*4" "3*5"
It takes two vectors of the same length, and an optional
argument, sep. The ith element of the result string, contains the ith
elements of both the arguments, separated by the string specified
by sep.
Vectors and vector operations

To create a vector: To access vector elements:


# c() command to create vector x # 2nd element of x
x=c(12,32,54,33,21,65) x[2]
# c() to add elements to vector x # first five elements of x
x=c(x,55,32) x[1:5]
# all but the 3rd element of x
# seq() command to create
sequence of number
x[-3]
# values of x that are < 40
years=seq(1990,2003)
# to contain in steps of .5
x[x<40]
# values of y such that x is < 40
a=seq(3,5,.5)
y[x<40]
# can use : to step by 1
years=1990:2003;
To perform operations:
# rep() command to create data # mathematical operations on vectors
that follow a regular pattern y=c(3,2,4,3,7,6,1,1)
b=rep(1,5) x+y; 2*y; x*y; x/y; y^2
c=rep(1:2,4) 22
R Vectors - Removing NA Values
> x <- c(1, 2, NA, 4, NA, 5)
> bad <- is.na(x)
> print(bad)
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> x[!bad]
[1] 1 2 4 5
Vectorized Operations
> x <- 1:4
> y <- 6:9
> z <- x + y
>z
[1] 7 9 11 13
>x
[1] 1 2 3 4
>x>2
[1] FALSE FALSE TRUE TRUE
> x >= 2
[1] FALSE TRUE TRUE TRUE
>x<3
[1] TRUE TRUE FALSE FALSE
> y == 8
[1] FALSE FALSE TRUE FALSE
• Q1
• Q2
• Q3

25
2
6 Outline

 Variables and Vectors


 Factors
 Arrays and Matrices
 Data Frames
 Functions and Conditionals
 Graphical Procedures
2
7 Factors

Factor Definition: A vector used to specify


a grouping (classification) of objects
in other vectors.

• Consider the following problem:


– We have a vector of the type of the Nationality of
students, and a vector of their marks in a given
subject.
– AIM: Find the average scores per nationality.
2
8 Graphical View of the Problem

Indian 6

Chinese 8
Indian
Indian 7
Chinese
Chinese 9
Russian
Indian 8
Factor
Russian 10

Nationality Marks
2
9 Code
# character starts
a comment

> nationalities <- c ("Indian", "Chinese", "Indian", "Chinese",


"Indian", "Russian") # create a factor
> marks <- c (6, 8, 7, 9, 8, 10)

> fac <- factor(nationalities)


> fac
[1] Indian Chinese Indian Chinese Indian Russian
Levels: Chinese Indian Russian

• The levels of a factor indicate the


categories
3
0 Code - II

• Now let us apply the factor to the


marks vector
> results <- tapply (marks, fac, mean)

Works on each compute the mean


factor
element of the list in each category

List of marks
3
1 Time for the results

> results
Chinese Indian Russian
8.5 7.0 10.0

• Let us now apply the sum function


> tapply (marks, fac, sum)
Chinese Indian Russian
17 21 10
3
2 levels and table

> levels (fac)


[1] "Chinese" "Indian" "Russian"
> table (fac)
fac
Chinese Indian Russian
2 3 1
• Let us assume that the factor is fac.
• fac is
[1] Indian Chinese Indian Chinese Indian Russian
Levels: Chinese Indian Russian
• levels returns a vector containing all the
unique labels
• table returns a special kind of array that
contains the counts of entries for each
label
3
3 Outline

 Variables and Vectors


 Factors
 Arrays and Matrices
 Data Frames
 Functions and Conditionals
 Graphical Procedures
3
4 Arrays and Matrices

• Generic array function


• Creates an array. Takes two arguments:
– data_vector  vector of values
– dimension_vector
• Example:
> array (1:10, c(2,5))
[,1 [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
The numbers are laid out in column major order.

Count from 1, Not 0


3
5 Other ways to make arrays
• Take a vector, and assign it dimensions
• > v <- c (1,2,3,4)
> dim(v) <- c(2,2)
>v
[,1] [,2]
[1,] 1 3
[2,] 2 4
3
6
Arrays are Created in Column Major
Order

> v <- 1:8


> dim(v) <- c(2,2,2)
>v
,,1
Start from the last index
[,1] [,2]
[1,] 1 3
[2,] 2 4
Array elements are accessed
,,2 by specifying their index
(within square brackets)
[,1] [,2]
[1,] 5 7
[2,] 6 8

> v[2,1,2]
[1] 6
3
7 The matrix command
• A matrix is a 2-D array
• There is a fast method of creating a matrix
– Use the matrix (data, dim1, dim2) command
• Example:
> matrix(1:4, 2, 2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
38 cbind and rbind

mat1 mat2 mat1 mat2

cbind

mat1
mat1 mat2

mat2
rbind
3
9
Problem: set the diagonal elements of a
matrix to 0

> mat <- matrix(1:16,4,4)


> mat
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> indices <- cbind (1:4, 1:4)
> mat[indices] <- 0
> mat
[,1] [,2] [,3] [,4]
[1,] 0 5 9 13
[2,] 2 0 10 14
[3,] 3 7 0 15
[4,] 4 8 12 0
40 Recycling Rule

> cbind (1:4, 1:8)


[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 1 5
[6,] 2 6
[7,] 3 7
[8,] 4 8

The smaller structure is replicated to match the


length of the longer structure
Note that the size of the longer structure has to
be a multiple of the size of the smaller structure.
41
Matrix Operations

 A * B is a normal element-by-element product


A %*% B is a matrix product
 Equation solution:
 solve (A, b) (for equations of the form Ax = b)
 solve (A) returns the inverse of the matrix
> A <- matrix (1:4, 2, 2)
> b <- 5:6 Solve an equation of the
> solve (A,b) form: Ax = b
[1] -1 2

> solve(A) %*% b


[,1] A-1 * b = x
[1,] -1
[2,] 2
42 Additional Features

 nrow (mat)  Number of rows in the matrix


 ncol (mat)  Number of columns in the matrix

Feature Function

Eigen Values eigen

Singular Value Decomposition svd

Least Squares Fitting lsfit

QR decomposition qr
Matrices & matrix operations

To create a matrix:
# matrix() command to create matrix A with rows and cols
A=matrix(c(54,49,49,41,26,43,49,50,58,71),nrow=5,ncol=2))
B=matrix(1,nrow=4,ncol=4)

To access matrix elements: Statistical operations:


# matrix_name[row_no, col_no] rowSums(A)
A[2,1] # 2nd row, 1st column element colSums(A)
A[3,] # 3rd row rowMeans(A)
A[,2] # 2nd column of the matrix colMeans(A)
A[2:4,c(3,1)] # submatrix of 2nd-4th # max of each columns
elements of the 3rd and 1st columns apply(A,2,max)
A["KC",] # access row by name, "KC" # min of each row
apply(A,1,min)

Element by element ops: Matrix/vector multiplication:


2*A+3; A+B; A*B; A/B; A %*% B; 43
Vectorized Matrix Operations
> x <- matrix(1:4, 2, 2)
> y <- matrix(rep(10, 4), 2, 2)
>
> ## element-wise multiplication
>x*y
[,1] [,2]
[1,] 10 30
[2,] 20 40
>
> ## element-wise division
>x/y
[,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4
>
> ## true matrix multiplication
> x %*% y
[,1] [,2]
[1,] 40 40
[2,] 60 60
Useful functions for vectors and matrices

• Find # of elements or dimensions


• length(v), length(A), dim(A)
• Transpose
• t(v), t(A)
• Matrix inverse
• solve(A)
• Sort vector values
• sort(v)
• Statistics
• min(), max(), mean(), median(), sum(), sd(), quantile()
• Treat matrices as a single vector (same with sort())
4
6 Outline

 Variables and Vectors


 Factors
 Arrays and Matrices
 Data Frames
 Functions and Conditionals
 Graphical Procedures
47 Lists and Data Frames

 A list is a heterogeneous data structure


 It can contain data belonging to all kinds of types
 Example:
 > lst <- list (“one”, 1, TRUE)
 Elements can be lists, arrays, factors, and normal
variables
 The components are always numbered
 They are accessed as follows: lst[[1]], lst[[2]],
lst[[3]]
 [[ ... ]] is the operator for accessing an element in a
list
48 Named Components

 Lists can also have named components


 lst <- list(name=“Sofia”, age=29, marks=33.7)
 The three components are: lst$name, lst$age, lst$marks
 We can also use
 lst [[“name”]], lst[[“age”]], lst [[“marks”]]
49 Data Frames
columns

rows Data Frame

 It is a table in R

> entries <- c(“cars”, “trucks”, “bikes”)


> price <- c (8, 10, 5)
> num <- c (1, 2, 3)
> df <- data.frame(entries, price, num)

> df
entries price num
1 cars 8 1
2 trucks 10 2
3 bikes 5 3
50 Accessing an Element

 Can be accessed as a regular array, or as a list


> df[1,2]
Row names, i.e. [1] 8
character values > df[2,]
entries price num
2 trucks 10 2
> df$price
[1] 8 10 5

 Summary shows a summary of each variable in the data frame


> summary(df)
Feature Function
entries price num Show first 6 rows of df head(df)
bikes :1 Min. : 5.000 Min. :1.0
cars :1 1st Qu.: 6.500 1st Qu.:1.5 List objects ls()
trucks:1 Median : 8.000 Median :2.0
Mean : 7.667 Mean :2.0 Remove variables x & rm(x,y)
3rd Qu.: 9.000 3rd Qu.:2.5 y from data frame
Max. :10.000 Max. :3.0
Sort df on variable x [order(df$x),]
51 Operations on Data Frames

 A data frame can be sorted on the values of a variable,


filtered using values of a variable, and grouped by a variable.

 Eg. Filter rows where entries = “cars”


> df[df$entries == "cars",]
entries price num
1 cars 8 1

 Group by entries
> aggregate(df,by = list(entries), mean)
Group.1 entries price num
1 bikes NA 5 3
2 cars NA 8 1
3 trucks NA 10 2
52 Reading Data from Files

 Reads in a data frame from a file


 Steps:
 Store the data frame in a file
 Read it in
 > df <- read.table (“<filename>”)

 Access the data frame


53 Outline

 Variables and Vectors


 Factors
 Arrays and Matrices
 Data Frames
 Functions and Conditionals
 Graphical Procedures
54 Grouping, Loops, Conditional
Execution

 R does have support for regular if statements,


while loops, and other conditionals
 if statement
 if (condition) statement 1 else statement 2. Use {} for
creating grouped statements
 The condition should evaluate to a single variable
(not a vector)
 Example:
> x <- 3
> if (x > 0) x <- x+ 3 else x <- x + 6
> x
[1] 6
55 For loop

 for (var in expr1) {


....
....
}

Example: > for (v in 1:10) print (v)


[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
56 While loop

> while (x[i] < 10) {


+ print (x[i])
+ i <- i + 1
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9

Use the break statement to exit a


loop
57 Writing one’s own functions

> cube <- function (x) {


+ x * x * x
+ }
> cube(4)
[1] 64

 A function takes a list of arguments within ( ... )


 To return a value, just print the expression
(without assignment statements)
 Function calling convention  similar to C
Reading data from files
• Large data sets are better loaded through the file input interface
in R
• Reading a table of data can be done using the read.table()
command:
• a <- read.table(“a.txt”)
• The values are read into R as an object of type data frame (a
sort of matrix in which different columns can have different
types). Various options can specify reading or discarding of
headers and other metadata.
• A more primitive but universal file-reading function exists, called
scan()
• b = scan(“input.dat”);
• scan() returns a vector of the data read
Programming in R

 The following slides assume a basic understanding of programming


concepts

 For more information, please see chapters 9 and 10 of the R


manual:
http://cran.r-project.org/doc/manuals/R-intro.html

Additional resources
 Beginning R: An Introduction to Statistical Programming by Larry Pace
 Introduction to R webpage on APSnet:
http://www.apsnet.org/edcenter/advanced/topics/ecologyandepidemiologyinr
/introductiontor/Pages/default.aspx
 The R Inferno:
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

59
Conditional statements
• Perform different commands in different situations
• if (condition) command_if_true
• Can add else command_if_false to end
• Group multiple commands together with braces {}
• if (cond1) {cmd1; cmd2;} else if (cond2) {cmd3; cmd4;}
• Conditions use relational operators
• ==, !=, <, >, <=, >=
• Do not confuse = (assignment) with == (equality)
• = is a command, == is a question
• Combine conditions with and (&&) and or (||)
• Use & and | for vectors of length > 1 (element-wise)
Loops
• Most common type of loop is the for loop
• for (x in v) { loop_commands; }
• v is a vector, commands repeat for each value in v
• Variable x becomes each value in v, in order
• Example: adding the numbers 1-10
• total = 0; for (x in 1:10) total = total + x;
• Other type of loop is the while loop
• while (condition) { loop_commands; }
• Condition is identical to if statement
• Commands are repeated until condition is false
• Might execute commands 0 times if already false
• while loops are useful when you don’t know number of iterations
Scripting in R
• A script is a sequence of R commands that perform some
common task
• E.g., defining a specific function, performing some analysis
routine, etc.
• Save R commands in a plain text file
• Usually have extension of .R
• Run scripts with source() :
• source(“filename.R”)
• To save command output to a file, use sink():
• sink(“output.Rout”)
• sink() restores output to console
• Can be used with or outside of a script
Lists
• Objects containing an ordered collection of objects
• Components do not have to be of same type
• Use list() to create a list:
• a <- list(“hello”,c(4,2,1),“class”);
• Components can be named:
• a <- list(string1=“hello”,num=c(4,2,1),string2=“class”)
• Use [[position#]] or $name to access list elements
• E.g., a[[2]] and a$num are equivalent
• Running the length() command on a list gives the number of
higher-level objects
Sampling and Creating simulated data
• Create data from a specific distribution
• rnorm(5,0,1)
• ## [1] -0.2986 -0.3762 0.7070 0.1605 0.7018
• rnorm(10,6,2)
• ## [1] 7.530 7.741 5.947 4.622 8.804 8.149 10.949 3.339 6.213
• ## [10] 6.554
• Sample from existing data
• sample(1:10)
• sample(1:10,replace = TRUE)
• sd
Writing your own functions
• Writing functions in R is defined by an assignment like:
• a <- function(arg1,arg2) { function_commands; }
• Functions are R objects of type “function”
• Functions can be written in C/FORTRAN and called via .C() or
.Fortran()
• Arguments may have default values
• Example: my.pow <- function(base, pow = 2) {return
base^pow;}
• Arguments with default values become optional, should usually
appear at end of argument list (though not required)
• Arguments are untyped
• Allows multipurpose functions that depend on argument type
Writing your own functions
• Writing functions in R is defined by an assignment like:
• a <- function(arg1,arg2) { function_commands; }
• myMean<- function(y1){
mean<- sum(y1)/length(y1)
return(mean)
}
• testVec<- rnorm(50,20,4)
• mean(testVec)
• ## [1] 19.72
• myMean(testVec)
• ## [1] 19.72
Writing your own functions
• Writing functions in R is defined by an assignment like:
• a <- function(arg1,arg2) { function_commands; }
• Functions are R objects of type “function”
• Functions can be written in C/FORTRAN and called via .C() or
.Fortran()
• Arguments may have default values
• Example: my.pow <- function(base, pow = 2) {return
base^pow;}
• Arguments with default values become optional, should usually
appear at end of argument list (though not required)
• Arguments are untyped
• Allows multipurpose functions that depend on argument type
68 Applying a Function

> lapply (1:2,cube)


[[1]]
[1] 1

[[2]]
[1] 8

 Apply the cube function to a vector


 Applies the function to each and every argument
 sapply returns a list

> sapply (1:3, cube)


[1] 1 8 27
lapply, sapply, apply
• When the same or similar tasks need to be performed multiple
times for all elements of a list or for all columns of an array.
• May be easier and faster than “for” loops
• lapply(li, function )
• To each element of the list li, the function function is applied.
• The result is a list whose elements are the individual function
results.
> li = list("klaus","martin","georg")
> lapply(li, toupper)
> [[1]]
> [1] "KLAUS"
> [[2]]
> [1] "MARTIN"
> [[3]]
> [1] "GEORG"
lapply, sapply, apply
sapply( li, fct )
Like apply, but tries to simplify the result, by converting it into a vector or
array of appropriate size

> li = list("klaus","martin","georg")
> sapply(li, toupper)
[1] "KLAUS" "MARTIN" "GEORG"

> fct = function(x) { return(c(x, x*x, x*x*x)) }


> sapply(1:5, fct)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 1 4 9 16 25
[3,] 1 8 27 64 125
apply
apply( arr, margin, fct )
Apply the function fct along some dimensions of the array arr, according
to margin, and return a vector or array of the appropriate size.
>x
[,1] [,2] [,3]
[1,] 5 7 0
[2,] 7 9 8
[3,] 4 6 7
[4,] 6 3 5
> apply(x, 1, sum)
[1] 12 24 17 14
> apply(x, 2, sum)
[1] 22 25 20
72 Named arguments

> fun <- function (x=4, y=3) { x - y }


> fun()
[1] 1
> fun (4,3)
[1] 1
> fun (y=4, x=3)
[1] -1

 Possible to specify default values in the function


declaration
 If a variable is not specified, the default value is used
 We can also specify the values of the variables by the
name of the argument (last line)
73 Scoping in R

> deposit <- function (amt) balance + amt


> withdraw <- function (amt) balance - amt
> balance <- withdraw(10)
> balance <- deposit (20)
> balance
[1] 110

 Scope of variables in R
 Function arguments (valid only inside the function)
 Local variables (valid only inside the function)
 Global variables (balance)
74 Functional Programming: Closures

> exponent <- function (n) {


+ power <- function (x) {
+ x ** n
+ }
+ }
> square <- exponent(2)
> square(4)
[1] 16

 A function with pre-specified data is called a closure


 exponent returns a function power (with n = 2)
75
source
Example: Numerical Integration
http://adv-r.had.co.nz/Functional-programming.html

> composite <- function(f, a, b, n = 10,


rule) {

area <- 0
+ points <- seq(a, b, length = n + 1)
+
+ area <- 0
+ for (i in seq_len(n)) { Function for
+ area <- area + rule(f, points[i], numerical
points[i + 1]) integration
+ }
+
+ area
+ }
> midpoint <- function(f, a, b) {
+ (b - a) * f((a + b) / 2) Midpoint rule
function passed + }
as an argument > composite(sin, 0, pi, n = 1000, rule =
midpoint)
[1] 2.00000
76 Outline

 Variables and Vectors


 Factors
 Arrays and Matrices
 Data Frames
 Functions and Conditionals
 Graphical Procedures
77 Plotting a Function

 A basic 2D plot:
vec1 <-cube(seq(1,100,10)) Plot type
(overplotted)
vec2 <-cube(seq(5,100,10))
plot(vec1, type="o", col="blue“, ylim=c(0,3e5))
title(main=“Plot of Cubes", col.main="red")

 To add a line to the same plot:


lines(vec2, type=“o", lty = 2, pch = 22, col=“red“)

Line type: Marker type:


dashed square
 To add a legend:
legend(1, max(vec1), c(“vec1",“vec2"), cex=0.8, col=c("blue","red"),
pch=21:22, lty=1:2)
78 Plotting: Linear Regression

library("MASS")
data(cats) # load data
plot(cats$Bwt, cats$Hwt) # scatter plot of cats body weight vs heart rate
M <- lm(formula = cats$Hwt ~ cats$Bwt, data=cats) # fit a linear model
regmodel <- predict(M) # predict values using this model
plot(cats$Bwt, cats$Hwt, pch = 16, cex = 1.3, col = "blue", main = "Heart
rate plotted against body weight of cats", xlab = "Body weight", ylab =
"Heart rate") # scatter plot
abline(M) # plot the regression line
79 Creating 3-D plots

Packages plot3D, ggplot2 contain useful 3D


plotting options
 plot3d, scatter3d, surf3d, persp3d are some
of the commonly used plots.
plot3d is from package rgl.
It allows creating interactive 3D plots that can be
rotated using the mouse.
plot3d(x, y, z, col="red", size=3)
80 Creating 3-D plots: surf3D

Surf3d (package: plot3D) allows us to create


surface plots like the one shown below:

#source: http://blog.revolutionanalytics.com/2014/02/3d-
plots-in-r.html
library ('ggplot2')
library(plot3D)
par(mar = c(2, 2, 2, 2))
par(mfrow = c(1, 1))
R <- 3; r <- 2
x <- seq(0, 2*pi,length.out=50)
y <- seq(0, pi,length.out=50)
M <- mesh(x, y)
alpha <- M$x; beta <- M$y
surf3D(x = (R + r*cos(alpha)) * cos(beta),
y = (R + r*cos(alpha)) * sin(beta),
z = r * sin(alpha),
colkey=FALSE,
bty="b2",
main="Half of a Torus")
81 Creating 3-D plots: persp3d

persp3d(package: plot3D) allows us to create


surface plots like the one shown below:

xdim <- 16
newmap <- array(0,dim=c(xdim,xdim))
newmap <- rnorm(256,1,.2)
jet.colors <- colorRampPalette( c("yellow", "red") )
pal <- jet.colors(100)
col.ind <- cut(newmap,100) # colour indices of each point
persp3d(seq(1:xdim),seq(1:xdim),newmap,shade=TRUE,
type="wire", col=pal[col.ind],xlab="",ylab="",zlab="",
cex.axis=1.5,xtics="",aspect=2,zlim=c(0,5))
Graphical display and plotting

• Most common plotting function is plot()


• plot(x,y) plots y vs x
• plot(x) plots x vs 1:length(x)
• plot() has many options for labels, colors, symbol, size, etc.
• Check help with ?plot
• Use points(), lines(), or text() to add to an existing plot
• Use x11() to start a new output window
• Save plots with png(), jpeg(), tiff(), or bmp()
R Packages
• R functions and datasets are organized into packages
• Packages base and stats include many of the built-in functions in R
• CRAN provides thousands of packages contributed by R users
• Package contents are only available when loaded
• Load a package with library(pkgname)
• Packages must be installed before they can be loaded
• Use library() to see installed packages
• Use install.packages(pkgname) and
update.packages(pkgname) to install or update a package
• Can also run R CMD INSTALL pkgname.tar.gz from command line
if you have downloaded package source
Exploring the iris data (cont.)
• Plot Petal.Length vs. Petal.Width:
– plot (iris[ , 3], iris[ , 4]);
– example(plot)
• Exercise: create a plot similar to this figure:

Src: Figure is from Introduction to Data Mining by


Pang-Ning Tan, Michael Steinbach, and Vipin Kumar 84
Decision Trees in R
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Decision Trees
• Used for classifying data by partitioning
attribute space
• Tries to find axis-parallel decision boundaries
for specified optimality criteria
• Leaf nodes contain class labels, representing
classification decisions
• Keeps splitting nodes based on split criterion,
such as GINI index, information gain or entropy
• Pruning necessary to avoid overfitting
Decision Trees in R

mydata<-data.frame(iris)
attach(mydata)

library(rpart)
model<-rpart(Species ~ Sepal.Length +
Sepal.Width + Petal.Length +
Petal.Width,
data=mydata,
method="class")
plot(model)
text(model,use.n=TRUE,all=TRUE,cex=0.8)
Decision Trees in R

library(tree)
model1<-tree(Species ~ Sepal.Length
+ Sepal.Width + Petal.Length +
Petal.Width,
data=mydata,
method="class",
split="gini")
plot(model1)
text(model1,all=TRUE,cex=0.6)
Decision Trees in R

library(party)
model2<-
ctree(Species ~
Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width,
data=mydata)
plot(model2)
Controlling number of nodes

This is just an
example. You
library(tree)
can come up
mydata<-
with better or
data.frame(iris)
more efficient
attach(mydata)
methods!
model1<-tree(Species
~ Sepal.Length +
Sepal.Width +
Petal.Length +
Petal.Width,
data=mydata,
method="clas
s",
control =
tree.control(n
obs = 150, mincut =
10))
Controlling number of nodes
model2<-
ctree(Species ~ This is just an
Sepal.Length + example. You
Sepal.Width + can come up
Petal.Length + with better or
Petal.Width, more efficient
data = mydata, methods!
controls =
ctree_control(maxd
epth=2))
plot(model2)

Note that setting the


maximum depth to 2
http://data.princeton.edu/R/linearmodels.h
Linear Models in R
• abline() – adds one or more straight lines to a plot
• lm() – function to fit linear regression model
x1<-c(1:5,1:3)
x2<-c(2,2,2,3,6,7,5,1)
abline(lm(x2~x1))
title('Regression of x2 on
X1')
plot(x1,x2)
abline(lm(x2~x1))
title('Regression of x2 on
+ x1')
s<-lm(x2~x1)
lm(x1~x2)
abline(1,2)
Regression analysis using R

id <- seq(1:18)
age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22,
43, 57, 33, 22, 63, 40, 48, 28, 49)
chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1,
3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0)

#Fit linear regression model

reg <- lm(chol ~ age)


summary(reg)
ANOVA result

> anova(reg)
Analysis of Variance Table

Response: chol
Df Sum Sq Mean Sq F value Pr(>F)
age 1 10.4944 10.4944 114.57 1.058e-08 ***
Residuals 16 1.4656 0.0916
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results of R analysis
> summary(reg)

Call:
lm(formula = chol ~ age)

Residuals:
Min 1Q Median 3Q Max
-0.40729 -0.24133 -0.04522 0.17939 0.63040

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.089218 0.221466 4.918 0.000154 ***
age 0.057788 0.005399 10.704 1.06e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedom


Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08
Diagnostics: influential data
par(mfrow=c(2,2)) Residuals vs Fitted Normal Q-Q
plot(reg)

0.0 0.2 0.4 0.6


8 8

2
Standardized residuals
6 6

Residuals

1
0
-1
-0.4
17
17

2.5 3.0 3.5 4.0 4.5 -2 -1 0 1 2

Fitted values Theoretical Quantiles

Scale-Location Residuals vs Leverage


1.5

8 1
8

2
6

Standardized residuals
Standardized residuals

17 6 0.5
1.0

1
0
0.5

-1
2
Cook's distance
0.0

0.5

2.5 3.0 3.5 4.0 4.5 0.00 0.05 0.10 0.15 0.20 0.25

Fitted values Leverage


A non-linear illustration: BMI and sexual
attractiveness
– Study on 44 university students
– Measure body mass index (BMI)
– Sexual attractiveness (SA) score

id <- seq(1:44)
bmi <- c(11.00, 12.00, 12.50, 14.00, 14.00, 14.00, 14.00,
14.00, 14.00, 14.80, 15.00, 15.00, 15.50, 16.00,
16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 19.00,
20.00, 20.00, 20.00, 20.50, 22.00, 23.00, 23.00,
24.00, 24.50, 25.00, 25.00, 26.00, 26.00, 26.50,
28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50,
36.00, 36.00)
sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5,
3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3,
6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7,
3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1,
2.1, 2.0, 1.8, 1.7)
Linear regression analysis of BMI
and SA
reg <- lm (sa ~ bmi)
summary(reg)

Residuals:
Min 1Q Median 3Q Max
-2.54204 -0.97584 0.05082 1.16160 2.70856

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.92512 0.64489 7.637 1.81e-09 ***
bmi -0.05967 0.02862 -2.084 0.0432 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.354 on 42 degrees of freedom


Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218
F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323
BMI and SA: analysis of
residuals
plot(reg)
Residuals vs Fitted Normal Q-Q

2
21 21
20 20

Standardized residuals
2

1
1
Residuals

0
-1

-1
-2

10
-3

-2
10

3.0 3.5 4.0 -2 -1 0 1 2

Fitted values Theoretical Quantiles

Scale-Location Residuals vs Leverage


21
20 10

2
1.2

Standardized residuals
Standardized residuals

1
0.8

0
0.4

-1

1
-2

10 3
Cook's distance
0.0

3.0 3.5 4.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

Fitted values Leverage


BMI and SA: a simple plot
par(mfrow=c(1,1))
reg <- lm(sa ~ bmi)
plot(sa ~ bmi, pch=16)
abline(reg)

6
5
sa

4
3
2

10 15 20 25 30 35

bmi
Reference

• Dr. Nagiza F. Samatova


• Arko Barman
107

You might also like