0% found this document useful (0 votes)
50 views

R & Statistics Tutorial: Andre Garenne May 17, 2017

This document provides a tutorial on using R for statistics. It discusses downloading and installing R, running R, installing and activating packages, and accessing help. It also covers R language basics like variable types, data manipulation, control structures, input/output, and graphics. The tutorial includes examples and screenshots to illustrate key concepts and functions in R.

Uploaded by

Eva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

R & Statistics Tutorial: Andre Garenne May 17, 2017

This document provides a tutorial on using R for statistics. It discusses downloading and installing R, running R, installing and activating packages, and accessing help. It also covers R language basics like variable types, data manipulation, control structures, input/output, and graphics. The tutorial includes examples and screenshots to illustrate key concepts and functions in R.

Uploaded by

Eva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

R & statistics tutorial

Andre Garenne

May 17, 2017

Contents
1 Introduction 3
1.1 Downloading and installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Running R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Package installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Package activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 R help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 R language basics 7
2.1 Variables types in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Numerical variables creation and processing . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 String variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Logical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Vectors, matrices and arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Factor variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.6 Lists and data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.7 Class, type, mode and storage.mode in R . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Manipulation and processing of R variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Conditional data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 R controlled sequences generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 R random series generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Control structures in R (flow of control) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 General considerations on coding in general and R coding in particular . . . . . . . . . 22
2.3.2 What is a code block ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1
2.3.5 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.6 User input and output functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Input and output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 R graphic libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 How to create and manage graphic windows? . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Line and point charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.4 Whisker box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.5 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.6 Bar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.7 The ggplot2 package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 Applications 46
3.1 R variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.1 Variable creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2 Variable manipulation and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 R control structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 "For" loops examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 "While" loops examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.3 Functions examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2
1 Introduction

1.1 Downloading and installing R

According to the authors "R is a language and environment for statistical computing and graphics". It can
be downloaded from the following site: http://cran.univ-lyon1.fr/ for example but there are many other
possible mirrors. Binaries are available for MacOSX, Linux and Windows and sources can also be recompiled
in different OS1 . The installation is usually very simple with the binaries and is also well documented and
explained to linux users. In these tutorial we will use the "standard" R graphic environment (R-GUI)2 .

1.2 Running R

The following screenshots and examples originate from the MacOSX version of R but the look and feel are
quite similar with the other versions. R can be used from the command line in a terminal window (Figure 1).
The standard GUI version provides some practical functionalities such as a script file editor and easy access

Figure 1: R command line example.

to packages or working directories management as shown in Figure 2. All these functionalities are available
from the top-menu bar.

1.3 Package installation

The standard distribution of R comes with numerous useful statistical and graphic packages. Nevertheless,
it becomes rapidly necessary to import or even built our own packages according to our specific needs. The
R-GUI provides a simple method to do so. First, it is necessary to install the required packages (i.e. to
download them from the site and make them available in our R session) and then to click on the Packages &
1 Operating system.
2 There also exist more friendly R-IDE like RKward (https://rkward.kde.org/) or RStudio (https://www.rstudio.com/).

3
Figure 2: R-GUI screenshot.

Data menu item and choose Package Installer and to get the full package list from the CRAN3 binaries. The
"Install Dependencies checkbox" has to be checked in order to alleviate partial or dysfunctional installations.
The required package can thus be selected and installed. This procedure is summarized in Figure 3. The
other way to install packages from distant sites is to use the install.package() command but it will be left
aside for the moment:

install . packages ( " agricolae " )

Figure 3: R-GUI package installation.

3 Comprehensive R Archive Network.

4
1.4 Package activation

Once a package has been installed, it has to be activated in the current R session (or at the beginning of
a script) otherwise its functions are not available. This can be done via the main menu by clicking on the
Packages & Data menu item and choosing Package Manager (Figure 4). The required packages for the current

Figure 4: R-GUI package activation.

R session are selected by clicking on the checkboxes (note that when a package has just been installed it is
necessary to update the list otherwise it is not visible). The alternative to the R-GUI activation method
consists in calling the library() function like this:

library ( agricolae ) # imports the agricolae package

This latter method is often necessary especially at the beginning of a script in which "non standard" functions
are needed.

1.5 R help

The R help can be accessed from the main menu (help item) but it is useful to know some basic commands:

5
help . start () # opens the general R help menu

If the syntax of a function is already known, the help() and "?" commands can also be used:

? t . test # help is requested on the student t test


help ( t . test )

These commands open a new window which contains function description and examples of use as shown in
Figure 5. If the function spelling is not exactly known, the "??" operator can be used.

Figure 5: R help window.

For instance the following command opens a new window in which all the help items containing similar
spelling are listed (Figure 6).

?? student

6
Figure 6: R help research with the "student" keyword.

2 R language basics

2.1 Variables types in R

2.1.1 Numerical variables creation and processing

The simplest way to store real numbers in R is to assign a value to a variable:

a <- 1.5

The "<-" symbol (or "=" as in the majority of computer languages) is an affectation operator. In this case
it dynamically creates a numeric variable named "a" containing 1.5 as initial value. If a variable has to be
created with no initial type the NULL keyword can be used. The last assignment of a variable just replaces
the previous one. The type of a variable can be obtained with the typeof() command.

7
> a <- 1.5
> a
[1] 1.5
> typeof ( a )
[1] " double "
> a <- NULL
> typeof ( a )
[1] " NULL "
> a <- 10.5
> typeof ( a )
[1] " double "
> a*2
[1] 21

The current variables list can be obtained using the ls() command:

> a <- 1.5


> b <- 2.5
> n <- NULL
> d <- 4.5
> ls ()
[1] " a " " b " " d " " n "
> remove ( n ) # remove or delete the variable n
> rm ( d ) # also remove or delete the variable d
> ls ()
[1] " a " " b "

Note that if the parentheses are omitted, the shell returns the R source code of the function. For example if
the "ls" command is entered it will return the result shown in Figure 7. The existence of a variable can be
tested using the exists() function. The exists() function requires at least 2 arguments. The first one is the
name space to check (ls() for example returns all the current variables names list) and the second one is the
variable name to be tested:

> exists ( ls () , fooBar )


Error in exists ( ls () , fooBar ) : objet ’ fooBar ’ introuvable
> fooBar <-1
> exists ( ls () , fooBar )
[1] TRUE

As for many other modern languages, the naming is case sensitive. Moreover it is recommended to avoid using
language reserved keywords as variable names. For example the keyword "for" cannot be used as variable
name but function names can be used which can lead to weird source code:

8
Figure 7: Sample of R source code of the ls function.

> for <- 12


Erreur : unexpected assignment in " for <-"
> For <- 12
> For
[1] 12
> For = For / 2
> For
[1] 6
> print <- 12
> print ( print )
[1] 12

Fundamental mathematical operations can be applied to variable and to constant values:

> a <- 1.2


> b <- 2.45
> a*b
[1] 2.94
> a/b
[1] 0.4897959
> a*2
[1] 2.4
> a ^2 # exponent
[1] 1.44
> sqrt (2.45) # square root
[1] 1.565248
> # etc ...

9
2.1.2 String variables

The dedicated type of variable to store characters is the "string". Strings can be created using simple ’ or
double " quotes or alternatively by converting other types of values to strings:

> s1 <-" hello "


> s2 <-" machin "
> print ( s1 )
[1] " hello "
> typeof ( s1 )
[1] " character "
> s1 * 2 # one can not multiply a string and a numeric value
Error in s1 * 2 : argument non numérique pour un opérateur binaire
> s3 <- 123
> s4 <- as . character ( s3 ) # converts a numerical value to a string ( character
)
> s3
[1] 123
> s4
[1] " 123 "
> s3 * 2
[1] 246
> s4 * 2
Error in s4 * 2 : argument non numérique pour un opérateur binaire
> as . numeric ( s4 ) * 2 # converts a string to a numerical value
[1] 246
> s5 <- paste ( s1 , s2 ) # concatenate several elements into a string
> s5
[1] " hello machin "

A string object can be manipulated using some dedicated functions4 :

> s6 <-" Hello Machin "


> toupper ( s6 )
[1] " HELLO MACHIN "
> tolower ( s6 )
[1] " hello machin "
> substring ( s6 , first =3 , last =7)
[1] " llo M "
> v1 <- strsplit ( s6 , " " )
> v1
[[1]] # a list object is returned
[1] " H " " e " " l " " l " " o " " " " M " " a " " c " " h " " i " " n "
> v2 <- unlist ( v1 )
> v2
[1] " H " " e " " l " " l " " o " " " " M " " a " " c " " h " " i " " n "
> nchar ( " abc " )
[1] 3

4 http://gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf

10
2.1.3 Logical variables

This data type can only take 2 values: FALSE or TRUE. Most of the time the boolean is used to introduce
conditional processing using tests and operators.

> b1 <- FALSE


> b2 <- TRUE
> as . character ( b1 )
[1] " FALSE "
> as . logical ( " TRUE " )
[1] TRUE
> as . logical (1) # implicitly , 1 value stands for TRUE whereas 0 value
stands for FALSE
[1] TRUE
> as . logical (0)
[1] FALSE

The use of boolean values will be intensively described in next chapters.

2.1.4 Vectors, matrices and arrays

Basically, a vector is a variable which contains several values (a matrix behaves mostly like a vector but with
a supplementary dimension). The c() command combines a series of values and returns a vector (note that
a<-1 is equivalent to a<-c(1)). Each element of a vector can be accessed or modified using its index number.
This index is ranging from 1 to n with n being the number of element of the vector whereas in most other
computer languages it ranges from 0 to n-1.

11
> vec _ 1 <-c (1 ,2 ,3)
> vec _ 1
[1] 1 2 3
> vec _ 1 <-c ( vec _ 1 ,4)
> vec _ 1
[1] 1 2 3 4
> sum ( vec1 )
[1] 10
> length ( vec1 )
[1] 4
> mean ( vec _ 1) # function application example on a vector
[1] 2.5
> sd ( vec _ 1) # standard deviation
[1] 1.290994
> length ( vec _ 1) # returns the number of elements
[1] 4
> vec _ 1+1
[1] 2 3 4 5
> vec _ 1 * 2
[1] 2 4 6 8
> vec _ 1^0.5
[1] 1.000000 1.414214 1.732051 2.000000
# index values range from 1 to vector length and not from 0 to vector
length - 1
> vec _ 1[1] # returns the content of the first position
[1] 1
> vec _ 1[5]
[1] NA
> vec _ 1[2:3] # slice
[1] 2 3
> vec _ 1[ -1] # remove the first value and returns the result
[1] 2 3 4
> vec _ 1 # does not change the variable content
[1] 1 2 3 4
> vec _ 1[ -4]
[1] 1 2 3

Vector’s elements can be modified:

> vec _ 1
[1] 1 2 3 4
> vec _ 1[2] <- 10
> vec _ 1
[1] 1 10 3 4
> vec _ 1[2:3] <- -100
> vec _ 1
[1] 1 -100 -100 4
> vec _ 1 <- vec _ 1[ -4] # reaffectation of a subset of vec _ 1
> vec _ 1
[1] 1 2 3
> is . vector ( vec _ 1)
[1] TRUE
> is . vector (1)
[1] TRUE

12
Vectors and matrices can contain only one type of value:

> v1 <-c (123)


> v1
[1] 123 # initial vector contains a real value
> v1 <-c ( v1 , " A " ) # if a string is added
> v1
[1] " 123 " " A " # the variable becomes a vector of strings
> v1 <-c ( v1 , TRUE ) # and if a logical value is added
> v1
[1] " 123 " " A " " TRUE " # the variable remains a vector of strings
because every type can be turned into characters
> v2 <-c (123)
> v2 <-c ( v2 , TRUE )
> v2
[1] 123 1 # here the boolean is converted to a double ( TRUE - 1)

A vector contains a field to name its elements called names() and which is empty by default. This field can
be defined and used to access the vector elements:

> v1 <-c (3 ,1 ,2)


> names ( v1 )
NULL
> names ( v1 ) <-c ( " a " ," b " ," c " )
> v1
a b c
3 1 2
> v1 [ " a " ]
a
3
> v1 [ " a " ] <- 12
> v1
a b c
12 1 2
> sort ( v1 ) # example of a simple sorting function applied to the vector
b c a
1 2 12

Vector insertions make use of the append function:

> a <- 1:6


> a
[1] 1 2 3 4 5 6
> a <- append (a , c (55 ,55) ,3)
> a
[1] 1 2 3 55 55 4 5 6

Matrices can be created in several ways:

13
> m1 <-c (1 ,2 ,3 ,4 ,5 ,6) # starts as a vector
> m1
[1] 1 2 3 4 5 6
> dim ( m1 ) # dim () returns the variable dimensions sizes
NULL # A vector has neither row nor column therefore the result is NULL
> dim ( m1 ) <-c (2 ,3) # the variable dimensions : 2 rows and 3 columns
> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 5
[2 ,] 2 4 6

Matrices can also be created this way:

> 1:6 # automatically generates a sequence


[1] 1 2 3 4 5 6
> m2 <- matrix (1:6 , ncol =2) # adds the content and specify the columns number
> m2
[ ,1] [ ,2]
[1 ,] 1 4
[2 ,] 2 5
[3 ,] 3 6
> diag ( m2 ) # returns the diagonal starting from the top - left value
[1] 1 5
> m2 <- matrix (1:6 , nrow =2) # adds the content and specify the row number
> m2
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 5
[2 ,] 2 4 6

Or this way:

> v1 <- 1:3


> v2 <- 4:6
> m3 <- rbind ( v1 , v2 ) # concatenate 2 vectors considered as rows
> m3
[ ,1] [ ,2] [ ,3]
v1 1 2 3
v2 4 5 6
> t ( m3 ) # transpose function
v1 v2
[1 ,] 1 4
[2 ,] 2 5
[3 ,] 3 6
> m4 <- cbind ( v1 , v2 ) # concatenate 2 vectors considered as columns
> m4
v1 v2
[1 ,] 1 4
[2 ,] 2 5
[3 ,] 3 6

The content of a matrix can be accessed and modified:

14
> m1 [1 ,1] # access to the matrix elements ( row , number )
[1] 1
> m1 [2 ,3]
[1] 6
> m1 [3 ,2]
Error in m1 [3 , 2] : indice hors limites # does not exist in this variable
> m1 [ ,1]
[1] 1 2
> m1 [1 ,]
[1] 1 3 5
> m1 [2 ,2] <- 123
> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 5
[2 ,] 2 123 6
> m1 [ ,3] <- -10
> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 -10
[2 ,] 2 123 -10

Like vectors, matrices contain only one type of data and their rows and columns can also be labelled:

> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 -10
[2 ,] 2 123 -10
> colnames ( m1 )
NULL
> colnames ( m1 ) <-c ( " A " ," B " ," C " )
> rownames ( m1 ) <-c ( " M " ," F " )
> m1
A B C
M 1 3 -10
F 2 123 -10
> summary ( m1 ) # a major command in R
A B C
Min . :1.00 Min . : 3 Min . : -10
1 st Qu .:1.25 1 st Qu .: 33 1 st Qu .: -10
Median :1.50 Median : 63 Median : -10
Mean :1.50 Mean : 63 Mean : -10
3 rd Qu .:1.75 3 rd Qu .: 93 3 rd Qu .: -10
Max . :2.00 Max . :123 Max . : -10
> m1 [ , " A " ]
M F
1 2
> m1 [ " F " ," A " ]
[1] 2
> m1 [2 ,1] <- 1000
> m1 [ " F " ," A " ]
[1] 1000

Both vectors and matrices can be used by R functions. For instance the sample() function randomly reorders

15
series:

> v1 <- 1:6


> sample ( v1 )
[1] 3 1 6 2 4 5
> sample ( v1 )
[1] 2 6 5 1 4 3
> sample ( v1 )
[1] 4 1 6 3 5 2
> sample ( v1 )
[1] 2 4 5 6 1 3
> sample ( v1 )
[1] 2 3 6 1 4 5

Arrays are similar to matrices but with more than 2 dimensions. More functionalities of dimensioned variables
will be discussed in the section 2.2.

2.1.5 Factor variables

In R, factors are variables which can take only a limited number of different values. Roughly speaking,
statistical variables can be quantitative (continuous or discrete) or qualitative (ordinal or nominal) and to
create categories in R the factor is the best choice. The explicit construction of factor objects makes use of
the factor() command:

> f1 <- factor ( " cat _ 1 " )


> f1
[1] cat _ 1
Levels : cat _ 1 # levels : list of different identified factors in the
variable f1

Along with the factor definition comes the "level" notion. The levels of a variable are the different possibles
values of the factors.

> data1 <-c (3 ,2 ,3 ,3 ,4 ,5 ,3 ,2 ,1 ,2 ,2 ,2 ,4 ,5 ,6)


> factor ( data1 )
[1] 3 2 3 3 4 5 3 2 1 2 2 2 4 5 6 # numerical values are turned into
factor type values
Levels : 1 2 3 4 5 6 # returns the total number of possible factor values
> data2 <-c ( " A " ," A " ," A " ," A " ," B " ," B " ," B " ," C " ," D " ," D " )
> levels ( data2 )
NULL # data2 is a vector of strings therefore it has no " level "
> levels ( factor ( data2 ) ) # but if data2 is " factorized " ...
[1] " A " " B " " C " " D "

16
2.1.6 Lists and data frames

Compared to vectors or matrices, these 2 container classes add flexibility and powerful functionalities5 . The
main difference between a data.frame and a list is that all elements in a data.frame have an equal length. In
this tutorial we will mainly focus on the utilization of the data.frame rather than on the list class. Indeed,
data.frame allows us to deal with most of the data formats we will use. To create a data.frame the first
possibility is to use the class constructor. In the following example we create an array of data according to
the table 2.1.6.

mark1 mark2
student1 11 16
student2 13 10
student3 10.5 5.5

> name <-c ( " student1 " ," student2 " ," student3 " ) # first column
> marks1 <-c (11 ,13 ,10.5) # second column
> marks2 <-c (16 ,10 ,5.5) # third column
> res <- data . frame ( name , marks1 , marks2 ) # implicitly : column names
> res
name marks1 marks2
1 student1 11.0 16.0
2 student2 13.0 10.0
3 student3 10.5 5.5

It is also possible to use row (rbind) or column (cbind) associations in order to change the final shape.

> res1 <- data . frame ( cbind ( name , marks1 , marks2 ) )


> res1
name marks1 marks2
1 student1 11 16
2 student2 13 10
3 student3 10.5 5.5
> res2 <- data . frame ( rbind ( name , marks1 , marks2 ) )
> res2
X1 X2 X3
name student1 student2 student3
marks1 11 13 10.5
marks2 16 10 5.5

Nevertheless, the use of cbind() and rbind() automatically converts the numerical values to factor objects.
5 http://stackoverflow.com/questions/15901224/what-is-difference-between-dataframe-and-list-in-r

17
> res [1 ,2]
[1] 11 # double value
> res1 [1 ,2]
[1] 11 # factor type value
Levels : 10.5 11 13
> res1 [ ,2] <- as . double ( as . character ( res1 [ ,2]) ) # annoying conversion method
> res1 [1 ,2]
[1] 11 # double value

Note that when using the rbind command, since the variables names are not applicable to give names to the
columns, Xn names are arbitrarily added. When the data.frame is built, row names are also arbitrarily set
to the row number.

> colnames ( res )


[1] " name " " marks1 " " marks2 "
> rownames ( res )
[1] " 1 " " 2 " " 3 "
> res
name marks1 marks2
1 student1 11.0 16.0
2 student2 13.0 10.0
3 student3 10.5 5.5

In table 2.1.6 the first column contains the names of the students and not numbers. To obtain this our res
variable has to be modified:

> rownames ( res ) <- res [ ,1] # sets the row names using the first column
values
> res <- res [ , -1] # remove the first column
> res
marks1 marks2
student1 11.0 16.0
student2 13.0 10.0
student3 10.5 5.5
> rownames ( res )
[1] " student1 " " student2 " " student3 "

The data.frame object can be addressed using either the row and column names but also using the field
names:

> res [ " student1 " ,]


marks1 marks2
student1 11 16
> res [ , " marks1 " ]
[1] 11.0 13.0 10.5
> res $ marks1 # shortcut
[1] 11.0 13.0 10.5

18
2.1.7 Class, type, mode and storage.mode in R

The R documentation provides detailed explanations regarding the discrepancies between a class, a mode
and a type6 . Our aim is not to go too deep in such considerations because the object model in R is complex
and derives from S and S+ languages. To summarize:

• mode() and class() are very similar and only differ in case of objects belonging to specific classes like
tests for example

• typeof() returns the "type" of object from R’s point of view7 .

• storage.mode() return the type of variable storage

For instance:

> a <-c (1 ,2 ,3)


> b <- matrix ( c ( " a " ," b " ," c " ) )
> c <- data . frame ( a )
> d <- list ( b )
> e <-t . test ( c (1 ,2) )
> f <- function () {}

result a b c d e f
class numeric matrix data.frame list htest function
mode numeric character list list list function
typeof double character list list list closure
storage.mode double character list list list function

2.2 Manipulation and processing of R variables

2.2.1 Operators

Arithmetic, logical and comparison operators are summarized here:

Here are some examples of operations on variables:


6 http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html#Objects
7 http://stats.stackexchange.com/questions/3212/mode-class-and-type-of-r-objects

19
Arithmetic Comparison Logic

+ addition < inferior to ! x Logical NO


- substraction > superior to à x & y Logical AND
* multiplication <= inferior or equal to x && y idem
/ division >= superior or equal to x | y Logical OR
^ exponent == equal to x || y idem
%% modulo != different from xor(x,y) Exclusive OR
%/% integer division

a <- c (1 ,2 ,3)
a ^2
[1] 1 4 9
a >=2
[1] FALSE TRUE TRUE
a %% 2
[1] 1 0 1

The parentheses allow the management of operator priorities. What is the result of the following command ?

(1 >3||4==3) & & (3 %% 2 <(2 <=3 %% 2) )

2.2.2 Conditional data selection

There are many ways to extract data from an array. They often rely on the use of indices or ranges as
previously shown in 2.1.4. But it is also possible to use conditional operators. For instance, all the data
inferior to 10 have to be removed from v1:

> v1 <- 1:20


> v1 >10
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE # test the items
to choose
> v1 <- v1 [ v1 >10] # keep only TRUE corresponding items
> v1
[1] 11 12 13 14 15 16 17 18 19 20

The ">" operator is used to test all the elements of v1 and the logical series is then used to select the
corresponding v1 values.

20
2.2.3 R controlled sequences generation

The R environment includes numerous possibilities to automatically generate ordered or regular sequences.
The use of the ":" symbol has already been evoked. A useful way to generate controlled sequences is the
command seq():

> 1:5
[1] 1 2 3 4 5
> seq ( from =1 , to =5 , by =2) # from , to and by are optional but can help for
clarity
[1] 1 3 5
> seq (10 ,1 , -3)
[1] 10 7 4 1
> seq ( from =1 , to =10 , length =4) # the total number of values is set by length
[1] 1 4 7 10
> seq (1 ,10 , length =5)
[1] 1.00 3.25 5.50 7.75 10.00

If a given sequence or pattern has to be repeated several times the rep() function can be used:

> rep (2 ,3)


[1] 2 2 2
> rep ( " Hi ! " ,4)
[1] " Hi ! " " Hi ! " " Hi ! " " Hi ! "
> rep ( c (1 ,2 ,4) ,3)
[1] 1 2 4 1 2 4 1 2 4

A similar function exists which produces levels (factor) series: gl()

> gl (3 ,2 , label = c ( " small " ," medium " ," large " ) )
[1] small small medium medium large large
Levels : small medium large

And if it is necessary to obtain variable patterns (incremental or decremental) the sequence() method (different
from seq()) can solve the problem:

> sequence (1:3)


[1] 1 1 2 1 2 3
> sequence (3:1)
[1] 1 2 3 1 2 1

2.2.4 R random series generation

Since R is intended to work on statistics and modeling, many useful (pseudo)random number generators
have been implemented. For instance: uniform, Poisson and normal random series can be created using

21
respectively runif(), rpois() and rnorm() functions. These functions take the number or values to generate as
the first argument and the distribution law parameters as the following arguments:

> series1 <- runif (100 ,0 ,1) # 100 values in the interval [0;1[
> series2 <- rnorm (100 ,10 ,2) # 100 values following a gaussian distribution
mean =10 and sd =2
> series3 <- rpois (100 ,1) # 100 values with a Poisson distribution and
lambda =1

For instance to simulate a series of 100 draws of a 6-sided dice:

> s <- floor ( runif (1000 ,1 ,7) ) # uniform numbers are real numbers and we want
to obtain discrete 1 -6 values
> s <- ceiling ( runif (1000 ,0 ,6) )

These functions (and others) will be intensively used thereafter.

2.3 Control structures in R (flow of control)

2.3.1 General considerations on coding in general and R coding in particular

A part of this tutorial is dedicated to script development. Most of the time the script is built as a small
list of commands aiming at processing, analyzing and plotting data. The script itself is a text file, which
name ends with ".r". This suffix is not mandatory but strongly recommended to avoid confusion with other
file types. A script can be written using the R-GUI or any kind of text editor, providing there is no text
encoding. Once the file containing the script is saved in a given folder, the current R working directory has
to be set to this folder. There are 2 possibilities to set a working directory. The first makes use of the R-GUI:
in the main menu, click on the Misc/Change working directory item and then select the folder in which the
scripts are created. The other possibility makes use of the R line commands:

> getwd () # what is the current working directory ?


[1] " / Users / sam "
> setwd ( " / Users / sam / Desktop / " ) # set the new working directory .
> getwd () # and check it
[1] " / Users / sam / Desktop "

Once the working directory is set, the scripts can be run thanks to the source() command. An example is
given in the Figure 8.

22
Figure 8: my first R script.

This script can be run this way:

> source ( " myFirstRScript . R " )


[1] " Hello I am Pointless "

To write a script several considerations have to be taken into account. As a start the first lines of the script
should look like this:

# " fileName . R "


# last modification date
# * * * EXPLICIT TITLE * * *
# Author
# Description :
Beginning of the code ...

This is better in case of the writing of a substantial script that other people may use. The R developper
should also keep the following points in mind:

• Comment the code and functions using #

• Remember to indent the code

• Choose adequate variable names (avoid too short, too long or too inadequate names)

• Whatever tongue is chosen (French, English, Swahili ...) always use the same in all the code

2.3.2 What is a code block ?

In R (as in many other languages like C, C++, Java ...) a code block is a piece of code which is grouped
together and which is delimited by the { and } symbols (curly braces). This entity is the building block and

23
is fundamental to structured programming, where control structures are formed out of blocks8 .

2.3.3 Tests

As previously mentioned in 2.2.1 it is possible to execute conditional tasks thanks to the logical tests. The
result of a test is generally TRUE or FALSE and according to this result, the script will selecte one processing
instead of another:

1 a <- 42
2 b <- 25
3 if (a >= b ) { # test is a is superior or equal to b . If TRUE then execute
the current block . if FALSE continue after the block ... line 6
4 print ( " a superior or equal to b " )
5 }
6 ...

Note: to test for equality the "=" symbol has to be doubled. Example:

if ( a == b ) {
...
}

These script is saved as test01.r:

> source ( " test01 . r " )


[1] " a superior or equal to b "

The test operators are detailed in 2.2.1. The else command allows alternate processings. For instance the
case where b superior to a has to be considered in previous script:

a <- 12
b <- 25
if (a >= b ) { # test is a is superior or equal to b . If TRUE then execute the
current block . if FALSE continue after the block ...
print ( " a superior or equal to b " )
} else { # the whole command "} else {" has to be on the same line
print ( " b superior to a " )
}

Of course in this case the 2 alternative are mutually exclusive.


8 https://en.wikipedia.org/wiki/Block_(programming)

24
2.3.4 Functions

One of the most interesting features of the coding is the possibility to create functions. For instance if a
statistical processing requires 20 successive steps, each involving a different command applied to the preceding
result. It is very tedious to enter all these commands each time on a data set. The first good idea is therefore
to write a script and if this piece of code is often used, the second good idea is to create a function out
of the script. The following minVal() function for example returns the minimal value of a vector argument
(Figure 9).

Figure 9: my second R script.

Once the file is properly saved in the working directory, it can be run from the R console:

> source ( " minVal . r " )


> minVal ( c (3 ,22) ) # call the function with some given arguments
[1] 3
> minVal ( c (24 ,2) )
[1] 2
> val <-c (102 ,12)
> minVal ( val )
[1] 12

The template of R functions is the following:

nameOfFunction <- function ( arguments ) {


instructions ...
return ()
}

Arguments and returning values are optional. The following function is perfectly correct (though not very
useful):

25
myPoorFunction <- function () {
print ( " nothing " )
}
> myPoorFunction ()
[1] " nothing "

Of course any kind of data structure can be taken as an argument or returned by a function. It only depends
on the specific needs.

26
The apply() function is powerful and complementary to other R functions. It permits serial processing of
arrays of values:

myNewFunction <- function ( a ) {


return ( a * 10)
}
# MARGIN : for a matrix 1 indicates rows , 2 indicates columns , c (1 , 2)
indicates rows and columns .
> m1 <- matrix (1:12 , nrow =3)
> m1
[ ,1] [ ,2] [ ,3] [ ,4]
[1 ,] 1 4 7 10
[2 ,] 2 5 8 11
[3 ,] 3 6 9 12
> r1 <- apply ( m1 ,2 , myNewFunction )
> r1
[ ,1] [ ,2] [ ,3] [ ,4]
[1 ,] 10 40 70 100
[2 ,] 20 50 80 110
[3 ,] 30 60 90 120

2.3.5 Loops

One of the main interests of scripts is the possibility to automatize and repeat tedious data processing and
loops allow controlled repetitive operations. For instance, the "for" loop can be used to repeat n times a
given piece of code:

for ( i in 1:6) {
print ( paste ( " hello " ,i ) )
}
# result :
[1] " hello 1 "
[1] " hello 2 "
[1] " hello 3 "
[1] " hello 4 "
[1] " hello 5 "
[1] " hello 6 "

A "for" loop defines 3 elements: (i) the counting variable and its initial value, (ii) the loop stop condition,
and (iii) the increment in the variable’s value. The keyword "in" allows this variable to scan all the values in
the interval 1:6 (note that the parentheses are mandatory in this syntax). Different types of variable series
can be used:

27
a <-c ( " a1 " ," f1 " ," lll " )
for ( gobbledigook in a ) {
print ( gobbledigook )
}
# result :
[1] " a1 "
[1] " f1 "
[1] " lll "

Loops can be embedded. For example if all the coordinates in a 2 rows x 3 columns9 matrix have to be
displayed, the following script can be used:

m <- NULL
for ( y in 1:2) {
l <- NULL
for ( x in 1:3) {
l <-c (l , paste (y ,x , sep = " ," ) )
}
m <- rbind (m , l )
}

Once saved as loop.r in the working directory:

> source ( " loop . r " )


> m
[ ,1] [ ,2] [ ,3]
l " 1 ,1 " " 1 ,2 " " 1 ,3 "
l " 2 ,1 " " 2 ,2 " " 2 ,3 "

If the number of the loop iterations is not known in advance, the "while" structure is preferable. It repeats
its internal block as long as the condition defined in the header is true. The previous "for" loop counting
from 1 to 6 can thus be implemented this way:

m <-0
while (m <6) {
m <-m +1
print ( m )
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6

9 It is conventional to define the first coordinate as the row and the second as the column.

28
The "while" loops can be infinite unless a "break" instruction is applied. In the following example, the loop
iterates as long as the m value is inferior or equal to 6. When it reaches 7 the loop is broken.

m <-0
while ( TRUE ) {
m <-m +1
if (m >6) {
break
}
print ( m )
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6

2.3.6 User input and output functions

The print() function provides a handy to use script output but it can also be necessary for a script to request
user’s inputs. This can be done with the scan() function:

a <- scan () # template : variableName <- scan ()


1: 12 14 16
4: # enter twice to stop values input
Read 3 items
a
[1] 12 14 16

2.4 Input and output files

Most of the time, the data to analyze are stored in files which formats depends on the software which produced
it. R has many libraries and packages and thus can read a lot of different data file formats but in this tutorial
we will focus on the manipulation of ASCII10 or unicode text files instead of binary or proprietary file formats.
The simplest way to read data from a file to store them in a variable consists in using the read.table() function.
If a file named "data1.txt" located in the working directory contains:

1 0.1 12
2 0.2 14
3 0.3 11

It can be open and read this way:


10 American Standard Code for Information Interchange

29
> res1 <- read . table ( " data1 . txt " )
> res1
V1 V2 V3
1 1 0.1 12
2 2 0.2 14
3 3 0.3 11

The table.read() function returns a data.frame object. Arbitrary column names and row numbers are thus
added to the variable. If a "data2.txt" file contains an explicit header like this:

A B C
1 1 0.1 12
2 2 0.2 14
3 3 0.3 11

The optional argument header=TRUE can be added to set the column names directly:

> res2 <- read . table ( " data2 . txt " )


> res2
V1 V2 V3
1 A B C
2 1 0.1 12
3 2 0.2 14
4 3 0.3 11
> res2 <- read . table ( " data2 . txt " , header = TRUE )
> res2
A B C
1 1 0.1 12
2 2 0.2 14
3 3 0.3 11

If the content of a variable has to be saved to a file, the output command equivalent to read.table() is
write.table:

> res2
A B C
1 1 0.1 12
2 2 0.2 14
3 3 0.3 11
> res2 [2 ,2] <- 0.55 # change the content of the current variable res2
> res2
A B C
1 1 0.10 12
2 2 0.55 14
3 3 0.30 11
> write . table ( res2 , file = " data2 . txt " , col . names = TRUE ) # writes the new
content of res2 in the same file

30
The new "data2.txt" file now contains:

"A" "B" "C"


"1" 1 0.1 12
"2" 2 0.55 14
"3" 3 0.3 11

The csv file format allows data exchanges between Excel, LibreOffice ... and R. For instance if an excel
worksheet is saved with the file name "data3.csv" (Figure 10).

Figure 10: Excel csv file creation.

And read it in R with the following command:

# the sep argument has to be specified and the default value in csv is ";"
> res3 <- read . csv ( " data3 . csv " , sep = " ; " )
> res3
name mark
1 truc 12
2 machin 13
3 bidule 11
> res3 $ mark [1] * 2
[1] 24
> res3 $ name
[1] truc machin bidule
Levels : bidule machin truc

2.5 R graphic libraries

One of the main advantages of R lies in its strong graphic capabilities. This tutorial will give an overview of
some basic graphs but many other 2D, 3D, animations and even interactive graphs can be obtained thanks
to the R packages1112 .
11 http://www.cookbook-r.com/Graphs/
12 http://www.statmethods.net/

31
2.5.1 How to create and manage graphic windows?

The first step to create a chart is to build a blank graphic window using the function dev.new() (Figure 11).
Many different windows can be visualized at the same time. Each time a new graph window is required the

Figure 11: R blank graph windows.

function dev.new() must be used. When several graph windows are open, it is possible to select the current
one thanks to the dev.next() and dev.prev() functions. For example if the current graph is the number 5 and
some data have to be plotted in the window number 3:

> dev . new () # quartz () for MacOS , X11 () for Linux and window () for Windows
are possible alternative commands
> dev . new ()
> dev . new ()
> dev . new ()
> dev . cur ()
quartz
5 # the graph window number starts from 2
> dev . set ( dev . prev () )
quartz
4
> dev . set ( dev . prev () )
quartz
3
> dev . cur ()
quartz
3

The par() function permits the plotting of several figures on the same graph:

32
dev . new ()
par ( mfrow = c (2 ,3) ) # indicates the dimensions of the layout (2 rows and 3
columns )
plot (1:1) # fake data ... just to show
title ( " title 1 " )
plot (1:2)
title ( " title 2 " )
plot (1:3)
title ( " title 3 " )
plot (1:4)
title ( " title 4 " )
plot (1:5)
title ( " title 5 " )
plot (1:6)
title ( " title 6 " )

This script plots a main window and 6 subparts each containing a different plot ant title as shown in Figure 12.

Figure 12: Combined plots in the same window.

33
2.5.2 Line and point charts

The generic R functions for plotting R objects are plot() and matplot(). These command can produce simple
and multiple curves and individual points graphs. For instance:

x0 <- seq ( from =0 , to =10 , by =0.5)


y0 <- cos ( x0 )
x1 <- seq ( from =0 , to =10 , by =0.01)
y1 <- cos ( x1 )
y2 <- sin ( x1 )
y3 <- sin (1.5 * x1 )
m1 <- cbind ( y1 , y2 , y3 )
dev . new ()
par ( mfrow = c (1 ,3) )
plot ( x0 , y0 , type = " l " )
points ( x0 , y0 , bg = " white " , pch =21) # a previous plot () is mandatory ,
pch =21 for full circles
title ( " Simple plot " )
matplot ( x1 , m1 , type = " l " )
title ( " Multiple plot " )
matplot ( y1 , y2 , type = " l " )
title ( " Parametric plot " )

The output of the script above is shown in Figure 13.

Figure 13: Use of the plot and matplot commands in R.

34
dev . new ()
x <- 0:64 / 64
y1 <- sin (3 * pi * x )
y2 <- sin (2 * pi * x )
y <- cbind ( y1 , y2 )
matplot (x , y , type = " l " , col = c ( " blue " ," red " ) , lty =1 , main = "
Points , lines and legends " )
# here the main argument is the graph title
points (x , y1 , bg = " white " , pch =21)
points (x , y2 , bg = " white " , pch =21)
legend ( legend = c ( " sin (3 * pi * x ) " ," sin (2 * pi * x ) " ) ," bottomleft " , lty
=1 , col = c ( " blue " ," red " ) )

Legends are necessary to obtain clear and understandable graphs especially in case of multiple plots. As
shown above, the legend() function takes the following arguments:

• legend=vector containing the different plots names

• lty=1 (line type)

• the location (here "bottomleft")

• col=the associated colors

The output is shown in Figure 14.

Figure 14: Use of the legend function in R.

35
2.5.3 Histogram

A visual control of data distribution is often very useful to describe a population. The hist() functions can
take series of data and plots the distribution histogram:

d1 <- rnorm (1000 ,10 ,2) # 1000 random values generated with mean =10 and sd =2
( gaussian )
d2 <- runif (1000 ,8 ,12) # 1000 random values generated in the interval [8 ,12[
( uniform )
dev . new ()
par ( mfrow = c (1 ,3) )
hist ( d1 , main = " Normal distribution " )
hist ( d2 , main = " Uniform distribution " )
hist ( d1 , main = " N =50 breaks " , breaks =50)

Figure 15: R histogram samples.

The Figure 15-left shows 2 examples of different distributions. The "break" argument defines the number of
intervals of the histogram and the Figure 15-right shows its effect on the plot (here it is set to 50). The second
argument to consider in the hist() function is "right". Its default value is set to TRUE which means that the
histogram cells are right-closed (left open) intervals. It is sometimes clearer to explicitly adjust the number
of intervals of the histogram. The following script plots the histogram of the same distribution (Figure 16).

36
a1 <-c (2 ,1 ,3 ,2 ,4 ,3 ,2 ,3 ,1 ,2 ,4 ,3 ,4 ,5 ,5 ,6 ,2 ,4 ,1 ,6)
dev . new ()
par ( mfrow = c (1 ,3) )
h1 <- hist ( a1 , main = " Default " )
h2 <- hist ( a1 , breaks =1:7 , main = " Breaks =1:7 " )
h3 <- hist ( a1 , breaks =1:7 , right =F , ylim = c (0 ,8) , main = " Breaks =1:7 , Right = FALSE " )
print ( " h1 $ count " ) ; print ( h1 $ count ) # Number of occurrences of each cell
print ( " h2 $ count " ) ; print ( h2 $ count )
print ( " h3 $ count " ) ; print ( h3 $ count )
# The table () method returns the contingency table of the counts at each
combination of factor levels of a variable
print ( " table ( a1 ) " ) ; print ( table ( a1 ) ) # equivalent to h3 $ count in this case

Figure 16: Effect of the "right" & "breaks" arguments combination in the hist() function.

37
2.5.4 Whisker box

The box-and-whisker plot is particularly adapted to the representation of small size samples or when data
are not normally distributed. In both cases the use of median and quartiles instead of mean and standard
error or standard deviation is meaningful. This graph contains many important information regarding central
tendency and dispersion as shown in Figure 1713 . Note that the [-1.5xIQR-Q1; 1.5xIQR+Q3] interval is used

Figure 17: Box-and-whisker plot description.

to select the actual most extreme values which will be used as whisker edges (the default range value is 1.5).
The R function is boxplot and it can be used this way:

d1 <- rnorm (10000 ,5 ,1)


d2 <- rnorm (10000 ,8 ,2)
d3 <- runif (10000 ,3 ,7)
d4 <- runif (10000 ,6 ,8)
mm <- cbind ( d1 , d2 , d3 , d4 ) # first creates the matrix to plot
boxplot ( mm , names = c ( " ND 5 ,1 " ," ND 8 ,2 " ," UD 3 ,7 " ," UD 6 ,8 " ) )

The above script generates 4 different series differently distributed and plots the box-and-whisker graph
(Figure 18). If the outliers values are numerous (Figure 18: ND 5,1 and ND 8,2) or useless a outline=FALSE
argument can be used to get rid of them.

13 http://www.unc.edu/ nielsen/soci708/m3/m3.htm

38
Figure 18: Box-and-whisker plot sample.

2.5.5 Pie chart

This graph is used to summarize values distribution using relative areas.

pc <-c (8 ,12 ,25 ,35 ,5 ,15) # creates the values


pie ( pc , labels = c ( " a " ," b " ," c " ," d " ," e " ," f " ) , main = " Pie chart " ) # plot
the graph

The resulting plot is shown in Figure 19.

39
Figure 19: Pie chart plot sample.

2.5.6 Bar charts

Bar graphs can be plotted using the barplot() function:

dev . new ()
barplot ( c (1 ,2 ,3) )

This command lines will create a simple barplot as shown in Figure 20.

40
Figure 20: Barplot sample.

Most of the time bar plots come along with error bars (corresponding to standard deviation, standard error
of the mean ...). Such bar plots are more difficult to create because they require the use of another function:
arrows(). Indeed there is no automated graph command which plots at the same time mean + error bars
from a set of data (at least not in the R base package). The following script provides a function which can
automatically plot such graphs:

# barValues and errorValues are matrices with column : group , row : value in
each group
myBarPlot <- function ( barValues , errorValues , titleText ) {
barx <- barplot ( barValues , beside = TRUE , ylim = c (0 , max ( m2 ) + max ( sem2 ) ) )
arrows ( barx , barValues + errorValues , barx , barValues , angle =90 ,
code =1 , length =0.05)
title ( titleText )
}

dev . new ()
par ( mfrow = c (1 ,2) )
m1 <- matrix (1:3 , nrow =1) # bar values
sem1 <- matrix (3:1 , nrow =1) # error bars values
myBarPlot ( m1 , sem1 , " Simple series " )
m2 <- matrix (1:6 , nrow =2) # bar values
sem2 <- matrix (6:1 , nrow =2) # error bars values
myBarPlot ( m2 , sem2 , " 3 groups of 2 values " )

The Figure 21 shows the output of this script.

41
Figure 21: Barplots and error bars.

2.5.7 The ggplot2 package

Created by Hadley Wickham en 2005 and very easy to install:

install . packages ( ’ ggplot2 ’)

It provides new functionalities to produce high quality enriched figures that can be used to illustrate research
papers as shown in Figure 22.

Figure 22: ggplot2 package figures samples.

42
The graph construction is slightly different from the base plot approach and we are going to test this package
14
on an example . Download the data_ggplot2.txt data file at: http://agbx.byethost9.com/data/ in your
working directory and store its content in a variable and call the ggplot2 package:

d1 <- read . table ( " data _ ggplot2 . txt " , header = T )


library ( ggplot2 )

The next steps consists in the creation of the main figure.

p0 <- ggplot ( data = d1 , aes ( x = Time , y = Abs , colour = Cond , shape = Cond ) )

In this instruction, aes refers to the Aesthetic parameters. In this case, the Time column is used for the x
axis, the Abs column for the y axis. The colour and shape of the dots will be automatically be incremented
using the Cond column detected levels. At the moment p1 is just an empty figure as shown in Figure 23:

p0

Figure 23: First ggplot.


14 http://bioinfo-fr.net/guide-de-demarrage-pour-ggplot2-un-package-graphique-pour-r

43
The content of the figure has then to be specified. The "+" operator is used to add supplementary description
to the initial p0 figure. If simple points have to be added to the plot the geom_point() function will be used:

p0 <- p0 + geom _ point ( size =4)

As shown in Figure 24 the size parameter will set the point size automatically.

Figure 24: First ggplot points series: p0.

Once the ggplot2 principle is understood, it is possible to add many additional features such as a continuous
line (Figure 25-A):

p1 <- p0 + ggtitle ( " My beautiful enzyme " ) + geom _ line ( size =2)
p2 <- ggplot ( data = d1 , aes ( x = Time , y = Abs , colour = Cond , shape = Cond ) ) + geom _
point ( size =4)
p2 <- p2 + geom _ smooth () # smoothing data using confidence interval
p3 <- ggplot ( data = d1 , aes ( x = Time , y = Abs , fill = Cond ) ) + geom _ bar ( aes ( y =
Abs ) , stat = " identity " , position = " dodge " )

It is also possible to automatically smooth a series of value with its confidence interval or to change the plot
type to bar plot as shown above with p2 and p3 respectively (Figure 25-B-C).
A very common way to visualize spike trains in neuroscience is to use raster plots. This can be done with
the ggplot2 library as shown in Figure 26-A-B:

44
Figure 25: ggplot2 curve samples: p1, p2 and p3.

sp1 <-c (10.1 , 14 , 25 , 27 , 30 , 45)


sp2 <-c (1.1 , 28 , 44)
sp3 <-c (3 , 15.5 , 31 , 60 , 67 , 99)
sp <-c ( sp1 , sp2 , sp3 )
neuron <-c ( rep (1 , length ( sp1 ) ) , rep (2 , length ( sp2 ) ) , rep (3 , length ( sp3 ) ) )
d4 <- data . frame ( neuron , sp )
p4 <- ggplot ( d4 , aes ( x = sp , y = factor ( neuron ) ) )
p4 <- p4 + geom _ point () + ylab ( " neuron " )
p5 <- ggplot ( d4 , aes ( x = sp , y = factor ( neuron ) , color = factor ( neuron ) ) )
p5 <- p5 + geom _ point () + ylab ( " neuron " )

Figure 26: ggplot2 raster plot samples (A: p4 and B: p5).

45
3 Applications

3.1 R variables

3.1.1 Variable creation

i)

> v1
[1] 101 102 103 104 105 106 107 108 109 110 111 112
> v2
[1] 20 18 16 14 12 10
> v3
[1] 4 6 3 4 6 3 4 6 3 4 6 3
> v4
[1] 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1
> v5
[1] 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 3 3 3 3 3

ii)

> v6
[ ,1] [ ,2] [ ,3] [ ,4]
[1 ,] 1 2 3 4
[2 ,] 5 6 7 8
[3 ,] 9 10 11 12
[4 ,] 13 14 15 16
[5 ,] 17 18 19 20
[6 ,] 21 22 23 24
> v7
[1] 1 0 1 0 1 0 1 0 1 0 1 0
> v8
Value _ 1 Value _ 2
Rank _ 1 3 6
Rank _ 2 9 12
Rank _ 3 15 18
> v9
Rank Result _ 1 Result _ 2
Candidate _ 1 A 11 14.0
Candidate _ 2 B 12 12.5
Candidate _ 3 C 14 18.0
> v10
e1 e2 e3
1 AA 11.0 110
2 AA 12.0 120
3 AA 9.0 115
4 BB 12.5 95
5 BB 10.0 140
6 BB 6.0 199
7 BB 8.0 55

46
3.1.2 Variable manipulation and selection

i) Create a variable according to this table: .

mark1 mark2 mark3


student1 11 16 10
student2 13 10 2.5
student3 10.5 5.5 12
student4 12 14 18
student5 10 11 8.5
student6 15.5 9.5 7

ii) Using conditional operators and R commands answer to the following questions:

• Who are the students whose mark1 is inferior to 12 ?


Note: your command can involve intermediary steps and should return the following results:

[1] " student1 " " student3 " " student5 "

• Who are the students whose mark2 is superior to 11 and inferior to 15 ?

[1] " student4 "

• What is the sum of all the mark1 values of the students whose mark2 is inferior to 10 ?

[1] 26

• How many students have a mark3 superior or equal to 10 ?

[1] 3

3.2 R control structures

3.2.1 "For" loops examples

Write the "test.r" scripts which produce the following outputs:


i)

47
> source ( " test . r " )
[1] 1 1 1
[1] 2 4 8
[1] 3 9 27
[1] 4 16 64
[1] 5 25 125
[1] 6 36 216

ii)

> source ( " test . r " ) # the only difference with the previous one is " string "
format of the output
[1] " 1 1 1 "
[1] " 2 4 8 "
[1] " 3 9 27 "
[1] " 4 16 64 "
[1] " 5 25 125 "
[1] " 6 36 216 "

iii)

> source ( " test . r " )


[1] " 1 is less than four "
[1] " 2 is less than four "
[1] " 3 is less than four "
[1] " 4 is four ! "
[1] " 5 is more than four "
[1] " 6 is more than four "
[1] " 7 is more than four "

3.2.2 "While" loops examples

i)

> source ( " test . r " )


[1] 36
[1] 25
[1] 16
[1] 9
[1] 4
[1] 1

ii) The same "test.r" script should produce both outputs

48
> source ( " test . r " )
[1] " Enter 0 to stop "
1: 4 1 2 5
5: # press enter
Read 4 items
1: # press enter
Read 0 items
[1] " Sum the values : "
[1] 12
[1] " Average value : "
[1] 3
[1] " Sorted values : "
[1] 1 2 4 5

> source ( " test . r " )


[1] " Enter 0 to stop "
1: 4
2: 1
3: 2
4: 5
5: # press enter
Read 4 items
1: # press enter
Read 0 items
[1] " Sum the values : "
[1] 12
[1] " Average value : "
[1] 3
[1] " Sorted values : "
[1] 1 2 4 5

3.2.3 Functions examples

Write the "test.r" scripts which produce the following outputs:


i)

> source ( " test . r " )


> counter (3 ,6)
[1] 9
[1] 16
[1] 25
[1] 36

ii)

49
> source ( " test . r " )
> facto (0)
[1] 1
> facto (1)
[1] 1
> facto (2)
[1] 2
> facto (5)
[1] 120

50

You might also like