R & Statistics Tutorial: Andre Garenne May 17, 2017
R & Statistics Tutorial: Andre Garenne May 17, 2017
Andre Garenne
Contents
1 Introduction 3
1.1 Downloading and installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Running R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Package installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Package activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 R help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 R language basics 7
2.1 Variables types in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Numerical variables creation and processing . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 String variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Logical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Vectors, matrices and arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Factor variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.6 Lists and data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.7 Class, type, mode and storage.mode in R . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Manipulation and processing of R variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Conditional data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 R controlled sequences generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 R random series generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Control structures in R (flow of control) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 General considerations on coding in general and R coding in particular . . . . . . . . . 22
2.3.2 What is a code block ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1
2.3.5 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.6 User input and output functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Input and output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 R graphic libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 How to create and manage graphic windows? . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Line and point charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.4 Whisker box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.5 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.6 Bar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.7 The ggplot2 package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 Applications 46
3.1 R variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.1 Variable creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2 Variable manipulation and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 R control structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 "For" loops examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 "While" loops examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.3 Functions examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2
1 Introduction
According to the authors "R is a language and environment for statistical computing and graphics". It can
be downloaded from the following site: http://cran.univ-lyon1.fr/ for example but there are many other
possible mirrors. Binaries are available for MacOSX, Linux and Windows and sources can also be recompiled
in different OS1 . The installation is usually very simple with the binaries and is also well documented and
explained to linux users. In these tutorial we will use the "standard" R graphic environment (R-GUI)2 .
1.2 Running R
The following screenshots and examples originate from the MacOSX version of R but the look and feel are
quite similar with the other versions. R can be used from the command line in a terminal window (Figure 1).
The standard GUI version provides some practical functionalities such as a script file editor and easy access
to packages or working directories management as shown in Figure 2. All these functionalities are available
from the top-menu bar.
The standard distribution of R comes with numerous useful statistical and graphic packages. Nevertheless,
it becomes rapidly necessary to import or even built our own packages according to our specific needs. The
R-GUI provides a simple method to do so. First, it is necessary to install the required packages (i.e. to
download them from the site and make them available in our R session) and then to click on the Packages &
1 Operating system.
2 There also exist more friendly R-IDE like RKward (https://rkward.kde.org/) or RStudio (https://www.rstudio.com/).
3
Figure 2: R-GUI screenshot.
Data menu item and choose Package Installer and to get the full package list from the CRAN3 binaries. The
"Install Dependencies checkbox" has to be checked in order to alleviate partial or dysfunctional installations.
The required package can thus be selected and installed. This procedure is summarized in Figure 3. The
other way to install packages from distant sites is to use the install.package() command but it will be left
aside for the moment:
4
1.4 Package activation
Once a package has been installed, it has to be activated in the current R session (or at the beginning of
a script) otherwise its functions are not available. This can be done via the main menu by clicking on the
Packages & Data menu item and choosing Package Manager (Figure 4). The required packages for the current
R session are selected by clicking on the checkboxes (note that when a package has just been installed it is
necessary to update the list otherwise it is not visible). The alternative to the R-GUI activation method
consists in calling the library() function like this:
This latter method is often necessary especially at the beginning of a script in which "non standard" functions
are needed.
1.5 R help
The R help can be accessed from the main menu (help item) but it is useful to know some basic commands:
5
help . start () # opens the general R help menu
If the syntax of a function is already known, the help() and "?" commands can also be used:
These commands open a new window which contains function description and examples of use as shown in
Figure 5. If the function spelling is not exactly known, the "??" operator can be used.
For instance the following command opens a new window in which all the help items containing similar
spelling are listed (Figure 6).
?? student
6
Figure 6: R help research with the "student" keyword.
2 R language basics
a <- 1.5
The "<-" symbol (or "=" as in the majority of computer languages) is an affectation operator. In this case
it dynamically creates a numeric variable named "a" containing 1.5 as initial value. If a variable has to be
created with no initial type the NULL keyword can be used. The last assignment of a variable just replaces
the previous one. The type of a variable can be obtained with the typeof() command.
7
> a <- 1.5
> a
[1] 1.5
> typeof ( a )
[1] " double "
> a <- NULL
> typeof ( a )
[1] " NULL "
> a <- 10.5
> typeof ( a )
[1] " double "
> a*2
[1] 21
The current variables list can be obtained using the ls() command:
Note that if the parentheses are omitted, the shell returns the R source code of the function. For example if
the "ls" command is entered it will return the result shown in Figure 7. The existence of a variable can be
tested using the exists() function. The exists() function requires at least 2 arguments. The first one is the
name space to check (ls() for example returns all the current variables names list) and the second one is the
variable name to be tested:
As for many other modern languages, the naming is case sensitive. Moreover it is recommended to avoid using
language reserved keywords as variable names. For example the keyword "for" cannot be used as variable
name but function names can be used which can lead to weird source code:
8
Figure 7: Sample of R source code of the ls function.
9
2.1.2 String variables
The dedicated type of variable to store characters is the "string". Strings can be created using simple ’ or
double " quotes or alternatively by converting other types of values to strings:
4 http://gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf
10
2.1.3 Logical variables
This data type can only take 2 values: FALSE or TRUE. Most of the time the boolean is used to introduce
conditional processing using tests and operators.
Basically, a vector is a variable which contains several values (a matrix behaves mostly like a vector but with
a supplementary dimension). The c() command combines a series of values and returns a vector (note that
a<-1 is equivalent to a<-c(1)). Each element of a vector can be accessed or modified using its index number.
This index is ranging from 1 to n with n being the number of element of the vector whereas in most other
computer languages it ranges from 0 to n-1.
11
> vec _ 1 <-c (1 ,2 ,3)
> vec _ 1
[1] 1 2 3
> vec _ 1 <-c ( vec _ 1 ,4)
> vec _ 1
[1] 1 2 3 4
> sum ( vec1 )
[1] 10
> length ( vec1 )
[1] 4
> mean ( vec _ 1) # function application example on a vector
[1] 2.5
> sd ( vec _ 1) # standard deviation
[1] 1.290994
> length ( vec _ 1) # returns the number of elements
[1] 4
> vec _ 1+1
[1] 2 3 4 5
> vec _ 1 * 2
[1] 2 4 6 8
> vec _ 1^0.5
[1] 1.000000 1.414214 1.732051 2.000000
# index values range from 1 to vector length and not from 0 to vector
length - 1
> vec _ 1[1] # returns the content of the first position
[1] 1
> vec _ 1[5]
[1] NA
> vec _ 1[2:3] # slice
[1] 2 3
> vec _ 1[ -1] # remove the first value and returns the result
[1] 2 3 4
> vec _ 1 # does not change the variable content
[1] 1 2 3 4
> vec _ 1[ -4]
[1] 1 2 3
> vec _ 1
[1] 1 2 3 4
> vec _ 1[2] <- 10
> vec _ 1
[1] 1 10 3 4
> vec _ 1[2:3] <- -100
> vec _ 1
[1] 1 -100 -100 4
> vec _ 1 <- vec _ 1[ -4] # reaffectation of a subset of vec _ 1
> vec _ 1
[1] 1 2 3
> is . vector ( vec _ 1)
[1] TRUE
> is . vector (1)
[1] TRUE
12
Vectors and matrices can contain only one type of value:
A vector contains a field to name its elements called names() and which is empty by default. This field can
be defined and used to access the vector elements:
13
> m1 <-c (1 ,2 ,3 ,4 ,5 ,6) # starts as a vector
> m1
[1] 1 2 3 4 5 6
> dim ( m1 ) # dim () returns the variable dimensions sizes
NULL # A vector has neither row nor column therefore the result is NULL
> dim ( m1 ) <-c (2 ,3) # the variable dimensions : 2 rows and 3 columns
> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 5
[2 ,] 2 4 6
Or this way:
14
> m1 [1 ,1] # access to the matrix elements ( row , number )
[1] 1
> m1 [2 ,3]
[1] 6
> m1 [3 ,2]
Error in m1 [3 , 2] : indice hors limites # does not exist in this variable
> m1 [ ,1]
[1] 1 2
> m1 [1 ,]
[1] 1 3 5
> m1 [2 ,2] <- 123
> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 5
[2 ,] 2 123 6
> m1 [ ,3] <- -10
> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 -10
[2 ,] 2 123 -10
Like vectors, matrices contain only one type of data and their rows and columns can also be labelled:
> m1
[ ,1] [ ,2] [ ,3]
[1 ,] 1 3 -10
[2 ,] 2 123 -10
> colnames ( m1 )
NULL
> colnames ( m1 ) <-c ( " A " ," B " ," C " )
> rownames ( m1 ) <-c ( " M " ," F " )
> m1
A B C
M 1 3 -10
F 2 123 -10
> summary ( m1 ) # a major command in R
A B C
Min . :1.00 Min . : 3 Min . : -10
1 st Qu .:1.25 1 st Qu .: 33 1 st Qu .: -10
Median :1.50 Median : 63 Median : -10
Mean :1.50 Mean : 63 Mean : -10
3 rd Qu .:1.75 3 rd Qu .: 93 3 rd Qu .: -10
Max . :2.00 Max . :123 Max . : -10
> m1 [ , " A " ]
M F
1 2
> m1 [ " F " ," A " ]
[1] 2
> m1 [2 ,1] <- 1000
> m1 [ " F " ," A " ]
[1] 1000
Both vectors and matrices can be used by R functions. For instance the sample() function randomly reorders
15
series:
Arrays are similar to matrices but with more than 2 dimensions. More functionalities of dimensioned variables
will be discussed in the section 2.2.
In R, factors are variables which can take only a limited number of different values. Roughly speaking,
statistical variables can be quantitative (continuous or discrete) or qualitative (ordinal or nominal) and to
create categories in R the factor is the best choice. The explicit construction of factor objects makes use of
the factor() command:
Along with the factor definition comes the "level" notion. The levels of a variable are the different possibles
values of the factors.
16
2.1.6 Lists and data frames
Compared to vectors or matrices, these 2 container classes add flexibility and powerful functionalities5 . The
main difference between a data.frame and a list is that all elements in a data.frame have an equal length. In
this tutorial we will mainly focus on the utilization of the data.frame rather than on the list class. Indeed,
data.frame allows us to deal with most of the data formats we will use. To create a data.frame the first
possibility is to use the class constructor. In the following example we create an array of data according to
the table 2.1.6.
mark1 mark2
student1 11 16
student2 13 10
student3 10.5 5.5
> name <-c ( " student1 " ," student2 " ," student3 " ) # first column
> marks1 <-c (11 ,13 ,10.5) # second column
> marks2 <-c (16 ,10 ,5.5) # third column
> res <- data . frame ( name , marks1 , marks2 ) # implicitly : column names
> res
name marks1 marks2
1 student1 11.0 16.0
2 student2 13.0 10.0
3 student3 10.5 5.5
It is also possible to use row (rbind) or column (cbind) associations in order to change the final shape.
Nevertheless, the use of cbind() and rbind() automatically converts the numerical values to factor objects.
5 http://stackoverflow.com/questions/15901224/what-is-difference-between-dataframe-and-list-in-r
17
> res [1 ,2]
[1] 11 # double value
> res1 [1 ,2]
[1] 11 # factor type value
Levels : 10.5 11 13
> res1 [ ,2] <- as . double ( as . character ( res1 [ ,2]) ) # annoying conversion method
> res1 [1 ,2]
[1] 11 # double value
Note that when using the rbind command, since the variables names are not applicable to give names to the
columns, Xn names are arbitrarily added. When the data.frame is built, row names are also arbitrarily set
to the row number.
In table 2.1.6 the first column contains the names of the students and not numbers. To obtain this our res
variable has to be modified:
> rownames ( res ) <- res [ ,1] # sets the row names using the first column
values
> res <- res [ , -1] # remove the first column
> res
marks1 marks2
student1 11.0 16.0
student2 13.0 10.0
student3 10.5 5.5
> rownames ( res )
[1] " student1 " " student2 " " student3 "
The data.frame object can be addressed using either the row and column names but also using the field
names:
18
2.1.7 Class, type, mode and storage.mode in R
The R documentation provides detailed explanations regarding the discrepancies between a class, a mode
and a type6 . Our aim is not to go too deep in such considerations because the object model in R is complex
and derives from S and S+ languages. To summarize:
• mode() and class() are very similar and only differ in case of objects belonging to specific classes like
tests for example
For instance:
result a b c d e f
class numeric matrix data.frame list htest function
mode numeric character list list list function
typeof double character list list list closure
storage.mode double character list list list function
2.2.1 Operators
19
Arithmetic Comparison Logic
a <- c (1 ,2 ,3)
a ^2
[1] 1 4 9
a >=2
[1] FALSE TRUE TRUE
a %% 2
[1] 1 0 1
The parentheses allow the management of operator priorities. What is the result of the following command ?
There are many ways to extract data from an array. They often rely on the use of indices or ranges as
previously shown in 2.1.4. But it is also possible to use conditional operators. For instance, all the data
inferior to 10 have to be removed from v1:
The ">" operator is used to test all the elements of v1 and the logical series is then used to select the
corresponding v1 values.
20
2.2.3 R controlled sequences generation
The R environment includes numerous possibilities to automatically generate ordered or regular sequences.
The use of the ":" symbol has already been evoked. A useful way to generate controlled sequences is the
command seq():
> 1:5
[1] 1 2 3 4 5
> seq ( from =1 , to =5 , by =2) # from , to and by are optional but can help for
clarity
[1] 1 3 5
> seq (10 ,1 , -3)
[1] 10 7 4 1
> seq ( from =1 , to =10 , length =4) # the total number of values is set by length
[1] 1 4 7 10
> seq (1 ,10 , length =5)
[1] 1.00 3.25 5.50 7.75 10.00
If a given sequence or pattern has to be repeated several times the rep() function can be used:
> gl (3 ,2 , label = c ( " small " ," medium " ," large " ) )
[1] small small medium medium large large
Levels : small medium large
And if it is necessary to obtain variable patterns (incremental or decremental) the sequence() method (different
from seq()) can solve the problem:
Since R is intended to work on statistics and modeling, many useful (pseudo)random number generators
have been implemented. For instance: uniform, Poisson and normal random series can be created using
21
respectively runif(), rpois() and rnorm() functions. These functions take the number or values to generate as
the first argument and the distribution law parameters as the following arguments:
> series1 <- runif (100 ,0 ,1) # 100 values in the interval [0;1[
> series2 <- rnorm (100 ,10 ,2) # 100 values following a gaussian distribution
mean =10 and sd =2
> series3 <- rpois (100 ,1) # 100 values with a Poisson distribution and
lambda =1
> s <- floor ( runif (1000 ,1 ,7) ) # uniform numbers are real numbers and we want
to obtain discrete 1 -6 values
> s <- ceiling ( runif (1000 ,0 ,6) )
A part of this tutorial is dedicated to script development. Most of the time the script is built as a small
list of commands aiming at processing, analyzing and plotting data. The script itself is a text file, which
name ends with ".r". This suffix is not mandatory but strongly recommended to avoid confusion with other
file types. A script can be written using the R-GUI or any kind of text editor, providing there is no text
encoding. Once the file containing the script is saved in a given folder, the current R working directory has
to be set to this folder. There are 2 possibilities to set a working directory. The first makes use of the R-GUI:
in the main menu, click on the Misc/Change working directory item and then select the folder in which the
scripts are created. The other possibility makes use of the R line commands:
Once the working directory is set, the scripts can be run thanks to the source() command. An example is
given in the Figure 8.
22
Figure 8: my first R script.
To write a script several considerations have to be taken into account. As a start the first lines of the script
should look like this:
This is better in case of the writing of a substantial script that other people may use. The R developper
should also keep the following points in mind:
• Choose adequate variable names (avoid too short, too long or too inadequate names)
• Whatever tongue is chosen (French, English, Swahili ...) always use the same in all the code
In R (as in many other languages like C, C++, Java ...) a code block is a piece of code which is grouped
together and which is delimited by the { and } symbols (curly braces). This entity is the building block and
23
is fundamental to structured programming, where control structures are formed out of blocks8 .
2.3.3 Tests
As previously mentioned in 2.2.1 it is possible to execute conditional tasks thanks to the logical tests. The
result of a test is generally TRUE or FALSE and according to this result, the script will selecte one processing
instead of another:
1 a <- 42
2 b <- 25
3 if (a >= b ) { # test is a is superior or equal to b . If TRUE then execute
the current block . if FALSE continue after the block ... line 6
4 print ( " a superior or equal to b " )
5 }
6 ...
Note: to test for equality the "=" symbol has to be doubled. Example:
if ( a == b ) {
...
}
The test operators are detailed in 2.2.1. The else command allows alternate processings. For instance the
case where b superior to a has to be considered in previous script:
a <- 12
b <- 25
if (a >= b ) { # test is a is superior or equal to b . If TRUE then execute the
current block . if FALSE continue after the block ...
print ( " a superior or equal to b " )
} else { # the whole command "} else {" has to be on the same line
print ( " b superior to a " )
}
24
2.3.4 Functions
One of the most interesting features of the coding is the possibility to create functions. For instance if a
statistical processing requires 20 successive steps, each involving a different command applied to the preceding
result. It is very tedious to enter all these commands each time on a data set. The first good idea is therefore
to write a script and if this piece of code is often used, the second good idea is to create a function out
of the script. The following minVal() function for example returns the minimal value of a vector argument
(Figure 9).
Once the file is properly saved in the working directory, it can be run from the R console:
Arguments and returning values are optional. The following function is perfectly correct (though not very
useful):
25
myPoorFunction <- function () {
print ( " nothing " )
}
> myPoorFunction ()
[1] " nothing "
Of course any kind of data structure can be taken as an argument or returned by a function. It only depends
on the specific needs.
26
The apply() function is powerful and complementary to other R functions. It permits serial processing of
arrays of values:
2.3.5 Loops
One of the main interests of scripts is the possibility to automatize and repeat tedious data processing and
loops allow controlled repetitive operations. For instance, the "for" loop can be used to repeat n times a
given piece of code:
for ( i in 1:6) {
print ( paste ( " hello " ,i ) )
}
# result :
[1] " hello 1 "
[1] " hello 2 "
[1] " hello 3 "
[1] " hello 4 "
[1] " hello 5 "
[1] " hello 6 "
A "for" loop defines 3 elements: (i) the counting variable and its initial value, (ii) the loop stop condition,
and (iii) the increment in the variable’s value. The keyword "in" allows this variable to scan all the values in
the interval 1:6 (note that the parentheses are mandatory in this syntax). Different types of variable series
can be used:
27
a <-c ( " a1 " ," f1 " ," lll " )
for ( gobbledigook in a ) {
print ( gobbledigook )
}
# result :
[1] " a1 "
[1] " f1 "
[1] " lll "
Loops can be embedded. For example if all the coordinates in a 2 rows x 3 columns9 matrix have to be
displayed, the following script can be used:
m <- NULL
for ( y in 1:2) {
l <- NULL
for ( x in 1:3) {
l <-c (l , paste (y ,x , sep = " ," ) )
}
m <- rbind (m , l )
}
If the number of the loop iterations is not known in advance, the "while" structure is preferable. It repeats
its internal block as long as the condition defined in the header is true. The previous "for" loop counting
from 1 to 6 can thus be implemented this way:
m <-0
while (m <6) {
m <-m +1
print ( m )
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
9 It is conventional to define the first coordinate as the row and the second as the column.
28
The "while" loops can be infinite unless a "break" instruction is applied. In the following example, the loop
iterates as long as the m value is inferior or equal to 6. When it reaches 7 the loop is broken.
m <-0
while ( TRUE ) {
m <-m +1
if (m >6) {
break
}
print ( m )
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
The print() function provides a handy to use script output but it can also be necessary for a script to request
user’s inputs. This can be done with the scan() function:
Most of the time, the data to analyze are stored in files which formats depends on the software which produced
it. R has many libraries and packages and thus can read a lot of different data file formats but in this tutorial
we will focus on the manipulation of ASCII10 or unicode text files instead of binary or proprietary file formats.
The simplest way to read data from a file to store them in a variable consists in using the read.table() function.
If a file named "data1.txt" located in the working directory contains:
1 0.1 12
2 0.2 14
3 0.3 11
29
> res1 <- read . table ( " data1 . txt " )
> res1
V1 V2 V3
1 1 0.1 12
2 2 0.2 14
3 3 0.3 11
The table.read() function returns a data.frame object. Arbitrary column names and row numbers are thus
added to the variable. If a "data2.txt" file contains an explicit header like this:
A B C
1 1 0.1 12
2 2 0.2 14
3 3 0.3 11
The optional argument header=TRUE can be added to set the column names directly:
If the content of a variable has to be saved to a file, the output command equivalent to read.table() is
write.table:
> res2
A B C
1 1 0.1 12
2 2 0.2 14
3 3 0.3 11
> res2 [2 ,2] <- 0.55 # change the content of the current variable res2
> res2
A B C
1 1 0.10 12
2 2 0.55 14
3 3 0.30 11
> write . table ( res2 , file = " data2 . txt " , col . names = TRUE ) # writes the new
content of res2 in the same file
30
The new "data2.txt" file now contains:
The csv file format allows data exchanges between Excel, LibreOffice ... and R. For instance if an excel
worksheet is saved with the file name "data3.csv" (Figure 10).
# the sep argument has to be specified and the default value in csv is ";"
> res3 <- read . csv ( " data3 . csv " , sep = " ; " )
> res3
name mark
1 truc 12
2 machin 13
3 bidule 11
> res3 $ mark [1] * 2
[1] 24
> res3 $ name
[1] truc machin bidule
Levels : bidule machin truc
One of the main advantages of R lies in its strong graphic capabilities. This tutorial will give an overview of
some basic graphs but many other 2D, 3D, animations and even interactive graphs can be obtained thanks
to the R packages1112 .
11 http://www.cookbook-r.com/Graphs/
12 http://www.statmethods.net/
31
2.5.1 How to create and manage graphic windows?
The first step to create a chart is to build a blank graphic window using the function dev.new() (Figure 11).
Many different windows can be visualized at the same time. Each time a new graph window is required the
function dev.new() must be used. When several graph windows are open, it is possible to select the current
one thanks to the dev.next() and dev.prev() functions. For example if the current graph is the number 5 and
some data have to be plotted in the window number 3:
> dev . new () # quartz () for MacOS , X11 () for Linux and window () for Windows
are possible alternative commands
> dev . new ()
> dev . new ()
> dev . new ()
> dev . cur ()
quartz
5 # the graph window number starts from 2
> dev . set ( dev . prev () )
quartz
4
> dev . set ( dev . prev () )
quartz
3
> dev . cur ()
quartz
3
The par() function permits the plotting of several figures on the same graph:
32
dev . new ()
par ( mfrow = c (2 ,3) ) # indicates the dimensions of the layout (2 rows and 3
columns )
plot (1:1) # fake data ... just to show
title ( " title 1 " )
plot (1:2)
title ( " title 2 " )
plot (1:3)
title ( " title 3 " )
plot (1:4)
title ( " title 4 " )
plot (1:5)
title ( " title 5 " )
plot (1:6)
title ( " title 6 " )
This script plots a main window and 6 subparts each containing a different plot ant title as shown in Figure 12.
33
2.5.2 Line and point charts
The generic R functions for plotting R objects are plot() and matplot(). These command can produce simple
and multiple curves and individual points graphs. For instance:
34
dev . new ()
x <- 0:64 / 64
y1 <- sin (3 * pi * x )
y2 <- sin (2 * pi * x )
y <- cbind ( y1 , y2 )
matplot (x , y , type = " l " , col = c ( " blue " ," red " ) , lty =1 , main = "
Points , lines and legends " )
# here the main argument is the graph title
points (x , y1 , bg = " white " , pch =21)
points (x , y2 , bg = " white " , pch =21)
legend ( legend = c ( " sin (3 * pi * x ) " ," sin (2 * pi * x ) " ) ," bottomleft " , lty
=1 , col = c ( " blue " ," red " ) )
Legends are necessary to obtain clear and understandable graphs especially in case of multiple plots. As
shown above, the legend() function takes the following arguments:
35
2.5.3 Histogram
A visual control of data distribution is often very useful to describe a population. The hist() functions can
take series of data and plots the distribution histogram:
d1 <- rnorm (1000 ,10 ,2) # 1000 random values generated with mean =10 and sd =2
( gaussian )
d2 <- runif (1000 ,8 ,12) # 1000 random values generated in the interval [8 ,12[
( uniform )
dev . new ()
par ( mfrow = c (1 ,3) )
hist ( d1 , main = " Normal distribution " )
hist ( d2 , main = " Uniform distribution " )
hist ( d1 , main = " N =50 breaks " , breaks =50)
The Figure 15-left shows 2 examples of different distributions. The "break" argument defines the number of
intervals of the histogram and the Figure 15-right shows its effect on the plot (here it is set to 50). The second
argument to consider in the hist() function is "right". Its default value is set to TRUE which means that the
histogram cells are right-closed (left open) intervals. It is sometimes clearer to explicitly adjust the number
of intervals of the histogram. The following script plots the histogram of the same distribution (Figure 16).
36
a1 <-c (2 ,1 ,3 ,2 ,4 ,3 ,2 ,3 ,1 ,2 ,4 ,3 ,4 ,5 ,5 ,6 ,2 ,4 ,1 ,6)
dev . new ()
par ( mfrow = c (1 ,3) )
h1 <- hist ( a1 , main = " Default " )
h2 <- hist ( a1 , breaks =1:7 , main = " Breaks =1:7 " )
h3 <- hist ( a1 , breaks =1:7 , right =F , ylim = c (0 ,8) , main = " Breaks =1:7 , Right = FALSE " )
print ( " h1 $ count " ) ; print ( h1 $ count ) # Number of occurrences of each cell
print ( " h2 $ count " ) ; print ( h2 $ count )
print ( " h3 $ count " ) ; print ( h3 $ count )
# The table () method returns the contingency table of the counts at each
combination of factor levels of a variable
print ( " table ( a1 ) " ) ; print ( table ( a1 ) ) # equivalent to h3 $ count in this case
Figure 16: Effect of the "right" & "breaks" arguments combination in the hist() function.
37
2.5.4 Whisker box
The box-and-whisker plot is particularly adapted to the representation of small size samples or when data
are not normally distributed. In both cases the use of median and quartiles instead of mean and standard
error or standard deviation is meaningful. This graph contains many important information regarding central
tendency and dispersion as shown in Figure 1713 . Note that the [-1.5xIQR-Q1; 1.5xIQR+Q3] interval is used
to select the actual most extreme values which will be used as whisker edges (the default range value is 1.5).
The R function is boxplot and it can be used this way:
The above script generates 4 different series differently distributed and plots the box-and-whisker graph
(Figure 18). If the outliers values are numerous (Figure 18: ND 5,1 and ND 8,2) or useless a outline=FALSE
argument can be used to get rid of them.
13 http://www.unc.edu/ nielsen/soci708/m3/m3.htm
38
Figure 18: Box-and-whisker plot sample.
39
Figure 19: Pie chart plot sample.
dev . new ()
barplot ( c (1 ,2 ,3) )
This command lines will create a simple barplot as shown in Figure 20.
40
Figure 20: Barplot sample.
Most of the time bar plots come along with error bars (corresponding to standard deviation, standard error
of the mean ...). Such bar plots are more difficult to create because they require the use of another function:
arrows(). Indeed there is no automated graph command which plots at the same time mean + error bars
from a set of data (at least not in the R base package). The following script provides a function which can
automatically plot such graphs:
# barValues and errorValues are matrices with column : group , row : value in
each group
myBarPlot <- function ( barValues , errorValues , titleText ) {
barx <- barplot ( barValues , beside = TRUE , ylim = c (0 , max ( m2 ) + max ( sem2 ) ) )
arrows ( barx , barValues + errorValues , barx , barValues , angle =90 ,
code =1 , length =0.05)
title ( titleText )
}
dev . new ()
par ( mfrow = c (1 ,2) )
m1 <- matrix (1:3 , nrow =1) # bar values
sem1 <- matrix (3:1 , nrow =1) # error bars values
myBarPlot ( m1 , sem1 , " Simple series " )
m2 <- matrix (1:6 , nrow =2) # bar values
sem2 <- matrix (6:1 , nrow =2) # error bars values
myBarPlot ( m2 , sem2 , " 3 groups of 2 values " )
41
Figure 21: Barplots and error bars.
It provides new functionalities to produce high quality enriched figures that can be used to illustrate research
papers as shown in Figure 22.
42
The graph construction is slightly different from the base plot approach and we are going to test this package
14
on an example . Download the data_ggplot2.txt data file at: http://agbx.byethost9.com/data/ in your
working directory and store its content in a variable and call the ggplot2 package:
p0 <- ggplot ( data = d1 , aes ( x = Time , y = Abs , colour = Cond , shape = Cond ) )
In this instruction, aes refers to the Aesthetic parameters. In this case, the Time column is used for the x
axis, the Abs column for the y axis. The colour and shape of the dots will be automatically be incremented
using the Cond column detected levels. At the moment p1 is just an empty figure as shown in Figure 23:
p0
43
The content of the figure has then to be specified. The "+" operator is used to add supplementary description
to the initial p0 figure. If simple points have to be added to the plot the geom_point() function will be used:
As shown in Figure 24 the size parameter will set the point size automatically.
Once the ggplot2 principle is understood, it is possible to add many additional features such as a continuous
line (Figure 25-A):
p1 <- p0 + ggtitle ( " My beautiful enzyme " ) + geom _ line ( size =2)
p2 <- ggplot ( data = d1 , aes ( x = Time , y = Abs , colour = Cond , shape = Cond ) ) + geom _
point ( size =4)
p2 <- p2 + geom _ smooth () # smoothing data using confidence interval
p3 <- ggplot ( data = d1 , aes ( x = Time , y = Abs , fill = Cond ) ) + geom _ bar ( aes ( y =
Abs ) , stat = " identity " , position = " dodge " )
It is also possible to automatically smooth a series of value with its confidence interval or to change the plot
type to bar plot as shown above with p2 and p3 respectively (Figure 25-B-C).
A very common way to visualize spike trains in neuroscience is to use raster plots. This can be done with
the ggplot2 library as shown in Figure 26-A-B:
44
Figure 25: ggplot2 curve samples: p1, p2 and p3.
45
3 Applications
3.1 R variables
i)
> v1
[1] 101 102 103 104 105 106 107 108 109 110 111 112
> v2
[1] 20 18 16 14 12 10
> v3
[1] 4 6 3 4 6 3 4 6 3 4 6 3
> v4
[1] 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1
> v5
[1] 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 3 3 3 3 3
ii)
> v6
[ ,1] [ ,2] [ ,3] [ ,4]
[1 ,] 1 2 3 4
[2 ,] 5 6 7 8
[3 ,] 9 10 11 12
[4 ,] 13 14 15 16
[5 ,] 17 18 19 20
[6 ,] 21 22 23 24
> v7
[1] 1 0 1 0 1 0 1 0 1 0 1 0
> v8
Value _ 1 Value _ 2
Rank _ 1 3 6
Rank _ 2 9 12
Rank _ 3 15 18
> v9
Rank Result _ 1 Result _ 2
Candidate _ 1 A 11 14.0
Candidate _ 2 B 12 12.5
Candidate _ 3 C 14 18.0
> v10
e1 e2 e3
1 AA 11.0 110
2 AA 12.0 120
3 AA 9.0 115
4 BB 12.5 95
5 BB 10.0 140
6 BB 6.0 199
7 BB 8.0 55
46
3.1.2 Variable manipulation and selection
ii) Using conditional operators and R commands answer to the following questions:
[1] " student1 " " student3 " " student5 "
• What is the sum of all the mark1 values of the students whose mark2 is inferior to 10 ?
[1] 26
[1] 3
47
> source ( " test . r " )
[1] 1 1 1
[1] 2 4 8
[1] 3 9 27
[1] 4 16 64
[1] 5 25 125
[1] 6 36 216
ii)
> source ( " test . r " ) # the only difference with the previous one is " string "
format of the output
[1] " 1 1 1 "
[1] " 2 4 8 "
[1] " 3 9 27 "
[1] " 4 16 64 "
[1] " 5 25 125 "
[1] " 6 36 216 "
iii)
i)
48
> source ( " test . r " )
[1] " Enter 0 to stop "
1: 4 1 2 5
5: # press enter
Read 4 items
1: # press enter
Read 0 items
[1] " Sum the values : "
[1] 12
[1] " Average value : "
[1] 3
[1] " Sorted values : "
[1] 1 2 4 5
ii)
49
> source ( " test . r " )
> facto (0)
[1] 1
> facto (1)
[1] 1
> facto (2)
[1] 2
> facto (5)
[1] 120
50