Regression Methods for Astrophysics
D. Fraix-Burnet and D. Valls-Gabaud (eds)
EAS Publications Series, 66 (2014) 11–18
PRACTICAL SESSION: INTRODUCTION TO R
M. Clausel1 and G. Grégoire 1
Abstract. An introduction to R is proposed. This pratical session is
an excerpt from practical exercises proposed by A. Dalalyan at EPNC
(see http://certis.enpc.fr/∼dalalyan/Download/TP ENPC 1.pdf).
Datas are also extracted from a practical session proposed by hydrologic
data from Amazonia proposed by D. Chessel A.B. Dufour in Lyon 1
(website indicated in the text below) and from other practical exercises
proposed by A. Dalalyan at ENPC (same address as above but ended
by /TP ENPC 4.pdf).
1
More about R
• The R project at http://www.r-project.org is an open-source community effort to develop a free software environment, called R, for statistical
computing and graphics.
• R is also a programming language and currently the lingua franca for statistical research and data analysis.
• It was started in 1993 by the statisticians Ross Ihaka and Robert Gentleman
and has become extremely popular in both academy and industry with an
estimated users base of one million people worldwide. See a couple of articles
about R at the New York times here and here.
• The R software can be freely downloaded and installed in Linux, Windows
and Mac computers from http://cran.r-project.org
• A new stable version is released every year about march/april and development versions are updated on daily basis. Last major release (version 3.0.0)
took place on April 3rd, 2013.
1
Laboratory LJK, Grenoble University, BP. 53, 38041 Grenoble Cedex 09, France
c EAS, EDP Sciences 2015
DOI: 10.1051/eas/1466002
12
Regression Methods for Astrophysics
Some books and websites
• More than 80 books about R:
– Specific collection Springer Use R
– Introductory Statistics with R (Dalgaard)
– The R book (Crawley)
– Linear models with R (Faraway)
– Software for data analysis: programming with R (Chambers).
• Some web sites: the R home page http://cran.r-project.org/ (Manuals,
Task view, RSiteSearch, Mailing lists, FAQ, . . . ).
2
First steps
1. To know more about R:
? help
help ( rep )
help (demo)
demo( graphics )
2. Simple manipulations: numbers and vectors:
p i ∗sqrt (10)+exp ( 4 )
3:10 # generating regu la r sequence
seq ( 3 , 1 0 ) # g e n e r a t i n g r e g u l a r s e q u e n c e
x = c (2 ,3 ,5 ,7 ,2 ,1) # assigns a value to a vector
y = c (10 ,15 ,12)
z = c(x , y)
# c o n c a t e n a t i o n o f x and y
z ˆ2
x∗x
w=rep ( x , 3 )
w=rep ( x , each =3)
ls ()
# current variables ?
rm( x )
x
ls ()
x
3. Arrays and matrices
x1 = 1 : 9
dim( x1 ) = c ( 3 , 3 ) # g i v e s t o x t h e dim a t t r i b u t e t h a t
# a l l o w s i t t o be t r e a t e d as a 3X3 a r r a y
?dim
x1
d e t ( x1 )
eigen ( x1 )
M. Clausel and G. Grégoire: Practical Session: Introduction to R
13
x2 = matrix ( seq ( 1 , 6 , by=1) , ncol =2)
m1 = cbind ( x2 , rep ( 0 , 3 ) )
m2 = rbind ( x1 , rep ( 0 , 2 ) )
y= matrix ( 1 : 1 2 , nrow=3 , byrow=T)
t(y)
z = matrix ( 1 : 4 , nrow=2 , byrow=T)
z ˆ2
z∗z
z %∗% z
z i n v=solve ( z )
z %∗% z i n v
4. Dataframe
A dataframe is a useful object to put together statistical observations of
several variables. The variables don’t need to be of same class. They can
be numeric, logical, character, factor. Thus a dataframe can be seen as a
matrix where each column is a variable and each row is an observation. The
lengths of the columns are all the same. Rows and columns can be named.
The function data.frame is used to create a dataframe. Try:
df d i a b <− data . frame ( s i z e=c ( 1 6 3 , 1 8 6 , 1 6 6 , 1 7 6 , 1 6 9 ) ,
s p o r t=c (TRUE,TRUE, FALSE,TRUE, FALSE) ,
s e x=c ( ”F” , ”H” , ”F” , ”H” , ”F” ) )
df d i a b
df d i a b $ s i z e
c l a s s ( df d i a b $ s i z e )
c l a s s ( df d i a b $ s p o r t )
c l a s s ( df d i a b $ s e x )
5. Libraries and packages:
• To see which packages are installed at your site, issue the command
library ( )
• To load a particular package already installed
l i b r a r y (MASS)
• Users connected to the Internet can use the install.packages() and update.packages() functions. Install the following packages from internet:
r e p o s i t o r y <− ” h t t p : // c r a n . univ−l y o n 1 . f r /”
i n s t a l l . packages ( ” UsingR ” , r e p o s=r e p o s i t o r y )
i n s t a l l . packages ( ” d a t a s e t s ” , r e p o s=r e p o s i t o r y )
14
Regression Methods for Astrophysics
install
install
install
install
install
. packages ( ”DAAG” , r e p o s=r e p o s i t o r y )
. packages ( ” l e a p s ” , r e p o s=r e p o s i t o r y )
. packages ( ” c a r ” , r e p o s=r e p o s i t o r y )
. packages ( ”HH” , r e p o s=r e p o s i t o r y )
. packages ( ” e d R G r a p h i c a l T o o l s ” , r e p o s=r e p o s i t o r y )
6. Graphics:
x = runif ( 5 0 , 0 , 2 )
y = runif ( 5 0 , 0 , 2 )
plot ( x , y , main=” T i t r e ” , x l a b=” a b s c i s s a ” ,
y l a b=” o r d i n a t e ” , c o l=” d a r k r e d ” )
abline ( h =.6 , v =.6)
text ( . 6 , . 6 , ” comment t h e graph ! ” )
colors ( )
7. Functions with R:
• A very simple example:
c a r r e = function ( x ) { x ˆ2 }
carre (3)
carre
• More complicated:
h i s t . norm=function ( n , c o l )
{
x = rnorm( n )
h = h i s t ( x , plot=F)
s = sd ( x )
m = mean( x )
ylim = c ( 0 , 1 . 2 ∗max(max( h$density ) , 1 / ( s ∗sqrt ( 2 ∗ p i ) ) ) )
x l a b = ” Histogram and normal a p p r o x i m a t i o n ”
ylab = ””
main = paste ( ” Gaussian sample : n=” , n )
h i s t ( x , f r e q=F , ylim=ylim , x l a b=xlab , y l a b=ylab ,
c o l=col , main=main )
curve (dnorm( x ,m, s ) , add=T, lwd=2)
}
op=par ( m f c o l=c ( 1 , 3 ) )
h i s t . norm ( 2 0 0 , c o l=” y e l l o w ” )
h i s t . norm ( 8 0 0 , c o l=” d a r k g o l d e n r o d ” )
h i s t . norm ( 3 2 0 0 , c o l=” b l u e ” )
par ( op )
M. Clausel and G. Grégoire: Practical Session: Introduction to R
3
15
Random distributions
How generate (pseudo–)random numbers...
1. Try!:
rnorm ( 1 0 )
# g e n e r a t e s 10 r e a l i z a t i o n s
# o f t h e s t a n d a r d Gaussian d i s t r i b u t i o n N( 0 , 1 )
rnorm ( 1 0 )
rnorm ( 1 0 )
plot (rnorm ( 1 0 0 ) )
rbinom ( 1 0 , s i z e =20 , prob =.5)
rcauchy ( 1 0 )
runif ( 1 0 , min=0 , max=1)
sample ( 1 : 4 0 , 5 )
sample ( 1 : 1 0 , 1 0 , replace=T)
sample ( c ( ‘ ‘ f a i l ’ ’ , ‘ ‘ s u c c e s s ’ ’ ) , 1 0 , replace=T,
prob=c ( 0 . 7 , 0 . 3 ) )
2. Some classical distributions: beta, binom, cauchy, chisq, exp, f, gamma,
norm, pois, t, unif. Prefixes r, d, p and q are used to get pseudo-random
samples, probability density functions, cumulative distribution functions and
quantiles. Try:
x = seq ( −4 ,4 ,by=0.01)
y = dnorm( x )
plot ( x , y , type=” l ” )
pnorm ( 1 . 6 5 )
y = pnorm( x )
abline ( h =0.0 , v =0.0)
plot ( x , y , type=” l ” )
qnorm ( 0 . 9 5 )
4
Empirical description
1. Order statistics:
x = rnorm ( 1 0 ) #
y = sort ( x )
#
Echantillon i . i . d .
order s t a t i s t i c s
2. Empirical cumulative distribution:
x = rnorm ( 1 0 0 )
n=length ( x )
plot ( sort ( x ) , 1 : n/n , type=” s ” , ylim=c ( 0 , 1 ) ,
x l a b=” ” , y l a b=” ” )
?pnorm
curve (pnorm( x , 0 , 1 ) , add=T, c o l=” b l u e ” )
16
Regression Methods for Astrophysics
3. Histogram and density:
x = rnorm ( 1 0 0 )
h i s t ( x , b r e a k s =20)
h i s t ( x , b r e a k s =20 , f r e q=F , c o l=” cyan ” )
curve (dnorm( x ) , add=T, c o l=” d a r k b l u e ” )
x = rnorm ( 5 0 )
h = h i s t ( x , plot=F)
h$ b r e a k s
h$ c o u n t s
? hist
x = rnorm( 1 0 0 0 )
h = h i s t ( x , f r e q=F)
l i n e s ( density ( x ) )
? density
4. Boxplot: Box plots are formed by:
• Vertical axis: the response variable.
• Horizontal axis: the factor of interest. More specifically, one
– Calculates the median and the quartiles(the lower quartile is the
25th percentile and the upper quartile is the 75th percentile).
– Plots a symbol at the median (or draw a line) and draw a box
(hence the name–box plot) between the lower and upper quartiles;
this box represents the middle 50% of the data– that is the “body”
of the data.
– Draws a line from the lower quartile to the minimum point and
another line from the upper quartile to the maximum point. Typically a symbol is drawn at these minimum and maximum points,
although this is optional. Thus the box plot identifies the middle
50% of the data, the median, and the extreme points.
5. Box plots with R:
x = rnorm ( 1 0 0 )
par ( m f c o l=c ( 2 , 2 ) , bg=” l i g h t c y a n ” )
boxplot ( x )
boxplot ( x , h o r i z o n t a l=T)
boxplot ( x , c o l=” r e d ” )
boxplot ( x , c o l=” o r a n g e ” , b o r d e r=” d a r k b l u e ” )
6. Parallel boxplots:
x = rnorm ( 1 0 0 )
y = (rnorm(400))ˆ2 −1
z= rnorm ( 5 0 ) ˆ 3
par ( bg=” l i g h t c y a n ” )
M. Clausel and G. Grégoire: Practical Session: Introduction to R
17
boxplot ( x , y , z , c o l=c ( ” b l u e ” , ” w h i t e ” , ” r e d ” ) ,
b o r d e r=c ( ” b l a c k ” , ” d a r k b l u e ” ) )
7. QQ-plots:
x = rnorm ( 1 0 0 )
y = (rnorm(400))ˆ2 −1
z = rnorm ( 2 0 0 ,m=4 ,sd=5)
par ( bg=” l i g h t c y a n ” , mfrow=c ( 2 , 2 ) )
qqplot ( x , y , pch =21 , bg=” r e d ” , f g=” d a r k b l u e ” )
qqplot ( x , z , pch =21 , bg=” r e d ” , f g=” d a r k b l u e ” )
qqnorm( y , pch =21 , bg=” o r a n g e ” , f g=” d a r k b l u e ” )
qqline ( y , pch =21 , c o l=” b l u e ” )
qqnorm( z , pch =21 , bg=” o r a n g e ” , f g=” d a r k b l u e ” )
qqline ( z , pch =21 , c o l=” b l u e ” )
5
Reading data in a file
Some data sets are furnished with R.
data ( ) # l i s t o f d a t a s e t s
?WWWusage # d e s c r i p t i o n o f t h e d a t a WWWusage
plot (WWWusage)
R can also read data from an external file using the command read.table. Here
is an example for a file with a .data extension
# read the data
Donnees = read . table ( ” A i r Q u a l i t y . data ” )
summary( Donnees ) # summary d a t a
h i s t ( Donnees$Ozone , c o l=” g o l d ” ) # h i s t o g r a m o f t h e
# v a r i a b l e Ozone
attach ( Donnees ) # t h e name o f t h e d a t a s e t can be o m i t t e d
h i s t ( Ozone , f r e q=F , c o l=” g o l d ” )
detach ( Donnees )
h i s t ( Ozone , f r e q=F , c o l=” g o l d ” )
or with the .txt extension
data=read . table ( ” . /TPRegMult/ p o l l u . t x t ” )
if the path of the directory where are located the data is ”./TPRegMult/”
Reading data from a website:
data=read . table ( ” h t t p : // p b i l . univ−l y o n 1 . f r /R/ donnees /
amaz . t x t ” , h e a d e r = TRUE)
(hydrologic data from Amazonia proposed by D. Chessel & A.B. Dufour at Lyon 1).
18
Regression Methods for Astrophysics
A convenient way to read data on your own computer is to use:
data=read . table ( f i l e . choose ( ) , h e a d e r=TRUE)
This allows to browse in your computer in order to find out the file your are
interested in.
6
A toy example
1. Statistical description (mean, standard deviation, median, min, max, . . .) of
the following data:
l i b r a r y (MASS)
data ( g e y s e r )
attach ( g e y s e r )
help ( g e y s e r )
Use the command summary as well as other already seen commands (histogram, cumulative distribution...).
2. The following data describe the maximum load to which ropes made in a
given factory can be subjected in service:
10.1 12.2 9.3 12.4 13.7 10.8 11.6 10.1 11.2 11.3
12.2 12.6 11.5 9.2 14.2 11.1 13.3 11.8 7.1 10.5
(a) Create the boxplot corresponding to theses data. Are there some outliers? What is the value of the third quartile?
(b) Using the box plot, do you think that the distribution of the data is
symmetric or not?