Academia.eduAcademia.edu

Practical Session: Introduction to R

2014, EAS Publications Series

Regression Methods for Astrophysics D. Fraix-Burnet and D. Valls-Gabaud (eds) EAS Publications Series, 66 (2014) 11–18 PRACTICAL SESSION: INTRODUCTION TO R M. Clausel1 and G. Grégoire 1 Abstract. An introduction to R is proposed. This pratical session is an excerpt from practical exercises proposed by A. Dalalyan at EPNC (see http://certis.enpc.fr/∼dalalyan/Download/TP ENPC 1.pdf). Datas are also extracted from a practical session proposed by hydrologic data from Amazonia proposed by D. Chessel A.B. Dufour in Lyon 1 (website indicated in the text below) and from other practical exercises proposed by A. Dalalyan at ENPC (same address as above but ended by /TP ENPC 4.pdf). 1 More about R • The R project at http://www.r-project.org is an open-source community effort to develop a free software environment, called R, for statistical computing and graphics. • R is also a programming language and currently the lingua franca for statistical research and data analysis. • It was started in 1993 by the statisticians Ross Ihaka and Robert Gentleman and has become extremely popular in both academy and industry with an estimated users base of one million people worldwide. See a couple of articles about R at the New York times here and here. • The R software can be freely downloaded and installed in Linux, Windows and Mac computers from http://cran.r-project.org • A new stable version is released every year about march/april and development versions are updated on daily basis. Last major release (version 3.0.0) took place on April 3rd, 2013. 1 Laboratory LJK, Grenoble University, BP. 53, 38041 Grenoble Cedex 09, France c EAS, EDP Sciences 2015  DOI: 10.1051/eas/1466002 12 Regression Methods for Astrophysics Some books and websites • More than 80 books about R: – Specific collection Springer Use R – Introductory Statistics with R (Dalgaard) – The R book (Crawley) – Linear models with R (Faraway) – Software for data analysis: programming with R (Chambers). • Some web sites: the R home page http://cran.r-project.org/ (Manuals, Task view, RSiteSearch, Mailing lists, FAQ, . . . ). 2 First steps 1. To know more about R: ? help help ( rep ) help (demo) demo( graphics ) 2. Simple manipulations: numbers and vectors: p i ∗sqrt (10)+exp ( 4 ) 3:10 # generating regu la r sequence seq ( 3 , 1 0 ) # g e n e r a t i n g r e g u l a r s e q u e n c e x = c (2 ,3 ,5 ,7 ,2 ,1) # assigns a value to a vector y = c (10 ,15 ,12) z = c(x , y) # c o n c a t e n a t i o n o f x and y z ˆ2 x∗x w=rep ( x , 3 ) w=rep ( x , each =3) ls () # current variables ? rm( x ) x ls () x 3. Arrays and matrices x1 = 1 : 9 dim( x1 ) = c ( 3 , 3 ) # g i v e s t o x t h e dim a t t r i b u t e t h a t # a l l o w s i t t o be t r e a t e d as a 3X3 a r r a y ?dim x1 d e t ( x1 ) eigen ( x1 ) M. Clausel and G. Grégoire: Practical Session: Introduction to R 13 x2 = matrix ( seq ( 1 , 6 , by=1) , ncol =2) m1 = cbind ( x2 , rep ( 0 , 3 ) ) m2 = rbind ( x1 , rep ( 0 , 2 ) ) y= matrix ( 1 : 1 2 , nrow=3 , byrow=T) t(y) z = matrix ( 1 : 4 , nrow=2 , byrow=T) z ˆ2 z∗z z %∗% z z i n v=solve ( z ) z %∗% z i n v 4. Dataframe A dataframe is a useful object to put together statistical observations of several variables. The variables don’t need to be of same class. They can be numeric, logical, character, factor. Thus a dataframe can be seen as a matrix where each column is a variable and each row is an observation. The lengths of the columns are all the same. Rows and columns can be named. The function data.frame is used to create a dataframe. Try: df d i a b <− data . frame ( s i z e=c ( 1 6 3 , 1 8 6 , 1 6 6 , 1 7 6 , 1 6 9 ) , s p o r t=c (TRUE,TRUE, FALSE,TRUE, FALSE) , s e x=c ( ”F” , ”H” , ”F” , ”H” , ”F” ) ) df d i a b df d i a b $ s i z e c l a s s ( df d i a b $ s i z e ) c l a s s ( df d i a b $ s p o r t ) c l a s s ( df d i a b $ s e x ) 5. Libraries and packages: • To see which packages are installed at your site, issue the command library ( ) • To load a particular package already installed l i b r a r y (MASS) • Users connected to the Internet can use the install.packages() and update.packages() functions. Install the following packages from internet: r e p o s i t o r y <− ” h t t p : // c r a n . univ−l y o n 1 . f r /” i n s t a l l . packages ( ” UsingR ” , r e p o s=r e p o s i t o r y ) i n s t a l l . packages ( ” d a t a s e t s ” , r e p o s=r e p o s i t o r y ) 14 Regression Methods for Astrophysics install install install install install . packages ( ”DAAG” , r e p o s=r e p o s i t o r y ) . packages ( ” l e a p s ” , r e p o s=r e p o s i t o r y ) . packages ( ” c a r ” , r e p o s=r e p o s i t o r y ) . packages ( ”HH” , r e p o s=r e p o s i t o r y ) . packages ( ” e d R G r a p h i c a l T o o l s ” , r e p o s=r e p o s i t o r y ) 6. Graphics: x = runif ( 5 0 , 0 , 2 ) y = runif ( 5 0 , 0 , 2 ) plot ( x , y , main=” T i t r e ” , x l a b=” a b s c i s s a ” , y l a b=” o r d i n a t e ” , c o l=” d a r k r e d ” ) abline ( h =.6 , v =.6) text ( . 6 , . 6 , ” comment t h e graph ! ” ) colors ( ) 7. Functions with R: • A very simple example: c a r r e = function ( x ) { x ˆ2 } carre (3) carre • More complicated: h i s t . norm=function ( n , c o l ) { x = rnorm( n ) h = h i s t ( x , plot=F) s = sd ( x ) m = mean( x ) ylim = c ( 0 , 1 . 2 ∗max(max( h$density ) , 1 / ( s ∗sqrt ( 2 ∗ p i ) ) ) ) x l a b = ” Histogram and normal a p p r o x i m a t i o n ” ylab = ”” main = paste ( ” Gaussian sample : n=” , n ) h i s t ( x , f r e q=F , ylim=ylim , x l a b=xlab , y l a b=ylab , c o l=col , main=main ) curve (dnorm( x ,m, s ) , add=T, lwd=2) } op=par ( m f c o l=c ( 1 , 3 ) ) h i s t . norm ( 2 0 0 , c o l=” y e l l o w ” ) h i s t . norm ( 8 0 0 , c o l=” d a r k g o l d e n r o d ” ) h i s t . norm ( 3 2 0 0 , c o l=” b l u e ” ) par ( op ) M. Clausel and G. Grégoire: Practical Session: Introduction to R 3 15 Random distributions How generate (pseudo–)random numbers... 1. Try!: rnorm ( 1 0 ) # g e n e r a t e s 10 r e a l i z a t i o n s # o f t h e s t a n d a r d Gaussian d i s t r i b u t i o n N( 0 , 1 ) rnorm ( 1 0 ) rnorm ( 1 0 ) plot (rnorm ( 1 0 0 ) ) rbinom ( 1 0 , s i z e =20 , prob =.5) rcauchy ( 1 0 ) runif ( 1 0 , min=0 , max=1) sample ( 1 : 4 0 , 5 ) sample ( 1 : 1 0 , 1 0 , replace=T) sample ( c ( ‘ ‘ f a i l ’ ’ , ‘ ‘ s u c c e s s ’ ’ ) , 1 0 , replace=T, prob=c ( 0 . 7 , 0 . 3 ) ) 2. Some classical distributions: beta, binom, cauchy, chisq, exp, f, gamma, norm, pois, t, unif. Prefixes r, d, p and q are used to get pseudo-random samples, probability density functions, cumulative distribution functions and quantiles. Try: x = seq ( −4 ,4 ,by=0.01) y = dnorm( x ) plot ( x , y , type=” l ” ) pnorm ( 1 . 6 5 ) y = pnorm( x ) abline ( h =0.0 , v =0.0) plot ( x , y , type=” l ” ) qnorm ( 0 . 9 5 ) 4 Empirical description 1. Order statistics: x = rnorm ( 1 0 ) # y = sort ( x ) # Echantillon i . i . d . order s t a t i s t i c s 2. Empirical cumulative distribution: x = rnorm ( 1 0 0 ) n=length ( x ) plot ( sort ( x ) , 1 : n/n , type=” s ” , ylim=c ( 0 , 1 ) , x l a b=” ” , y l a b=” ” ) ?pnorm curve (pnorm( x , 0 , 1 ) , add=T, c o l=” b l u e ” ) 16 Regression Methods for Astrophysics 3. Histogram and density: x = rnorm ( 1 0 0 ) h i s t ( x , b r e a k s =20) h i s t ( x , b r e a k s =20 , f r e q=F , c o l=” cyan ” ) curve (dnorm( x ) , add=T, c o l=” d a r k b l u e ” ) x = rnorm ( 5 0 ) h = h i s t ( x , plot=F) h$ b r e a k s h$ c o u n t s ? hist x = rnorm( 1 0 0 0 ) h = h i s t ( x , f r e q=F) l i n e s ( density ( x ) ) ? density 4. Boxplot: Box plots are formed by: • Vertical axis: the response variable. • Horizontal axis: the factor of interest. More specifically, one – Calculates the median and the quartiles(the lower quartile is the 25th percentile and the upper quartile is the 75th percentile). – Plots a symbol at the median (or draw a line) and draw a box (hence the name–box plot) between the lower and upper quartiles; this box represents the middle 50% of the data– that is the “body” of the data. – Draws a line from the lower quartile to the minimum point and another line from the upper quartile to the maximum point. Typically a symbol is drawn at these minimum and maximum points, although this is optional. Thus the box plot identifies the middle 50% of the data, the median, and the extreme points. 5. Box plots with R: x = rnorm ( 1 0 0 ) par ( m f c o l=c ( 2 , 2 ) , bg=” l i g h t c y a n ” ) boxplot ( x ) boxplot ( x , h o r i z o n t a l=T) boxplot ( x , c o l=” r e d ” ) boxplot ( x , c o l=” o r a n g e ” , b o r d e r=” d a r k b l u e ” ) 6. Parallel boxplots: x = rnorm ( 1 0 0 ) y = (rnorm(400))ˆ2 −1 z= rnorm ( 5 0 ) ˆ 3 par ( bg=” l i g h t c y a n ” ) M. Clausel and G. Grégoire: Practical Session: Introduction to R 17 boxplot ( x , y , z , c o l=c ( ” b l u e ” , ” w h i t e ” , ” r e d ” ) , b o r d e r=c ( ” b l a c k ” , ” d a r k b l u e ” ) ) 7. QQ-plots: x = rnorm ( 1 0 0 ) y = (rnorm(400))ˆ2 −1 z = rnorm ( 2 0 0 ,m=4 ,sd=5) par ( bg=” l i g h t c y a n ” , mfrow=c ( 2 , 2 ) ) qqplot ( x , y , pch =21 , bg=” r e d ” , f g=” d a r k b l u e ” ) qqplot ( x , z , pch =21 , bg=” r e d ” , f g=” d a r k b l u e ” ) qqnorm( y , pch =21 , bg=” o r a n g e ” , f g=” d a r k b l u e ” ) qqline ( y , pch =21 , c o l=” b l u e ” ) qqnorm( z , pch =21 , bg=” o r a n g e ” , f g=” d a r k b l u e ” ) qqline ( z , pch =21 , c o l=” b l u e ” ) 5 Reading data in a file Some data sets are furnished with R. data ( ) # l i s t o f d a t a s e t s ?WWWusage # d e s c r i p t i o n o f t h e d a t a WWWusage plot (WWWusage) R can also read data from an external file using the command read.table. Here is an example for a file with a .data extension # read the data Donnees = read . table ( ” A i r Q u a l i t y . data ” ) summary( Donnees ) # summary d a t a h i s t ( Donnees$Ozone , c o l=” g o l d ” ) # h i s t o g r a m o f t h e # v a r i a b l e Ozone attach ( Donnees ) # t h e name o f t h e d a t a s e t can be o m i t t e d h i s t ( Ozone , f r e q=F , c o l=” g o l d ” ) detach ( Donnees ) h i s t ( Ozone , f r e q=F , c o l=” g o l d ” ) or with the .txt extension data=read . table ( ” . /TPRegMult/ p o l l u . t x t ” ) if the path of the directory where are located the data is ”./TPRegMult/” Reading data from a website: data=read . table ( ” h t t p : // p b i l . univ−l y o n 1 . f r /R/ donnees / amaz . t x t ” , h e a d e r = TRUE) (hydrologic data from Amazonia proposed by D. Chessel & A.B. Dufour at Lyon 1). 18 Regression Methods for Astrophysics A convenient way to read data on your own computer is to use: data=read . table ( f i l e . choose ( ) , h e a d e r=TRUE) This allows to browse in your computer in order to find out the file your are interested in. 6 A toy example 1. Statistical description (mean, standard deviation, median, min, max, . . .) of the following data: l i b r a r y (MASS) data ( g e y s e r ) attach ( g e y s e r ) help ( g e y s e r ) Use the command summary as well as other already seen commands (histogram, cumulative distribution...). 2. The following data describe the maximum load to which ropes made in a given factory can be subjected in service: 10.1 12.2 9.3 12.4 13.7 10.8 11.6 10.1 11.2 11.3 12.2 12.6 11.5 9.2 14.2 11.1 13.3 11.8 7.1 10.5 (a) Create the boxplot corresponding to theses data. Are there some outliers? What is the value of the third quartile? (b) Using the box plot, do you think that the distribution of the data is symmetric or not?