0% found this document useful (0 votes)
41 views

R Programming

R is a programming language and software environment for statistical analysis and graphics. It is similar to S but is available under GNU GPL license. R can be used for data manipulation, calculation, graphical displays, statistical model fitting, and data analysis. It provides a variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R is an effective tool for practical data analysis tasks and statistical research.

Uploaded by

falconau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

R Programming

R is a programming language and software environment for statistical analysis and graphics. It is similar to S but is available under GNU GPL license. R can be used for data manipulation, calculation, graphical displays, statistical model fitting, and data analysis. It provides a variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R is an effective tool for practical data analysis tasks and statistical research.

Uploaded by

falconau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

R-programming

http://www.r-project.org/
http://cran.r-project.org/
Hung Chen
Outline
• Introduction: – Grouping, loops and conditional
– Historical development execution
– S, Splus – Function
– Capability • Reading and writing data from
– Statistical Analysis files
• References • Modeling
– Regression
• Calculator
– ANOVA
• Data Type
• Data Analysis on Association
• Resources
– Lottery
• Simulation and Statistical
– Geyser
Tables
– Probability distributions
• Smoothing
• Programming
R, S and S-plus
• S: an interactive environment for data analysis developed at Bell
Laboratories since 1976
– 1988 - S2: RA Becker, JM Chambers, A Wilks
– 1992 - S3: JM Chambers, TJ Hastie
– 1998 - S4: JM Chambers

• Exclusively licensed by AT&T/Lucent to Insightful Corporation,


Seattle WA. Product name: “S-plus”.
• Implementation languages C, Fortran.
• See:
http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html
• R: initially written by Ross Ihaka and Robert Gentleman at Dep.
of Statistics of U of Auckland, New Zealand during 1990s.
• Since 1997: international “R-core” team of ca. 15 people with
access to common CVS archive.
Introduction
•R is “GNU S” — A language and environment for data manipula-
tion, calculation and graphical display.
– R is similar to the award-winning S system, which was developed at Bell
Laboratories by John Chambers et al.
– a suite of operators for calculations on arrays, in particular matrices,
– a large, coherent, integrated collection of intermediate tools for interactive data
analysis,
– graphical facilities for data analysis and display either directly at the computer
or on hardcopy
– a well developed programming language which includes conditionals, loops,
user defined recursive functions and input and output facilities.
•The core of R is an interpreted computer language.
– It allows branching and looping as well as modular programming using
functions.
– Most of the user-visible functions in R are written in R, calling upon a smaller
set of internal primitives.
– It is possible for the user to interface to procedures written in C, C++ or
FORTRAN languages for efficiency, and also to write additional primitives.
What R does and does not
o data handling and storage: o is not a database, but
numeric, textual connects to DBMSs
o matrix algebra o has no graphical user
interfaces, but connects to
o hash tables and regular
Java, TclTk
expressions
o language interpreter can be
o high-level data analytic and
very slow, but allows to call
statistical functions
own C/C++ code
o classes (“OO”)
o no spreadsheet view of data,
o graphics but connects to
o programming language: Excel/MsOffice
loops, branching, o no professional /
subroutines commercial support
R and statistics
o Packaging: a crucial infrastructure to efficiently produce, load
and keep consistent software libraries from (many) different
sources / authors
o Statistics: most packages deal with statistics and data analysis
o State of the art: many statistical researchers provide their
methods as R packages
Data Analysis and Presentation
• The R distribution contains functionality for large number of
statistical procedures.
– linear and generalized linear models
– nonlinear regression models
– time series analysis
– classical parametric and nonparametric tests
– clustering
– smoothing
• R also has a large set of functions which provide a flexible
graphical environment for creating various kinds of data
presentations.
References
• For R,
– The basic reference is The New S Language: A Programming Environment
for Data Analysis and Graphics by Richard A. Becker, John M. Chambers
and Allan R. Wilks (the “Blue Book”) .
– The new features of the 1991 release of S (S version 3) are covered in
Statistical Models in S edited by John M. Chambers and Trevor J. Hastie (the
“White Book”).
– Classical and modern statistical techniques have been implemented.
• Some of these are built into the base R environment.
• Many are supplied as packages. There are about 8 packages supplied with R
(called “standard” packages) and many more are available through the cran
family of Internet sites (via http://cran.r-project.org).
• All the R functions have been documented in the form of help
pages in an “output independent” form which can be used to create
versions for HTML, LATEX, text etc.
– The document “An Introduction to R” provides a more user-friendly starting
point.
– An “R Language Definition” manual
– More specialized manuals on data import/export and extending R.
R as a calculator
> log2(32)

1.0
[1] 5

0.5
sin(seq(0, 2 * pi, length = 100))
> sqrt(2)
[1] 1.414214

0.0
-0.5
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5 -1.0

0 20 40 60 80 100

Index

> plot(sin(seq(0, 2*pi, length=100)))


Object orientation

primitive (or: atomic) data types in R are:

• numeric (integer, double, complex)


• character
• logical
• function
out of these, vectors, arrays, lists can be built.
variables
> a = 49
> sqrt(a) numeric
[1] 7

> a = "The dog ate my homework"


> sub("dog","cat",a) character
[1] "The cat ate my homework“ string

> a = (1+1==3)
>a logical
[1] FALSE
vectors, matrices and arrays
• vector: an ordered collection of data of the same type
> a = c(1,2,3)
> a*2
[1] 2 4 6

• Example: the mean spot intensities of all 15488 spots on a chip:


a vector of 15488 numbers

• In R, a single number is the special case of a vector with 1


element.
• Other vector types: character strings, logical
vectors, matrices and arrays

• matrix: a rectangular table of data of the same type

• example: the expression values for 10000 genes for 30 tissue


biopsies: a matrix with 10000 rows and 30 columns.

• array: 3-,4-,..dimensional matrix


• example: the red and green foreground and background values
for 20000 spots on 120 chips: a 4 x 20000 x 120 (3D) array.
Lists
• vector: an ordered collection of data of the same type.
> a = c(7,5,1)
> a[2]
[1] 5

• list: an ordered collection of data of arbitrary types.


> doe = list(name="john",age=28,married=F)
> doe$name
[1] "john“
> doe$age
[1] 28
• Typically, vector elements are accessed by their index (an integer),
list elements by their name (a character string). But both types
support both access methods.
Data frames
data frame: is supposed to represent the typical data table that
researchers come up with – like a spreadsheet.

It is a rectangular table with rows and columns; data within each


column has the same type (e.g. number, text, logical), but
different columns may have different types.

Example:
>a
localisation tumorsize progress
XX348 proximal 6.3 FALSE
XX234 distal 8.0 TRUE
XX987 proximal 10.0 FALSE
Factors
A character string can contain arbitrary text. Sometimes it is useful to use a limited
vocabulary, with a small number of allowed words. A factor is a variable that can only
take such a limited number of values, which are called levels.
>a
[1] Kolon(Rektum) Magen Magen
[4] Magen Magen Retroperitoneal
[7] Magen Magen(retrogastral) Magen
Levels: Kolon(Rektum) Magen Magen(retrogastral) Retroperitoneal
> class(a)
[1] "factor"
> as.character(a)
[1] "Kolon(Rektum)" "Magen" "Magen"
[4] "Magen" "Magen" "Retroperitoneal"
[7] "Magen" "Magen(retrogastral)" "Magen"
> as.integer(a)
[1] 1 2 2 2 2 4 2 3 2
> as.integer(as.character(a))
[1] NA NA NA NA NA NA NA NA NA NA NA NA
Warning message: NAs introduced by coercion
Subsetting
Individual elements of a vector, matrix, array or data frame are
accessed with “[ ]” by specifying their index, or their name
>a
localisation tumorsize progress
XX348 proximal 6.3 0
XX234 distal 8.0 1
XX987 proximal 10.0 0
> a[3, 2]
[1] 10
> a["XX987", "tumorsize"]
[1] 10
> a["XX987",]
localisation tumorsize progress
XX987 proximal 10 0
>a
localisation tumorsize progress
XX348 proximal 6.3 0 Subsetting
XX234 distal 8.0 1
XX987 proximal 10.0 0
> a[c(1,3),]
localisation tumorsize progress subset rows by a
XX348 proximal 6.3 0 vector of indices
XX987 proximal 10.0 0
> a[c(T,F,T),]
localisation tumorsize progress subset rows by a
XX348 proximal 6.3 0 logical vector
XX987 proximal 10.0 0
> a$localisation
[1] "proximal" "distal" "proximal"
> a$localisation=="proximal" subset a column
[1] TRUE FALSE TRUE
> a[ a$localisation=="proximal", ] comparison resulting in
localisation tumorsize progress logical vector
XX348 proximal 6.3 0
XX987 proximal 10.0 0 subset the selected
rows
Resources
• A package specification allows the production of loadable modules
for specific purposes, and several contributed packages are made
available through the CRAN sites.
• CRAN and R homepage:
– http://www.r-project.org/
It is R’s central homepage, giving information on the R project and
everything related to it.
– http://cran.r-project.org/
It acts as the download area,carrying the software itself, extension packages,
PDF manuals.
• Getting help with functions and features
– help(solve)
– ?solve
– For a feature specified by special characters, the argument must be enclosed in
double or single quotes, making it a “character string”: help("[[")
Getting help
Details about a specific command whose name you know (input
arguments, options, algorithm, results):

>? t.test
or
>help(t.test)
Getting help

o HTML search engine

o Search for topics


with regular
expressions:
“help.search”
Probability distributions
• Cumulative distribution function P(X ≤ x): ‘p’ for the CDF
• Probability density function: ‘d’ for the density,,
• Quantile function (given q, the smallest x such that P(X ≤ x) > q):
‘q’ for the quantile
• simulate from the distribution: ‘r
Distribution R name additional arguments
beta beta shape1, shape2, ncp
binomial binom size, prob
Cauchy cauchy location, scale
chi-squared chisq df, ncp
exponential exp rate
F f df1, df1, ncp
gamma gamma shape, scale
geometric geom prob
hypergeometric hyper m, n, k
log-normal lnorm meanlog, sdlog
logistic logis; negative binomial nbinom; normal norm; Poisson pois; Student’s t
t ; uniform unif; Weibull weibull; Wilcoxon wilcox
Grouping, loops and conditional execution
• Grouped expressions
– R is an expression language in the sense that its only command type is a
function or expression which returns a result.
– Commands may be grouped together in braces, {expr 1, . . ., expr m}, in
which case the value of the group is the result of the last expression in the
group evaluated.
• Control statements
– if statements
– The language has available a conditional construction of the form
if (expr 1) expr 2 else expr 3
where expr 1 must evaluate to a logical value and the result of the entire
expression is then evident.
– a vectorized version of the if/else construct, the ifelse function. This has the
form ifelse(condition, a, b)
Repetitive execution
• for loops, repeat and while
– for (name in expr 1) expr 2
where name is the loop variable. expr 1 is a vector expression,
(often a sequence like 1:20), and expr 2 is often a grouped
expression with its sub-expressions written in terms of the
dummy name. expr 2 is repeatedly evaluated as name ranges
through the values in the vector result of expr 1.
• Other looping facilities include the
– repeat expr statement and the
– while (condition) expr statement.
– The break statement can be used to terminate any loop, possibly
abnormally. This is the only way to terminate repeat loops.
– The next statement can be used to discontinue one particular
cycle and skip to the “next”.
Branching

if (logical expression) {
statements
} else {
alternative statements
}

else branch is optional


Loops
• When the same or similar tasks need to be performed multiple
times; for all elements of a list; for all columns of an array; etc.
• Monte Carlo Simulation
• Cross-Validation (delete one and etc)

for(i in 1:10) {
print(i*i)
}

i=1
while(i<=10) {
print(i*i)
i=i+sqrt(i)
}
lapply, sapply, apply
• When the same or similar tasks need to be performed multiple
times for all elements of a list or for all columns of an array.
• May be easier and faster than “for” loops
• lapply(li, function )
• To each element of the list li, the function function is applied.
• The result is a list whose elements are the individual function
results.
> li = list("klaus","martin","georg")
> lapply(li, toupper)
> [[1]]
> [1] "KLAUS"
> [[2]]
> [1] "MARTIN"
> [[3]]
> [1] "GEORG"
lapply, sapply, apply
sapply( li, fct )
Like apply, but tries to simplify the result, by converting it into a
vector or array of appropriate size

> li = list("klaus","martin","georg")
> sapply(li, toupper)
[1] "KLAUS" "MARTIN" "GEORG"

> fct = function(x) { return(c(x, x*x, x*x*x)) }


> sapply(1:5, fct)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 1 4 9 16 25
[3,] 1 8 27 64 125
apply
apply( arr, margin, fct )
Apply the function fct along some dimensions of the array arr,
according to margin, and return a vector or array of the
appropriate size.
>x
[,1] [,2] [,3]
[1,] 5 7 0
[2,] 7 9 8
[3,] 4 6 7
[4,] 6 3 5
> apply(x, 1, sum)
[1] 12 24 17 14
> apply(x, 2, sum)
[1] 22 25 20
functions and operators
Functions do things with data
“Input”: function arguments (0,1,2,…)
“Output”: function result (exactly one)

Example:
add = function(a,b)
{ result = a+b
return(result) }

Operators:
Short-cut writing for frequently used functions of one or two
arguments.
Examples: + - * / ! & | %%
functions and operators
• Functions do things with data
• “Input”: function arguments (0,1,2,…)
• “Output”: function result (exactly one)
Exceptions to the rule:
• Functions may also use data that sits around in other places, not
just in their argument list: “scoping rules”*
• Functions may also do other things than returning a result. E.g.,
plot something on the screen: “side effects”

* Lexical scope and Statistical Computing.


R. Gentleman, R. Ihaka, Journal of Computational and
Graphical Statistics, 9(3), p. 491-508 (2000).
Reading data from files
• The read.table() function
– To read an entire data frame directly, the external file will normally have a
special form.
– The first line of the file should have a name for each variable in the data
frame.
– Each additional line of the file has its first item a row label and the values for
each variable.
Price Floor Area Rooms Age Cent.heat
01 52.00 111.0 830 5 6.2 no
02 54.75 128.0 710 5 7.5 no
03 57.50 101.0 1000 5 4.2 no
04 57.50 131.0 690 6 8.8 no
05 59.75 93.0 900 5 1.9 yes
...
• numeric variables and nonnumeric variables (factors)
Reading data from files
• HousePrice <- read.table("houses.data", header=TRUE)
Price Floor Area Rooms Age Cent.heat
52.00 111.0 830 5 6.2 no
54.75 128.0 710 5 7.5 no
57.50 101.0 1000 5 4.2 no
57.50 131.0 690 6 8.8 no
59.75 93.0 900 5 1.9 yes
...
• The data file is named ‘input.dat’.
– Suppose the data vectors are of equal length and are to be read in in parallel.
– Suppose that there are three vectors, the first of mode character and the remaining two
of mode numeric.
• The scan() function
– inp<- scan("input.dat", list("",0,0))
– To separate the data items into three separate vectors, use assignments like
label <- inp[[1]]; x <- inp[[2]]; y <- inp[[3]]
– inp <- scan("input.dat", list(id="", x=0, y=0)); inp$id; inp$x; inp$y
Storing data
• Every R object can be stored into and restored from a file with
the commands “save” and “load”.
• This uses the XDR (external data representation) standard of
Sun Microsystems and others, and is portable between MS-
Windows, Unix, Mac.

> save(x, file=“x.Rdata”)


> load(“x.Rdata”)
Importing and exporting data

There are many ways to get data into R and out of R.

Most programs (e.g. Excel), as well as humans, know how to deal


with rectangular tables in the form of tab-delimited text files.

> x = read.delim(“filename.txt”)
also: read.table, read.csv

> write.table(x, file=“x.txt”, sep=“\t”)


Explore Association
• Data(stackloss)
–It is a data frame with 21 observations on 4 variables.
– [,1] `Air Flow' Flow of cooling air
– [,2] `Water Temp' Cooling Water Inlet Temperature
– [,3] `Acid Conc.' Concentration of acid [per 1000, minus 500]
– [,4] `stack.loss' Stack loss
–The data sets `stack.x', a matrix with the first three (independent)
variables of the data frame, and `stack.loss', the numeric vector
giving the fourth (dependent) variable, are provided as well.
• Scatterplots, scatterplot matrix:
–plot(stackloss$Ai,stackloss$W)
–plot(stackloss) data(stackloss)
–two quantitative variables.
• summary(lm.stack <- lm(stack.loss ~ stack.x))
• summary(lm.stack <- lm(stack.loss ~ stack.x))
Explore Association
• Boxplot suitable for showing
a quantitative and a
qualitative variable.
• The variable test is not
quantitative but categorical.
– Such variables are also called
factors.

You might also like