Intro To R
Intro To R
Intro To R
This document very briefly reviews the main features of the R statistical package. For full
details, please refer to the various manuals and help files that come packaged with R, or the
extra material referred to at the end of this document.
3. Basic Statistics: Covers t-tests with confidence intervals for means, chi-square tests
with confidence intervals for proportions, non-parametrics, simple regression, general-
ized linear models including logistic regression.
5. Creating your own functions: We will see here how to create our own functions,
to extend the capabilities of R.
1 Introduction
1.1 What is R?
1
for research in statistical methodology, and R provides an Open Source route to
participation in that activity.
“One of R’s strengths is the ease with which well-designed publication-quality
plots can be produced, including mathematical symbols and formulae where
needed. Great care has been taken over the defaults for the minor design choices
in graphics, but the user retains full control.
“R is available as Free Software under the terms of the Free Software Foundation’s
GNU General Public License in source code form. It compiles and runs on a wide
variety of UNIX platforms and similar systems (including FreeBSD and Linux),
Windows and MacOS.”
In short: R is “free Splus”, a very sophisticated statistics and graphics package, used by
most statisticians these days for modern analysis and statistical research.
2 Installation
2
8. Follow directions on screen until program is installed. In general, you can just leave
all options unchanged.
9. After installation is completed, click on the “R” icon on your desktop to begin the
program. At this point, your screen should have an R prompt on it, and you are ready
to begin using the program.
3 Adding Packages
You can easily extend R’s basic capabilities by adding in functions that others have created
and made freely available through the web. Rather than surfing the web trying to find,
them, R makes adding these capabilities trivially easy by using the “Packages” menu items.
For example, to add a new package not previously downloaded onto your computer, click on
“Packages −→ Install Package(s)”, pick a site from the list (usually a Canadian site) and
choose the package you want to install from the long list (currently, about 900 packages are
available). To activate this package, you then need to click on it in the ‘Packages −→ Load
Package” menu item. You are then ready to use it, and the help files for the package will
also be automatically added to your installation. See, for example, the “Help −→ HTML
help” menu item, and click on “Packages” in the browser window that opens.
4 R Basics
Use R as a Calculator:
> 2+3
[1] 5
> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478
> log(2+3*pi)
[1] 2.435785
> exp(2.435785)
[1] 11.42478
3
> 6/7
[1] 0.8571429
> cos(pi)
[1] -1
Scaler variables:
> a <- 3
> a
[1] 3
> b <- 5
> b
[1] 5
> b-a
[1] 2
> b/a
[1] 1.666667
Vector variables:
> x<-c(2,3,1,5,4,6,5,7,6,8)
> x
[1] 2 3 1 5 4 6 5 7 6 8
> y <- c(10, 12, 14, 13, 34, 23, 12, 34, 25, 43)
> y
[1] 10 12 14 13 34 23 12 34 25 43
> x[4]
[1] 5
> y[3]
[1] 14
> x[1:10]
[1] 2 3 1 5 4 6 5 7 6 8
> x[6:8]
[1] 6 5 7
> x[x > 5]
[1] 6 7 6 8
4
Functions on vectors:
> length(x)
[1] 10
> length(y)
[1] 10
> sum(x)
[1] 47
> sum(y)
[1] 220
> sum(x^2)
[1] 265
> mean(x)
[1] 4.7
> mean(y)
[1] 22
> var(x)
[1] 4.9
> var(y)
[1] 136.4444
> sqrt(var(x))
[1] 2.213594
> sqrt(var(y))
[1] 11.68094
> sum((x-mean(x))^2)
[1] 44.1
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.00 4.70 6.00 8.00
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 12.25 18.50 22.00 31.75 43.00
> summary(x^2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 10.75 25.00 26.50 36.00 64.00
Matrices:
5
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> z.transpose<- matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=F)
> z.transpose
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> z[2,3]
[1] 6
> z[,1]
[1] 1 4 7
> z[1,]
[1] 1 2 3
> z[3,]
[1] 7 8 9
> sex <- c("male", "female", "female", "male", "male", "female", "female")
> sex
[1] "male" "female" "female" "male" "male" "female" "female"
> smoking <- c("yes", "no", "no", "yes", "no", "yes", "no", "no", "yes", "no")
> smoking
[1] "yes" "no" "no" "yes" "no" "yes" "no" "no" "yes" "no"
You can declare these variables to be factor variables for regression analyses:
6
> x
[1] 2.135011 7.573118 3.840814 5.544555 3.393277 2.731562
[7] 1.780588 4.576343 3.250077 3.908082
> y<-rbinom(10, size=1, prob=0.4)
> y
[1] 0 1 1 1 0 0 1 0 1 0
> z<- 5 + 2*x + 3*y +rnorm(10, mean=0, sd=0.5)
> z
[1] 9.44732 23.30763 15.34688 19.62479 11.30383 10.89258
[7] 11.22488 13.33018 15.11767 12.81894
Data Frames:
Most data sets are entered into R as data frames. To create a data frame, you
can enter:
> my.data.frame<-data.frame(x,y,z)
> my.data.frame
x y z
1 2.135011 0 9.44732
2 7.573118 1 23.30763
3 3.840814 1 15.34688
4 5.544555 1 19.62479
5 3.393277 0 11.30383
6 2.731562 0 10.89258
7 1.780588 1 11.22488
8 4.576343 0 13.33018
9 3.250077 1 15.11767
10 3.908082 0 12.81894
> help(sum)
and a help window will pop up, in this case for the function “sum”. R also comes with
extensive manuals, html help files, a nice introduction, etc. Extremely Useful Suggestion for
this Course: Look at the help files for read.table to see how to enter data from a text file on
your computer.
7
5 Basic Statistical Routines
You can use R to do any of the standard statistical routines you will typically need. R
includes everything you will learn in 607/621, and many other routines, with more added
regularly. R has many more modern (frequentist) routines available compared to other
packages such SAS or STATA, and whatever is not in the base package is likely to be in
the contributed section. In addition, whatever is not available in either of these sources,
“built-in” routines can easily be added on your own, as we will soon see.
T-Tests
Chi-square tests
8
data: c(67, 46) out of c(67 + 34, 46 + 45) X-squared = 4.2965, df
= 1, p-value = 0.03819 alternative hypothesis: two.sided 95
percent confidence interval:
0.009420794 0.306322868
sample estimates:
prop 1 prop 2
0.6633663 0.5054945
More generally:
Linear Models
> my.data.frame
x y z
1 2.135011 0 9.44732
2 7.573118 1 23.30763
3 3.840814 1 15.34688
4 5.544555 1 19.62479
5 3.393277 0 11.30383
6 2.731562 0 10.89258
7 1.780588 1 11.22488
8 4.576343 0 13.33018
9 3.250077 1 15.11767
10 3.908082 0 12.81894
9
Suppose we want simple linear regression model of z on x and y.
> lin.reg<-lm(z ~ x + y)
> summary(lin.reg)
Call:
lm(formula = z ~ x + y)
Residuals:
Min 1Q Median 3Q Max
-0.6821 -0.4339 0.0893 0.3849 0.5679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8644 0.4245 11.460 8.66e-06 ***
x 1.9989 0.1064 18.792 3.00e-07 ***
y 3.2690 0.3450 9.475 3.05e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Other types of regression equations are available through generalized linear models:
and so on.
10
6 Graphics
R is capable of producing very high quality graphics in a variety of formats (wmf, ps, jpg,
pdf, png, bmp).
> plot(x,z)
11
Figure 2: A simple scatter plot with labels and title
Intercept X
5.236471 2.324865
> age.range <-seq(2,7,by=0.01)
> outcome.fit <- 5.236471 + 2.324865*age.range
> points(age.range, outcome.fit, type="l")
You can plot more than one curve on a single plot, and label them via a legend:
12
Figure 3: Scatter plot with best fitting line added
13
You can also do histograms . . .
Figure 5: Histogram
. . . and boxplots.
14
Figure 6: Boxplot
15
7 Writing your Own Functions
It is very easy to write your own functions in R. Once these are written, they become
permanent functions in your version of R, available just like any other function. Here I
present two examples, one easy, one (slightly) more complex.
First the easy example, creating a function to add two numbers. Of course, this is a silly
example, since R already does this easily, but its simplicity will allow us to see how functions
are constructed in general.
Now for something more complex, we will create a function to carry out Bayesian inferences
for a single normal mean, when the variance is assumed known.
The theory is the following (to remind you of what we covered last class):
Suppose that your data set, X = (x1 , x2 , . . . , xn ), with sample size n is known to arise from
a Normal distribution, xi ∼ N (µ, σ 2 ), i = 1, 2, . . . , n, i.e., the likelihood function for the data
is normal. Suppose also that the variance of the data, σ 2 is a known constant, and so the
only unknown parameter to be estimated is the mean. Suppose that the prior distribution
for the unknown mean µ is N (θ, τ 2 ).
If we combine the information in the prior distribution with the information in the data set
via Bayes Theorem, we derive the posterior distribution, which summarizes current informa-
tion about the mean µ. It happens that the posterior is also normal, and the exact formula
16
is:
θ
+ nx
−1 !
1 n
τ2 σ2
µposterior ∼ N 1 n , 2
+ 2 ,
τ2
+ σ2 τ σ
where x is the mean of the observed data, and [·]−1 indicates matrix inversion.
n<- length(x)
x.bar <- mean(x)
post.mean.numerator <- prior.mean/prior.var + n*x.bar/data.var
post.mean.denominator <- 1/prior.var + n/data.var
post.mean <- post.mean.numerator/post.mean.denominator
post.var <- (1/(1/prior.var + n/data.var))
Another useful way to create output in functions is illustrated below. This format is especially
useful if you want to use the program’s results later on in your R session.
17
> post.normal.mean <- function(x, prior.mean, prior.var, data.var)
{
####################################################################
# R function for Bayesian analysis of normal mean, variance known #
# Parameters included are: #
# #
# Inputs: #
# #
# x = vector of data #
# prior.mean = prior mean #
# prior.var = prior variance #
# data.var = assumed known variance of data #
# #
# Outputs: #
# #
# post.mean = posterior mean #
# post.var = posterior variance #
# #
####################################################################
n<- length(x)
x.bar <- mean(x)
post.mean.numerator <- prior.mean/prior.var + n*x.bar/data.var
post.mean.denominator <- 1/prior.var + n/data.var
post.mean <- post.mean.numerator/post.mean.denominator
post.var <- (1/(1/prior.var + n/data.var))
$post.var
[1] 0.09999
A useful function relating to functions is args, which reminds you of the required arguments
of the function. For example:
> args(post.normal.mean)
function (x, prior.mean, prior.var, data.var)
18
8 Conclusion
R is a very powerful package, we have just scratched the surface of what is available. Its
interactive nature makes it more pleasant to use than “batch” programs like SAS, and
its developmental environment makes it the first choice of many, if not the majority, of
statisticians today.
• A useful book titled “Analysis of Epidemiological Data Using R and Epicalc,” available
at
http://cran.r-project.org/doc/contrib/Epicalc_Book.pdf
http://pj.freefaculty.org/R/statsRus.html
I would be happy to answer any questions you have about R, just email me.
19