Introduction To R
Introduction To R
Introduction To R
Phil Spector
Statistical Computing Facility
Department of Statistics
University of California, Berkeley
Some Basics
There are three types of data in R: numeric, character and
logical.
R supports vectors, matrices, lists and data frames.
Objects can be assigned values using an equal sign (=) or the
special <- operator.
R is highly vectorized - almost all operations work equally well
on scalars and arrays
All the elements of a matrix or vector must be of the same type
Lists provide a very general way to hold a collection of
arbitrary R objects.
A data frame is a cross between a matrix and a list columns
(variables) of a data frame can be of different types, but they
all must be the same length.
2
Using R
Typing the name of any object will display a printed
representation. Alternatively, the print() function can be
used to display the entire object.
Element numbers are displayed in square brackets
Typing a functions name will display its argument list and
definition, but sometimes its not very enlightening.
The str() function shows the structure of an object
If you dont assign an expression to an R object, R will display
the results, but they are also stored in the .Last.value object
Function calls require parentheses, even if there are no
arguments. For example, type q() to quit R.
Square brackets ([ ]) are used for subscripting, and can be
applied to any subscriptable value.
3
Getting Help
To view the manual page for any R function, use the
help(functionname ) command, which can be abbreviated by
following a question mark (?) by the function name.
The help.search("topic ") command will often help you get
started if you dont know the name of a function.
The command help.start() will open a browser pointing to a
variety of (locally stored) information about R, including a search
engine and access to more lengthly PDF documents. Once the
browser is open, all help requests will be displayed in the browser.
Many functions have examples, available through the example()
function; general demonstrations of R capabilities can be seen
through the demo() function.
Libraries
Libraries in R provide routines for a large variety of data
manipulation and analysis. If something seems to be missing from
R, it is most likely available in a library.
You can see the libraries installed on your system with the
command library() with no arguments. You can view a brief
description of the library using library(help=libraryname )
Finally, you can load a library with the command
library(libraryname )
Many libraries are available through the CRAN (Comprehensize R
Archive Network) at
http://cran.r-project.org/src/contrib/PACKAGES.html .
You can install libraries from CRAN with the install.packages()
function, or through a menu item in Windows. Use the lib.loc=
argument if you dont have administrative permissions.
7
Search Path
When you type a name into the R interpreter, it checks through
several directories, known as the search path, to determine what
object to use. You can view the search path with the command
search(). To find the names of all the objects in a directory on
the search path, type objects(pos=num ), where num is the
numerical position of the directory on the search path.
You can add a database to the search path with the attach()
function. To make objects from a previous session of R available,
pass attach() the location of the appropriate .RData file. To refer
to the elements of a data frame or list without having to retype the
object name, pass the data frame or list to attach(). (You can
temporarily avoid having to retype the object name by using the
with() function.)
Sizes of Objects
The nchar() function returns the number of characters in a
character string. When applied to numeric data, it returns the
number of characters in the printed representation of the number.
The length() function returns the number of elements in its
argument. Note that, for a matrix, length() will return the total
number of elements in the matrix, while for a data frame it will
return the number of columns in the data frame.
For arrays, the dim() function returns a list with the dimensions of
its arguments. For a matrix, it returns a vector of length two with
the number of rows and number of columns. For convenience, the
nrow() and ncol() functions can be used to get either dimension
of a matrix directly. For non-arrays dim() returns a NULL value.
Finding Objects
The objects() function, called with no arguments, prints the
objects in your working database. This is where the objects you
create will be stored.
The pos= argument allows you look in other elements of your
search path. The pat= argument allows you to restrict the search
to objects whose name matches a pattern. Setting the all.names=
argument to TRUE will display object names which begin with a
period, which would otherwise be suppressed.
The apropos() function accepts a regular expression, and returns
the names of objects anywhere in your search path which match
the expression.
10
Combining Objects
The c() function attempts to combine objects in the most general
way. For example, if we combine a matrix and a vector, the result
is a vector.
> c(matrix(1:4,ncol=2),1:3)
[1] 1 2 3 4 1 2 3
Note that the list() function preserves the identity of each of its
elements:
> list(matrix(1:4,ncol=2),1:3)
[[1]]
[,1] [,2]
[1,]
1
3
[2,]
2
4
[[2]]
[1] 1 2 3
12
Subscripting
Subscripting in R provides one of the most effective ways to
manipulate and select data from vectors, matrices, data frames and
lists. R supports several types of subscripts:
Empty subscripts - allow modification of an object while
preserving its size and type.
x = 1 creates a new scalar, x, with a value of 1, while
x[] = 1 changes each value of x to 1.
Empty subscripts also allow refering to the i-th column of a
data frame or matrix as matrix[i,] or the j -th row as
matrix[,j].
Subscripts (contd)
Negative numeric subscripts - allow exclusion of selected
elements
Zero subscripts - subscripts with a value of zero are ignored
Character subscripts - used as an alternative to numeric
subscripts
Elements of R objects can be named. Use names() for vectors
or lists, dimnames(), rownames() or colnames() for data
frames and matrices. For lists and data frames, the notation
object$name can also be used.
More on Subscripts
By default, when you extract a single column from a matrix or
data.frame, it becomes a simple vector, which may confuse some
functions. Furthermore, if the column was named, the name will be
lost. To prevent this from happening, you can pass the drop=TRUE
argument to the subscript operator:
> mx = matrix(c(1:4,4:7,5:8),ncol=3,byrow=TRUE,
+ dimnames=list(NULL,c("one","two","three")))
> mx[,3]
[1] 3 5 5 8
> mx[,3,drop=FALSE]
three
[1,]
3
[2,]
5
[3,]
5
[4,]
8
17
[[ Subscripting Operator
A general principle in R is that subscripted objects retain the mode
of their parent. For vectors and arrays, this rarely causes a
problem, but for lists (and data frames treated like lists), R will
often have problems working directly with such objects.
> mylist = list(1:10,10:20,30:40)
> mean(mylist[1])
[1] NA
Warning message:
argument is not numeric or logical: returning NA in: mean.default(mylist[1])
For named lists, the problem can also be avoided using the $
notation.
18
20
21
Missing Values
In R, missing values are represented by the string NA. You can
assign a missing value by setting a variable equal to NA, but you
must use the is.na() function to test for a missing value.
Missing values are propagated in all calculations, so the presence of
even a single missing value can result in a variety of problems.
Many statistical functions provide a na.rm= argument to remove
missing values before computations. Alternatively, use logical
subscripting to easily extract non-missing values:
> values = c(12,NA,19,15,12,17,14,NA,19)
> values[!is.na(values)]
[1] 12 19 15 12 17 14 19
> vv = matrix(values,ncol=3,byrow=TRUE)
> vv[!is.na(vv[,2]),,drop=FALSE]
[,1] [,2] [,3]
[1,]
15
12
17
23
24
TTY
pts/1
pts/1
pts/1
TIME
00:00:00
00:00:02
00:00:00
CMD
tcsh
R
ps
Printing
The print() function can be used to print any R object, and is
silently invoked when an expression is not assigned to a value. For
lists and arrays, it will always include subscripting information.
> print(7)
[1] 7
> print(matrix(c(1,2,3,4),ncol=2))
[,1] [,2]
[1,]
1
3
[2,]
2
4
Output Destinations
By default, the output from an interactive R session is sent to the
screen. To divert this output to a file, use the sink() or
capture.output() function. These provide an exact copy of the
sessions output.
To write the contents of a matrix or vector to a file, use the
write() function. Remember that when writing a matrix, it will
be written to the file by columns, so the transpose function (t())
may be useful. The ncolumns= argument can be used to specify the
number of data values on each line; the append= argument can be
used to avoid overwriting an existing file.
To write a data frame to a file, use the write.table() function; it
is basically the mirror image of the read.table() function.
27
Recycling of Vectors
When a vector is involved in an operation that requires more
elements than the vector contains, the values in the vector are
recycled.
> x = matrix(1:3,nrow=2,ncol=6)
> x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]
1
3
2
1
3
2
[2,]
2
1
3
2
1
3
Operators
All of the binary operators in R are vectorized, operating element
by element on their arguments, recycling values as needed. These
operators include:
+
addition
subtraction
multiplication
division
Exponentiation
%%
Modulus
%/%
Integer Division
less than
>
greater than
<=
l.t. or equal
>=
g.t. or equal
==
equality
!=
non-equality
elementwise and
&&
pairwise and
negation
elementwise or
||
pairwise or
xor()
exclusive or
Rounding Functions
The following functions are available for rounding numerical values:
round() - uses IEEE standard to round up or down; optional
digits= argument controls precision
signif() - rounds numerical values to the specified digits=
number of significant digits
trunc() - rounds by removing non-integer part of number
floor(), ceiling() - rounds to integers not greater or not
less than their arguments, respectively
zapsmall() - accepts a vector or array, and makes numbers
close to zero (compared to others in the input) zero. digits=
argument controls the rounding.
30
Non-vectorized functions
Although most functions in R are vectorized, returning objects which are
the same size and shape as their input, some will always return a single
logical value.
any() tests if any of the elements of its arguments meet a particular
condition; all() tests if they all do.
> all(x > 0)
[1] NA
> all(x > 0,na.rm=TRUE)
[1] TRUE
> x = c(7,3,12,NA,13,8)
> any(is.na(x))
[1] TRUE
Categorical Variables
Categorical variables in R are known as factors, and are stored as
integer codes along with a set of labels. The cut() function creates
factors from continuous variables.
> x = c(17,19,22,43,14,8,12,19,20,51,8,12,27,31,44)
> cut(x,3)
[1] (7.96,22.3] (7.96,22.3] (7.96,22.3] (36.7,51] (7.96,22.3] (7.96,22.3]
[7] (7.96,22.3] (7.96,22.3] (7.96,22.3] (36.7,51] (7.96,22.3] (7.96,22.3]
[13] (22.3,36.7] (22.3,36.7] (36.7,51]
Levels: (7.96,22.3] (22.3,36.7] (36.7,51]
> cut(x,3,labels=c("low","medium","high"))
[1] low
low
low
high
low
low
low
low
low
high
[11] low
low
medium medium high
Levels: low medium high
34
35
Tabulation
The table() function is the main tool for tabulating data in R.
Given a vector, it produces a frequency table:
> x = c(7,12,19,7,19,21,12,14,17,12,19,21)
> table(x)
x
7 12 14 17 19 21
2 3 1 1 3 2
Notice that the output is named based on the data values. To use
them as numbers, they must be passed to the as.numeric()
function:
> sum(table(x) * as.numeric(names(table(x))))
[1] 180
36
Cross-tabulation
If several equal length vectors are passed to table(), it will output
an array with the counts of the vectors cross-tabulation:
>
>
>
>
x1 = c(1,2,3,2,1,3,2,3,1)
x2 = c(2,1,2,3,1,3,2,2,3)
x3 = c(1,2,3,3,2,1,2,1,2)
table(x1,x2)
x2
x1 1 2 3
1 1 1 1
2 1 1 1
3 0 2 1
ftable()
> tbl = table(x1,x2,x3)
> tbl
, , x3 = 1
x1
x2
1
1 0
2 0
3 0
2
1
0
1
> ftable(tbl)
x3 1 2 3
x1 x2
1 1
0 1 0
2
1 0 0
3
0 1 0
2 1
0 1 0
2
0 1 0
3
0 0 1
3 1
0 0 0
2
1 0 1
3
1 0 0
3
0
0
1
, , x3 = 2
x1
x2
1
1 1
2 1
3 0
2
0
1
0
3
1
0
0
, , x3 = 3
x1
x2
1
1 0
2 0
3 0
2
0
0
1
3
0
1
0
38
Date Values
The as.Date() function converts a variety of character date
formats into R dates, stored as the number of days since January 1,
1970. The format= argument is a string composed of codes such as
%Y for full year, %y for 2-digit year, %m for month number, and %d
for day number.
Once converted to dates, the following functions will return
information about dates: weekdays(), months(), quarters() and
julian().
In addition, the cut() function allows the following choices for the
breaks= argument: "day", "week", "month", or "year".
Note: Alternative date and time formats are available through the
chron and POSIXct libraries.
39
# default formats
file.exists()
file.remove()
file.rename()
file.append()
file.copy()
file.symlink()
dir.create()
file.show()
43
Assignment Functions
Many functions in R can be used on the left hand side of an
assignment operator, allowing various properties of objects to be
modified conveniently. Some examples include names(), diag(),
dimnames(), length(), dim() and substr().
For example, one way to reshape a matrix would be to assign a
value to the dim() function:
> dim(m) <- c(6,2)
> m
[,1] [,2]
[1,]
1
7
[2,]
2
8
[3,]
3
9
[4,]
4
10
[5,]
5
11
[6,]
6
12
> m = matrix(1:12,4,3)
> m
[,1] [,2] [,3]
[1,]
1
5
9
[2,]
2
6
10
[3,]
3
7
11
[4,]
4
8
12
46
Generating Sequences
The colon (:) operator creates simple integer sequences. While it is
often used as a subscript to slice an array, it can also be used on
its own
> letters[10:15]
[1] "j" "k" "l" "m" "n" "o"
> x = 1:100
> mean(x)
[1] 50.5
sample()
The sample() function returns random samples or permutations of
a vector. Passed an integer, n, it returns a random permutation of
the integers from 1 to n.
The size= argument specifies the size of the returned vector.
The prob= argument provides a vector of probabilities for sampling.
By default, sampling is done without replacement; the
replace=TRUE option will result in a random sample from the
specified input.
> sample(10)
[1] 7 6 1 2 8 9 3 4 10 5
> sample(c("a","b","c","d","e"),size=10,replace=TRUE)
[1] "d" "c" "b" "b" "c" "a" "a" "e" "b" "a"
> sample(c("a","b","c"),size=10,prob=c(.5,.25,.25),replace=TRUE)
[1] "c" "b" "a" "a" "a" "a" "a" "c" "c" "a"
50
Options
R has a number of global options which affect the way it behaves.
The options() function, called with no arguments, will display the
value of all options. Options can be set with options() by passing
optionname=value pairs. Some common options include:
prompt - controls prompt(default: ">")
width - number of columns in output
digits - number of digits displayed
height - number of lines per page
browser - used in help.start()
papersize - for Postscript plots (default: "A4")
51
Functional Programming
R provides a number of tools that map or apply a function to
various parts of a matrix, data frame, vector or list. These are often
more useful than traditional programming constructs because they
automatically determine the size of the object that they return, and
any names assigned to the objects being manipulated are preserved.
These functions include
apply() - operate on selected dimensions of arrays
lapply(), sapply() - operate on each element of a vector or
list
tapply() - operate on elements based on a grouping variable
mapply() - multivariate extension of sapply
In addition, some functions (for example by(), sweep(), and
aggregate()) wrap these functions in a more convenient form.
52
apply()
apply() will execute a function on every row or every column of a
matrix. Suppose the data frame finance contains the variables
Name, Price, Volume and Cap, and we want the mean of each of the
numerical variables.
> apply(finance[-1],2,mean)
Price
Volume
21.46800 51584.51840
Cap
NA
tapply()
tapply() allows you to map a function to a vector, broken up into
groups defined by a second vector or list of vectors. Generally the
vectors will all be from the same matrix or data frame, but they
need not be.
Suppose we have a data frame called banks with columns name,
state and interest.rate, and we wish to find the maximum
interest rate for banks in each of the states.
with(banks,tapply(interest.rate,state,max,na.rm=TRUE))
Note the use of the with() function to avoid retyping the data
frame name (banks) when refering to its columns.
55
56
aggregate()
The aggregate() function presents a summary of a scalar valued
statistic, broken down by one or more groups. While similar
operations can be performed using tapply(), aggregate()
presents the results in a data frame instead of a table.
> testdata = data.frame(one=c(1,1,1,2,2,2),two=c(1,2,3,1,2,3),
+
three=rnorm(6))
> aggregate(testdata$three,list(testdata$one,testdata$two),mean)
Group.1 Group.2
x
1
1
1 -0.62116475
2
2
1 -0.68367887
3
1
2 0.53058202
4
2
2 -0.61788020
5
1
3 0.02823623
6
2
3 -1.01561697
> tapply(testdata$three,list(testdata$one,testdata$two),mean)
1
2
3
1 -0.6211648 0.5305820 0.02823623
2 -0.6836789 -0.6178802 -1.01561697
57
aggregate() (contd)
If the first argument to aggregate() is a matrix or multicolumn
data frame, the statistic will be calculated separately for each
column.
> testdata = cbind(sample(1:5,size=100,replace=TRUE),
+
matrix(rnorm(500),100,5))
> dimnames(testdata) = list(NULL,c("grp","A","B","C","D","E"))
> aggregate(testdata[,-1],list(grp=testdata[,1]),min)
grp
A
B
C
D
E
1
1 -2.362126 -2.777772 -1.8970320 -2.152187 -1.966436
2
2 -1.690446 -2.202395 -1.6244721 -1.637200 -1.036338
3
3 -2.078137 -1.571017 -2.0555413 -1.401563 -1.881126
4
4 -1.325673 -1.660392 -0.8933617 -2.205563 -1.749313
5
5 -1.644125 -1.518352 -2.4064893 -1.664716 -1.994624
split()
While by() and aggregate() simplify many tasks which arise in
data manipulation, R provides the split() function which takes a
data frame, and creates a list of data frames by breaking up the
original data frame based on the value of a vector. In addition to
being useful in its own right, the returned list can be passed to
lapply() or sapply() as an alternative way of doing repetitive
analyses on subsets of data.
> data(Orange)
> trees = split(Orange,Orange$Tree)
> sapply(trees,function(x)coef(lm(circumference~age,data=x)))
3
1
5
2
4
(Intercept) 19.20353638 24.43784664 8.7583446 19.9609034 14.6376202
age
0.08111158 0.08147716 0.1110289 0.1250618 0.1351722
Loops in R
R provides three types of loops, although they are not used as often
as most other programming languages.
for loop
for(var in sequence) expression
while loop
while(condition) expression
repeat loop
repeat expression
In all cases, the expressions must be surrounded by curly braces if
they are more than one line long.
Unassigned objects are not automatically printed inside of loops
you must explicitly use the print() or cat() functions.
To terminate loops at any time, use the break statement; to
continue to the next iteration use the next statement.
61
Sorting
The sort() function will return a sorted version of a vector.
> x = c(19,25,31,15,12,43,82,22)
> sort(x)
[1] 12 15 19 22 25 31 43 82
62
67
Text Substitution
The functions sub() and gsub() can be used to create new
character strings by replacing regular expressions (first argument)
with replacement strings (second argument) in an existing
string(third argument). The ignore.case=TRUE argument can be
provided to ignore the case of the regular expression.
The only difference between the two functions is that sub() will
only replace the first occurence of a regular expression within
elements of its third argument, while gsub replaces all such
occurences.
> vars = c("date98","size98","x98weight98")
> sub(98,04,vars)
[1] "date04"
"size04"
"x04weight98"
> gsub(98,04,vars)
[1] "date04"
"size04"
"x04weight04"
68
69
> two
a
y
1 9 108
2 3 209
3 2 107
4 7 114
5 8 103
reshape()
Suppose we have a data frame named firmdata, with several
observations of a variable x recorded at each of several times, for
each of several firms, and we wish to create one observation per
firm, with new variables for each of the values of x contained in
that observation. In R terminology, the original format is called
long, and the desired format is called wide.
Our original data set would look like this:
firm time x
7
1 7
7
2 19
7
3 12
12
1 13
12
2 18
12
3 9
19
1 21
19
2 15
19
3 7
72
reshape(), contd
The following arguments to reshape explain the role that the
variables play in the transformation. Each argument is a character
string or vector of strings.
timevar= the variable which specifies which new variable will
be created from a given observation
idvar= the variable which identifies the unique observations in
the new data set. There will be one observation for each level
of this variable.
v.names= the variables of interest which were recorded multiple
times
direction= "long" or "wide" to described the desired result
varying= when converting to long format, the variables
which will be broken up to create the new (v.names) variable.
73
reshape(), contd
The following R statements perform the desired transformation:
> newdata = reshape(firmdata,timevar="time",idvar="firm",
+
v.names="x",direction="wide")
> newdata
firm x.1 x.2 x.3
1
7
7 19 12
4
12 13 18
9
7
19 21 15
7
Once converted, a call to reshape() in the opposite direction needs
no other arguments, because the reshaping information is stored
with the data.
74
expand.grid()
The expand.grid() function accepts vectors or lists of vectors, and
creates a data frame with one row for each combination of values
from the vectors. Combined with apply(), it can easily perform a
grid search over a specified range of values.
> values = expand.grid(x=seq(-3,3,length=10),
y=seq(-3,3,length=10))
> result = cbind(values,
result=apply(values,1,function(z)sin(z[1])*cos(z[2])))
> dim(result)
[1] 100
3
> result[which(result[,3] == max(result[,3])),]
x y
result
3 -1.666667 -3 0.9854464
93 -1.666667 3 0.9854464
75
Normal
exp
Exponential
gamma
Gamma
pois
Poisson
binom
Binomial
chisq
Chi-square
Students t
unif
Uniform
76
Descriptive Statistics
Among the functions for descriptive, univariate statistics are
mean()
median()
range()
kurtosis()
skewness()
var()
mad()
sd()
IQR()
weighted.mean()
- in e1071 library
Hypothesis Tests
R provides a number of functions for simple hypothesis testing.
They each have a alternative= argument (with choices
two.sided, less, and greater), and a conf.level= argument for
prespecifying a confidence level.
Among the available functions are:
prop.test
Equality of proportions
wilcox.test
Wilcoxon test
binom.test
chisq.test
Contingency tables
t.test
Students t-test
var.test
Equality of variances
cor.test
Correlation coefficient
ks.test
Goodness of fit
Results of t.test()
Welch Two Sample t-test
data: x and y
t = 1.6422, df = 18.99, p-value = 0.117
alternative hypothesis: true difference in means
is not equal to 0
95 percent confidence interval:
-0.766232 6.348050
sample estimates:
mean of x mean of y
16.09091 13.30000
79
Statistical Models in R
R provides a number of functions for statistical modeling, along
with a variety of functions that extract and display information
about those models. Using object-oriented design principles, most
modeling functions work in similar ways and use similar arguments,
so changing a modeling strategy is usually very simple. Among the
modeling functions available are:
Linear Models
aov
Analysis of Variance
glm
gam1
tree
cph2
nls
Non-linear Models
loess
lm()
- in mgcv library
- in Design library
80
Formulas
R provides a notation to express the idea of a statistical model
which is used in all the modeling functions, as well as some
graphical functions. The dependent variable is listed on the
left-hand side of a tilde (~), and the independent variables are
listed on the right-hand side, joined by plus signs (+).
Most often, the variables in a formula come from a data frame, but
this is not a requirement; in addition expressions involving
variables can also be used in formulas
Inside of formulas, some symbols have special meanings, as shown
below.
+
add terms
remove terms
interaction
crossing
%in%
nesting
limit crossing
Example of Formulas
Additive Model
y ~ x1 + x2 + x3
Additive Model without Intercept
y ~ x1 + x2 + x3 - 1
Regress response versus all other variables in data frame
response ~ .
Fully Factorial ANOVA model (a, b, and c are factors)
y ~ a*b*c
Factorial ANOVA model limited to depth=2 interactions
y ~ (a*b*c)^2
Polynomial Regression
y ~ x + I(x^2) + I(x^3)
82
formula=
weights=
na.action=
values
na.fail
na.omit
na.pass
do nothing
83
Graphics in R
The graphics system in R consists of three components:
High level functions - These functions produce entire plots with
a single command. Examples include barplot(), boxplot(),
contour(), dotchart(), hist(), pairs(), persp(), pie(),
and plot().
Low level functions - These functions add to existing plots.
Examples include abline(), arrows(), axis(), frame(),
legend(), lines(), points(), rug(), symbols(), text(), and
title()
Graphics parameters - Accessed through either plotting
commands or the par() function, these are arguments that
change the layout or appearance of a plot. These parameters
control thing like margins, text size, tick marks, plotting style,
and overall size of the plot.
84
Device Drivers
. By default, a window will automatically be opened to display
your graphics. Some other devices available include postscript(),
pdf(), bitmap() and jpeg(). (See the help for Devices for a
complete list.)
To use an alternative driver, either call the appropriate function
before plotting, or use the dev.copy() function to specify a device
to which to copy the current plot, always ending with a call to
dev.off().
For example, to create a PostScript plot, use statements like
postscript(file="myplot.ps")
... plotting commands go here ...
dev.off()
cos(x)
1.0
0.5
0.0
1.0
0.5
sin(x)
0.5
1.0
Cosine
1.0
Sine
Tangent
Sin^2/Cosine
1.0e+16
sin(x)^2/cos(x)
0.0e+00
5.0e+15
1.0e+16
0.0e+00
tan(x)
1.5e+16
0
x
Plot Types
The plot() function accepts a type= argument, which can be set
to any of the following values: "p" for points, "l" for lines, "b" for
both, "s" for stairstep, and "n" for none.
By setting type="n", axes will be drawn, but no points will be
displayed, allowing multiple lines to be plotted on the same set of
axes. (The matplot() function is also useful in this setting.)
>
>
>
+
>
data(USArrests)
popgroup = cut(USArrests$UrbanPop,3,labels=c("Low","Medium","High"))
plot(range(USArrests$Murder),range(USArrests$Rape),type="n",
xlab="Murder",ylab="Rape")
points(USArrests$Murder[popgroup=="Low"],
USArrests$Rape[popgroup=="Low"],col="Red")
> points(USArrests$Murder[popgroup=="Medium"],
USArrests$Rape[popgroup=="Medium"],col="Green")
> points(USArrests$Murder[popgroup=="High"],
USArrests$Rape[popgroup=="High"],col="Blue")
88
Legends
The legend() function can produce a legend on a plot, displaying
points or lines with accompanying text. The function accepts x=
and y= arguments to specify the location of the legend, and a
legend= argument with the text to appear in the legend, as well as
many graphics parameters, often in vector form to accomodate the
multiple plots or points on a graph.
The following code could be used to place a legend on the previous
plot; the title() function is also used to add a title.
legend(2,44,levels(popgroup),col=c("Red","Green","Blue"),
pch=1)
title("Rape vs Murder by Population Density")
The locator() function can be used to interactively place the
legend.
89
Low
Medium
High
10
20
Rape
30
40
10
Murder
90
15
Plotting Limits
While Rs default will usually produce an attractive plot, it is
sometimes useful to restrict plotting to a reduced range of points,
or to expand the range of points. Many plotting routines accept
xlim= and ylim= arguments, which can be set to a vector of length
2 giving the range of points to be displayed.
For example, the airquality data set contains data on different
measures of the air quality in New York City. We could plot the
ozone level versus the temperature for the complete set of points
with the following statements:
data(airquality)
with(airquality,plot(Temp,Ozone))
92
50
Ozone
100
150
60
70
80
90
Temp
20
40
Ozone
60
80
100
60
65
70
Temp
94
75
80
Custom Axes
The axis() function allows creation of custom axes. Graphics
parameters xaxt= and yaxt= can be set to "n" to suppress the
default creation of axes when producing a plot.
Arguments to axis() include side= (1=bottom, 2=left, 3=top,
and 4=right), at=, a vector of locations for tick marks and labels,
and labels= to specify the labels.
We can create a barplot showing mean Murder rates for states in
the three population groups with the following code:
rmeans = with(USArrests,aggregate(Rape,list(popgroup),mean))
where = barplot(mmeans[,2],xlab="Population Density")
axis(1,at=where,labels=as.character(rmeans[,1]))
box()
title("Average Rape Rate by Population Density")
The return value from barplot() gives the centers of the bars; the
box() function draws a box around the plot.
95
10
15
20
Low
Medium
Population Density
96
High
Conditioning Plots
Size of Orange Trees
500
1000
1500
4
200
150
circumference
100
50
200
150
100
50
500
1000
1500
500
1000
1500
age
Note that all the scales are identical, and all margins between the
plots have been eliminated, making it very easy to compare the
graphs.
98
3D Scatter plots
histogram()
histogram
qq()
Quantile-Quantile plots
barchart()
Bar charts
dotplot()
Dot Plots
splom()
Scatterplot matrices
Note that its the function name that is passed to xyplot(), not an
actual call to the function.
100
1000
1500
4
200
150
circumference
100
50
200
150
100
50
500
1000
1500
500
1000
1500
age
Trellis Objects
The trellis library optimizes its plots depending on the current
device. This can lead to problems if dev.copy() is used, since it
will be using settings from the current device. To avoid this
problem, trellis plots can be stored in a device dependent way, and
rendered for a particular device with the print() function.
For example, if the current device is x11, using dev.copy() will
create a PostScript version of the x11-optimized lattice plot. Trellis
objects avoid this problem:
obj = xyplot( ... )
print(obj)
# required to view the plot if stored as an object
postscript(file="out.ps")
print(obj)
# creates PostScript file
dev.off()