R For Data Engineers: Greg Wilson
R For Data Engineers: Greg Wilson
1 Introduction 1
1.1 Who are these lessons for? . . . . . . . . . . . . . . . . . . . 1
1.2 How do I get started? . . . . . . . . . . . . . . . . . . . . . . 2
2 Simple Beginnings 3
2.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 How do I say hello? . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 How do I add numbers? . . . . . . . . . . . . . . . . . . . . . 5
2.4 How do I store many numbers together? . . . . . . . . . . . 7
2.5 How do I index a vector? . . . . . . . . . . . . . . . . . . . . 9
2.6 How do I create new vectors from old? . . . . . . . . . . . . 12
2.7 How else does R represent the absence of data? . . . . . . . 14
2.8 How can I store a mix of different types of objects? . . . . . 16
2.9 What is the difference between [ and [[? . . . . . . . . . . . 17
2.10 How can I access elements by name? . . . . . . . . . . . . . 19
2.11 How can I create and index a matrix? . . . . . . . . . . . . . 20
2.12 How do I choose and repeat things? . . . . . . . . . . . . . . 21
2.13 How can I vectorize loops and conditionals? . . . . . . . . . 22
2.14 How can I express a range of values? . . . . . . . . . . . . . 23
2.15 How can I use a vector in a conditional statement? . . . . . 25
2.16 How do I create and call functions? . . . . . . . . . . . . . . 26
2.17 How can I write a function that takes variable arguments? . 27
2.18 How can I provide default values for arguments? . . . . . . . 28
2.19 How can I hide the value that R returns? . . . . . . . . . . . 29
2.20 How can I assign to a global variable from inside a function? 30
2.21 Key Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3
4 0 Contents
3 The Tidyverse 33
3.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 How do I read data? . . . . . . . . . . . . . . . . . . . . . . 33
3.3 How do I inspect data? . . . . . . . . . . . . . . . . . . . . . 37
3.4 How do I index rows and columns? . . . . . . . . . . . . . . 40
3.5 How do I calculate basic statistics? . . . . . . . . . . . . . . 45
3.6 How do I filter data? . . . . . . . . . . . . . . . . . . . . . . 47
3.7 How do I write tidy code? . . . . . . . . . . . . . . . . . . . 50
3.8 How do I model my data? . . . . . . . . . . . . . . . . . . . 56
3.9 How do I create a plot? . . . . . . . . . . . . . . . . . . . . . 58
3.10 Do I need more practice with the tidyverse? . . . . . . . . . 62
3.11 Key Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Creating Reports 67
4.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 How can I create and preview a simple page? . . . . . . . . . 68
4.3 How can I run code and include its output in a page? . . . . 70
4.4 How can I format tables in a page? . . . . . . . . . . . . . . 72
4.5 How can I share code between pages? . . . . . . . . . . . . . 73
4.6 How can I parameterize documents? . . . . . . . . . . . . . . 74
4.7 How can I publish pages on GitHub? . . . . . . . . . . . . . 76
4.8 Key Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Creating Packages 79
5.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 What is our starting point? . . . . . . . . . . . . . . . . . . 80
5.3 How do I convert values to numbers? . . . . . . . . . . . . . 81
5.4 How do I reorganize the columns? . . . . . . . . . . . . . . . 100
5.5 How do I create a package? . . . . . . . . . . . . . . . . . . 109
5.6 How can I document the contents of a package? . . . . . . . 114
5.7 How can my package import what it needs? . . . . . . . . . 116
5.8 How can I add data to a package? . . . . . . . . . . . . . . . 119
5.9 Key Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
0.0 Contents 5
Appendix 201
A License 201
C Citation 207
D Contributing 209
F Glossary 231
Bibliography 241
1
Introduction
Years ago, Patrick Burns wrote The R Inferno, a guide to R for those who
think they are in hell. Upon first encountering the language after two decades
of using Python, I thought Burns was an optimist—after all, hell has rules.
I have since realized that R does too, and that they are no more confusing
or contradictory than those of other programming languages. They only ap-
pear so because R draws on a tradition unfamiliar to those of us raised with
derivatives of C. Counting from one, copying data rather than modifying it,
lazy evaluation: to quote the other bard, these are not mad, just differently
sane.
Welcome, then, to a universe where the strange will become familiar, and
everything familiar, strange. Welcome, thrice welcome, to R.
1
2 1 Introduction
We begin by introducing the basic elements of R. You will use these less often
than you might expect, but they are the building blocks for the higher-level
tools introduced in Chapter 3, and offer the comfort of familiarity. Where
we feel comparisons would aid understanding, we provide short examples in
Python.
• Name and describe R’s atomic data types and create objects of those types.
• Explain what ‘scalar’ values actually are in R.
• Identify correct and incorrect variable names in R.
• Create vectors in R and index them to select single values, ranges of values,
and selected values.
• Explain the difference between NA and NULL and correctly use tests for
each.
• Explain the difference between a list and a vector.
• Explain the difference between indexing with [ and with [[.
• Use [ and [[ correctly to extract elements and sub-structures from data
structures in R.
• Create a named list in R.
• Access elements by name using both [ and $ notation.
• Correctly identify cases in which back-quoting is necessary when accessing
elements via $.
• Create and index matrices in R.
• Create for loops and if/else statements in R.
• Explain why vectors cannot be used directly in conditional expressions
and correctly use all and any to combine their values.
• Define functions taking a fixed number of named arguments and/or a
variable number of arguments.
• Explain what vectorization is and create vectorized equivalents of unnested
loops containing simple conditional tests.
3
4 2 Simple Beginnings
print("Hello, world!")
Hello, world!
print("Hello, world!")
Python prints what we asked for, but what does the [1] in R’s output signify?
Is it perhaps something akin to a line number? Let’s take a closer look by
evaluating a couple of expressions without calling print:
2.3 How do I add numbers? 5
[1] doesn’t appear to be a line number; let’s ignore it for now and do a little
more exploring.
Note that R uses double quotes to display strings even when we give
it a single-quoted string (which is no worse than Python using single
quotes when we’ve given it doubles).
print(1 + 2 + 3)
We can check the type of the result using type, which tells us that the result
6 is an integer:
print(type(6))
<class 'int'>
1 + 2 + 3
[1] 6
6 2 Simple Beginnings
typeof(6)
[1] "double"
R’s type inspection function is called typeof rather than type, and it re-
turns the type’s name as a string. That’s all fine, but it seems odd for integer
addition to produce a double-precision floating-point result. Let’s try an ex-
periment:
typeof(6)
[1] "double"
typeof(6L)
[1] "integer"
typeof(1L + 2L + 3L)
[1] "integer"
typeof(as.integer(6))
[1] "integer"
But wait: what is that dot in as.integer’s name? Is there an object called as
with a method called integer? The answer is “no”: . is (usually) just another
character in R; like the underscore _, it is used to make names more readable.
2.4 How do I store many numbers together? 7
[3, 5, 7, 11]
[1] 3 5 7 11
Assignment is done using a left-pointing arrow <- (though other forms exist,
which we will discuss later). As in Python, assignment is a statement rather
than an expression, so we enter the name of the newly-created variable to get
R to display its value.
Now that we can create vectors in R, we can explain the errant [1] in our
previous examples. To start, let’s have a look at the lengths of various things
in Python:
print(primes, len(primes))
[3, 5, 7, 11] 4
8 2 Simple Beginnings
print(len(4))
Detailed traceback:
File "<string>", line 1, in <module>
Fair enough: the length of a list is the number of elements it contains, and
since a scalar like the integer 4 doesn’t contain elements, it has no length.
What of R’s vectors?
length(primes)
[1] 4
Good—and numbers?
length(4)
[1] 1
typeof(primes)
[1] "double"
That’s also unexpected: the type of the vector is the type of the elements it
contains. This all becomes clear once we realize that there are no scalars in R.
4 is not a single lonely integer, but rather a vector of length one containing
the value 4. When we display its value, the [1] that R prints is the index of
its first value. We can prove this by creating and displaying a longer vector:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5
[26] 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
In order to help us find our way in our data, R automatically breaks long lines
and displays the starting index of each line. These indices also show us that R
counts from 1 as humans do, rather than from zero. (There are a great many
myths about why programming languages do the latter. The truth is stranger
than any fiction could be.)
2.5 How do I index a vector? 9
eburnean
print(colors[2])
wenge
colors[3]
Detailed traceback:
File "<string>", line 1, in <module>
print(colors[-1])
wenge
[1] "eburnean"
colors[3]
[1] "wenge"
colors[4]
[1] NA
R handles gaps in data using the special value NA (short for “not available”),
and returns this value when we ask for a nonexistent element of a vector.
But it does more than this—much more. In Python, a negative index counts
backward from the end of a list. In R, we use a negative index to indicate a
value that we don’t want:
colors[-1]
colors[1, 2]
That didn’t work because R interprets [i, j] as being row and column in-
dices, and our vector has only one dimension. What if we create a vector with
c(...) and use that as a subscript?
colors[c(3, 1, 2)]
colors[c(1, 1, 1)]
Note that this is pull indexing, i.e., the value at location i in the index vector
specifies which element of the source vector is being pulled into that location
in the result vector (Figure 2.2).
We can also select out several elements:
2.5 How do I index a vector? 11
"eburnean" 1 "eburnean"
"glaucous" 1 "eburnean"
"wenge" 1 "eburnean"
colors[c(-1, -2)]
[1] "wenge"
colors[c(1, -1)]
Error in colors[c(1, -1)]: only 0's may be mixed with negative subscripts
colors[0]
character(0)
In order to understand this rather cryptic response, we can try calling the
function character ourselves with a positive argument:
character(3)
colors[c(0, 1)]
[1] "eburnean"
colors[c(1, 10)]
[1] "eburnean" NA
original = [3, 5, 7, 9]
doubled = [2 * x for x in original]
print(doubled)
instead of:
doubled = []
for x in original:
doubled.append(2 * x)
print(doubled)
[1] 6 10 14 18
If two vectors of unequal length are used together, the elements of the shorter
are recycled. This behaves sensibly if one of the vectors is a scalar—it is just
re-used as many times as necessary:
hundreds + 5
If both vectors have several elements, the shorter is repeated as often as neces-
sary. This works, but is so likely to lead to hard-to-find bugs that R produces
a warning message:
colors # as a reminder
All three vectors are of the same length, and the first (the condition) is usually
constructed using the values of one or both of the other vectors:
NULL == integer(0)
logical(0)
The safe way to test if something is NULL is to use the function is.null:
is.null(NULL)
[1] TRUE
Circling back, the safe way to test whether a value is NA is not to use direct
comparison:
[1] NA
The result is NA because if we don’t know what the value is, we can’t know
if it’s equal to threshold or not. Instead, we should always use the function
is.na:
is.na(threshold)
[1] FALSE
16 2 Simple Beginnings
is.na(NA)
[1] TRUE
thing[i]
thing[i, j]
thing[[i]]
thing[[i, j]]
thing$name
thing$"name"
but they can behave differently depending on what kind of thing thing is. To
explain, we must first take a look at lists.
A list in R is a vector that can contain values of many different types. (The
technical term for this is heterogeneous, in contrast with a homogeneous data
structure that can only contain one type of value.) We’ll use this list in our
examples:
[[1]]
[1] "first"
[[2]]
[1] 2 20 200
[[3]]
[1] 3.3
The output tells us that the first element of thing is a vector of one element,
that the second is a vector of three elements, and the third is again a vector
of one element; the major indices are shown in [[…]], while the indices of the
contained elements are shown in […]. (Again, remember that "first" and 3.3
are actually vectors of length 1.)
2.9 What is the difference between [ and [[? 17
thing[[1]]
[1] "first"
thing[[2]]
[1] 2 20 200
thing[[3]]
[1] 3.3
typeof(thing[[1]])
[1] "character"
typeof(thing[[2]])
[1] "double"
typeof(thing[[3]])
[1] "double"
That seems sensible. Now, what do we get if we index single square brackets
[…]?
18 2 Simple Beginnings
thing[1]
[[1]]
[1] "first"
typeof(thing[1])
[1] "list"
This shows the difference between [[ and [: the former peels away a layer
of data structure, returning only the sub-structure, while the latter gives us
back a structure of the same type as the thing being indexed. Since a “scalar”
is just a vector of length 1, there is no difference between [[ and [ when they
are applied to vectors:
[1] "second"
typeof(v[2])
[1] "character"
v[[2]]
[1] "second"
typeof(v[[2]])
[1] "character"
c(3, 4)) produces c(1, 2, 3, 4). It also does automatic type con-
version: c("first", c(2, 20, 200), 30) produces a vector of char-
acter strings c("first", "2", "20", "200", "30"). This is helpful
once you get used to it (which once again is true of both quantum
mechanics and necromancy).
Another “helpful, ish” behavior is that using [[ with a list sub-
sets recursively: if thing <- list(a = list(b = list(c = list(d
= 1)))), then thing[[c("a", "b", "c", "d")]] selects the 1.
m f nb f f m
"Male" "Female" "Non-binary" "Female" "Female" "Male"
m
"Male"
[1] "Male"
We will explore this in more detail when we look at the tidyverse in Chapter 3,
since that is where access-by-name is used most often. For now, simply note
that if the name of an element isn’t a legal variable name, we have to put it
in backward quotes to use it with $:
20 2 Simple Beginnings
[1] "F"
If you have control, or at least the illusion thereof, choose names such
as first_field that don’t require back-quoting.
Behind the scenes, a matrix is a vector with an attribute called dim that stores
its dimensions:
dim(a)
[1] 3 3
a[8]
[1] 8
2.12 How do I choose and repeat things? 21
print(sign(values))
[1] -1 0 1
2.14 How can I express a range of values? 23
But what if the function we want doesn’t exist (or if we don’t know what
it’s called)? In that case, the easiest approach is often to create a new vector
whose values are derived from those of the vector we had and trust R to match
up corresponding elements:
This output makes no sense until we remember that every value is a vector,
and that length returns the length of a vector, so that length(colors[0])
is telling us that colors[0] contains one element. If we want the number of
characters in the strings, we can use R’s built-in nchar or the more modern
function stringr::str_length:
for (i in seq_along(colors)) {
print(glue("The length of color {i} is {stringr::str_length(colors[i])}"))
}
seq_along(colors)
[1] 1 2 3 4
Since sequences of this kind are used frequently, R lets us write them using
range expressions:
5:10
[1] 5 6 7 8 9 10
colors[2:3]
colors[-1:-2]
seq(1, 10, 3)
[1] 1 4 7 10
This example also shows that ranges in R are inclusive at both ends, i.e., they
run up to and including the upper bound. As is traditional among program-
ming language advocates, people claim that this is more natural and then cite
some supportive anecdote as if it were proof.
Repeating Things
The function rep repeats things, so rep("a", 3) is c("a", "a", "a"). If
the second argument is a vector of the same length as the first, it specifies
how many times each item in the first vector is to be repeated: rep(c("a",
"b"), c(2, 3)) is c("a", "a", "b", "b", "b").
Warning in if (numbers) {: the condition has length > 1 and only the first
element will be used
The function all returns TRUE if every element in its argument is TRUE; it
corresponds to a logical “and” of all its inputs. We can use a corresponding
function any to check if at least one value is TRUE, which corresponds to a
logical “or” across the whole input.
max(1, 3, 5) + min(1, 3, 5)
[1] 6
We define a new function using the function keyword. This creates the func-
tion; to name it, we must assign the newly-created function to a variable:
As this example shows, the result of a function is the value of the last ex-
pression evaluated within it. A function can return a value earlier using the
return function; we can use return for the final value as well, but most R
programmers do not.
}
c(pair[2], pair[1])
}
swap(c("one"))
NULL
swap(c("left", "right"))
Returning NULL when our function’s inputs are invalid as we have done above
is foolhardy, as doing so means that swap can fail without telling us that it
has done so. Consider:
NULL[1] # Try to access an element of the vector that does not exist.
NULL
NULL
(Note that we are passing three separate values here, not a single vector con-
taining three values.) If we want a function to handle a varying number of
arguments, we represent the “extra” arguments with an ellipsis ... (three
dots), which serves the same purpose as Python’s *args:
==to-do==
Monday
Tuesday
Wednesday
The function paste creates a string by combining its arguments with the
specified separator.
[1] 16
One caution: when you use a name in a function call, R ignores things that
aren’t functions when looking up the function. This means that the call
to orange() in the code below produces 110 rather than an error because
purple(purple) is interpreted as “pass the value 10 into the globally-defined
function purple” rather than “try to call a function 10(10)”:
[1] 110
[1] 20
to this:
The assignment operator <<- means “assign to a variable outside the current
scope”. As the example below shows, this means that what looks like creation
of a new local variable can actually be modification of a global one:
demonstrate()
var
This should only and always be done with care: modern R strongly encourages
a functional style of programming in which functions do not modify their input
data, and nobody thinks that modifying global variables is a good idea any
more.
2.21 Key Points 31
There is no point in becoming fluent in Enochian if you do not then call forth
a Dweller Beneath at the time of the new moon. Similarly, there is no point
learning a language designed for data manipulation if you do not then bend
data to your will. This chapter therefore looks at how to do the things that R
was summoned—er, designed—to do.
33
34 3 The Tidyverse
country,year,estimate,hi,lo
AFG,2009,NA,NA,NA
AFG,2010,NA,NA,NA
...
AFG,2017,NA,NA,NA
AGO,2009,NA,NA,NA
AGO,2010,0.03,0.04,0.02
AGO,2011,0.05,0.07,0.04
AGO,2012,0.06,0.08,0.05
...
ZWE,2016,0.71,0.88,0.62
ZWE,2017,0.65,0.81,0.57
The actual file has many more rows (and no ellipses). It uses NA to show
missing data rather than (for example) -, a space, or a blank, and its values
are interpreted as follows:
import pandas as pd
infant_hiv = pd.read_csv('results/infant_hiv.csv')
print(infant_hiv)
The equivalent in R is to load the tidyverse collection of packages and then call
the read_csv function. We will go through this in stages, since each produces
output.
library(tidyverse)
Ah. We must install the tidyverse (but only need to do so once per machine):
install.packages("tidyverse")
At any time, we can call sessionInfo to find out what versions of which
packages we have loaded, along with the version of R we’re using and some
other useful information:
sessionInfo()
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
library(tidyverse)
Note that we give install.packages a string to install, but simply give the
name of the package we want to load to library.
Loading the tidyverse gives us eight packages. One of those, dplyr, defines
two functions that mask standard functions in R with the same names. If we
need the originals, we can always get them with their fully-qualified names
stats::filter and stats::lag.
Once we have the tidyverse loaded, reading the file looks remarkably like
reading the file:
R’s read_csv tells us more about what it has done than Pandas does. In
particular, it guesses the data types of columns based on the first thousand
values and then tells us what types it has inferred. (In a better universe,
people would habitually use the first two rows of their spreadsheets for name
and units, but we do not live there.)
We can now look at what read_csv has produced.
3.3 How do I inspect data? 37
infant_hiv
# A tibble: 1,728 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 AFG 2009 NA NA NA
2 AFG 2010 NA NA NA
3 AFG 2011 NA NA NA
4 AFG 2012 NA NA NA
5 AFG 2013 NA NA NA
6 AFG 2014 NA NA NA
7 AFG 2015 NA NA NA
8 AFG 2016 NA NA NA
9 AFG 2017 NA NA NA
10 AGO 2009 NA NA NA
# ... with 1,718 more rows
We often have a quick look at the content of a table to remind ourselves what
it contains. Pandas does this using methods whose names are borrowed from
the Unix shell’s head and tail commands:
print(infant_hiv.head())
print(infant_hiv.tail())
38 3 The Tidyverse
head(infant_hiv)
# A tibble: 6 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 AFG 2009 NA NA NA
2 AFG 2010 NA NA NA
3 AFG 2011 NA NA NA
4 AFG 2012 NA NA NA
5 AFG 2013 NA NA NA
6 AFG 2014 NA NA NA
tail(infant_hiv)
# A tibble: 6 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 ZWE 2012 0.38 0.47 0.33
2 ZWE 2013 0.570 0.7 0.49
3 ZWE 2014 0.54 0.67 0.47
4 ZWE 2015 0.59 0.73 0.51
5 ZWE 2016 0.71 0.88 0.62
6 ZWE 2017 0.65 0.81 0.570
tail(infant_hiv)
# A tibble: 6 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 ZWE 2012 0.38 0.47 0.33
2 ZWE 2013 0.570 0.7 0.49
3 ZWE 2014 0.54 0.67 0.47
3.4 How do I inspect data? 39
Note that the row numbers printed by tail are relative to the output, not
absolute to the table. This is different from Pandas, which retains the original
row numbers.
What about overall information?
print(infant_hiv.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 5 columns):
country 1728 non-null object
year 1728 non-null int64
estimate 728 non-null float64
hi 728 non-null float64
lo 728 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 67.6+ KB
None
summary(infant_hiv)
Your display of R’s summary may or may not wrap, depending on how large
a screen the older acolytes have allowed you.
40 3 The Tidyverse
print(infant_hiv['estimate'])
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
1723 0.57
1724 0.54
1725 0.59
1726 0.71
1727 0.65
Name: estimate, Length: 1728, dtype: float64
infant_hiv['estimate']
# A tibble: 1,728 x 1
estimate
<dbl>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
# ... with 1,718 more rows
3.4 How do I index rows and columns? 41
infant_hiv$estimate
Again, note that the boxed number on the left is the start index of that row.
What about single values? Remembering to count from zero from Python and
as humans do for R, we have:
print(infant_hiv.estimate[11])
0.05
infant_hiv$estimate[12]
[1] 0.05
print(len(infant_hiv.estimate[11]))
Detailed traceback:
File "<string>", line 1, in <module>
length(infant_hiv$estimate[12])
[1] 1
print(infant_hiv.estimate[5:15])
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 0.03
11 0.05
12 0.06
13 0.15
14 0.10
Name: estimate, dtype: float64
infant_hiv$estimate[6:15]
Note that the upper bound is the same, because it’s inclusive in R and exclu-
sive in Python. Note also that nothing prevents us from selecting a range of
rows that spans several countries, which is why selecting by row number is
usually a sign of innocence, insouciance, or desperation.
We can select by column number as well. Pandas uses the rather clumsy
object.iloc[rows, columns] with the usual shortcut : for “entire range”:
print(infant_hiv.iloc[:, 0])
3.4 How do I index rows and columns? 43
0 AFG
1 AFG
2 AFG
3 AFG
4 AFG
...
1723 ZWE
1724 ZWE
1725 ZWE
1726 ZWE
1727 ZWE
Name: country, Length: 1728, dtype: object
print(infant_hiv.iloc[:, 0][0])
AFG
infant_hiv[1]
# A tibble: 1,728 x 1
country
<chr>
1 AFG
2 AFG
3 AFG
4 AFG
5 AFG
6 AFG
7 AFG
8 AFG
9 AFG
10 AGO
# ... with 1,718 more rows
But notice that the output is not a vector, but another tibble (i.e., a table
with N rows and one column). This means that adding another index does
column-wise indexing on that tibble:
44 3 The Tidyverse
infant_hiv[1][1]
# A tibble: 1,728 x 1
country
<chr>
1 AFG
2 AFG
3 AFG
4 AFG
5 AFG
6 AFG
7 AFG
8 AFG
9 AFG
10 AGO
# ... with 1,718 more rows
How then are we to get the first mention of Afghanistan? The answer is to
use double square brackets to strip away one level of structure:
infant_hiv[[1]]
[1] "AFG" "AFG" "AFG" "AFG" "AFG" "AFG" "AFG" "AFG" "AFG" "AGO" "AGO" "AGO"
[13] "AGO" "AGO" "AGO" "AGO" "AGO" "AGO" "AIA" "AIA" "AIA" "AIA" "AIA" "AIA"
[25] "AIA" "AIA" "AIA" "ALB" "ALB" "ALB" "ALB" "ALB" "ALB" "ALB" "ALB" "ALB"
[37] "ARE" "ARE" "ARE" "ARE" "ARE" "ARE" "ARE" "ARE" "ARE" "ARG" "ARG" "ARG"
[49] "ARG" "ARG" "ARG" "ARG" "ARG" "ARG" "ARM" "ARM" "ARM" "ARM" "ARM" "ARM"
[61] "ARM" "ARM" "ARM" "ATG" "ATG" "ATG" "ATG" "ATG" "ATG" "ATG" "ATG" "ATG"
[73] "AUS" "AUS" "AUS" "AUS" "AUS" "AUS" "AUS" "AUS" "AUS" "AUT" "AUT" "AUT"
[85] "AUT" "AUT" "AUT" "AUT" "AUT" "AUT" "AZE" "AZE" "AZE" "AZE" "AZE" "AZE"
[97] "AZE" "AZE" "AZE" "BDI" "BDI" "BDI" "BDI" "BDI" "BDI" "BDI" "BDI" "BDI"
[109] "BEL" "BEL" "BEL" "BEL" "BEL" "BEL" "BEL" "BEL" "BEL" "BEN" "BEN" "BEN"
[121] "BEN" "BEN" "BEN" "BEN" "BEN" "BEN" "BFA" "BFA" "BFA" "BFA" "BFA" "BFA"
[133] "BFA" "BFA" "BFA" "BGD" "BGD" "BGD" "BGD" "BGD" "BGD" "BGD" "BGD" "BGD"
[145] "BGR" "BGR" "BGR" "BGR" "BGR" "BGR" "BGR" "BGR" "BGR" "BHR" "BHR" "BHR"
[157] "BHR" "BHR" "BHR" "BHR" "BHR" "BHR" "BHS" "BHS" "BHS" "BHS" "BHS" "BHS"
[169] "BHS" "BHS" "BHS" "BIH" "BIH" "BIH" "BIH" "BIH" "BIH" "BIH" "BIH" "BIH"
[181] "BLR" "BLR" "BLR" "BLR" "BLR" "BLR" "BLR" "BLR" "BLR" "BLZ" "BLZ" "BLZ"
[193] "BLZ" "BLZ" "BLZ" "BLZ" "BLZ" "BLZ" "BOL" "BOL" "BOL" "BOL" "BOL" "BOL"
[205] "BOL" "BOL" "BOL" "BRA" "BRA" "BRA" "BRA" "BRA" "BRA" "BRA" "BRA" "BRA"
[217] "BRB" "BRB" "BRB" "BRB" "BRB" "BRB" "BRB" "BRB" "BRB" "BRN" "BRN" "BRN"
[229] "BRN" "BRN" "BRN" "BRN" "BRN" "BRN" "BTN" "BTN" "BTN" "BTN" "BTN" "BTN"
...
3.5 How do I calculate basic statistics? 45
This is now a plain old vector, so it can be indexed with single square brackets:
infant_hiv[[1]][1]
[1] "AFG"
But that too is a vector, so it can of course be indexed as well (for some value
of “of course”):
infant_hiv[[1]][1][1]
[1] "AFG"
Thus, data[1][[1]] produces a tibble, then selects the first column vector
from it, so it still gives us a vector. This is not madness. It is merely…differently
sane.
estimates = infant_hiv.estimate
print(len(estimates))
1728
print(estimates.mean())
0.3870192307692308
[1] 1728
mean(estimates)
[1] NA
The void is always there, waiting for us… Let’s fix this in R first by telling
mean to drop NAs:
[1] 0.3870192
print(estimates.mean(skipna=False))
nan
Many functions in R use na.rm to control whether NAs are removed or not.
(Remember, the . character is just another part of the name) R’s default be-
havior is to leave NAs in, and then to include them in aggregate computations.
Python’s is to get rid of missing values early and work with what’s left, which
makes translating code from one language to the next much more interesting
than it might otherwise be. But other than that, the statistics works the same
way. In Python, we write:
print("min", estimates.min())
min 0.0
print("max", estimates.max())
max 0.95
3.6 How do I filter data? 47
print("std", estimates.std())
std 0.3034511074214113
and in R:
min 0
max 0.95
sd 0.303451107421411
A good use of aggregation is to check the quality of the data. For example, we
can ask if there are any records where some of the estimate, the low value, or
the high value are missing, but not all of them:
print((infant_hiv.hi.isnull() != infant_hiv.lo.isnull()).any())
False
any(is.na(infant_hiv$hi) != is.na(infant_hiv$lo))
[1] FALSE
52
And in R:
[1] 1052
The difference is unexpected. Let’s have a closer look at the result in Python:
print(maximal)
180 0.95
181 0.95
182 0.95
183 0.95
184 0.95
185 0.95
187 0.95
360 0.95
361 0.95
362 0.95
379 0.95
380 0.95
381 0.95
382 0.95
384 0.95
385 0.95
386 0.95
446 0.95
447 0.95
461 0.95
...
And in R:
3.6 How do I filter data? 49
maximal
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[15] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[29] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[43] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[57] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[71] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[85] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[99] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[113] NA NA NA NA NA NA NA NA NA NA NA NA 0.95 0.95
[127] 0.95 0.95 0.95 0.95 0.95 NA NA NA NA NA NA NA NA NA
[141] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[155] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[169] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[183] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[197] NA NA NA NA NA NA NA NA NA NA NA NA 0.95 0.95
[211] 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 NA NA NA NA NA NA
[225] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[239] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[253] NA NA NA NA NA NA NA NA NA NA NA NA NA 0.95
[267] 0.95 NA NA 0.95 NA NA NA NA NA NA NA NA NA NA
...
It appears that R has kept the unknown values in order to highlight just how
little we know. More precisely, wherever there was an NA in the original data
there is an NA in the logical vector and hence an NA in the final vector. Let us
then turn to which to get a vector of indices at which a vector contains TRUE.
This function does not return indices for FALSE or NA:
[1] 181 182 183 184 185 186 188 361 362 363 380 381 382 383 385
[16] 386 387 447 448 462 793 794 795 796 797 798 911 912 955 956
[31] 957 958 959 960 961 962 963 1098 1107 1128 1429 1430 1462 1554 1604
[46] 1607 1625 1626 1627 1629 1708 1710
[1] 52
50 3 The Tidyverse
So now we can index our vector with the result of the which:
[1] 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95
[16] 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95
[31] 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95
[46] 0.95 0.95 0.95 0.95 0.95 0.95 0.95
But should we do this? Those NAs are important information, and should
not be discarded so blithely. What we should really be doing is using the
tools the tidyverse provides rather than clever indexing tricks. These behave
consistently across a wide scale of problems and encourage use of patterns
that make it easier for others to understand our programs.
# A tibble: 183 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 ARG 2016 0.67 0.77 0.61
2 ARG 2017 0.66 0.77 0.6
3 AZE 2014 0.74 0.95 0.53
4 AZE 2015 0.83 0.95 0.64
3.7 How do I write tidy code? 51
Notice that the expression is lo > 0.5 rather than "lo" > 0.5. The latter
expression would return the entire table because the string "lo" is greater
than the number 0.5 everywhere.
But how is it that the name lo can be used on its own? It is the name of
a column, but there is no variable called lo. The answer is that R uses lazy
evaluation: function arguments aren’t evaluated until they’re needed, so the
function filter actually gets the expression lo > 0.5, which allows it to
check that there’s a column called lo and then use it appropriately. It may
seem strange at first, but it is much tidier than filter(data, data$lo >
0.5) or filter(data, "lo > 0.5"). We will explore lazy evaluation further
in Chapter 6.
We can make data anlaysis code more readable by using the pipe operator
%>%:
# A tibble: 183 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 ARG 2016 0.67 0.77 0.61
2 ARG 2017 0.66 0.77 0.6
3 AZE 2014 0.74 0.95 0.53
4 AZE 2015 0.83 0.95 0.64
5 AZE 2016 0.75 0.95 0.56
6 AZE 2017 0.74 0.95 0.56
7 BLR 2009 0.95 0.95 0.95
8 BLR 2010 0.95 0.95 0.95
9 BLR 2011 0.95 0.95 0.91
10 BLR 2012 0.95 0.95 0.95
# ... with 173 more rows
This may not seem like much of an improvement, but neither does a Unix pipe
consisting of cat filename.txt | head. What about this?
52 3 The Tidyverse
filter(infant_hiv, (estimate != 0.95) & (lo > 0.5) & (hi <= (lo + 0.1)))
# A tibble: 1 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 TTO 2017 0.94 0.95 0.86
It uses the vectorized “and” operator & twice, and parsing the condition takes
a human being at least a few seconds. Its pipelined equivalent is:
infant_hiv %>% filter(estimate != 0.95) %>% filter(lo > 0.5) %>% filter(hi <= (lo + 0.1))
# A tibble: 1 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 TTO 2017 0.94 0.95 0.86
Breaking the condition into stages like this often makes reading and testing
much easier, and encourages incremental write-test-extend development. Let’s
increase the band from 10% to 20%, break the line the way the tidyverse style
guide recommends to make the operations easier to spot, and order by lo in
descending order:
infant_hiv %>%
filter(estimate != 0.95) %>%
filter(lo > 0.5) %>%
filter(hi <= (lo + 0.2)) %>%
arrange(desc(lo))
# A tibble: 55 x 5
country year estimate hi lo
<chr> <dbl> <dbl> <dbl> <dbl>
1 TTO 2017 0.94 0.95 0.86
2 SWZ 2011 0.93 0.95 0.84
3 CUB 2014 0.92 0.95 0.83
4 TTO 2016 0.9 0.95 0.83
5 CRI 2009 0.92 0.95 0.81
6 CRI 2012 0.89 0.95 0.81
7 NAM 2014 0.91 0.95 0.81
8 URY 2016 0.9 0.95 0.81
9 ZMB 2014 0.91 0.95 0.81
10 KAZ 2015 0.84 0.95 0.8
# ... with 45 more rows
3.7 How do I write tidy code? 53
infant_hiv %>%
filter(estimate != 0.95) %>%
filter(lo > 0.5) %>%
filter(hi <= (lo + 0.2)) %>%
arrange(desc(lo)) %>%
select(year, lo, hi)
# A tibble: 55 x 3
year lo hi
<dbl> <dbl> <dbl>
1 2017 0.86 0.95
2 2011 0.84 0.95
3 2014 0.83 0.95
4 2016 0.83 0.95
5 2009 0.81 0.95
6 2012 0.81 0.95
7 2014 0.81 0.95
8 2016 0.81 0.95
9 2014 0.81 0.95
10 2015 0.8 0.95
# ... with 45 more rows
Once again, we are using the unquoted column names year, lo, and hi and
letting R’s lazy evaluation take care of the details for us.
Rather than selecting these three columns, we can select out the columns
we’re not interested in by negating their names. This leaves the columns that
are kept in their original order, rather than putting lo before hi, which won’t
matter if we later select by name, but will if we ever want to select by position:
infant_hiv %>%
filter(estimate != 0.95) %>%
filter(lo > 0.5) %>%
filter(hi <= (lo + 0.2)) %>%
arrange(desc(lo)) %>%
select(-country, -estimate)
# A tibble: 55 x 3
year hi lo
<dbl> <dbl> <dbl>
1 2017 0.95 0.86
2 2011 0.95 0.84
54 3 The Tidyverse
Giddy with power, we now add a column containing the difference between the
low and high values. This can be done using either mutate, which adds new
columns to the end of an existing tibble, or with transmute, which creates a
new tibble containing only the columns we explicitly ask for. (There is also a
function rename which simply renames columns.) Since we want to keep hi
and lo, we decide to use mutate:
infant_hiv %>%
filter(estimate != 0.95) %>%
filter(lo > 0.5) %>%
filter(hi <= (lo + 0.2)) %>%
arrange(desc(lo)) %>%
select(-country, -estimate) %>%
mutate(difference = hi - lo)
# A tibble: 55 x 4
year hi lo difference
<dbl> <dbl> <dbl> <dbl>
1 2017 0.95 0.86 0.0900
2 2011 0.95 0.84 0.110
3 2014 0.95 0.83 0.12
4 2016 0.95 0.83 0.12
5 2009 0.95 0.81 0.140
6 2012 0.95 0.81 0.140
7 2014 0.95 0.81 0.140
8 2016 0.95 0.81 0.140
9 2014 0.95 0.81 0.140
10 2015 0.95 0.8 0.150
# ... with 45 more rows
Does the difference between high and low estimates vary by year? To answer
that question, we use group_by to group records by value and then summarize
to aggregate within groups. We might as well get rid of the arrange and
select calls in our pipeline at this point, since we’re not using them, and
count how many records contributed to each aggregation using n():
3.7 How do I write tidy code? 55
infant_hiv %>%
filter(estimate != 0.95) %>%
filter(lo > 0.5) %>%
filter(hi <= (lo + 0.2)) %>%
mutate(difference = hi - lo) %>%
group_by(year) %>%
summarize(count = n(), ave_diff = mean(year))
# A tibble: 9 x 3
year count ave_diff
<dbl> <int> <dbl>
1 2009 3 2009
2 2010 3 2010
3 2011 5 2011
4 2012 5 2012
5 2013 6 2013
6 2014 10 2014
7 2015 6 2015
8 2016 10 2016
9 2017 7 2017
How might we do this with Pandas? One approach is to use a single multi-part
.query to select data and store the result in a variable so that we can refer to
the hi and lo columns twice without repeating the filtering expression. We
then group by year and aggregate, again using strings for column names:
data = pd.read_csv('results/infant_hiv.csv')
data = data.query('(estimate != 0.95) & (lo > 0.5) & (hi <= (lo + 0.2))')
data = data.assign(difference = (data.hi - data.lo))
grouped = data.groupby('year').agg({'difference' : {'ave_diff' : 'mean', 'count' : 'count'
//anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py:1455: FutureWarning
in a future version.
print(grouped)
difference
ave_diff count
year
2009 0.170000 3
2010 0.186667 3
2011 0.168000 5
2012 0.186000 5
2013 0.183333 6
2014 0.168000 10
2015 0.161667 6
2016 0.166000 10
2017 0.152857 7
There are other ways to tackle this problem with Pandas, but the tidyverse
approach produces code that I find more readable.
Call:
lm(formula = estimate ~ lo, data = infant_hiv)
Coefficients:
(Intercept) lo
0.0421 1.0707
names directly. In fact, it lets us write much more complex formulas involving
functions of multiple variables. For example, we can regress estimate against
the square roots of lo and hi (though there is no sound statistical reason to
do so):
Call:
lm(formula = estimate ~ sqrt(lo) + sqrt(hi), data = infant_hiv)
Coefficients:
(Intercept) sqrt(lo) sqrt(hi)
-0.2225 0.6177 0.4814
One important thing to note here is the way that + is overloaded in formulas.
The formula estimate ~ lo + hi does not mean “regress estimate against
the sum of lo and hi”, but rather, “regress estimate against the two variables
lo and hi”:
Call:
lm(formula = estimate ~ lo + hi, data = infant_hiv)
Coefficients:
(Intercept) lo hi
-0.01327 0.42979 0.56752
infant_hiv %>%
mutate(ave_lo_hi = (lo + hi)/2) %>%
lm(estimate ~ ave_lo_hi, data = .)
Call:
lm(formula = estimate ~ ave_lo_hi, data = .)
Coefficients:
(Intercept) ave_lo_hi
-0.00897 1.01080
58 3 The Tidyverse
Here, the call to lm is using the variable . to mean “the data coming in from
the previous stage of the pipeline”. Most of the functions in the tidyverse use
this convention so that data can be passed to a function that expects it in a
position other than the first.
0.75
0.50
hi
0.25
0.00
Let’s create a slightly more appealing plot by dropping NAs, making the points
semi-transparent, and colorizing them according to the value of estimate:
infant_hiv %>%
drop_na() %>%
ggplot(mapping = aes(x = lo, y = hi, color = estimate)) +
geom_point(alpha = 0.5) +
xlim(0.0, 1.0) + ylim(0.0, 1.0)
1.00
0.75
estimate
0.75
hi
0.50
0.50
0.25
0.00
0.25
0.00
We set the transparency alpha outside the aesthetic because its value is con-
stant for all points. If we set it inside aes(...), we would be telling ggplot2 to
set the transparency according to the value of the data. We specify the limits
to the axes manually with xlim and ylim to ensure that ggplot2 includes the
upper bounds: without this, all of the data would be shown, but the upper
label “1.00” would be omitted.
This plot immediately shows us that we have some outliers. There are far
60 3 The Tidyverse
more values with hi equal to 0.95 than it seems there ought to be, and there
are eight points running up the left margin that seem troubling as well. Let’s
create a new tibble that doesn’t have these:
infant_hiv %>%
drop_na() %>%
filter(hi != 0.95) %>%
filter(!((lo < 0.10) & (hi > 0.25))) %>%
ggplot(mapping = aes(x = lo, y = hi, color = estimate)) +
geom_point(alpha = 0.5) +
xlim(0.0, 1.0) + ylim(0.0, 1.0)
1.00
0.75
estimate
0.8
0.6
hi
0.50
0.4
0.2
0.0
0.25
0.00
infant_hiv %>%
drop_na() %>%
filter(hi != 0.95) %>%
filter(!((lo < 0.10) & (hi > 0.25))) %>%
ggplot(mapping = aes(x = lo, y = hi)) +
geom_point(mapping = aes(color = estimate), alpha = 0.5) +
geom_smooth(method = lm, color = 'red') +
xlim(0.0, 1.0) + ylim(0.0, 1.0)
1.00
0.75
estimate
0.8
0.6
hi
0.50
0.4
0.2
0.0
0.25
0.00
But wait: why is this complaining about missing values? Some online searches
lead to the discovery that geom_smooth adds virtual points to the data for
plotting purposes, some of which lie outside the range of the actual data, and
that setting xlim and ylim then truncates these. (Remember, R is differently
sane…) The safe way to control the range of the data is to add a call to
coord_cartesian, which effectively zooms in on a region of interest:
infant_hiv %>%
drop_na() %>%
filter(hi != 0.95) %>%
filter(!((lo < 0.10) & (hi > 0.25))) %>%
ggplot(mapping = aes(x = lo, y = hi)) +
geom_point(mapping = aes(color = estimate), alpha = 0.5) +
geom_smooth(method = lm, color = 'red') +
coord_cartesian(xlim = c(0.0, 1.0), ylim = c(0.0, 1.0))
1.00
0.75
estimate
0.8
0.6
hi
0.50
0.4
0.2
0.0
0.25
0.00
library(tidyverse)
library(here)
person
# A tibble: 5 x 3
person_id personal_name family_name
<chr> <chr> <chr>
1 dyer William Dyer
2 pb Frank Pabodie
3 lake Anderson Lake
4 roe Valentina Roerich
5 danforth Frank Danforth
nrow(person)
[1] 5
ncol(person)
[1] 3
(These names don’t have a package prefix because they are built in.) Let’s
show that information in a slightly nicer way using glue to insert values into
a string and print to display the result:
If we want to display several values, we can use the function paste to combine
the elements of a vector. colnames gives us the names of a tibble’s columns,
and paste’s collapse argument tells the function to use a single space to
separate concatenated values:
64 3 The Tidyverse
Time for some data manipulation. Let’s get everyone’s family and personal
names:
# A tibble: 5 x 2
family_name personal_name
<chr> <chr>
1 Dyer William
2 Pabodie Frank
3 Lake Anderson
4 Roerich Valentina
5 Danforth Frank
and then filter that list to keep only those that come in the first half of the
alphabet:
# A tibble: 3 x 2
family_name personal_name
<chr> <chr>
1 Dyer William
2 Lake Anderson
3 Danforth Frank
person %>%
dplyr::select(family_name, personal_name) %>%
dplyr::filter(family_name < "N")
# A tibble: 3 x 2
family_name personal_name
<chr> <chr>
1 Dyer William
2 Lake Anderson
3 Danforth Frank
3.11 Key Points 65
It’s easy to add a column that records the lengths of family names:
person %>%
dplyr::mutate(name_length = stringr::str_length(family_name))
# A tibble: 5 x 4
person_id personal_name family_name name_length
<chr> <chr> <chr> <int>
1 dyer William Dyer 4
2 pb Frank Pabodie 7
3 lake Anderson Lake 4
4 roe Valentina Roerich 7
5 danforth Frank Danforth 8
person %>%
dplyr::mutate(name_length = stringr::str_length(family_name)) %>%
dplyr::arrange(dplyr::desc(name_length))
# A tibble: 5 x 4
person_id personal_name family_name name_length
<chr> <chr> <chr> <int>
1 danforth Frank Danforth 8
2 pb Frank Pabodie 7
3 roe Valentina Roerich 7
4 dyer William Dyer 4
5 lake Anderson Lake 4
• Create a Markdown file that includes headings, links, and external images.
• Compile that file to produce an HTML page.
• Customize the page via its YAML header.
67
68 4 Creating Reports
## Methods {-}
An [external link](https://tidynomicon.tech).
[Another link][rstudio]
[rstudio]: https://rstudio.com
Methods
Something really important.
An external link.
Another link
4.3 How can I run code and include its output in a page?
If this is all R Markdown could do, it would be nothing more than an idiosyn-
cratic way to create HTML pages. What makes it powerful is the ability to
include code chunks that are evaluated as the document is knit, and whose
output is included in the final page. Put this in a file called second.Rmd:
```{r}
colors <- c('red', 'green', 'blue')
colors
```
The triple back-quotes mark the start and end of a block of code; putting {r}
immediately after the back-quotes at the start tells knitr to run the code and
include its output in the generated page, which therefore looks like this:
Displaying the colors:
We can put any code we want inside code blocks. We don’t have to execute
it all at once: the Code pulldown in RStudio’s main menu offers a variety of
ways to run regions of code. The IDE also gives us a keyboard shortcut to
insert a new code chunk, so there really is no excuse for not making notes as
we go along.
We can control execution and formatting by putting options inside the curly
braces at the start of the code block:
• {r label} gives the chunk a label that we can cross-reference. Labels must
be unique within documents, just like the id attributes of HTML elements.
• {r include=FALSE} tells knitr to run the code but not to include either
the code or its output in the finished document. While the option name is
confusing—the code is actually included in processing—this is handy when
we have setup code that loads libraries or does other things that our readers
probably don’t care about.
• {r eval=FALSE} displays the code but doesn’t run it, and is often used for
tutorials like this one.
• {r echo=FALSE} hides the code but includes the output. This is most often
used for displaying static images as we will see below.
# My Thesis {-}
```{r calculate-depth-by-magnitude}
depth_by_magnitude <- earthquakes %>%
mutate(round_mag = round(Magnitude)) %>%
group_by(round_mag) %>%
summarize(depth = mean(Depth_Km))
depth_by_magnitude
```
```{r plot-depth-by-magnitude}
depth_by_magnitude %>%
ggplot() +
geom_point(mapping = aes(x = round_mag, y = depth))
```
In order:

The options for the latter give the code chunk an ID, prevent it from being
echoed in the final document, and most importantly, give the figure a caption.
While it requires a bit of extra typing, it produces more predictable results
across different output formats.
earthquakes %>%
head(5) %>%
kable()
Our output is more attractive if we install and load the kableExtra package
and use it to style the table. We must call its functions after we call kable(),
just as we call the styling functions for plots after ggplot(). Below, we select
four columns from our earthquake data and format them as a narrow table
with two decimal places for latitude and longitude, one for magnitude and
depth, and some multi-column headers:
earthquakes %>%
select(lat = Latitude, long = Longitude,
mag = Magnitude, depth = Depth_Km) %>%
head(5) %>%
kable(digits = c(2, 2, 1, 1)) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c('Location' = 2, 'Details' = 2))
Location Details
lat long mag depth
42.70 13.23 6.0 8.1
42.71 13.25 4.5 9.0
42.76 13.17 3.8 9.7
42.78 13.17 3.9 9.7
42.78 13.16 3.6 9.7
The chunk is named setup, and neither it nor its output are displayed. All
the chunk does is load and run common.R, which contains the following lines:
74 4 Creating Reports
library(tidyverse)
library(reticulate)
library(rlang)
library(knitr)
knitr::opts_knit$set(width = 69)
The first few load libraries that various chapters depend on; the last one tells
knitr to set the line width option to 69 characters.
1. The header starts with exactly three dashes on a line of their own
and ends the same way. A common mistake is to forget the clos-
ing dashes; another is to use too many or too few, or to include
whitespace in the line.
2. The content of the header is formatted using YAML, which stands
for “Yet Another Markup Language”. In its simplest form it con-
tains key-value pairs: the keys are words, the values can be num-
bers, quoted strings, or a variety of other things, and the two are
separated by a comma.
This header tells knitr what the document’s title is, who its author is, when
it was created (which really ought to be written as an ISO-formatted date,
but worse sins await us), and what output format we want by default. When
we knit the document, knitr reads the header but does not include it in the
output. Instead, its values control knitr’s operation (e.g., select HTML as
the output format) or are inserted into the document itself (e.g., the title).
4.6 How can I parameterize documents? 75
---
title: "fourth"
author: "Greg Wilson"
date: "2019-09-18"
output:
html_document:
theme: united
toc: true
---
---
title: "Fifth Report"
params:
country: Canada
---
```{r load-data}
data <- read_csv(here::here('data', glue(params$country, '.csv')))
```
This document’s YAML header contains the key params, under which is a
sub-key for each parameter we want to create. When the document is knit,
these parameters are put in a named list called params and can be referred
to like any other variable. If we want to display it inline, we use a back-ticked
76 4 Creating Reports
code fragment that starts with the letter ‘r’; if we want to use it in a fenced
code block, it’s no different from any other variable.
Parameters don’t have to be single values: they can, for example, be lists of
mysterious ailments whose inexorable spread you are vainly trying to halt.
Parameters can also be provided on the command line:
• It ignores files and directories whose names start with an underscore, or that
it is specifically told to exclude in _config.yml.
• If an HTML or Markdown file starts with a YAML header, Jekyll translates
it.
4.7 How can I publish pages on GitHub? 77
• If the file does not include such a header, Jekyll simply copies it as-is.
1. Make sure the project is configured to publish from the root direc-
tory of the master branch.
2. Compile our R Markdown files locally to create HTML in the root
directory of our local copy.
3. Commit that HTML to the master branch.
4. Push.
This works because the HTML files generated by knitr don’t contain YAML
headers, so they are copied as-is. If we want to style those pages with our own
CSS or add some JavaScript, we can tell knitr to include files of our choice
during translation:
---
title: "Defenestration By Country"
output:
html_document:
includes:
in_header: extra-header.html
after_body: extra-footer.html
---
...content...
Behind the scenes, knitr translates our R Markdown into plain Markdown,
which is then turned into HTML by yet another tool called Pandoc. If we are
brave, we can create an entirely new Pandoc HTML template and format our
files with that:
---
title: "Defenestration by Season"
output:
html_document:
template: seasonal-report.html
---
...content...
This works, but having the generated HTML in our root directory is messy.
Given that we can configure GitHub Pages to publish from the docs folder,
why don’t we put our HTML there? After all, knitr::knit has an output
parameter with which we can specify a location for the output file.
The answer is that knitr becomes rather vexed when the output directory is
not the same as the current working directory. Programmers being program-
mers, there are several ways around this:
78 4 Creating Reports
None of this is made any less frustrating by the fact that other tools in the
knitr family, such as Bookdown, do allow users to specify the output directory
through a configuration parameter.
Data is not born tidy. We must cleanse it to make it serve our needs. The
previous chapter gave us the tools; here, we will see how to apply them and
how to make our work usable by others.
79
80 5 Creating Packages
"Early Infant Diagnosis: Percentage of infants born to women living with HIV...",,,,,,,,,,
,,2009,,,2010,,,2011,,,2012,,,2013,,,2014,,,2015,,,2016,,,2017,,,
ISO3,Countries,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,...
AFG,Afghanistan,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
ALB,Albania,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
DZA,Algeria,-,-,-,-,-,-,38%,42%,35%,23%,25%,21%,55%,60%,50%,27%,30%,25%,23%,25%,21%,33%,37
AGO,Angola,-,-,-,3%,4%,2%,5%,7%,4%,6%,8%,5%,15%,20%,12%,10%,14%,8%,6%,8%,5%,1%,2%,1%,1%,2%
... many more rows ...
YEM,Yemen,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
ZMB,Zambia,59%,70%,53%,27%,32%,24%,70%,84%,63%,74%,88%,67%,64%,76%,57%,91%,>95%,81%,43%,52
ZWE,Zimbabwe,-,-,-,12%,15%,10%,23%,28%,20%,38%,47%,33%,57%,70%,49%,54%,67%,47%,59%,73%,51%
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,2009,,,2010,,,2011,,,2012,,,2013,,,2014,,,2015,,,2016,,,2017,,,
,,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,Estimate,hi,lo,...
Region,East Asia and the Pacific,25%,30%,22%,35%,42%,29%,30%,37%,26%,32%,38%,27%,28%,34%,2
,Eastern and Southern Africa,23%,29%,20%,44%,57%,37%,48%,62%,40%,54%,69%,46%,51%,65%,43%,6
,Eastern Europe and Central Asia,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,
... several more rows ...
,Sub-Saharan Africa,16%,22%,13%,34%,46%,28%,37%,50%,30%,43%,57%,35%,41%,54%,33%,50%,66%,41
,Global,17%,23%,13%,33%,45%,27%,36%,49%,29%,41%,55%,34%,40%,53%,32%,48%,64%,39%,49%,64%,40
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Indicator definition: Percentage of infants born to women living with HIV... ,,,,,,,,,,,,,
Note: Data are not available if country did not submit data...,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Data source: Global AIDS Monitoring 2018 and UNAIDS 2018 estimates,,,,,,,,,,,,,,,,,,,,,,,,
"For more information on this indicator, please visit the guidance:...",,,,,,,,,,,,,,,,,,,
"For more information on the data, visit data.unicef.org",,,,,,,,,,,,,,,,,,,,,,,,,,,,,
country,year,estimate,hi,lo
AFG,2009,NA,NA,NA
AFG,2010,NA,NA,NA
AFG,2011,NA,NA,NA
5.3 How do I convert values to numbers? 81
AFG,2012,NA,NA,NA
...
ZWE,2016,0.71,0.88,0.62
ZWE,2017,0.65,0.81,0.57
Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4], 'X5' [5],
'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12],
'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17], 'X18' [18],
'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22], 'X23' [23], 'X24' [24],
'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28], 'X29' [29], 'X30' [30]
head(raw)
# A tibble: 6 x 30
`Early Infant D~ X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 <NA> <NA> 2009 <NA> <NA> 2010 <NA> <NA> 2011 <NA> <NA>
2 ISO3 Coun~ Esti~ hi lo Esti~ hi lo Esti~ hi lo
3 AFG Afgh~ - - - - - - - - -
4 ALB Alba~ - - - - - - - - -
5 DZA Alge~ - - - - - - 38% 42% 35%
6 AGO Ango~ - - - 3% 4% 2% 5% 7% 4%
# ... with 19 more variables: X12 <chr>, X13 <chr>, X14 <chr>, X15 <chr>,
# X16 <chr>, X17 <chr>, X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>,
# X22 <chr>, X23 <chr>, X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>,
# X28 <chr>, X29 <chr>, X30 <lgl>
82 5 Creating Packages
All right: R isn’t able to infer column names, so it uses the entire first comment
string as a very long column name and then makes up names for the other
columns. Looking at the file, the second row has years (spaced at three-column
intervals) and the column after that has the ISO3 country code, the country’s
name, and then “Estimate”, “hi”, and “lo” repeated for every year. We are
going to have to combine what’s in the second and third rows, so we’re going
to have to do some work no matter which we skip or keep. Since we want the
ISO3 code and the country name, let’s skip the first two rows.
head(raw)
# A tibble: 6 x 30
ISO3 Countries Estimate hi lo Estimate_1 hi_1 lo_1 Estimate_2 hi_2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 AFG Afghanis~ - - - - - - - -
2 ALB Albania - - - - - - - -
3 DZA Algeria - - - - - - 38% 42%
4 AGO Angola - - - 3% 4% 2% 5% 7%
5 AIA Anguilla - - - - - - - -
6 ATG Antigua ~ - - - - - - - -
# ... with 20 more variables: lo_2 <chr>, Estimate_3 <chr>, hi_3 <chr>,
5.3 How do I convert values to numbers? 83
# lo_3 <chr>, Estimate_4 <chr>, hi_4 <chr>, lo_4 <chr>, Estimate_5 <chr>,
# hi_5 <chr>, lo_5 <chr>, Estimate_6 <chr>, hi_6 <chr>, lo_6 <chr>,
# Estimate_7 <chr>, hi_7 <chr>, lo_7 <chr>, Estimate_8 <chr>, hi_8 <chr>,
# lo_8 <chr>, X30 <lgl>
That’s a bit of an improvement, but why are all the columns character
instead of numbers? This happens because:
1. our CSV file uses - (a single dash) to show missing data, and
2. all of our numbers end with %, which means those values actually
are character strings.
We will tackle the first problem by setting na = c("-") in our read_csv call
(since we should never do ourselves what a library function will do for us):
head(raw)
# A tibble: 6 x 30
ISO3 Countries Estimate hi lo Estimate_1 hi_1 lo_1 Estimate_2 hi_2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 AFG Afghanis~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
84 5 Creating Packages
2 ALB Albania <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 DZA Algeria <NA> <NA> <NA> <NA> <NA> <NA> 38% 42%
4 AGO Angola <NA> <NA> <NA> 3% 4% 2% 5% 7%
5 AIA Anguilla <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 ATG Antigua ~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# ... with 20 more variables: lo_2 <chr>, Estimate_3 <chr>, hi_3 <chr>,
# lo_3 <chr>, Estimate_4 <chr>, hi_4 <chr>, lo_4 <chr>, Estimate_5 <chr>,
# hi_5 <chr>, lo_5 <chr>, Estimate_6 <chr>, hi_6 <chr>, lo_6 <chr>,
# Estimate_7 <chr>, hi_7 <chr>, lo_7 <chr>, Estimate_8 <chr>, hi_8 <chr>,
# lo_8 <chr>, X30 <lgl>
That’s progress. We now need to strip the percentage signs and convert what’s
left to numeric values. To simplify our lives, let’s get the ISO3 and Countries
columns out of the way. We will save the ISO3 values for later use (and because
it will illustrate a point about data hygiene that we want to make later, but
which we don’t want to reveal just yet). Rather than typing out the names of
all the columns we want to keep in the call to filter, we subtract the ones
we want to discard:
In the Hollywood version of this lesson, we would sigh heavily at this point
as we realize that we should have called select, not filter. Once we make
that change, we can move forward once again:
# A tibble: 6 x 28
Estimate hi lo Estimate_1 hi_1 lo_1 Estimate_2 hi_2 lo_2 Estimate_3
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
86 5 Creating Packages
1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> 38% 42% 35% 23%
4 <NA> <NA> <NA> 3% 4% 2% 5% 7% 4% 6%
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# ... with 18 more variables: hi_3 <chr>, lo_3 <chr>, Estimate_4 <chr>,
# hi_4 <chr>, lo_4 <chr>, Estimate_5 <chr>, hi_5 <chr>, lo_5 <chr>,
# Estimate_6 <chr>, hi_6 <chr>, lo_6 <chr>, Estimate_7 <chr>, hi_7 <chr>,
# lo_7 <chr>, Estimate_8 <chr>, hi_8 <chr>, lo_8 <chr>, X30 <lgl>
But wait. Weren’t there some aggregate lines of data at the end of our input?
What happened to them?
tail(countries, n = 25)
[1] "YEM"
[2] "ZMB"
[3] "ZWE"
[4] ""
[5] ""
[6] ""
[7] "Region"
[8] ""
[9] ""
[10] ""
[11] ""
[12] ""
[13] ""
[14] ""
[15] ""
[16] "Super-region"
[17] ""
[18] ""
[19] ""
[20] ""
[21] "Indicator definition: Percentage of infants born to women living with HIV receiving
[22] "Note: Data are not available if country did not submit data to Global AIDS Monitorin
[23] "Data source: Global AIDS Monitoring 2018 and UNAIDS 2018 estimates"
[24] "For more information on this indicator, please visit the guidance: http://www.unaids
[25] "For more information on the data, visit data.unicef.org"
Once again the actor playing our part on screen sighs heavily. How are we to
trim this? Since there is only one file, we can open the file with an editor or
5.3 How do I convert values to numbers? 87
spreadsheet program, scroll down, check the line number, and slice there. This
is a very bad idea if we’re planning to use this script on other files—we should
instead look for the first blank line or the entry for Zimbabwe or something
like that—but let’s revisit the problem once we have our data in place.
Notice that we’re counting rows not including the two we’re skipping, which
means that the 192 in the call to slice above corresponds to row 195 of our
original data: 195, not 194, because we’re using the first row of unskipped
data as headers and yes, you are in fact making that faint whimpering sound
you now hear. You will hear it often when dealing with real-world data…
Notice also that we are slicing, then extracting the column containing the
countries. In an earlier version of this lesson we peeled off the ISO3 coun-
try codes, sliced that vector, and then wondered why our main table still
88 5 Creating Packages
is.numeric(result)
[1] TRUE
is.numeric(numbers)
[1] TRUE
is_tibble(body)
[1] TRUE
is_tibble(numbers)
[1] FALSE
Perdition. After browsing the data, we realize that some entries are ">95%",
i.e., there is a greater-than sign as well as a percentage in the text. We will
need to regularize those before we do any conversions.
Before that, however, let’s see if we can get rid of the percent signs. The
obvious way is is to use str_replace(body, "%", ""), but that doesn’t work:
str_replace works on vectors, but a tibble is a list of vectors. Instead, we
can use a higher-order function called map to apply the function str_replace
to each column in turn to get rid of the percent signs:
90 5 Creating Packages
$Estimate
[1] NA NA NA NA NA NA
[7] NA NA NA NA "26" NA
[13] NA NA NA ">95" NA "77"
[19] NA NA "7" NA NA "25"
[25] NA NA "3" NA ">95" NA
[31] "27" NA "1" NA NA NA
[37] "5" NA "8" NA "92" NA
[43] NA "83" NA NA NA NA
[49] NA NA NA "28" "1" "4"
[55] NA NA NA NA "4" NA
[61] NA NA NA NA "61" NA
[67] NA NA NA NA NA NA
[73] NA NA "61" NA NA NA
5.3 How do I convert values to numbers? 91
[79] NA "2" NA NA NA NA
[85] NA NA NA ">95" NA NA
[91] NA NA NA NA NA "43"
[97] "5" NA NA NA NA NA
[103] "37" NA "8" NA NA NA
[109] NA NA NA NA NA "2"
[115] NA NA NA NA "2" NA
[121] NA "50" NA "4" NA NA
[127] NA "1" NA NA NA NA
[133] NA NA "1" NA NA NA
[139] ">95" NA NA "58" NA NA
[145] NA NA NA NA "11" NA
[151] NA NA NA NA NA NA
[157] NA NA NA NA NA NA
[163] "9" NA NA NA NA "1"
[169] NA NA NA "7" NA NA
[175] NA NA NA NA "8" "78"
[181] NA NA "13" NA NA "0"
[187] NA NA NA NA "59" NA
[193] "" "2009" "Estimate" "25" "23" NA
[199] "24" "2" NA "1" "8" NA
[205] "7" "72" "16" "17" "" ""
[211] "" "" "" ""
$hi
[1] NA NA NA NA NA NA NA NA NA NA "35" NA
...
Perdition once again. The problem now is that map produces a raw list as
output. The function we want is map_dfr, which maps a function across the
rows of a tibble and returns a tibble as a result. (There is a corresponding
function map_dfc that maps a function across columns.)
=> 'hi_7' [25], 'lo' => 'lo_7' [26], 'Estimate' => 'Estimate_8' [27], 'hi' =>
'hi_8' [28], 'lo' => 'lo_8' [29]
# A tibble: 6 x 28
Estimate hi lo Estimate_1 hi_1 lo_1 Estimate_2 hi_2 lo_2 Estimate_3
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> 38 42 35 23
4 <NA> <NA> <NA> 3 4 2 5 7 4 6
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# ... with 18 more variables: hi_3 <chr>, lo_3 <chr>, Estimate_4 <chr>,
# hi_4 <chr>, lo_4 <chr>, Estimate_5 <chr>, hi_5 <chr>, lo_5 <chr>,
# Estimate_6 <chr>, hi_6 <chr>, lo_6 <chr>, Estimate_7 <chr>, hi_7 <chr>,
# lo_7 <chr>, Estimate_8 <chr>, hi_8 <chr>, lo_8 <chr>, X30 <chr>
Now to tackle those ">95%" values. It turns out that str_replace uses regular
expressions, not just direct string matches, so we can get rid of the > at the
same time as we get rid of the %. We will check by looking at the first Estimate
column, which earlier inspection informed us had at least one ">95%" in it:
=> 'hi_2' [10], 'lo' => 'lo_2' [11], 'Estimate' => 'Estimate_3' [12], 'hi'
=> 'hi_3' [13], 'lo' => 'lo_3' [14], 'Estimate' => 'Estimate_4' [15], 'hi'
=> 'hi_4' [16], 'lo' => 'lo_4' [17], 'Estimate' => 'Estimate_5' [18], 'hi'
=> 'hi_5' [19], 'lo' => 'lo_5' [20], 'Estimate' => 'Estimate_6' [21], 'hi'
=> 'hi_6' [22], 'lo' => 'lo_6' [23], 'Estimate' => 'Estimate_7' [24], 'hi'
=> 'hi_7' [25], 'lo' => 'lo_7' [26], 'Estimate' => 'Estimate_8' [27], 'hi' =>
'hi_8' [28], 'lo' => 'lo_8' [29]
[1] NA NA NA NA NA NA
[7] NA NA NA NA "26" NA
[13] NA NA NA "95" NA "77"
[19] NA NA "7" NA NA "25"
[25] NA NA "3" NA "95" NA
[31] "27" NA "1" NA NA NA
[37] "5" NA "8" NA "92" NA
[43] NA "83" NA NA NA NA
[49] NA NA NA "28" "1" "4"
[55] NA NA NA NA "4" NA
[61] NA NA NA NA "61" NA
[67] NA NA NA NA NA NA
[73] NA NA "61" NA NA NA
[79] NA "2" NA NA NA NA
[85] NA NA NA "95" NA NA
[91] NA NA NA NA NA "43"
[97] "5" NA NA NA NA NA
[103] "37" NA "8" NA NA NA
[109] NA NA NA NA NA "2"
[115] NA NA NA NA "2" NA
...
94 5 Creating Packages
Excellent. We can now use map_dfr to convert the columns to numeric per-
centages using an anonymous function that we define inside the map_dfr call
itself:
head(percents)
# A tibble: 6 x 28
Estimate hi lo Estimate_1 hi_1 lo_1 Estimate_2 hi_2 lo_2 Estimate_3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA 0.38 0.42 0.35 0.23
4 NA NA NA 0.03 0.04 0.02 0.05 0.07 0.04 0.06
5 NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA
# ... with 18 more variables: hi_3 <dbl>, lo_3 <dbl>, Estimate_4 <dbl>,
# hi_4 <dbl>, lo_4 <dbl>, Estimate_5 <dbl>, hi_5 <dbl>, lo_5 <dbl>,
# Estimate_6 <dbl>, hi_6 <dbl>, lo_6 <dbl>, Estimate_7 <dbl>, hi_7 <dbl>,
# lo_7 <dbl>, Estimate_8 <dbl>, hi_8 <dbl>, lo_8 <dbl>, X30 <dbl>
warnings()
Something is still not right. The first Estimates column looks all right, so
let’s have a look at the second column:
trimmed$hi
[1] NA NA NA NA NA NA NA NA NA NA "35" NA NA NA NA
[16] "95" NA "89" NA NA "10" NA NA "35" NA NA "5" NA "95" NA
[31] "36" NA "1" NA NA NA "6" NA "12" NA "95" NA NA "95" NA
[46] NA NA NA NA NA NA "36" "1" "4" NA NA NA NA "6" NA
[61] NA NA NA NA "77" NA NA NA NA NA NA NA NA NA "74"
[76] NA NA NA NA "2" NA NA NA NA NA NA NA "95" NA NA
[91] NA NA NA NA NA "53" "7" NA NA NA NA NA "44" NA "9"
[106] NA NA NA NA NA NA NA NA "2" NA NA NA NA "2" NA
[121] NA "69" NA "7" NA NA NA "1" NA NA NA NA NA NA "1"
[136] NA NA NA "95" NA NA "75" NA NA NA NA NA NA "13" NA
[151] NA NA NA NA NA NA NA NA NA NA NA NA "11" NA NA
[166] NA NA "1" NA NA NA "12" NA NA NA NA NA NA "9" "95"
[181] NA NA "16" NA NA "1" NA NA NA NA "70" NA "" "" "hi"
[196] "30" "29" NA "32" "2" NA "2" "12" NA "9" "89" "22" "23" "" ""
[211] "" "" "" ""
Where are the empty strings toward the end of trimmed$hi coming from? Let’s
backtrack by examining the hi column of each of our intermediate variables
interactively in the console…
…and there’s our bug. We are creating a variable called sliced that has only
5.3 How do I convert values to numbers? 99
the rows we care about, but then using the full table in raw to create body.
It’s a simple mistake, and one that could easily have slipped by us. Here is
our revised script:
head(percents)
# A tibble: 6 x 28
Estimate hi lo Estimate_1 hi_1 lo_1 Estimate_2 hi_2 lo_2 Estimate_3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA
100 5 Creating Packages
and tail:
tail(percents)
# A tibble: 6 x 28
Estimate hi lo Estimate_1 hi_1 lo_1 Estimate_2 hi_2 lo_2 Estimate_3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA 0.31 0.37 0.26 0.3
4 NA NA NA NA NA NA NA NA NA NA
5 0.59 0.7 0.53 0.27 0.32 0.24 0.7 0.84 0.63 0.74
6 NA NA NA 0.12 0.15 0.1 0.23 0.28 0.2 0.38
# ... with 18 more variables: hi_3 <dbl>, lo_3 <dbl>, Estimate_4 <dbl>,
# hi_4 <dbl>, lo_4 <dbl>, Estimate_5 <dbl>, hi_5 <dbl>, lo_5 <dbl>,
# Estimate_6 <dbl>, hi_6 <dbl>, lo_6 <dbl>, Estimate_7 <dbl>, hi_7 <dbl>,
# lo_7 <dbl>, Estimate_8 <dbl>, hi_8 <dbl>, lo_8 <dbl>, X30 <dbl>
Comparing this to the raw data file convinces us that yes, we are now con-
verting the percentages properly, which means we are halfway home.
# A tibble: 2 x 4
ISO est hi lo
<chr> <dbl> <dbl> <dbl>
1 ABC 0.25 0.3 0.2
2 DEF 0.55 0.6 0.5
small %>%
pivot_longer(cols = c(est, hi, lo), names_to = "kind", values_to = "reported")
# A tibble: 6 x 3
ISO kind reported
<chr> <chr> <dbl>
1 ABC est 0.25
2 ABC hi 0.3
3 ABC lo 0.2
4 DEF est 0.55
5 DEF hi 0.6
6 DEF lo 0.5
The cols parameter tells pivot_longer which columns to rearrange. The new
column names_to gets the old column titles (in our case, est, hi, and lo),
while the new column values_to gets the values. The result is a table which
is longer and narrower than the original, which is what inspired the function’s
name. (Previous versions of the tidyverse called this function gather, but
users reported that they found the name confusing.)
The other tool we need to rearrange our data is separate, which splits one
column into two. For example, if we have the year and the heading type in a
single column:
102 5 Creating Packages
# A tibble: 6 x 2
combined value
<chr> <dbl>
1 2009-est 123
2 2009-hi 456
3 2009-lo 789
4 2010-est 987
5 2010-hi 654
6 2010-lo 321
we can get the year and the heading into separate columns by separating on
the - character:
single %>%
separate(combined, sep = "-", c("year", "kind"))
# A tibble: 6 x 3
year kind value
<chr> <chr> <dbl>
1 2009 est 123
2 2009 hi 456
3 2009 lo 789
4 2010 est 987
5 2010 hi 654
6 2010 lo 321
1. Replace the double column headers in the existing data with a single
header that combines the year with the kind.
2. Gather the data so that the year-kind values are in a single column.
3. Split that column.
5.4 How do I reorganize the columns? 103
We’ve seen the tools we need for the second and third step; the first involves a
little bit of list manipulation. Let’s start by repeating "est", "hi", and "lo"
as many times as we need them:
[1] "est" "hi" "lo" "est" "hi" "lo" "est" "hi" "lo" "est" "hi" "lo"
[13] "est" "hi" "lo" "est" "hi" "lo" "est" "hi" "lo" "est" "hi" "lo"
[25] "est" "hi" "lo"
As you can probably guess from its name, rep repeats things a spec-
ified number of times, and as noted previously, a vector of vectors is
flattened into a single vector, so what an innocent might expect to be
c(c('est', 'hi', 'lo'), c('est', 'hi', 'lo)) automatically becomes
c('est', 'hi', 'lo', 'est', 'hi', 'lo).
What about the years? We want to wind up with:
i.e., with each year repeated three times. rep won’t do this, but we can get
there with map:
[[1]]
[1] 2009 2009 2009
[[2]]
[1] 2010 2010 2010
[[3]]
[1] 2011 2011 2011
[[4]]
[1] 2012 2012 2012
[[5]]
[1] 2013 2013 2013
[[6]]
104 5 Creating Packages
[[7]]
[1] 2015 2015 2015
[[8]]
[1] 2016 2016 2016
[[9]]
[1] 2017 2017 2017
That’s almost right, but map hasn’t flattened the list for us. Luckily, we can
use unlist to do that:
[1] 2009 2009 2009 2010 2010 2010 2011 2011 2011 2012 2012 2012 2013 2013 2013
[16] 2014 2014 2014 2015 2015 2015 2016 2016 2016 2017 2017 2017
We can now combine the years and kinds by pasting the two vectors together
with "-" as a separator:
Warning: The `value` argument of ``names<-`()` must have the same length as `x` as of tibb
`names` must have length 28, not 27.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
5.4 How do I reorganize the columns? 105
percents
# A tibble: 192 x 28
`2009-est` `2009-hi` `2009-lo` `2010-est` `2010-hi` `2010-lo` `2011-est`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA 0.38
4 NA NA NA 0.03 0.04 0.02 0.05
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
7 NA NA NA NA NA NA 0.13
8 NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA
# ... with 182 more rows, and 21 more variables: `2011-hi` <dbl>,
# `2011-lo` <dbl>, `2012-est` <dbl>, `2012-hi` <dbl>, `2012-lo` <dbl>,
# `2013-est` <dbl>, `2013-hi` <dbl>, `2013-lo` <dbl>, `2014-est` <dbl>,
# `2014-hi` <dbl>, `2014-lo` <dbl>, `2015-est` <dbl>, `2015-hi` <dbl>,
# `2015-lo` <dbl>, `2016-est` <dbl>, `2016-hi` <dbl>, `2016-lo` <dbl>,
# `2017-est` <dbl>, `2017-hi` <dbl>, `2017-lo` <dbl>, NA <dbl>
This example shows that names(table) doesn’t just give us a list of column
names: it gives us something we can assign to when we want to rename those
columns. This example also shows us that percents has the wrong number
of columns. Inspecting the tibble in the console, we see that the last column
is full of NAs:
percents[, ncol(percents)]
# A tibble: 192 x 1
``
<dbl>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
106 5 Creating Packages
7 NA
8 NA
9 NA
10 NA
# ... with 182 more rows
all(is.na(percents[,ncol(percents)]))
[1] TRUE
Let’s relabel our data again and then drop the empty column. (There are
other ways to do this, but I find steps easier to read after the fact this way.)
# A tibble: 192 x 27
`2009-est` `2009-hi` `2009-lo` `2010-est` `2010-hi` `2010-lo` `2011-est`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA 0.38
4 NA NA NA 0.03 0.04 0.02 0.05
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
7 NA NA NA NA NA NA 0.13
8 NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA
# ... with 182 more rows, and 20 more variables: `2011-hi` <dbl>,
# `2011-lo` <dbl>, `2012-est` <dbl>, `2012-hi` <dbl>, `2012-lo` <dbl>,
# `2013-est` <dbl>, `2013-hi` <dbl>, `2013-lo` <dbl>, `2014-est` <dbl>,
# `2014-hi` <dbl>, `2014-lo` <dbl>, `2015-est` <dbl>, `2015-hi` <dbl>,
# `2015-lo` <dbl>, `2016-est` <dbl>, `2016-hi` <dbl>, `2016-lo` <dbl>,
# `2017-est` <dbl>, `2017-hi` <dbl>, `2017-lo` <dbl>
It’s time to put the country codes back on the table, move the year and kind
from column headers to a column with pivot_longer, and then split that
column with separate:
5.4 How do I reorganize the columns? 107
# A tibble: 5,184 x 4
country year kind reported
<chr> <chr> <chr> <dbl>
1 AFG 2009 est NA
2 AFG 2009 hi NA
3 AFG 2009 lo NA
4 AFG 2010 est NA
5 AFG 2010 hi NA
6 AFG 2010 lo NA
7 AFG 2011 est NA
8 AFG 2011 hi NA
9 AFG 2011 lo NA
10 AFG 2012 est NA
# ... with 5,174 more rows
clean_infant_hiv("data/infant_hiv.csv", 192)
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 192 rows [28, 56,
84, 112, 140, 168, 196, 224, 252, 280, 308, 336, 364, 392, 420, 448, 476, 504,
532, 560, ...].
# A tibble: 5,376 x 4
country year kind reported
<chr> <chr> <chr> <dbl>
1 AFG 2009 est NA
2 AFG 2009 hi NA
3 AFG 2009 lo NA
4 AFG 2010 est NA
5 AFG 2010 hi NA
6 AFG 2010 lo NA
5.5 How do I create a package? 109
We’re done, and we have learned a lot of R, but what we have also learned
is that we make mistakes, and that those mistakes can easily slip past us. It
would be hubris to believe that we will not make more as we continue to clean
this data. What will guide us safely through these dark caverns and back into
the light of day?
The answer is testing. We must test our assumptions, test our code, test our
very being if we are to advance. R provides tools for this purpose, but in order
to use them, we must venture into the greater realm of packaging in R.
• The text file DESCRIPTION (with no suffix) describes what the package does,
110 5 Creating Packages
who wrote it, and what other packages it requires to run. We will edit its
contents as we go along.
• NAMESPACE, (whose name also has no suffix) contains the names of everything
exported from the package (i.e., everything that is visible to the outside
world). As we will see, we should leave its management in the hands of
RStudio and the devtools package we will meet below.
We can type all of this in if we want, but R has a very useful package called
usethis to help us create and maintain packages. To use it, we load usethis
in the console with library(usethis) and use usethis::create_package
with the path to the new package directory as an argument:
usethis::create_package('~/unicefdata')
� Creating '/Users/gvwilson/unicefdata/'
� Setting active project to '/Users/gvwilson/unicefdata'
� Creating 'R/'
� Writing 'DESCRIPTION'
Package: unicefdata
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
5.5 How do I create a package? 111
usethis::use_readme_md()
usethis::use_mit_license(name="UNICEF Data")
use_mit_license creates two files: LICENSE and LICENSE.md. The rules for
R packages require the former, but GitHub expects the latter.
usethis::use_code_of_conduct()
## Code of Conduct
Please note that the placeholder project is released with a [Contributor Code of Conduct
� Writing 'CODE_OF_CONDUCT.md'
� Adding '^CODE_OF_CONDUCT\\.md$' to '.Rbuildignore'
� Don't forget to describe the code of conduct in your README:
Please note that the 'unicefdata' project is released with a
[Contributor Code of Conduct](CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.
[Copied to clipboard]
# unicefdata
## Installation
Package: unicefdata
Title: Small UNICEF Dataset for Tutorial Purposes
Version: 0.0.0.9000
Authors@R:
person(given = "Greg",
family = "Wilson",
role = c("aut", "cre"),
email = "gvwilson@third-bit.com",
comment = c(ORCID = "0000-0001-8659-8979"))
Description: This package demonstrates how to share small datasets in R.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
We can now go to the Build tab in RStudio and run Check to make sure our
empty package is judged sane by our strict, yet impartial, machine.
We can now put the function we wrote to clean up the infant HIV data in a
file called R/clean_infant_hiv.R either by using File...New in RStudio or
5.6 How do I create a package? 113
A little documentation seems like a fair request. For this, we turn to Hadley
Wickham’s R Packages and Karl Broman’s “R package primer” for advice on
writing roxygen2 documentation. We then return to our source file and prefix
our existing code with this:
5.6 How can I document the contents of a package? 115
roxygen2 processes comment lines that start with #' (hash followed by sin-
gle quote). Putting a comment block right before a function associates that
documentation with that function, so here we are saying that:
Our function is now documented, but when we run Check, we still get a
warning. After a bit more searching and experimentation, we discover that we
need to load the devtools package and run devtools::document() in the
console to regenerate documentation—it isn’t done automatically.
devtools::document()
Another check confirms that our function is now documented. NAMESPACE now
contains:
export(infant_hiv)
the package, and the comment helpfully reminds us that we shouldn’t edit
this file ourselves, but should instead trust our tools to do the work for us.
As for man/clean_infant_hiv.Rd, it shows us more clearly than mere words
ever could why we want to use roxygen2:
library(stringr)
5.7 How can my package import what it needs? 117
probably won’t work when it’s loaded by a user, because stringr may not be
in memory on the user’s machine at the time str_replace is called.
How then can our packages use libraries? One way is to add import directives
to the documentation for our functions to tell R what we depend on:
percents %>%
dplyr::mutate(country = countries) %>%
tidyr::pivot_longer(cols = c(est, hi, lo), names_to = "kind", values_to = "reported")
tidyr::separate(year_kind, c("year", "kind"))
Imports:
readr (>= 1.1.0),
dplyr (>= 0.7.0),
magrittr (>= 1.5.0),
purrr (>= 0.2.0),
rlang (>= 0.3.0),
stringr (>= 1.3.0),
tidyr (>= 0.8.3)
NULL
and then modify the calls that use naked column names to look like:
dplyr::select(-.data$ISO3, -.data$Countries)
What is this .data that we have invoked? Typing ?rlang::.data gives us the
answer: it is a pronoun that allows us to be explicit when we refer to an object
inside the data. Adding this—i.e., being explicity that country is a column
of .data rather than an undefined variable—finally (finally) gives us a clean
build.
Everything except the last line is a roxygen2 comment block that describes
the data in plain language, then uses some tags and directives to docu-
ment its format and fields. (Note that we have also documented our data
in inst/extdata/README.md, but that focuses on the format and meaning of
the raw data, not the cleaned-up version.)
The last line is the string "infant_hiv", i.e., the name of the dataset. We
will create one placeholder R file like this for each of our datasets, and each
will have that dataset’s name as the thing being documented.
Let’s run a check:
Depends:
R (>= 2.10)
#' Clean up and share some data from UNICEF on infant HIV rates.
#'
#' @author Greg Wilson, \email{gvwilson@third-bit.com}
#' @docType package
#' @name unicefdata
NULL
• Data in other formats can be put in the inst/extdata directory, and will
be installed when the package is installed.
• Add comments starting with #' to an R file to document functions.
• Use roxygen2 to extract these comments to create manual pages in the man
directory.
• Use @export directives in roxygen2 comment blocks to make functions visi-
ble outside a package.
• Add required libraries to the Imports section of the DESCRIPTION file to
indicate that your package depends on them.
• Use package::function to access externally-defined functions inside a pack-
age.
• Alternatively, add @import directives to roxygen2 comment blocks to make
external functions available inside the package.
• Import .data from rlang and use .data$column to refer to columns instead
of using bare column names.
• Create a file called R/package.R and document NULL to document the pack-
age as a whole.
• Create a file called R/dataset.R and document the string ‘dataset’ to docu-
ment a dataset.
6
Non-Standard Evaluation
The biggest difference between R and Python is not where R starts count-
ing, but its use of lazy evaluation. Nothing in R truly makes sense until we
understand how this works.
def ones_func(ones_arg):
return ones_arg + " ones"
def tens_func(tens_arg):
return ones_func(tens_arg + " tens")
initial = "start"
final = tens_func(initial + " more")
print(final)
123
124 6 Non-Standard Evaluation
When we call tens_func we pass it initial + " more"; since initial has
just been assigned the value "start", that’s the same as calling tens_func
with "start more". tens_func then calls ones_func with "start more
tens", and ones_func returns "start more tens ones". But there’s more
going on here than that two-sentence summary suggests. Let’s spell out the
steps:
def ones_func(ones_arg):
ones_temp_1 = ones_arg + " ones"
return ones_temp_1
def tens_func(tens_arg):
tens_temp_1 = tens_arg + " tens"
tens_temp_2 = ones_func(tens_temp_1)
return tens_temp_2
initial = "start"
global_temp_1 = initial + " more"
final = tens_func(global_temp_1)
print(final)
Step 3: Python creates a new stack frame to hold the call to tens_func:
Note that tens_arg points to the same thing in memory as global_temp_1,
since Python passes everything by reference.
6.3 How does Python evaluate function calls? 125
>tens_func</i></b> tens_arg
global_temp_1 value: "start more"
pported by viewer] initial value: "start"
Step 5: Python creates a new stack frame to manage the call to ones_func:
>ones_func</i></b> ones_arg
tens_temp_1 value: "start more tens"
>tens_func</i></b> tens_arg
global_temp_1 value: "start more"
pported by viewer] initial value: "start"
And here it is with the intermediate steps spelled out in a syntax I just made
up:
While the original code looked much like our Python, the evaluation trace is
very different, and hinges on the fact that an expression in a programming
language can be represented as a data structure.
128 6 Non-Standard Evaluation
What’s an Expression?
An expression is anything that has a value. The simplest expressions are
literal values like the number 1, the string "stuff", and the Boolean TRUE.
A variable like least is also an expression: its value is whatever the variable
currently refers to.
Complex expressions are built out of simpler expressions: 1 + 2 is an ex-
pression that uses + to combine 1 and 2, while the expression c(10, 20,
30) uses the function c to create a vector out of the values 10, 20, 30.
Expressions are often drawn as trees like this:
+
/ \
1 2
global_temp_1
<pre><code><font face="Helvetica">PROMISE(@global@, paste(initial, "more"), _
pported by viewer] initial value: "start"
• the value of that expression (which I’m showing as ____, since it’s initially
empty).
>tens_func</i></b> tens_arg
global_temp_1
<pre><code><font face="Helvetica">PROMISE(@global@, paste(initial, "more"), _
pported by viewer] initial value: "start"
>tens_func</i></b> tens_arg
global_temp_1
<pre><code><font face="Helvetica">PROMISE(@global@, paste(initial, "more"), "start mo
pported by viewer] initial value: "start"
This evaluation happens after tens_func has been called, not before as in
Python, which is why this scheme is called “lazy” evaluation. Once a promise
has been resolved, R uses its value, and that value never changes.
Steps 5: tens_func wants to call ones_func, so R creates another promise to
record what’s being passed into ones_func:
Step 6: R calls ones_func, binding the newly-created promise to ones_arg as
it does so:
Step 7: R needs a value for ones_arg to pass to paste, so it resolves the
promise:
130 6 Non-Standard Evaluation
tens_temp_1
<pre><code><font face="Helvetica">PROMISE(@tens_func@, paste(tens_arg, "tens"), __
>tens_func</i></b> tens_arg
global_temp_1
<pre><code><font face="Helvetica">PROMISE(@global@, paste(initial, "more"), "start mo
pported by viewer] initial value: "start"
>ones_func</i></b> ones_arg
tens_temp_1
<pre><code><font face="Helvetica">PROMISE(@tens_func@, paste(tens_arg, "tens"), __
>tens_func</i></b> tens_arg
global_temp_1
<pre><code><font face="Helvetica">PROMISE(@global@, paste(initial, "more"), "start mo
pported by viewer] initial value: "start"
>ones_func</i></b> ones_arg
tens_temp_1
<pre><code><font face="Helvetica">PROMISE(@tens_func@, paste(tens_arg, "tens"), "start more te
>tens_func</i></b> tens_arg
global_temp_1
<pre><code><font face="Helvetica">PROMISE(@global@, paste(initial, "more"), "start more")</font><
pported by viewer] initial value: "start"
sign(2)
then behind the scenes, R is creating a promise and passing it to sign, where
it is automatically resolved to get the number 2 when its value is needed. (If
I wanted to be thorough, I would have shown the promises passed into paste
at each stage of execution above.)
132 6 Non-Standard Evaluation
my_expr
red
typeof(my_expr)
[1] "symbol"
eval(my_expr)
More usefully, what if we create something that has a value for red:
6.4 Why is lazy evaluation useful? 133
# A tibble: 2 x 2
red green
<dbl> <dbl>
1 1 10
2 2 20
and then ask R to evaluate our expression in the context of that tibble:
eval(my_expr, color_data)
[1] 1 2
When we do this, eval looks for definitions of variables in the data structure
we’ve given it—in this case, the tibble color_data. Since that tibble has a
column called red, eval(my_expr, color_data) gives us that column.
This may not seem life-changing yet, but being able to pass expressions around
and evaluate them in contexts of our choosing allows us to seem very clever
indeed. For example, let’s create another expression:
[1] "language"
eval(add_red_green, color_data)
[1] 11 22
[[1]]
[1] TRUE TRUE
[[2]]
[1] TRUE TRUE
We can take it one step further and simply report whether all the checks
passed or not:
[1] TRUE
This is cool, but typing expr(...) over and over is kind of clumsy. It also
seems superfluous, since we know that arguments aren’t evaluated before
they’re passed into functions. Can we get rid of this and write something
that does this?
6.5 What is tidy evaluation? 135
The answer is going to be “yes”, but it’s going to take a bit of work.
What I did wrong was use [ instead of [[, which meant that
conditions[1] was not an expression—it was a list containing a single
expression. It turns out that evaluating a list containing an expression
produces a list of expressions rather than an error, which is so helpful that
it only took me an hour to figure out my mistake.
without calling expr to quote our expressions explicitly. For simplicity’s sake,
our first attempt only handles a single expression:
This actually makes sense: by the time we reach the call to eval, test refers
to a promise that represents the value of red != green in the global envi-
ronment. Promises are not expressions—each promise contains an expression,
but it also contains an environment and a copy of the expression’s value (if
it has ever been calculated). As a result, when R sees the call to eval inside
check_naive it automatically tries to resolve the promise that contains left
!= right, and fails because there are no variables with those names in the
global environment.
So how can we get the expression out of the promise without triggering eval-
uation? One way is to use a function called substitute:
However, substitute is frowned upon because it does one thing when called
interactively on the command line and something else when called inside a
function. Instead, we can use a function called enquo from the rlang package.
enquo returns an object called a quosure that contains only an unevaluated
expression and an environment:
<quosure>
expr: ^red != green
env: global
[1] TRUE
max_of_var(color_data, red)
# A tibble: 2 x 2
red maximum
<dbl> <dbl>
1 1 1
2 2 2
We can use this to write a function run_two_checks that runs two checks on
some data:
[1] TRUE
That’s much easier to follow than a bunch of enquo and eval calls, but what
if we want to handle an arbitrary number of checks? Our first attempt is this:
Error in run_all_checks(new_colors, 0 < yellow, violet < yellow): object 'yellow' not foun
This code fails because the call to list(...) tries to evaluate the expressions
in ... when adding them to the list. What we need to use instead is enquos,
which does what enquo does but on ...:
[1] FALSE
6.6 What if I truly desire to venture into the depths? 139
That failed because we’re not enquoting the test. Let’s modify it the code to
enquote and then pass in the expression:
Damn—we thought this one had a chance. The problem is that when we say
result = x_test, what actually gets passed into transmute is a promise
containing an expression. Somehow, we need to prevent R from doing that
promise wrapping.
This brings us to enquo’s partner !!, which we can use to splice the expression
in a quosure into a function call. !! is pronounced “bang bang” or “oh hell”,
depending on how your day is going. It only works in contexts like function
calls where R is automatically quoting things for us, but if we use it then, it
does exactly what we want:
140 6 Non-Standard Evaluation
[1] TRUE
We are almost in a state of grace. The two rules we must follow are:
[1] TRUE
And just to make sure that it fails when it’s supposed to:
[1] FALSE
NULL
The square bracket operators [ and [[, on the other hand, are evaluating
functions, so we can give them a variable containing a column name and get
either a single-column tibble:
# A tibble: 2 x 1
red
<dbl>
1 1
2 2
or a naked vector:
[1] 1 2
That said, being able to pass column names to functions without wrapping
them in strings is very useful, and many powerful tools (such as using formu-
las in models) rely on taking unevaluated expressions apart and rearranging
them. If you would like to know more, or check that what you now think you
understand is accurate, this tutorial is a good next step.
• Explain what the formula operator ~ was created for and what other uses it
has.
• Describe and use ., .x, .y,..1,..2‘, and other convenience parameters.
• Define copy-on-modify and explain its use in R.
print(glue('here("book.bib"): {here("book.bib")}'))
here("book.bib"): /Users/gvwilson/tidynomicon/book.bib
143
144 7 Intellectual Debt
Let’s work through a small example. Suppose we’ve read a CSV file and wound
up with this table:
# A tibble: 8 x 3
person flavor ranking
<chr> <chr> <dbl>
1 Lhawang strawberry 1.7
2 Lhawang chocolate 2.5
3 Lhawang mustard 0.2
4 Khadee strawberry 2.1
5 Khadee chocolate 2.4
6 Khadee vanilla 3.9
7 Haddad strawberry 1.8
8 Haddad vanilla 2.1
Let’s aggregate using flavor values so that we can check our factor-based ag-
gregating later:
raw %>%
group_by(flavor) %>%
summarize(number = n(), average = mean(ranking))
# A tibble: 4 x 3
flavor number average
<chr> <int> <dbl>
1 chocolate 2 2.45
2 mustard 1 0.2
3 strawberry 3 1.87
4 vanilla 2 3
It probably doesn’t make sense to turn person into factors, since names are
actually character strings, but flavor is a good candidate:
# A tibble: 8 x 3
person flavor ranking
<chr> <fct> <dbl>
1 Lhawang strawberry 1.7
2 Lhawang chocolate 2.5
3 Lhawang mustard 0.2
4 Khadee strawberry 2.1
5 Khadee chocolate 2.4
6 Khadee vanilla 3.9
7 Haddad strawberry 1.8
8 Haddad vanilla 2.1
raw %>%
group_by(flavor) %>%
summarize(number = n(), average = mean(ranking))
# A tibble: 4 x 3
flavor number average
<fct> <int> <dbl>
1 chocolate 2 2.45
2 mustard 1 0.2
3 strawberry 3 1.87
4 vanilla 2 3
# A tibble: 8 x 3
person flavor ranking
<chr> <fct> <dbl>
1 Lhawang strawberry 1.7
2 Lhawang chocolate 2.5
3 Lhawang mustard 0.2
4 Khadee strawberry 2.1
5 Khadee chocolate 2.4
6 Khadee vanilla 3.9
7 Haddad strawberry 1.8
8 Haddad vanilla 2.1
7.3 What the hell are factors? 147
This changes the order in which they are displayed after grouping:
raw %>%
group_by(flavor) %>%
summarize(number = n(), average = mean(ranking))
# A tibble: 4 x 3
flavor number average
<fct> <int> <dbl>
1 chocolate 2 2.45
2 strawberry 3 1.87
3 vanilla 2 3
4 mustard 1 0.2
raw %>%
group_by(flavor) %>%
summarize(number = n(), average = mean(ranking)) %>%
ggplot(mapping = aes(x = flavor, y = average)) +
geom_col()
2
average
To learn more about how factors work and how to use them when analyzing
categorical data, please see this paper by McNamara and Horton.
When we put a function in a pipeline using %>%, that operator calls the func-
tion with the incoming data as the first argument, so data %>% func(arg) is
the same as func(data, arg). This is fine when we want the incoming data
to be the first argument, but what if we want it to be second? Or third?
One possibility is to save the result so far in a temporary variable and then
start a second pipe:
data %>%
transmute(id = row_number()) %>%
filter(empties) %>%
pull(id)
[1] 1
This builds a logical vector empties with as many entries as data has rows,
then filters data according to which of the entries in the vector are TRUE.
A better practice is to use the parameter name ., which means “the incoming
data”. In some functions (e.g., a two-argument function being used in map)
we can also use .x and .y for the first and second arguments, and for more
arguments, we can use ..1, ..2, and so on (with two dots at the front):
7.5 I thought you said that R encouraged functional programming? 149
data %>%
pmap_lgl(function(...) {
args <- list(...)
any(is.na(args))
}) %>%
tibble(empty = .) %>%
mutate(id = row_number()) %>%
filter(empty) %>%
pull(id)
[1] 1
In this model, we create the logical vector, then turn it into a tibble with one
column called empty (which is what empty = . does in tibble’s constructor).
After that, we add another column with row numbers, filter, and pull out the
row numbers.
And while we’re here: row_number doesn’t do what its name suggests. We’re
better off using rowid_to_column:
data %>%
rowid_to_column()
# A tibble: 2 x 3
rowid left right
<int> <dbl> <dbl>
1 1 1 NA
2 2 2 20
# A tibble: 5 x 1
family_name
<chr>
1 Dyer
2 Pabodie
3 Lake
4 Roerich
5 Danforth
Note that the column name must be passed as a quoted string; Chapter 6 will
show us how to pass unquoted column names.
We might occasionally want to allow the user to specify what values in the
file are to be considered NAs. This small addition allows us to do that, while
keeping the empty string and the string "NA" as defaults:
# A tibble: 5 x 1
family_name
<chr>
1 <NA>
2 Pabodie
3 Lake
4 Roerich
5 Danforth
We can also allow the user to specify any number of columns by capturing
“extra” parameters in ... and passing that value directly to dplyr::select:
# A tibble: 5 x 2
personal_name family_name
7.5 I thought you said that R encouraged functional programming? 151
<chr> <chr>
1 William Dyer
2 Frank Pabodie
3 Anderson Lake
4 Valentina Roerich
5 Frank Danforth
Now that we can create functions, we can use the tools in the purrr library to
wield them. purrr::map applies a function to each value in a vector in turn
and returns a list:
purrr::map(person$family_name, is_long_name)
[[1]]
[1] FALSE
[[2]]
[1] TRUE
[[3]]
[1] FALSE
[[4]]
[1] TRUE
[[5]]
[1] TRUE
also use purrr::map_lgl so that the result of the call is a logical vector rather
than a list. Similarly-named functions will give us numbers, character strings,
and so on:
purrr::map_lgl(person$family_name,
function(name) stringr::str_length(name) > 4)
Little functions like this are so common that purrr allows us to use write
them as formulas using the ~ operator with.x‘ as a shorthand for the value
from the vector being processed:
purrr::map_chr(person$family_name, ~ stringr::str_to_upper(.x))
purrr::map2_chr(person$personal_name,
person$family_name,
~ stringr::str_c(.y, .x, sep = '_'))
If we need to collapse the result to a single value (e.g., to use in if) we have
purrr::some and purrr::every:
[1] FALSE
[1] "WFAVF"
Modify this so that the initial empty string isn’t in the final result.
first
# A tibble: 2 x 2
left right
<dbl> <dbl>
1 999 202
2 303 404
second
# A tibble: 2 x 2
left right
<dbl> <dbl>
1 101 202
2 303 404
In this case, the entire left column of first has been replaced: tibbles (and
data frames) are stored as lists of vectors, so changing any value in a column
triggers construction of a new column vector.
We can watch this happen using the tracemem function, which shows us where
objects live in the computer’s memory:
[1] "<0x7fdcb1800d08>"
untracemem(first)
This rather cryptic output tell us the address of the tibble, then notifies us of
changes to the tibble and its contents. We can accomplish something a little
more readable using pryr::address (i.e., the address function from the pryr
package):
(We need to use the alias temp because address(first$left) doesn’t work:
the argument to address needs to be a variable name.)
R’s copy-on-modify semantics is particularly important when writing func-
tions. If we modify an argument inside a function, that modification isn’t
visible to the caller, so even functions that appear to modify structures usu-
ally don’t. (“Usually”, because there are exceptions, but we must stray off the
path to find them.)
156 7 Intellectual Debt
The function order generates indices to pull values into place rather than
push them, i.e., order(x)[i] is the index in x of the element that belongs at
location i. For example:
[1] 4 2 1 3
shows that the value at location 4 (the "a") belongs in the first spot of the
vector; it does not mean that the value in the first location (the "g") belongs
in location 4. This convention means that something[order(something)]
does the right thing:
bases[order(bases)]
The function one_of is a handy way to specify several values for matching
without complicated Boolean conditionals. For example, gather(data, key
= "year", value = "cases", one_of(c("1999", "2000"))) collects data
for the years 1999 and 2000.
The difference is that & always returns a vector result after doing element-by-
element conjunction, while && returns a scalar result. This means that & is
almost always what we want to use when working with data.
There is a function called n. It’s not the same thing as a column called n. I
only made this mistake a dozen times.
# A tibble: 1 x 1
total
<dbl>
1 30
158 7 Intellectual Debt
# A tibble: 1 x 1
total
<int>
1 2
Novices write code and pray that it works. Experienced programmers know
that prayer alone is not enough, and take steps to protect what little sanity
they have left. This chapter looks at the tools R gives us for doing this.
values = [-1, 0, 1]
for i in range(4):
try:
reciprocal = 1/values[i]
print("index {} value {} reciprocal {}".format(i, values[i], reciprocal))
except ZeroDivisionError:
print("index {} value {} ZeroDivisionError".format(i, values[i]))
except Exception as e:
print("index{} some other Exception: {}".format(i, e))
159
160 8 Testing and Error Handling
message("This is a message.")
This is a message.
warning("This is a warning.\n")
stop("This is an error.")
Note that we have to supply our own line ending for warnings but not for
the other two cases. Note also that there are very few situations in which
a warning is appropriate: if something has truly gone wrong then we should
stop, but otherwise we should not distract users from more pressing concerns.
The bluntest of instruments for handling errors is to ignore them. If a state-
ment is wrapped in the function try then errors that occur in it are still
reported, but execution continues. Compare this:
with this:
result is result
Do not do this, lest you one day find yourself lost in a silent hellscape.
Should you more sensibly wish to handle conditions rather than ignore them,
you may invoke tryCatch. We begin by raising an error explicitly:
tryCatch(
stop("our message"),
error = function(cnd) print(glue("error object is {cnd}"))
)
tryCatch(
attemptWithoutTry(1, "two"),
error = function(cnd) print(glue("error object is {cnd}"))
)
library(testthat)
matches
8.3 What should I know about testing in general? 163
is_null
matches
Good: we can draw some comfort from the fact that Those Beyond have not
yet changed the fundamental rules of arithmetic. But what are the curly braces
around expect_equal for? The answer is that they create a code block for
test_that to run. We can run expect_equal on its own:
expect_equal(0, 1)
but that doesn’t produce a summary of how many tests passed or failed. Pass-
ing a block of code to test_that also allows us to check several things in one
test:
library(testthat)
context("Demonstrating the testing library")
The first line loads the testthat package, which gives us our tools. The call to
context on the second line gives this set of tests a name for reporting purposes.
After that, we add as many calls to test_that as we want, each with a name
and a block of code. We can now run this file from within RStudio:
test_dir("tests/testthat")
v | OK F W S | Context
test_determine_skip_rows_a.R:14: failure: The right row is found when there are header row
`result` not equal to 3.
Lengths differ: 0 is not 1
test_determine_skip_rows_a.R:19: failure: The right row is found when there are no header
`result` not equal to 0.
Lengths differ: 0 is not 1
1/1 mismatches
[1] 0 - 0.01 == -0.01
--------------------------------------------------------------------------------
== Results =====================================================================
Duration: 0.3 s
OK: 14
Failed: 9
Warnings: 1
Skipped: 0
Care is needed when interpreting these results. There are four test_that calls,
but eight actual checks, and the number of successes and failures is counted
by recording the latter, not the former.
What then is the purpose of test_that? Why not just use expect_equal and
its kin, such as expect_true, expect_false, expect_length, and so on? The
answer is that it allows us to do one operation and then check several things
afterward. Let’s create another file called tests/testthat/test_tibble.R:
8.4 How should I organize my tests? 167
library(tidyverse)
library(testthat)
context("Testing properties of tibbles")
(We don’t actually have to call our test files test_something.R, but test_dir
and the rest of R’s testing infrastructure expect us to. Similarly, we don’t have
to put them in a tests directory, but gibbering incoherence will ensue if we
do not.) Now let’s run all of our tests:
test_dir("tests/testthat")
v | OK F W S | Context
test_determine_skip_rows_a.R:14: failure: The right row is found when there are header row
`result` not equal to 3.
Lengths differ: 0 is not 1
test_determine_skip_rows_a.R:19: failure: The right row is found when there are no header
`result` not equal to 0.
Lengths differ: 0 is not 1
== Results =====================================================================
Duration: 0.2 s
OK: 14
Failed: 9
Warnings: 0
Skipped: 0
Ah. It turns out that filter is applied to filenames after the leading test_
and the trailing .R have been removed. Let’s try again:
v | OK F W S | Context
== Results =====================================================================
OK: 1
Failed: 0
Warnings: 0
Skipped: 0
Woot!
That’s better, and it illustrates our earlier point about the importance of
following conventions.
There is a lot going on here, particularly if you are new to R (as I am at the
time of writing) and needed help to figure out that pmap is the function this
problem wants. But now that we have it, we can do this:
source("scripts/find_empty_01.R")
find_empty_rows("a,b\n1,2\n,\n5,6")
The source function reads R code from the given source. Using this inside
an R Markdown file is usually a bad idea, since the generated HTML or PDF
won’t show readers what code we loaded and ran. On the other hand, if we are
creating command-line tools for use on clusters or in other batch processing
modes, and are careful to display the code in a nearby block, the stain on our
soul is excusable.
The more interesting part of this example is the call to find_empty_rows.
Instead of giving it the name of a file, we have given it the text of the CSV we
want parsed. This string is passed to read_csv, which (according to documen-
tation that only took us 15 minutes to realize we had already seen) interprets
its first argument as a filename or as the actual text to be parsed if it contains
a newline character. This allows us to write put the test fixture right there
in the code as a literal string, which experience shows is to understand and
maintain than having test data in separate files.
8.5 How can I write a few simple tests? 171
Wat?
Buried in the middle of the pipe shown above is the expression:
tibble(empty = .)
Quoting from Advanced R, “The function arguments look a little quirky
but allow you to refer to . for one argument functions, .x and .y. for two
argument functions, and ..1, ..2, ..3, etc, for functions with an arbitrary
number of arguments.” In other words, . in tidyverse functions usually
means “whatever is on the left side of the %>% operator”, i.e., whatever
would normally be passed as the function’s first argument. Without this,
we have no easy way to give the sole column of our newly-constructed
tibble a name.
library(tidyverse)
library(testthat)
context("Finding empty rows")
source("../../scripts/find_empty_02.R")
And here’s what happens when we run this file with test_dir:
test_dir("tests/testthat", "find_empty_a")
v | OK F W S | Context
== Results =====================================================================
OK: 1
Failed: 2
Warnings: 0
Skipped: 0
8.5 How can I write a few simple tests? 173
This is perplexing: we expected that if there were no empty rows, our function
would return NULL. Let’s look more closely:
find_empty_rows("a\n1")
integer(0)
Ah: our function is returning an integer vector of zero length rather than NULL.
Let’s have a closer look at the properties of this strange beast:
print(glue("any(logical(0))? {any(logical(0))}"))
any(logical(0))? FALSE
print(glue("all(logical(0))? {all(logical(0))}"))
all(logical(0))? TRUE
library(tidyverse)
library(testthat)
context("Finding empty rows")
source("../../scripts/find_empty_02.R")
174 8 Testing and Error Handling
And here’s what happens when we run this file with test_dir:
test_dir("tests/testthat", "find_empty_b")
v | OK F W S | Context
== Results =====================================================================
OK: 3
Failed: 0
Warnings: 0
Skipped: 0
You rock!
,,GLOBAL DATABASES,,,,,,,,,,,,,
,,[data.unicef.org],,,,,,,,,,,,,
,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,
Indicator:,Delivered in health facilities,,,,,,,,,,,,,,
Unit:,Percentage,,,,,,,,,,,,,,
,,,,Mother's age,,,,,,,,,,,
iso3,Country/areas,year,Total ,age 15-17,age 18-19,age less than 20,age more than 20,age 2
AFG,Afghanistan,2010, 33 , 25 , 29 , 28 , 31 , 31 , 31 ,MICS,2010,,,,
ALB,Albania,2005, 98 , 100 , 96 , 97 , 98 , 99 , 92 ,MICS,2005,,,,
ALB,Albania,2008, 98 , 94 , 98 , 97 , 98 , 98 , 99 ,DHS,2008,,,,
...
ZWE,Zimbabwe,2005, 66 , 64 , 64 , 64 , 67 , 69 , 53 ,DHS,2005,,,,
ZWE,Zimbabwe,2009, 58 , 49 , 59 , 55 , 59 , 60 , 52 ,MICS,2009,,,,
ZWE,Zimbabwe,2010, 64 , 56 , 66 , 62 , 64 , 65 , 60 ,DHS,2010,,,,
ZWE,Zimbabwe,2014, 80 , 82 , 82 , 82 , 79 , 80 , 77 ,MICS,2014,,,,
,,,,,,,,,,,,,,,
Definition:,Percentage of births delivered in a health facility.,,,,,,,,,,,,,,
,"The indicator refers to women who had a live birth in a recent time period, generally tw
,,,,,,,,,,,,,,,
Note:,"Database include reanalyzed data from DHS and MICS, using a reference period of two
,Includes surveys which microdata were available as of April 2016. ,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,
Source:,"UNICEF global databases 2016 based on DHS, MICS .",,,,,,,,,,,,,,
,,,,,,,,,,,,,,,
Contact us:,data@unicef.org,,,,,,,,,,,,,,
There are two other files in this collection called c_sections.csv and
skilled_attendant_at_birth.csv, which are the number of Caesarean sec-
tions and the number of births where a midwife or other trained practitioner
was present. All three datasets have been exported from the same Excel spread-
sheet; rather than writing a separate script for each, we should create a tool
that will handle them all.
At first glance, the problems we need to solve to do this are:
1. Each file may have a different number of header rows (by inspection,
two of the files have 7 and one has 8), so we should infer this number
from the file.
2. Each file may contain a different number of records, so our tool
should select rows by content rather than by absolute row number.
176 8 Testing and Error Handling
3. The files appear to have the same column names (for which we give
thanks), but we should check this in case someone tries to use our
function with a dataset that doesn’t.
These three requirements will make our program significantly more compli-
cated, so we should tackle each with its own testable function.
The data we care about comes after the row with iso3, Country/areas, and
other column headers, so the simplest way to figure out how many rows to
skip is to read the data, look for this row, and discard everything above it. The
simplest way to do that is to read the file once to find the number of header
rows, then read it again, discarding that number of rows. It’s inefficient, but
for a dataset this size, simplicity beats performance.
Here’s our first try:
read_csv("data/at_health_facilities.csv") %>%
select(check = 1) %>%
mutate(id = row_number()) %>%
filter(check == "iso3") %>%
select(id) %>%
first()
Warning: Missing column names filled in: 'X1' [1], 'X2' [2], 'X4' [4], 'X5' [5],
'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12],
'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16]
X14 = col_logical(),
X15 = col_logical(),
X16 = col_logical()
)
[1] 7
Ignoring the messages about missing column names, this tells us that iso3
appears in row 7 of our data, which is almost true: it’s actually in row 8,
because read_csv has interpreted the first row of the raw CSV data as a
header. On the bright side, that means we can immediately use this value as
the skip parameter to the next read_csv call.
How do we test this code? Easy: we turn it into a function, tell that function to
stop if it can’t find iso3 in the data, and write some unit tests. The function
is:
library(tidyverse)
library(tidyverse)
library(testthat)
context("Skipping rows correctly")
source("../../scripts/determine_skip_rows_a.R")
test_that("The right row is found when there are header rows and blank lines", {
result <- determine_skip_rows("a1,a2\nb1,b2\n,\nis03,stuff\nc1,c2\n,\n")
expect_equal(result, 3)
})
test_that("The right row is found when there are no header rows to discard", {
result <- determine_skip_rows("iso3,stuff\nc1,c2\n")
expect_equal(result, 0)
})
test_dir("tests/testthat", "determine_skip_rows_a")
v | OK F W S | Context
test_determine_skip_rows_a.R:14: failure: The right row is found when there are header row
`result` not equal to 3.
Lengths differ: 0 is not 1
test_determine_skip_rows_a.R:19: failure: The right row is found when there are no header
`result` not equal to 0.
Lengths differ: 0 is not 1
== Results =====================================================================
OK: 0
Failed: 5
Warnings: 0
Skipped: 0
That’s right: all five fail. The first problem is that we have written is03 (with
a digit 0 instead of a letter o) in the first two tests. If we fix that and re-run
the tests, they pass; what about the other three?
library(tidyverse)
test_dir("tests/testthat", "determine_skip_rows_b")
v | OK F W S | Context
== Results =====================================================================
OK: 5
Failed: 0
Warnings: 0
Skipped: 0
Our tests still aren’t checking anything statistical, but without trustworthy
data, our statistics will be meaningless. Tests like these allow our future selves
to focus on making new mistakes instead of repeating old ones.
We have covered a lot of material in the previous chapters, but have only
scratched the surface of what R can do to for us. To wrap up, we will look
briefly at a few of its more advanced capabilities.
print("Hello R")
Hello R
but how can those chunks interact with your R and vice versa? The answer
181
182 9 Advanced Topics
Sys.which("python")
python
"/anaconda3/bin/python"
If you want to run the Pythonic bits of code we present as well as the R, run
install.packages("reticulate") and then set the RETICULATE_PYTHON
environment variable to point at the version of Python you want to use
before you launch RStudio. This is necessary because you may have a
system-installed version somewhere like /usr/bin/python and a conda-
managed version in ~/anaconda3/bin/python.
import pandas
data = pandas.read_csv('results/infant_hiv.csv')
print(data.head())
All of our Python variables are available in our R session as part of the py
object, so py$data is our data frame inside a chunk of R code:
9.2 How can I use Python with R? 183
library(reticulate)
head(py$data)
Hydrogen is in position 1 in R:
elements[1]
[1] "hydrogen"
print(r.elements[0])
hydrogen
Note our use of the object r in our Python code: just py$whatever gives us
access to Python objects in R, r.whatever gives us access to R objects in
Python.
We don’t have to run Python code, store values in a variable, and then access
that variable from R: we can call the Python directly (or vice versa). For
example, we can use Python’s random number generator in R as follows:
9.2 How can I use Python with R? 185
[1] 1.114449
#!/usr/bin/env python
import pandas as pd
def get_countries(filename):
data = pd.read_csv(filename)
return data.country.unique()
source_python('countries.py')
There is no output because all the script did was define a function. By default,
that function and all other top-level variables defined in the script are now
available in R:
get_countries('results/infant_hiv.csv')
[1] "AFG" "AGO" "AIA" "ALB" "ARE" "ARG" "ARM" "ATG" "AUS" "AUT" "AZE" "BDI"
[13] "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ" "BOL" "BRA"
[25] "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CIV" "CMR" "COD"
[37] "COG" "COK" "COL" "COM" "CPV" "CRI" "CUB" "CYP" "CZE" "DEU" "DJI" "DMA"
[49] "DNK" "DOM" "DZA" "ECU" "EGY" "ERI" "ESP" "EST" "ETH" "FIN" "FJI" "FRA"
[61] "FSM" "GAB" "GBR" "GEO" "GHA" "GIN" "GMB" "GNB" "GNQ" "GRC" "GRD" "GTM"
[73] "GUY" "HND" "HRV" "HTI" "HUN" "IDN" "IND" "IRL" "IRN" "IRQ" "ISL" "ISR"
[85] "ITA" "JAM" "JOR" "JPN" "KAZ" "KEN" "KGZ" "KHM" "KIR" "KNA" "KOR" "LAO"
[97] "LBN" "LBR" "LBY" "LCA" "LKA" "LSO" "LTU" "LUX" "LVA" "MAR" "MDA" "MDG"
[109] "MDV" "MEX" "MHL" "MKD" "MLI" "MLT" "MMR" "MNE" "MNG" "MOZ" "MRT" "MUS"
[121] "MWI" "MYS" "NAM" "NER" "NGA" "NIC" "NIU" "NLD" "NOR" "NPL" "NRU" "NZL"
[133] "OMN" "PAK" "PAN" "PER" "PHL" "PLW" "PNG" "POL" "PRK" "PRT" "PRY" "PSE"
[145] "ROU" "RUS" "RWA" "SAU" "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SOM" "SRB"
[157] "SSD" "STP" "SUR" "SVK" "SVN" "SWE" "SWZ" "SYC" "SYR" "TCD" "TGO" "THA"
[169] "TJK" "TKM" "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "TZA" "UGA" "UKR" "UNK"
[181] "URY" "USA" "UZB" "VCT" "VEN" "VNM" "VUT" "WSM" "YEM" "ZAF" "ZMB" "ZWE"
186 9 Advanced Topics
There is one small pothole in this. When the script is run, the special Python
variable __name__ is set to '__main__'"', i.e., the script thinks it is being
called from the command line. If it includes a conditional block to handle
command-line arguments like this:
if __name__ == '__main__':
input_file, output_files = sys.argv[1], sys.argv[2:]
main(input_file, output_files)
then that block will be executed, but will fail because sys.argv won’t include
anything.
Behind the scenes, R continues to store our nine values as a vector. However,
it adds an attribute called class to the vector to identify it as a matrix:
class(m)
[1] "matrix"
and another attribute called dim to store its dimensions as a 2-element vector:
dim(m)
[1] 3 3
An object’s attributes are simply a set of name-value pairs. We can find out
what attributes are present using attributes and show or set individual
attributes using attr:
$dim
[1] 3 3
$prospects
[1] "dismal"
t <- tribble(
~a, ~b,
1, 2,
3, 4)
typeof(t)
[1] "list"
attributes(t)
$names
[1] "a" "b"
188 9 Advanced Topics
$row.names
[1] 1 2
$class
[1] "tbl_df" "tbl" "data.frame"
This tells us that a tibble is stored as a list (the first line of output), and that
it has an attribute called names that stores the names of its columns, another
called row.names that stores the names of its rows (a feature we should ignore),
and three classes. These classes tell R what functions to search for when we
are (for example) asking for the length of a tibble (which is the number of
rows it contains):
length(t)
[1] 2
To show how classes and generic functions work together, let’s customize the
way that 2D coordinates are converted to strings. First, we create two coordi-
nate vectors:
toString(second)
S3’s protocol is simple: given a function F and an object of class C, S3 looks for
a function named F.C. If it doesn’t find one, it looks at the object’s next class
(assuming it has more than one); once its user-assigned classes are exhausted,
it uses whatever function the system has defined for its base type (in this case,
character vector). We can trace this process by importing the sloop package
and calling s3_dispatch:
library(sloop)
s3_dispatch(toString(first))
=> toString.two_d
* toString.default
s3_dispatch(toString(c(7.1, 7.2)))
toString.double
toString.numeric
=> toString.default
The constructor’s first argument should always be the base object (in our case,
the two-element vector). It should also have one argument for each attribute
the object is to have, if any. Unlike matrices, our 2D points don’t have any
extra arguments, so our constructor needs no extra arguments. Crucially, the
constructor checks the type of its arguments to ensure that the object has at
least some chance of being valid.
Validators are only needed when checks on data correctness and consistency
are expensive. For example, if we were to define a class to represent sorted
vectors, checking that each element is no less than its predecessor could take
a long time for very long vectors. To illustrate this, we will check that we
have exactly two coordinates; in real code, we would probably include this
(inexpensive) check in the constructor.
The third and final function in our trio provides a user-friendly way to con-
struct objects of our new class. It should call the constructor and the validator
(if one exists), but should also provide a richer set of defaults, better error mes-
sages, and so on. To illustrate this, we shall allow the user to provide either
one argument (which must be a two-element vector) or two (which must each
be numeric):
We said above that an object can have more than one class, and that S3
searches the classes in order when it wants to find a method to call. Methods
can also trigger invocation of other methods explicitly in order to supplement,
rather than replace, the behavior of other classes. To show how this works,
we shall look at that classic of object-oriented design: shapes. (The safe kind,
of course, not those whose non-Euclidean angles have placed such intolerable
stress on the minds of so many of our colleagues over the years.) We start by
defining a polygon class:
192 9 Advanced Topics
pinkish <- new_colored_polygon(list(c(0, 0), c(1, 0), c(1, 1)), "triangle", "roseate")
class(pinkish)
toString(pinkish)
So far so good: since we have not defined a method to handle colored poly-
gons specifically, we get the behavior for a regular polygon. Let’s add another
method that supplements the behavior of the existing method:
toString(pinkish)
9.5 How can I write web applications in R? 193
[1] "triangle: <0, 0>, <1, 0>, <1, 1>+ color = roseate"
In practice, we will almost always place all of the methods associated with a
class in the same file as its constructor, validator, and helper. The time has
finally come for us to explore projects and packages.
Let’s get the data about the people into a data frame:
library(DBI)
db <- dbConnect(RSQLite::SQLite(), here::here("data", "example.db"))
dbGetQuery(db, "select * from Person;")
That seems simple enough: the database connection is the first argument to
dbGetQuery, the query itself is the second, and the result is a tibble whose
column names correspond to the names of the fields in the database table.
What if we want to parameterize our query? Inside the text of the query, we
use :name as a placeholder for a query parameter, then pass a list of name-
value pairs to specify what we actually want:
dbGetQuery(db,
"select * from Measurements where quantity = :desired",
params = list(desired = "rad"))
dbClearResult(results)
9.5 How can I work with relational databases in R? 197
Data scientists spend most of their time reading data, but someone has to
create it. RSQLite makes it easy to map a data frame directly to a database
table; to show how it works, we will create an in-memory database:
Let’s see what the combination of R and SQLite has done with our data and
the types thereof:
who when
1 Dyer -15647
2 Peabody -15582
198 9 Advanced Topics
What fresh hell is this? After considerable trial and error, we discover that
our dates have been returned to us as the number of days since January 1,
1970:
dbExecute(db,
"insert into appointments values('Testing', :the_date);",
params = list(the_date = lubridate::as_date('1971-01-01')))
[1] 1
who when
1 Testing 365
There is no point screaming: those who might pity you cannot hear, and those
who can hear will definitely not pity you.
• If an object has multiple classes in its class attribute, R looks for a corre-
sponding method for each in turn.
• Every user defined class C should have functions new_C (to create it),
validate_C (to validate its integrity), and C (to create and validate).
• Use the DBI package to work with relational databases.
• Use DBI::dbConnect(...) with database-specific parameters to connect to
a specific database.
• Use dbGetQuery(connection, "query") to send an SQL query string to a
database and get a data frame of results.
• Parameterize queries using :name as a placeholder in the query and
params = list(name = value) as a third parameter to dbGetQuery to
specify actual values.
• Use dbFetch in a while loop to page results.
• Use dbWriteTable to write an entire data frame to a table, and dbExecute
to execute a single insertion statement.
• Dates… why did it have to be dates?
A
License
• Remix—remix, transform, and build upon the material for any purpose,
even commercially.
The licensor cannot revoke these freedoms as long as you follow the license
terms.
Under the following terms:
Notices:
You do not have to comply with the license for elements of the material in the
public domain or where your use is permitted by an applicable exception or
limitation.
No warranties are given. The license may not give you all of the permissions
necessary for your intended use. For example, other rights such as publicity,
privacy, or moral rights may limit how you use the material.
201
B
Code of Conduct
203
204 B Code of Conduct
B.3 Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may
be further defined and clarified by project maintainers.
B.4 Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by emailing the project team. All complaints will be reviewed and
investigated and will result in a response that is deemed necessary and ap-
propriate to the circumstances. The project team is obligated to maintain
confidentiality with regard to the reporter of an incident. Further details of
specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project’s leadership.
B.5 Attribution 205
B.5 Attribution
This Code of Conduct is adapted from the Contributor Covenant version 1.4.
C
Citation
207
D
Contributing
Contributions of all kinds are welcome, from errata and minor improvements
to entirely new sections and chapters: please email us or submit an issue or
pull request to our GitHub repository. Everyone whose work is incorporated
will be acknowledged; please note that all contributors are required to abide
by our Code of Conduct (s:conduct).
The Jekyll template used in this tutorial can support multiple languages. All
English content should go in the _en directory. (Please note that we use Sim-
plified English rather than Traditional English, i.e., American rather than
British spelling and grammar.) We encourage translations; if you would like
to take this on, please email us.
If you wish to report errata or suggest improvements to wording, please include
the chapter name in the first line of the body of your report (e.g., Testing
Data Analysis).
209
E
Practice Problems
You need more practice with the functions in Chapter 3. To begin, open a
fresh file and begin by loading the tidyverse and the here package used to
construct paths:
library(tidyverse)
library(here)
person
# A tibble: 5 x 3
person_id personal_name family_name
<chr> <chr> <chr>
1 dyer William Dyer
2 pb Frank Pabodie
211
212 E Practice Problems
nrow(person)
[1] 5
ncol(person)
[1] 3
(These names don’t have a package prefix because they are built in.) Let’s
show that information in a slightly nicer way using glue to insert values into
a string and print to display the result:
If we want to display several values, we can use the function paste to combine
the elements of a vector. colnames gives us the names of a tibble’s columns,
and paste’s collapse argument tells the function to use a single space to
separate concatenated values:
Time for some data manipulation. Let’s get everyone’s family and personal
names:
# A tibble: 5 x 2
family_name personal_name
<chr> <chr>
1 Dyer William
2 Pabodie Frank
3 Lake Anderson
4 Roerich Valentina
5 Danforth Frank
E.0 213
and then filter that list to keep only those that come in the first half of the
alphabet:
# A tibble: 3 x 2
family_name personal_name
<chr> <chr>
1 Dyer William
2 Lake Anderson
3 Danforth Frank
person %>%
dplyr::select(family_name, personal_name) %>%
dplyr::filter(family_name < "N")
# A tibble: 3 x 2
family_name personal_name
<chr> <chr>
1 Dyer William
2 Lake Anderson
3 Danforth Frank
It’s easy to add a column that records the lengths of family names:
person %>%
dplyr::mutate(name_length = stringr::str_length(family_name))
# A tibble: 5 x 4
person_id personal_name family_name name_length
<chr> <chr> <chr> <int>
1 dyer William Dyer 4
2 pb Frank Pabodie 7
3 lake Anderson Lake 4
4 roe Valentina Roerich 7
5 danforth Frank Danforth 8
person %>%
dplyr::mutate(name_length = stringr::str_length(family_name)) %>%
dplyr::arrange(dplyr::desc(name_length))
# A tibble: 5 x 4
person_id personal_name family_name name_length
<chr> <chr> <chr> <int>
1 danforth Frank Danforth 8
2 pb Frank Pabodie 7
3 roe Valentina Roerich 7
4 dyer William Dyer 4
5 lake Anderson Lake 4
measurements
# A tibble: 21 x 4
visit_id person_id quantity reading
<dbl> <chr> <chr> <dbl>
1 619 dyer rad 9.82
2 619 dyer sal 0.13
3 622 dyer rad 7.8
4 622 dyer sal 0.09
5 734 pb rad 8.41
6 734 lake sal 0.05
7 734 pb temp -21.5
E.1 Do I need even more practice? 215
dplyr::summarize(measurements)
# A tibble: 1 x 0
measurements %>%
dplyr::filter(!is.na(reading))
# A tibble: 20 x 4
visit_id person_id quantity reading
<dbl> <chr> <chr> <dbl>
1 619 dyer rad 9.82
2 619 dyer sal 0.13
3 622 dyer rad 7.8
4 622 dyer sal 0.09
5 734 pb rad 8.41
6 734 lake sal 0.05
7 734 pb temp -21.5
8 735 pb rad 7.22
9 735 <NA> sal 0.06
10 735 <NA> temp -26
11 751 pb rad 4.35
12 751 pb temp -18.5
13 752 lake rad 2.19
14 752 lake sal 0.09
15 752 lake temp -16
16 752 roe sal 41.6
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.5
20 844 roe rad 11.2
Removing rows that contain any NAs is equally easy, though it may be sta-
tistically unsound:
216 E Practice Problems
We can now group our data by the quantity measured and count the number
of each—the column is named n automatically:
cleaned %>%
dplyr::group_by(quantity) %>%
dplyr::count()
# A tibble: 3 x 2
# Groups: quantity [3]
quantity n
<chr> <int>
1 rad 8
2 sal 7
3 temp 3
cleaned %>%
dplyr::group_by(quantity) %>%
dplyr::summarize(low = min(reading),
mid = mean(reading),
high = max(reading))
# A tibble: 3 x 4
quantity low mid high
<chr> <dbl> <dbl> <dbl>
1 rad 1.46 6.56 11.2
2 sal 0.05 9.24 41.6
3 temp -21.5 -18.7 -16
After inspection, we realize that most of the salinity measurements lie between
0 and 1, but a handful range up to 100. During a brief interval of lucidity,
the librarian who collected the battered notebooks from which the data was
transcribed informs us that one of the explorers recorded percentages rather
than actual values. We therefore decide to normalize all salinity measurements
greater than 1.0 using ifelse (a two-branch analog of case_when):
# A tibble: 18 x 4
visit_id person_id quantity reading
<dbl> <chr> <chr> <dbl>
1 619 dyer rad 9.82
2 619 dyer sal 0.13
3 622 dyer rad 7.8
4 622 dyer sal 0.09
5 734 pb rad 8.41
6 734 lake sal 0.05
7 734 pb temp -21.5
8 735 pb rad 7.22
9 751 pb rad 4.35
10 751 pb temp -18.5
11 752 lake rad 2.19
12 752 lake sal 0.09
13 752 lake temp -16
14 752 roe sal 0.416
15 837 lake rad 1.46
16 837 lake sal 0.21
17 837 roe sal 0.225
18 844 roe rad 11.2
To answer our next set of questions, we need data about when each site was
visited. Let’s read visited.csv and discard entries that are missing the visit
date:
visited
# A tibble: 7 x 3
visit_id site_id visit_date
<dbl> <chr> <date>
1 619 DR-1 1927-02-08
2 622 DR-1 1927-02-10
3 734 DR-3 1930-01-07
218 E Practice Problems
and then combine that table with our cleaned measurement data. We will use
an inner join that matches records on the visit ID; dplyr also provides other
kinds of joins should we need them.
We can now find the date of the highest radiation reading at each site:
combined %>%
dplyr::filter(quantity == "rad") %>%
dplyr::group_by(site_id) %>%
dplyr::mutate(max_rad = max(reading)) %>%
dplyr::filter(reading == max_rad)
# A tibble: 3 x 7
# Groups: site_id [3]
visit_id site_id visit_date person_id quantity reading max_rad
<dbl> <chr> <date> <chr> <chr> <dbl> <dbl>
1 734 DR-3 1930-01-07 pb rad 8.41 8.41
2 837 MSK-4 1932-01-14 lake rad 1.46 1.46
3 844 DR-1 1932-03-22 roe rad 11.2 11.2
or:
combined %>%
dplyr::filter(quantity == "rad") %>%
dplyr::group_by(site_id) %>%
dplyr::top_n(1, reading) %>%
dplyr::select(site_id, visit_date, reading)
# A tibble: 3 x 3
# Groups: site_id [3]
site_id visit_date reading
<chr> <date> <dbl>
1 DR-3 1930-01-07 8.41
2 MSK-4 1932-01-14 1.46
3 DR-1 1932-03-22 11.2
E.2 Do I need even more practice? 219
combined %>%
dplyr::filter(quantity == "rad") %>%
dplyr::group_by(site_id) %>%
dplyr::mutate(delta_rad = reading - dplyr::lag(reading)) %>%
dplyr::arrange(site_id, visit_date)
# A tibble: 7 x 7
# Groups: site_id [3]
visit_id site_id visit_date person_id quantity reading delta_rad
<dbl> <chr> <date> <chr> <chr> <dbl> <dbl>
1 619 DR-1 1927-02-08 dyer rad 9.82 NA
2 622 DR-1 1927-02-10 dyer rad 7.8 -2.02
3 844 DR-1 1932-03-22 roe rad 11.2 3.45
4 734 DR-3 1930-01-07 pb rad 8.41 NA
5 735 DR-3 1930-01-12 pb rad 7.22 -1.19
6 751 DR-3 1930-02-26 pb rad 4.35 -2.87
7 837 MSK-4 1932-01-14 lake rad 1.46 NA
Going one step further, we can create a list of sites at which radiation increased
between any two visits:
combined %>%
dplyr::filter(quantity == "rad") %>%
dplyr::group_by(site_id) %>%
dplyr::mutate(delta_rad = reading - dplyr::lag(reading)) %>%
dplyr::filter(!is.na(delta_rad)) %>%
dplyr::summarize(any_increase = any(delta_rad > 0)) %>%
dplyr::filter(any_increase)
# A tibble: 1 x 2
site_id any_increase
<chr> <lgl>
1 DR-1 TRUE
220 E Practice Problems
Tamburello N, Côté IM, Dulvy NK (2015) Data from: Energy and the
scaling of animal space use. Dryad Digital Repository. https://doi.org/10.
5061/dryad.q5j65
head(hra)
# A tibble: 6 x 24
taxon common.name class order family genus species primarymethod N
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 lake~ american e~ acti~ angu~ angui~ angu~ rostra~ telemetry 16
2 rive~ blacktail ~ acti~ cypr~ catos~ moxo~ poecil~ mark-recaptu~ <NA>
3 rive~ central st~ acti~ cypr~ cypri~ camp~ anomal~ mark-recaptu~ 20
4 rive~ rosyside d~ acti~ cypr~ cypri~ clin~ fundul~ mark-recaptu~ 26
5 rive~ longnose d~ acti~ cypr~ cypri~ rhin~ catara~ mark-recaptu~ 17
6 rive~ muskellunge acti~ esoc~ esoci~ esox masqui~ telemetry 5
# ... with 15 more variables: mean.mass.g <dbl>, log10.mass <dbl>,
# alternative.mass.reference <chr>, mean.hra.m2 <dbl>, log10.hra <dbl>,
# hra.reference <chr>, realm <chr>, thermoregulation <chr>, locomotion <chr>,
# trophic.guild <chr>, dimension <chr>, preymass <dbl>, log10.preymass <dbl>,
# PPMR <dbl>, prey.size.reference <chr>
E.2 Please may I create some charts? 221
A few keystrokes show us how the masses of these animals are distributed:
ggplot2::ggplot(hra) +
ggplot2::geom_histogram(mapping = aes(x = mean.mass.g))
400
count
200
The distribution becomes much clearer if we plot the logarithms of the masses,
which are helpfully precalculated in log10.mass:
ggplot2::ggplot(hra) +
ggplot2::geom_histogram(mapping = aes(x = log10.mass))
50
40
30
count
20
10
0 2 4 6
log10.mass
ggplot2::ggplot(hra) +
ggplot2::geom_histogram(mapping = aes(x = log10.mass), bins = 100) +
ggplot2::ggtitle("Frequency of Species Masses") +
ggplot2::xlab("Log10 of Mass") +
ggplot2::ylab("Number of Species") +
ggplot2::theme_minimal()
E.2 Please may I create some charts? 223
20
15
Number of Species
10
0 2 4 6
Log10 of Mass
ggplot2::ggplot(hra) +
ggplot2::geom_point(mapping = aes(x = log10.mass, y = log10.hra))
10.0
7.5
5.0
log10.hra
2.5
0.0
0 2 4 6
log10.mass
224 E Practice Problems
Does the relationship depend on the class of animal? (Here, we use the word
“class” in the biological sense: the class “aves” is birds.)
hra %>%
dplyr::mutate(class_fct = as.factor(class)) %>%
ggplot2::ggplot(mapping = aes(x = log10.mass, y = log10.hra, color = class_fct)) +
ggplot2::geom_point(alpha = 0.5)
10.0
7.5
class_fct
5.0
log10.hra
actinopterygii
aves
mammalia
2.5 reptilia
0.0
0 2 4 6
log10.mass
*What’s a Factor?
The code above creates a new column class_fct by converting the text
values in class to a factor. Other languages call this an enumeration: we
will discuss factors in more detail in Chapter 7.
hra %>%
dplyr::mutate(class_fct = as.factor(class)) %>%
ggplot2::ggplot(mapping = aes(x = log10.mass, y = log10.hra, color = class_fct)) +
ggplot2::geom_point(alpha = 0.5) +
ggplot2::facet_wrap(~class_fct)
E.2 Please may I create some charts? 225
actinopterygii aves
10.0
7.5
5.0
2.5
0.0 class_fct
log10.hra
actinopterygii
7.5 reptilia
5.0
2.5
0.0
0 2 4 6 0 2 4 6
log10.mass
If we want to look at the mass-area relationship more closely for birds, we can
construct a regression line:
hra %>%
dplyr::filter(class == "aves") %>%
ggplot2::ggplot(mapping = aes(x = log10.mass, y = log10.hra)) +
ggplot2::geom_point(alpha = 0.5) +
ggplot2::geom_smooth(method = lm, color = 'red')
8
log10.hra
1 2 3 4 5
log10.mass
Drilling down even further, we can create a violin plot of mass by order for
the birds (where “order” is the biological division below “class”):
hra %>%
dplyr::filter(class == "aves") %>%
dplyr::mutate(order_fct = as.factor(order)) %>%
ggplot2::ggplot(mapping = aes(x = order_fct, y = log10.mass, color = order_fct)) +
ggplot2::geom_violin()
E.2 Please may I create some charts? 227
3
order_fct
accipitriformes
cuculiformes
log10.mass
falconiformes
galliformes
2
gruiformes
passeriformes
piciformes
strigiformes
accipitriformes
anseriformes
apterygiformes
caprimulgiformes
charadriiformes
columbidormes
columbiformes
coraciiformes
cuculiformes
falconiformes
galliformes
gruiformes
passeriformes
pelecaniformes
piciformes
psittaciformes
rheiformes
strigiformes
struthioniformes
tinamiformes
order_fct
hra %>%
dplyr::filter(class == "aves") %>%
dplyr::mutate(order_fct = as.factor(order)) %>%
ggplot2::ggplot(mapping = aes(x = order_fct, y = log10.mass, color = order_fct)) +
ggplot2::geom_boxplot()
228 E Practice Problems
anseriformes
5
apterygiformes
caprimulgiformes
charadriiformes
4 columbidormes
columbiformes
coraciiformes
log10.mass
cuculiformes
3
falconiformes
galliformes
gruiformes
2 passeriformes
pelecaniformes
piciformes
psittaciformes
1
rheiformes
strigiformes
accipitriformes
anseriformes
apterygiformes
caprimulgiformes
charadriiformes
columbidormes
columbiformes
coraciiformes
cuculiformes
falconiformes
galliformes
gruiformes
passeriformes
pelecaniformes
piciformes
psittaciformes
rheiformes
strigiformes
struthioniformes
tinamiformes struthioniformes
order_fct
tinamiformes
And if we want to save our chart to a file, that’s just one more call as well:
hra %>%
dplyr::filter(class == "aves") %>%
ggplot2::ggplot(mapping = aes(x = log10.mass, y = log10.hra)) +
ggplot2::geom_point(alpha = 0.5) +
ggplot2::geom_smooth(method = lm, color = 'red')
8
log10.hra
1 2 3 4 5
log10.mass
ggsave("/tmp/birds.png")
231
232 F Glossary
235
236 G Key Points
• Pre-allocate storage in a list for each result from a loop and fill it in rather
than repeatedly extending the list.
• An R package can contain code, data, and documentation.
• R code is distributed as compiled bytecode in packages, not as source.
• R packages are almost always distributed through CRAN, the Comprehen-
sive R Archive Network.
• Most of a project’s metadata goes in a file called DESCRIPTION.
• Metadata related to imports and exports goes in a file called NAMESPACE.
• Add patterns to a file called .Rbuildignore to ignore files or directories
when building a project.
• All source code for a package must go in the R sub-directory.
• library calls in a package’s source code will not be executed as the package
is loaded after distribution.
• Data can be included in a package by putting it in the data sub-directory.
• Data must be in .rda format in order to be loaded as part of a package.
• Data in other formats can be put in the inst/extdata directory, and will
be installed when the package is installed.
• Add comments starting with #' to an R file to document functions.
• Use roxygen2 to extract these comments to create manual pages in the man
directory.
• Use @export directives in roxygen2 comment blocks to make functions visi-
ble outside a package.
• Add required libraries to the Imports section of the DESCRIPTION file to
indicate that your package depends on them.
• Use package::function to access externally-defined functions inside a pack-
age.
• Alternatively, add @import directives to roxygen2 comment blocks to make
external functions available inside the package.
• Import .data from rlang and use .data$column to refer to columns instead
of using bare column names.
• Create a file called R/package.R and document NULL to document the pack-
age as a whole.
• Create a file called R/dataset.R and document the string ‘dataset’ to docu-
ment a dataset.
241