0% found this document useful (0 votes)
29 views

R Programming for Data Science Roger D. Peng download

The document provides links to various eBooks related to R programming and data science, including titles by Roger D. Peng and others. It includes a detailed table of contents for 'R Programming for Data Science,' outlining topics such as the history of R, getting started, data handling, and more. The document encourages readers to explore additional resources available on the website.

Uploaded by

morvehnobbe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

R Programming for Data Science Roger D. Peng download

The document provides links to various eBooks related to R programming and data science, including titles by Roger D. Peng and others. It includes a detailed table of contents for 'R Programming for Data Science,' outlining topics such as the history of R, getting started, data handling, and more. The document encourages readers to explore additional resources available on the website.

Uploaded by

morvehnobbe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

R Programming for Data Science Roger D.

Peng pdf
download

https://ebookmeta.com/product/r-programming-for-data-science-
roger-d-peng/

Download more ebook from https://ebookmeta.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

The Art of Data Science Roger D. Peng

https://ebookmeta.com/product/the-art-of-data-science-roger-d-
peng/

Functional Programming in R 4: Advanced Statistical


Programming for Data Science, Analysis, and Finance -
Second Edition Thomas Mailund

https://ebookmeta.com/product/functional-programming-
in-r-4-advanced-statistical-programming-for-data-science-
analysis-and-finance-second-edition-thomas-mailund/

R Programming for Actuarial Science 1st Edition Mcquire

https://ebookmeta.com/product/r-programming-for-actuarial-
science-1st-edition-mcquire/

Cultural Histories of Ageing Myths Plots and Metaphors


of the Senescent Self 1st Edition Margery Vibe Skagen
(Editor)

https://ebookmeta.com/product/cultural-histories-of-ageing-myths-
plots-and-metaphors-of-the-senescent-self-1st-edition-margery-
vibe-skagen-editor/
Introduction to Banking 3rd Edition Claudia Girardone

https://ebookmeta.com/product/introduction-to-banking-3rd-
edition-claudia-girardone/

An Analysis of Geoffrey Parker s Global Crisis War


Climate Change and Catastrophe in the Seventeenth
Century 1st Edition Ian Jackson

https://ebookmeta.com/product/an-analysis-of-geoffrey-parker-s-
global-crisis-war-climate-change-and-catastrophe-in-the-
seventeenth-century-1st-edition-ian-jackson/

Cross my Heart Steamy in Sweetville 10 1st Edition


Haven Rose

https://ebookmeta.com/product/cross-my-heart-steamy-in-
sweetville-10-1st-edition-haven-rose/

The Bitcoin Dilemma: Weighing The Economic And


Environmental Costs And Benefits 1st Edition Colin L.
Read

https://ebookmeta.com/product/the-bitcoin-dilemma-weighing-the-
economic-and-environmental-costs-and-benefits-1st-edition-colin-
l-read/

Pennsylvania Dutch The Story of an American Language


1st Edition Mark L. Louden

https://ebookmeta.com/product/pennsylvania-dutch-the-story-of-an-
american-language-1st-edition-mark-l-louden/
Religious Giving For Love of God 1st Edition David H
Smith

https://ebookmeta.com/product/religious-giving-for-love-of-
god-1st-edition-david-h-smith/
R Programming for Data Science
Roger D. Peng
© 2014 - 2016 Roger D. Peng
Also By Roger D. Peng
The Art of Data Science
Exploratory Data Analysis with R
Report Writing for Data Science in R
Contents

1. Stay in Touch! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3. History and Overview of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


3.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 What is S? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 The S Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Back to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Basic Features of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 Free Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.7 Design of the R System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.8 Limitations of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.9 R Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4. Getting Started with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


4.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Getting started with the R interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5. R Nuts and Bolts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


5.1 Entering Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.4 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.5 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.6 Creating Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.7 Mixing Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.8 Explicit Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.9 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.10 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.11 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.12 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.13 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.14 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
CONTENTS

5.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6. Getting Data In and Out of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


6.1 Reading and Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Reading Data Files with read.table() . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Reading in Larger Datasets with read.table . . . . . . . . . . . . . . . . . . . . . . . 25
6.4 Calculating Memory Requirements for R Objects . . . . . . . . . . . . . . . . . . . 26

7. Using the readr Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8. Using Textual and Binary Formats for Storing Data . . . . . . . . . . . . . . . . . . . 29


8.1 Using dput() and dump() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Binary Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9. Interfaces to the Outside World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


9.1 File Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.2 Reading Lines of a Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.3 Reading From a URL Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

10. Subsetting R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


10.1 Subsetting a Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.2 Subsetting a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.3 Subsetting Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.4 Subsetting Nested Elements of a List . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.5 Extracting Multiple Elements of a List . . . . . . . . . . . . . . . . . . . . . . . . . 41
10.6 Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
10.7 Removing NA Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

11. Vectorized Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


11.1 Vectorized Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

12. Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


12.1 Dates in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12.2 Times in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12.3 Operations on Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

13. Managing Data Frames with the dplyr package . . . . . . . . . . . . . . . . . . . . . . 50


13.1 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13.2 The dplyr Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13.3 dplyr Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
13.4 Installing the dplyr package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
13.5 select() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
13.6 filter() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
13.7 arrange() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
CONTENTS

13.8 rename() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
13.9 mutate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
13.10 group_by() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.11 %>% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
13.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

14. Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


14.1 if-else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
14.2 for Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
14.3 Nested for loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
14.4 while Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
14.5 repeat Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
14.6 next, break . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

15. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.1 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.2 Your First Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.3 Argument Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
15.4 Lazy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.5 The ... Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.6 Arguments Coming After the ... Argument . . . . . . . . . . . . . . . . . . . . . 78
15.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

16. Scoping Rules of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


16.1 A Diversion on Binding Values to Symbol . . . . . . . . . . . . . . . . . . . . . . . 80
16.2 Scoping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
16.3 Lexical Scoping: Why Does It Matter? . . . . . . . . . . . . . . . . . . . . . . . . . 82
16.4 Lexical vs. Dynamic Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
16.5 Application: Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
16.6 Plotting the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
16.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

17. Coding Standards for R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

18. Loop Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90


18.1 Looping on the Command Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
18.2 lapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
18.3 sapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
18.4 split() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
18.5 Splitting a Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
18.6 tapply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
18.7 apply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
18.8 Col/Row Sums and Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
CONTENTS

18.9 Other Ways to Apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


18.10 mapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
18.11 Vectorizing a Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
18.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

19. Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108


19.1 Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
19.2 Primary R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
19.3 grep() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
19.4 grepl() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
19.5 regexpr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
19.6 sub() and gsub() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
19.7 regexec() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
19.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

20. Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119


20.1 Something’s Wrong! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
20.2 Figuring Out What’s Wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
20.3 Debugging Tools in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
20.4 Using traceback() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
20.5 Using debug() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
20.6 Using recover() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
20.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

21. Profiling R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


21.1 Using system.time() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
21.2 Timing Longer Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
21.3 The R Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
21.4 Using summaryRprof() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
21.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

22. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


22.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
22.2 Setting the random number seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
22.3 Simulating a Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
22.4 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
22.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

23. Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S. . . . . . 141
23.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
23.2 Loading and Processing the Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . 141
23.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

24. About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


1. Stay in Touch!
Thanks for purchasing this book. If you are interested in hearing more from me about things that
I’m working on (books, data science courses, podcast, etc.), you can do two things.
First, I encourage you to join my mailing list of Leanpub Readers¹. On this list I send out updates
of my own activities as well as occasional comments on data science current events. I’ll also let you
know what my co-conspirators Jeff Leek and Brian Caffo are up to because sometimes they do really
cool stuff.
Second, I have a regular podcast called Not So Standard Deviations² that I co-host with Dr. Hilary
Parker, a Senior Data Analyst at Etsy. On this podcast, Hilary and I talk about the craft of data
science and discuss common issues and problems in analyzing data. We’ll also compare how data
science is approached in both academia and industry contexts and discuss the latest industry trends.
You can listen to recent episodes on our SoundCloud page or you can subscribe to it in iTunes³ or
your favorite podcasting app.
Thanks again for purchasing this book and please do stay in touch!
¹http://eepurl.com/bAJ3zj
²https://soundcloud.com/nssd-podcast
³https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570

1
2. Preface
I started using R in 1998 when I was a college undergraduate working on my senior thesis.
The version was 0.63. I was an applied mathematics major with a statistics concentration and
I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts
(Shakespeare, Milton, etc.). The idea was to see if we could identify the authorship of each of the
texts based on how frequently they used certain words. We downloaded the data from Project
Gutenberg and used some basic linear discriminant analysis for the modeling. The work was
eventually published¹ and was my first ever peer-reviewed publication. I guess you could argue
it was my first real “data science” experience.
Back then, no one was using R. Most of my classes were taught with Minitab, SPSS, Stata, or
Microsoft Excel. The cool people on the cutting edge of statistical methodology used S-PLUS. I
was working on my thesis late one night and I had a problem. I didn’t have a copy of any of those
software packages because they were expensive and I was a student. I didn’t feel like trekking over
to the computer lab to use the software because it was late at night.
But I had the Internet! After a couple of Yahoo! searches I found a web page for something called R,
which I figured was just a play on the name of the S-PLUS package. From what I could tell, R was a
“clone” of S-PLUS that was free. I had already written some S-PLUS code for my thesis so I figured
I would try to download R and see if I could just run the S-PLUS code.
It didn’t work. At least not at first. It turns out that R is not exactly a clone of S-PLUS and quite a few
modifications needed to be made before the code would run in R. In particular, R was missing a lot of
statistical functionality that had existed in S-PLUS for a long time already. Luckily, R’s programming
language was pretty much there and I was able to more or less re-implement the features that were
missing in R.
After college, I enrolled in a PhD program in statistics at the University of California, Los Angeles.
At the time the department was brand new and they didn’t have a lot of policies or rules (or classes,
for that matter!). So you could kind of do what you wanted, which was good for some students and
not so good for others. The Chair of the department, Jan de Leeuw, was a big fan of XLisp-Stat and
so all of the department’s classes were taught using XLisp-Stat. I diligently bought my copy of Luke
Tierney’s book² and learned to really love XLisp-Stat. It had a number of features that R didn’t have
at all, most notably dynamic graphics.
But ultimately, there were only so many parentheses that I could type, and still all of the research-
level statistics was being done in S-PLUS. The department didn’t really have a lot of copies of S-PLUS
lying around so I turned back to R. When I looked around at my fellow students, I realized that I
was basically the only one who had any experience using R. Since there was a budding interest in R
¹http://amstat.tandfonline.com/doi/abs/10.1198/000313002100#.VQGiSELpagE
²http://www.amazon.com/LISP-STAT-Object-Oriented-Environment-Statistical-Probability/dp/0471509167/

2
Preface 3

around the department, I decided to start a “brown bag” series where every week for about an hour
I would talk about something you could do in R (which wasn’t much, really). People seemed to like
it, if only because there wasn’t really anyone to turn to if you wanted to learn about R.
By the time I left grad school in 2003, the department had essentially switched over from XLisp-
Stat to R for all its work (although there were a few hold outs). Jan discusses the rationale for the
transition in a paper³ in the Journal of Statistical Software.
In the next step of my career, I went to the Department of Biostatistics⁴ at the Johns Hopkins
Bloomberg School of Public Health, where I have been for the past 12 years. When I got to Johns
Hopkins people already seemed into R. Most people had abandoned S-PLUS a while ago and were
committed to using R for their research. Of all the available statistical packages, R had the most
powerful and expressive programming language, which was perfect for someone developing new
statistical methods.
However, we didn’t really have a class that taught students how to use R. This was a problem because
most of our grad students were coming into the program having never heard of R. Most likely in
their undergradute programs, they used some other software package. So along with Rafael Irizarry,
Brian Caffo, Ingo Ruczinski, and Karl Broman, I started a new class to teach our graduate students
R and a number of other skills they’d need in grad school.
The class was basically a weekly seminar where one of us talked about a computing topic of interest.
I gave some of the R lectures in that class and when I asked people who had heard of R before, almost
no one raised their hand. And no one had actually used it before. The main selling point at the time
was “It’s just like S-PLUS but it’s free!” A lot of people had experience with SAS or Stata or SPSS. A
number of people had used something like Java or C/C++ before and so I often used that a reference
frame. No one had ever used a functional-style of programming language like Scheme or Lisp.
To this day, I still teach the class, known a Biostatistics 140.776 (“Statistical Computing”). However,
the nature of the class has changed quite a bit over the past 10 years. The population of students
(mostly first-year graduate students) has shifted to the point where many of them have been
introduced to R as undergraduates. This trend mirrors the overall trend with statistics where we
are seeing more and more students do undergraduate majors in statistics (as opposed to, say,
mathematics). Eventually, by 2008–2009, when I’d asked how many people had heard of or used
R before, everyone raised their hand. However, even at that late date, I still felt the need to convince
people that R was a “real” language that could be used for real tasks.
R has grown a lot in recent years, and is being used in so many places now, that I think it’s
essentially impossible for a person to keep track of everything that is going on. That’s fine, but
it makes “introducing” people to R an interesting experience. Nowadays in class, students are often
teaching me something new about R that I’ve never seen or heard of before (they are quite good
at Googling around for themselves). I feel no need to “bring people over” to R. In fact it’s quite the
opposite–people might start asking questions if I weren’t teaching R.
³http://www.jstatsoft.org/v13/i07
⁴http://www.biostat.jhsph.edu
Preface 4

This book comes from my experience teaching R in a variety of settings and through different stages
of its (and my) development. Much of the material has been taken from by Statistical Computing
class as well as the R Programming⁵ class I teach through Coursera.
I’m looking forward to teaching R to people as long as people will let me, and I’m interested in
seeing how the next generation of students will approach it (and how my approach to them will
change). Overall, it’s been just an amazing experience to see the widespread adoption of R over the
past decade. I’m sure the next decade will be just as amazing.
⁵https://www.coursera.org/course/rprog
3. History and Overview of R
There are only two kinds of languages: the ones people complain about and the ones
nobody uses —Bjarne Stroustrup

Watch a video of this chapter¹

3.1 What is R?
This is an easy question to answer. R is a dialect of S.

3.2 What is S?
S is a language that was developed by John Chambers and others at the old Bell Telephone
Laboratories, originally part of AT&T Corp. S was initiated in 1976² as an internal statistical analysis
environment—originally implemented as Fortran libraries. Early versions of the language did not
even contain functions for statistical modeling.
In 1988 the system was rewritten in C and began to resemble the system that we have today (this
was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the white
book) documents the statistical analysis functionality. Version 4 of the S language was released in
1998 and is the version we use today. The book Programming with Data by John Chambers (the
green book) documents this version of the language.
Since the early 90’s the life of the S language has gone down a rather winding path. In 1993 Bell Labs
gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. In 2004
Insightful purchased the S language from Lucent for $2 million. In 2006, Alcatel purchased Lucent
Technologies and is now called Alcatel-Lucent.
Insightful sold its implementation of the S language under the product name S-PLUS and built a
number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”. In 2008 Insightful was
acquired by TIBCO for $25 million. As of this writing TIBCO is the current owner of the S language
and is its exclusive developer.
The fundamentals of the S language itself has not changed dramatically since the publication of the
Green Book by John Chambers in 1998. In 1998, S won the Association for Computing Machinery’s
Software System Award, a highly prestigious award in the computer science field.
¹https://youtu.be/STihTnVSZnI
²http://cm.bell-labs.com/stat/doc/94.11.ps

5
History and Overview of R 6

3.3 The S Philosophy


The general S philosophy is important to understand for users of S and R because it sets the stage for
the design of the language itself, which many programming veterans find a bit odd and confusing.
In particular, it’s important to realize that the S language had its roots in data analysis, and did not
come from a traditional programming language background. Its inventors were focused on figuring
out how to make data analysis easier, first for themselves, and then eventually for others.
In Stages in the Evolution of S³, John Chambers writes:

“[W]e wanted users to be able to begin in an interactive environment, where they


did not consciously think of themselves as programming. Then as their needs became
clearer and their sophistication increased, they should be able to slide gradually into
programming, when the language and system aspects would become more important.”

The key part here was the transition from user to developer. They wanted to build a language that
could easily service both “people”. More technically, they needed to build language that would
be suitable for interactive data analysis (more command-line based) as well as for writing longer
programs (more traditional programming language-like).

3.4 Back to R
The R language came to use quite a bit after S had been developed. One key limitation of the S
language was that it was only available in a commericial package, S-PLUS. In 1991, R was created
by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. In
1993 the first announcement of R was made to the public. Ross’s and Robert’s experience developing
R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics:

Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal
of Computational and Graphical Statistics, 5(3):299–314, 1996

In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the
GNU General Public License⁴ to make R free software. This was critical because it allowed for the
source code for the entire R system to be accessible to anyone who wanted to tinker with it (more
on free software later).
In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core
Group was formed, containing some people associated with S and S-PLUS. Currently, the core group
controls the source code for R and is solely able to check in changes to the main R source tree. Finally,
in 2000 R version 1.0.0 was released to the public.
³http://www.stat.bell-labs.com/S/history.html
⁴http://www.gnu.org/licenses/gpl-2.0.html
History and Overview of R 7

3.5 Basic Features of R


In the early days, a key feature of R was that its syntax is very similar to S, making it easy for
S-PLUS users to switch over. While the R’s syntax is nearly identical to that of S’s, R’s semantics,
while superficially similar to S, are quite different. In fact, R is technically much closer to the Scheme
language than it is to the original S language when it comes to how R works under the hood.
Today R runs on almost any standard computing platform and operating system. Its open source
nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R
has been reported to be running on modern tablets, phones, PDAs, and game consoles.
One nice feature that R shares with many popular open source projects is frequent releases. These
days there is a major annual release, typically in October, where major new features are incorporated
and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed.
The frequent releases and regular release cycle indicates active development of the software and
ensures that bugs will be addressed in a timely manner. Of course, while the core developers control
the primary source tree for R, many people around the world make contributions in the form of new
feature, bug fixes, or both.
Another key advantage that R has over many other statistical packages (even today) is its sophisti-
cated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the
very beginning and has generally been better than competing packages. Today, with many more
visualization packages available than before, that trend continues. R’s base graphics system allows
for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems,
like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.
R has maintained the original S philosophy, which is that it provides a language that is both useful
for interactive work, but contains a powerful programming language for developing new tools. This
allows the user, who takes existing tools and applies them to data, to slowly but surely become a
developer who is creating new tools.
Finally, one of the joys of using R has nothing to do with the language itself, but rather with the
active and vibrant user community. In many ways, a language is successful inasmuch as it creates a
platform with which many people can create new things. R is that platform and thousands of people
around the world have come together to make contributions to R, to develop packages, and help
each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly
active for over a decade now and there is considerable activity on web sites like Stack Overflow.

3.6 Free Software


A major advantage that R has over many other statistical packages and is that it’s free in the sense
of free software (it’s also free in the sense of free beer). The copyright for the primary source code
for R is held by the R Foundation⁵ and is published under the GNU General Public License version
⁵http://www.r-project.org/foundation/
History and Overview of R 8

2.0⁶.
According to the Free Software Foundation, with free software, you are granted the following four
freedoms⁷

• The freedom to run the program, for any purpose (freedom 0).
• The freedom to study how the program works, and adapt it to your needs (freedom 1). Access
to the source code is a precondition for this.
• The freedom to redistribute copies so you can help your neighbor (freedom 2).
• The freedom to improve the program, and release your improvements to the public, so that
the whole community benefits (freedom 3). Access to the source code is a precondition for
this.

You can visit the Free Software Foundation’s web site⁸ to learn a lot more about free software. The
Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web
site⁹ is an interesting read if you happen to have some spare time.

3.7 Design of the R System


The primary R system is available from the Comprehensive R Archive Network¹⁰, also known as
CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.
The R system is divided into 2 conceptual parts:

1. The “base” R system that you download from CRAN: Linux¹¹ Windows¹² Mac¹³ Source Code¹⁴
2. Everything else.

R functionality is divided into a number of packages.

• The “base” R system contains, among other things, the base package which is required to run
R and contains the most fundamental functions.
• The other packages contained in the “base” system include utils, stats, datasets, graphics,
grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.

⁶http://www.gnu.org/licenses/gpl-2.0.html
⁷http://www.gnu.org/philosophy/free-sw.html
⁸http://www.fsf.org
⁹https://stallman.org
¹⁰http://cran.r-project.org
¹¹http://cran.r-project.org/bin/linux/
¹²http://cran.r-project.org/bin/windows/
¹³http://cran.r-project.org/bin/macosx/
¹⁴http://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz
Other documents randomly have
different content
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

You might also like