Chapter-9
Chapter-9
Chapter 9
UCLA
Contents
Learning Objectives 2
1 Introduction 2
2 Characters in R 3
2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The paste() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Print Functions for Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 The print() and noquote() Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 The cat() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.3 The format() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Pattern Matching 7
4.1 Introduction and the %in% Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 The grep() and grepl() Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 The gsub() Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1
Learning Objectives
After studying this chapter, you should be able to:
• Perform basic string manipulation in R
• Perform basic pattern matching with grep(), grepl(), and gsub()
• Interpret and use basic regular expressions
• Calculate the Flesch reading ease score
1 Introduction
Most of statistical computing involves working with numeric data. However, many modern applications have
considerable amounts of data in the form of text.
There are whole areas of statistics and machine learning devoted to organizing and interpreting text-based
data, such as textual data analysis, linguistic analysis, text mining, sentiment analysis, and natural language
processing (NLP).
For more information and resources:
• https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
• https://www.tidytextmining.com/
Text-based analyses are beyond the scope of this course. However, even in non-text-based analyses, working
with data in R often requires processing of characters, such as in row/column names, dates, monetary
quantities, longitude/latitude, etc.
Other common scenarios involving characters:
• Removing a given character in the names of your variables
• Changing the level(s) of a categorical variable
• Replacing a given character in a dataset
• Converting labels to upper or lower case
• Extracting a regular pattern of characters from a large text file
• Parsing input from an XML or HTML file
A basic understanding of character (or string) manipulation and regular expressions can be a valuable skill
for any statistical analysis. We will discuss the most common syntax and functions for string manipulation in
base R and introduce basic regular expressions in R.
For more information and resources:
Books and Articles
• Gaston Sanchez’s “Handling Strings with R”: https://www.gastonsanchez.com/r4strings/
• Garrett Grolemund and Hadley Wickham’s “R for Data Science”: http://r4ds.had.co.nz/strings.html
• https://en.wikibooks.org/wiki/R_Programming/Text_Processing
• https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html
Cheat Sheets for stringr and Regular Expressions
• https://github.com/rstudio/cheatsheets/raw/master/strings.pdf
• https://www.cheatography.com//davechild/cheat-sheets/regular-expressions/pdf/
Sites for Testing Regular Expressions
• https://regex101.com/
• https://regexr.com/
2
2 Characters in R
2.1 Basic Definitions
Symbols in R that represent text or words are called characters. A string is a character variable that
contains one or more characters, but we often will use “character” and “string” interchangeably.
Values that are stored as characters have base type character and are typically printed with quotation
marks.
x <- "Pawnee rules"
x
[1] "character"
Characters can be created using single or double quotation marks, but double quotation marks are almost
universally preferred.
Single quotation marks can be used within double quotation marks and vice versa, but you cannot directly
insert single quotes within single quotes or double quotes within double quotes.
The double quotation inside a string is a special character, so inserting it within double quotes requires a
backslash \" to escape the special property of the character.
"This is the 'R' Language"
• The ... argument means the input can be any number of objects.
• The optional sep argument specifies the separator between characters after pasting. The default is a
single whitespace " ".
• The optional collapse argument specifies characters to separate the result.
3
paste("I ate some", pi, "and it was deloicious.")
4
cat(x, "Eagleton drools", sep=", ", file = "pawnee.txt")
When file is specified, an optional logical argument append specifies whether the result should be appended
to or overwrite an existing file.
Side Note: There are a few other optional arguments that are useful for longer text strings. Consult the R
documentation for more information.
Function Description
nchar() Returns number of characters
tolower() Converts to lower case
toupper() Converts to upper case
casefold() Wrapper for tolower() and toupper()
chartr() Translates characters
abbreviate() Abbreviates characters
substr() Extracts substrings of a character vector
strsplit() Splits strings into substrings
The best way to understand how these functions work is to try them on simple examples and see how the
input character vector changes.
5
3.2 The nchar() Function
The nchar() function inputs a character vector and outputs the number of (human-readable) characters
contained in each entry of the vector.
y <- c("Pawnee rules", "Eagleton drools")
nchar(y)
[1] 12 15
6
# Extract the 3rd to 5th characters of each value in `y`
substr(y, start = 3, stop = 5)
[[1]]
[1] "Pawnee ru" "es"
To separate a sentence into separate words, we can split by the single space character " ".
z <- c("Pawnee rules and Eagleton drools.", "I love friends, waffles, and work.")
word_z <- strsplit(z, split = " ")
word_z
[[1]]
[1] "Pawnee" "rules" "and" "Eagleton" "drools."
[[2]]
[1] "I" "love" "friends," "waffles," "and" "work."
Note: The ith component in word_z contains the words in the ith sentence of z. To combine all values
in separate components of a list into a single vector, we can use the unlist() function to remove the list
structure.
unlist(word_z)
4 Pattern Matching
4.1 Introduction and the %in% Operator
One main application of string manipulation is pattern matching. Finding patterns in text are useful for data
validation, data scraping, text parsing, filtering search results, etc.
A first tool for pattern/value matching is the %in% operator. The %in% operator is a vectorized binary
operator that checks each value in the vector on the left and returns TRUE if the entry matches one of the
values on the right and FALSE otherwise. The output of the %in% operator is always a logical vector.
1:10 %in% c(5, 7, 9)
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
ucla <- c("u", "c", "l", "a")
letters %in% ucla
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[25] FALSE FALSE
7
letters[letters %in% ucla]
• The pattern argument is the character string to be matched in the character vector x. The pattern
can be the literal character(s) to match or a regular expression.
• The x argument is the input character vector where matches are to be found.
• There are other optional arguments as well, including an ignore.case argument that specifies whether
the pattern to match is case sensitive or not.
The command grep(pattern, x) returns a numeric vector of the indices of the entries of x that contain a
match to pattern. The command grepl(pattern, x) returns a logical vector of whether each entry of x
contains a match (TRUE) or not (FALSE).
test <- c("April", "and", "Andy", "love", "Champion", "and", "Lil'", "Sebastian")
grep(pattern = "a", test)
[1] 2 5 6 8
grepl(pattern = "a", test)
The pattern, x, and optional arguments of gsub() are identical to those found in grep() and grepl(). While
grep() and grepl() identify pattern matches, gsub() replaces the pattern match by the replacement
argument.
gsub(pattern = "A", replacement = "a", test)
8
4.4 Regular Expressions
For more complicated patterns, we need more tools to efficiently specify the pattern to match.
A regular expression (or regex) is a set of symbols that describes a text pattern. More formally, a regular
expression is a pattern that describes a set of strings.
Regular expressions are a formal language in their own right in the sense that the symbols have a defined set
of rules to specify the desired patterns. Most programming languages, including R, can use and implement
regular expressions. The best way to learn the syntax and become fluent with regular expressions is to
practice.
Some common applications of regular expressions:
• Test if a phone number has the correct number of digits
• Test if a date follows a specifc format (e.g. mm/dd/yy)
• Test if an email address is in a valid format
• Test if a password has numbers and special characters
• Search a document for gray spelled either as “gray” or “grey”
• Search a document and replace all occurrences of “Will”, “Bill”, or “W.” with “William”
• Count the number of times in a document that the word “analysis” is immediately preceded by the
words “data”, “computer”, or “statistical”
• Convert a comma-delimited file into a tab-delimited file
• Find duplicate words in a text
We will not cover a full treatment of regular expressions in R (it is typically covered in detail in Stats 102A).
For more information, refer to the ?regex help documentation or one of the references at the beginning of
this chapter.
We will introduce a few basic regular expressions in the next section to allow us to compute a readability
measure of the difficulty level of a passage in English is to understand.
9
5 Application: The Flesch Reading Ease Score
5.1 Introduction
The Flesch reading ease score is a numeric measure of English readability, i.e., the ease with which a
reader can understand text.
The formula to compute the Flesch reading ease (RE) score is
total words total syllables
RE = 206.835 − 1.015 × − 84.6 ×
total sentences total words
= 206.835 − (1.015 × ASL) − (84.6 × ASW)
where ASL is the average sentence length and ASW is the average number of syllables per word.
The reading ease score RE is usually a number between 0 and 100, though there are some exceptions for
non-standard words/sentences. Higher values of RE indicate text that is easier to read, and lower values
indicate text that is more difficult to read.
More information on the Flesch reading ease score:
• http://www.readabilityformulas.com/flesch-reading-ease-readability-formula.php
• https://en.wikipedia.org/wiki/Flesch-Kincaid_readability_tests#Flesch_reading_ease
From the Readability Formulas site:
“Though simple it might seem, the Flesch Reading Ease Formula has certain ambiguities. For
instance, periods, explanation [sic] points, colons and semicolons serve as sentence delimiters;
each group of continuous non-blank characters with beginning and ending punctuation removed
counts as a word; each vowel in a word is considered one syllable subject to: (a) -es, -ed and -e
(except -le) endings are ignored; (b) words of three letters or shorter count as single syllables; and
(c) consecutive vowels count as one syllable.”
We want to write a function in R that inputs an English passage and outputs the reading ease score of the
passage.
As an example, we will compute the Flesch reading ease score of the following passage:
“We need to remember what’s important in life: friends, waffles, work. Or waffles, friends, work.
Doesn’t matter, but work is third.” – Leslie Knope (Parks and Recreation)
For convenience, the command to create a waffles object containing this passage is in the waffles.R file on
Bruin Learn.
source("waffles.R")
waffles
[1] "We need to remember what's important in life: friends, waffles, work. Or
waffles, friends, work. Doesn't matter, but work is third."
The primary components of the Flesch reading score formula are: sentences, words, and syllables. The main
steps to compute the reading score are:
1. Separate the text passage into individual sentences, and count the number of sentences.
2. Separate each sentence into individual words, and count the number of words for each sentence.
3. Separate each individual word into individual syllables, and count the number of syllables.
10
To find the individual sentences, we need to split the text string based on “end of sentence” punctuation. The
sentence delimiters, i.e., what symbols represent the end of a sentence, we want to consider are periods
(.), exclamation points (!), question marks (?), colons (:), and semicolons (;).
A regular expression that represents the pattern of “any sentence delimiter” would be [.!?:;]. In the context
of regular expressions, the square brackets define a character set, which means any single character that
is contained within the brackets will match the pattern. The regular expression for “any vowel” would be
[aeiouy].
We will use this regular expression to split the sentences in the waffles object into separate characters.
strsplit(waffles, split = "[.!?:;]")
[[1]]
[1] "We need to remember what's important in life"
[2] " friends, waffles, work"
[3] " Or waffles, friends, work"
[4] " Doesn't matter, but work is third"
Remember that the output of strsplit() is always a list. Since the waffles object was a single character
value, then the output of strsplit(waffles, split = "[.!?:;]") has a single component. To continue
processing the text, we will extract the character vector inside.
waffles_sentences <- strsplit(waffles, split = "[.!?:;]")[[1]]
waffles_sentences
11
5.3 Splitting Sentences Into Words
We now have a vector of sentences that have been processed to remove cases and punctuation. We are ready
to move on to counting words!
At this stage in the notes, each entry in the waffles_sentences vector is a sentence that we want to split
into words. Since we have removed all punctuation already, the only character that separates words is the
single whitespace character " ". We thus can use strsplit() again, splitting based on " ".
waffles_words <- strsplit(waffles_sentences, split = " ")
waffles_words
[[1]]
[1] "we" "need" "to" "remember" "whats" "important"
[7] "in" "life"
[[2]]
[1] "" "friends" "waffles" "work"
[[3]]
[1] "" "or" "waffles" "friends" "work"
[[4]]
[1] "" "doesnt" "matter" "but" "work" "is" "third"
Caution: Notice the output! By splitting based on the whitespace " ", we have leading empty characters in
all components of waffles_words except for the first component. This is due to the space after the end of a
sentence. We do not want to count the empty character "" as a word in our reading ease formula, so we need
to remove them.
Within each component of the waffles_words list, we want to only keep the character values that have a
nonzero number of characters. This can be done in one line, but we will create a helper function for clarity.
keep_words <- function(words) {
words[nchar(words) > 0]
}
The keep_words() function inputs a vector of words and returns only the words that have a positive number
of characters. We can now use lapply() to apply the keep_words() function to each component of the
waffles_words list.
waffles_words <- lapply(waffles_words, keep_words)
waffles_words
[[1]]
[1] "we" "need" "to" "remember" "whats" "important"
[7] "in" "life"
[[2]]
[1] "friends" "waffles" "work"
[[3]]
[1] "or" "waffles" "friends" "work"
[[4]]
[1] "doesnt" "matter" "but" "work" "is" "third"
Note: At this point in the notes, you are able to compute the ASL (average sentence length) value in the
reading ease formula.
12
Question: How would you compute the ASL for the waffles text using the waffles_words object?
[1] 3
Since length(tom_letters) is 3, then tom is one syllable.
For longer words, such as "horses" and "eagleton", we will need to consider the other rules and count the
vowels.
13
5.4.2 Accounting for Special Word Endings
The second rule: -es, -ed and -e (except -le) endings are ignored.
Before we count the syllables in words longer than 3 letters, we need to first ignore the special word endings:
-es, -ed, and -e, unless the ending is -le.
The horses_letters vector contains the individual letters of the word "horses". We can use the tail()
function to extract the last two letters of the word.
horses_tail <- tail(horses_letters, n = 2)
We can write a helper function is_special_ending() that inputs a vector of two letters (that represent the
last two letters of a word) and returns TRUE if the word ends in a special ending (-es, -ed, or -e except -le)
and FALSE otherwise.
is_special_ending <- function(ending) {
is_es <- all(ending == c("e", "s"))
is_ed <- all(ending == c("e", "d"))
is_e_not_le <- ending[2] == "e" & ending[1] != "l"
is_es | is_ed | is_e_not_le
}
is_special_ending(horses_tail)
[1] TRUE
Since the word ends in -es, we will remove the word ending and count the syllables in the remaining “word”
"hors".
rm_special_endings <- function(word_letters) {
word_tail <- tail(word_letters, n = 2)
if (is_special_ending(word_tail)) {
if (word_tail[2] == "e") {
word_letters[-length(word_letters)]
} else {
head(word_letters, n = -2)
}
} else {
word_letters
}
}
rm_special_endings(horses_letters)
14
5.4.3 Accounting For Consecutive Vowels
The third rule: Consecutive vowels count as one syllable.
Once the first two rules are accounted for, we next need to be able to identify which letters are vowels. As an
example, we will use the character "eagleton".
We can write a helper function is_vowel() that inputs a vector of letters and returns TRUE for each vowel
and FALSE otherwise.
is_vowel <- function(letter) {
letter %in% c("a", "e", "i", "o", "u", "y")
}
eagleton_vowels <- is_vowel(eagleton_letters)
eagleton_vowels
[1] 1 2 5 7
Based on the numeric indices alone, can you tell which vowels are consecutive?
Consecutive vowels will have consecutive indices! One trick to find consecutive indices is to find the consecutive
differences with the diff() function.
diff(which(eagleton_vowels))
[1] 1 3 2
Each consecutive difference of 1 indicates consecutive vowels. So the total number of syllables in the word is
(number of syllables) = (number of vowels) − (number of consecutive differences of 1 in the vowel indices).
15
count_syllables("tom")
[1] 1
count_syllables("horses")
[1] 1
count_syllables("eagleton")
[1] 3
count_syllables("pneumonoultramicroscopicsilicovolcanoconiosis")
[1] 17
Note: At this point in the notes, you are able to compute the ASW (average number of syllables per word)
value in the reading ease formula.
Question: How would you compute the ASW for the waffles text using the waffles_words object?
[1] "We need to remember what's important in life: friends, waffles, work. Or
waffles, friends, work. Doesn't matter, but work is third."
reading_ease(waffles)
[1] 96.76339
For the waffles text, the Flesch reading score is 96.7633929. Use this value to verify that you have
implemented your function correctly.
16