R Programming Text Book
R Programming Text Book
R Programming Text Book
Dr. G. Sudhamathy
If you are looking for a complete step-by-step instructions for learning R Programming
for Statistical Data Analysis, Graphical Visualization and Data Mining, authors Dr.
Sudhamathy & Dr. Jothi venkateswaran’s “R Programming - An Approach to Data
Analytics” is a hands-on book packed with examples and references that would help
you get started coding in R for variety of data science problems.
Hopefully you can take the instructions provided in this book to get started in R
programming for your next data analysis project, do some exciting data visualization
and data mining on your own.
It’s my immense happiness in penning this foreword for a book that is quite impressive
for any techie who is interested in R-programming. It’s also equally joyous to have a
book written by experts, Dr. G. Sudhamathy and Dr. C. Jothi Venkateswaran. When
a book can teach you and guide you as you work hands on the tool, you are in the
right direction in your learning path.
One can be definitively sure that book will be of great help and guidance for the
learner to carry out their works on Analytics using R, either in the research, practice
or just to learn the tool.
Best wishes for a bestselling of this book in the Academia, Research and Practice.
Dr. S. Justus
Associate Professor & Chair - Software Engineering Research Group
VIT University, Chennai
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
PREFACE
Huge volumes of data are being generated by many sources like commercial
enterprises, scientific domains and general public daily. According to a recent
research, data production will be 44 times greater in 2020 than it was in 2010.
Data being a vital resource for business organizations and other domains like
education, health, manufacturing etc., its management and analysis is becoming
increasingly important. This data, due to its volume, variety and velocity, often
referred to as Big Data, also includes highly unstructured data in the form of textual
documents, web pages, graphical information and social media comments. Since
Big Data is characterised by massive sample sizes, high dimensionality and intrinsic
heterogeneity, traditional approaches to data management, visualisation and
analytics are no longer satisfactorily applicable. There is therefore an urgent need
for newer tools, better frameworks and workable methodologies for such data to
be appropriately categorised, logically segmented, efficiently analysed and securely
managed. This requirement has resulted in an emerging new discipline of Data
Science that is now gaining much attention with researchers and practitioners in
the field of Data Analytics.
R programming language and make it easy to approach by any one. The chapters
are designed in such a fashion that it targets the beginners with the first 4 chapters
and targets the advanced concept learners in the next 3 chapters. The book also
helps the reader with the list of all packages and functions used in this book along
with the page numbers to know the usage of those. Every concept discussed in the
various sections in this book has proper example dealt with a set of code and its
results (as text or as graphs).
The book is organized into 7 chapters and the concept discussed in each chapter
is as detailed below.
Chapter 2 discusses on the basic data types in R, the primitive data types such
as vectors, matrices and arrays, lists and factors. It also deals with the complex data
types such as data frames, strings, dates and times. The chapter not only discusses
on the data creation, but also basic operations on the data of different data types.
Chapter 3 deals with data preparation in which it details on how and where to
fetch the datasets from, how to import and export data from various sources which
are of different types like CSV files, XML files, etc. It also discusses on the ways of
accessing various databases. The data cleaning and transformation techniques such
as data reshaping, grouping functions are also outlined in this chapter.
Chapter 4 is about using the graphical features in R for exploratory data analysis.
It gives examples of pie charts, scatter plots, line plots, histograms, box plots and
bar plots using the various graphical packages such as base, lattice and ggplot2.
Chapter 5 deals with statistical analysis concepts using R such as the basic
statistical measures like mean, median, mode, standard deviation, variance and
ranges. It discusses on the distribution of data as normal distribution and binomial
distribution and how it can be viewed and analyzed using R. Then, the chapter
explores on the complex statistical techniques such as correlation analysis, regression
analysis, ANOVA and hypothesis testing which can be implemented using R.
Chapter 7 is mainly to explore the various essential case studies such as text
analytics, credit risk analysis, social network analysis and few exploratory data
analysis. The main purpose of this chapter is to use the basic and advanced concepts
presented in the other previous chapters of this book.
The author would like to mention her special regards and thanks to Dr. G.
P. Jeyanthi, Research and Consultancy Director, Dr. A. Parvathi, Dean, Faculty of
Science and Dr. V. Radha, Head, Department of Computer Science, Avinashilingam
Universty, Coimbatore, for their constant encouragement and support to turn this
work into a useful product.
The author wishes to thank all the faculty members of the Department of
Computer Science, Avinashilingam University, Coimbatore, for their continuous
support and suggestions for this book.
We are grateful to the students and teacher community who kept us on our
toes with their constant bombardment of queries which prompted us to learn more,
simplify our learning and findings and place them neatly in a book.
Our Special regards for the experts Mr. Sajeev Madhavan, Director of
Architecture, Oracle, USA and Dr. S. Justus, Associate Professor, VIT, Chennai who
gave their expert opinion in shaping this book into a more appealing format.
Most importantly we would like to thank our family members without whose
support this book would not have been a reality.
Last, but not the least, this work is a dedication to God, the Almighty whose
grace has showered upon us in making our dream come true.
G. Sudhamathy
C. Jothi Venkateswaran
Chapter 1 Basics of R 1
Chapter 2 Data Types in R 27
Chapter 3 Data Preparation 83
Chapter 4 Graphics using R 117
Chapter 5 Statistical Analysis Using R 141
Chapter 6 Data Mining Using R 177
Chapter 7 Case Studies 233
Glossary 299
Packages Used 309
Functions Used 313
References 359
Books 359
Websites 359
Index 361
Basics of R
OBJECTIVES
1.1. Introducing R
R is a Programming Language and R also refers to the software that is used to run
the R programs. Ross Ihaka and Robert Gentleman from University of Auckland
created R language in 1990s. R language is based on the S language. S Language
was developed at the Bell Laboratories in 1970s. S Language was developed by John
Chambers. R Software is a GNU project free and open source software. R (Language
and Software) is developed by the R Core Team. R has evolved over the past 3 to 4
decades as its history originated from 1970s.
2 R Programming — An Approach for Data Analytics
One can write a new package in R if the existing package is not sufficient
for the individual’s use. R is a high-level scripting language which need not be
compiled, but it is an interpreted language. R is an imperative language and still
it supports object-oriented programming.
The R language allows the user to program loops to successively analyze several
data sets. It is also possible to combine in single program different statistical
functions to perform more complex analyses. The R users may get benefitted from
a large number of programs written and available on the internet. At first R can
look very complex for a beginner or non-specialist. But, this is not actually true as
the prominent feature of R is its flexibility. R displays the results of the analysis
immediately and these results are stored in “objects” so that further analysis can be
done on them. The user can also extract a part of the result which is of interest to
him.
Looking at the features of R, some users may think that “I can’t write programs
using R”. But, this is not the case for two reasons. First, R is an interpreted language
and not a compiled one. This means that all commands typed on the keyboard are
directly executed without need to build the complete program like in C, C++ or
Java. Second, R’s syntax is very simple and intuitive.
In R, a function is always written with parentheses, eg. ls(). If only the name
of the function is typed, R displays the content of the function. In this book the
functions are written with their names followed by parentheses to distinguish them
from other objects. When R is running variables, data, functions, results, etc. are
stored in the active memory of the computer in the form of objects which have a
name. The user can do actions on these objects with operators and functions.
3 Basics of R
1.2. Installing R
R is available in several forms, essentially for Unix and Linux machines, or some
pre-compiled binaries for Windows, Linux and Macintosh. The files needed to
install R, either from the source or from the pre-compiled binaries are distributed
from the internet site of the Comprehensive R Archive Network (CRAN) where the
instructions for installation are also available.
1.3. Initiating R
Open R Gui, find the command prompt and type the command below and hit enter
to run the command.
> sum(1:5)
[1] 15
The result above shows that the command gives the result 15. That is the
command has taken the input of integers from 1 to 5 and has performed the sum
operation on them. In the above command sum() is a function that takes the
argument 1:5 which means a vector that consists of a sequence of integers from 1 to
5. Like any other command prompt, R also allows to use the up arrow key to revoke
the previous commands.
1.3.2. Help in R
There are many ways to get help from R. If a function name or a dataset name is
known then we can type ? followed by the name. If name is not known then we
5 Basics of R
need to type ?? followed by a term that is related to the search function. Keywords,
special characters and two separate terms of search need to be enclosed in double or
single quotes. The symbol # is used to comment a line in R Program like any other
programming language.
> ?mean # help page for mean function opens
> ?”+” # help page for addition function opens
> ?”if ” # help page for if opens
> ??plotting # searches for the help pages containing the word “plotting”
> ??”regression model” # searches for “regression model” phrase
The same help can be obtained by the functions help() and help.search(). In
these functions the arguments has to be enclosed by quotes.
> help(“mean”)
> help(“+”)
> help(“if ”)
> help.search(“plotting”)
> help.search(“regression model”)
The variable names consist of letters, numbers, dots and underscores, but a
variable name should only start with an alphabet. The variable names should not
be reserve words. To create global variables (variables available everywhere) we use
the symbol “<<-”.
X <<- exp(exp(1))
Assignment operation can also be done using the assign() function. For global
assignment the same function assign() can be used, but, by including an extra
attribute globalenv(). To see the value of the variable, simply type the variable in
the command prompt. The same thing can be done using a print() function.
> assign(“F”, 3 * 8)
> assign(“G”, 6 * 9, globalenv())
>F
[1] 24
> print(G)
[1] 54
If assignment and printing of a value has to be done in one line we can do the
same in two ways. First method, by separating the two statements by a semicolon
and the second method is by wrapping the assignment in parenthesis () as below.
> L <- sum(4:8); L
[1] 30
> (M <- sum(5:9))
[1] 35
The “+” plus operator is used to perform the addition operation. It can be used
to add two numbers or add two vectors. Vector represents an ordered set of values.
Vectors are mainly used to analyse statistical data. The “:” colon operator creates
a sequence. Sequence is a series of numbers within the given limits. The “c()”
function concatenates the values given within the brackets “(“ and “)”. Variable
7 Basics of R
names in R are case sensitive. Open R Gui, find the command prompt and type the
command below and hit enter to run the command.
> 7:12 + 12:17
[1] 19 21 23 25 27 29
> c(3, 1, 8, 6, 7) + c(9, 2, 5, 7, 1)
[1] 12 3 13 13 8
The vectors and the c() function in R helps us to avoid loops. The statistical
functions in R can take the vectors as input and produce results. The sum() function
takes vector arguments and produces results. But, the median() function when
taking the vector arguments shows errors.
> sum(7:10)
[1] 34
> mean(7:10)
[1] 8.5
> median(7:10)
[1] 8.5
> sum(7,8,9,10)
[1] 34
> mean(7,8,9,10)
[1] 7
> median(7,8,9,10)
Error in median(7, 8, 9, 10) : unused arguments (9, 10)
Similar to the “+” plus operator all other operators in R take vectors as inputs
and can produce results. The subtraction and the multiplication operations work
as below.
> c(5, 6, 1, 9) - 2
[1] 3 4 -1 7
> c(5, 6, 1, 9) - c(4, 2, 0, 7)
8 R Programming — An Approach for Data Analytics
[1] 1 4 1 2
> -1:4 * -2:3
[1] 2 0 0 2 6 12
> -1:4 * 3
[1] -3 0 3 6 9 12
The exponentiation operator is represented using the symbol “^” or the “**”.
This can be checked using the function identical().
> identical(2^3, 2**3)
[1] TRUE
The other mathematical functions are the trigonometry functions like, sin(),
cos(), tan(), asin(), acos(), atan() and the logarithmic and exponential functions
like log(), exp(), log1p(), expm1(). All these mathematical functions can operate on
vectors as well as individual elements. Few more examples of the mathematical
functions are listed below
The operator “==” is used for comparing two values. For checking inequalities
of values the operator “!=” is used. These operators are called the relational
operators. The relational operators also take the vectors as input and operate on
them. The other relational operators are the “< “, “> “, “<= “ and “>= “.
> c(2, 4 - 2, 1 + 1) == 2
9 Basics of R
The equality operator “==” can also be used to compare strings, but, string
comparison is case sensitive. Similarly, the operators “<” and “>” can also be used
on strings. The below examples show the results.
> c(“Week”, “WEEK”, “week”, “weak”) == “week”
[1] FALSE FALSE TRUE FALSE
10 R Programming — An Approach for Data Analytics
1.4. Packages in R
R Packages are installed in an online repository called CRAN (Comprehensive R
Archive Network). A Package is a collection of R functions and datasets. Currently,
the CRAN package repository features 10756 available packages. The list of all
available packages in the CRAN repository can be viewed from the web site “https://
cran.r-project.org/web/packages/available_packages_by_name.html”. To find the
list of functions available in a package (say the package is “stats”) we can use the
command ls(“package:stats”) or the command library(help = stats) in the command
prompt.
A library is a folder in the machine that stores the files for a package. If a package
is already installed on a machine we can load the same using the library() function.
The name of the package to be loaded is passed to the library() function as argument
without enclosing in quotes. If the package name has to be programmatically passed
to the library() function, then we need to set the argument character.only = TRUE.
If a package is not installed and if the library() function is used to load the package,
it will throw an error message. Alternatively if the require() function is used to load
a package, it returns TRUE if the package is already installed or it returns FALSE if
the package is not already installed.
We can list and see all the packages that are already loaded using the search()
function. This list shows the global environment as the first one followed by the
recently loaded packages. The last two are special environments, namely, “Autoloads”
and “base” package.
> search()
[1] “.GlobalEnv” “package:cluster” “tools:rstudio” “package:stats”
[5] “package:graphics” “package:grDevices” “package:utils” “package:datasets”
[9] “package:methods” “Autoloads” “package:base”
11 Basics of R
The CRAN package repository contains handful of packages that needs special
attention. To access additional repositories, type setRepositories() and select
the repository required. The repositories R-Forge and rforge.net contains the
development versions of the packages that appear on the CRAN repository. The
function available.packages() lists thousands of packages in each of the selected
12 R Programming — An Approach for Data Analytics
repository. (Note: can use the View() function to restrict fetching of thousands of
the packages at one go)
> setRepositories()
--- Please select repositories for use in this session ---
1: + CRAN
2: BioC software
3: BioC annotation
4: BioC experiment
5: BioC extra
6: CRAN (extras)
7: Omegahat
8: R-Forge
9: rforge.net
10: + CRANextra
Enter one or more numbers separated by spaces, or an empty line to cancel
1:
There are many online repositories like GitHub, Bitbucket, and Google Code
from where many R Packages can be retrieved. The packages can be installed using
the function install.packages() function by mentioning the name of the package as
argument to this function. But, it is necessary to have internet connection to install
any package and write permission to the hard drive. To update the latest version of
the installed packages, we use the function update.packages() with the argument
ask = FALSE which disallows prompting before updating each package. To delete
a package already installed, we use the function remove.packages() by passing the
name of the package to be removed as argument.
> install.packages(“chron”)
13 Basics of R
1.5.1. Environments
We can assign variables into a newly created environment using the double
square brackets or the dollar operator as below.
> newenvironment[[“variable1”]] <- c(4, 7, 9)
> newenvironment$variable2 <- TRUE
> assign(“variable3”, “Value for variable3”, newenvironment)
The functions ls() and ls.str() take an environment argument and lists its
contents. We can test if a variable exists in an environment using the exists()
function.
14 R Programming — An Approach for Data Analytics
An environment can be converted into a list using the function as.list() and a
list can be converted into an environment using the function as.environment() or
the function list2env().
> newlist <- as.list(newenvironment)
> newlist
$variable3
[1] “Value for variable3”
$variable1
[1] 4 7 9
$variable2
[1] TRUE
> as.environment(newlist)
<environment: 0x124730a8>
> list2env(newlist)
<environment: 0x12edf3e8>
> anotherenv <- as.environment(newlist)
> anotherenv[[“variable3”]]
[1] “Value for variable3”
All environments are nested and so every environment has a parent environment.
The empty environment sits at the top of the hierarchy without any parent. The
15 Basics of R
exists() and the get() function also looks for the variables in the parent environment.
To change this behaviour we need to pass the argument inherits = FALSE.
> subenv <- new.env(parent = newenvironment)
> exists(“variable1”, subenv)
[1] TRUE
> exists(“variable1”, subenv, inherits = FALSE)
[1] FALSE
The word frame is used interchangeably with the word environment. The function
to refer to parent environment is denoted as parent.frame(). The variables assigned
from the command prompt are stored in the global environment. The functions and
the variables from the R’s base package are stored in the base environment.
1.5.2. Functions
A function and its environment together is called a closure. When we load a
package, the functions in that package are stored in the environment on the search
path where the package is installed. A function in R is a verb and not a noun as it
does things with its data. Functions are also another data types and hence we can
assign and manipulate and pass them as arguments to other functions. Typing the
function name in the command prompt lists the code associated with the function.
Below is the code listed for the function readLines().
> readLines
function (con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = unknown”,
skipNul = FALSE)
{
if (is.character(con)) {
con <- file(con, “r”)
on.exit(close(con))
}
.Internal(readLines(con, n, ok, warn, encoding, skipNul))
}
16 R Programming — An Approach for Data Analytics
When we call a function by passing values to it, the values are called as
arguments. The lines of code of the function can be seen between the curly braces
as body of the function. In R, there is no explicit return statement to return values.
The last value that is calculated in a function is returned by default in R.
The functions formals(), args() and formalArgs() can fetch the arguments
defined for a function. The body of the function can be retrieved using the body()
and deparse() functions.
> formals(cube)
$x
> args(cube)
function (x)
NULL
> formalArgs(cube)
[1] “x”
> body(cube)
{
cu <- x^3
}
17 Basics of R
> deparse(cube)
[1] “function (x) “ “{“ “ cu <- x^3” “}”
Thus R will search for a variable in the current environment and if it could not
find it, it will check the same in its parent environment. This search will proceed
upwards until the variable is searched in the global environment. The variables
defined in the global environment are called the global variables, which can be
accessed from anywhere else. The replicate() function can be used to run a function
several times as below. In this the user defined function random() returns 1 if the
value returned by the rnorm() function is a positive value and otherwise it returns
the value of the argument passed to the function random(). This function random()
is called 20 times using the replicate() function.
> random <- function(x)
+{
+ if(rnorm(1) > 0)
+ {r <- 1}
+ else
+ {r <- x}
+}
> replicate(20, random(5))
[1] 5 5 1 1 5 1 5 5 5 5 5 5 5 5 5 1 1 5 1 5
19 Basics of R
The if statement takes a logical value and executes the next statement only if the
value is TRUE.
> if(TRUE) message(“TRUE Statement”)
TRUE Statement
> if(FALSE) message(“FALSE Statement”)
It is not necessary to pass the logical values as TRUE or FALSE directly, instead
a variable or expression that returns a logical value can be used. If there are several
statements to execute after the condition, they can be wrapped in curly braces.
a <- 5
if(a < 7)
{
b <- a * 5
c <- b * 3
message(“b is “, b)
message(“c is “, c)
}
b is 25
c is 75
In the if and else construct the code that follows the if statement is executed if
the condition is TRUE and the code that follows the else statement is executed if
the condition is FALSE. It is important to note that the else statement must occur
on the same line as the closing curly brace of the if statement and otherwise it will
throw an error message.
20 R Programming — An Approach for Data Analytics
a <- 8
if(a < 7)
{
b <- a * 5
c <- b * 3
message(“b is “, b)
message(“c is “, c)
} else
{
message(“a is greater than 7”)
}
a is greater than 7
The if and else statements can be used repeatedly to code multiple conditions
and this respective actions. In this case it is important to note that the if and the
else statements are separated and they are not one word as ifelse. The ifelse function
is of different use which will be covered shortly.
a <- -8
if(a < 0)
{
message(“a is negative”)
} else if(a == 0)
{
message(“a is zero”)
} else if(a > 0)
{
message(“a is positive”)
a is negative
21 Basics of R
The ifelse() function takes three arguments of which the first is logical condition,
the second is the value that is returned when the first vector is TRUE and the third
is the value that is returned when the first vector is FALSE.
> a <- 3
> b <- 5
> ifelse(a < b, “a is less than b”, “a is greater than b”)
[1] “a is less than b”
If there are many else statements, it looks confusing and in such cases the switch()
function is required. The first argument of the switch statement is an expression that
can return a string value or an integer. This is followed by several named arguments
that provide the results when the name matches the value of the first argument.
Here also we can execute multiple statements enclosed by curly braces. If there is
no match the switch statement returns NULL. So, in this case, it is safe to mention
a default value if none matches.
> switch(“color”,”color” = “red”, “shape” = “circle”, “radius” = 10)
[1] “red”
> switch(“position”,”color” = “red”, “shape” = “circle”, “radius” = 10)
[1] NULL
> switch(“position”,”color” = “red”, “shape” = “circle”, “radius” = 10,”default”)
[1] “default”
> switch(2,”red”,”green”,”blue”)
[1] “green”
1.7. Loops
There are three kinds of loops in R namely, repeat, while and for.
22 R Programming — An Approach for Data Analytics
The repeat is the easiest loop in R that executes the same code until it is forced to
stop. This repeat is similar to the do while statement in other languages. A break
statement can be given when it is required to break the looping. Also, it is possible
to skip the rest of the statements in a loop and execute the next iteration and this
is done using the next statement.
a <- 1
repeat {
message(“Inside the loop”)
if(a == 3)
{
a=a+1
next
}
message(“The value of a is “, a)
a=a+1
if(a > 5)
{
message(“Exiting the loop”)
break
}
}
Inside the loop
The value of a is 1
Inside the loop
The value of a is 2
Inside the loop
Inside the loop
23 Basics of R
The value of a is 4
Inside the loop
The value of a is 5
Exiting the loop
The while loops are backward repeat loops. The repeat loop executes the code and
then checks for the condition, but in while loops the condition is first checked
and then the code is executed. So, in this case it is possible that the code may not
be executed even once when the condition fails at the entry itself during the first
iteration. The same example above can be written using the while statement.
a <- 1
while (a <= 5)
{
message(“Inside the loop”)
if(a == 3)
{
a=a+1
next
}
message(“The value of a is “, a)
a=a+1
}
Inside the loop
The value of a is 1
Inside the loop
The value of a is 2
Inside the loop
Inside the loop
24 R Programming — An Approach for Data Analytics
The value of a is 4
Inside the loop
The value of a is 5
The for loops are used when we know how many times the code needs to be repeated.
The for loop accepts an iterating variable and a vector. It repeats the loop giving the
iterating each element from the vector in turn. In this case also if there are multiple
statements to execute, we can use the curly braces. The iterating variable can be an
integer, number, character or logical vectors and they can be even lists.
for(i in 1:5)
{
j <- i * i
message(“The square value of “, i, “ is “, j)
}
The square value of 1 is 1
The square value of 2 is 4
The square value of 3 is 9
The square value of 4 is 16
The square value of 5 is 25
for(i in c(TRUE, FALSE, NA))
{
message(“This Statement is “, i)
}
This Statement is TRUE
This Statement is FALSE
This Statement is NA
a <- c(1,2,3)
b <- c(“a”,”b”,”c”,”d”)
25 Basics of R
HIGHLIGHTS
R is a free open source language that has cross platform compatibility.
R’s syntax is very simple and intuitive.
R’s installation software can be downloaded from the CRAN Website.
Help in R can be obtained by using, for eg. ?mean() / help(“mean”)
Variables can be assigned using the symbol ß or the assign() function.
The basic functions are c(), sum(), mean(), median(), exp(), sqrt() etc.
The basic operators are “+”, “*”, “:”, “/”, “**”, “*”, “%%”, “%/%”, “==”,
“!=”, “<”, “>”, “<=”, “>=” etc.
Currently, the CRAN package repository features 10756 available packages.
A Package can be newly installed using the function install.packages() and
it can be invoked using the function library().
When a variable is assigned in the command prompt, it goes by default
into the global environment.
To create a new environment we use the function new.env().
Typing the function name in the command prompt lists the code
associated with the function.
The if and the else statements are separated and they are not one word as
ifelse.
The ifelse() function takes three arguments.
If there are many else statements, the switch() function is required.
26 R Programming — An Approach for Data Analytics
Data Types in R
OBJECTIVES
types of R-objects. The frequently used ones are − Vectors, Arrays, Matrices, Lists,
Data Frames, Strings and Factors.
The simplest of these objects is the Vector object and there are six data types
of these atomic vectors, also termed as six classes of vectors. The other R-Objects
are built upon the atomic vectors. Hence, the basic data types in R are Numeric,
Integer, Complex, Logical and Character.
2.1.1. Numeric
Decimal values are called numeric in R. It is the default computational data type. If
we assign a decimal value to a variable x as follows, x will be of numeric type.
> x = 10.5
> x
[1] 10.5
> class(x) # print the class name of x
[1] “numeric”
2.1.2. Integer
> y = as.integer(3)
>y
[1] 3
> class(y)
[1] “integer”
> is.integer(y)
[1] TRUE
We can force a numeric value into an integer with the same as.integer() function
as below.
> as.integer(3.14)
[1] 3
The integer values of the logical values TRUE and FALSE are 1 and 0 respectively.
> as.integer(TRUE)
[1] 1
> as.integer(FALSE)
[1] 0
2.1.3. Complex
> z = 3 + 4i
>z
[1] 3 + 4i
> class(z)
[1] “complex”
If we find the square root of -1, it gives an error. But, if it is converted into a
complex number and then square root is applied, it produces the necessary result as
another complex number.
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
> sqrt(as.complex(-1))
[1] 0+1i
2.1.4. Logical
When two variables are compared, the logical values are created. The logical
operators are “&” (and), “|” (or), and “!” (negation).
> a = 4; b = 7
>p=a>b
>p
[1] FALSE
> class(p)
[1] “logical”
> a = TRUE; b = FALSE
>a&b
[1] FALSE
>a|b
31 Data Types in R
[1] TRUE
> !a
[1] FALSE
2.1.5. Character
The character object is used to represent string values in R. Objects can be converted
into character values using the as.character() function. A paste() function can be
used to concatenate two character values.
> s = as.character(“7.48”)
>s
[1] “7.48”
> class(s)
[1] “character”
> fname = “Adam”
> lname = “Smith”
> paste(fname, lname)
[1] “Adam Smith”
However, a readable string can be created using the sprint() function and this is
similar to the C language syntax.
> sprintf(“%s has %d rupees”, “Sundar”,1000)
[1] “Sundar has 1000 rupees”
The substr() function can be used to extract a substring from a given string. The
sub() function is used to replace the first occurrence of a string with another string
as below.
> substr(“Twinkle Twinkle Little Star”, start = 9, stop = 15)
[1] “Twinkle”
> sub(“Twinkle”, “Wrinkle”, “Twinkle Twinkle Little Star”)
[1] “Wrinkle Twinkle Little Star”
32 R Programming — An Approach for Data Analytics
2.2. Vectors
A sequence of data elements of the same basic type is called a Vector. Members in a
vector are called as components or members. The vector() function creates a vector
of a specified type and length. The result is a zero or FALSE or empty string.
> vector(“numeric”, 3)
[1] 0 0 0
> vector(“logical”, 5)
[1] FALSE FALSE FALSE FALSE FALSE
> vector(“character”, 2)
[1] “” “”
The below commands also produces the same result as the above commands.
> numeric(3)
[1] 0 0 0
> logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
> character(2)
[1] “” “”
The seq() function allows to generate sequences. The function seq.int() also
creates sequence from one number to another, but this function provides more
options for splitting the sequence.
> seq(1:5)
[1] 1 2 3 4 5
> seq.int(5, 12)
[1] 5 6 7 8 9 10 11 12
> seq.int(10, 5, -1.5)
[1] 10.0 8.5 7.0 5.5
33 Data Types in R
The function seq_len() creates a sequence from 1 to the input value. The
function seq_along() creates a sequence from 1 to the length of the input.
> seq_len(7)
[1] 1 2 3 4 5 6 7
> p <- c(3, 4, 5, 6)
> seq_along(p)
[1] 1 2 3 4
The function length() can be used to find the length of the vector, that is the
number of elements in a vector. Using this function, it is possible to assign new
length to a vector. If the vector length is extended NA(s) will be added to the end.
> length(1:7)
[1] 7
> length(c(“aa”, “ccc”, “eeee”))
[1] 3
> nchar(c(“aa”, “ccc”, “eeee”))
[1] 2 3 4
> s <- c(1,2,3,4,5)
> length(s) <- 3
>s
[1] 1 2 3
> length(s) <- 8
>s
[1] 1 2 3 NA NA NA NA NA
Each element of a vector can be given a name during the vector creation itself. If
there are space or special characters in the name, it needs to be enclosed in quotes.
The names() function can be used to give names to the vector elements after its
creation.
34 R Programming — An Approach for Data Analytics
> c(a = 1, b = 2, c = 3)
abc
123
> s <- 1:3
>s
[1] 1 2 3
> names(s) <- c(“a”, “b”, “c”)
>s
abc
123
Elements of a vector can be accessed using its indexes which are specified in
a square bracket. The index number starts from 1 and not 0. Specifying a negative
number as index to a vector means, it returns all the elements except the one
specified. The name of the vector element can also be specified as index to fetch it.
> x <- c(1:5)
>x
[1] 1 2 3 4 5
> x[c(2,3)]
[1] 2 3
> x[c(-1,-4)]
[1] 2 3 5
> s <- 1:3
>s
[1] 1 2 3
> names(s) <- c(“a”, “b”, “c”)
> s[“b”]
b
2
35 Data Types in R
The which() function returns the elements of the vector which satisfies the
condition specified within this function. The functions which.min() and which.
max() can be used to display the minimum and the maximum elements in the
vector.
>x
[1] 1 2 3 4 5
> which.min(x)
[1] 1
> which.max(x)
[1] 5
> which(x>3)
[1] 4 5
Vectors can be combined using the c() function. When the two vectors are
combined the numeric values are forced into character values. This shows that all
the members of a vector should be of the same basic data type.
> f = c(7, 5, 9)
> g = c(“aaa”, “bbb”, “ccc”)
> c(f, g)
[1] “7” “5” “9” “aaa” “bbb” “ccc”
> x = c(5, 8, 9)
> y = c(2, 6, 9)
>4*y
[1] 8 24 36
>x+y
[1] 7 14 18
>x-y
[1] 3 2 0
>x*y
[1] 10 48 81
>x/y
[1] 2.500000 1.333333 1.000000
> v = c(1, 2, 3, 4, 5, 6)
>x+v
[1] 6 10 12 9 13 15
The rep() function creates a vector with repeated elements. This function has
its other variants such as rep.int() and rep_len() whose usage is as given below.
> rep(1:3, 4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
> rep(1:3, each = 4)
[1] 1 1 1 1 2 2 2 2 3 3 3 3
> rep(1:3, times = 1:3)
[1] 1 2 2 3 3 3
> rep(1:3, length.out = 9)
[1] 1 2 3 1 2 3 1 2 3
> rep.int(1:3, 4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
37 Data Types in R
> rep_len(1:3, 9)
[1] 1 2 3 1 2 3 1 2 3
The syntax for creating matrices is using the function matrix() and passing the
nrow or ncol argument instead of the dim argument in the arrays. A matrix can also
be created using the array() function where the dimension of the array is two.
> m <- matrix(1:12, nrow = 3, dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
>m
38 R Programming — An Approach for Data Analytics
def g
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12
> m1 <- array(1:12, dim = c(3,4),
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
> m1
def g
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12
The argument byrow = TRUE in the matrix() function assigns the elements
row wise. If this argument is not specified, by default the elements are filled column
wise.
> m <- matrix(1:12, nrow = 3, byrow = TRUE,
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
The dim() function returns the dimensions of an array or a matrix. The functions
nrow() and ncol() returns the number of rows and number of columns of a matrix
respectively.
> dim(x)
[1] 4 3 2
> dim(m)
[1] 3 4
> nrow(m)
[1] 3
> ncol(m)
[1] 4
39 Data Types in R
The length() function also works for matrices and arrays. It is also possible to
assign new dimension for a matrix or an array using the dim() function.
> length(x)
[1] 24
> length(m)
[1] 12
> dim(m) <- c(6,2)
The functions rownames(), colnames() and dimnames() can be used to fetch the
row names, column names and dimension names of matrices and arrays respectively.
> rownames(m1)
[1] “a” “b” “c”
> colnames(m1)
[1] “d” “e” “f ” “g”
> dimnames(x)
[[1]]
[1] “a” “b” “c” “d”
[[2]]
[1] “e” “f ” “g”
[[3]]
[1] “h” “i”
It is possible to extract the element at the nth row and mth column using the
expression M[n, m]. The entire nth row can be extracted using M[n, ] and similarly,
the mth column can be extracted using M[,m]. Also, it is possible to extract more
than one column or row.
> M[2,3]
[1] 6
> M[2,]
[1] 4 5 6
40 R Programming — An Approach for Data Analytics
> M[,3]
[1] 3 6 9
> M[,c(1,3)]
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 7 9
> M[c(1,3),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 7 8 9
The columns of two matrices can be combined using the cbind() function and
similarly the rows of two matrices can be combined using the rbind() function.
> M1 = matrix(c(2,4,6,8,10,12), nrow=3, ncol=2)
> M1
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9), nrow=3, ncol = 1)
> M2
41 Data Types in R
[,1]
[1,] 3
[2,] 6
[3,] 9
> cbind(M1, M2)
[,1] [,2] [,3]
[1,] 2 8 3
[2,] 4 10 6
[3,] 6 12 9
> M3 = matrix(c(4,8), nrow=1, ncol=2)
> M3
[,1] [,2]
[1,] 4 8
> rbind(M1, M3)
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
[4,] 4 8
A matrix can be deconstructed using the c() function which combines all
column vectors into one.
> c(M1)
[1] 2 4 6 8 10 12
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9,11,1,5), nrow=3, ncol = 2)
> M2
[,1] [,2]
[1,] 3 11
[2,] 6 1
[3,] 9 5
> M1 + M2
[,1] [,2]
[1,] 5 19
[2,] 10 11
[3,] 15 17
> M1 * M2
[,1] [,2]
[1,] 6 88
[2,] 24 10
[3,] 54 60
> M2 = matrix(c(3,6,9,11), nrow=2, ncol = 2)
> M2
[,1] [,2]
[1,] 3 9
[2,] 6 11
> M1 %*% M2
[,1] [,2]
[1,] 54 106
43 Data Types in R
[2,] 72 146
[3,] 90 186
The power operator “^” also works element wise on matrices. To find the
inverse of a matrix the function solve() can be used.
> M2
[,1] [,2]
[1,] 3 9
[2,] 6 11
> M2^-1
[,1] [,2]
[1,] 0.3333333 0.11111111
[2,] 0.1666667 0.09090909
> solve(M2)
[,1] [,2]
[1,] -0.5238095 0.4285714
[2,] 0.2857143 -0.1428571
2.4. Lists
Lists allow us to combine different data types in a single variable. Lists can be
created using the list() function. This function is similar to the c() function. The
contents of a list are just listed within the list() function as arguments separated by
a comma. List elements can be a vector, matrix or a function. It is possible to name
the elements of the list while creation or later using the names() function.
> L <- list(c(9,1, 4, 7, 0), matrix(c(1,2,3,4,5,6), nrow = 3))
>L
[[1]]
[1] 9 1 4 7 0
[[2]]
44 R Programming — An Approach for Data Analytics
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
>L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> L <- list(Num = c(9,1, 4, 7, 0), Mat = matrix(c(1,2,3,4,5,6), nrow = 3))
>L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Lists can be nested. That is a list can be an element of another list. But, vectors,
arrays and matrices are not recursive/nested. They are atomic. The functions
is.recursive() and is.atomic() shows if a variable type is recursive or atomic respectively.
> is.atomic(list())
[1] FALSE
45 Data Types in R
> is.recursive(list())
[1] TRUE
> is.atomic(L)
[1] FALSE
> is.recursive(L)
[1] TRUE
> is.atomic(matrix())
[1] TRUE
> is.recursive(matrix())
[1] FALSE
The length() function works on list like in vectors and matrices. But, the dim(),
nrow() and ncol() functions returns only NULL.
> length(L)
[1] 2
> dim(L)
NULL
> nrow(L)
NULL
> ncol(L)
NULL
Arithmetic operations in list are possible only if the elements of the list are of
the same data type. Generally, it is not recommended. As in vectors the elements
of the list can be accessed by indexing them using the square brackets. The index
can be a positive number, or a negative number, or element names or logical values.
> L1 <- list(l1 = c(8, 9, 1), l2 = matrix(c(1,2,3,4), nrow = 2),
l3 = list( l31 = c(“a”, “b”), l32 = c(TRUE, FALSE) ))
> L1
$l1
46 R Programming — An Approach for Data Analytics
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
$l3
$l3$l31
[1] “a” “b”
$l3$l32
[1] TRUE FALSE
> L1[1:2]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
> L1[-3]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
The list t contains copies of the vectors a, b and d. A list slice is retrieved using
single square brackets []. In the below, t[2] contains a slice and a copy of b. Slice can
also be retrieved with multiple members.
> t[2]
[[1]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[c(2,4)]
[[1]]
[1] “abc” “def ” “ghi” “jkl” “mno”
[[2]]
[1] 5
48 R Programming — An Approach for Data Analytics
To reference a list member directly double square bracket [[]] is used. Thus
t[[2]] retrieves the second member of the list t. This results in a copy of b, but not a
slice of b. It is also possible to modify the contents of the elements directly, but the
contents of b are unaffected.
> t[[2]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[[2]][1] = “qqq”
> t[[2]]
[1] “qqq” “def ” “ghi” “jkl” “mno”
>b
[1] “abc” “def ” “ghi” “jkl” “mno”
We can assign names to the list members and reference lists by names instead of
numeric indexes. A list of two members is given as example below with the member
names as “first” and “second”. The list slice containing the member “first” can be
retrieved using the square brackets [] as shown below.
> l = list(first=c(1,2,3), second=c(“a”,”b”, “c”))
>l
$first
[1] 1 2 3
$second
[1] “a” “b” “c”
> l[“first”]
$first
[1] 1 2 3
The named list member can also be directly referenced with the $ operator or
double square brackets [[]] as below.
> l$first
[1] 1 2 3
49 Data Types in R
> l[[“first”]]
[1] 1 2 3
> as.list(v)
[[1]]
[1] 7
[[2]]
[1] 3
[[3]]
[1] 9
[[4]]
[1] 2
[[5]]
[1] 6
[[3]]
[1] “ccc”
> as.character(L1)
[1] “aaa” “bbb” “ccc”
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L1
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55
> unlist(L1)
l11 l12 l13 l21 l22 l23 l24 l25
78 90 21 11 22 33 44 55
The c() function can also be used to combine lists as we do for vectors.
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L2 <- list(“aaa”, “bbb”, “ccc”)
> c(L1, L2)
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55
[[3]]
[1] “aaa”
[[4]]
[1] “bbb”
[[5]]
[1] “ccc”
51 Data Types in R
By default the row names are automatically numbered from 1 to the number of
rows in the data frame. It is also possible to provide row names manually using the
row.names argument as below.
> df1 = data.frame(a, b, d, row.names = c(“one”, “two”, “three”))
> df1
a b d
one 1 a TRUE
two 2 b FALSE
three 3 c TRUE
> nrow(df1)
[1] 3
> ncol(df1)
[1] 3
> dim(df1)
[1] 3 3
> length(df1)
[1] 3
> colnames(df1)
[1] “a” “b” “d”
It is possible to create data frames with different length of vectors as long as the
shorter ones can be recycled to match that of the longer ones. Otherwise, an error
will be thrown.
> df2 <- data.frame(x = 1, y = 2:3, y = 4:7)
> df2
x y y.1
1 1 2 4
2 1 3 5
3 1 2 6
4 1 3 7
The argument check.names can be set as FALSE so that a data frame will not
look for valid column names.
53 Data Types in R
There are many built-in data frames available in R (example – mtcars). When
this data frame is invoked in R tool, it produces the below result.
> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
The top line contains the header or the column names. Each row denotes a record
or a row in the table. A row begins with the name of the row. Each data member of a
row is called a cell. To retrieve a cell value, we enter the row and the column number
of the cell in square brackets [] separated by a comma. The cell value of the second
row and third column is retrieved as below. The row and the column names can also
be used inside the square brackets [] instead of the row and column numbers.
> mtcars[2, 3]
[1] 160
> mtcars[“Mazda RX4 Wag”, “disp”]
[1] 160
The nrow() function gives the number of rows in a data frame and the ncol()
function gives the number of columns in a data frame. To get the preview or the first
few records of a data frame along with the header the head() function can be used.
54 R Programming — An Approach for Data Analytics
> nrow(mtcars)
[1] 32
> ncol(mtcars)
[1] 11
> head(mtcars)
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
......
To retrieve a column from a data frame we use double square brackets [[]] and
the column name or the column number inside the [[]]. The same can be achieved
by making use of the $ symbol as well. This same result can also be achieved by
using single brackets [] by mentioning a comma instead of the row name / number
and using the column name / number as the second index inside the [].
> mtcars[[“hp”]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[[4]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars$hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[,”hp”]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[,4]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
Similarly, if we use the column name or the column number inside a single
square bracket [], we get the below result.
> mtcars[4]
hp
Mazda RX4 110
55 Data Types in R
To retrieve a row from a data frame we use the single square brackets [] only by
mentioning the row name / number as the first index inside [] and a comma instead
of the column name / number.
> mtcars[6,]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
> mtcars[c(6,18),]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
Fiat 128 32.4 4 78.7 66 4.08 2.20....
> mtcars[“Valiant”,]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE
As we have for matrices the transpose of a data frame can be obtained using the
t() function as below.
> t(D)
[,1] [,2] [,3] [,4] [,5] [,6]
x “a” “b” “c” “d” “e” “f ”
y “ 3” “ 4” “ 7” “ 8” “12” “15”
z “ TRUE” “ TRUE” “FALSE” “ TRUE” “FALSE” “ TRUE”
57 Data Types in R
The functions rbind() and cbind() can also be applied on the data frames as we
do for the matrices. The only condition for rbind() is that the column names should
match, but for cbind() it does not check even if the column names are duplicated.
> x1 <- c(“aaa”, “bbb”, “ccc”, “ddd”, “eee”, “fff ”)
> y1 <- c(9, 12, 17, 18, 23, 32)
> z1 <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
> E <- data.frame(x1, y1, z1)
>E
x1 y1 z1
1 aaa 9 TRUE
2 bbb 12 FALSE
3 ccc 17 TRUE
4 ddd 18 FALSE
5 eee 23 TRUE
6 fff 32 FALSE
> cbind(D, E)
x y z x1 y1 z1
1 a 3 TRUE aaa 9 TRUE
2 b 4 TRUE bbb 12 FALSE
3 c 7 FALSE ccc 17 TRUE
4 d 8 TRUE ddd 18 FALSE
5 e 12 FALSE eee 23 TRUE
6 f 15 TRUE fff 32 FALSE
> F <- data.frame(x, y, z)
>F
58 R Programming — An Approach for Data Analytics
x y z
1 a 9 TRUE
2 b 12 FALSE
3 c 17 TRUE
4 d 18 FALSE
5 e 23 TRUE
6 f 32 FALSE
> rbind(D, F)
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE
7 a 9 TRUE
8 b 12 FALSE
9 c 17 TRUE
10 d 18 FALSE
11 e 23 TRUE
12 f 32 FALSE
The merge() function can be applied to merge two data frames provided they
have common column names. By default, the merge() function does the merging
based on all the common columns, otherwise one of the common column name has
to be specified.
> merge(D, F, by = “x”, all = TRUE)
59 Data Types in R
> rowSums(G[1:3, ])
1 2 3
45 48 51
> rowMeans(G[2:4, ])
2 3 4
16 17 18
2.6. Factors
Factors are used to store categorical data like gender (“Male” or “Female”). They
behave sometimes like character vectors and sometimes like integer vectors based
on the context.
Factors stores categorical data and they behave like strings sometimes and
integers sometimes. Consider a data frame that stores the weight of few males
and females. In this case the column that stores the gender is a factor as it stores
categorical data. The choices “female” and “male” are called the levels of the factor.
This can be viewed by using the levels() function and nlevels() function.
> weight <- data.frame(wt_kg = c(60,82,45, 49,52,75,68),
gender = c(“female”,”male”, “female”, “female”, “female”, “male”, “male”))
> weight
wt_kg gender
1 60 female
2 82 male
3 45 female
4 49 female
5 52 female
6 75 male
7 68 male
> weight$gender
61 Data Types in R
At the atomic level a factor can be created using the factor() function, which
takes a character vector as the argument.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”, “male”, “male”))
> gender
[1] female male female female female male male
Levels: female male
The levels argument can be used in the factor() function to specify the levels of
the factor. It is also possible to change the levels once the factor is created. This is
done using the function levels() or the function relevel(). The function relevel() just
mentions which level comes first.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”,
“male”, “male”), levels = c(“male”, “female”))
> gender
[1] female male female female female male male
Levels: male female
> levels(gender) <- c(“F”, “M”)
> gender
[1] M F M M M F F
Levels: F M
> relevel(gender, “M”)
[1] M F M M M F F
Levels: M F
62 R Programming — An Approach for Data Analytics
It is possible to drop a level from a factor using the function droplevels() when
the level is not in use as in the example below. [Note: the function is.na() is used to
remove the missing value].
> diet <- data.frame(eat = c(“fruit”, “fruit”, “vegetable”, “fruit”),
type = c(“apple”, “mango”, NA, “papaya”))
> diet
eat type
1 fruit apple
2 fruit mango
3 vegetable <NA>
4 fruit papaya
> diet <- subset(diet, !is.na(type))
> diet
eat type
1 fruit apple
2 fruit mango
4 fruit papaya
> diet$eat
[1] fruit fruit fruit
Levels: fruit vegetable
> levels(diet)
NULL
> levels(diet$eat)
[1] “fruit” “vegetable”
> unique(diet$eat)
[1] fruit
Levels: fruit vegetable
63 Data Types in R
In some cases, the levels need to be ordered as in rating a product or course. The
ratings can be “Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”. When a
factor is created with these levels, it is not necessary they are ordered. So, to order the
levels in a factor, we can either use the function ordered() or the argument ordered =
TRUE in the factor() function. Such ordering can be useful when analysing survey
data.
> ch <- c(“Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”)
> val <- sample(ch, 100, replace = TRUE)
> rating <- factor(val, ch)
> rating
[1] Outstanding Bad Outstanding Good Very Good Very Good
[7] Excellent Outstanding Bad Excellent Very Good Bad
...
Levels: Outstanding Excellent Very Good Good Bad
> is.factor(rating)
[1] TRUE
> is.ordered(rating)
[1] FALSE
> rating_ord <- ordered(val, ch)
> is.factor(rating_ord)
[1] TRUE
> is.ordered(rating_ord)
[1] TRUE
> rating_ord
64 R Programming — An Approach for Data Analytics
Numeric values can be summarized into factors using the cut() function and
the result can be viewed using the table() function which lists the count of numbers
in each category. For example let us consider the variable age which has the numeric
values of ages. These ages can be grouped using the cut() function with an interval
of 10 and the result is a factor age_group.
> age <- c(18,20, 31, 32, 33, 35, 41, 38, 45, 48, 51, 27, 29, 42, 39)
> age_group <- cut(age, seq.int(15, 55, 10))
> age
[1] 18 20 31 32 33 35 41 38 45 48 51 27 29 42 39
> age_group
[1] (15,25] (15,25] (25,35] (25,35] (25,35] (25,35] (35,45] (35,45] (35,45] (45,55]
[11] (45,55] (25,35] (25,35] (35,45] (35,45]
Levels: (15,25] (25,35] (35,45] (45,55]
> table(age_group)
age_group
(15,25] (25,35] (35,45] (45,55]
2 6 5 2
The function gl() can be used to create a factor, which takes the first argument
that tells how many levels the factor contains and the second argument that tells
how many times each level has to be repeated as value. This function can also take
the argument labels, which lists the names of the factor levels. The function can
also be made to list alternating values of the labels as below.
> gl(5,3)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
65 Data Types in R
The factors thus generated can be combined using the function interaction() to
get a resultant combined factor.
> fac1 <- gl(5,3, labels = c(“one”, “two”, “three”, “four”, “five”))
> fac2 <- gl(5,1,15, labels = c(“a”, “b”, “c”, “d”, “e”, “f ”, “g”, “h”, “i”, “j”,
“k”, “l”, “m”, “n”, “o”))
> interaction(fac1, fac2)
[1] one.a one.b one.c two.d two.e two.a three.b three.c three.d four.e
[11] four.a four.b five.c five.d five.e
75 Levels: one.a two.a three.a four.a five.a one.b two.b three.b four.b ... five.o
2.7. Strings
Strings are stored in character vectors. Most string manipulation functions act
on character vectors. Character vectors can be created using the c() function by
enclosing the string in double or single quotes. (Generally we follow only double
quotes). The paste() function can be used to concatenate two strings with a space
in between. If the space need not be shown, we use the function paste0(). To have
specified separator between the two concatenated string, we use the argument
sep in the paste() function. The result can be collapsed into one string using the
collapse argument.
> c(“String 1”, ‘String 2’)
[1] “String 1” “String 2”
> paste(c(“Pine”, “Red”), “Apple”)
66 R Programming — An Approach for Data Analytics
The to String() function can be used to convert a number vector into a character
vector, with the elements separated by a comma and a space. It is possible to specify
the width of the print string in this function.
> x <- c(1:10)^3
>x
[1] 1 8 27 64 125 216 343 512 729 1000
> toString(x)
[1] “1, 8, 27, 64, 125, 216, 343, 512, 729, 1000”
> toString(x, 18)
[1] “1, 8, 27, 64, ....”
The cat() function is also similar to the paste() function, but there is little
difference in it as shown below.
> cat(c(“Red”, “Pine”), “Apple”)
Red Pine Apple
The noquote() function forces the string outputs not to be displayed with
quotes.
> a <- c(“I”, “am”, “a”, “data”, “scientist”)
>a
[1] “I” “am” “a” “data” “scientist”
> noquote(a)
[1] I am a data scientist
67 Data Types in R
The formatC() function is used to format the numbers and display them as
strings. This function has the arguments digits, width, format, flag etc which can be
used as below. A slight variation of the function formatC() is the function format()
whose usage is as shown below.
> h <- c(4.567, 8.981, 27.772)
>h
[1] 4.567 8.981 27.772
> formatC(h)
[1] “4.567” “8.981” “27.77”
> formatC(h, digits = 3)
[1] “4.57” “8.98” “27.8”
> formatC(h, digits = 3, width = 5)
[1] “ 4.57” “ 8.98” “ 27.8”
> formatC(h, digits = 3, format = “e”)
[1] “4.567e+00” “8.981e+00” “2.777e+01”
> formatC(h, digits = 3, flag = “+”)
[1] “+4.57” “+8.98” “+27.8”
> format(h)
[1] “ 4.567” “ 8.981” “27.772”
> format(h, digits = 3)
[1] “ 4.57” “ 8.98” “27.77”
> format(h, digits = 3, trim = TRUE)
[1] “4.57” “8.98” “27.77”
The sprint() function is also used for formatting strings and passing number
values in between the strings. The argument %s in this function stands for a string
to be passed. The argument %d and argument %f stands for integer and floating-
point number. The usage of this function can be understood by the below example.
68 R Programming — An Approach for Data Analytics
To print a tab in between text, we can use the cat() function with the special
character “\t” included in between the text as below. Similarly, if we need to insert
a new line in between the text, we use “\n”. In this cat() function the argument fill
= TRUE means that after printing the text, the cursor is placed in the next line.
Suppose if a back slash has to be used in between the text, it is preceded by another
back slash. If we enclose the text in double quotes and if the text contains a double
quote in between, it is also preceded by a back slash. Similarly, if we enclose the
text in single quotes and if the text contains a single quote in between, it is also
preceded by a back slash. If we enclose the text in double quotes and if the text
contains a single quote in between, or if we enclose the text in single quotes and if
the text contains a double quote in between, it is not a problem (No need for back
slash).
> cat(“Black\tBerry”, fill = TRUE)
Black Berry
> cat(“Black\nBerry”, fill = TRUE)
Black
Berry
> cat(“Black\\Berry”, fill = TRUE)
Black\Berry
> cat(“Black\”Berry”, fill = TRUE)
Black”Berry
> cat(‘Black\’Berry’, fill = TRUE)
Black’Berry
> cat(‘Black”Berry’, fill = TRUE)
Black”Berry
69 Data Types in R
The function toupper() and tolower() are used to convert a string into upper
case or lower case respectively. The substring() or the substr() function is used to cut
a part of the string from the given text. Its arguments are the text, starting position
and ending position. Both these functions produce the same result.
> toupper(“The cat is on the Wall”)
[1] “THE CAT IS ON THE WALL”
> tolower(“The cat is on the Wall”)
[1] “the cat is on the wall”
The function strsplit() does the splitting of a text into many strings based on
the splitting character mentioned as argument. In the below example the splitting
is done when a space is encountered. It is important to note that this function
returns a list and not a character vector as a result.
> strsplit(“I like Bannana, Orange and Pineapple”, “ “)
[[1]]
[1] “I” “like” “Bannana,” “Orange” “and” “Pineapple”
In this same example if the text has to be split when a comma or space is
encountered it is mentioned as “,?”. This means that the comma is optional and
space is mandatory for splitting the given text.
70 R Programming — An Approach for Data Analytics
The default R’s working directory can be obtained using the function getwd()
and this default directory can be changed using the function setwd(). The directory
path mentioned in the setwd() function should have the forward slash instead of
backward slash as in the example below.
> getwd()
[1] “C:/Users/admin/Documents”
> setwd(“C:/Program Files/R”)
> getwd()
[1] “C:/Program Files/R”
It is also possible to construct the file paths using the file.path() function which
automatically inserts the forward slash between the directory names. The function
R.home() list the home directory where R is installed.
> file.path(“C:”, “Program Files”, “R”, “R-3.3.0”)
[1] “C:/Program Files/R/R-3.3.0”
> R.home()
[1] “C:/PROGRA~1/R/R-33~1.0”
Paths can also be specified by relative terms such as “.” denotes current directory,
“..” denotes parent directory and “~” denotes home directory. The function path.
expand() converts relative paths to absolute paths.
> path.expand(“.”)
[1] “.”
> path.expand(“..”)
[1] “..”
> path.expand(“~”)
[1] “C:/Users/admin/Documents”
71 Data Types in R
The function basename() returns only the file name leaving its directory if
specified. On the other hand the function dirname() returns only the directory
name leaving the file name.
> filename <- “C:/Program Files/R/R-3.3.0/bin/R.exe”
> basename(filename)
[1] “R.exe”
> dirname(filename)
[1] “C:/Program Files/R/R-3.3.0/bin”
R has three date and time base classes and they are POSIXct, POSIXlt and Date.
POSIX is a set of standards that defines how dates and times should be specified and
“ct” stands for “calendar time”. POSIXlt stores dates as a list of seconds, minutes,
hours, day of month etc. For storing and calculating with dates, we can use POSIXct
and for extracting parts of dates, we can use POSXlt.
The function Sys.time() is used to return the current date and time. This
returned value is by default in the POSIXct form. But, this can be converted to
POSIXlt form using the function as.POSIXlt(). When printed both forms of date
and time are displayed in the same manner, but their internal storage mechanism
differs. We can also access individual components of a POSIXlt date using the dollar
symbol or the double brackets as shown below.
> Sys.time()
[1] “2017-05-11 14:31:29 IST”
> t <- Sys.time()
> t1 <- Sys.time()
> t2 <- as.POSIXlt(t1)
72 R Programming — An Approach for Data Analytics
> t1
[1] “2017-05-11 14:39:39 IST”
> t2
[1] “2017-05-11 14:39:39 IST”
> class(t1)
[1] “POSIXct” “POSIXt”
> class(t2)
[1] “POSIXlt” “POSIXt”
> t2$sec
[1] 39.20794
> t2[[“min”]]
[1] 39
> t2$hour
[1] 14
> t2$mday
[1] 11
> t2$wday
[1] 4
The Date class stores the dates as number of days from start of 1970. This class
is useful when time is insignificant. The as.Date() function can be used to convert
a date in other class formats to the Date class format.
> t3 <- as.Date(t2)
> t3
[1] “2017-05-11”
There are also other add-on packages available in R to handle date and time and
they are date, dates, chron, yearmon, yearqtr, timeDate, ti and jul.
73 Data Types in R
In CSV files the dates will be normally stored as strings and they have to be converted
into date and time using any of the packages. For this we need to parse the strings
using the function strptime() and this returns the date of the format POSIXlt.
The date format is specified as a string and passed as argument to the strptime()
function. If the given string does not match the format given in the format string,
then it returns NA.
> date1 <- strptime(“22:15:45 22/08/2015”, “%H:%M:%S %d/%m/%Y”)
> date1
[1] “2015-08-22 22:15:45 IST”
In the format string “%H” denotes hour in 24 hour system, “%M” denotes
minutes, “%S” denotes second, “%m” denotes the number of the month, “%d”
denotes the day of the month as number, “%Y” denotes four digit year.
To convert a date into a string the function strftime() is used. This function also
takes a date formatting string as argument like strptime(). In the format string “%I”
denotes hour in 12 hours system, “%p” denotes AM/PM, “%A” denotes the string of
day of the week, “%B” denotes the string of name of the month.
> strftime(Sys.Date(),”It’s %I:%M%p on %A %d %B, %Y.”)
[1] “It’s 12:00AM on Thursday 11 May, 2017.”
It is possible to specify the time zone when parsing a date string using strptime()
or strftime() functions. If this is not specified, the default time zone is taken. The
functions Sys.timezone() and Sys.getlocale(“LC_TIME”) are used to get the default
time zone of the system and the operating system respectively.
74 R Programming — An Approach for Data Analytics
> Sys.timezone()
[1] “Asia/Calcutta”
> Sys.getlocale(“LC_TIME”)
[1] “English_India.1252”
Few of the time zones are UTC (Universal Time), IST (Indian Standard Time),
EST (Eastern Standard Time), PST (Pacific Standard Time), GMT (Greenwitch
Meridian Time), etc. It is also possible to give manual offset from UTC as “UTC+n”
or “UTC–n” to denote west and east parts of UTC respectively. Even though it
throws warning message, it gives the result correctly.
> strftime(Sys.time(), tz = “UTC”)
[1] “2017-05-12 04:59:04”
The time zone changes does not happen in strftime() function if the date is in
POSIXlt dates. Hence, it is required to change to POSIXct format first and then
apply the function.
If we add a number to the POSIXct or POSIXlt classes, it will shift to that many
seconds. If we add a number to the Date class, it will shift to that many days.
> ct <- as.POSIXct(Sys.time())
> lt <- as.POSIXlt(Sys.time())
75 Data Types in R
Adding two dates, throws error. But subtracting two dates gives the number
of days in between the dates. To get the same result, alternatively, the difftime()
function can be used and in this it is possible to specify the attribute units = “secs”
(or “mins” or “hours” or “days” or “weeks”).
> dt1 <- as.Date(“10/10/1973”, “%d/%m/%Y”)
> dt1
[1] “1973-10-10”
> dt2 <- as.Date(“25/09/2000”, “%d/%m/%Y”)
> dt2
[1] “2000-09-25”
The seq() function can be used to generate a sequence of dates. The argument
“by” can take many options based on the class of the dates specified. We can apply
the mean() and summary() functions on these sequence of dates generate.
> seq(dt1, dt2, by = “1 year”)
[1] “1973-10-10” “1974-10-10” “1975-10-10” “1976-10-10” “1977-10-10”
“1978-10-10”
[7] “1979-10-10” “1980-10-10” “1981-10-10” “1982-10-10” “1983-10-10”
“1984-10-10”
[13] “1985-10-10” “1986-10-10” “1987-10-10” “1988-10-10” “1989-10-10”
“1990-10-10”
[19] “1991-10-10” “1992-10-10” “1993-10-10” “1994-10-10” “1995-10-10”
“1996-10-10”
[25] “1997-10-10” “1998-10-10” “1999-10-10”
The lubridate package makes the process of date and time manipulation easier.
The ymd() function in this package converts any date to the format of year, month
and day separated by hyphens.(Note: This function requires the date to be specified
in the order of year, month and day, but can use any separator as below).
> install.packages(“lubridate”)
> library(lubridate)
> ymd(“2000/09/25”, “2000-9-25”, “2000*9.25”)
[1] “2000-09-25” “2000-09-25” “2000-09-25”
If the given date is in other formats that is not in the order of year, month and
day, then we have other functions such as ydm(), mdy(), myd(), dmy() and dym().
These functions can also be accompanied with time by making use of the functions
ymd_h(), ymd_hm() and ymd_hms() [similar functions available for ydm(), mdy(),
myd(), dmy() and dym()]. All the parsing functions in the lubridate package returns
POSIXct dates and the default time zone is UTC. A function named stamp() in the
lubridate package allows formatting of the dates in a human readable format.
> dt_format <- stamp(“I purchased on Sunday, the 10th of October 2013 at
6:00:00 PM”)
Multiple formats matched: “I purchased on %A, the %dth of %B %Y at %H:%M:%S
78 R Programming — An Approach for Data Analytics
The lubridate package has three variable types, namely the “Durations”, “Periods”
and “Intervals”. The lubridate package has the functions, dyears(), dweeks(), ddays(),
dhours(), dminutes(), dseconds() etc that specify the duration of year, week, day, hour,
minute and second in terms of seconds. The duration of 1 minute is 60 seconds,
the duration of 1 hour is 3600 seconds (60 minutes * 60 seconds), the duration of 1
day is 86,400 seconds (24 hours * 60 minutes * 60 seconds), the duration of 1 year
is 31,536,000 seconds (365 days * 24 hours * 60 minutes * 60 seconds) and so on.
The function today() returns the current days date.
> y <- dyears(1:5)
>y
[1] “31536000s (~52.14 weeks)” “63072000s (~2 years)” “94608000s (~3 years)”
[4] “126144000s (~4 years)” “157680000s (~5 years)”
> w <- dweeks(1:4)
>w
[1] “604800s (~1 weeks)” “1209600s (~2 weeks)” “1814400s (~3 weeks)”
[4] “2419200s (~4 weeks)”
> d <- ddays(1:10)
>d
[1] “86400s (~1 days)” “172800s (~2 days)” “259200s (~3 days)”
[4] “345600s (~4 days)” “432000s (~5 days)” “518400s (~6 days)”
[7] “604800s (~1 weeks)” “691200s (~1.14 weeks)” “777600s (~1.29 weeks)”
[10] “864000s (~1.43 weeks)”
79 Data Types in R
> today() + y
[1] “2018-05-12” “2019-05-12” “2020-05-11” “2021-05-11” “2022-05-11”
“Periods” specify time spans according to the clock time. The lubridate package
has the functions, years(), weeks(), days(), hours(), minutes(), seconds() etc that
specify the period of year, week, day, hour, minute and second in terms of clock
time. The exact length of these periods can be realized only if they are added to an
instance of date or time.
> y <- years(1:7)
>y
[1] “1y 0m 0d 0H 0M 0S” “2y 0m 0d 0H 0M 0S” “3y 0m 0d 0H 0M 0S”
“4y 0m 0d 0H 0M 0S”
[5] “5y 0m 0d 0H 0M 0S” “6y 0m 0d 0H 0M 0S” “7y 0m 0d 0H 0M 0S”
> today()+y
[1] “2018-05-12” “2019-05-12” “2020-05-12” “2021-05-12” “2022-05-12”
“2023-05-12”
[7] “2024-05-12”
“Intervals” are defined by the instance of date or time at the beginning and
end. They are mostly used for specifying “Periods” and “Durations” and conversion
between “Periods” and “Durations”.
> yr <- dyears(5)
> yr
[1] “157680000s (~5 years)”
> as.period(yr)
[1] “5y 0m 0d 0H 0M 0S”
> sdt <- ymd(“2017-05-12”)
> int <- new_interval(sdt, sdt+yr)
> int
[1] 2017-05-12 UTC--2022-05-11 UTC
80 R Programming — An Approach for Data Analytics
The operator “%--%” is used for defining intervals and the operator “%within%”
is used for checking if a given date is within the given interval.
> intv <- ymd(“1973-10-10”) %--% ymd(“2000-09-25”)
> intv
[1] 1973-10-10 UTC--2000-09-25 UTC
> ymd(“1979-12-12”) %within% intv
[1] TRUE
The function with_tz() can be used to change the time zone of a date (correctly
handles POSIXlt dates) and the function force_tz() is used for updating incorrect
time zones.
> with_tz(Sys.time(), tz = “America/Los_Angeles”)
[1] “2017-05-12 06:44:14 PDT”
> with_tz(Sys.time(), tz = “Asia/Kolkata”)
[1] “2017-05-12 19:14:29 IST”
The functions floor_date() and ceiling_date() can be used to find the lower and
upper limit of a given date as below.
> floor_date(today(), “year”)
[1] “2017-01-01”
> ceiling_date(today(), “year”)
[1] “2018-01-01”
> floor_date(today(), “month”)
[1] “2017-05-01”
> ceiling_date(today(), “month”)
[1] “2017-06-01”
HIGHLIGHTS
The basic data types in R are Numeric, Integer, Complex, Logical and
Character.
81 Data Types in R
R has three date and time base classes and they are POSIXct, POSIXlt and
Date.
The function Sys.time() is used to return the current date and time.
The as.Date() function can be used to convert a date in other class formats
to the Date class format.
The functions Sys.timezone() is used to get the default time zone of the
system.
The lubridate package has the functions, dyears(), dweeks(), ddays(),
dhours(), dminutes(), dseconds() etc that specify the duration of year,
week, day, hour, minute and second in terms of seconds.
The lubridate package has the functions, years(), weeks(), days(), hours(),
minutes(), seconds() etc that specify the period of year, week, day, hour,
minute and second in terms of clock time.
CHAPTER 3
Data Preparation
OBJECTIVES
3.1. Datasets
R has many datasets built in. R can read data from variety of other data sources
and in variety of formats. One of the packages in R is datasets which is filled with
example datasets. Many other packages also contain datasets. We can see all the
datasets available in the loaded packages using the data() function.
84 R Programming — An Approach for Data Analytics
To access a particular dataset use the data() function with its argument as the
dataset name enclosed within double quotes and the second optional argument
being the package name in which the dataset is present (This second argument is
required only if the particular package is not loaded). The invoked dataset can be
listed just like a data frame using the head() function.
> data(“kidney”, package = “survival”)
> head(kidney)
id time status age sex disease frail
1 1 8 1 28 1 Other 2.3
2 1 16 1 28 1 Other 2.3
3 2 23 1 48 2 GN 1.9
….
Text documents have several formats. Common format are CSV (Comma Separated
Values), XML (Extended Markup Language), JSON (JavaScript Object Notation)
and YAML. An example of an unstructured text data is a book.
Comma Separated Values (CSV) Files is a spreadsheet like data stored with
comma delimited values. The read.table() function reads these files and stores the
result in a data frame. If the data has header, it is required to pass the argument
header = TRUE to the read.table() function. The argument fill = TRUE makes
the read.table() function substitute NA values for the missing fields. The system.
file() function is used to locate files that are inside a package. In the below example
“extdata” is the folder name and the package name is “learning” and the file name
is “RedDeerEndocranialVolume,dlm” The str() function takes the data frame name
as the argument and lists the structure of the dataset stored in the data frame.
> install.packages(“learningr”)
> library(learningr)
> deer_file <- system.file(“extdata”,”RedDeerEndocranialVolume.dlm”,
package = “learningr”)
> deer_data <- read.table(deer_file, header=TRUE, fill=TRUE)
> str(deer_data)
‘data.frame’: 33 obs. of 8 variables:
$ SkullID : Factor w/ 33 levels “A4”,”B11”,”B12”,..: 14 2 17 16 15 13 10 11
19 3 ...
$ VolCT : int 389 389 352 388 375 325 346 302 379 410 ...
$ VolBead : int 375 370 345 370 355 320 335 295 360 400 ...
$ VolLWH : int 1484 1722 1495 1683 1458 1363 1250 1011 1621 1740 ...
$ VolFinarelli: int 337 377 328 377 328 291 289 250 347 387 ...
$ VolCT2 : int NA NA NA NA NA NA 346 303 375 413 ...
86 R Programming — An Approach for Data Analytics
The column names and row names are listed by default and if the row names
are not given in the dataset, the rows are simply numbered 1, 2, 3 and so on. The
arguments specify how the file will be read. The argument sep determines the
character to use as separator between fields. The nrow argument specifies the lines
of data to read. The argument skip specifies the number of lines to skip at the start
of the file. For the functions read.table() and read.csv() the default separator is set
to comma and they assume the data has header row. The function read.csv2() uses
the semicolon as the separator and comma instead of decimals. The read.delim()
function imports the tab-delimited files with full stops for decimal places. The read.
delim2() function imports the tab-delimited files with commas for decimal places.
> read.csv(deer_file, header=FALSE, skip = 3, nrow = 2)
V1
1 DIC90 352 345 1495 328
2 DIC83 388 370 1683 377
> head(deer_data)
SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
1 DIC44 389 375 1484 337 NA NA NA
2 B11 389 370 1722 377 NA NA NA
3 DIC90 352 345 1495 328 NA NA NA
….
The colbycol and sqldf packages contain functions that allow to read part of
the CSV file into R. These are useful when we don’t need all the columns or all the
rows. For low-level control we can use the scan() function to import CSV file. For
data exported from other languages we may need to pass the na.strings argument to
the read.table() function to replace the missing values. If the data is exported from
SQL, we use na.strings = “NULL” and if the data is exported from SAS or Stata,
we use na.strings = “.”. If the data is exported from Excel we use the na.strings =
c(“”,”#N/A”, “#DIV/0!”, “#NUM!”).
87 Data Preparation
Writing data from R into a file is easier than reading files into R. For this we use
the functions write.table() and write.csv(). These functions take a data frame and a
file path as arguments. They also have arguments to specify if we need not include
row names in the output file or to specify the character encoding of the output file.
> write.csv(deer_data,”F:/deer.csv”, row.names = FALSE, fileEncoding = “utf8”)
If the file structure is week, it is easier to read the file as lines of text using the function
readLines() and then parse the contents. The readLines() function accepts a path
to the file as the argument. Similarly, the writeLines() function takes a text line or a
character vector and the file name as argument and writes the text to the file.
> tempest <- readLines(“F:/Tempest.txt”)
> tempest
[1] “The writing of Prefaces to Plays was probably invented by some very”
[2] “ambitious Poet, who never thought he had done enough: Perhaps by
ome”
[3] “Ape of the French Eloquence, which uses to make a business of a Letter
of ”
....
> writeLines(“This book is about a story by Shakespeare”, “F:/story.csv”)
XML files are used for storing nested data. Few of them are RSS (Really Simple
Syndication) feeds, SOAP (Simple Object Access Protocols) and XHTML Web Pages.
To read the XML files, the XML package has to be installed. When an XML file is
imported, the result can be stored using the internal nodes or the R nodes. If the result
is stored using internal nodes, it allows to query the node tree using the XPath language
(used for interrogating XML documents). The XML file can be imported using the
function xmlParse() function. This function can take the argument useInternalNodes
= FALSE to use R-level nodes instead of the internal nodes while importing the XML
files. But, this is set by default by the xml TreeParse() function.
88 R Programming — An Approach for Data Analytics
> install.packages(“XML”)
> library(XML)
The functions for importing HTML pages are htmlParse() and htmlTreeParse()
and they behave same as the xmlParse() and xmlTreeParse() functions.
The two packages dealing with JSON data are RJSONIO and rjson and the best of
these is the RJSONIO. The function used to import the JSON file is fromJSON()
and the function used to export the JSON file is toJSON(). The yaml package has
two functions for importing YAML data and they are yaml.load() and yaml.load_
file(). The function as.yaml() performs the task of converting R objects to YAML
strings.
Many softwares store their data in binary formats which are smaller in size
than the text files. They hence provide performance gains at the expense of human
readability.
Excel is the world’s most powerful data analysis tool and its document formats
are XLX and XLSX. Spreadsheets can be imported with the functions read.xlsx()
and read.xlsx2(). The colClasses argument determines what class each column
should have in the resulting data frame and this argument is optional in the above
functions. To write to an excel file from R we use the function write.xlsx2() that
takes the data frame and the file name as arguments. There is another package
xlsReadWrite that does the same function of the xlxs package but this one works
only in 32-bit R installations and only on windows.
89 Data Preparation
> install.packages(“xlsx”)
> library(xlsx)
> logfile <- read.xlsx2(“F:/Log2015.xls”, sheetIndex = 1, startRow = 2, endrow = 72,
colIndex = 1:5, colClasses = c(“character”, “numeric”, “character”,
“character”, “integer”))
The files from a statistical package are imported using the foreign package. The
read.ssd() function is used to read SAS datasets and the read.dta() function is
used to read Stata DTA files. The read.spss() function is used to import the SPSS
data files. Similarly, these files can be written with the write.foreign() function.
The MATLAB binary data files can be read and written using the readMat() and
writeMat() functions in the R.matlab package. The files in picture formats can be
read via the jpeg, png, tiff, rtiff and readbitmap packages.
R has ways to import data from web sources using Application Programming Interface
(API). For example the World Bank makes its data available using the WDI package
and the Polish government data can be accessed using the SmarterPoland package.
The twitter package provides access to Twitter’s users and their tweet.
The read.table() function can accept URL rather than a local file. Accessing a
large file from internet can be slow and if the file is required frequently, it is better
to download the file using the download.file() function and create a local copy and
then import that.
> cancer_url <- “http://repository.seasr.org/Datasets/UCI/csv/breast-cancer.csv”
> cancer_data <- read.csv(cancer_url)
> str(cancer_data)
‘data.frame’: 287 obs. of 10 variables:
$ age : Factor w/ 7 levels “20-29”,”30-39”,..: 7 3 4 4 3 3 4 4 3 3 ...
$ menopause : Factor w/ 4 levels “ge40”,”lt40”,..: 4 3 1 1 3 3 3 1 3 3 ...
90 R Programming — An Approach for Data Analytics
The function dbDisconnect() is used for disconnecting and unloading the driver
and the function dbUnloadDriver() is used to unload the defined database driver.
> dbDisconnect(conn)
> dbUnloadDriver(driver)
For MySQL database we need to load the RMySQL package and set the
driver type to be “MySQL”. The PostgreSQL, Oracle and JDBC databases need
the PostgreSQL, ROracle and RJDBC packages respectively. To connect to an
SQL Server or Access databases, the RODBC package needs to be loaded. In this
package, the function odbcConnect() is used to connect to the database and the
function sqlQuery() is used to run a query and the function odbcClose() is used to
close and cleanup the database connections. There are not much matured methods
to access the NoSQL (Not only SQL) databases (lightweight databases – scalable
than traditional SQL relational databases). To access the MongoDB database the
packages RMongo and rmongodb are used. The database Cassandra can be accessed
using the package RCassandra.
92 R Programming — An Approach for Data Analytics
In some datasets or data frames logical values are represented as “Y” and “N” instead
of TRUE and FALSE. In such cases it is possible to replace the string with correct
logical value as in the example below.
> a <- c(1,2,3)
> b <- c(“A”, “B”, “C”)
> d <- c(“Y”, “N”, “Y”)
> df1 <- data.frame(a, b, d)
> df1
a b d
1 1 A Y
2 2 B N
3 3 C Y
convt <- function(x)
{
y <- rep.int(NA, length(x))
y[x == “Y”] <- TRUE
y[x == “N”] <- FALSE
y
}
> df1$d <- convt(df1$d)
> df1
a b d
1 1 A TRUE
2 2 B FALSE
3 3 C TRUE
93 Data Preparation
The functions grep() and grepl() are used to find a pattern in a given text and the
functions sub() and gsub() are used to replace a pattern with another in a given text.
The above four functions belong to the base package, but the package stringr consists
of many such string manipulation functions. The function str_
detect() in the stringr package does the same function of detecting the presence of
a given pattern in the given text. We can also use the function fixed() to mention if
the string that we are searching for is a fixed one.
> grep(“my”, “This is my pen”)
[1] 1
> grepl(“my”, “This is my pen”)
[1] TRUE
> sub(“my”, “your”,”This is my pen”)
[1] “This is your pen”
> gsub(“my”, “your”,”This is my pen”)
[1] “This is your pen”
> str_detect(“This is my pen”, “my”)
[1] TRUE
> str_detect(“This is my pen”, fixed(“my”))
[1] TRUE
The function str_split() is used to split a given text based on the pattern specified
as below. This function returns a vector. But the function str_split_fixed() can be
used to split the given text into fixed number of strings based on the specified
patterns. This function returns a matrix.
94 R Programming — An Approach for Data Analytics
[[1]]
The function str_replace() can be used to replace the specified pattern with
another pattern in the given text. This function will only replace the first occurrence
of the pattern. Hence, to replace all the occurrences of the pattern we use the
function str_replace_all(). In these functions, to denote multiple patterns to be
replaced, they can be placed within square brackets. This means it should replace
all that matches these characters specified within the square brackets.
> str_replace(“I like mangoes, oranges and pineapples”, “s”, “sss”)
[1] “I like mangoesss, oranges and pineapples”
> str_replace_all(“I like mangoes, oranges and pineapples”, “s”, “sss”)
[1] “I like mangoesss, orangesss and pineapplesss”
> str_replace_all(“I like mangoes, oranges and pineapples”, “[ao]”, “-”)
[1] “I like m-ng-es, -r-nges -nd pine-pples”
In the example below, the various ways of storing the gender values are
transformed into one way, ignoring the case differences. This is done using the
str_replace() function and the fixed() functions that ignores the case.
95 Data Preparation
To add a column to a data frame, we can use the below command to achieve this.
> name <- c(“Jhon”, “Peter”, “Mark”)
> start_date <- c(“1980-10-10”, “1999-12-12”, “1990-04-05”)
> end_date <- c(“1989-03-08”, “2004-09-20”, “2000-09-25”)
> service <- data.frame(name, start_date, end_date)
> service
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
2 Peter 1999-12-12 2004-09-20
3 Mark 1990-04-05 2000-09-25
> service$period <- as.Date(service$end_date) - as.Date(service$start_date)
> service
name start_date end_date period
1 Jhon 1980-10-10 1989-03-08 3071 days
2 Peter 1999-12-12 2004-09-20 1744 days
3 Mark 1990-04-05 2000-09-25 3826 days
Another way of doing the same is using the function within(). But, the difference
lies when there are multiple columns to be added to a data frame, we can easily do
the same using the within() function in a single command and this is not possible
using the with() function.
> service <- within(service,
{
period <- as.Date(end_date) - as.Date(start_date)
highperiod <- period > 2000
})
> service
name start_date end_date period highperiod
1 Jhon 1980-10-10 1989-03-08 3071 days TRUE
2 Peter 1999-12-12 2004-09-20 1744 days FALSE
3 Mark 1990-04-05 2000-09-25 3826 days TRUE
The mutate() function in the plyr package also does the same function as the
function within(), but the syntax is slightly different.
> library(plyr)
> service <- mutate(service,
{
period = as.Date(end_date) - as.Date(start_date)
highperiod = period > 2000
})
> service
97 Data Preparation
The function complete.cases() returns the number of rows in a data frame that
is free of missing values. The function na.omit() will remove the rows with missing
values in a data frame. And the function na.fail() throws an error message if the
data frame contains any missing values.
> crime.data <- read.csv(“F:/Crimes.csv”)
> nrow(crime.data)
[1] 65535
> complete <- complete.cases(crime.data)
> nrow(crime.data[complete, ])
[1] 63799
> clean.crime.data <- na.omit(crime.data)
> nrow(clean.crime.data)
[1] 63799
A data frame can be transformed by choosing few of the columns and ignoring
the remaining, but considering all the rows as in the example below.
> crime.data <- read.csv(“F:/Crimes.csv”)
> colnames(crime.data)
[1] “CASE.” “DATE..OF.OCCURRENCE” “BLOCK”
[4] “IUCR” “PRIMARY.DESCRIPTION”
“SECONDARY.DESCRIPTION”
[7] “LOCATION.DESCRIPTION” “ARREST” “DOMESTIC”
[10] “BEAT” “WARD” “FBI.CD”
[13] “X.COORDINATE” “Y.COORDINATE” “LATITUDE”
[16] “LONGITUDE” “LOCATION”
98 R Programming — An Approach for Data Analytics
Alternatively, the data frame can be transformed by selecting only the required
rows and retaining all columns of a data frame as in the example below.
> nrow(crime.data)
[1] 65535
> crime.data2 <- crime.data[1:10,]
> nrow(crime.data2)
[1] 10
The function sort() sorts the given vector of numbers or strings. It generally
sorts from smallest to largest, but this can be altered using the argument decreasing
= TRUE.
> x <- c(5, 10, 3, 15, 6, 8)
> sort(x)
[1] 3 5 6 8 10 15
> sort(x, decreasing = TRUE)
[1] 15 10 8 6 5 3
> y <- c(“X”, “AB”, “Deer”, “For”, “Moon”)
> sort(y)
[1] “AB” “Deer” “For” “Moon” “X”
> sort(y, decreasing = TRUE)
[1] “X” “Moon” “For” “Deer” “AB”
The function order() is the inverse of the sort() function. It returns the index
of the vector elements in the order as below. But, x[order(x)] is same as sort(x). This
can be seen by the use of the identical() function.
99 Data Preparation
> order(x)
[1] 3 1 5 6 2 4
> x[order(x)]
[1] 3 5 6 8 10 15
> identical(sort(x), x[order(x)])
[1] TRUE
The order() function is more useful than the sort() function as it can be used to
manipulate the data frames easily.
> name <- c(“Jhon”, “Peter”, “Mark”)
> start_date <- c(“1980-10-10”, “1999-12-12”, “1990-04-05”)
> end_date <- c(“1989-03-08”, “2004-09-20”, “2000-09-25”)
> service <- data.frame(name, start_date, end_date)
> service
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
2 Peter 1999-12-12 2004-09-20
3 Mark 1990-04-05 2000-09-25
> startdt <- order(service$start_date)
> service.ordered <- service[startdt, ]
> service.ordered
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
3 Mark 1990-04-05 2000-09-25
2 Peter 1999-12-12 2004-09-20
The arrange() function of the plyr package does the same function as above.
> library(plyr)
> arrange(service, start_date)
100 R Programming — An Approach for Data Analytics
The rank() function lists the rank of the elements in a vector or a data frame.
By specifying the argument ties.method = “first”, a rank need not be shared among
more than one element with the same value.
> x <- c(9, 5, 4, 6, 4, 5)
> rank(x)
[1] 6.0 3.5 1.5 5.0 1.5 3.5
> rank(x, ties.method = “first”)
[1] 6 3 1 5 2 4
The SQL statements can be executed from R and the results can be obtained
as in any other database. The package sqldf needs to be installed to manipulate the
data frames or datasets using SQL.
> install.packages(“sqldf ”)
> library(sqldf)
> query <- “SELECT * FROM iris WHERE Species = ‘setosa’”
> sqldf(query)
Data Reshaping in R is about changing the way data is organized into rows and
columns. Most of the time data processing in R is done by taking the input data as a
data frame. It is easy to extract data from the rows and columns of a data frame. But
there are situations when we need the data frame in a different format than what we
received. R has few functions to split, merge and change the columns to rows and vice-
versa in a data frame.
The cbind() function can be used to join multiple vectors to create a data frame.
We can also merge two data frames using the rbind() function.
101 Data Preparation
5 Lowry CO 80230
6 Charlotte FL 33949
The merge() function can be used to merge two data frames. The merging
requires the data frames to have same column names on which the merging is done.
In the example below, we consider the data sets about Diabetes in Pima Indian
Women available in the library named “MASS”. The two datasets are merged based
on the values of blood pressure (“bp”) and body mass index (“bmi”). On choosing
these two columns for merging, the records where values of these two variables
match in both data sets are combined together to form a single data frame.
> library(MASS)
> head(Pima.te)
npreg glu bp skin bmi ped age type
1 6 148 72 35 33.6 0.627 50 Yes
2 1 85 66 29 26.6 0.351 31 No
3 1 89 66 23 28.1 0.167 21 No
...
> head(Pima.tr)
npreg glu bp skin bmi ped age type
1 5 86 68 28 30.2 0.364 24 No
2 7 195 70 33 25.1 0.163 55 Yes
3 5 77 82 41 35.8 0.156 35 No
...
> nrow(Pima.te)
[1] 332
> nrow(Pima.tr)
[1] 200
> merged.Pima <- merge(x = Pima.te, y = Pima.tr,
+ by.x = c(“bp”, “bmi”),
+ by.y = c(“bp”, “bmi”)
103 Data Preparation
+)
> head(merged.Pima)
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20
2 64 29.7 2 75 24 0.370 33 No 2 100 23
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13
...
ped.y age.y type.y
1 0.088 31 No
2 0.368 21 No
3 0.295 24 No
...
> nrow(merged.Pima)
[1] 17
Now we melt the data using the melt() function in the package reshape2 to
organize it, converting all columns other than type and year into multiple rows.
104 R Programming — An Approach for Data Analytics
> library(reshape2)
> molten.ships <- melt(ships, id = c(“type”,”year”))
> head(molten.ships)
type year variable value
1 A 60 period 60
2 A 60 period 75
3 A 65 period 60
4 A 65 period 75
5 A 70 period 60
6 A 70 period 75
> nrow(molten.ships)
[1] 120
> nrow(ships)
[1] 40
We can cast the molten data into a new form where the aggregate of each type
of ship for each year is created. It is done using the cast() function.
> recasted.ship <- cast(molten.ships, type+year~variable,sum)
> head(recasted.ship)
type year period service incidents
1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68
6 B 65 135 48979 111
105 Data Preparation
R has many apply functions such as apply(), lapply(), sapply(), vapply(), mapply(),
rapply(), tapply(), aggregate() and by(). Function lapply() is a list apply which acts
on a list or vector and returns a list. Function sapply() is a simple lapply() function
defaults to returning a vector or matrix when possible. Function vapply() is a verified
apply() function that allows the return object type to be pre-specified. Function
rapply() is a recursive apply for nested lists, i.e. lists within lists. Function tapply()
is a tagged apply where the tags identify the subsets. Function apply() is generic,
applies a function to a matrix’s rows or columns or, more generally, to dimensions
of an array.
If we want to apply a function to each element of a list in turn and get a list
back, we use the lapply() function as below.
> x <- list(a = 1, b = 1:3, c = 10:100)
>x
$a
[1] 1
$b
[1] 1 2 3
107 Data Preparation
$c
[1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[18] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[35] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[52] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
[69] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[86] 95 96 97 98 99 100
$b
[1] 1 2 3
$c
[1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[18] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[35] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[52] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
[69] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[86] 95 96 97 98 99 100
When we want to use the function sapply(), but need to squeeze some more speed
out of the code, we use the function vapply() as below. For the function vapply(), we
give R the information on what the function will return, which can save some time
coercing returned values to fit in a single atomic vector. In the example below, we tell
R that everything returned by length() should be an integer of length 1.
> x <- list(a = 1, b = 1:3, c = 10:100)
> vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
For when we have several data structures (e.g. vectors, lists) and we want to
apply a function to the 1st elements of each, and then the 2nd elements of each,
etc., coercing the result to a vector/array we use the function vapply() as below.
109 Data Preparation
When we want to apply a function to subsets of a vector and the subsets are
defined by some other vector, usually a factor, we use the function tapply() as below.
> x <- 1:20
>x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
111 Data Preparation
The by() function, can be thought of, as a “wrapper” for the function tapply().
When we want to compute a task that tapply() can’t handle, the by() function
arises.
> cta <- tapply(iris$Sepal.Width , iris$Species , summary )
> cba <- by(iris$Sepal.Width , iris$Species , summary )
> cta
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
> cba
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
------------------------------------------------------
112 R Programming — An Approach for Data Analytics
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, cta and cba, we have the same results. The only
differences are in how they are shown with the different class attributes. The power
of the function by() arises when we can’t use the function tapply() as in the following
code.
> tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say “we want to calculate
the summary of all variable in iris along the factor Species”: but R just can’t do
that because it does not know how to handle. The by() function lets the summary()
function work even if the length of the first argument are different.
> bywork <- by(iris, iris$Species, summary )
> bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
Median :5.000 Median :3.400 Median :1.500 Median :0.200
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
113 Data Preparation
Species
setosa :50
versicolor: 0
virginica : 0
------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
Median :5.900 Median :2.800 Median :4.35 Median :1.300
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
Species
setosa :0
versicolor:50
virginica : 0
------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
Median :6.500 Median :3.000 Median :5.550 Median :2.000
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
Species
setosa :0
114 R Programming — An Approach for Data Analytics
versicolor: 0
virginica :50
The arguments must have the same lengths. R can’t do that because it does not
know how to handle it. The by() function lets the summary() function work even if
the length of the first argument is different. The result is an object of class by that
along Species computes the summary of each variable.
The aggregate() function can be seen as another a different way of using tapply()
function if we use it in such a way.
> att <- tapply(iris$Sepal.Length , iris$Species , mean)
> agt <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
> att
setosa versicolor virginica
5.006 5.936 6.588
> agt
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of the aggregate()
function must be a list while tapply() function can (not mandatory) be a list and
that the output of the aggregate() function is a data frame while the one of tapply()
function is an array. The power of the aggregate() function is that it can handle
easily subsets of the data with subset argument and that it can handle formula
as well. These elements make the aggregate() function easier to work with than
tapply() function in some situations.
> ag <- aggregate(len ~ ., data = ToothGrowth, mean)
> ag
115 Data Preparation
HIGHLIGHTS
One of the packages in R is datasets which is filled with example datasets.
We can see all the datasets available in the loaded packages using the
data() function.
The read.table() function reads the CSV files and stores the result in a
data frame.
The system.file() function is used to locate files that are inside a package.
Writing data from R into a file is done using the functions write.table()
and write.csv().
If the file is unstructured, it is read using the function readLines().
The writeLines() function takes a text line and the file name as argument
and writes the text to the file.
The XML file can be imported using the function xmlParse() function.
The function used to import the JSON file is fromJSON() and the function
used to export the JSON file is toJSON().
Spreadsheets can be imported with the functions read.xlsx() and read.
xlsx2().
To write to an excel file from R we use the function write.xlsx2().
The read.ssd() function is used to read SAS datasets.
The read.spss() function is used to import the SPSS data files.
The MATLAB binary data files can be read and written using the readMat()
and writeMat() functions in the R.matlab package.
116 R Programming — An Approach for Data Analytics
Graphics using R
OBJECTIVES
Exploratory Data Analysis (EDA) is an approach for data analysis that employs
a variety of techniques (mostly graphical) to:
1) Maximize insight into a data set
2) Uncover underlying structure
3) Extract important variables
4) Detect outliers and anomalies
5) Test underlying assumptions
6) Develop parsimonious models
7) Determine optimal factor settings.
x – numeric vector
labels – description of the slices
radius – values between [-1 to +1]
119 Graphics using R
A 3D Pie Chart can be drawn using the package plotrix which uses the function
pie3D().
> install.packages(“plotrix”)
> library(plotrix)
> pie3D(x, labels = labels, explode = 0.1, main = “Flowers”)
120 R Programming — An Approach for Data Analytics
This plot can be made more appealing and readable by adding colour and
changing the plotting character. For this we use the arguments col and pch (can
take the values between 1 and 25) in the plot() function as below. Thus the plot in
Fig. 4.4 shows that there is a strong positive correlation between the speed of a car
and its stopping distance.
> plot(cars$speed, cars$dist, col = “red”, pch = 15)
The layout() function is used to control the layout of multiple plots in the
matrix. Thus in the example below multiple related plots are placed in a single
figure as in Fig. 4.5.
> data(mtcars)
> layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
> plot(mtcars$wt, mtcars$mpg, col = “blue”, pch = 17)
> plot(mtcars$wt, mtcars$disp, col = “red”, pch = 15)
> plot(mtcars$mpg, mtcars$disp, col = “dark green”, pch = 10)
> plot(mtcars$mpg, mtcars$hp, col = “violet”, pch = 7)
When we have more than two variables and we want to find the correlation
between one variable versus the remaining ones we use scatter plot matrix. We use
pairs() function to create matrices of scatter plots as in Fig. 4.6. The basic syntax for
creating scatter plot matrices in R is as below.
pairs(formula, data)
The lattice graphics system has equivalent of plot() function and it is xyplot().
This function uses a formula to specify the x and y variables (yvar ~ xvar) and a data
frame argument. To use this function, it is required to include the lattice package.
> library(lattice)
> xyplot(mtcars$mpg ~ mtcars$disp, mtcars, col = “purple”, pch = 7)
Axis scales can be specified in the xyplot() using the scales argument and this
argument must be a list. This list consists of the name = value pairs. If we mention
log = TRUE, the log scales for the x and y axis are set as in Fig. 4.8. The scales list
can take other arguments also like the x and y that sets the x and y axes respectively.
> xyplot(mtcars$mpg ~ mtcars$disp, mtcars, scales = list(log = TRUE),
col = “red”, pch = 11)
Figure 4.8 Scatter Plot Matrix with Axis Scales Using xyplot()
The data in the graph can be split based on one of the columns in the dataset
namely mtcars$carb. This can be done by appending the pipe symbol (|) along with
the column name used for splitting. The argument relation = “same” means that
each panel shares the same axes. If the argument alternating = TRUE, axis ticks for
each panel is drawn on alternating sides of the plot as in Fig. 4.9.
> xyplot(mtcars$mpg ~ mtcars$disp | mtcars$carb, mtcars,
scales = list(log = TRUE, relation = “same”, alternating = FALSE),
layout = c(3, 2), col = “blue”, pch = 14)
125 Graphics using R
The lattice plots can be stored in variables and hence they can be further
updated using the function update as below.
> graph1 <- xyplot(mtcars$mpg ~ mtcars$disp | mtcars$carb, mtcars,
scales = list(log = TRUE, relation = “same”, alternating = FALSE),
layout = c(3, 2), col = “blue”, pch = 14)
> graph2 <- update(graph1, col = “yellow”, pch = 6)
In the ggplot2 graphics, each plot is drawn with a call to the ggplot() function
as in Fig. 4.10. This function takes a data frame as its first argument. The passing
of data frame columns to the x and y axis is done using the aes() function which is
used within the ggplot() function. The other aesthetics to the graph are then added
using the geom() function appended with a “+” symbol to the ggplot() function.
> library(ggplot2)
> ggplot(mtcars, aes(mpg, disp)) +
geom_point(color = “purple”, shape = 16, cex = 2.5)
126 R Programming — An Approach for Data Analytics
The ggplots can also be split into several panels like the lattice plots as in Fig. 4.11.
This is done using the function facet_wrap() which takes a formula of the column
used for splitting. The function theme() is used to specify the orientation of the
axis readings. The functions facet_wrap() and theme() are appended to the ggplot()
function using the “+” symbol. The ggplots can be stored in a variable like the lattice
plots and as usual wrapping the expression in parentheses makes it to auto print.
> (graph1 <- ggplot(mtcars, aes(mpg, disp)) +
geom_point(color = “dark green”, shape = 15, cex = 3))
> (graph2 <- graph1 + facet_wrap(~mtcars$cyl, ncol = 3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)
v – numeric vector
type - takes value “p” (only points), or “l” (only lines) or “o” (both points and lines)
xlab – label of x-axis
ylab – label of y-axis
main - title of the chart
col - colour palette
> plot(male, type = “o”, col = “red”, xlab = “Month”, ylab = “Wages”,
main = “Monthly Wages”, ylim = c(0, 5000))
> lines(female, type = “o”, col = “blue”)
> lines(child, type = “o”, col = “green”)
> legend(“topleft”, wages, cex = 0.8, fill = color)
Line plots in the lattice graphics uses the xyplot() function as in Fig. 4.13. In
this multiple lines can be creating using the “+” symbol in the formula where the
x and the y axes are mentioned. The argument type = “l” is used to mention that
it is a continuous line.
> xyplot(economics$pop + economics$unemploy ~ economics$date, economics, type = “l”)
In the ggplot2 graphics, the same syntax for scatter plots are used, except for
the change of geom_plot() function with the geom_line() function as in Fig. 4.14.
But, there need to be multiple geom_line() functions for multiple lines to be drawn
in the graph.
> ggplot(economics, aes(economics$date)) + geom_line(aes(y = economics$pop)) +
geom_line(aes(y = economics$unemploy))
129 Graphics using R
The plot in the Fig. 4.15 can be drawn without using multiple geom_line()
functions also. This is possible using the function geom_ribbon() as mentioned
below. This function plots not only the two lines, but also the contents in between
the two lines.
> ggplot(economics, aes(economics$date, ymin = economics$unemploy,
ymax = economics$pop)) + geom_ribbon(color = “blue”, fill = “white”)
4.6. Histograms
Histograms represents the variable values frequencies, that are split into ranges. This
is similar to bar charts, but histograms group values into continuous ranges. In R
histograms in the base graphics are drawn using the function hist() as in the Fig. 4.16,
that takes a vector of numbers as input together with few more parameters listed below.
hist(v, main, xlab, xlim, ylim, breaks, col, border)
v – numeric vector main - title of the chart
col - colour palette border – border colour
xlab – label of x-axis xlim – range of x-axis
ylim – range of y-axis breaks – width of each bar
The lattice histogram is drawn using the function histogram() as in Fig. 4.17 and
it behaves in the same way as the base ones. But it allows easy splitting of data into
panels and saving plots as variables. The breaks argument behaves the same way as with
hist(). The lattice histograms support counts, probability densities, and percentage
y-axes via the type argument, which takes the string “count”, “density”, or “percent”.
131 Graphics using R
x – vector or a formula
data – data frame
notch – logical value (TRUE – draw a notch)
varwidth – logical value (TRUE – box width proportionate to sample size
names – labels printed under the boxes
main – title of the chart
This type of plot is often clearer if we reorder the box plots from smallest to
largest, in some sense. The reorder() function changes the order of a factor’s levels,
based upon some numeric score.
> boxplot(mpg ~ reorder(gear, mpg, median), data = mtcars,
xlab = “Number of Gears”, ylab = “Miles Per Gallon”,
main = “Car Mileage”, varwidth = TRUE,
col = c(“red”,”blue”, “green”), names = c(“Low”, “Medium”, “High”))
In the lattice graphics the box plot is drawn using the function bwplot() as in
Fig. 4.20.
> bwplot(mpg ~ reorder(gear, mpg, median), data = mtcars,
xlab = “Number of Gears”, ylab = “Miles Per Gallon”,
main = “Car Mileage”, varwidth = TRUE,
col = c(“red”,”blue”, “green”), names = c(“Low”, “Medium”, “High”))
In the ggplot2 graphics the box plot is drawn by adding the function geom_
boxplot() to the function ggplot() as in Fig. 4.21.
> ggplot(mtcars, aes(reorder(gear, mpg, median), mpg)) + geom_boxplot()
134 R Programming — An Approach for Data Analytics
By default the bars are vertical, but if we want horizontal bars, they can be
generated with horiz = TRUE parameter as in Fig. 4.23. We can also do some
fiddling with the plot parameters, via the par() function. The las parameter controls
whether labels are horizontal, vertical, parallel, or perpendicular to the axes. Plots
are usually more readable if you set las = 1, for horizontal. The mar parameter is a
numeric vector of length 4, giving the width of the plot margins at the bottom/left/
top/right of the plot.
> x <- matrix(c(1000, 900, 1500, 4400, 800, 2100, 1700, 2900, 3800), nrow = 3, ncol = 3)
> years <- c(“2011”, “2012”, “2013”)
136 R Programming — An Approach for Data Analytics
Extending this to multiple variables just requires a tweak to the formula, and
passing stack = TRUE to make a stacked plot as in Fig. 4.25.
> barchart(mtcars$mpg ~ mtcars$disp + mtcars$qsec + mtcars$hp, mtcars,
stack = TRUE
In the ggplot2 graphics the bar chart is drawn by adding the function geom_
bar() to the function ggplot() as in Fig. 4.26. Like base, ggplot2 defaults to vertical
bars; adding the function coord_flip() swaps this. We must pass the argument stat
= “identity” to the function geom_bar().
> ggplot(mtcars, aes(mtcars$mpg, mtcars$disp)) + geom_bar(stat = “identity”) +
coord_flip()
HIGHLIGHTS
Exploratory Data Analysis (EDA) shows how to use visualisation and
transformation to explore data in a systematic way.
The main graphical packages are base, lattice and ggplot2.
In R the pie chart is created using the pie() function.
A 3D Pie Chart can be drawn using the package plotrix which uses the
function pie3D().
The basic scatter plot in the base graphics system can be obtained by
using the plot() function.
We use the arguments col and pch (values between 1 and 25) in the plot()
function to specify colour and plot pattern.
The layout() function is used to control the layout of multiple plots in the
matrix.
140 R Programming — An Approach for Data Analytics
OBJECTIVES
respectively. Let us use the dataset named mtcars that is available in R by default to
understand these statistical measures.
> data(mtcars)
> colnames(mtcars)
[1] “mpg” “cyl” “disp” “hp” “drat” “wt” “qsec” “vs” “am” “gear” “carb”
> min(mtcars$cyl)
[1] 4
> max(mtcars$cyl)
[1] 8
> mean(mtcars$cyl)
[1] 6.1875
> median(mtcars$cyl)
[1] 6
All the above results can also be obtained by one function summary() and this
can also be applied on all the fields of the dataset at one shot. The range() function
gives the minimum and maximum values of a numeric field at one go.
> summary(mtcars$cyl)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.000 4.000 6.000 6.188 8.000 8.000
> range(mtcars$cyl)
[1] 4 8
5.1.1. Mean
Mean is calculated by taking the sum of the values and dividing with the number of
values in a data series. The function mean() is used to calculate this in R. The basic
syntax for calculating mean in R is given below along with its parameters.
mean(x, trim = 0, na.rm = FALSE, ...)
x - numeric vector
143 Statistical Analysis Using R
trim - to drop some observations from both end of the sorted vector
na.rm - to remove the missing values from the input vector
> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> mean(x)
[1] 9.8
When trim parameter is supplied, the values in the vector get sorted and then
the required numbers of observations are dropped from calculating the mean.
When trim = 0.2, 2 values from each end will be dropped from the calculations to
find mean. In this case the sorted vector is (-91, -45, 1, 3, 12, 15, 24, 45, 56, 78) and
the values removed from the vector for calculating mean are (−91, −45) from left
and (56, 78) from right.
> mean(x, trim = 0.2)
[1] 16.66667
If there are missing values, then the mean() function returns NA. To drop the
missing values from the calculation use na.rm = TRUE, which means remove the
NA values.
> x <- c(45, 56, 78, 12, 3, -91, NA, -45, 15, 1, 24, NA)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 9.8
5.1.2. Median
The middle most value in a data series is called the median. The median() function
is used in R to calculate this value. The basic syntax for calculating median in R is
given below along with its parameters.
median(x, na.rm = FALSE)
x - numeric vector
na.rm - to remove the missing values from the input vector
144 R Programming — An Approach for Data Analytics
> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> median(x)
[1] 13.5
5.1.3. Mode
The mode is the value that has highest number of occurrences in a set of data.
Unlike mean and median, mode can have both numeric and character data. R does
not have a standard in-built function to calculate mode. So we create a user function
to calculate mode of a data set in R. This function takes the vector as input and
gives the mode value as output.
Mode <- function(x)
{
y <- unique(x)
y[which.max(tabulate(match(x, y)))]
}
The function unique() returns a vector, data frame or array like x but with
duplicate elements/rows removed. The function match() returns a vector of
the positions of (first) matches of its first argument in its second. The function
tabulate() takes the integer-valued vector bin and counts the number of times each
integer occurs in it. The function which.max() determines the location, i.e., index
of the (first) maximum of a numeric (or logical) vector.
145 Statistical Analysis Using R
The functions to calculate the standard deviation, variance and the mean absolute
deviation are sd(), var() and mad() respectively.
> sd(mtcars$cyl)
[1] 1.785922
> var(mtcars$cyl)
[1] 3.189516
> mad(mtcars$cyl)
[1] 2.9652
The quantile() function provides the quartiles of the numeric values. An alternative
function for quartiles is fivenum(). The IQR() function provides the inter quartile
range of the numeric fields.
> quantile(mtcars$cyl)
0% 25% 50% 75% 100%
4 4 6 8 8
> fivenum(mtcars$cyl)
[1] 4 4 6 8 8
> IQR(mtcars$cyl)
[1] 4
The function cor() and cov() are used to find the correlation and covariance between
two numeric fields respectively. In the below example the value shows that there is
negative correlation between the two numeric fields.
> cor(mtcars$mpg, mtcars$cyl)
[1] -0.852162
146 R Programming — An Approach for Data Analytics
There are other statistics functions such as pmin(), pmax() [parallel equivalents
of min() and max() respectively], cummin() [cumulative minimum value], cummax()
[cumulative maximum value], cumsum() [cumulative sum] and cumprod()
[cumulative product].
> nrow(mtcars)
[1] 32
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmin(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmax(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> cummin(mtcars$cyl)
[1] 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
> cummax(mtcars$cyl)
[1] 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
> cumsum(mtcars$cyl)
[1] 6 12 16 22 30 36 44 48 52 58 64 72 80 88 96 104 112 116 120 124
[21] 128 136 144 152 160 164 168 172 180 186 194 198
> cumprod(mtcars$cyl)
[1] 6.000000e+00 3.600000e+01 1.440000e+02 8.640000e+02 6.912000e+03
4.147200e+04
[7] 3.317760e+05 1.327104e+06 5.308416e+06 3.185050e+07 1.911030e+08
1.528824e+09
[13] 1.223059e+10 9.784472e+10 7.827578e+11 6.262062e+12 5.009650e+13
2.003860e+14
147 Statistical Analysis Using R
wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
148 R Programming — An Approach for Data Analytics
Mean :2.812
3rd Qu.:4.000
Max. :8.000
x - vector of numbers
p - vector of probabilities
n - sample size
mean - mean (default value is 0)
sd - standard deviation (default value is 1)
5.3.1. dnorm()
For a given mean and standard deviation, this function gives the height of the
probability distribution. Below is an example in which the result of the dnorm()
function is plotted in a graph in Fig. 5.1.
> x <- seq(-5,5, by = 0.05)
> y <- dnorm(x, mean = 1.5, sd = 0.5)
> plot(x, y)
149 Statistical Analysis Using R
5.3.2. pnorm()
5.3.3. qnorm()
The qnorm() function takes the probability value as input and returns a cumulative
value that matches the probability value. Below is an example in which the result of
the qnorm() function is plotted in a graph as in Fig. 5.3.
> x <- seq(0, 1, by = 0.02)
> y <- qnorm(x, mean = 2, sd = 1)
> plot(x, y)
5.3.4. rnorm()
x - vector of numbers
p - vector of probabilities
n - sample size
size – number of trials
prob – probability of success of each trial
152 R Programming — An Approach for Data Analytics
5.4.1. dbinom()
This function gives the probability density distribution at each point. Below is an
example in which the result of the dbinom() function is plotted in a graph as in Fig. 5.5.
> x <- seq(0, 25, by = 1)
> y <- dbinom(x,25,0.5)
> plot(x, y)
5.4.2. pbinom()
5.4.3. qbinom()
The function qbinom() takes the probability value as input and returns a number
whose cumulative value matches the probability value. The below example finds how
many heads will have a probability of 0.5 will come out when a coin is tossed 50 times.
> x <- qbinom(0.5, 50, 1/2)
>x
[1] 25
5.4.4. rbinom()
The function rbinom() returns the required number of random values of the given
probability from a given sample. The below code is to find 5 random values from a
sample of 50 with a probability of 0.5.
> x <- rbinom(5,50,0.5)
>x
[1] 24 21 22 29 32
cor.test(x, y, method)
Consider the data set “mtcars” available in the R environment. Let us first find
the correlation between the horse power (“hp”) and the mileage per gallon (“mpg”)
of the cars and then between the horse power (“hp”) and the cylinder displacement
(“disp”) of the cars. From the test we find that the horse power (“hp”) and the
154 R Programming — An Approach for Data Analytics
mileage per gallon (“mpg”) of the cars have negative correlation (-0.7761684) and
the horse power (“hp”) and the cylinder displacement (“disp”) of the cars have
positive correlation (0.7909486).
> cor(mtcars$hp, mtcars$mpg, method = “pearson”)
[1] -0.7761684
The correlation results can also be viewed graphically as in Fig. 5.6. The corrplot()
function can be used to analyze the correlation between the various columns of a
dataset, say mtcars. After this, the correlation between individual columns can be
compared by plotting it in separate graphs as in Fig. 5.7 and Fig. 5.8.
> library(corrplot)
> M <- cor(mtcars)
> corrplot(M, method = “number”)
It can be noted that the graph with negative correlation (Fig. 5.7) has the dots
from top left corner to bottom right corner and the graph with positive correlation
(Fig. 5.8) has the dots from the bottom left corner to the top right corner.
The function lm() creates the relationship model between the predictor and the
response variable. The basic syntax for lm() function in linear regression is as given
below.
lm(formula,data)
> x <- c(1510, 1740, 1380, 1860, 1280, 1360, 1790, 1630, 1520, 1310)
> y <- c(6300, 8100, 5600, 9100, 4700, 5700, 7600, 7200, 6200, 4800)
> model <- lm(y~x)
> model
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-3845.509 6.746
> summary(model)
Call:
lm(formula = y ~ x)
158 R Programming — An Approach for Data Analytics
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3845.5087 804.9013 -4.778 0.00139 **
x 6.7461 0.5191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The basic syntax for the function predict() in linear regression is as given below.
predict(object, newdata)
In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, ...xn are the predictor variables.
In R, the lm() function is used to create the regression model. The model
determines the value of the coefficients using the input data. Next we can predict
the value of the response variable for a given set of predictor variables using these
coefficients. The relationship model is built between the predictor variables and
the response variables. The basic syntax for lm() function in multiple regression is
as given below.
lm(y ~ x1+x2+x3..., data)
160 R Programming — An Approach for Data Analytics
Consider the data set “mtcars” available in the R environment. This dataset
presents the data of different car models in terms of mileage per gallon (“mpg”),
cylinder displacement (“disp”), horse power (“hp”), weight of the car (“wt”) and
some more parameters. This model establishes the relationship between “mpg” as
a response variable with “disp”, “hp” and “wt” as predictor variables. We create a
subset of these variables from the mtcars data set for this purpose.
> model2 <- lm(mpg~disp+hp+wt, data = mtcars[, c(“mpg”, “disp”, “hp”, “wt”)])
> model2
Call:
lm(formula = mpg ~ disp + hp + wt, data = mtcars[, c(“mpg”, “disp”,
“hp”, “wt”)])
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
> a <- coef(model2)[1]
>a
(Intercept)
37.10551
> b1 <- coef(model2)[2]
> b2 <- coef(model2)[3]
> b3 <- coef(model2)[4]
> b1
disp
-0.0009370091
> b2
hp
-0.03115655
> b3
161 Statistical Analysis Using R
wt
-3.800891
We create the mathematical equation below, from the above intercept and
coefficient values.
Y = a+b1*x1+b2*x2+b3*x3
Y = (37.10551)+(-0.0009370091)*x1+(-0.03115655)*x2+(-3.800891)*x3
We can use the regression equation created above to predict the mileage when
a new set of values for displacement, horse power and weight is provided. For a car
with disp = 160, hp = 110 and wt = 2.620 the predicted mileage is given by:
Y = (37.10551)+(-0.0009370091)*160+(-0.03115655)*110+(-3.800891)*2.620
= 23.57003
The above value can also be calculated using the function predict() for the given
new value.
> newdata <- data.frame(disp = 160, hp = 110, wt = 2.620)
> mileage <- predict(model2, newdata)
> mileage
1
23.57003
In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, ...xn are the predictor variables.
The function used to create the logical regression model is the glm() function.
The basic syntax for glm() function in logistic regression is as given below.
glm(formula,data,family)
The in-built data set “mtcars” describes different models of a car with their
various engine specifications. In “mtcars” data set, the column am describes the
transmission mode with a binary value. A logistic regression model is built between
the columns “am” and 3 other columns - hp, wt and cyl.
> model3 <- glm(am ~ cyl + hp + wt, data = mtcars[, c(“am”, “cyl”, “hp”, “wt”)],
family = binomial)
> model3
Coefficients:
(Intercept) cyl hp wt
19.70288 0.48760 0.03259 -9.14947
> summary(model3)
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = mtcars[,
c(“am”, “cyl”, “hp”, “wt”)])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
164 R Programming — An Approach for Data Analytics
The p-value in the summary is greater than 0.05 for the variables “cyl” (0.6491)
and “hp” (0.0840). This value is considered to be insignificant in contributing to
the value of the variable “am”. Only weight “wt” (0.0276) impacts the “am” value
in this regression model.
In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, ...xn are the predictor variables.
The function used to create the Poisson regression model is the glm() function.
The basic syntax for glm() function in logistic regression is as given below.
glm(formula, data, family)
165 Statistical Analysis Using R
The data set “warpbreaks” describes the effect of wool type and tension on the
number of warp breaks per loom. Let’s consider “breaks” as the response variable
which is a count of number of breaks. The wool “type” and “tension” are taken as
predictor variables. The model so built shows the below results.
> model4 <- glm(formula = breaks ~ wool + tension, data = warpbreaks,
family = poisson)
> summary(model4)
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the summary we look for the p-value in the last column to be less than 0.05
to consider an impact of the predictor variable on the response variable. As seen the
wooltype B having tension type M and H have impact on the count of breaks.
p-value of woolB = 6.49e-05 = 0.0000649 < 0.05
p-value of tensionM = 9.73e-08 = 0.0000000973 < 0.05
p-value of tensionH = 5.21e-16 = 0.000000000000000521 < 0.05
When modelling real world data for regression analysis, we observe that it is rarely
the case that the equation of the model is a linear equation giving a linear graph. The
equation of the model of real world data involves mathematical functions of higher
degree. In such a scenario, the plot of the model gives a curve rather than a line. Both
linear and non-linear regression aims to adjust the values of the model’s parameters.
This is to find the line or curve that comes nearer to the data. On finding these values
we will be able to estimate the response variable with good accuracy.
a = b1*x^2+b2
Let us assume the initial coefficients to be 1 and 3 and fit these values into nls()
function.
> x <- c(1.6, 2.1, 2, 2.23, 3.71, 3.25, 3.4, 3.86, 1.19, 2.21)
> y <- c(5.19, 7.43, 6.94, 8.11, 18.75, 14.88, 16.06, 19.12, 3.21, 7.58)
> plot(x, y)
> model <- nls(y ~ b1*x^2+b2, start = list(b1 = 1,b2 = 3))
> new <- data.frame(x = seq(min(x), max(x), len = 100))
> lines(new$x, predict(model, newdata = new))
> res1 <- sum(resid(model)^2)
> res1
[1] 1.081935
> res2 <- confint(model)
> res2
2.5% 97.5%
b1 1.137708 1.253135
b2 1.497364 2.496484
We can conclude that the value of b1 is more close to 1 (1.253135) while the
value of b2 is more close to 2 (2.496484) and not 3.
Response: PlantGrowth$weight
Df Sum Sq Mean Sq F value Pr(>F)
PlantGrowth$group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The result shows that the F-value is 4.8461 and the p-value is 0.01591 which is less
than 0.05 (5% level of significance). This shows that the null hypothesis is rejected,
that is the control group / treatment has effect on the plant growth / plant weight.
169 Statistical Analysis Using R
For two-way ANOVA, consider the below example of revenues collected for 5
years in each month. We want to see if the revenue depends on the Year and / or
Month or if they are independent of these two factors.
> revenue = c(15,18,22,23,24, 22,25,15,15,14, 18,22,15,19,21,
+ 23,15,14,17,18, 23,15,26,18,14, 12,15,11,10,8, 26,12,23,15,18,
+ 19,17,15,20,10, 15,14,18,19,20, 14,18,10,12,23, 14,22,19,17,11,
+ 21,23,11,18,14)
> anova(fit)
Analysis of Variance Table
Response: revenue
Df Sum Sq Mean Sq F value Pr(>F)
months 11 308.45 28.041 1.4998 0.1660
years 4 44.17 11.042 0.5906 0.6712
Residuals 44 822.63 18.696
The significance of the difference between months is: F = 1.4998. This value is
lower than the value tabulated and indeed p-value > 0.05. So we cannot reject the
null hypothesis: the means of revenue evaluated according to the months are not
proven to be not equal, hence we remain in our belief that the variable “months”
has no effect on revenue.
The significance of the difference between years is: F = 0.5906. This value is
lower than the value tabulated and indeed p-value > 0.05. So we fail to reject the
null hypothesis: the means of revenue evaluated according to the years are not found
to be un-equal, then the variable “years” has no effect on revenue.
170 R Programming — An Approach for Data Analytics
ANCOVA is a type of ANOVA model that has a general linear model with a
continuous outcome variable and two or more predictor variables. Of these predictor
variables, at least one is continuous and at least one more is categorical. Analysis of
variance (ANOVA) is a collection of statistical models and their procedures which
are used to observe differences between the means of three or more variables in a
population based on the sample presented.
Consider the R built in data set “mtcars”. In this dataset the field “am”
represents the type of transmission and it takes the values 0 or 1. The miles per
gallon value, “mpg” of a car can also depend on it besides the value of horse power,
“hp”. The effect of the value of “am” on the regression between “mpg” and “hp” is
studied. It is done by using the aov() function followed by the anova() function to
compare the multiple regressions.
Consider the fields “mpg”, “hp” and “am” from the data set “mtcars”. The
variable “mpg” is the response variable, and the variable “hp” is chosen as the
predictor variable and “am” as the categorical variable. We create a regression model
taking “hp” as the predictor variable and “mpg” as the response variable taking into
account the interaction between “am” and “hp”.
The model with interaction between categorical variable and predictor variable
is given as below.
> res1 <- aov(mtcars$mpg ~ mtcars$hp * mtcars$am, data = mtcars)
> summary(res1)
Df Sum Sq Mean Sq F value Pr(>F)
171 Statistical Analysis Using R
As the p-value in both the cases is less than 0.05, the result shows that both
horse power “hp” and transmission type “am” has significant effect on miles per
gallon “mpg”. But the interaction between these two variables is not significant as
the p-value is more than 0.05.
As the p-value in both cases is less than 0.05 in the above result, it shows that
both horse power and transmission type has significant effect on miles per gallon.
Now we can compare the two models to conclude if the interaction of the variables
is truly in-significant. For this we use the anova() function.
> finres <- anova(res1,res2)
> finres
As the p-value is (0.9806) greater than 0.05 we conclude that the interaction
between horse power “hp” and transmission type “am” is not significant. So the
mileage per gallon “mpg” will depend in a similar manner on the horse power of the
car in both auto and manual transmission mode.
The function chisq.test() is used for performing chi-Square test on the given
data. The R syntax for chi-square test is as below.
chisq.test(data)
We will take the Cars93 data in the “MASS” library which represents the sales
of different models of car in the year 1993. The facor variables in this dataset can be
considered as categorical variables. In the below model the variables “AirBags” and
“Type” are considered. Here we aim to find out any significant correlation between
the types of car sold and the type of Air bags it has. If correlation is observed we can
estimate which types of cars can sell better with what types of air bags.
173 Statistical Analysis Using R
> library(MASS)
> cardata = table(Cars93$AirBags, Cars93$Type)
> chi <- chisq.test(cardata)
> chi
Pearson’s Chi-squared test
data: cardata
X-squared = 33.001, df = 10, p-value = 0.0002722
The result shows the p-value (0.0002723) of less than 0.05 which indicates a
strong correlation between the “AirBags” and “Type” of the cars sold.
Hypothesis testing is of two types, one-tailed tests and two-tailed tests. A one-
tailed test is a test of a hypothesis which is used to measure the relationship of
variable in one direction which allows the hypothesis to be rejected. A two-tailed
test is used to test the statistical significance of the hypothesis for the dataset to
accept or reject the hypothesis.
The p-value is 0.5743 > 0.05. Hence, at .05 significance level, we do not reject
the null hypothesis that the proportion of voters in the population is above 60%
this year.
Let 12% of apples taken from an orchard last year were rotten. This year, 30 out of
214 apples turns out to be rotten. Is it possible to reject the null hypothesis at .05
level of significance, that the proportion of rotten apples in harvest stays below 12%
this year?
> prop.test(30, 214, p=.12, alt=”greater”, correct=FALSE)
The p-value is 0.14018 > 0.05. Hence, at .05 significance level, we do not reject
the null hypothesis that the proportion of rotten apples in harvest stays below 12%
this year.
Let 12 heads are turned up out of 20 trials in a coin toss. At .05 significance level, is
it possible to reject the null hypothesis that the coin toss is fair?
> prop.test(12, 20, p=0.5, correct=FALSE)
The p-value is 0.6 > 0.05. Hence, at .05 significance level, we do not reject the
null hypothesis that the coin toss is fair.
HIGHLIGHTS
Most important functions to get the statistical measures are available in
the packages base and stats.
The basic statistical measures are obtained by the functions min(), max(),
mean() and median().
The basic statistical measures can also be obtained by one function
summary().
R does not have a standard in-built function to calculate mode and hence
we create a user function to calculate mode.
176 R Programming — An Approach for Data Analytics
The function unique() returns a vector, data frame or array with duplicate
elements/rows removed.
The function match() returns a vector of the positions of (first) matches
of its first argument in its second.
The function tabulate() takes the integer-valued vector bin and counts
the number of times each integer occurs in it.
The function which.max() determines the location, i.e., index of the (first)
maximum of a numeric (or logical) vector.
The functions to calculate the standard deviation, variance and the mean
absolute deviation are sd(), var() and mad() respectively.
The quantile() function provides the quartiles of the numeric values.
The IQR() function provides the inter quartile range of the numeric fields.
The function cor() and cov() are used to find the correlation and covariance
between two numeric fields respectively.
R has the built in functions to generate normal distribution of data namely
dnorm(), pnorm(), qnorm() and rnorm().
R has the built in functions to generate binomila distribution of data
namely dbinom(), pbinom(), qbinom() and rbinom().
Correlation coefficient in R can be computed using the functions cor() or
cor.test().
The corrplot() function plots the correlation between the various columns
of a dataset.
The function lm() creates the relationship model between the predictor
and the response variable in linear / multiple regression analysis.
Multiple regressions is an extension of linear regression where we have
more than one predictor variable and one response variable.
The function used to create the logical / poisson regression model is the
glm() function.
The function used for Analysis of Variance (ANOVA) / Analysis of
Covariance (ANCOVA) is anova() and aov().
The function used for performing chi-Square test is chisq.test().
In Hypothesis Testing, we apply the prop.test() function to compute the
p-value directly.
CHAPTER 6
OBJECTIVES
R has the default dataset called iris and this dataset will be used in the example
below. The cluster number is set to 3 in the below clustering as the number of
distinct species in the iris dataset is 3 (“setosa”, “versicolor”, “virginica”). For the
purpose of initial manipulation let us copy the iris dataset into another dataset
called iris1. Then we remove the column “Species” from the dataset iris1 and then
apply kmeans clustering on it.
The result of the clustering is then compared with the “Species” column of the
dataset iris to see if similar objects are grouped together. The result of clustering
shows that the species “setosa” can be clustered separately and the other two species
180 R Programming — An Approach for Data Analytics
have few overlapping of objects and hence can be clustered together. The functions
kmeans(), table(), plot() and point() are used below for getting and plotting the
results. It can be noted that the plots can be drawn with any two dimensions of the
species available at a particular time (eg. Sepal.Length Vs. Sepal.Width). Also, the
results of the kmeans clustering can vary from run to run due to the selection of
cluster centres.
> iris1 <- iris
> iris1$Species <- NULL
> km <- kmeans(iris1, 3)
> km
K-means clustering with 3 clusters of sizes 50, 38, 62
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[41] 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3
[81] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3
[121] 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3
Within cluster sum of squares by cluster:
[1] 15.15100 23.87947 39.82097
(between_SS / total_SS = 88.4 %)
Available components:
[1] “cluster” “centers” “totss” “withinss” “tot.withinss”
[6] “betweenss” “size” “iter” “ifault”
> table(iris$Species, km$cluster)
181 Data Mining Using R
1 2 3
setosa 50 0 0
versicolor 0 2 48
virginica 0 36 14
> plot(iris1[c(“Sepal.Length”, “Sepal.Width”)], col = km$cluster)
> points(km$centers[, c(“Sepal.Length”, “Sepal.Width”)],
col = 1:3, pch = 8, cex = 2)
R has the pam() and pamk() functions of the cluster package to do the k-medoids
clustering. The k-means and k-medoids clustering produces almost the same result
and the only difference is that in k-means the cluster is represented by the cluster
centre and in k-medoids the cluster is represented by the object close to the cluster
centre. But in the presence of outliers, k-medoids is more robust than k-means
clustering. Partitioning Around Medoids (PAM) is the classic algorithm applied for
k-medoids clustering. The PAM algorithm is not efficient in handling large datasets.
CLARA is an enhanced technique of PAM which performs better on large datasets.
For the functions pam() and clara() in the package cluster we need to specify the
number of clusters. But, for the function pamk() in the package fpc, we need not
183 Data Mining Using R
specify the number of clusters. In this case the number of clusters is estimated
using the silhouette width.
> library(fpc)
> pmk <- pamk(iris1)
> pmk$nc
[1] 2
> table(pmk$pamobject$clustering, iris$Species)
In the above left side chart of Fig 4.3, we can see that there are two clusters,
one for the species “setosa” and the other for the mixture of species “versicolor” and
“virginica”. The right side chart of Fig 4.3, shows the silhouette width which decides
the number of clusters (2 clusters in this case). The silhouette width is shown to be
184 R Programming — An Approach for Data Analytics
between 0.81 and 0.62 (nearing 1), and this means that the observations are well
clustered. If the silhouette width is around 0 it means the observations lies between
the two clusters and it is less than 0, it means the observations are placed in the
wrong clusters.
Now, let us use the pam() function from the cluster package to cluster the iris
data and plot the results.
> library(cluster)
> pm <- pam(iris1, 3)
> table(pm$clustering, iris$Species)
setosa versicolor virginica
1 50 0 0
2 0 48 14
3 0 2 36
> layout(matrix(c(1, 2), 1, 2))
> plot(pm)
> layout(matrix(1))
In the above left chart of Fig. 6.4, we can see three clusters. Cluster 1 with only
the “setosa” species and the cluster 2 with mostly “versicolor” species and few of
“virginica” species and in cluster 3 we have mostly “viginica” species and few of
“versicolor” species. In both the above graphs Fig. 6.3 and Fig. 6.4, the line between
the clusters shows the distance between the clusters. From the above we can say
that the choice of the clustering function used in R depends on the target problem
and the domain knowledge available.
The hierarchical clustering in R can be done using the function hclust() in the
fpc package. As the hierarchical clustering when plotted will be very crowded if the
data is large, we create a sample of the iris data and do the clustering and plotting
using this sample data. The function rect.hclust() is used to draw rectangle that
covers each cluster. The cutree() function is used to draw the dendrogram as in the
Fig. 6.5.
> i <- sample(1:dim(iris)[1], 50)
> iris3 <- iris[i,]
> iris3$Species <- NULL
> hc <- hclust(dist(iris3), method = “ave”)
> plot(hc, hang = -1, labels = iris$Species[i])
> rect.hclust(hc, k=3)
> grp <- cutree(hc, k=3)
186 R Programming — An Approach for Data Analytics
The resultant graph Fig. 6.5, shows that the first cluster has just the species
“setosa”, the second cluster has the species “virginica” and the third cluster has a
mix of both the species “versicolor” and “virginica”.
The function dbscan() in the package fpc is used for the density based clustering.
The density based clustering is done to cluster the entire data into one cluster. There
are two main parameters in the function dbscan(). They are the eps and MinPts. The
parameter MinPts stands for the minimum points in a cluster and the parameter
eps defines the reachability distance. The density based clustering is sensitive to
noisy data. Standard values are given for these parameters eps and MinPts.
> library(fpc)
> iris4 <- iris
> iris4$Species <- NULL
> db <- dbscan(iris4, eps = 0.42, MinPts = 5)
> table(db$cluster, iris$Species)
187 Data Mining Using R
In the result above we can see that there are three clusters, cluster 1, cluster 2
and cluster 3. The cluster 0 corresponds to the outliers in the data.
In the graph as in Fig. 6.6, the black circles represent the outliers. The clustering
results can also be plotted in the scatter plot like in k-means clustering. This can be
done using the function plot() as in Fig. 6.7, or the function plotcluster() as in Fig.
6.8, in the fpc package. The black circles in the Fig. 6.7 and the black zeros “0” in
the Fig. 6.8 shows the outliers.
> plot(db, iris4[c(1,2)])
> plotcluster(iris4, db$cluster)
188 R Programming — An Approach for Data Analytics
The clustering model can be used to model new data based on the similarity
between the new data and the clusters. We take a sample of 10 records from the iris
dataset and add some outliers to it and try to label the new dataset. The noises are
generated using the function runif().
> set.seed(435)
> i <- sample(1:dim(iris)[1], 10)
189 Data Mining Using R
Thus from the above results and the plot in Fig. 6.9, we can see that out of
the 10 new data, 7 (3 + 2 + 2) are assigned to the correct clusters and there is one
outlier data in it.
190 R Programming — An Approach for Data Analytics
6.3. Classification
Classification is a data mining technique that assigns items in a sample to target
labelled classes. The goal of classification is to accurately predict the target class for
each case in the data. For example, a classification model could be used to identify
the age group of the people as children, youth, or old. The various classification
techniques are Decision Tree based Methods, Rule-based Methods, Memory
based reasoning, Neural Networks, Naïve Bayes and Bayesian Belief Networks and
Support Vector Machines. R Package provides various packages and functions that
implement many classification techniques such as SVM, kNN, Decision Trees,
Naive Bayes, etc. R-Package has many classification techniques implemented as the
functions bundled in different library packages. We can see the below table listing
the classification techniques available in R along with their corresponding packages
and functions.
broken into smaller and smaller subsets. The final result is a tree with decision nodes
and leaf nodes. A decision node has two or more branches. Leaf node represents a
classification or decision. The topmost decision node in a tree is called the root
node and this corresponds to the best predictor. Decision trees can handle both
categorical and numerical data.
The function ctree() in the package party can be used to build the decision tree
for the given data. We consider the iris dataset available in R for our analysis. This
dataset has four attributes namely, the Sepal.Length, Sepal.Width, Petal.Length and
Petal.Width using which we can predict the Species of the flower. After applying the
function ctree() and getting the decision tree model, we can do the prediction using
the function predict() for the given new data, so that we can categorize it into which
Species the flowers belong to.
Before applying the decision tree function, the iris dataset is first split into
training and test subsets. For training we choose 80% of the data randomly and the
remaining 20% is used for testing. The seed for sampling is randomly set to a fixed
number as below for effective splitting of data. After creating the decision tree using
the training data, prediction is done on the test data. The results of the built tree
can be viewed as text result and as well as a decision tree plot as in Fig. 6.10.
> set.seed(1234)
> i <- sample(2, nrow(iris), replace=TRUE, prob=c(0.8, 0.2))
> train <- iris[i==1,]
> test <- iris[i==2,]
> form <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
> dt <- ctree(form, data=train)
> table(predict(dt), train$Species)
192 R Programming — An Approach for Data Analytics
The above decision tree result can also be drawn as a simple decision tree as in
Fig. 6.11.
> plot(dt, type = “simple”)
In the first decision tree (Fig. 6.10) the number of training data under each
species is listed as bar graph, but in the second decision tree (Fig. 6.11) the same is
listed using variable y. For example, node 2 is labelled as “n = 42, y(1, 0, 0)”, which
means that it contains 42 training instances and all of them belong to the species
“setosa”.
Now, the predicted model will be tested with the test data to see if the instances
are correctly classified.
> pred <- predict(dt, newdata = test)
> table(pred, test$Species)
pred setosa versicolor virginica
setosa 8 0 0
versicolor 0 7 1
virginica 0 0 11
194 R Programming — An Approach for Data Analytics
The function rpart() in the package rpart can be used to build the decision tree for
the given data. We consider the bodyfat dataset available in the package TH.data of
R for our analysis. After applying the function rpart() and getting the decision tree
model, we can do the prediction using the function predict() for the given new data,
so that we can categorize it into which Species the flowers belong to.
Before applying the decision tree function, the dataset is first split into training
and test subsets. For training we choose 70% of the data randomly and the remaining
30% is used for testing. After creating the decision tree using the training data,
prediction is done on the test data. The decision tree is shown in the Fig. 6.12 and
the details of the split are listed below.
> library(TH.data)
> data(“bodyfat”, package = “TH.data”)
> set.seed(1234)
> i <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))
> train <- bodyfat[i==1,]
> test <- bodyfat[i==2,]
> form <- DEXfat ~ age + waistcirc + hipcirc +
elbowbreadth + kneebreadth
> dt2 <- rpart(form, data = train, control = rpart.control(minsplit = 10))
> plot(dt2)
> text(dt2, use.n=T, all = T, cex = 1)
195 Data Mining Using R
The predicted values are compared with the observed values and the graph in Fig.
6.13 shows that the modelling is good as most points lie close to the diagonal line.
Similarly, we can also apply the same rpart() function for the iris dataset as
before splitting the data into 80% training and 20% test data. The obtained
model (Fig. 6.14) can then be used for predicting the species of the test data. The
prediction shows that out of the 8 “setosa” species, all are correctly classified, out of
the 8 “versicolor” species, 7 are correctly classified and 1 is incorrectly classified as
“virginica” and out of the 11 “virginica” species, 10 are correctly classified and 1 is
incorrectly classified as “versicolor”.
> set.seed(1234)
> i <- sample(2, nrow(iris), replace=TRUE, prob=c(0.8, 0.2))
> train <- iris[i==1,]
> test <- iris[i==2,]
> form <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
> dt <- rpart(form, data = train, control = rpart.control(minsplit = 10))
> table(predict(dt), train$Species)
setosa versicolor virginica
setosa 42 0 0
versicolor 0 42 4
virginica 0 1 34
> plot(dt)
> text(dt, use.n=TRUE, all=TRUE)
198 R Programming — An Approach for Data Analytics
> varImpPlot(rf)
Finally, the built random forest is tested on test data, and the result is checked
with the functions table() and margin(). The margin of a data point is the proportion
of the correct class minus maximum proportion of the other classes. Generally,
positive margin means correct classification (Fig. 6.17).
201 Data Mining Using R
Where P(A) is the percentage (or probability) of cases containing A and P(B) is
the percentage (or probability) of cases containing B.
In the below section, we do association rule mining using the function apriori().
This function has the default settings as supp=0.1, which is the minimum support
of rules, conf=0.8, which is the minimum confidence of rules and maxlen=10,
which is the maximum length of the rules.
> library(arules)
> rules.all <- apriori(titanic.raw)
> quality(rules.all) <- round(quality(rules.all), digits=3)
> rules.all
set of 27 rules
> inspect(rules.all)
lhs rhs support confidence lift
[1] {} => {Age=Adult} 0.950 0.950 1.000
[2] {Class=2nd} => {Age=Adult} 0.119 0.916 0.964
[3] {Class=1st} => {Age=Adult} 0.145 0.982 1.033
[4] {Sex=Female} => {Age=Adult} 0.193 0.904 0.951
[5] {Class=3rd} => {Age=Adult} 0.285 0.888 0.934
[6] {Survived=Yes}=> {Age=Adult} 0.297 0.920 0.968
[7] {Class=Crew} => {Sex=Male} 0.392 0.974 1.238
[8] {Class=Crew} => {Age=Adult} 0.402 1.000 1.052
[9] {Survived=No} => {Sex=Male} 0.620 0.915 1.164
[10] {Survived=No}=> {Age=Adult} 0.653 0.965 1.015
[11] {Sex=Male} => {Age=Adult} 0.757 0.963 1.013
[12] {Sex=Female,Survived=Yes}
=> {Age=Adult} 0.144 0.919 0.966
[13] {Class=3rd,Sex=Male}
=> {Survived=No} 0.192 0.827 1.222
[14] {Class=3rd,Survived=No}
=> {Age=Adult} 0.216 0.902 0.948
204 R Programming — An Approach for Data Analytics
[15] {Class=3rd,Sex=Male}
=> {Age=Adult} 0.210 0.906 0.953
[16] {Sex=Male,Survived=Yes}
=> {Age=Adult} 0.154 0.921 0.969
[17] {Class=Crew,Survived=No}
=> {Sex=Male} 0.304 0.996 1.266
[18] {Class=Crew,Survived=No}
=> {Age=Adult} 0.306 1.000 1.052
[19] {Class=Crew,Sex=Male}
=> {Age=Adult} 0.392 1.000 1.052
[20] {Class=Crew,Age=Adult}
=> {Sex=Male} 0.392 0.974 1.238
[21] {Sex=Male,Survived=No}
=> {Age=Adult} 0.604 0.974 1.025
[22] {Age=Adult,Survived=No}
=> {Sex=Male} 0.604 0.924 1.175
[23] {Class=3rd,Sex=Male,Survived=No}
=> {Age=Adult} 0.176 0.917 0.965
[24] {Class=3rd,Age=Adult,Survived=No}
=> {Sex=Male} 0.176 0.813 1.034
[25] {Class=3rd,Sex=Male,Age=Adult}
=> {Survived=No} 0.176 0.838 1.237
[26] {Class=Crew,Sex=Male,Survived=No}
=> {Age=Adult} 0.304 1.000 1.052
[27] {Class=Crew,Age=Adult,Survived=No}
=> {Sex=Male} 0.304 0.996 1.266
Many rules generated above are uninteresting. If we are interested in only rules
with rhs indicating survival, we set rhs=c(“Survived=No”, “Survived=Yes”) in the
205 Data Mining Using R
[7] {Class=Crew,Sex=Female,Age=Adult}
=> {Survived=Yes} 0.009 0.870 2.692
[8] {Class=2nd,Sex=Female,Age=Adult}
=> {Survived=Yes} 0.036 0.860 2.663
[9] {Class=2nd,Sex=Male, Age=Adult}
=> {Survived=No} 0.070 0.917 1.354
[10] {Class=2nd,Sex=Male}
=> {Survived=No} 0.070 0.860 1.271
[11] {Class=3rd,Sex=Male,Age=Adult}
=> {Survived=No} 0.176 0.838 1.237
[12] {Class=3rd,Sex=Male}
=> {Survived=No} 0.192 0.827 1.222
Some rules generated above provide little or no extra information when some
other rules are in the result. For example, the rule 2 provides no extra knowledge in
addition to rule 1, since rules 1 tells us that all 2nd-class children survived. A rule
is considered to be redundant when it is a super rule of another rule and it has the
same or a lower lift. Other redundant rules in the above result are rules 4, 7 and 8,
compared with the rules 3, 6 and 5 respectively. We prune redundant rules with
code below.
> subset.matrix <- is.subset(rules.sorted, rules.sorted)
> subset.matrix[lower.tri(subset.matrix, diag=T)] <- FALSE
> redundant <- colSums(subset.matrix, na.rm=T) >= 1
> which(redundant)
{Class=2nd,Sex=Female,Age=Child,Survived=Yes}
2
{Class=1st,Sex=Female,Age=Adult,Survived=Yes}
4
{Class=Crew,Sex=Female,Age=Adult,Survived=Yes}
7
207 Data Mining Using R
{Class=2nd,Sex=Female,Age=Adult,Survived=Yes}
8
> rules.pruned <- rules.sorted[!redundant]
> inspect(rules.pruned)
The above rules show that only 2nd class children have survived. But, this
cannot be the case as we have setup a higher support and confidence levels in
the previous case. To investigate the above issue, we run the code below to find
rules whose rhs is “Survived=Yes” and lhs contains “Class=1st”, “Class=2nd”,
“Class=3rd”, “Age=Child” and “Age=Adult” only, and which contains no other
items (default=”none”). We use lower thresholds for both support and confidence
than before to find all rules for children of different classes.
208 R Programming — An Approach for Data Analytics
> inspect(rules.sorted)
lhs rhs support confidence lift
[1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000 3.0956399
[2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034 1.0000000 3.0956399
[3] {Class=1st,Age=Adult} => {Survived=Yes} 0.089504771 0.6175549 1.9117275
[4] {Class=2nd,Age=Adult} =>{Survived=Yes} 0.042707860 0.3601533 1.1149048
[5] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151 0.3417722 1.0580035
[6] {Class=3rd,Age=Adult} => {Survived=Yes} 0.068605179 0.2408293 0.7455209
In the above result, the first two rules show that children of the 1st class are of
the same survival rate as children of the 2nd class and that all of them survived. The
rule of 1st-class children didn’t appear before, simply because of its support was
below the threshold specified. Rule 5 presents a sad fact that children of class 3 had
a low survival rate of 34%, which is comparable with that of 2nd-class adults and
much lower than 1st-class adults.
Now, let us see some ways of visualizing association rules such as Scatter Plot,
Grouped Matrix, Graph and Parallel Coordinates Plot as in Fig. 6.18, Fig. 6.19, Fig.
6.20 and Fig. 6.21 respectively.
> library(arulesViz)
> plot(rules.all)
> plot(rules.all, method=”grouped”)
209 Data Mining Using R
The Univariate outlier detection can be used to find outliers in multivariate data
also. First, we create a data frame with the two independent variables and detect
their outliers separately. Then, we take multivariate outliers as those data which are
outliers for both variables. In the below code outliers are marked with “+” in red.
The result is displayed as a scatter plot as in Fig. 6.23.
> y <- rnorm(100)
> df <- data.frame(x, y)
> rm(x, y)
> head(df)
x y
1 -3.31539150 0.7619774
2 -0.04765067 -0.6404403
3 0.69720806 0.7645655
4 0.35979073 0.3131930
5 0.18644193 0.1709528
6 0.27493834 -0.8441813
213 Data Mining Using R
Similarly, we can take multivariate outliers as those data which are outliers in
either of the variables (x or y). This is shown as scatter plot in the Fig. 6.24.
> outlierlist2 <- union(a,b)
> outlierlist2
[1] 1 33 64 74 24 25 49
> plot(df)
> points(df[outlier.list2,], col=”blue”, pch=”x”, cex=2)
214 R Programming — An Approach for Data Analytics
When there are three or more variables in an application, a final list of outliers
might be produced with majority voting of outliers detected from individual
variables. Domain knowledge should be involved when choosing the optimal way to
group the items in real-world applications.
We show outliers with a biplot of the first two principal components in the
Fig. 6.26. In the below code, prcomp() performs a principal component analysis,
and biplot() plots the data with its first two principal components. In the below
graph, Fig. 6.26, x-axis and y-axis are respectively the first and second principal
components, the arrows show the original columns (variables), and the five outliers
are labelled with their row numbers.
> n <- nrow(iris1)
> labels <- 1:n
216 R Programming — An Approach for Data Analytics
We can also show outliers with a pairs plot as below, where outliers are labelled
with “+” in red.
The outliers can also be displayed using the pairs() function as in the Fig. 6.27
in which the outliers are marked as “+” symbol displayed in red colour. Package Rlof
provides the function lof(), a parallel implementation of the LOF algorithm.
217 Data Mining Using R
One way to detect outliers is clustering. By grouping data into clusters, those
data not assigned to any clusters are taken as outliers. For example, with density-
based clustering such as DBSCAN, objects are grouped into one cluster if they
are connected to one another by densely populated area. Therefore, objects not
assigned to any clusters are isolated from other objects and are taken as outliers.
The function dbscan() in the package fpc is used for the density based clustering.
There are two main parameters in the function dbscan(). They are the eps and
MinPts. The parameter MinPts stands for the minimum points in a cluster and the
parameter eps defines the reachability distance. Standard values are given for these
parameters.
> library(fpc)
> iris1 <- iris
> iris1$Species <- NULL
> db <- dbscan(iris1, eps = 0.42, MinPts = 5)
> table(db$cluster, iris$Species)
218 R Programming — An Approach for Data Analytics
In the result above we can see that there are three clusters, cluster 1, cluster 2
and cluster 3. The cluster 0 corresponds to the outliers in the data (marked by black
circles in the plot of Fig. 6.28).
The clustering results can also be plotted in the scatter plot as in Fig. 6.29. This
can be done using the function plot() in the fpc package. Here also the black circles
denote the outliers.
> plot(db, iris1[c(1,2)])
219 Data Mining Using R
Figure 6.29 Scatter Plot of Outliers Detection Using Density Based Clustering
We can also detect outliers with the k-means algorithm. With k-means, the
data are partitioned into k groups by assigning them to the closest cluster centres.
After that, we can calculate the distance (or dissimilarity) between each object and
its cluster centre, and pick those with largest distances as outliers. An example of
outlier detection with k-means from the iris data is given below. In the graph of Fig.
6.30, the cluster centres are labelled with “*” and outliers with “+”.
> iris2 <- iris[,1:4]
> km <- kmeans(iris2, centers=3)
> centers <- km$centers[km$cluster, ]
> dist <- sqrt(rowSums((iris2 - centers)^2))
> outliers <- order(dist, decreasing=T)[1:5]
> print(outliers)
[1] 99 58 94 61 119
> print(iris2[outliers,])
dataset results in lower predictive power of the model. This scenario is often termed
as the curse of dimensionality. Principal Component Analysis (PCA) is a popular
dimensionality reduction technique. PCA is used in many applications that deals
with high dimension of data.
Let us use the function apply() to the crimtab dataset row wise to calculate the
variance to see how each variable is varying. The function apply() returns a vector
or array or list of values obtained by applying a function to margins of an array or
matrix.
> apply(crimtab,2,var)
142.24 144.78 147.32 149.86 152.4
0.02380952 0.02380952 0.17421603 0.88792102 2.56445993
223 Data Mining Using R
We can see that the column “165.1” contains maximum variance “270.58536585”.
Let us now apply PCA using the function prcomp() on the data set crimtab.
> pca =prcomp(crimtab)
> pca
Standard deviations (1, .., p=22):
[1] 30.07962021 14.61901911 5.45438277 4.65250574 3.21408168 2.77322835
2.30250353
[8] 1.92678188 1.40986049 1.24320894 1.02967875 0.72502776 0.50683548
0.47841947
[15] 0.29167315 0.26636232 0.22462458 0.12793888 0.12483426 0.06548509
0.00000000
[22] 0.00000000
...
...
224 R Programming — An Approach for Data Analytics
From the above code, the resultant components of PCA object are the standard
deviations and Rotation. From the standard deviations we can observe that the 1st
PCA explained most of the variation, followed by other PCAs’. The proportion
of each variable along each principal component is given by the Rotation of the
principal component. Let’s plot all the principal components and see how the
variance is accounted with each component in Fig. 6.31.
> par(mar = rep(2, 4))
> plot(PCA)
Clearly the first principal component accounts for maximum information. The
results of PCA can be represented as a biplot graph. Biplot is used to show the
proportions of each variable along the first two principal components as in Fig. 6.32.
The first two lines of the below code changes the direction of the biplot. If we do
not include the first two lines the plot will be mirror image of the below graph.
> PCA$rotation=-PCA$rotation
> PCA$x=-PCA$x
> biplot (PCA, scale = 0.2)
The below Fig. 6.32, is known as a biplot. In this, we can see the two principal
components (PC1 and PC2) of the crimtab dataset ploted in the graph. The arrows
225 Data Mining Using R
represent the loading vectors, and this specifies how the feature space varies along
the principal component vectors. From the below plot, we can see that the first
principal component vector, PC1, more or less places equal weight on three features:
165.1, 167.64, and 170.18. This means that these three features are more correlated
with each other. In the second principal component, PC2 places more weight on
160.02, 162.56 than the other 3 features.
> library(caret)
> library(corrplot)
> library(plyr)
> dat <- read.csv(“Sample.csv”)
> set.seed(227)
# Remove variables having high percentage of missing values
> dat1 <- dat[, colMeans(is.na(dat)) <= .5]
> dim(dat1)
[1] 19622 93
> dim(dat)
[1] 19622 160
> nzv <- nearZeroVar(dat1)
# Remove Zero and Near Zero-Variance Predictors
> dat2 <- dat1[, -nzv]
> dim(dat2)
[1] 19622 59
> numericData <- dat2[sapply(dat2, is.numeric)]
> summary(descrCor[upper.tri(descrCor)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.992008 -0.101969 0.001729 0.001405 0.084718 0.980924
> corrplot(descrCor, order = “FPC”, method = “color”, type = “lower”,
tl.cex = 0.7, tl.col = rgb(0, 0, 0))
227 Data Mining Using R
importance. The first one - Gini Gain produced by the variable, averaged
over all trees. The second one - Permutation Importance i.e. mean decrease
in classification accuracy after permuting the variable, averaged over all
trees. Sort the permutation importance score on descending order and
select the TOP k variables. The below code is for feature selection with
Random Forest.
> library(randomForest)
> rf <-randomForest(classe~.,data=dat3, importance=TRUE,ntree=100)
# Finding importance of variables
> imp = importance(rf, type=1)
> imp <- data.frame(predictors=rownames(imp),imp)
# Sorting the variables in descending order of their MeanDecreaseAccuracy
> imp.sort <- arrange(imp,desc(MeanDecreaseAccuracy))
> imp.sort$predictors <-
factor(imp.sort$predictors,levels=imp.sort$predictors)
> imp.20<- imp.sort[1:20,]
# Printing top 20 variables with high MeanDecreaseAccuracy
> print(imp.20)
predictors MeanDecreaseAccuracy
1 X 36.878224
2 raw_timestamp_part_1 19.939217
3 cvtd_timestamp 19.936367
4 pitch_belt 14.474235
5 roll_dumbbell 12.502391
6 gyros_belt_z 12.429689
7 num_window 11.491461
8 total_accel_dumbbell 11.193014
9 gyros_arm_y 10.509349
10 magnet_forearm_z 10.353922
229 Data Mining Using R
11 gyros_dumbbell_x 10.245442
12 magnet_belt_z 10.078787
13 pitch_forearm 10.069103
14 roll_arm 10.049374
15 yaw_arm 9.959173
16 gyros_dumbbell_y 9.770771
17 gyros_belt_x 9.602383
18 magnet_forearm_y 9.407758
19 user_name 9.304626
20 gyros_forearm_x 8.954952
> varImpPlot(rf, type=1)
# Retaining only top 20 variables with high MeanDecreaseAccuracy in dat4
> dat4 = cbind(classe = dat3$classe, dat3[,c(imp.20$predictors)])
HIGHLIGHTS
The R packages that are related to data mining are stats, cluster, fpc,
sna, e1071, cba, biclust, clues, kohonen, rpart, party, randomForest, ada,
caret, arules, eclat, arulesViz, DMwR, dprep, Rlof, plyr, corrplot, RWeka,
gausspred, optimsimplex, CCMtools, FactoMineR and nnet.
The R packages and function for clustering are stats - kmeans(), cluster
- pam(), fpc - pamk(), cluster - agnes(), stats - hclust(), cluster - daisy(),
fpc - dbscan(), sna - kcores(), e1071 - cmeans(), cba - rockCluster(), biclust
- biclust(), clues - clues(), kohonen - som(), cba - proximus() and cluster -
clara().
The functions kmeans(), table(), plot() and point() are used below for
getting and plotting the results of K-Means clustering.
R has the pam() and pamk() functions of the cluster package to do the
k-medoids clustering.
The silhouette width decides the number of clusters in k-medoids
clustering.
The hierarchical clustering in R can be done using the function hclust() in
the fpc package.
The function dbscan() in the package fpc is used for the density based
clustering.
The R packages and function for clustering are e1071 - svm(), RWeka
- IBk(), party - rpart(), rpart - ctree(), rpart - cforest(), randomForest -
randomForest(), e1071 - naiveBayes(), ada - ada(), caret - train().
The function ctree() in the package party and the function rpart() in the
package rpart can be used to build the decision tree for the given data.
We can do the prediction using the function predict() for the given new
data.
The function randomForest() in the package randomForest can be used to
classify the given data.
The importance of variables in a dataset can be obtained with the functions
importance() and varImpPlot().
The built random forest is tested on test data, and the result is checked
with the functions table() and margin().
231 Data Mining Using R
Case Studies
OBJECTIVES
This case study on text mining starts with using the twitter feeds from the dataset
“GameReview.csv” for further analysis. The extracted text is then transformed to
build a document-term matrix. Then, frequent words and associations are found
from the matrix. Important words in a document can be presented as a word cloud.
Packages used for text mining are “tm” and “wordcloud”.
> library(tm)
234 R Programming — An Approach for Data Analytics
After that, we use stemCompletion() to complete the stems with the unstemmed
corpus corpcopy as a dictionary. With the default setting, it takes the most frequent
match in dictionary as completion.
> stemCompletion2 <- function(x, dictionary) {
+ x <- unlist(strsplit(as.character(x), “ “))
+ x <- x[x != “”]
+ x <- stemCompletion(x, dictionary=dictionary)
+ x <- paste(x, sep=””, collapse=” “)
+ stripWhitespace(x)
+}
> corp <- lapply(corp, stemCompletion2, dictionary=corpcopy)
> corp <- Corpus(VectorSource(corp))
> for (i in 1:5) {
+ cat(paste(“[[“, i, “]] “, sep=””))
+ writeLines(strwrap(corp[[i]], width=60))
+ }
[[1]] addicted deff downlaod epic cat lovers will fall love 3
[[2]] great game love game unlike game constant give money play
236 R Programming — An Approach for Data Analytics
From the listings from the corpus before stemming, after stemming and after
stemming completion we can see the words being changed. For example in line
4, the word was “addictive” before stemming, the word became “addict” after
stemming and then after stemming completion it became “addicted”.
As we can see from the above result, the term-document matrix is composed of
2117 terms and 1000 documents. It is very sparse, with 99% of the entries being zero.
We then have a look at the first six terms starting with “g” and tweets numbered
201 to 210.
> idx = grep(glob2rx(“g*”), dimnames(tdm)$Terms)
> inspect(tdm[idx,201:210])
<<TermDocumentMatrix (terms: 81, documents: 10)>>
237 Case Studies
Many data mining tasks can be done, based on the above matrix. For example,
clustering, classification and association rule mining. When there are too many
terms, the size of a term-document matrix can be reduced by selecting terms that
appear in a minimum number of documents.
We will now have a look at the popular words and the association between
words from the 1000 tweets.
> findFreqTerms(tdm, lowfreq=80)
[1] “addicted” “love” “game” “good” “great” “play” “fun”
[8] “awesome” “get” “time” “like” “just” “update” “app”
[15] “can” “cant”
In the code above, findFreqTerms() finds frequent terms with frequency no less
than eighty. Note that they are ordered alphabetically, instead of by frequency or
238 R Programming — An Approach for Data Analytics
popularity. To show the top frequent words visually, we next make a bar plot for them.
From the term document matrix, we can derive the frequency of terms with rowSums().
Then we select terms that appears in eighty or more documents and shown them with
a bar plot using the package ggplot2. In the code below, geom=”bar” specifies a bar
plot and coord_flip() swaps x-axis and y-axis. The bar plot below clearly shows that the
three most frequent words are “game”, “play” and “great”.
> termFrequency <- rowSums(as.matrix(tdm))
> termFrequency <- subset(termFrequency, termFrequency>=80)
> library(ggplot2)
> df <- data.frame(term=names(termFrequency), freq=termFrequency)
> ggplot(df, aes(x=term, y=freq)) + geom_bar(stat=”identity”) +
+ xlab(“Terms”) + ylab(“Count”) + coord_flip()
Alternatively, the above plot can also be drawn with barplot() as below, where
the argument las sets the direction of x-axis labels to be vertical.
> barplot(termFrequency, las=2)
239 Case Studies
It is also possible to find the highly associated words with another word with
the function findAssocs(). Below is the code to find the terms associated with the
words “game” and “play” with correlation no less than 0.20 and 0.25 respectively.
The words are ordered by their correlation with the terms “game” (or “play”).
> findAssocs(tdm, “game”, 0.20)
$game
play player dont say new thatd ever
0.29 0.22 0.21 0.21 0.21 0.21 0.20
> findAssocs(tdm, “play”, 0.25)
$play
game potential course dont year will meter
0.29 0.27 0.27 0.26 0.26 0.25 0.25
Words with frequency below twenty are not plotted, as specified by min.freq=20. By
setting random.order=F, frequent words are plotted first, which makes them appear
in the centre of cloud. We also set the colours to gray levels based on frequency.
> library(wordcloud)
> m <- as.matrix(tdm)
> wordFreq <- sort(rowSums(m), decreasing=TRUE)
> pal <- brewer.pal(9, “BuGn”)
> pal <- pal[-(1:4)]
> set.seed(375)
> grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) )
> wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=20,
random.order=F, colors=pal)
The above word cloud clearly shows again that “game”, “play” and “great” are
the top three words, which validates that the Game review tweets. Some other
important words are “love”, “good” and “fun”, which shows that the review on the
game is very good.
the distances between terms are calculated with dist() after scaling. After that, the
terms are clustered with hclust() and the dendrogram is cut into 4 clusters. The
agglomeration method is set to ward, which denotes the increase in variance when
two clusters are merged.
> tdm2 <- removeSparseTerms(tdm, sparse=0.90)
> m2 <- as.matrix(tdm2)
> distMatrix <- dist(scale(m2))
> fit <- hclust(distMatrix, method=”ward.D”)
> plot(fit)
> rect.hclust(fit, k=4)
> (groups <- cutree(fit, k=4))
addicted love game good great play fun awesome get
1 1 2 3 4 1 4 1 1
time
1
Figure 7.4 Cluster of Words with High Frequencies Using Hierarchical Clustering
In the above dendrogram, we can see the topics in the tweets grouped into 4
different clusters. The most frequent word “game” is in the second cluster, the next
242 R Programming — An Approach for Data Analytics
frequent word “good” is in the third cluster, the words “great” and “fun” falls under the
fourth cluster and the remaining words with low frequency falls under the first cluster.
Next we cluster the tweets using k-means clustering algorithm. The k-means
clustering takes the values in the matrix as numeric. We transpose the term-document
matrix to a document-term one. The tweets are then clustered with kmeans() with
the number of clusters set to eight. After that, we check the popular words in every
cluster and also the cluster centres. A fixed random seed is set with set.seed() before
running kmeans().
> m3 <- t(m2)
> set.seed(123)
> k <- 8
> kmeansResult <- kmeans(m3, k)
> round(kmeansResult$centers, digits=3)
addicted love game good great play fun awesome get time
1 0.168 0.107 1.267 0.137 1.313 0.069 0.176 0.130 0.145 0.061
2 0.227 0.394 3.652 0.409 0.348 0.803 0.273 0.258 0.409 0.258
3 0.038 0.017 1.174 0.547 0.000 0.127 0.068 0.127 0.081 0.106
4 0.212 0.061 1.091 0.333 0.030 0.606 0.212 0.030 2.242 0.788
5 0.183 0.113 1.507 0.113 0.183 0.887 1.085 0.197 0.197 0.254
6 1.205 0.154 0.397 0.077 0.090 0.154 0.423 0.179 0.026 0.064
7 0.000 0.118 0.000 0.125 0.104 0.146 0.212 0.090 0.066 0.097
8 0.237 1.381 1.216 0.031 0.031 0.216 0.062 0.175 0.134 0.093
To make it easy to find what the clusters are about, we then check the top three
words in every cluster.
> for (i in 1:k) {
+ cat(paste(“cluster “, i, “: “, sep=””))
+ s <- sort(kmeansResult$centers[i,], decreasing=T)
+ cat(names(s)[1:3], “\n”)
+}
243 Case Studies
From the above top words and centres of clusters, we can see that the clusters are
of different topics. We can see in every cluster except the cluster 7, the word “game” is
a part of it and each of these clusters talk about the game in different angle.
The dataset has many attributes that define the credibility of the customers
seeking for several types of loan. The values for these attributes can have outliers
that do not fit into the regular range of data. Hence, it is required to remove the
outliers before the dataset is used for further modelling. The outlier detection for
quantitative features is done using the function levels(). For numeric features the
box plot technique is used for outlier detection and this is implemented using the
daisy() function of the cluster package. But, before this the numeric data has to
be normalized into a domain of [0, 1]. The agglomerative hierarchical clustering
algorithm is used for outlier ranking. This is done using the outliers.ranking()
function of the DMwR package. After ranking the outlier data, the ones that is out
of range is disregarded and the remaining outliers are filled with null values.
The inconsistencies in the data like unbalanced dataset have to be balanced
before building the classification model. Many real time datasets have this problem
and hence need to be rectified for better results. But, before this step, it is required
to split the sample dataset into training and test datasets which will be in the ratio
4:1 (i.e. Training dataset 80% of data and 20% of data will be test dataset). Now
the balancing step will be executed on the training dataset using the SMOTE()
function of the DMwR package.
Next using the training dataset the correlation between the various attributes
need to be checked to see if there are any redundant information represented using
two attributes. This is implemented using the plotcorr() function in the ellipse
package. The unique features will then be ranked and based on the threshold limit
the number of highly ranked features will be chosen for model building. For ranking
the features the randomForest() function of the randomForest package is used. The
threshold for selecting the number of important features is chosen by using the
rfcv() function of the randomForest package.
Now the resultant dataset with the reduced number of features is ready for use
by the classification algorithms. Classification is one of the data analysis methods
that predict the class labels. Classification can be done in several ways and one of
the most appropriate for the chosen problem is using decision trees. Classification
is done in two steps – (i) the class labels of the training dataset is used to build the
decision tree model and (ii) This model will be applied on the test dataset to predict
the class labels of the test dataset. For the first step the function rpart() of the rpart
package will be used. The predict() function is used to execute the second step. The
resultant prediction is then evaluated against the original class labels of the test
dataset to find the accuracy of the model.
Dataset Attribute Types
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 Def
Q N Q Q N Q Q N Q Q N Q N Q Q N Q N B B B
Q: Quantitative N: Numeric B: Binary
Dataset Selection
The German Credit Scoring dataset in the numeric format which is used for the
implementation of this model has the below attributes and the descriptions of the
same are given in the below table.
After selecting and understanding the dataset it is loaded into the R software
using the below code. The dataset is loaded into R with the name creditdata.
> creditdata <- read.csv(“UCI German Credit Data Numeric.csv”,
header = TRUE, sep = “,”)
> nrow(creditdata)
[1] 1000
Data Pre-Processing
1) Outlier Detection: To identify the outliers of the numeric attributes, the
values of the numeric attributes are normalized into the domain range of [0,
1] and they are plotted as box plot to view the outlier values as in Fig. 7.5. The
code and the result for this step are given as below.
> normalization <- function(data,x)
> {for(j in x)
> {data[!(is.na(data[,j])),j]=
> (data[!(is.na(data[,j])),j]-min(data[!(is.na(data[,j])),j]))/
> (max(data[!(is.na(data[,j])),j])-min(data[!(is.na(data[,j])),j]))}
> return(data)}
> c <- c(2,5,8,11,13,16,18)
> normdata <- normalization(creditdata,c)
> boxplot(normdata[,c])
248 R Programming — An Approach for Data Analytics
> outlierdata=outliers.ranking(distance,test.data=NULL,method=”sizeDiff ”,
clus = list(dist=”euclidean”, alg = “hclust”, meth=”average”),
power = 1, verb = F)
3) Outliers Removal: The observations which are out of range (based on the
rankings) are removed using the below code. After outlier removal the dataset
creditdata is renamed as creditdata_noout.
> boxplot(outlierdata$prob.outliers[outlierdata$rank.outliers])
> n=quantile(outlierdata$rank.outliers)
> n1=n[1]
> n4=n[4]
> filler=(outlierdata$rank.outlier > n4*1.3)
> creditdata_noout=creditdata[!filler,]
> nrow(creditdata_noout)
[1] 975
4) Imputations Removal: The method used for null values removal is multiple
imputation method in which the k nearest neighbours’ algorithm is used for
both numeric and quantitative attributes. The numeric features are normalized
before calculating the distance between objects. The following code is used for
imputations removal. After imputations removal the dataset creditdata_noout is
renamed as creditdata_noout_noimp.
> require(DMwR)
> creditdata_noout_noimp=knnImputation(creditdata_noout, k = 5, scale = T,
meth = “weighAvg”, distData = NULL)
> nrow(creditdata_noout_noimp)
[1] 975
There were no null values for the attributes in the dataset we have chosen and
hence the number of records remains unchanged after the above step.
5) Splitting Training and Test Datasets: Before proceeding to the further steps,
the dataset has to be split into training and test datasets so that the model can
be built using the training dataset. The code for splitting the database is listed
250 R Programming — An Approach for Data Analytics
below.
> library(DMwR)
> split<-sample(nrow(creditdata_noout_noimp),
round(nrow(creditdata_noout_noimp)*0.8))
> trainingdata=creditdata_noout_noimp[split,]
> testdata=creditdata_noout_noimp[-split,]
6) Balancing Training Dataset: The SMOTE() function handles unbalanced
classification problems and it generates the new smoted dataset that addresses
the unbalanced class problem. It artificially generates observations of minority
classes using the nearest neighbours of this class of elements to balance the
training dataset. The following code is used for balancing the training dataset.
> creditdata_noout_noimp_train=trainingdata
> creditdata_noout_noimp_train$default <-
factor(ifelse(creditdata_noout_noimp_train$Def == 1, “def ”, “nondef ”))
> creditdata_noout_noimp_train_smot <-
SMOTE(default ~ ., creditdata_noout_noimp_train, k=5,perc.over = 500)
The data distribution before and after balancing the data are shown in the
figures Fig. 7.6 and Fig. 7.7. This method is based on proximities between objects
and produces a spatial representation of these objects. Proximities represent the
similarity or dissimilarity between data objects. The code used to plot these objects
is shown below.
> library(cluster)
> dist1=daisy(creditdata_noout_noimp_train[,-21],stand=TRUE,metric=c(“gower”),
type = list(interval=c(2,5,8,11,13,16,18),
nominal=c(1,3,4,6,7,9,10,12,14,15,17),binary=c(19,20)))
> dist2=daisy(creditdata_noout_noimp_train_smot[,-21],
stand=TRUE,metric=c(“gower”),
type = list(interval=c(2,5,8,11,13,16,18),
nominal=c(1,3,4,6,7,9,10,12,14,15,17),binary=c(19,20)))
> loc1=cmdscale(dist1,k=2)
> loc2=cmdscale(dist2,k=2)
> x1=loc1[,1]
> y1=loc1[,2]
> x2=loc2[,1]
> y2=loc2[,2]
> plot(x1,y1,type=”n”)
252 R Programming — An Approach for Data Analytics
> text(x1,y1,labels=creditdata_noout_noimp_train[,22],
col=as.numeric(creditdata_noout_noimp_train[,22])+4)
> plot(x2,y2,type=”n”)
> text(x2,y2,labels=creditdata_noout_noimp_train_smot[,22],
col=as.numeric(creditdata_noout_noimp_train_smot[,22])+4)
Features Selection
1) Correlation Analysis: Datasets may contain irrelevant or redundant features
which might make the model more complicated. Hence removing such redundant
features will speed up the model. The function plotcorr() plots a correlation matrix
using ellipse shaped glyphs for each entry. It shows the correlation between the
features in an easy way. The plot is coloured for more clarity. The following code
displays the correlation. Correlation is checked independently for each data type:
numeric and nominal. From the results in the figures, Fig. 7.8 and Fig. 7.9, it is
observed that there is no positive correlation between any of the features, both
numeric and quantitative. Hence, in this step none of the features are removed.
> library(ellipse)
> c= c(2,5,8,11,13,16,18)
> plotcorr(cor(creditdata_noout_noimp_train[,c]),col=cl<-c(7,6,3))
> c= c(1,3,4,6,7,9,10,12,14,15,17)
> plotcorr(cor(creditdata_noout_noimp_train [,c]),col=cl<-c(“green”,”red”,”blue”))
The above function importance() displays the features importance using the
“mean decrease accuracy” measure in the below table. The measures can be plotted
using the function varImpPlot() as shown in the figure. Fig. 7.10.
A5 6.238347
A6 4.554283
A7 3.316346
A8 0.59622
A9 1.634721
A10 1.383725
A11 0.541585
A12 2.344433
A13 2.621854
A14 4.629331
A15 0.825801
A16 1.225997
A17 0.635881
A18 0.037408
A19 1.117891
A20 1.388876
> varImpPlot(randf)
The result of this code is shown in the figure Fig. 7.11 and it shows the best
number of features is 15. Hence we select the features A1, A2, A3, A5, A6, A7, A9,
A10, A12, A13, A14, A16, A19, A20, Def to build the model.
Building Model
Classification is one of the data analysis forms that predicts categorical labels. We
used the decision tree model to predict the probability of default. The following
code uses the function rpart() and finds a model from the training dataset.
> library(rpart)
> c = c(4, 8, 11, 15, 17, 18, 22)
> trdata=data.frame(creditdata_noout_noimp_train[,-c])
> tree=rpart(trdata$Def~.,data=trdata,method=”class”)
> printcp(tree)
The result of this code is displayed below and in the table below.
Classification tree:
rpart(formula = trdata$Def ~ ., data = trdata, method = “class”)
Variables actually used in tree construction:
[1] A1 A12 A13 A2 A3 A5 A6 A9
Root node error: 232/780 = 0.29744
n= 780
The command to plot the classification tree is shown in the figure Fig. 7.12.
> plot(tree, uniform=TRUE,main=”Classification Tree”)
> text(tree, use.n=TRUE, all=TRUE, cex=0.7)
Prediction
The model is tested using the test dataset by using the predict() function. The code
for the same and the results of the prediction are displayed below and in the table
below.
> predicttest=data.frame(testdata)
> pred=predict(tree,predicttest)
> c=c(21)
> table(predict(tree, testdata, type=”class”,na.action=na.pass), testdata[, c])
def Nondef
def 30 5
nondef 6 154
Evaluation
Common metrics calculated from the confusion matrix are Precision, Accuracy,
True Positive Rate (TP Rate) and False Positive Rate (FP Rate). The calculations
for the same are listed below.
True Defaults
Precision =
True Defaults + False Defaults
True Defaults
TP Rate =
Total Defaults
False Defaults
FP Rate =
Total Non defaults
258 R Programming — An Approach for Data Analytics
From our resultant data we get the values of the above metrics by applying the
values as derived below.
True Defaults = 30
False Default = 6
Total Default = 35
True Non default = 154
False Non default = 5
Total Non default = 160
Total Test set = 195
Precision = 30 / (30 + 6) = 0.833
Accuracy = (30 + 154) / 195 = 0.943
TP Rate = 30 / 35 = 0.857
FP Rate = 6 / 160 = 0.037
These results show that the proposed model is performing with high accuracy
and precision and hence can be applied for credit scoring.
deployment of police resources, to perform temporal data analysis of crime data and
to capture the trend of crimes happening.
chron - Creates chronological objects which represent dates and time of a day.
ggplot2 - A system for creating graphs, based on the data you provide.
260 R Programming — An Approach for Data Analytics
read.csv() - Reads a file in table format and creates a data frame from it.
which() - Gives the TRUE indices of a logical object, allowing for array indices
cut() - Cut divides the range of x into intervals and codes the values in x according
to which interval they fall. The leftmost interval corresponds to level one, the next
leftmost to level two and so on.
labels() - Finds a suitable set of labels from an object for use in printing or
plotting
length() - Get or set the length of vectors (including lists) and factors, and of
any other R object for which a method has been defined.
261 Case Studies
ifelse() - Ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the element
of test is TRUE or FALSE.
aggregate() - Splits the data into subsets, computes summary statistics for each,
and returns the result in a convenient form.
qplot() - It’s is the basic plotting function. It is used to create a different number
of plots.
The dataset was analyzed to get details such as the file size, number of records,
the fields specified and their meaning.
U.S crime dataset is loaded into the R tool and the data field organization
and its important dimensions are understood. It is noted that each crime incident
262 R Programming — An Approach for Data Analytics
indicates one record and the various crime types are manually analyzed. The dataset
is loaded using the read.csv() function. After this the required packages for this
project are installed and loaded using the below commands.
> install.packages(“chron”)
> library(chron)
> install.packages(“Rcpp”, dependencies = TRUE)
> library(Rcpp)
> install.packages(“ggplot2”)
> library(ggplot2)
The pre-processing deals with removing duplicate records, removing records with
missing values, removing records with incorrect values, formatting the Timestamp
field (splitting the date and time parts of the data), binning the time intervals (4
intervals – each 6 hours) and grouping of similar crimes into one crime type. The
functions used for these preprocessing steps are subset(), as.POSIXlt(), weekdays(),
months(), chron(), cut(), table(), length(), as.character(), ifelse().
Then finally we go for finding and visualizing which crime types lead to arrests,
finding and visualizing frequency of different crime types, finding and visualizing
the hours of the day in which more crimes happen, finding and visualizing the days
of a week in which more crimes happen, finding and visualizing the months of a
year in which more crimes occurs, visualizing the occurrence of various crime types
during various hours of the day, visualizing the occurrence of various crime types
during various days of the week, visualizing the occurrence of various crime types
during various months of the year. All these exploration and visualization are done
using the functions qplot(), factor(), aggregate() and ggplot() that belong to the
package ggplot2.
#Loading data
> crime.data <- read.csv(“crime.data.csv”)
263 Case Studies
#Date Conversion
> crime.data$date <- as.POSIXlt(crime.data$DATE..OF.OCCURRENCE,
format= “%m/%d/%Y %I:%M:%S %p”)
> crime.data <- subset(crime.data, !is.na(crime.data$date))
> crime.data$time <- times(format(crime.data$date, “%H:%M:%S”))
#Formating Date
> crime.data$date <- as.POSIXlt(strptime(crime.data$date, format = “%Y-%m-%d”))
> crime.data$day <- weekdays(crime.data$date, abbreviate = TRUE)
> crime.data$month <- months(crime.data$date, abbreviate = TRUE)
> table(crime.data.arrest$crime)
on several aspects such as variation of salary over a period of time with respect to
League, Team etc. The player’s career records are analyzed based on their hits and
runs and their batting averages calculated. These analysis results are then presented
in the form of graphs and histograms using the functions available in R. The
objective is to identify the trend of base ball players salary over the years, facilitates
to understand the correlation between players salary and their performances, analyze
if age, country, height and weight of the players have impact on their performance
and it captures the details of top performing baseball players.
k. Filter the records of the players in the years in which they have not
had a chance to bat (AB > 0)
c. Exploratory analysis of the baseball team data
a. Visualize the trend of how salaries change over time
b. Find one players salary, team and other details
c. Find the relation of the player’s salary with his height, weight, birth
country
d. Find how each player was batting year wise
e. Visualize correlation of salary with the players performance
f. Visualize each player’s career record (Eg. Total Hit and Runs) based
on their highest rank
g. Visualize the correlation between the players hits and runs
h. Visualize the batting average of the players in a histogram
ggplot2 - A system for creating graphs, based on the data you provide.
read.csv() - Reads a file in table format and creates a data frame from it.
merge() - It is used to merge two data frames by common column or row names,
or do other versions of database join operations.
order() - Order returns a permutation which rearranges its first argument into
ascending or descending order, breaking ties by further arguments.
max() - Returns the (parallel) maxima and minima of the input values.
min() - Returns the (parallel) maxima and minima of the input values.
273 Case Studies
ggplot() - initializes a ggplot object. It can be used to declare the input data
frame for a graphic and to specify the set of plot aesthetics intended to be common
throughout all subsequent layers unless specifically overridden.
The dataset was analyzed to get details such as the file size, number of records,
the fields specified and their meaning.
manually analyzed. The datasets are loaded using the read.csv() function of the R
Package. After this the required packages for this project are installed and loaded
using the below commands.
> install.packages(“data.table”)
> library(data.table)
> install.packages(“Rcpp”, dependencies = TRUE)
> library(Rcpp)
> install.packages(“ggplot2”)
> library(ggplot2)
datasets (To study player wise batting performance), filtering the records
of the players in the years in which they have not had a chance to bat (AB
> 0) and merging the three datasets to get a study of player wise batting
performance.
Exploration and Visualization deals with finding the trend of Salaries of Players
over the Years (Entire Data Set), finding trend of Salaries of Players over the Years
>= 1990 and only for the American League, finding the year wise Average Salary,
finding Year wise & League wise Average Salary, finding Year wise & Team wise
Average Salary, finding the correlation between the players Hits and Runs, finding
one Players Salary details in the different years and the teams he belonged to,
finding Batting Average of the Players, visualizing the graph of Players Salary Vs
Height, Players Salary Vs Weight and Players Salary Vs Birth Country. All these
exploration and visualization are done using the function ggplot() that belongs to
the package ggplot2.
#Loading Data
> salaries<-read.csv(“salaries.csv”,header=TRUE)
> master<-read.csv(“master.csv”,header=TRUE)
> batting<-read.csv(“batting.csv”,header=TRUE)
> salaries.filtered.sorted=salaries.filtered[order(salary), ]
=max(salary),minimum=min(salary)),by=”lgID”]
> summarized.year.lg=salaries[,list(Average=mean(salary),Maximum=max(salary),
minimum=min(salary)),by=c(“yearID”,”lgID”)]
> summarized.year.play=salaries[,list(Average=mean(salary),Maximum=max(salary),
minimum=min(salary)),by=c(“yearID”,”playerID”)]
> summarized.year.team=salaries[,list(Average=mean(salary),Maximum=max(salary),
minimum=min(salary)),by=c(“yearID”,”teamID”)]
> batting=as.data.table(batting)
> salaries.reduced.round<-round(as.numeric(y,1))
> summarized.year.lg$average.reduce.round<-paste(summarized.year.
lg$Average/10000)
> averages.reduce.round<-round(as.numeric(summarized.year.lg$average.reduce.
round,1))
Figure 7.22 Ggplot of Trend of Salaries of Year > 1990 for American League
280 R Programming — An Approach for Data Analytics
Figure 7.24 Ggplot of Year wise and League wise Average Salary
281 Case Studies
Figure 7.25 Ggplot of Year wise and Team wise Average Salary
Figure 7.26 Ggplot of Correlation between the Players Hits and Runs
282 R Programming — An Approach for Data Analytics
twitter text data used in the Text Mining Section. The terms in this data can be
considered as people and the tweets as groups in the LinkedIn. The term-document
matrix is a representation of the group membership of people.
In this case study, we first build the network of terms based on their co-
occurrence in the same tweets and then build a network of tweets based on the
terms shared by them. After this, we also build a two-mode network composed of
both terms and tweets.
As a first step we build the term-document matrix as in the Text Mining section
using the below code.
> library(tm)
#Reading the input file
> dat<-read.csv(“GameReview.csv”,stringsAsFactors = FALSE)
#Converting it to a corpus
> corp <- Corpus(VectorSource(dat$text))
#Preprocessing – removing stop words, punctuations, whitespaces etc.
> corp <- tm_map(corp, content_transformer(tolower))
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stripWhitespace)
> corp <- tm_map(corp, removeWords, stopwords(“english”))
#Converting into a term-document matrix
> tdm <- TermDocumentMatrix(corp)
#Removing sparse terms
> tdm2 <- removeSparseTerms(tdm, sparse=0.96)
#converting into a matrix
> termDocMatrix <- as.matrix(tdm2)
> termDocMatrix <- termDocMatrix[ ,1:150]
relationship between the frequent terms and also make the graph more readable by
setting colors, font sizes and transparency of vertices and edges.
> termDocMatrix[5:10,1:20]
Docs
Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
good 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0
great 0 1 0 0 1 0 0 0 1 0 2 0 0 0 1 0 0 0 0 0
money 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
play 0 1 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0
fun 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0
much 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> termDocMatrix[termDocMatrix>=1] <- 1
> termMatrix <- termDocMatrix %*% t(termDocMatrix)
> termMatrix[5:10,5:10]
Terms
Terms good great money play fun much
good 29 6 3 8 4 1
great 6 25 1 5 5 1
money 3 1 9 3 1 2
play 8 5 3 25 10 2
fun 4 5 1 10 35 4
much 1 1 2 2 4 8
In the above code, %*% is an operator for the product of two matrices, and the
function t() transposes a matrix. We then build a term-term adjacency matrix. In
this matrix the rows and columns represent terms, and every entry is the number
of concurrences of two terms. Next we can build a graph with the function graph.
adjacency() from package igraph.
> library(igraph)
286 R Programming — An Approach for Data Analytics
Next, we set the label size of vertices based on their degrees, to make important
terms stand out. Similarly, we also set the width and transparency of edges based
on their weights. This is useful in applications where graphs are crowded with many
vertices and edges. The vertices and edges in the below code are accessed with V()
and E(). The function rgb(red, green, blue, alpha) defines the colors. With the same
layout, we plot the graph again as in Fig. 7.32.
> V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
> V(g)$label.color <- rgb(0, 0, .2, .8)
287 Case Studies
Figure 7.32 Network of Terms with Label Size of Vertices Based on their Degrees
Next, we try to detect communities from the graph, called as cohesive blocks
and then plot the network of terms based on cohesive blocks as in Fig. 7.33.
> blocks <- cohesive.blocks(g)
> blocks
Cohesive block structure:
B-1 c 15, n 31
‘- B-2 c 16, n 30 oooooooooo .ooooooooo oooooooooo o
‘- B-3 c 17, n 28 ooo.oooooo .ooooooooo oooooooooo .
> plot(blocks, g, vertex.size=.3, vertex.label.cex=1.5, edge.olor=rgb(.4,.4,0,.3))
288 R Programming — An Approach for Data Analytics
Next we plot the network of terms based on maximal cliques as in Fig. 7.34.
> cl <- maximal.cliques(g)
> length(cl)
[1] 286
> colbar <- rainbow(length(cl) + 1)
289 Case Studies
Next we plot the network of terms based on largest cliques as in Fig. 7.35.
> cl <- largest.cliques(g)
> length(cl)
[1] 41
> colbar <- rainbow(length(cl) + 1)
> for (i in 1:length(cl)) {
+ V(g)[cl[[i]]]$color <- colbar[i+1]
+}
> plot(g, mark.groups=cl, vertex.size=.3, vertex.label.cex=1.5, edge.
color=rgb(.4,.4,0,.3))
most tweets are connected with others and the graph of tweets is very crowded.
To simplify the graph and find relationship between tweets beyond the above two
keywords, we remove the two words before building a graph.
> idx <- which(dimnames(termDocMatrix)$Terms %in% c(“game”, “play”))
> M <- termDocMatrix[-idx,]
> tweetMatrix <- t(M) %*% M
> g <- graph.adjacency(tweetMatrix, weighted=T, mode = “undirected”)
> V(g)$degree <- degree(g)
> g <- simplify(g)
> V(g)$label <- V(g)$name
> V(g)$label.cex <- 1
> V(g)$label.color <- rgb(.4, 0, 0, .7)
> V(g)$size <- 2
> V(g)$frame.color <- NA
Next, we have a look at the distribution of degree of vertices and the result is
shown in the below bar graph as in Fig. 7.36. We can see that there are around 20
isolated vertices (with a degree of zero). Note that most of them are caused by the
removal of the two keywords, “game” and “play”.
> barplot(table(V(g)$degree))
With the code below, the vertex colours are set based on degree, and labels
of isolated vertices are set to tweet IDs and the first 10 characters of every tweet.
The labels of other vertices are set to tweet IDs only, so that the graph will not
be overcrowded with labels. The colour and width of edges are set based on their
weights. The produced graph is shown in Fig. 7.37.
> idx <- V(g)$degree == 0
> V(g)$label.color[idx] <- rgb(0, 0, .3, .7)
> V(g)$label[idx] <- paste(V(g)$name[idx], substr(dat$text[idx], 1, 10), sep=”: “)
> egam <- (log(E(g)$weight)+.2) / max(log(E(g)$weight)+.2)
> E(g)$color <- rgb(.5, .5, 0, egam)
> E(g)$width <- egam
> set.seed(3152)
> layout2 <- layout.fruchterman.reingold(g)
> plot(g, layout=layout2)
The vertices in crescent are isolated from all others, and next they are removed
from the graph with the function delete.vertices() and re-plot the graph as in Fig. 7.38.
> g2 <- delete.vertices(g, V(g)[degree(g)==0])
> plot(g2, layout=layout.fruchterman.reingold)
292 R Programming — An Approach for Data Analytics
Similarly, it is also possible to remove the edges with low degrees to simplify
the graph using the function delete.edges(). After removing edges, some vertices
become isolated and they are also removed. The produced graph is as in Fig. 7.39.
> g3 <- delete.edges(g, E(g)[E(g)$weight <= 1])
> g3 <- delete.vertices(g3, V(g3)[degree(g3) == 0])
> plot(g3, layout=layout.fruchterman.reingold)
In the Fig. 7.39, there are some groups (or cliques) of tweets. Few of them are
listed below. The group of tweets (25, 35, 112) is about the word “Awesome”, the
group of tweets (31, 47, 122) is about the word “good” and the group of tweets (57,
58, 67, 75, 103, 146) is about the word “addictive”.
> dat$text[c(25,35,112)]
[1] “ Awesome! A lot of fun!!”
[2] “ Awesome Mysterious Game!! Fun game to play @ night before bed to wind
down!!”
[3] “ Miss Awesome fun”
> dat$text[c(31,47,122)]
[1] “ Error in patching Every time I try to log it it says error in patching but overall good
game.”
[2] “ Good For spending time while waiting for an appointment”
[3] “ Good It is a good game to play while wasting time”
> dat$text[c(57,58,67,75,103,146)]
[1] “ Addictive fun Perfect fun”
[2] “ Wonderful Is a great game and addictive. Brilliant”
[3] “ Addictive Great looking, fun game”
[4] “ ADDICTIVE!!!! This is a fun and easy to play and lose!!”
[5] “ Very fun Addictive game, similar to a Tomogotchi. You will want to check in on your
village and clan. Building, building, building and re-arranging you village. Some battles too.
Ver well constructed.”
[6] “ JD Very addictive fun gaming”
The Fig. 7.40 shows that most tweets are around two centers, “game” and “play”.
Next, let’s have a look at which tweets are about “game”. In the code below, the
function nei() returns all vertices which are neighbors of the vertex “game”.
> V(g)[nei(“game”)]
+ 89/181 vertices, named:
[1] 2 3 4 5 6 8 9 11 13 15 17 18 20 21 27 28 29
[18] 30 31 34 35 37 38 39 40 42 44 51 53 54 55 58 59 60
[35] 61 63 64 66 67 71 72 73 76 80 81 82 83 85 87 90 91
[52] 92 93 94 95 97 98 99 101 102 103 105 107 108 109 110 111 115
[69] 116 117 118 119 120 122 125 127 128 129 131 134 136 138 140 141 143
[86] 144 145 148 149
We can also have a further look at which tweets contain all two terms: “game”
and “play”.
> (rdmVertices <- V(g)[nei(“game”) & nei(“play”)])
+ 20/181 vertices, named:
[1] 2 6 34 35 37 42 44 59 61 66 73 82 92 107 122 131 134
[18] 143 144 149
> dat$text[as.numeric(rdmVertices$label)]
296 R Programming — An Approach for Data Analytics
[1] “ Great game I love this game. Unlike other games they constantly give you money to
play. They are always given you a bone. Keep up the good work.”
[2] “ Meh Used to be good until World Cup upgrade.\nNow it lags all the time, making
it difficult to play.\nMaybe if you spent more time getting the game to actually work and less
time trying to squeeze advertising into every nook of game play, we could have a winner.”
...
Next, we remove “game” and “play” to show the relationship between tweets
with other words. Isolated vertices are also deleted from graph.
> idx <- which(V(g)$name %in% c(“game”, “play”))
> g2 <- delete.vertices(g, V(g)[idx-1])
> g2 <- delete.vertices(g2, V(g2)[degree(g2)==0])
> set.seed(209)
> plot(g2, layout=layout.fruchterman.reingold)
From Fig. 7.41, we can clearly see groups of tweets and their keywords, such as
“addictive”, “good” and “fun”.
HIGHLIGHTS
Text mining involves the process of preprocessing the input text, deriving
patterns within the preprocessed data, and finally evaluation of the
output.
A word cloud is used to present important words in documents.
Corpus is a collection of text documents.
The most accurate and highly used credit scoring measure is the Probability
of Default called the PD.
The function importance() displays the features importance using the
“mean decrease accuracy” measure.
Common metrics calculated from the confusion matrix are Precision,
Accuracy, True Positive Rate (TP Rate) and False Positive Rate (FP Rate).
The US crime dataset is used for the EDA of crimes in US
Pre-processing done in this case study are removing duplicate records,
records with missing values, records with incorrect values, formatting
Timestamp field, binning time intervals (4 intervals – each 6 hours) and
grouping of similar crimes.
The objective of the EDA on baseball data is to identify trend of base ball
players salary over the years, to understand correlation between players
salary and their performances, analyze if age, country, height and weight
of the players have impact on their performance.
Social Network Analysis (SNA) is the process of investigating social
structures through the use of networks and graph theory.
The package used for Text Mining is tm and the package for Social Network
Analysis is igraph.
The function nei() returns all vertices which are neighbours of the given
vertex.
Glossary
Base Environment The functions and the variables from the R’s base
package are stored in the base environment.
Basic Data Types The basic data types in R are Numeric, Integer,
Complex, Logical and Character.
300 R Programming — An Approach for Data Analytics
Local Outlier Factor The local outlier factor (LOF) is an algorithm for
finding anomalous data points by measuring the
local deviation of a given data point with respect to
its neighbours.
304 R Programming — An Approach for Data Analytics
arules Package for Mining Association Rules and 113, 131, 151
Frequent Itemsets
arulesViz Package for Visualizing Association Rules and 113, 134, 151
Frequent Itemsets
DMwR Package that provide Functions and Data for 113, 139,
“Data Mining with R” 151, 160, 163
310 R Programming — An Approach for Data Analytics
plyr Package with Tools for Splitting, Applying 58, 60, 113,
and Combining Data 147, 151
sqldf Package that Performs SQL Selects on R Data 52, 60, 61, 70
Frames
aggregate() stats Splits the data into subsets, computes 63, 70,
summary statistics for each, and 171,172,
returns the result in a convenient 174,
form.
colMeans() base Form row and column sums and 35, 36,
means for numeric arrays. 147
colSums() base Form row and column sums and 35, 36,
means for numeric arrays. 133,
52, 54,
Reads a file in table format and
59, 162,
creates a data frame from it, with
read.csv() utils 170, 171,
cases corresponding to lines and
172, 180,
variables to fields in the file.
182
340 R Programming — An Approach for Data Analytics
122, 125,
Recursive Partitioning and
127, 130,
rpart() rpart Regression Trees. Fit a rpart
151, 161,
model.
167
system.file() base Finds the full file names of files 51, 53,
in packages etc. 55, 70,
BOOKS
WEBSITES
1. https://www.tutorialspoint.com/r/
2. http://www.r-tutor.com/
3. https://cran.r-project.org/manuals.html
4. https://www.r-bloggers.com/
5. http://www.rdatamining.com/examples/
Index
Classification 168, 177, 190, 191, 200, Dataset 4, 84-86, 103, 120, 124, 141,
228, 237, 243, 244, 250, 256, 300, 142, 147, 155, 160, 168, 170, 172,
301, 307, 346 173, 176, 179, 182, 188, 190, 191,
Closure 15, 300 194, 197, 199, 202, 211, 220, 221,
Clustering 177-180, 182, 185-188, 217, 222, 224, 230, 233, 243, 244, 247,
218, 230, 233, 237, 240, 242, 244, 249, 250, 253, 255, 257, 258, 261,
248, 300, 302, 303, 318, 322, 331, 262, 273, 274, 297, 305
336 Date Conversion 263
Coef 160, 211, 318 Dates and Times 71, 74, 309
Colbycol 86 Dbinom 151, 152, 176, 322
ColClasses 88, 89 Dbscan 178, 186, 187, 217, 230, 322
Collapse 65, 66, 235 Decision Tree 190, 192, 193, 301
Complex 2, 27, 29, 30, 168, 248, 253 Dendrogram 185, 241, 341
Confidence 201, 202, 300 Density Based Clustering 178, 186-188,
Confusion matrix 300 218, 219, 301
Corpus 234, 235, 284, 297, 320 Density Plot 214
Correlation Analysis xi, 141 Dimensionality Reduction 178, 220,
Corrplot 155, 176, 178, 226, 230, 309, 301
320 DMwR 178, 214, 230, 231, 243, 244,
CRAN 3, 10, 11, 12, 25, 300, 305, 330 248, 249, 250, 309, 331, 336, 346
Credit Scoring 243, 258, 297 Dnorm 148, 149, 176, 325
CSV 86, 87, 89, 90, 97, 115, 226, 233, Document-Term Matrix 233, 236, 301,
234, 247, 260-262, 270-275, 284, 307, 352
339, 355
E
D
Eclat 178, 202, 230, 231, 326
Data Frame 11, 51-56, 60, 81, 84, 85, Environment 10, 13-15, 18, 25, 153,
87, 88, 95-98, 100, 102, 114-116, 160, 299-302, 305, 306, 315, 316,
123, 125, 132, 144, 166, 176, 212, 350, 355
260, 261, 272, 273, 301, 314, 315, Excel 86, 88, 312, 356
322, 323, 325, 328, 333, 334, 339, Exploratory Data Analysis 258
340, 342, 353, 356 Export 83, 88, 115, 340
Data Reshaping x
363 Index
Lift 201, 202, 203, 205-208 193, 194, 197, 199, 220, 221, 243,
Linear Orthogonal Transformation 221 244, 247, 249, 252, 255, 257, 258,
Linear Regression 156, 159, 303 300, 301, 302, 304, 313, 314, 318,
Line Plot 127 319, 335, 339, 342, 344, 350, 353
List 10, 11, 13, 14, 17, 25, 27, 37, 38, Multiple Regression 161, 304
43-45, 47-51, 64, 68-71, 81, 91, Multivariate Outlier Detection 177
105-109, 114, 124, 125, 166, 167,
205, 208, 210, 211, 214, 222, 227, N
234, 239, 248, 249, 251, 265, 275,
276, 314, 315, 318, 321, 325, 332,
Next Statement 19, 22, 26
336, 342, 344, 345, 353, 354
Normal Distribution 141, 148, 176,
Local Outlier Factor 214, 303
325, 338, 343
Logical 19, 21, 24, 29, 30, 32, 45, 92,
Null Hypothesis 168, 169, 173-175,
119, 132, 144, 162, 176, 260, 319,
302, 303, 304, 306, 308
331, 354, 355
Numeric 28, 29, 32, 35, 48, 49, 59, 64,
Logistic Regression 162, 304
81, 89, 118, 127, 130, 131, 133-
Loop 22, 23, 24, 26, 346
135, 141-145, 147, 153, 157, 158,
Lower Tail Test 173, 304
162, 165, 176, 214, 226, 242-244,
247, 249, 252, 276-278, 295, 315,
M 319, 324, 341, 343, 354, 355
Rnorm 18, 148, 150, 151, 176, 211, 260, 262, 263, 302, 314, 349
212, 343 Summary Statistics 141, 261, 313
RowSums 59, 60, 219, 238, 240, 343 Support 190, 201, 307, 310
RStudio 3 Switch Statement 21
Runif 188, 189, 344
T
S
Term-Document Matrix 236, 237, 239,
SAS 83, 86, 89, 115, 310, 340 242, 284, 301, 307, 326, 327, 341,
Scatter plot 118, 120, 122, 139, 140, 352
187, 212, 213, 218, 306 Term-Term Adjacency Matrix 284, 285
Sequence 4, 6, 32, 33, 76, 318, 326 Text Mining 233, 297, 308
Set.seed 188, 191, 194, 197, 199, 211, Threshold 208, 229, 244, 255
226, 240, 242, 253, 286, 291, 294, Time Zone 73, 74, 77, 80, 82, 327, 350,
296, 346 355
Significance level 173-175, 303, 306 Transpose 40, 56, 242, 308, 351
Silhouette 307 True Positive Rate 257, 297
Skip 22, 26, 86 Tweet 89, 291, 293
SMOTE 244, 250, 307, 346 Two-Tailed Test 308
Social Network Analysis 233 Type I error 308
Sort 98, 99, 116, 205, 208, 228, 240, Type II error 308
242, 336, 347
SPSS 83, 89, 115, 310, 340 U
Sqldf 86, 100, 116, 311, 347
SQLite 90, 116, 311
Univariate Outlier Detection 211, 212
Standard Deviation 141, 145, 148, 176,
Upper Tail Test 174, 308
325, 338, 343, 344
Statistical Analysis 141
Statistical Measures 141 V
Stemming 234, 236
Stop-word 307 Variable Scope 17
Strings 9, 27, 60, 65, 67, 69, 73, 81, 86, Variable Selection 220, 302
88, 93, 98, 116, 273, 323, 335 Variance 145, 168, 170, 176, 221-224,
Subset 56, 62, 114, 160, 206, 238, 253, 241, 299, 313, 314, 354
367 Index
VarImpPlot 200, 229, 230, 253, 254, 176, 182, 183, 185, 191, 193, 194,
354 202, 203, 205-208, 211-214, 216,
Vector 4, 7, 21, 24, 27, 32-37, 43, 47, 227, 234, 239-241, 243, 244, 247,
49, 61, 66, 69, 81, 87, 93, 98, 100, 249, 252, 255, 258-262, 272, 274,
105, 107, 108, 110, 116, 118, 127, 275, 284, 290, 293, 295-297, 299-
130, 132, 134, 135, 141-144, 148, 303, 305, 307-309, 313, 317-319,
151, 157, 158, 166, 176, 180, 222, 321, 330-332, 335, 336, 339, 341,
225, 261, 273, 302, 314, 318-320, 342, 344, 345, 346, 349, 350, 353,
322, 324, 325, 326, 328, 331, 333, 354, 355
334, 344, 346, 347, 349, 350, 351, Wordcloud 233, 239, 240, 312, 355
353, 354, 355 Working Directory 70, 81, 346
W X
Which 2, 4, 5, 12, 17, 18, 20, 21, 27, XML 83, 85, 87, 88, 115, 312, 329, 356
34, 35, 41, 61, 64, 67, 70, 81, 83, XPath 87
84, 87, 88, 102, 105, 108, 115, 118,
119, 125, 126, 130, 132, 134, 139, Y
143, 144, 148-150, 152, 156, 157,
162, 164, 165, 168, 170, 172, 173,
YAML 83, 85, 88, 316, 356