R - Overview
R - Overview
R - Overview
R is freely available under the GNU General Public License, and pre-
compiled binary versions are provided for various operating systems like
Linux, Windows and Mac.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the
Department of Statistics of the University of Auckland in Auckland, New
Zealand. R made its first appearance in 1993.
Since mid-1997 there has been a core group (the "R Core Team") who can
modify the R source code archive.
Features of R
As stated earlier, R is a programming language and software environment
for statistical analysis, graphics representation and reporting. The following
are the important features of R −
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
R - Environment Setup
Local Environment Setup
If you are still willing to set up your environment for R, you can follow the
steps given below.
Windows Installation
You can download the Windows installer version of R from R-3.2.2 for
Windows (32/64 bit) and save it in a local directory.
After installation you can locate the icon to run the Program in a directory
structure "R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files.
Clicking this icon brings up the R-GUI which is the R console to do R
Programming.
Linux Installation
R is available as a binary for many versions of Linux at the location R
Binaries.
The instruction to install Linux varies from flavor to flavor. These steps are
mentioned under each type of Linux version in the mentioned link.
However, if you are in a hurry, then you can use yum command to install R
as follows −
$ yum install R
Now you can use install command at R prompt to install the required
package. For example, the following command will install plotrix package
which is required for 3D charts.
> install.packages("plotrix")
R - Basic Syntax
As a convention, we will start learning R programming by writing a "Hello,
World!" program. Depending on the needs, you can program either at R
command prompt or you can use an R script file to write your program.
Let's check both one by one.
R Command Prompt
Once you have R environment setup, then it’s easy to start your R
command prompt by just typing the following command at your command
prompt −
$ R
This will launch R interpreter and you will get a prompt > where you can
start typing your program as follows −
>print( myString)
[1]"Hello, World!"
R Script File
Usually, you will do your programming by writing your programs in script
files and then you execute those scripts at your command prompt with the
help of R interpreter called Rscript. So let's start with writing following
code in a text file called test.R as under −
Live Demo
print( myString)
Save the above code in a file test.R and execute it at Linux command
prompt as given below. Even if you are using Windows or other system,
syntax will remain same.
$ Rscript test.R
Comments
Comments are like helping text in your R program and they are ignored by
the interpreter while executing your actual program. Single comment is
written using # in the beginning of the statement as follows −
# My first program in R Programming
R does not support multi-line comments but you can perform a trick which
is something as follows −
Live Demo
if(FALSE){
"This is a demo for multi-line comments and it should be put inside either a
print( myString)
[1] "Hello, World!"
R - Data Types
Generally, while doing programming in any programming language, you
need to use various variables to store various information. Variables are
nothing but reserved memory locations to store values. This means that,
when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide
character, integer, floating point, double floating point, Boolean etc. Based
on the data type of a variable, the operating system allocates memory and
decides what can be stored in the reserved memory.
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data
types of these atomic vectors, also termed as six classes of vectors. The
other R-Objects are built upon the atomic vectors.
v <- TRUE
print(class(v))
[1] "logical"
v <-23.5
print(class(v))
[1] "numeric"
v <-2L
print(class(v))
[1] "integer"
v <-2+5i
print(class(v))
[1] "complex"
v <-"TRUE"
print(class(v))
[1] "character"
v <- charToRaw("Hello")
print(class(v))
[1] "raw"
# Create a vector.
print(apple)
print(class(apple))
Lists
A list is an R-object which can contain many different types of elements
inside it like vectors, functions and even another list inside it.
Live Demo
# Create a list.
print(list1)
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using
a vector input to the matrix function.
Live Demo
# Create a matrix.
print(M)
Arrays
While matrices are confined to two dimensions, arrays can be of any
number of dimensions. The array function takes a dim attribute which
creates the required number of dimension. In the below example we create
an array with two elements which are 3x3 matrices each.
Live Demo
# Create an array.
print(a)
, , 2
Factors are created using the factor() function. The nlevels functions
gives the count of levels.
Live Demo
# Create a vector.
print(factor_apple)
print(nlevels(factor_apple))
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each
column can contain different modes of data. The first column can be
numeric while the second column can be character and third column can be
logical. It is a list of vectors of equal length.
gender = c("Male","Male","Female"),
height = c(152,171.5,165),
weight = c(81,93,78),
Age= c(42,38,26)
print(BMI)
R - Variables
A variable provides us with named storage that our programs can
manipulate. A variable in R can store an atomic vector, group of atomic
vectors or a combination of many Robjects. A valid variable name consists
of letters, numbers and the dot or underline characters. The variable name
starts with a letter or the dot not followed by a number.
var_name% Invalid Has the character '%'. Only dot(.) and underscore
allowed.
valid
.var_name,
Can start with a dot(.) but the dot(.)should not be
followed by a number.
var.name
.2var_name invalid The starting dot is followed by a number making it
invalid.
Variable Assignment
The variables can be assigned values using leftward, rightward and equal to
operator. The values of the variables can be printed
using print() or cat()function. The cat() function combines multiple items
into a continuous print output.
Live Demo
var.1= c(0,1,2,3)
var.2<- c("learn","R")
c(TRUE,1)->var.3
print(var.1)
Note − The vector c(TRUE,1) has a mix of logical and numeric class. So
logical class is coerced to numeric class making TRUE as 1.
Data Type of a Variable
In R, a variable itself is not declared of any data type, rather it gets the
data type of the R - object assigned to it. So R is called a dynamically typed
language, which means that we can change a variable’s data type of the
same variable again and again when using it in a program.
Live Demo
var_x <-"Hello"
var_x <-34.5
var_x <-27L
Finding Variables
To know all the variables currently available in the workspace we use
the ls()function. Also the ls() function can use patterns to match the
variable names.
Live Demo
print(ls())
print(ls(pattern ="var"))
The variables starting with dot(.) are hidden, they can be listed using
"all.names = TRUE" argument to ls() function.
Live Demo
print(ls(all.name = TRUE))
Deleting Variables
Variables can be deleted by using the rm() function. Below we delete the
variable var.3. On printing the value of the variable error is thrown.
Live Demo
rm(var.3)
print(var.3)
All the variables can be deleted by using the rm() and ls() function
together.
Live Demo
rm(list = ls())
print(ls())
R - Operators
An operator is a symbol that tells the compiler to perform specific
mathematical or logical manipulations. R language is rich in built-in
operators and provides following types of operators.
Types of Operators
We have the following types of operators in R programming −
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language.
The operators act on each element of the vector.
v <- c(2,5.5,6)
t <- c(8,3,4)
print(v+t)
t <- c(8,3,4)
print(v-t)
v <- c(2,5.5,6)
t <- c(8,3,4)
print(v*t)
t <- c(8,3,4)
print(v/t)
t <- c(8,3,4)
print(v%%t)
t <- c(8,3,4)
print(v%/%t)
[1] 0 1 1
t <- c(8,3,4)
print(v^t)
Relational Operators
Following table shows the relational operators supported by R language.
Each element of the first vector is compared with the corresponding
element of the second vector. The result of comparison is a Boolean value.
print(v>t)
v <- c(2,5.5,6,9)
Checks if each element of the first
t <- c(8,2.5,14,9)
vector is less than the
corresponding element of the print(v < t)
second vector.
== Live Demo
v <- c(2,5.5,6,9)
Checks if each element of the first
t <- c(8,2.5,14,9)
vector is equal to the
corresponding element of the print(v == t)
second vector.
v <- c(2,5.5,6,9)
Checks if each element of the first
t <- c(8,2.5,14,9)
vector is less than or equal to the
corresponding element of the print(v<=t)
second vector.
print(v>=t)
it produces the following result −
!= Live Demo
v <- c(2,5.5,6,9)
Checks if each element of the first
t <- c(8,2.5,14,9)
vector is unequal to the
corresponding element of the print(v!=t)
second vector.
Logical Operators
Following table shows the logical operators supported by R language. It is
applicable only to vectors of type logical, numeric or complex. All numbers
greater than 1 are considered as logical value TRUE.
print(v|t)
TRUE if one the elements is TRUE.
it produces the following result −
! Live Demo
v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator.
Takes each element of the vector print(!v)
and gives the opposite logical
value.
it produces the following result −
The logical operator && and || considers only the first element of the
vectors and give a vector of single element as output.
v <- c(3,0,TRUE,2+2i)
Called Logical AND operator.
Takes first element of both the t <- c(1,3,TRUE,2+3i)
vectors and gives the TRUE only if
both are TRUE. print(v&&t)
[1] TRUE
print(v||t)
it produces the following result −
[1] FALSE
Assignment Operators
These operators are used to assign values to vectors.
v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
<−
v3 = c(3,1,TRUE,2+3i)
or
print(v1)
=
print(v2)
or
print(v3)
<<−
it produces the following result −
c(3,1,TRUE,2+3i)-> v1
-> c(3,1,TRUE,2+3i)->> v2
print(v1)
or
print(v2)
->>
Miscellaneous Operators
These operators are used to for specific purpose and not general
mathematical or logical computation.
: Live Demo
Colon
operator. It
creates the v <-2:8
series of
print(v)
numbers in
sequence
for a vector. it produces the following result −
[1] 2 3 4 5 6 7 8
v1 <-8
This v2 <-12
operator is
t <-1:10
used to
identify if an print(v1 %in% t)
element
belongs to a print(v2 %in% t)
vector.
it produces the following result −
[1] TRUE
[1] FALSE
[,1] [,2]
[1,] 65 82
[2,] 82 117
R - Decision making
Decision making structures require the programmer to specify one or more
conditions to be evaluated or tested by the program, along with a
statement or statements to be executed if the condition is determined to
be true, and optionally, other statements to be executed if the condition is
determined to be false.
1 if statement
3 switch statement
A switch statement allows a variable to be tested for equality against a
list of values.
R - Loops
There may be a situation when you need to execute a block of code several
number of times. In general, statements are executed sequentially. The
first statement in a function is executed first, followed by the second, and
so on.
1 repeat loop
2 while loop
Repeats a statement or group of statements while a given condition is
true. It tests the condition before executing the loop body.
3 for loop
Like a while statement, except that it tests the condition at the end of
the loop body.
1 break statement
2 Next statement
The next statement simulates the behavior of R switch.
R - Functions
A function is a set of statements organized together to perform a specific
task. R has a large number of in-built functions and the user can create
their own functions.
The function in turn performs its task and returns control to the interpreter
as well as any result which may be stored in other objects.
Function Definition
An R function is created by using the keyword function. The basic syntax
of an R function definition is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Function Components
The different parts of a function are −
Return Value − The return value of a function is the last expression in the
function body to be evaluated.
R has many in-built functions which can be directly called in the program
without defining them first. We can also create and use our own functions
referred as user defined functions.
Built-in Function
Simple examples of in-built functions
are seq(), mean(), max(), sum(x)and paste(...) etc. They are directly
called by user written programs. You can refer most widely used R
functions.
Live Demo
print(seq(32,44))
print(mean(25:82))
print(sum(41:68))
User-defined Function
We can create user-defined functions in R. They are specific to what a user
wants and once created they can be used like the built-in functions. Below
is an example of how a function is created and used.
new.function<-function(a){
for(i in 1:a){
b <- i^2
print(b)
}
Calling a Function
Live Demo
new.function<-function(a){
for(i in1:a){
b <- i^2
print(b)
new.function(6)
new.function<-function(){
for(i in1:5){
print(i^2)
new.function<-function(a,b,c){
result <- a * b + c
print(result)
new.function(5,3,11)
result <- a * b
print(result)
new.function()
new.function(9,5)
new.function<-function(a, b){
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
new.function(6)
R - Strings
Any value written within a pair of single quote or double quotes in R is
treated as a string. Internally R stores every string within double quotes,
even when you create them with single quote.
Double quotes can be inserted into a string starting and ending with single
quote.
Single quote can be inserted into a string starting and ending with double
quotes.
Double quotes can not be inserted into a string starting and ending with double
quotes.
Single quote can not be inserted into a string starting and ending with single
quote.
print(a)
print(c)
print(d)
e <-'Mixed quotes"
print(e)
print(f)
print(g)
String Manipulation
Concatenating Strings - paste() function
Many strings in R are combined using the paste() function. It can take any
number of arguments to be combined together.
Syntax
The basic syntax for paste function is −
paste(..., sep = " ", collapse = NULL)
collapse is used to eliminate the space in between two strings. But not the
space within two words of one string.
Example
Live Demo
a <-"Hello"
b <-'How'
print(paste(a,b,c))
Syntax
The basic syntax for format function is −
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
nsmall is the minimum number of digits to the right of the decimal point.
Example
Live Demo
print(result)
print(result)
print(result)
print(result)
# Numbers are padded with blank in the beginning for width.
print(result)
print(result)
print(result)
Syntax
The basic syntax for nchar() function is −
nchar(x)
Example
Live Demo
result <- nchar("Count the number of characters")
print(result)
Syntax
The basic syntax for toupper() & tolower() function is −
toupper(x)
tolower(x)
Example
Live Demo
print(result)
print(result)
Syntax
The basic syntax for substring() function is −
substring(x,first,last)
Example
Live Demo
print(result)
R - Vectors
Vectors are the most basic R data objects and there are six types of atomic
vectors. They are logical, integer, double, complex, character and raw.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1
and belongs to one of the above vector types.
Live Demo
print("abc");
print(12.5)
# Atomic vector of type integer.
print(63L)
print(TRUE)
print(2+3i)
print(charToRaw('hello'))
v <-5:13
print(v)
v <-6.6:12.6
print(v)
# If the final element specified does not belong to the sequence then it is
discarded.
v <-3.8:11.4
print(v)
print(seq(5,9,by=0.4))
s <- c('apple','red',5,TRUE)
print(s)
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
x <- t[c(-2,-5)]
print(x)
y <- t[c(0,0,0,0,0,0,1)]
print(y)
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided
giving the result as a vector output.
Live Demo
v2 <- c(4,11,0,8,1,2)
# Vector addition.
print(add.result)
# Vector subtraction.
print(sub.result)
# Vector multiplication.
print(multi.result)
# Vector division.
print(divi.result)
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
print(add.result)
print(sub.result)
v <- c(3,8,4,5,0,11,-9,304)
print(sort.result)
print(revsort.result)
v <- c("Red","Blue","yellow","violet")
print(sort.result)
# Sorting character vectors in reverse order.
print(revsort.result)
R - Lists
Lists are the R objects which contain elements of different types like −
numbers, strings, vectors and another list inside it. A list can also contain a
matrix or a function as its elements. List is created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors
and a logical values.
Live Demo
# values.
print(list_data)
[[2]]
[1] "Green"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
list("green",12.3))
print(list_data)
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
list("green",12.3))
print(list_data[1])
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
print(list_data$A_Matrix)
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
list("green",12.3))
list_data[4]<-"New element"
print(list_data[4])
list_data[4]<- NULL
print(list_data[4])
list_data[3]<-"updated element"
print(list_data[3])
$<NA>
NULL
Merging Lists
You can merge many lists into one list by placing all the lists inside one
list() function.
Live Demo
print(merged.list)
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
print(list1)
list2 <-list(10:14)
print(list2)
v1 <- unlist(list1)
v2 <- unlist(list2)
print(v1)
print(v2)
print(result)
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
R - Matrices
Matrices are the R objects in which the elements are arranged in a two-
dimensional rectangular layout. They contain elements of the same atomic
types. Though we can create a matrix containing only characters or only
logical values, they are not of much use. We use matrices containing
numeric elements to be used in mathematical calculations.
Syntax
The basic syntax for creating a matrix in R is −
matrix(data, nrow, ncol, byrow, dimnames)
data is the input vector which becomes the data elements of the matrix.
byrow is a logical clue. If TRUE then the input vector elements are arranged by
row.
Example
Create a matrix taking a vector of numbers as input.
Live Demo
print(M)
print(N)
rownames = c("row1","row2","row3","row4")
colnames = c("col1","col2","col3")
P <- matrix(c(3:14), nrow =4, byrow = TRUE, dimnames = list(rownames, colnames))
print(P)
rownames = c("row1","row2","row3","row4")
colnames = c("col1","col2","col3")
print(P[1,3])
print(P[2,])
print(P[,3])
Matrix Computations
Various mathematical operations are performed on the matrices using the R
operators. The result of the operation is also a matrix.
The dimensions (number of rows and columns) should be same for the
matrices involved in the operation.
print(matrix1)
print(matrix2)
print(result)
cat("Result of subtraction","\n")
print(result)
print(matrix1)
print(matrix2)
cat("Result of multiplication","\n")
print(result)
cat("Result of division","\n")
print(result)
R - Arrays
Arrays are the R data objects which can store data in more than two
dimensions. For example − If we create an array of dimension (2, 3, 4)
then it creates 4 rectangular matrices each with 2 rows and 3 columns.
Arrays can store only data type.
An array is created using the array() function. It takes vectors as input and
uses the values in the dim parameter to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3
rows and 3 columns.
Live Demo
print(result)
, , 2
matrix.names))
print(result)
, , Matrix2
column.names, matrix.names))
print(result[3,,2])
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
# Print the 2nd Matrix.
print(result[,,2])
print(result)
Syntax
apply(x, margin, fun)
x is an array.
Example
We use the apply() function below to calculate the sum of the elements in
the rows of an array across all the matrices.
Live Demo
print(new.array)
# Use apply to calculate the sum of the rows across all the matrices.
print(result)
, , 2
[1] 56 68 60
R - Factors
Factors are the data objects which are used to categorize the data and store
it as levels. They can store both strings and integers. They are useful in the
columns which have a limited number of unique values. Like "Male,
"Female" and True, False etc. They are useful in data analysis for statistical
modeling.
Factors are created using the factor () function by taking a vector as input.
Example
Live Demo
data <-
c("East","West","East","North","North","East","West","West","West","East","North")
print(data)
print(is.factor(data))
# Apply the factor function.
print(factor_data)
print(is.factor(factor_data))
print(input_data)
print(is.factor(input_data$gender))
"West","West","East","North")
print(factor_data)
print(new_order_data)
Syntax
gl(n, k, labels)
Example
Live Demo
print(v)
R - Data Frames
A data frame is a table or a two-dimensional array-like structure in which
each column contains values of one variable and each row contains one set
of values from each column.
The data stored in a data frame can be of numeric, factor or character type.
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
print(emp.data)
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
str(emp.data)
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
print(summary(emp.data))
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
print(result)
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
print(result)
Extract 3rd and 5th row with 2nd and 4th column
Live Demo
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract 3rd and 5th row with 2nd and 4th column.
print(result)
Add Column
Just add the column vector using a new column name.
Live Demo
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
v <- emp.data
print(v)
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
Add Row
To add more rows permanently to an existing data frame, we need to bring
in the new rows in the same structure as the existing data frame and use
the rbind() function.
In the example below we create a data frame with new rows and merge it
with the existing data frame to create the final data frame.
Live Demo
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date =as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
print(emp.finaldata)
R - Packages
R packages are a collection of R functions, complied code and sample data.
They are stored under a directory called "library" in the R environment. By
default, R installs a set of packages during installation. More packages are
added later, when they are needed for some specific purpose. When we
start the R console, only the default packages are available by default.
Other packages which are already installed have to be loaded explicitly to
be used by the R program that is going to use them.
When we execute the above code, it produces the following result. It may
vary depending on the local settings of your pc.
[2] "C:/Program Files/R/R-3.2.2/library"
library()
When we execute the above code, it produces the following result. It may
vary depending on the local settings of your pc.
Packages in library ‘C:/Program Files/R/R-3.2.2/library’:
search()
When we execute the above code, it produces the following result. It may
vary depending on the local settings of your pc.
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
Now you can run the following command to install this package in the R
environment.
R - Data Reshaping
Data Reshaping in R is about changing the way data is organized into rows
and columns. Most of the time data processing in R is done by taking the
input data as a data frame. It is easy to extract data from the rows and
columns of a data frame but there are situations when we need the data
frame in a format that is different from format in which we received it. R
has many functions to split, merge and change the rows to columns and
vice-versa in a data frame.
print(addresses)
city = c("Lowry","Charlotte"),
state = c("CO","FL"),
zipcode = c("80230","33949"),
stringsAsFactors = FALSE
# Print a header.
print(new.address)
# Print a header.
In the example below, we consider the data sets about Diabetes in Pima
Indian Women available in the library names "MASS". we merge the two
data sets based on the values of blood pressure("bp") and body mass
index("bmi"). On choosing these two columns for merging, the records
where values of these two variables match in both data sets are combined
together to form a single data frame.
Live Demo
library(MASS)
by.x = c("bp","bmi"),
by.y = c("bp","bmi")
print(merged.Pima)
nrow(merged.Pima)
We consider the dataset called ships present in the library called "MASS".
Live Demo
library(MASS)
print(ships)
When we execute the above code, it produces the following result −
type year period service incidents
1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
4 A 65 75 1095 4
5 A 70 60 1512 6
.............
.............
8 A 75 75 2244 11
9 B 60 60 44882 39
10 B 60 75 17176 29
11 B 65 60 28609 58
............
............
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
............
............
print(molten.ships)
print(recasted.ship)
R - CSV Files
In R, we can read data from files stored outside the R environment. We can
also write data into files which will be stored and accessed by the operating
system. R can read and write into various file formats like csv, excel, xml
etc.
In this chapter we will learn to read data from a csv file and then write data
into a csv file. The file should be present in current working directory so
that R can read it. Of course we can also set our own directory and read
files from there.
print(getwd())
setwd("/web/com")
print(getwd())
This result depends on your OS and your current directory where you are
working.
You can create this file using windows notepad by copying and pasting this
data. Save the file as input.csv using the save As All files(*.*) option in
notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
print(data)
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
Once we read data in a data frame, we can apply all the functions applicable
to data frames as explained in subsequent section.
print(sal)
print(retval)
When we execute the above code, it produces the following result −
id name salary start_date dept
5 NA Gary 843.25 2015-03-27 Finance
print(retval)
print(info)
print(retval)
write.csv(retval,"output.csv")
print(newdata)
Here the column X comes from the data set newper. This can be dropped
using additional parameters while writing the file.
print(newdata)
R - Excel File
Microsoft Excel is the most widely used spreadsheet program which stores
data in the .xls or .xlsx format. R can read directly from these files using
some excel specific packages. Few such packages are - XLConnect, xlsx,
gdata etc. We will be using xlsx package. R can also write into excel file
using this package.
any(grepl("xlsx",installed.packages()))
library("xlsx")
Also copy and paste the following data to another worksheet and rename
this worksheet to "city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx". You should save it in the current working
directory of the R workspace.
print(data)
R - Binary Files
A binary file is a file that contains information stored only in form of bits
and bytes.(0’s and 1’s). They are not human readable as the bytes in it
translate to characters and symbols which contain many other non-printable
characters. Attempting to read a binary file using any text editor will show
characters like Ø and ð.
R has two functions WriteBin() and readBin() to create and read binary
files.
Syntax
writeBin(object, con)
readBin(con, what, n )
what is the mode like character, integer etc. representing the bytes to be read.
Example
We consider the R inbuilt data "mtcars". First we create a csv file from it
and convert it to a binary file and store it as a OS file. Next we read this
binary file created into R.
# Read the "mtcars" data frame as a csv file and store only the columns
"cyl","am"and"gear".
# Create a connection object to write the binary file using mode "wb".
write.filename = file("/web/com/binmtcars.dat","wb")
# Write the column names of the data frame to the connection object.
writeBin(colnames(new.mtcars), write.filename)
writeBin(c(new.mtcars$cyl,new.mtcars$am,new.mtcars$gear), write.filename)
# Close the file for writing so that it can be read by other program.
close(write.filename)
# Create a connection object to read the file in binary mode using "rb".
# Next read the column values. n = 18 as we have 3 column names and 15 values.
print(bindata)
# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)
# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)
# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)
# Combine all the read values to a dat frame.
colnames(finaldata)= column.names
print(finaldata)
When we execute the above code, it produces the following result and chart
−
[1] 7108963 1728081249 7496037 6 6 4
[7] 6 8 1 1 1 0
[13] 0 4 4 4 3 3
[1] 6 6 4 6 8
[1] 1 1 1 0 0
[1] 4 4 4 3 3
cyl am gear
[1,] 6 1 4
[2,] 6 1 4
[3,] 4 1 4
[4,] 6 0 3
[5,] 8 0 3
As we can see, we got the original data back by reading the binary file in R.
R - XML Files
XML is a file format which shares both the file format and the data on the
World Wide Web, intranets, and elsewhere using standard ASCII text. It
stands for Extensible Markup Language (XML). Similar to HTML it contains
markup tags. But unlike HTML where the markup tag describes structure of
the page, in xml the markup tags describe the meaning of the data
contained into he file.
You can read a xml file in R using the "XML" package. This package can be
installed using following command.
install.packages("XML")
Input Data
Create a XMl file by copying the below data into a text editor like notepad.
Save the file with a .xml extension and choosing the file type as all
files(*.*).
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
library("XML")
library("methods")
print(result)
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
library("XML")
library("methods")
# Give the input file name to the function.
print(rootsize)
library("XML")
library("methods")
print(rootnode[1])
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
library("XML")
library("methods")
print(rootnode[[1]][[1]])
print(rootnode[[1]][[5]])
print(rootnode[[3]][[2]])
When we execute the above code, it produces the following result −
1
IT
Michelle
library("XML")
library("methods")
print(xmldataframe)
As the data is now available as a dataframe we can use data frame related
function to read and manipulate the file.
R - JSON Files
JSON file stores data as text in human-readable format. Json stands for
JavaScript Object Notation. R can read JSON files using the rjson package.
Input Data
Create a JSON file by copying the below data into a text editor like notepad.
Save the file with a .json extension and choosing the file type as all
files(*.*).
"ID":["1","2","3","4","5","6","7","8"],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru"],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5"],
"StartDate":["1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":["IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
library("rjson")
print(result)
$Name
[1] "Rick" "Dan" "Michelle" "Ryan" "Gary" "Nina" "Simon" "Guru"
$Salary
[1] "623.3" "515.2" "611" "729" "843.25" "578" "632.8" "722.5"
$StartDate
[1] "1/1/2012" "9/23/2013" "11/15/2014" "5/11/2014" "3/27/2015" "5/21/2013"
"7/30/2013" "6/17/2014"
$Dept
[1] "IT" "Operations" "IT" "HR" "Finance" "IT"
"Operations" "Finance"
library("rjson")
json_data_frame <-as.data.frame(result)
print(json_data_frame)
R - Web Data
Many websites provide data for consumption by its users. For example the
World Health Organization(WHO) provides reports on health and medical
information in the form of CSV, txt and XML files. Using R programs, we can
programmatically extract specific data from such websites. Some packages
in R which are used to scrap data form the web are − "RCurl",XML", and
"stringr". They are used to connect to the URL’s, identify required links for
the files and download them to the local environment.
Install R Packages
The following packages are required for processing the URL’s and links to
the files. If they are not available in your R Environment, you can install
them using following commands.
install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")
Input Data
We will visit the URL weather data and download the CSV files using R for
the year 2015.
Example
We will use the function getHTMLLinks() to gather the URLs of the files.
Then we will use the function download.file() to save the files to the local
system. As we will be applying the same code again and again for multiple
files, we will create a function to be called multiple times. The filenames are
passed as parameters in form of a R list object to this function.
url <-"http://www.geos.ed.ac.uk/~weather/jcmb_ws/"
# Identify only the links which point to the JCMB 2015 files.
filenames <- links[str_detect(links,"JCMB_2015")]
filenames_list <-as.list(filenames)
# Create a function to download the files by passing the URL and filename list.
downloadcsv <-function(mainurl,filename){
download.file(filedetails,filename)
# Now apply the l_ply function and save the files into the current R working
directory.
l_ply(filenames,downloadcsv,mainurl ="http://www.geos.ed.ac.uk/~weather/jcmb_ws/")
R - Databases
The data is Relational database systems are stored in a normalized format.
So, to carry out statistical computing we will need very advanced and
complex Sql queries. But R can connect easily to many relational databases
like MySql, Oracle, Sql server etc. and fetch records from them as a data
frame. Once the data is available in the R environment, it becomes a
normal R data set and can be manipulated or analyzed using all the
powerful packages and functions.
install.packages("RMySQL")
Connecting R to MySql
Once the package is installed we create a connection object in R to connect
to the database. It takes the username, password, database name and host
name as input.
# We will connect to the sampel database named "sakila" that comes with MySql
installation.
host ='localhost')
dbListTables(mysqlconnection)
# Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.
print(data.fame)
print(data)
"insert into mtcars(row_names, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear,
carb)
values('New Mazda RX4 Wag', 21, 6, 168.5, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4)"
After executing the above code we can see the row inserted into the table in
the MySql Environment.
# Create the connection object to the database where we want to create the table.
host ='localhost')
After executing the above code we can see the table created in the MySql
Environment.
R - Pie Charts
R Programming language has numerous libraries to create charts and
graphs. A pie-chart is a representation of values as slices of a circle with
different colors. The slices are labeled and the numbers corresponding to
each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive
numbers as a vector input. The additional parameters are used to control
labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
radius indicates the radius of the circle of the pie chart.(value between −1 and
+1).
clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
Example
A very simple pie-chart is created using just the input vector and labels.
The below script will create and save the pie chart in the current R working
directory.
Live Demo
png(file ="city.jpg")
pie(x,labels)
dev.off()
Example
The below script will create and save the pie chart in the current R working
directory.
Live Demo
x <- c(21,62,10,53)
png(file ="city_title_colours.jpg")
dev.off()
x <- c(21,62,10,53)
piepercent<- round(100*x/sum(x),1)
png(file ="city_percentage_legends.jpg")
fill = rainbow(length(x)))
dev.off()
3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages. The
package plotrix has a function called pie3D() that is used for this.
library(plotrix)
x <- c(21,62,10,53)
lbl <- c("London","New York","Singapore","Mumbai")
png(file ="3d_pie_chart.jpg")
dev.off()
R - Bar Charts
A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable. R uses the function barplot() to
create bar charts. R can draw both vertical and Horizontal bars in the bar
chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Example
A simple bar chart is created using just the input vector and the name of
each bar.
The below script will create and save the bar chart in the current R working
directory.
Live Demo
H <- c(7,12,28,3,41)
png(file ="barchart.png")
barplot(H)
dev.off()
Example
The below script will create and save the bar chart in the current R working
directory.
Live Demo
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
# Give the chart file a name
png(file ="barchart_months_revenue.png")
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")
dev.off()
More than two variables are represented as a matrix which is used to create
the group bar chart and stacked bar chart.
colors = c("green","orange","brown")
png(file ="barchart_stacked.png")
dev.off()
R - Boxplots
Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the data
set. It is also useful in comparing the distribution of data across data sets
by drawing boxplots for each of them.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
x is a vector or a formula.
data is the data frame.
varwidth is a logical value. Set as true to draw width of the box proportionate
to the sample size.
names are the group labels which will be printed under each boxplot.
Example
We use the data set "mtcars" available in the R environment to create a
basic boxplot. Let's look at the columns "mpg" and "cyl" in mtcars.
Live Demo
print(head(input))
png(file ="boxplot.png")
dev.off()
The below script will create a boxplot graph with notch for each of the data
group.
Live Demo
notch = TRUE,
varwidth = TRUE,
col = c("green","yellow","purple"),
names = c("High","Medium","Low")
dev.off()
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Example
A simple histogram is created using input vector, label, col and border
parameters.
The script given below will create and save the histogram in the current R
working directory.
Live Demo
v <- c(9,13,21,8,36,22,12,41,31,33,19)
png(file ="histogram.png")
dev.off()
v <- c(9,13,21,8,36,22,12,41,31,33,19)
png(file ="histogram_lim_breaks.png")
# Create the histogram.
breaks =5)
dev.off()
R - Line Graphs
A line chart is a graph that connects a series of points by drawing line
segments between them. These points are ordered in one of their
coordinate (usually the x-coordinate) value. Line charts are usually used in
identifying the trends in data.
The plot() function in R is used to create the line graph.
Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
type takes the value "p" to draw only the points, "l" to draw only the lines and
"o" to draw both points and lines.
Example
A simple line chart is created using the input vector and the type parameter
as "O". The below script will create and save a line chart in the current R
working directory.
Live Demo
v <- c(7,12,28,3,41)
png(file ="line_chart.jpg")
plot(v,type ="o")
Example
Live Demo
v <- c(7,12,28,3,41)
dev.off()
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)
png(file ="line_chart_2_lines.jpg")
dev.off()
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Example
We use the data set "mtcars" available in the R environment to create a
basic scatterplot. Let's use the columns "wt" and "mpg" in mtcars.
Live Demo
print(head(input))
png(file ="scatterplot.png")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab ="Weight",
ylab ="Milage",
xlim = c(2.5,5),
ylim = c(15,30),
dev.off()
Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)
data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable. A scatterplot
is plotted for each pair.
Live Demo
png(file ="scatterplot_matrices.png")
pairs(~wt+mpg+disp+cyl,data = mtcars,
dev.off()
The functions we are discussing in this chapter are mean, median and
mode.
Mean
It is calculated by taking the sum of the values and dividing with the
number of values in a data series.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
print(result.mean)
When trim = 0.3, 3 values from each end will be dropped from the
calculations to find mean.
In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and
the values removed from the vector for calculating mean are (−21,−5,2)
from left and (12,18,54) from right.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
print(result.mean)
Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE. which
means remove the NA values.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
print(result.mean)
print(result.mean)
Median
The middle most value in a data series is called the median.
The median()function is used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
na.rm is used to remove the missing values from the input vector.
Example
Live Demo
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
print(median.result)
Mode
The mode is the value that has highest number of occurrences in a set of
data. Unike mean and median, mode can have both numeric and character
data.
Example
Live Demo
getmode <-function(v){
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
print(result)
print(result)
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a
relationship model between two variables. One of these variable is called
predictor variable whose value is gathered through experiments. The other
variable is called response variable whose value is derived from the
predictor variable.
Find the coefficients from the model created and create the mathematical
equation using these
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the
response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
x <- c(151,174,138,186,128,136,179,163,152,131)
y <- c(63,81,56,91,47,57,76,72,62,48)
print(relation)
Coefficients:
(Intercept) x
-38.4551 0.6746
x <- c(151,174,138,186,128,136,179,163,152,131)
y <- c(63,81,56,91,47,57,76,72,62,48)
print(summary(relation))
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
x <- c(151,174,138,186,128,136,179,163,152,131)
y <- c(63,81,56,91,47,57,76,72,62,48)
# Apply the lm() function.
print(result)
x <- c(151,174,138,186,128,136,179,163,152,131)
y <- c(63,81,56,91,47,57,76,72,62,48)
png(file ="linearregression.png")
dev.off()
When we execute the above code, it produces the following result −
R - Multiple Regression
Multiple regression is an extension of linear regression into relationship
between more than two variables. In simple linear relation we have one
predictor and one response variable, but in multiple regression we have
more than one predictor variable and one response variable.
lm() Function
This function creates the relationship model between the predictor and the
response variable.
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
formula is a symbol presenting the relation between the response variable and
predictor variables.
Example
Input Data
Consider the data set "mtcars" available in the R environment. It gives a
comparison between different car models in terms of mileage per gallon
(mpg), cylinder displacement("disp"), horse power("hp"), weight of the
car("wt") and some more parameters.
print(head(input))
print(model)
a <- coef(model)[1]
print(a)
Xdisp<- coef(model)[2]
Xhp<- coef(model)[3]
Xwt<- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is
−
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
R - Logistic Regression
The Logistic Regression is a regression model in which the response variable
(dependent variable) has categorical values such as True/False or 0/1. It
actually measures the probability of a binary response as the value of
response variable based on the mathematical equation relating it with the
predictor variables.
The function used to create the regression model is the glm() function.
Syntax
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)
family is R object to specify the details of the model. It's value is binomial for
logistic regression.
Example
The in-built data set "mtcars" describes different models of a car with their
various engine specifications. In "mtcars" data set, the transmission mode
(automatic or manual) is described by the column am which is a binary
value (0 or 1). We can create a logistic regression model between the
columns "am" and 3 other columns - hp, wt and cyl.
Live Demo
print(head(input))
print(summary(am.data))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion
In the summary as the p-value in the last column is more than 0.05 for the
variables "cyl" and "hp", we consider them to be insignificant in contributing
to the value of the variable "am". Only weight (wt) impacts the "am" value
in this regression model.
R - Normal Distribution
In a random collection of data from independent sources, it is generally
observed that the distribution of data is normal. Which means, on plotting a
graph with the value of the variable in the horizontal axis and the count of
the values in the vertical axis we get a bell shape curve. The center of the
curve represents the mean of the data set. In the graph, fifty percent of
values lie to the left of the mean and the other fifty percent lie to the right
of the graph. This is referred as normal distribution in statistics.
x is a vector of numbers.
p is a vector of probabilities.
mean is the mean value of the sample data. It's default value is zero.
dnorm()
This function gives height of the probability distribution at each point for a
given mean and standard deviation.
Live Demo
x <- seq(-10,10,by=.1)
png(file ="dnorm.png")
plot(x,y)
dev.off()
pnorm()
This function gives the probability of a normally distributed random number
to be less that the value of a given number. It is also called "Cumulative
Distribution Function".
Live Demo
png(file ="pnorm.png")
plot(x,y)
dev.off()
x <- seq(0,1,by=0.02)
png(file ="qnorm.png")
# Plot the graph.
plot(x,y)
dev.off()
rnorm()
This function is used to generate random numbers whose distribution is
normal. It takes the sample size as input and generates that many random
numbers. We draw a histogram to show the distribution of the generated
numbers.
Live Demo
y <- rnorm(50)
png(file ="rnorm.png")
dev.off()
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
dbinom()
This function gives the probability density distribution at each point.
Live Demo
x <- seq(0,50,by=1)
y <- dbinom(x,50,0.5)
png(file ="dbinom.png")
plot(x,y)
dev.off()
x <- pbinom(26,51,0.5)
print(x)
qbinom()
This function takes the probability value and gives a number whose
cumulative value matches the probability value.
Live Demo
# How many heads will have a probability of 0.25 will come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
rbinom()
This function generates required number of random values of given
probability from a given sample.
Live Demo
x <- rbinom(8,150,.4)
print(x)
R - Poisson Regression
Poisson Regression involves regression models in which the response
variable is in the form of counts and not fractional numbers. For example,
the count of number of births or number of wins in a football match series.
Also the values of the response variables follow a Poisson distribution.
Syntax
The basic syntax for glm() function in Poisson regression is −
glm(formula,data,family)
family is R object to specify the details of the model. It's value is 'Poisson' for
Logistic Regression.
Example
We have the in-built data set "warpbreaks" which describes the effect of
wool type (A or B) and tension (low, medium or high) on the number of
warp breaks per loom. Let's consider "breaks" as the response variable
which is a count of number of breaks. The wool "type" and "tension" are
taken as predictor variables.
Input Data
Live Demo
print(head(input))
family = poisson)
print(summary(output))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the summary we look for the p-value in the last column to be less than
0.05 to consider an impact of the predictor variable on the response
variable. As seen the wooltype B having tension type M and H have impact
on the count of breaks.
R - Analysis of Covariance
We use Regression analysis to create models which describe the effect of
variation in predictor variables on the response variable. Sometimes, if we
have a categorical variable with values like Yes/No or Male/Female etc. The
simple regression analysis gives multiple results for each value of the
categorical variable. In such scenario, we can study the effect of the
categorical variable by using it along with the predictor variable and
comparing the regression lines for each level of the categorical variable.
Such an analysis is termed as Analysis of Covariance also called
as ANCOVA.
Example
Consider the R built in data set mtcars. In it we observer that the field "am"
represents the type of transmission (auto or manual). It is a categorical
variable with values 0 and 1. The miles per gallon value(mpg) of a car can
also depend on it besides the value of horse power("hp").
We study the effect of the value of "am" on the regression between "mpg"
and "hp". It is done by using the aov() function followed by
the anova() function to compare the multiple regressions.
Input Data
Create a data frame containing the fields "mpg", "hp" and "am" from the
data set mtcars. Here we take "mpg" as the response variable, "hp" as the
predictor variable and "am" as the categorical variable.
Live Demo
print(head(input))
ANCOVA Analysis
We create a regression model taking "hp" as the predictor variable and
"mpg" as the response variable taking into account the interaction between
"am" and "hp".
Model with interaction between categorical variable and
predictor variable
Live Demo
print(summary(result))
This result shows that both horse power and transmission type has
significant effect on miles per gallon as the p value in both cases is less
than 0.05. But the interaction between these two variables is not significant
as the p-value is more than 0.05.
print(summary(result))
This result shows that both horse power and transmission type has
significant effect on miles per gallon as the p value in both cases is less
than 0.05.
print(anova(result1,result2))
Syntax
The basic syntax for ts() function in time series analysis is −
timeseries.object.name <- ts(data, start, end, frequency)
data is a vector or matrix containing the values used in the time series.
start specifies the start time for the first observation in time series.
end specifies the end time for the last observation in time series.
Example
Consider the annual rainfall details at a place starting from January 2012.
We create an R time series object for a period of 12 months and plot it.
Live Demo
png(file ="rainfall.png")
plot(rainfall.timeseries)
dev.off()
When we execute the above code, it produces the following result and chart
−
Jan Feb Mar Apr May Jun Jul Aug Sep
2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2
Oct Nov Dec
2012 985.0 882.8 1071.0
frequency = 24*6 pegs the data points for every 10 minutes of a day.
rainfall2 <-
c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)
print(rainfall.timeseries)
png(file ="rainfall_combined.png")
dev.off()
When we execute the above code, it produces the following result and chart
−
Series 1 Series 2
Jan 2012 799.0 655.0
Feb 2012 1174.8 1306.9
Mar 2012 865.1 1323.4
Apr 2012 1334.6 1172.2
May 2012 635.4 562.2
Jun 2012 918.5 824.0
Jul 2012 685.5 822.4
Aug 2012 998.6 1265.5
Sep 2012 784.2 799.6
Oct 2012 985.0 1105.6
Nov 2012 882.8 1106.7
Dec 2012 1071.0 1337.8
Syntax
The basic syntax for creating a nonlinear least square test in R is −
nls(formula, data, start)
Example
We will consider a nonlinear model with assumption of initial values of its
coefficients. Next we will see what is the confidence intervals of these
assumed values so that we can judge how well these values fir into the
model.
Let's assume the initial coefficients to be 1 and 3 and fit these values into
nls() function.
Live Demo
plot(xvalues,yvalues)
# Plot the chart with new data by fitting it to a prediction from 100 data points.
lines(new.data$xvalues,predict(model,newdata =new.data))
dev.off()
print(sum(resid(model)^2))
print(confint(model))
R - Decision Tree
Decision tree is a graph to represent choices and their results in form of a
tree. The nodes in the graph represent an event or choice and the edges of
the graph represent the decision rules or conditions. It is mostly used in
Machine Learning and Data Mining applications using R.
Install R Package
Use the below command in R console to install the package. You also have
to install the dependent packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and
analyze decison tree.
Syntax
The basic syntax for creating a decision tree in R is −
ctree(formula, data)
Input Data
We will use the R in-built data set named readingSkills to create a
decision tree. It describes the score of someone's readingSkills if we know
the variables "age","shoesize","score" and whether the person is a native
speaker or not.
# dependent packages.
library(party)
print(head(readingSkills))
When we execute the above code, it produces the following result and chart
−
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the ctree() function to create the decision tree and see its
graph.
# dependent packages.
library(party)
png(file ="decision_tree.png")
data = input.dat)
plot(output.tree)
# Save the file.
dev.off()
as.Date, as.Date.numeric
R - Random Forest
In the random forest approach, a large number of decision trees are
created. Every observation is fed into every decision tree. The most
common outcome for each observation is used as the final output. A new
observation is fed into all the trees and taking a majority vote for each
classification model.
An error estimate is made for the cases which were not used while building
the tree. That is called an OOB (Out-of-bag) error estimate which is
mentioned as a percentage.
The R package "randomForest" is used to create random forests.
Install R Package
Use the below command in R console to install the package. You also have
to install the dependent packages if any.
install.packages("randomForest)
Syntax
The basic syntax for creating a random forest in R is −
randomForest(formula, data)
Input Data
We will use the R in-built data set named readingSkills to create a decision
tree. It describes the score of someone's readingSkills if we know the
variables "age","shoesize","score" and whether the person is a native
speaker.
# required packages.
library(party)
print(head(readingSkills))
When we execute the above code, it produces the following result and chart
−
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the randomForest() function to create the decision tree and
see it's graph.
# required packages.
library(party)
library(randomForest)
data = readingSkills)
print(output.forest)
print(importance(fit,type =2))
Conclusion
From the random forest shown above we can conclude that the shoesize
and score are the important factors deciding if someone is a native speaker
or not. Also the model has only 1% error which means we can predict with
99% accuracy.
R - Survival Analysis
Survival analysis deals with predicting the time when a specific event is
going to occur. It is also known as failure time analysis or analysis of time
to death. For example predicting the number of days a person with cancer
will survive or predicting the time when a mechanical system is going to
fail.
The R package named survival is used to carry out survival analysis. This
package contains the function Surv() which takes the input data as a R
formula and creates a survival object among the chosen variables for
analysis. Then we use the function survfit() to create a plot for the
analysis.
Install Package
install.packages("survival")
Syntax
The basic syntax for creating survival analysis in R is −
Surv(time,event)
survfit(formula)
library("survival")
print(head(pbc))
When we execute the above code, it produces the following result and chart
−
id time status trt age sex ascites hepato spiders edema bili chol
1 1 400 2 1 58.76523 f 1 1 1 1.0 14.5 261
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302
3 3 1012 2 1 70.07255 m 0 0 0 0.5 1.4 176
4 4 1925 2 1 54.74059 f 0 1 1 0.5 1.8 244
5 5 1504 1 2 38.10541 f 0 1 1 0.0 3.4 279
6 6 2503 2 2 66.25873 f 0 1 0 0.0 0.8 248
albumin copper alk.phos ast trig platelet protime stage
1 2.60 156 1718.0 137.95 172 190 12.2 4
2 4.14 54 7394.8 113.52 88 221 10.6 3
3 3.48 210 516.0 96.10 55 151 12.0 4
4 2.54 64 6121.8 60.63 92 183 10.3 4
5 3.53 143 671.0 113.15 72 136 10.9 3
6 3.98 50 944.0 93.00 63 NA 11.0 3
From the above data we are considering time and status for our analysis.
library("survival")
# Create the survival object.
survfit(Surv(pbc$time,pbc$status ==2)~1)
png(file ="survival.png")
plot(survfit(Surv(pbc$time,pbc$status ==2)~1))
dev.off()
When we execute the above code, it produces the following result and chart
−
Call: survfit(formula = Surv(pbc$time, pbc$status == 2) ~ 1)
For example, we can build a data set with observations on people's ice-
cream buying pattern and try to correlate the gender of a person with the
flavor of the ice-cream they prefer. If a correlation is found we can plan for
appropriate stock of flavors by knowing the number of gender of people
visiting.
Syntax
The function used for performing chi-Square test is chisq.test().
data is the data in form of a table containing the count value of the variables in
the observation.
Example
We will take the Cars93 data in the "MASS" library which represents the
sales of different models of car in the year 1993.
Live Demo
library("MASS")
print(str(Cars93))
library("MASS")
car.data = table(Cars93$AirBags,Cars93$Type)
print(car.data)
print(chisq.test(car.data))
data: car.data
X-squared = 33.001, df = 10, p-value = 0.0002723
Warning message:
In chisq.test(car.data) : Chi-squared approximation may be incorrect
Conclusion
The result shows the p-value of less than 0.05 which indicates a string
correlation.