Predictive Business Analytics: Data Analytics in R
Predictive Business Analytics: Data Analytics in R
Predictive Business Analytics: Data Analytics in R
Data Analytics in R
▪ R is a statistical language and is most common in the Data Science Industry. The language
nickname is ‘You Stupid Darkness’ and the version we are using is 3.4.0+. However, it is a
practice to execute the codes in and Integrated Development Environment (IDE) known as
RStudio.
▪ Do not worry with the jargons, we will be starting from the basics of programming.
▪ RStudio needs to be installed on your system after you have installed R language. RStudio
connects with R language and we can then execute R from RStudio. We will majorly concentrate
on the two parts of RStudio – Console and Script area. Console is connected directly with you R
language and executes statement by statement. However, in the Script area you can write a
number of statements together and execute them at will.
So it gives us an error that the object is not found since R is a case sensitive language and x is not equal
to X.
Data Types are mainly of three primary types which are also used as three primary data types in R:
1. Numeric;
2. Character and
3. Logical.
We will use the class command to check the data type of the variable:
class(x)
## [1] 'numeric'
We will work on these eventually.
class(y)
## [1] 'character'
class(z)
## [1] 'logical'
If we write anything in '', what will be the data type, let’s check:
a='10'
class(a)
## [1] 'character'
When converting it into numeric it does not give you an error rather it just gives a warning message and
converts it into NA (Not Applicable).
Substitution
res5 <- sub('-',',',res4)
res5
## [1] 'Edupristine,Business-Analytics'
Substituting (all occurences)
res6 <- gsub('-',',',res4)
res6
## [1] 'Edupristine,Business,Analytics'
Extracting parts of text is another very important function that is performed with the strings.
The second parameter given to the command is the starting position of the extraction.
Derived data types are the data types that evolve from the primary data types. We will be learning about
the following:
1. Vectors;
2. Matrix;
4. Lists.
x <- c(2,3,5,1,2,4,-5)
x
## [1] 2 3 5 1 2 4 -5
To access multiple vector elements: Use 1:4 as it will create a sequence from 1 to 4 and it will pick up
the values at positions 1 to 4 of x
x[1:4]
## [1] 2 3 5 1
Negative index: A negative index will remove that value at the second index which is mentioned with
negative sign
x[-2]
## [1] 2 5 1 2 4 -5
It deleted a combination of 1st to 3rd, 5th and 7th element from vector x.
x
## [1] 1 2 3 4 5
y
## [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0
While 'by' tells you the difference between the sequence; the 'length' tells you how many elements
should be there in the vector.
First, third, fifth, sixth and seventh elements of x are present at 1, 5, 9, 15 and 21st position in y
Second and fourth element of x are not present in y.
Random Sample
x=1:50
sample(x,5) #Choosing 5 random numbers
## [1] 5 4 6 19 40
sample(x,5) #Choosing 5 random numbers
## [1] 45 3 13 10 18
Fixing seed for randomizing algorithm
set.seed(10)
sample(x,5)
## [1] 26 16 21 33 4
With seed 10, output will always be 26 16 21 33 4 irrespective of what OS/System/Version you are
running your codes.
▪ Vectors are called atomic data type because they can not contain higher data types themselves.
▪ List are recursive data types and can contain all sorts of object and they do not need to be of
same type either:
x=1:10
y=letters
z=3.14
list1=list(x,y,z)
list1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q'
## [18] 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z'
##
## [[3]]
## [1] 3.14
Data frames: A special kind of list which has vectors of equal length
set.seed(15)
x=1:26
y=letters
df=data.frame(x,y)
df
## x y
## 1 1 a ## 11 11 k ## 21 21 u
## 2 2 b ## 12 12 l ## 22 22 v
## 3 3 c ## 13 13 m ## 23 23 w
## 4 4 d ## 14 14 n ## 24 24 x
## 5 5 e ## 15 15 o ## 25 25 y
## 6 6 f ## 16 16 p ## 26 26 z
## 7 7 g ## 17 17 q
## 8 8 h ## 18 18 r
## 9 9 I ## 19 19 s
## 10 10 j ## 20 20 t
Number columns that you see on the left are known as row names for data frames.
Loading and processing data in R: We will use the Credit Card dataset.
## $ customer_id : int 713782 515901 95166 425557 624581 721691 269858 219196 413020 174424
## $ country_reg : chr W E W E
## $ ad_exp : chr N N Y Y
## $ imp_cscore : int 619 527 606 585 567 560 620 658 519 645
▪ The apply() family pertains to the R base package and is populated with functions to manipulate
slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow
crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input
list, matrix or array and apply a named function with one or several optional arguments.
▪ The apply() functions form the basis of more complex combinations and helps to perform operations
with very few lines of code. More specifically, the family is made up of the.
• apply();
• lapply();
• sapply();
• vapply();
• mapply();
• rapply() and
• tapply().
▪ We start with the godfather of the family, apply, which operates on arrays (for simplicity we
limit here to 2D arrays, aka, matrices). The R base manual tells us that it's called as follows:
▪ Where:
• X is an array (a matrix if the dimension of the array is 2);
• MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows,
whereas with MARGIN=2, it works over columns. Noticeably, with the construct MARGIN=c(1,2) it
applies to both rows and columns;
• FUN is the function we want to apply and can be any R function, including a User Defined Function
(more on functions in a separate post)
▪ Now, beginners may have difficulties in visualizing what is actually happening, so a few
pictures will help figuring it out. Let's construct a 5 x 6 matrix and imagine we want to sum the
values of each column: we can write something like:
• X<-matrix(rnorm(30), nrow=5, ncol=6)
• apply(X,2 ,sum)
▪ We wish to apply a given function to every element of a list and obtain a list as result . Upon
?lapply, we see that the syntax looks like the apply. Here the difference is that:
• It can be used for other objects like dataframes, lists or vectors.
• The output returned is a list (thus the l in the function name) which has the same number of elements
as the object passed to it.
▪ To see how this works, let's create a few matrices and extract from each a given column. This
is a quite common operation performed on real data when making comparisons or
aggregations from different dataframes.
• A<-matrix(1:9, 3,3)
• B<-matrix(4:15, 4,3)
• C<-matrix(8:10, 3,2)
• MyList<-list(A,B,C) # display the list
• # extract the second column from the list of matrices, using the selection operator '['
• lapply(MyList,'[', , 2)
▪ Sapply works as lapply, but it tries to simplify the output to the most elementary data
structure that is possible. In effect, as can be seen in the base manual, sapply is a 'wrapper'
function for lapply.
▪ An example may help. Say we want to repeat the extraction operation of a single element as in
the last example, now taking the first element of the second row (indexes 2 and 1) for each
matrix. As we know, lapply would give us a list:
• sapply(MyList,'[', 2,1 )
▪ R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called ‘library’ in the R environment. By default, R installs a set of packages
during installation. More packages are added later, when they are needed for some specific
purpose.
▪ There are multiple ways to install a package. The preferred way is to directly install it using
the internet using the following command. We will install a package called ‘dplyr’ which is one
of the most widely used package for data wrangling
▪ Let us say we want to install a package named ‘dplyr’.
▪ install.packages(‘dplyr’)
▪ library(dplyr)
▪ Filter
• Conditional filtering of data
▪ Select
• Selecting columns
▪ Mutate
• Adding/modifying columns
▪ Arrange
• Sorting columns
▪ Summarise (with adverb group_by)
• Collapsing Data to its summaries
▪ dplyr is a powerful R-package to transform and summarize tabular data with rows and
columns. For another explanation of dplyr see the dplyr package vignette: Introduction to
dplyr:
▪ Why is it useful?
The package contains a set of functions (or ‘verbs’) that perform common data manipulation operations
such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and
summarizing data.n addition, dplyr contains a useful function to perform another common task which is
the ‘split-apply-combine’ concept. We will discuss that in a little bit.
Verb Details
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
allows for group operations in the 'split-
group_by()
apply-combine' concept
▪ Let us now use the msleep (mammals sleep) data set contains the sleeptimes and weights for a set
of mammals and is available in the dagdata repository on github. This data set contains 83 rows and
11 variables.
Let us say Dan was given a task to analyze the sleep data and come up with insights.
Let us filter out only the required rows based on multiple conditions
Filter the rows for mammals that sleep a total of more than 16 hours.
filter(msleep, sleep_total >= 16)
Filter the rows for mammals that sleep a total of more than 16 hours and have a body weight of
greater than 1 kilogram.
filter(msleep, sleep_total >= 16, bodywt >= 1)
Filter the rows for mammals in the Perissodactyla and Primates taxonomic order
filter(msleep, order %in% c(‘Perissodactyla’, ‘Primates’))
Let’s introduce the pipe operator: %>%. dplyr imports this operator from another package
(magrittr). This operator allows you to pipe the output from one function to the input of another
function. Instead of nesting functions (reading from the inside to the outside), the idea of piping is
to read the functions from left to right.
head(),
Let us take another example where we want to arrange the data based on the ‘order’ variable:
msleep %>% arrange(order) %>% head
library(dplyr)
filter(CCData, country_reg %in% c(‘E’, ‘W’))
Nesting: Using %>% operator (called chaining operator or the continuation operator)
CCData %>%
select(customer_id, country_reg, est_income) %>%
filter(est_income > mean(est_income))
## 1 425557 E 67211.59
## 2 721691 E 73896.1
## 3 269858 W 73609.4
## 4 48861 W 72547.14
## 5 791715 E 76087.37
## 6 778438 E 68383.43
## 7 407939 E 72032.97
## 8 182729 W 83992.32
## 9 47021 W 71675.08
## 10 399206 E 69048.21
CCData %>%
select(RiskScore, imp_cscore) %>%
mutate(ScoreRatio = RiskScore/imp_cscore)
## Observations: 14,000
## Variables: 13
## $ customer_id <int> 461914, 664695, 581996, 680566, 523177, 7728...
second(timeval)
## [1] 25
wday(timeval)
## [1] 6
## [1] Fri
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
You can do arithmetic with dates also, for example, you want to add a year to a date
ymd(20160101) + dyears(1)
## [1] '2016-12-31'
ymd(20160101) + years(1)
## [1] '2017-01-01'
▪ The example we saw earlier had well-formatted character strings as input for dates
▪ Dates can have months as character names or even abbreviated form of three letter words and for other date
components as well
▪ You can handle that by specifying your own formats too these format builders and function parse_date_time
• b : Abbreviated month name
• B : Full month name
• d : Day of the month as decimal number (01 to 31 or 0 to 31)
• H : Hours as decimal number (00 to 24 or 0 to 24) - 24 hrs format
• I : Hours as decimal number (01 to 12 or 0 to 12) - 12 hrs format
• j : Day of year as decimal number (001 to 366 or 1 to 366)
• m : Month as decimal number (01 to 12 or 1 to 12)
• M : Minute as decimal number (00 to 59 or 0 to 59)
• p : AM/PM indicator in the locale. Used in conjunction with I and not with H
• S : Second as decimal number (00 to 61 or 0 to 61), allowing for up to two leap-seconds (but POSIXcompliant
implementations will ignore leap seconds)
• OS :Fractional second
• y : Year without century (00 to 99 or 0 to 99)
• Y : Year with century
parse_date_time('01-16-Jan','%d-%y-%b')
## [1] '2016-01-01 UTC'