0% found this document useful (0 votes)
31 views56 pages

Predictive Business Analytics: Data Analytics in R

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 56

Predictive Business Analytics

Data Analytics in R

© EduPristine For [PBA – Data Analytics in R ] (Confidential)


Data Analytics in R

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 2


Learning R

▪ R is a statistical language and is most common in the Data Science Industry. The language
nickname is ‘You Stupid Darkness’ and the version we are using is 3.4.0+. However, it is a
practice to execute the codes in and Integrated Development Environment (IDE) known as
RStudio.

▪ Do not worry with the jargons, we will be starting from the basics of programming.

▪ RStudio needs to be installed on your system after you have installed R language. RStudio
connects with R language and we can then execute R from RStudio. We will majorly concentrate
on the two parts of RStudio – Console and Script area. Console is connected directly with you R
language and executes statement by statement. However, in the Script area you can write a
number of statements together and execute them at will.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 3


Variables and Constants

Variable is something that can change and constants never change.


Example:
x = 10, is a variable which has been assigned the constant 10
x=10
x
## [1] 10

Please note, R is a case sensitive language and hence, X is not equal to x.


So we see above that when we print x, we get a value of 10. What should happen when we print X?
X
Error: object 'X' not found

So it gives us an error that the object is not found since R is a case sensitive language and x is not equal
to X.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 4


Data Types

Data Types are mainly of three primary types which are also used as three primary data types in R:
1. Numeric;
2. Character and
3. Logical.

We will use the class command to check the data type of the variable:
class(x)
## [1] 'numeric'
We will work on these eventually.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 5


Data Types

Let us try using another data types:


y='King'
z=TRUE

class(y)
## [1] 'character'
class(z)
## [1] 'logical'

If we write anything in '', what will be the data type, let’s check:
a='10'
class(a)
## [1] 'character'

Any value being placed in a variable in quotes is always going to be a character.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 6


Data Types

Convert character data type (numerical data) to numeric data type:


b <- as.numeric(a)
a
## [1] '10'
class(a)
## [1] 'character'
b
## [1] 10
class(b)
## [1] 'numeric'

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 7


What happens if the character is alphabets and we convert it into numeric? Let's see it.
y
## [1] 'King'
c <- as.numeric(y)
## Warning: NAs introduced by coercion
c
## [1] NA

When converting it into numeric it does not give you an error rather it just gives a warning message and
converts it into NA (Not Applicable).

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 8


Character Operations with R Functions – Concatenate
There are different character operations for requirements. Let’s discuss few basic ones:
Concatenation: Joining multiple strings together
x <- 'Edupristine'
y <- 'Business'
z <- 'Analytics'
res <- paste(x,y,z)
res
## [1] 'Edupristine Business Analytics'
Concatenate: (Without Spaces/different operator)
res2 <- paste0(x,y,z)
res3 <- paste(x,y,z, sep = '')
res4 <- paste(x,y,z, sep='-')
res2
## [1] 'EdupristineBusinessAnalytics'
res3
## [1] 'EdupristineBusinessAnalytics'
Res4
## [1] 'Edupristine-Business-Analytics'

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 9


Substitution

Substitution
res5 <- sub('-',',',res4)
res5
## [1] 'Edupristine,Business-Analytics'
Substituting (all occurences)
res6 <- gsub('-',',',res4)
res6
## [1] 'Edupristine,Business,Analytics'

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 10


Text Extraction

Extracting parts of text is another very important function that is performed with the strings.

Extract 'Edupristine' from res4.


res7 <- substr(res6, 1, 11)
res7
## [1] 'Edupristine'

The second parameter given to the command is the starting position of the extraction.

The last parameter the last position of the extraction.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 11


Logical Operations
Checking Logical statements Checking multiple statements

x=10 z>y & y>x


y=20 ## [1] TRUE
z=30
So we see above that z>y AND (denoted by & in R) y>x
x>y are true and hence, the final result is TRUE.
x>y | z>y
## [1] TRUE
## [1] FALSE
Use pipe (|) for resembling OR condition.
z>y
## [1] TRUE
x==z
## [1] FALSE

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 12


Derived data types

Derived data types are the data types that evolve from the primary data types. We will be learning about
the following:

1. Vectors;

2. Matrix;

3. Data Frames and

4. Lists.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 13


01. Vectors

▪ Vectors: A collections of scalars of the same data type


▪ Vectors will have units of primary data types.
▪ Vectors can be either a numeric, character or a logical.
▪ Use command 'c' (combination) to create any vector.

x <- c(2,3,5,1,2,4,-5)
x
## [1] 2 3 5 1 2 4 -5

Checking vector data type


class(x)
## [1] 'numeric'
is.vector(x)
## [1] TRUE
x is a numeric vector

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 14


01. Vectors

Accessing vector elements: To access the third element of vector x.


x[3]
## [1] 5

To access multiple vector elements: Use 1:4 as it will create a sequence from 1 to 4 and it will pick up
the values at positions 1 to 4 of x
x[1:4]
## [1] 2 3 5 1

To pick values from 1 to 3, 5th and 7th element


x[c(1:3,5,7)]
## [1] 2 3 5 2 -5`

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 15


01. Vectors – Negative Index

Negative index: A negative index will remove that value at the second index which is mentioned with
negative sign
x[-2]
## [1] 2 5 1 2 4 -5

Removing multiple indices from the vector


x[-c(1:3,5,7)]
## [1] 1 4

It deleted a combination of 1st to 3rd, 5th and 7th element from vector x.

Logical Index: Let’s say, we want list of numbers above 3


x>3
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE
x[x>3]
## [1] 5 4
© EduPristine For [PBA – Data Analytics in R ] (Confidential) 16
01. Vector – Creating a Vector

Using seq command to create a vector


x <- seq(1,5,by=1)
y <- seq(1,5,length=11)

x
## [1] 1 2 3 4 5
y
## [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0

While 'by' tells you the difference between the sequence; the 'length' tells you how many elements
should be there in the vector.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 17


01. Vectors – Operations

Vector operations are similar to normal matrix operations


x <- letters #Creating alphabets
y <- 1:26 # A sequence from 1 to 26
z <- paste(x,y, sep='*') #Creating a vector pasting x and y with an asterisk in between
res <- paste(z, collapse =' + ' ) #Combining all elements of the vector using a plus sign in between
x
## [1] 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q'
## [18] 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z'
y
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26
z
## [1] 'a*1' 'b*2' 'c*3' 'd*4' 'e*5' 'f*6' 'g*7' 'h*8' 'i*9' 'j*10'
## [11] 'k*11' 'l*12' 'm*13' 'n*14' 'o*15' 'p*16' 'q*17' 'r*18' 's*19' 't*20'
## [21] 'u*21' 'v*22' 'w*23' 'x*24' 'y*25' 'z*26'
res
## [1] 'a*1 + b*2 + c*3 + d*4 + e*5 + f*6 + g*7 + h*8 + i*9 + j*10 + k*11 + l*12 + m*13 + n*14 + o*15
+ p*16 + q*17 + r*18 + s*19 + t*20 + u*21 + v*22 + w*23 + x*24 + y*25 + z*26'
© EduPristine For [PBA – Data Analytics in R ] (Confidential) 18
More Utility Functions and Operators

Matching two Vectors: Match


x=c('a','$','e','1','i','o','u')
y=letters
match(x,y)
## [1] 1 NA 5 NA 9 15 21

First, third, fifth, sixth and seventh elements of x are present at 1, 5, 9, 15 and 21st position in y
Second and fourth element of x are not present in y.

Matching two vectors: %in% – Matching logically


x %in% y
## [1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 19


More Utility Functions and Operators

Modulo Operator: %% – Finding out remainder


5%%3
## [1] 2
Extract odd numbers from the vector 2:99
It returns indices of elements in a vector which fulfill the specified condition.
x=2:99
which(x%%2!=0)
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
## [24] 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92
## [47] 94 96 98
The returned vector contains indices not the values themselves, you can get the values by passing these
indices to vector for sub-setting.
x[which(x%%2!=0)]
## [1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
## [24] 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93
## [47] 95 97 99

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 20


Sampling

Random Sample
x=1:50
sample(x,5) #Choosing 5 random numbers
## [1] 5 4 6 19 40
sample(x,5) #Choosing 5 random numbers
## [1] 45 3 13 10 18
Fixing seed for randomizing algorithm
set.seed(10)
sample(x,5)
## [1] 26 16 21 33 4
With seed 10, output will always be 26 16 21 33 4 irrespective of what OS/System/Version you are
running your codes.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 21


Lists

▪ Vectors are called atomic data type because they can not contain higher data types themselves.
▪ List are recursive data types and can contain all sorts of object and they do not need to be of
same type either:
x=1:10
y=letters
z=3.14

list1=list(x,y,z)
list1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q'
## [18] 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z'
##
## [[3]]
## [1] 3.14

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 22


Lists

Accessing individual list elements Accessing the 10th element of list 2


list1[[2]] list2$prod_name[10]
## [1] 'j'
## [1] 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q'
## [18] 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z'
Accessing the 10th element of list 2
list1[[2]][10]
## [1] 'j'
Naming list elements
list2=list(cust_num=x,prod_name=y,marktng_expense=z)
list2
## $cust_num
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $prod_name
## [1] 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q'
## [18] 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z'
##
## $marktng_expense
## [1] 3.14

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 23


Data Frames

Data frames: A special kind of list which has vectors of equal length
set.seed(15)
x=1:26
y=letters
df=data.frame(x,y)
df
## x y
## 1 1 a ## 11 11 k ## 21 21 u
## 2 2 b ## 12 12 l ## 22 22 v
## 3 3 c ## 13 13 m ## 23 23 w
## 4 4 d ## 14 14 n ## 24 24 x
## 5 5 e ## 15 15 o ## 25 25 y
## 6 6 f ## 16 16 p ## 26 26 z
## 7 7 g ## 17 17 q
## 8 8 h ## 18 18 r
## 9 9 I ## 19 19 s
## 10 10 j ## 20 20 t

Number columns that you see on the left are known as row names for data frames.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 24


Data Frames

View(df) #View the data set


This outputs variables names of the data set.
names(df) #outputs variable names of the data set
## [1] 'x' 'y'
Changing column names
names(df)=c('Prod_ID','Prod_Name')
df
## Prod_ID Prod_Name
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
----- and so on…

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 25


Data Frames

rownames(df) #Provide row names


## [1] '1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14'
## [15] '15' '16' '17' '18' '19' '20' '21' '22' '23' '24' '25' '26'

rownames(df)=paste0('R',1:26) #Renaming rownames


df
## Prod_ID Prod_Name
## R1 1 a
## R2 2 b
## R3 3 c
## R4 4 d
## R5 5 e
## R6 6 f
----- and so on ...

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 26


Data Frames
dim(df) #Gives dimensions ▪ Character variables are by default stored as
## [1] 26 2 factors which are nothing but integers assigned
to different levels/unique values for character
nrow(df) #Row numbers variable.
## [1] 26
▪ This allows R to save on memory because
ncol(df) #Column numbers integers take lesser space that character
## [1] 2 values.
str(df) #Quick glimpse of the structure of data
## 'data.frame': 26 obs. of 2 variables: df$Prod_Name <- as.character(df$Prod_Name)
## $ Prod_ID : int 1 2 3 4 5 6 7 8 9 10 ... #Forcing the type on a particular column
## $ Prod_Name: Factor w/ 26 levels str(df)
'a','b','c','d',..: 1 2 3 4 5 6 7 8 9 10 ... ## 'data.frame': 26 obs. of 2 variables:
## $ Prod_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Prod_Name: chr 'a' 'b' 'c' 'd' ...

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 27


Accessing Data Frames

Loading and processing data in R: We will use the Credit Card dataset.

filedata1 <- read.csv('CCData1.csv', stringsAsFactors = F) #Reading the data file


str(filedata1) #Getting glimpse of the data

## 'data.frame' 10000 obs. Of 12 Variables:

## $ customer_id : int 713782 515901 95166 425557 624581 721691 269858 219196 413020 174424

## $ demographic_slice : chr AX03efs AX03efs AX03efs AX03efs


## $ country_reg : chr W E W E
## $ ad_exp : chr N N Y Y
## $ est_income : num 33408 19928 51222 67212 20093
## $ hold_bal : num 3 20.3 4 18.7 4
## $ pref_cust_prob : num 0.5311 0.2974 0.0185 0.0893 0.0949
## $ imp_cscore : int 619 527 606 585 567 560 620 658 519 645
## $ RiskScore : num 503 820 587 635 632
## $ imp_crediteval : num 24 23 24.9 24.8 24.7
## $ axio_score : num 0.1373 0.0523 0.452 0.5646 0.9173
## $ card_offer : logi FALSE FALSE FALSE FALSE FALSE TRUE

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 28


Accessing Data Frames

filedata2 <- read.csv('CCData1.csv', stringsAsFactors = F) #Reading dataset – creating another


str(filedata2) #Glimpse of the data

## 'data.frame': 10000 obs. Of 12 Variables


## $ customer_id : int 713782 515901 95166 425557 624581 721691 269858 219196 413020 174424

## $ demographic_slice: : chr AX03efs AX03efs AX03efs AX03efs

## $ country_reg : chr W E W E

## $ ad_exp : chr N N Y Y

## $ est_income : num 33408 19928 51222 67212 20093

## $ hold_bal : num 3 20.3 4 18.7 4

## $ pref_cust_prob : num 0.5311 0.2974 0.0185 0.0893 0.0949

## $ imp_cscore : int 619 527 606 585 567 560 620 658 519 645

## $ RiskScore : num 503 820 587 635 632

## $ imp_crediteval : num 24 23 24.9 24.8 24.7

## $ axio_score : num 0.1373 0.0523 0.452 0.5646 0.9173

## $ card_offer : logi FALSE FALSE FALSE FALSE FALSE TRUE

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 29


Apply functions details

▪ The apply() family pertains to the R base package and is populated with functions to manipulate
slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow
crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input
list, matrix or array and apply a named function with one or several optional arguments.
▪ The apply() functions form the basis of more complex combinations and helps to perform operations
with very few lines of code. More specifically, the family is made up of the.
• apply();
• lapply();
• sapply();
• vapply();
• mapply();
• rapply() and
• tapply().

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 30


‘'Apply'’ function details

▪ We start with the godfather of the family, apply, which operates on arrays (for simplicity we
limit here to 2D arrays, aka, matrices). The R base manual tells us that it's called as follows:

apply(X, MARGIN, FUN, ...)

▪ Where:
• X is an array (a matrix if the dimension of the array is 2);
• MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows,
whereas with MARGIN=2, it works over columns. Noticeably, with the construct MARGIN=c(1,2) it
applies to both rows and columns;
• FUN is the function we want to apply and can be any R function, including a User Defined Function
(more on functions in a separate post)
▪ Now, beginners may have difficulties in visualizing what is actually happening, so a few
pictures will help figuring it out. Let's construct a 5 x 6 matrix and imagine we want to sum the
values of each column: we can write something like:
• X<-matrix(rnorm(30), nrow=5, ncol=6)
• apply(X,2 ,sum)

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 31


'lapply' function details

▪ We wish to apply a given function to every element of a list and obtain a list as result . Upon
?lapply, we see that the syntax looks like the apply. Here the difference is that:
• It can be used for other objects like dataframes, lists or vectors.
• The output returned is a list (thus the l in the function name) which has the same number of elements
as the object passed to it.
▪ To see how this works, let's create a few matrices and extract from each a given column. This
is a quite common operation performed on real data when making comparisons or
aggregations from different dataframes.
• A<-matrix(1:9, 3,3)
• B<-matrix(4:15, 4,3)
• C<-matrix(8:10, 3,2)
• MyList<-list(A,B,C) # display the list

• # extract the second column from the list of matrices, using the selection operator '['
• lapply(MyList,'[', , 2)

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 32


‘sapply’ function details

▪ Sapply works as lapply, but it tries to simplify the output to the most elementary data
structure that is possible. In effect, as can be seen in the base manual, sapply is a 'wrapper'
function for lapply.

▪ An example may help. Say we want to repeat the extraction operation of a single element as in
the last example, now taking the first element of the second row (indexes 2 and 1) for each
matrix. As we know, lapply would give us a list:
• sapply(MyList,'[', 2,1 )

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 33


Concept of Packages in R

▪ R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called ‘library’ in the R environment. By default, R installs a set of packages
during installation. More packages are added later, when they are needed for some specific
purpose.

▪ There are multiple ways to install a package. The preferred way is to directly install it using
the internet using the following command. We will install a package called ‘dplyr’ which is one
of the most widely used package for data wrangling
▪ Let us say we want to install a package named ‘dplyr’.
▪ install.packages(‘dplyr’)
▪ library(dplyr)

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 34


Playing with dplyr

▪ Filter
• Conditional filtering of data
▪ Select
• Selecting columns
▪ Mutate
• Adding/modifying columns
▪ Arrange
• Sorting columns
▪ Summarise (with adverb group_by)
• Collapsing Data to its summaries

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 35


‘dplyr’ usage in R

▪ dplyr is a powerful R-package to transform and summarize tabular data with rows and
columns. For another explanation of dplyr see the dplyr package vignette: Introduction to
dplyr:

apply(X, MARGIN, FUN, ...)

▪ Why is it useful?
The package contains a set of functions (or ‘verbs’) that perform common data manipulation operations
such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and
summarizing data.n addition, dplyr contains a useful function to perform another common task which is
the ‘split-apply-combine’ concept. We will discuss that in a little bit.

▪ How does it compare to using base functions R?


If you are familiar with R, you are probably familiar with base R functions such as split(), subset(), apply(),
sapply(), lapply(), tapply() and aggregate(). Compared to base functions in R, the functions in dplyr are
easier to work with, are more consistent in the syntax and are targeted for data analysis around data
frames instead of just vectors.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 36


Important dplyr Verbs to Remember

Verb Details
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
allows for group operations in the 'split-
group_by()
apply-combine' concept

▪ Let us now use the msleep (mammals sleep) data set contains the sleeptimes and weights for a set
of mammals and is available in the dagdata repository on github. This data set contains 83 rows and
11 variables.

▪ Load the mammals data from 'mammals.csv' file as msleep in R.

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 37


Using dplyr to Analyze Data

Let us say Dan was given a task to analyze the sleep data and come up with insights.

Let us filter out only the required columns.

Selecting columns using select()


There are multiple ways to use select in dplyr. Here are few examples using the sleep data:
sleepData <- select(msleep, name, sleep_total)
To select all the columns except a specific column, use the ‘-’ (subtraction) operator (also known
as negative indexing)
head(select(msleep, -name))
To select a range of columns by name, use the ‘:’ (colon) operator
head(select(msleep, name:order))
To select all columns that start with the character string ‘sl’, use the function starts_with()
head(select(msleep, starts_with(‘sl’)))

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 38


Using dplyr to filter data

Let us filter out only the required rows based on multiple conditions

Filter the rows for mammals that sleep a total of more than 16 hours.
filter(msleep, sleep_total >= 16)
Filter the rows for mammals that sleep a total of more than 16 hours and have a body weight of
greater than 1 kilogram.
filter(msleep, sleep_total >= 16, bodywt >= 1)
Filter the rows for mammals in the Perissodactyla and Primates taxonomic order
filter(msleep, order %in% c(‘Perissodactyla’, ‘Primates’))

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 39


Using pipe (%>%) in dplyr

Let’s introduce the pipe operator: %>%. dplyr imports this operator from another package
(magrittr). This operator allows you to pipe the output from one function to the input of another
function. Instead of nesting functions (reading from the inside to the outside), the idea of piping is
to read the functions from left to right.

Here’s an example you have seen:


head(select(msleep, name, sleep_total))

head(),

msleep %>% select(name, sleep_total) %>% head

Let us take another example where we want to arrange the data based on the ‘order’ variable:
msleep %>% arrange(order) %>% head

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 40


Using dplyr with CCdata

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 41


Accessing Data Frames

Install.packages('dplyr') #Installing the required package, if required


library(dplyr) #Loading the required package

CCData<- as.data.frame(bind_rows(filedata1, filedata2)) #Merging


str(CCData) #Glimpse of the new data set

## 'data.frame': 20000 obs. Of 12 Variables


## $ customer_id : int 713782 515901 95166 425557 624581 721691 269858 219196 413020 174424
## $ demographic_slice : chr AX03efs AX03efs AX03efs AX03efs
## $ country_reg : chr W E W E
## $ ad_exp : chr N N Y Y
## $ est_income : num 33408 19928 51222 67212 20093
## $ hold_bal : num 3 20.3 4 18.7 4
## $ pref_cust_prob : num 0.5311 0.2974 0.0185 0.0893 0.0949
## $ imp_cscore : int 619 527 606 585 567 560 620 658 519 645
## $ RiskScore : num 503 820 587 635 632
## $ imp_crediteval : num 24 23 24.9 24.8 24.7
## $ axio_score : num 0.1373 0.0523 0.452 0.5646 0.9173
## $ card_offer : logi FALSE FALSE FALSE FALSE FALSE TRUE

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 42


Merging Data Together Continued

Looking at the categorical variables in detail.


table(CCData$demographic_slice)
##
## AX03efs BWEsk45 CARDIF2 DERS3w5
## 4990 5138 4858 5014

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 43


Merging Data Together Continued

library(dplyr)
filter(CCData, country_reg %in% c(‘E’, ‘W’))

## customer_id demographic_slice country_reg ad_exp est_income


## 1 713782 AX03efs W N 33,407.90
## 2 515901 AX03efs E N 19,927.53
## 3 95166 AX03efs W Y 51,222.47
## 4 425557 AX03efs E Y 67,211.59
## 5 624581 AX03efs W N 20,093.34
## 6 721691 AX03efs E N 73,896.10
## 7 269858 AX03efs W N 73,609.40
## 8 219196 AX03efs W N 57,619.67
## 9 413020 AX03efs W Y 49,282.62
## 10 174424 AX03efs W N 57,173.06

## 11 664695 AX03efs E Y 1,216.41

## 12 498530 AX03efs W N 60,261.18

## 13 722754 AX03efs E Y 27,898.80

## 14 48861 AX03efs W Y 72,547.14

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 44


Merging Data Together Continued

select(CCData, customer_id, demographic_slice, est_income)

## customer_id demographic_slice est_income

## 1 713782 AX03efs 3.34E+04

## 2 515901 AX03efs 1.99E+04

## 3 95166 AX03efs 5.12E+04

## 4 425557 AX03efs 6.72E+04

## 5 624581 AX03efs 2.01E+04

## 6 721691 AX03efs 7.39E+04

## 7 269858 AX03efs 7.36E+04

## 8 219196 AX03efs 5.76E+04

## 9 413020 AX03efs 4.93E+04

## 10 174424 AX03efs 5.72E+04

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 45


Merging Data Together Continued

Nesting: Using %>% operator (called chaining operator or the continuation operator)

CCData %>%
select(customer_id, country_reg, est_income) %>%
filter(est_income > mean(est_income))

## customer_id Country_reg est_income

## 1 425557 E 67211.59
## 2 721691 E 73896.1
## 3 269858 W 73609.4
## 4 48861 W 72547.14
## 5 791715 E 76087.37
## 6 778438 E 68383.43
## 7 407939 E 72032.97
## 8 182729 W 83992.32
## 9 47021 W 71675.08

## 10 399206 E 69048.21

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 46


Merging Data Together Continued

Mutate verb to add/modify variables of in the data set

CCData %>%
select(RiskScore, imp_cscore) %>%
mutate(ScoreRatio = RiskScore/imp_cscore)

## RiskScore imp_cscore ScoreRatio

## 1 503.249 619 0.813003


## 2 820.1081 527 1.556182
## 3 586.6058 606 0.967996
## 4 634.702 585 1.084961
## 5 631.95 567 1.114550
## 6 809.334 560 1.445239
## 7 697.3082 620 1.124691
## 8 668.0755 658 1.015312
## 9 656.1116 519 1.264184

## 10 547.7671 645 0.849251

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 47


Sampling Data

▪ Need to take a sample: Taking 70% data as sample


set.seed(11)
s=sample(1:nrow(CCData),0.7*nrow(CCData))
s
## [1] 5545 11 10212 281 1294 19093 1730 5798 17607 2464 3501
## [12] 8811 18133 17010 14670 11466 9628 6607 3151 9594 4071 13594
## [23] 7268 6999 1239 9649 7973 324 2499 7952 10084 6544 8223
## [34] 4042 16226 12819 5528 2068 5111 1155 4940 4292 9853 13041
## [45] 6567 17239 12727 275 10558 16647 10032 11925 8490 6228 4427
## [56] 8845 8228 16268 3244 11905 2687 15002 3471 11561 14979 3645
## [67] 1144 9937 3831 12352 13673 8481 8494 4076 18709 10675 5389
## [78] 6622 7252 4389 1014 9886 10121 4699 16207 8347 3812 12922
## [89] 16195 4490 7277 15223 287 9483 16136 4764 15235 3842 11683
## [100] 4440 15626 13476 8140 8135 4110 5252 13755 14729 11204 2846
## [111] 14049 1790 8707 16349 13316 6021 18525 8920 10418 4876 3935
## [122] 5752 19687 3390 8980 1431 509 17020 14009 12537 18973 13548
## [133] 4262 10422 2804 9579 5396 14085 15388 19729 13404 11092 8954
## [144] 13664 4422 10502 11720 6574 17243 6039 11160 15175 4873 15198
## [155] 13391 12755 2835 3663 13017 11244 9408 5884 12376 5096 5403
## [166] 16042 3611 904 19597 3691 5418 16388 1927 15335 13754 12650
## [177] 3901 15911 1309 7111 17549 2087 3013 7914 16367 5710 5816
## [188] 5593 13791 17343 7176 19967 17980 19685 2518 14398 18244 11423

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 48


Sampling Data Continued

▪ sampled.data <- CCData[s,]


glimpse(sampled.data)

## Observations: 14,000
## Variables: 13
## $ customer_id <int> 461914, 664695, 581996, 680566, 523177, 7728...

## $ demographic_slice <chr> 'CARDIF2', 'AX03efs', 'AX03efs', 'AX03efs', ...


## $ country_reg <chr> 'W', 'E', 'E', 'W', 'W', 'E', 'E', 'E', 'W',...
## $ ad_exp <chr> 'N', 'Y', 'N', 'N', 'Y', 'Y', 'N', 'N', 'N',...
## $ est_income <dbl> 77245.010, 1216.414, 44456.504, 38553.075, 4...
## $ hold_bal <dbl> 6.537963, 31.298468, 25.573535, 1.000000, 9....
## $ pref_cust_prob <dbl> 0.11743892, 0.65458819, 0.09804096, 0.546484...
## $ imp_cscore <int> 570, 675, 517, 528, 515, 761, 508, 683, 604,...
## $ RiskScore <dbl> 483.8262, 461.0786, 619.8412, 684.0436, 703....
## $ imp_crediteval <dbl> 24.23085, 25.96869, 23.30721, 23.78964, 21.7...
## $ axio_score <dbl> 0.72962641, 0.03685513, 0.75120118, 0.019777...
## $ card_offer <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ ScoreRatio <dbl> 0.8488179, 0.6830794, 1.1989191, 1.2955371, ...

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 49


Working with Date and Time Data in R

▪ Using ‘lubridate’ package to handle dates


▪ R stores date-time data as a character and when it is dealt with, it is like numbers
▪ Parsing these seemingly character strings to numbers in a way that they can represent data time
and then throw those time zones in the mix and you have pretty difficult situation to handle

Install.packages(‘lubridate’) #installing the required package, if needed


library(lubridate) #loading the required package
ymd(‘20170804’) #Year-Month-Date
## [1] ‘2017-08-04’
mdy('08-04-2017') #Month-Date-Year
## [1] ‘2017-08-04'
dmy('04/08/2017') #Day-Month-Year
## [1] ‘2017-08-04'
▪ Include time components and time zones by simply adding the order of time components hours (‘h'), minutes ('m')
and seconds ('s')
timeval <- ymd_hms('2017-08-04 12:00:25', tz = 'Pacific/Auckland')`

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 50


Working with Date and Time Data in R

Extracting individual elements

second(timeval)

## [1] 25

wday(timeval)

## [1] 6

wday(timeval, label = TRUE)

## [1] Fri
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 51


Check for Leap Year

Check for leap year


leap_year(2001)
## [1] FALSE
leap_year(2016)
## [1] TRUE

You can do arithmetic with dates also, for example, you want to add a year to a date
ymd(20160101) + dyears(1)
## [1] '2016-12-31'
ymd(20160101) + years(1)
## [1] '2017-01-01'

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 52


Check for Leap Year

▪ The example we saw earlier had well-formatted character strings as input for dates
▪ Dates can have months as character names or even abbreviated form of three letter words and for other date
components as well
▪ You can handle that by specifying your own formats too these format builders and function parse_date_time
• b : Abbreviated month name
• B : Full month name
• d : Day of the month as decimal number (01 to 31 or 0 to 31)
• H : Hours as decimal number (00 to 24 or 0 to 24) - 24 hrs format
• I : Hours as decimal number (01 to 12 or 0 to 12) - 12 hrs format
• j : Day of year as decimal number (001 to 366 or 1 to 366)
• m : Month as decimal number (01 to 12 or 1 to 12)
• M : Minute as decimal number (00 to 59 or 0 to 59)
• p : AM/PM indicator in the locale. Used in conjunction with I and not with H
• S : Second as decimal number (00 to 61 or 0 to 61), allowing for up to two leap-seconds (but POSIXcompliant
implementations will ignore leap seconds)
• OS :Fractional second
• y : Year without century (00 to 99 or 0 to 99)
• Y : Year with century

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 53


Check for Leap Year

parse_date_time('01-16-Jan','%d-%y-%b')
## [1] '2016-01-01 UTC'

parse_date_time('31-16-Oct 02:05 PM','%d-%y-%b %I:%M %p')


## [1] '2016-10-31 14:05:00 UTC'

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 54


Handling Heterogenous Formats

x = c('17-04-16', '170205', '17-05 05', '29-07-14 15:12')


parse_date_time(x, c('%y%m%d', '%y%m%d %H%M'))

## [1] '2017-04-16 00:00:00 UTC' '2017-02-05 00:00:00 UTC'


## [3] '2017-05-05 00:00:00 UTC' '2029-07-14 15:12:00 UTC'

© EduPristine For [PBA – Data Analytics in R ] (Confidential) 55


Thank You!

For queries, write to us at: care@edupristine.com

© EduPristine For [PBA – Data Analytics in R ] (Confidential)

You might also like