R Complete

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

R Complete Note

Presenting Data in Charts and Tables


Bar Chart
# Function to create bar charts:
barplot()

Parameters for barplot() :

x — Vector or matrix containing numeric values used in bar chart

xlab — Label for x-axis

ylab — Label for y-axis

main — The title of the bar chart

names.arg — Vector names appearing under each bar

col — Give colors to the bars

border — Give colors to the borders of the bars

density — (A single number or a vector) Gives the density of the shading


lines for the bars. The default is no shading

Example:

x = c(5, 7, 4, 8, 12, 16, 15)

barplot(
x,
main = "Number of customers visited the store",
xlab = "Days",
ylab = "Number of customers",

R Complete Note 1
names.arg = c("Monday", "Tuesday", "Wednesday", "Thursda
y", "Friday", "Saturday", "Sunday"),
col = rainbow(length(x))
)

Exercise
Create a barplot for this Exercise using R.

Performance Level Frequency

Good 13

Above Average 12

Average 15

Poor 10

Total 50

x = c(13, 12, 15, 10)


barplot(
x,
main = "Performance Level",
xlab = "Levels",
ylab = "Frequency",
names.arg = c("Good", "Above Average", "Average", "Poo
r"),

R Complete Note 2
col = rainbow(length(x))
)

Example with density :

barplot(
x,
main = "Number of customers visited the store",
xlab = "Days",
ylab = "Number of customers",
names.arg = c("Monday", "Tuesday", "Wednesday", "Thursd
ay", "Friday", "Saturday", "Sunday"),
col = rainbow(length(x)),
density = seq(10, 70, 10)
)

R Complete Note 3
Tabular Method
Example:

Rating Frequency Relative Frequency Percent Frequency


2
Poor 2 20
​ = 0.1  10%
3
Below Average 3 20
​ = 0.15  15%
5
Average 5 20
​ = 0.25  25%
9
Above Average 9 20
​ = 0.45  45%
1
Excellent 1 20
​ = 0.05  5%

Total 20 1.00 100%

set = c(1, 3, 8, 4, 2, 3, 6, 5, 5, 8, 4, 2, 4, 1, 5)

# Create a frequency table


table(set)
ta <- table(set)

# Relatove frequency
prop.table(ta)

# Cumulative Frequency
cumsum(ta)

R Complete Note 4
Pie Chart
# Function to create pie charts:
pie()

Parameters for pie() :

labels — Descriptions to the slices

radius — Radius of the circle (value between −1and +1)

main — The title of the pie chart

col — The color of the slice

clockwise — If set to TRUE , slices are drawn clockwise

💡 Slices are drawn counter-clockwise by default

Example:

x <- c(35, 28, 47, 63, 50)


pie(x)

Pie Chart with slice percentages as labels:

x <- c(35, 28, 47, 63, 50)


piepercent <- round(100 * x / sum(x), 1)

R Complete Note 5
lbls <- paste(piepercent, "%", sep = "")
pie(
x,
labels = lbls,
main = "Pie chart with slice percentage",
col = rainbow(length(x)),
radius = 1
)

Pie chart with slice percentage along with characters as labels:

x <- c(35, 28, 47, 63, 50)


districts <- c("Colombo", "Kandy", "Jaffna", "Anuradhapur
a", "Batticaloa")
piepercent <- round(100 * x / sum(x), 1)
lbls <- paste(districts, piepercent, "%", sep = "")
pie(
x,
labels = lbls,
main = "Pie chart with slice percentage",
col = heat.colors(length(x)),

R Complete Note 6
radius = 1
)

Measures of Centre Tendency


Mean
datasets::CO2
d <- CO2 # Assign variable 'd' to CO2
u <- d$uptake # Assign variable 'u' to d$uptakeu

mean(u) # Find the mean of 'u'

Find mean values for each category in a column (similar to group by in SQL):

## Find mean values for each category in a column (similar


to group by in SQL)
# Find the mean value of the 'uptake' variable for each cat
egory in the 'Plant' column
tapply(d$uptake, d$Plant, mean)

R Complete Note 7
# Find the maximum value of the 'uptake' variable for each
category in the 'Treatment' column
tapply(d$uptake, d$Treatment, max)

Find the mean value for a specific category in a column:

## Find the mean value of a specific category type in a col


umn
# Find the mean value of the 'uptake' variable for the cate
gory 'chilled' in the 'Treatment column
mean(d$uptake[d$Treatment=="chilled"])

Median
median(u) # Find the median of 'u'

Mode
## Find the mode -- Method 1

# Define function to find the mode


getmode <- function(v) {
uniqv <- unique(v) # Get unique values of 'v'
uniqv[which.max(tabulate(match(v, uniqv)))]} # Return the
value that occurs most frequently

# Call the function to get the mode of 'u'


getmode(u)

R Complete Note 8
💬 match(v, uniqv) returns a vector of the same length as
element is the index of the corresponding element of
v

v
where each
in uniqv .

tabulate() counts the number of times each integer occurs in the


input vector, up to the maximum value in the input vector.

which.max() returns the index of the maximum value in the input vector.

So,
uniqv[which.max(tabulate(match(v, uniqv)))] returns the value in uniqv that
corresponds to the maximum count in v , which is the mode of v .

## Find the mode -- Method 2

# Create a frequency table of 'u' and assign it to 'y'


y <- table(u)

# Find the mode(s) of 'u'


names(y)[which(y==max(y))]

💬 The table() function counts the number of times each unique value
occurs in u

R Complete Note 9
💬 max(y) finds the maximum frequency in y .

returns the indices of


which(y==max(y)) y where the frequency is equal
to the maximum frequency.

names(y)[which(y==max(y))]returns the names (or labels) of y at these


indices. In other words, it finds the values of u that occur most
frequently, which is the mode(s) of u .

Measures of Dispersion
Range and Interquartile Range
# Find the range
Range = max(u) - min(u)
Range

# Find the Interquartile Range


IQR(u)

Quartiles

# Find Quartiles
quantile(u, 0.25) #First Quartile
quantile(u, 0.5) #Second Quartile
quantile(u, 0.75) #Third Quartile

Five number summary

# Find the five-number summary


summary(u)

R Complete Note 10
Find the five-number summary for a specific category in a column:

## Find the five-number summary of a specific category type


in a column
# Find the five-number summary of the 'uptake' variable for
the category 'chilled' in the 'Treatment column
summary(d$uptake[d$Treatment=="chilled"])

Exercise
Find the summary statistics for uptake where plant type is Qn1 and uptake value
is more than 20

# Find the summary statistics for uptake where plant type i


s “Qn1” and uptake value is more than 20
summary(d$uptake[d$Plant=="Qn1" & d$uptake > 20])

💡 function will automatically neglect missing values while


summary()

others do not.
To neglect the missing values in other functions you have to
specifically mention it

Example:

num = c(10, 20, 33, 44, NA, 88, 55)


mean(num) # This will not neglect 'NA'
mean(num, na.rm = T # This will neglect 'NA'

Deciles

# Find Deciles
quantile(u, 0.4) #Fourth Decide
quantile(u, 0.7) #Seventh Decile

Percentiles

R Complete Note 11
# Find Percentiles
quantile(u, 0.98) # 98th Percentile
quantile(u, 0.37) # 37th Percentile

Variance
# Find the Sample Variance
var(u)

# Find the Standard Deviation


sd(u)

Box-Plot (Box & Whisker Plot)


# Function to create box plots:
boxplot()

Parameters for boxplot() :

[y-axis]~[x-axis] — The axes of the graph

data — The dataset

main — The title of the bar chart

xlab — Label for x-axis

ylab — Label for y-axis

col — Give colors to the boxes

border — Give colors to the borders of the boxes

notch — Add a notch to the box at the Median

varwidth — If set to FALSE , all boxes will have the same width regardless of
the size of the group

horizontal — If set to TRUE , the boxes will be horizontal

Examples:

R Complete Note 12
datasets::ToothGrowth
TG<-ToothGrowth

boxplot(
TG$len,
main="Box plot of tooth length",
ylab="Tooth length",
col="hotpink",
border="lightpink",
notch = FALSE,
varwidth = FALSE,
horizontal = TRUE
)

datasets::ToothGrowth
TG<-ToothGrowth

boxplot(

R Complete Note 13
len~supp,
data = TG,
main = "Tooth growth with supplement types",
xlab = "Supplement type",
ylab = "Tooth length",
col = c("hotpink", "lightpink")
)

Tally Table
# Tally Table
datasets::iris
i <- iris
table(i$Species)

Output:

R Complete Note 14
Contingency Table
# Contingency Table
table(d$Plant, d$Type)
table(d$Plant, d$Treatment)

Output:

Binomial Distribution
dbinom
For binomial distributions, dbinom is used in R.

# dbinom Help
help(dbinom)

Example:
Find P (x = 1)when n = 5, and θ = 0.1.

R Complete Note 15
x=1
n=5
θ = 0.1

P (x = 1) =5 C1 (0.1)1 (0.9)4

= 5 × 0.1 × 0.6561
= 0.32805

dbinom(x = 1, size = 5, prob = 0.1)

Find P (x ≤ 3)when n = 5and θ = 0.1.

sum(dbinom(x = 0:3, size = 5, prob = 0.1))

Exercise
A customer receiving service from a customer care center can be classified as
good service or bad service. The probability of getting good service is 0.4.

1. What is the probability of he/she getting at least 2 good services out of 10


tries?

n = 10
x=2
θ = 0.4

sum(dbinom(x = 2:10, size = 10, prob = 0.4))

1 - sum(dbinom(x = 0:1, size = 10, prob = 0.4))

2. What is the probability he/she getting bad service between 3 and 7 out of
10 tries?

n = 10
3 < x < 10
θ = 0.6

R Complete Note 16
sum(dbinom(x = 4:6, size = 10, prob = 0.6))

pbinom
pbinom is a cumulative function

# pbinomm Help
help(pbinom)

Examples:

Find P (x ≤ 3)when n = 5and θ = 0.1.

pbinom(3, size = 5, prob = 0.1)

The same can be done with dbinom as:

sum(dbinom(x = 0:3, size = 5, prob = 0.1))

Poisson Distribution
dpois
dpois is used for Poisson distributions in R

# dpois Help
help(dpois)

Examples:
Find P (x = 0)when λ = 0.03.

dpois(x = 0, lambda = 0.03)

Find P (x ≥ 1)when λ = 0.03

1 - dpois(x = 0, lambda = 0.03)

R Complete Note 17
ppois
ppois is a cumulative function

# ppois Help
help(ppois)

Example:
Find the value of P (x = 0) + P (x = 1) + P (x = 2)when λ = 2.

ppois(2, lamba = 2)

The same can be done with dpois as:

# Method 1
p1 <- dpois(x = 0, lambda = 2)
p2 <- dpois(x = 1, lambda = 2)
p3 <- dpois(x = 2, lambda = 2)
p <- p1 + p2 + p3
p

# Method 2
sum(dpois(x=0:2, lambda = 2))

Exercise
Suppose it has been observed that, on average 180 cars per hour pass a
specified point on a particular road in the morning rush hour. Due to impending
road works it is estimated that congestion will occur closer to the city center if
more than 5 cars pass the point in any of one minute. What is the probability of
congestion occurring?

180
λ= =3
60
x>5

1 - ppois(5, lambda = 3)

R Complete Note 18
Exercise
A manufacturer of balloons produces 40% that are oval and 60% that are
round. Packets of 20 balloons may be assumed to contain random samples of
balloons. Determine the probability that such a packet contains:

1. an equal number of oval balloons and round balloons

2. P (oval) = 0.4
P (round) = 0.6
20
C10 (0.4)10 (0.6)10

dbinom(x = 10, size = 20, prob = 0.4)

3. fewer oval balloons than round balloons

P (x ≤ 9)

pbinom(9, size = 20, prob = 0.4)

A customer selects packets of 20 balloons at random from a large consignment


until she finds a packet with exactly 12 round balloons.

3. Give a reason why a binomial distribution is not an appropriate model for


the number of packets selected.

The number of trials is not fixed even though they are independent
events

Continuous Uniform Distribution


dunif
dunif(x, min, max) is used to find the PDF at xin R.

# dunif Help
?dunif

R Complete Note 19
Example:
Find the PDF of a uniform distribution between 0and 5at the point x = 2.

dunif(2, min = 0, max = 5)

Cumulative Distribution Function (CDF)

💡 Cumulative distribution function (CDF) for a uniform distribution gives


the probability that the random variable X is less than or equal to a
certain value x.

punif
punif is a cumulative function in R.

punif(q, min, max) is used to find the CDF of x ≤ q


Example:
Find the probability that a random variable from a uniform distribution between
0 and 5 is less than or equal to 3.

punif(3 , min = 0, max = 5)

qunif
qunif is a quantile function in R.

qunif(p, min, max) is used to find the quantile defined by p

Example:

What is the 90th percentile of a uniform distribution between 0 and 5?

qunif(0.90, min = 0, max = 5)

The Normal Distribution / Gaussian


Distribution

R Complete Note 20
pnorm
pnorm is used for Normal Distribution calculations in R.

# pnorm Help
?pnorm

Example:
Find P (x < 18)when mean is 15and the standard deviation is 2

x−μ 18 − 15
<
2
​ ​

σ
z < 1.5
= 0.9332

pnorm(q = 18, mean = 15, sd = 2)

pnorm(q = 18, mean = 15, sd = 2, lower.tail = TRUE)

💬 The lower.tail argument specifies whether the PDF is calculated for


the lower tail (left-hand side) or the upper tail (right-hand side) of the
normal distribution.

Example:
Find P (x > 18)when mean is 15and the standard deviation is 2

1 − P (x < 18)
= 1 − 0.9331928
= 0.0668072

pnorm(q = 18, mean = 15, sd = 2, lower.tail = FALSE)

Example:
Find P (970000 < x < 1060000)when mean is 1000000and standard
deviation is 30000

R Complete Note 21
# P(x < 1060000)
P1 <- pnorm(q = 1060000, mean = 1000000, sd = 30000, lower.
tail = TRUE)
# P(x < 970000)
P2 <- pnorm(q = 970000, mean = 1000000, sd = 30000, lower.t
ail = TRUE)

# P(970000 < x < 1060000) = P(x < 1060000) - P(x < 970000)
P <- P2 - P1
P

Hypothesis Testing - Examples


H0 : ? ≤ 80

H1 : ? > 80

Sample Mean = 83
Standard Deviation = 8

# Test Statistic Value


Z1 = (83 - 80) / (8 / sqrt(25))
Z1

# Table_value for 95% upper tail test


Table_value <- qnorm(0.95)
Table_value

if (Table_value < Z1) {


print("Reject the H0")
}

H0 : ? = 170 (this specifies a signle value for the parameter of interest)


H1 : ? > 170 (this is what we want to determine)


sd = 65
mu_0 = 170

R Complete Note 22
n = 400
x_bar = 178

# Test Statistic Value


z1 <- (x_bar - mu_0) / (sd / sqrt(n))
z1

# Table_value for 95% upper tail test


Table_value <- qnorm(0.95)
Table_value

if (Table_value < z1) {


print("Reject the H0")
}

The owner of the shop wants to induce the annual income of the shop. He
suspects compared to previous years annual income rate declined to less than
5%.. He suspects at 5% significance error. Standard deviation of annual
income for last 16 years is 0.1%. The population mean is 5%, and sample mean
is 4.962%.

H0 : ? = 5 (this specifies a single values for the parameter of interest)


H1 : ? < 5 (this is what we want to determine)


sd = 0.1
mu_0 = 5
n = 16
x_bar = 4.962

# Test Statistic Value


z1 <- (x_bar - mu_0) / (sd / sqrt(n))
z1

# Table_value for 5% lower tail test


Table_value <- round(qnorm(0.05), 2)
Table_value

if (Table_value > z1) {

R Complete Note 23
print("Reject the H0")
} else {
print("Failed to reject H0")
}

R Complete Note 24

You might also like