R-Unit 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

R Programming(20MCA2PERR)

UNIT – 5

1
CORRELATION AND REGRESSION ANALYSIS
Correlation analysis is used to investigate the association between two or more
variables.
Correlation is one of the most common statistics.
Using one single value, it describes the "degree of relationship" between two variables.
Correlation ranges from -1 to +1.
Negative values of correlation indicate that as one variable increases the other variable
decreases.
Positive values of correlation indicate that as one variable increase the other variable
increases as well.
It does not specify that one variable is the dependent variable and the other is the
independent variable.

2
Pearson Correlation
The most commonly used type of correlation is Pearson correlation, named after Karl Pearson,
introduced this statistic around the turn of the 20th century. Pearson's r measures
the linear relationship between two variables, say X and Y. A correlation of 1 indicates the data points
perfectly lie on a line for which Y increases as X increases. A value of -1 also implies the data points lie
on a line; however, Y decreases as X increases. The formula for r is

The Pearson correlation has two assumptions:


 The two variables are normally distributed.
 The relationship between the two variables is linear. We can test this assumption by examining the
scatterplot between the two variables.

3
4
R method to find correlation coefficient

Syntax:
cor(x, y, method = c("pearson", "kendall", "spearman"))

Note: If the data contain missing values, the following R code can be used to handle
missing values by case-wise deletion.

cor(x, y, method = "pearson", use = "complete.obs")

5
Example:

Suppose a training program was conducted to improve the participants’ knowledge of ICT. Data were
collected from a selected sample of 10 individuals before and after the ICT training program. Find the
association/correlation between the paired samples

Code:
library(stats)
before <- c(12.2, 14.6, 13.4, 11.2, 12.7, 10.4, 15.8, 13.9, 9.5, 14.2)
after <- c(13.5, 15.2, 13.6, 12.8, 13.7, 11.3, 16.5, 13.4, 8.7, 14.6)
cor.test(x=before,y=after,method=c(“pearson”),conf.level=0.95)

6
First, create before and after as objects containing the scores of ICT training.

before <- c(12.2, 14.6, 13.4, 11.2, 12.7, 10.4, 15.8, 13.9, 9.5, 14.2)
after <- c(13.5, 15.2, 13.6, 12.8, 13.7, 11.3, 16.5, 13.4, 8.7, 14.6)
data <- data.frame(subject = rep(c(1:10), 2),
time = rep(c("before", "after"), each = 10),
score = c(before, after))

You may be interested in the summary statistics of the difference in scores before and after ICT
training. Subtract the ICT score before training from the one after training to get the score
differences. Again using the summary function for diff() object will result in summary statistics for
the differences.

diff = after - before


summary(diff)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.800 0.250 0.650 0.540 0.975 1.600

7
Correlation using Scatter Plot

The basic syntax for creating scatterplot in R is −


plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used −


 x is the data set whose values are the horizontal coordinates.
 y is the data set whose values are the vertical coordinates.
 main is the tile of the graph.
 xlab is the label in the horizontal axis.
 ylab is the label in the vertical axis.
 xlim is the limits of the values of x used for plotting.
 ylim is the limits of the values of y used for plotting.
 axes indicates whether both axes should be drawn on the plot.

8
Correlation using Scatter Plot

Write a program to import the built-in dataset of R "mtcars" having the following fields:
mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
Write an R program to perform the following:
i) Plot a scatter plot of wt~mpg with relevant title, x-axis & y-axis labels and a colour of your choice.
ii) Display the correlation value inside the plot
iii) Plot a correlation plot by considering all the fields of mtcars.

input <- mtcars[,c('wt','mpg')]


plot(x = input$wt, y = input$mpg,
xlab = "Weight",
ylab = "Milage",
main = "Weight vs Milage",
col="blue")

text(paste("Correlation:", cor(input$wt, input$mpg)), x = 4, y = 30)

9
Scatterplot matrix
When we have more than two variables and we want to find the correlation between one
variable versus the remaining ones we use scatterplot matrix. We use pairs() function to
create matrices of scatterplots.
The basic syntax for creating scatterplot matrices in R is:
pairs(formula, data)

Example:

# Plot the matrices between 4 variables giving 12 plots.


# One variable with 3 others and total 4 variables.
pairs(~wt+mpg+disp+cyl,data = mtcars,
main = "Scatterplot Matrix",col="red")

10
Correlation Plot
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars), method="circle")

11
Correlation Plot
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars), method="number")

12
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars,method="kendall"), method="pie")

13
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars,method="kendall"), method="number")

14
library(corrplot)
#need to install corrplot
input <- mtcars[,c('wt','mpg')]
corrplot(cor(input,method="kendall"), method="square")

15
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars,method="kendall"), method="pie",type = "upper")

16
Plot the following for the dataset "mtcars"

corrplot.mixed(cor(mtcars),
upper = "square",
lower = "number")
17
A company wants to evaluate the cause-effect relationship between several factors (foam, scent, color,
and residue) on the perceived quality of shampoo.

Write an R program to create and display the data frame with the above data. Use the data frame to
create the following plots.
i) Plot a scatter plot matrix. Set the colour of plot as blue and provide a suitable title.
ii) Plot a correlation plot using Kendall correlation method. Show the correlation values of fields in the
lower triangular portion of the plot and piechart on the upper triangular portion of plot.

18
library(corrplot)
Data_Frame <- data.frame(
Foam=c(6.3,4.4,3.9,5.1,5.6,4.6,4.8,6.5,8.7),
Scent=c(5.3,4.9,5.3,4.2,5.1,5.6,4.6,4.8,4.5),
Colour=c(4.8,3.5,4.8,3.1,5.5,5.1,4.8,4.3,3.9),
Residue=c(3.1,3.9,4.7,3.6,5.1,4.1,3.3,5.2,2.9),
Quality=c(91,87,82,83,83,84,90,84,97))
print(Data_Frame)
pairs(~foam+scent+Colour+Residue+Quality,
data = Data_Frame,
main = "Scatterplot Matrix",
col="blue")
#need to install corrplot
corrplot.mixed(cor(Data_Frame,
method="kendall"),
upper = "pie",
lower = "number" )
19
A soft drink bottler is trying to predict delivery times for a driver. He has collected data
on the delivery time, the number of cartons delivered and the distance the driver walked
and the same is shown below.

Delivery Time 16.68 11.5 12.03 14.88 13.75 18.11 8 17.83 79.24 21.5
X1 (Cartons) 7 3 3 4 6 7 2 7 30 5
X2 (Distance) 560 220 340 80 150 330 110 210 1460 605

Write a program in R to create Scatter plot matrix and Correlation matrix for the above
data.

20
library(corrplot)
Data_Frame <- data.frame(
delivery_time=c(16.68,11.5,12.03,14.88,13.75,18.11,8,17.83,79.24,21.5),
X1=c(7,3,3,4,6,7,2,7,30,5),
X2=c(560,220,340,80,150,330,110,210,1460,605))
print(Data_Frame)

pairs(~delivery_time+X1+X2,
data = Data_Frame,
main = "Scatterplot Matrix",
col="blue")

#need to install corrplot


corrplot(cor(Data_Frame,
method="spearman"),
method = "pie")
21
22
Regression analysis
Regression analysis is a very widely used statistical tool to establish a relationship model between two
variables. One of these variable is called predictor variable whose value is gathered through experiments.
The other variable is called response variable whose value is derived from the predictor variable

Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear
relationship where the exponent of any variable is not equal to 1 creates a curve.

The general mathematical equation for a linear regression is −


y = ax + b

Following is the description of the parameters used −


 y is the response variable.
 x is the predictor variable.
 a(slope) and b(y intercept: Avg) are constants which are called the coefficients.

 When x increases by 1, y increases by a


 b is the average of y value

23
lm() Function
This function creates the relationship model between the predictor and the response
variable.

Syntax
The basic syntax for lm() function in linear regression is :
lm(formula,data)

Following is the description of the parameters used −


 formula is a symbol presenting the relation between x and y.
 data is the vector on which the formula will be applied.

24
df = data.frame(x=c(1, 3, 3, 4, 5, 5, 6, 8, 9, 12),
y=c(12, 14, 14, 13, 17, 19, 22, 26, 24, 22))

#fit linear regression model using 'x' as predictor and 'y' as response variable
model <- lm(y ~ x, data=df)
abline(model)

25
Create a Scatterplot with a Regression Line

#fit regression model/linear model


fit <- lm(mpg ~ wt, data=mtcars)

#create scatterplot
plot(mpg ~ wt, data=mtcars)

#add fitted regression line to scatterplot


abline(fit)

#Change the colour of the regression line

26
Let height is a variable that describes the heights (in cm) of ten people and bodymass is
a variable that describes the masses (in kg) of the same ten people.

height 176 154 138 196 132 176 181 169 150 175
bodymass 82 49 53 112 47 69 77 71 62 78

For the above data:


•Perform linear regression and display the result.
•Create a Regression plot with the following specifications.
•Display the title of the graph as “Height Vs. Bodymass”
•Set the color of the plot as blue
•Display the x-axis label as “Height(cm)”
•Display the y-axis label as “Bodymass(kg)”
•Display the regression equation inside the graph

27
height <- c(176, 154, 138,196, 132, 176, 181, 169, 150, 175)
bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
plot(bodymass, height, col = "blue", main = "HEIGHT Vs. BODY MASS", xlab = "BODY MASS
(kg)", ylab = "HEIGHT (cm)")
abline(lm(height ~ bodymass))
coeff<-round(coef(lm(height~bodymass)),2)
print(lm(height ~ bodymass))
text(90,140, paste("Y = ", coeff[1], "+", coeff[2], "x"))

Output:

Call: lm(formula = height ~ bodymass)


Coefficients: (Intercept) bodymass
98.0054 0.9528

28
The years of Experience and Salary details of a sample of 10 Employees are shown
below.
For the above data:
 Perform linear regression and display the result
 Create a Regression Plot with the following specifications
 Display the title of the graph as “Years of Experience Vs. Salary”
 Set the color of the plot as green
 Display the x-axis label as “Years of Experience”
 Display the y-axis label as “Salary”
 Display the regression equation inside the plot.

Employee 1 2 3 4 5 6 7 8 9 10
Years of Experience 1.1 1.3 1.5 2.0 2.2 2.9 3.0 3.2 3.2 3.7
Salary 39343 46205 37731 43525 39891 56642 60150 54445 64445 57189

29
YoE<- c(1.1,1.3,1.5,2.0,2.2,2.9,3.0,3.2,3.2,3.7)
Salary <- c(39343,46205,37731,43525,39891,56642,60150,54445,64445,57189)

plot(YoE, Salary,
col = "green",
main = "Years of Experience Vs. Salary",
xlab = "Years of Experience",
ylab = "Salary")

abline(lm(Salary~YoE))
coeff<-round(coef(lm(Salary~YoE)),2)
print(lm(Salary~YoE))
text(2,60000, paste("Y = ", coeff[1], "+", coeff[2], "x"))

30
The local ice cream shop keeps track of how much ice cream they sell versus the noon
temperature on that day.
Ice Cream Sales vs Temperature

Temperature °C Ice Cream Sales


For the given data:
14.2° $215
 Perform linear regression and display the result
16.4° $325
 Create a Regression Plot with the following
11.9° $185 specifications
15.2° $332  Provide relevant title to the plot
18.5° $406  Set the color of the plot as green
22.1° $522  Provide relevant x-axis label & y-axis label
19.4° $412  Display the regression equation inside the
25.1° $614 plot.
23.4° $544
18.1° $421

31
ICS<- c(215,325,185,332,406,522,412,614,544,421)
Temperature <- c(14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1)

plot(Temperature, ICS,
col = "green",
main = "Temperature Vs. Ice Cream Sales",
xlab = "Temperature",
ylab = "Ice Cream Sales")

abline(lm(ICS~Temperature))
coeff<-round(coef(lm(ICS~Temperature)),2)
print(lm(ICS~Temperature))
text(18,200, paste("Y = ", coeff[1], "+", coeff[2], "x"))

32
The years of Experience and Salary details of a sample of 10 Employees are shown below.

Employee 1 2 3 4 5 6 7 8 9 10
Years of Experience 1.1 1.3 1.5 2.0 2.2 2.9 3.0 3.2 3.2 3.7
Salary 39343 46205 37731 43525 39891 56642 60150 54445 64445 57189

For the above data:


•Perform linear regression and display the result
•Create a Regression Plot with the following specifications:
•Display the title of the graph as “Years of Experience Vs. Salary”
•Set the color of the plot as green
•Display the x-axis label as “Years of Experience”
•Display the y-axis label as “Salary”
•Display the regression equation inside the plot.

33
YoE<- c(1.1,1.3,1.5,2.0,2.2,2.9,3.0,3.2,3.2,3.7)
Salary <- c(39343,46205,37731,43525,39891,56642,60150,54445,64445,57189)

plot(YoE, Salary,
col = "green",
main = "Years of Experience Vs. Salary",
xlab = "Years of Experience", ylab = "Salary")

abline(lm(Salary~YoE))
coeff<-round(coef(lm(Salary~YoE)),2)
print(lm(Salary~YoE))
text(2,60000, paste("Y = ", coeff[1], "+", coeff[2], "x"))

34
An organization is interested in knowing the relationship between the monthly e-
commerce sales and the online advertising costs(in Dollars). The survey results for 7
online stores for last year are shown below.
Monthly E-Commerce 368 340 665 954 331 556 376
Sales(in 1000s)
Online Advertising Cost (in 1.7 1.5 2.8 5 1.3 2.2 1.3
1000s)

For the above data:


•Perform linear regression and display the result.
•Create a Regression plot with the following specifications.
•Display the title of the graph as “E-commerce Sales Vs. Advertising costs”
•Set the color of the plot as red
•Display the x-axis label as “Advertising costs”
•Display the y-axis label as “E-commerce sales”
•Display the regression equation inside the graph
35
ecommercesales <- c(368,340,665,954,331,556,376)
advertisingcosts <- c(1.7,1.5,2.8,5,1.3,2.2,1.3)

plot(advertisingcosts, ecommercesales , col = "red",


main = "E-commerce Sales Vs. Advertising costs",
xlab = " Advertising costs",
ylab = "E-commerce sales")

abline(lm(ecommercesales ~ advertisingcosts))
coeff<-round(coef(lm(ecommercesales ~ advertisingcosts)),2)
lm(ecommercesales ~ advertisingcosts)
text(4,500, paste("Y = ", coeff[1], "+", coeff[2], "x"))

36
Classification using Logistic Regression

Logistic Regression is a classification algorithm which is used when we want to predict a


categorical variable (Yes/No, Pass/Fail) based on a set of independent variable(s).
Logistic regression is used when the dependent variable is binary(0/1, True/False, Yes/No)
in nature.
Logistic regression is also known as Binomial logistics regression.
• Logistic Regression is used when the dependent variable(target) is categorical. For
example:
• To predict whether an email is a spam (1) or not spam (0)
• Whether the tumor is malignant (1) or not (0)

37
38
Logistic Regression

Syntax:
logistic_model <- glm( formula, family, dataframe )
Parameter:
•formula: determines the symbolic description of the model to be fitted.
•family: determines the description of the error distribution and link function to be
used in the model.
•dataframe: determines the data frame to be used for fitting purpose

39
The in-built data set "mtcars" describes different models of a car with their various engine specifications.
In "mtcars" data set, the transmission mode (automatic or manual) is described by the column am which
is a binary value (0 or 1). Create a logistic regression model between the columns "am" and 3 other
columns - hp, wt and cyl and display the summary.

input <- mtcars[,c("am","cyl","hp","wt")]

am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)

print(summary(am.data))
plot(am.data)

40
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8

41
Create a data frame with the following data.
Perform logistic regression for the dataset that shows whether or not college basketball players got drafted into the
NBA (draft: 0 = no, 1 = yes) based on their average points, rebounds, and assists in the previous season.

Draft 0 1 0 1 1 1 1 0 0 1
AvgPts 12 13 13 12 14 14 17 17 21 21
Rebounds 3 4 4 9 4 4 2 6 5 9
Assists 6 4 6 9 5 4 2 5 7 3

input <- data.frame(


Draft<-c(0,1,0,1,1,1,1,0,0,1),
AvgPts<-c(12,13,13,12,14,14,17,17,21,21),
Rebounds<-c(3,4,4,9,4,4,2,6,5,9),
Assists<-c(6,4,6,9,5,4,2,5,7,3))

Draft.data = glm(formula = Draft ~ AvgPts + Rebounds +Assists, data = input, family = binomial)

print(summary(Draft.data))
plot(Draft.data)

42
Introduction to Statistical Hypothesis Testing in R
A statistical hypothesis is an assumption made by the researcher about the data of the population collected
for any experiment. It is not mandatory for this assumption to be true every time. Hypothesis testing, in a
way, is a formal process of validating the hypothesis made by the researcher.

In order to validate a hypothesis, it will consider the entire population into account. However, this is not
possible practically. Thus, to validate a hypothesis, it will use random samples from a population. On the
basis of the result from testing over the sample data, it either selects or rejects the hypothesis.

Statistical Hypothesis Testing can be categorized into two types as below:


Null Hypothesis – Hypothesis testing is carried out in order to test the validity of a claim or assumption that
is made about the larger population. This claim that involves attributes to the trial is known as the Null
Hypothesis. The null hypothesis testing is denoted by H0.
Alternative Hypothesis – An alternative hypothesis would be considered valid if the null hypothesis is
fallacious. The evidence that is present in the trial is basically the data and the statistical computations that
accompany it. The alternative hypothesis testing is denoted by H1or Ha.

43
Hypothesis Testing in R
Statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected.
Hypothesis testing is conducted in the following manner:
State the Hypotheses – Stating the null and alternative hypotheses.
Formulate an Analysis Plan – The formulation of an analysis plan is a crucial step in this stage.
Analyze Sample Data – Calculation and interpretation of the test statistic, as described in the analysis plan.
Interpret Results – Application of the decision rule described in the analysis plan.

Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in other words what
the data are about the population. The p-value ranges between 0 and 1. It can be interpreted in the
following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.

44
Two-proportions z-test
The two-proportions z-test is used to compare two observed proportions

Example:
We have two groups of individuals:

Group A with lung cancer: n = 500


Group B, healthy individuals: n = 500
The number of smokers in each group is as follow:

Group A with lung cancer: n = 500, 490 smokers, pA=490/500=98


Group B, healthy individuals: n = 500, 400 smokers, pB=400/500=80
In this setting:

The overall proportion of smokers is p = (490+400)/(500+500)=89


The overall proportion of non-smokers is q=1−p=11

We want to know, whether the proportions of smokers are the same in the two groups of individuals?
45
1. whether the observed proportion of smokers in group A (pA) is equal to the observed proportion of
smokers in group (pB)?
2. whether the observed proportion of smokers in group A (pA) is less than the observed proportion of
smokers in group (pB)?
3. whether the observed proportion of smokers in group A (pA) is greater than the observed proportion of
smokers in group (pB)?

In statistics, we can define the corresponding null hypothesis (H0) as follow:

1. H0:pA= pB (two-tailed tests)


2. H0:pA≤ pB (one-tailed tests)
3. H0:pA≥ pB (one-tailed tests)
The corresponding alternative hypotheses (Ha) are as follow:

1. Ha:pA≠pB (different) (two-tailed tests)


2. Ha:pA>pB (greater) (one-tailed tests)
3. Ha:pA<pB (less) (one-tailed tests)

46
Formula of the test statistic
Case of large sample sizes
The test statistic (also known as z-test) can be calculated as follow:

where,

pA is the proportion observed in group A with size nA


pB is the proportion observed in group B with size nB
p and q are the overall proportions

 if |z|<1.96, then the difference is not significant at 5%


 if |z|≥1.96, then the difference is significant at 5%
 The significance level (p-value) corresponding to the z-statistic can be read in the z-table.

Note that, the formula of z-statistic is valid only when sample size (n) is large enough. nAp, nAq, nBp and
nBq should be ≥ 5.
47
Compare two-proportions z-test in R

R function: prop.test()
x: a vector of counts of successes
n: a vector of count trials
alternative: a character string specifying the alternative hypothesis
correct: a logical indicating whether Yates’ continuity correction should be applied where possible

48
We want to know, whether the proportions of smokers are the same in the two groups of individuals?
res <- prop.test(x = c(490, 400), n = c(500, 500))
# Printing the results
res
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
The function returns:

the value of Pearson’s chi-squared test statistic.


a p-value
a 95% confidence intervals
an estimated probability of success (the proportion of smokers in the two groups) 49
Note that:

If you want to test whether the observed proportion of smokers in group A (pA) is less
than the observed proportion of smokers in group (pB):
prop.test(x = c(490, 400), n = c(500, 500), alternative = "less")

Or, if you want to test whether the observed proportion of smokers in group A (pA) is
greater than the observed proportion of smokers in group (pB):
prop.test(x = c(490, 400), n = c(500, 500), alternative = "greater")

50
The result of prop.test() function is a list containing the following components:

statistic: the number of successes


parameter: the number of trials
p.value: the p-value of the test
conf.int: a confidence interval for the probability of success.
estimate: the estimated probability of success.
# printing the p-value
res$p.value
[1] 2.363439e-19

# printing the mean


res$estimate
prop 1 prop 2
0.98 0.80
# printing the confidence interval
res$conf.int
[1] 0.1408536 0.2191464
attr(,"conf.level")
[1] 0.95 51
Example:

We have a population of mice containing half male and have female (p = 0.5 = 50%). Some of these mice (n =
160) have developed a spontaneous cancer, including 95 male and 65 female.

We want to know, whether the cancer affects more male than female?

In this setting:

 the number of successes (male with cancer) is 95


 The observed proportion (po) of male is 95/160
 The observed proportion (q) of female is 1−po
 The expected proportion (pe) of male is 0.5 (50%)
 The number of observations (n) is 160

52
1) whether the observed proportion of male (po) is equal to the expected proportion (pe)?
2) whether the observed proportion of male (po) is less than the expected proportion (pe)?
3) whether the observed proportion of male (po) is greater than the expected proportion (pe)?

prop.test(x, n, p = NULL, alternative = "two.sided", correct = TRUE)

#To know whether the cancer affects more male than female
res <- prop.test(x = 95, n = 160, p = 0.5, correct = FALSE)
# Printing the results
res

#to test whether the proportion of male with cancer is less than 0.5
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE, alternative = "less")

#to test whether the proportion of male with cancer is greater than 0.5 (one-tailed test)
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE, alternative = "greater")

53
Example:
A survey is taken two times over the course of two weeks. The pollsters wish to see if there is a difference in the
results as there has been a new advertising campaign run. Here is the data

Week 1 Week 2
Favorable 45 56
Unfavorable 35 47

prop.test(c(45,56),c(45+35,56+47))

2-sample test for equality of proportions with continuity correction


data: c(45, 56) out of c(45 + 35, 56 + 47)
X-squared = 0.0108, df = 1, p-value = 0.9172
alternative hypothesis: two.sided
95 percent confidence interval:
-0.1374478 0.1750692
sample estimates:
prop 1 prop2
2 0.5625000 0.5436893

we observe that the p-value is 0.9172 so we accept the null hypothesis

54
Example:

ABC company manufactures tablets. For quality control, two sets of tablets were tested. In the first
group, 32 out of 700 were found to contain some sort of defect. In the second group, 30 out of 400 were
found to contain some sort of defect. Is the difference between the two groups significant? Use a 5%
alpha level

#to know the difference between two groups is significant


prop.test(x = c(32, 30), n = c(700, 400))

#to test whether the observed proportion of defect in group one is less than the observed proportion of
defect in group two
prop.test(x = c(32, 30), n = c(700, 400), alternative = "less")

# to test whether the observed proportion of defects in group one is greater than the observed
proportion of defects in group two
prop.test(x = c(32, 30), n = c(700, 400), alternative = "greater")

55
Example:

Researchers want to test the side effects of a new COVID vaccine. In clinical trail, 62 out of 300 individuals
taking X1 vaccine report side effects while 48 individuals out of 400 taking X2 vaccine report side effects. At
95% confidence level, Write a program in R to answer the following questions?
 Is the side effect of X1 same as X2?
 Is the side effect of X1 is less than X2?
 Is the side effect of X1 is greater than X2?

prop.test(x=c(62,48), n=(300,400))
prop.test(x=c(62,48), n=(300,400), alternative= “less”)
prop.test(x=c(62,48), n=(300,400) ,alternative= “greater”)

56
Two Sample t-test in R
A two sample t-test is used to test whether or not the means of two populations are
equal.

Example:

Suppose we want to know whether or not the mean weight between two different
species of turtles is equal. To test this, we collect a simple random sample of turtles from
each species with the following weights:

Sample 1: 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303
Sample 2: 335, 329, 322, 321, 324, 319, 304, 308, 305, 311, 307, 300, 305

57
#define vector of turtle weights for each sample
sample1 <- c(300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303)
sample2 <- c(335, 329, 322, 321, 324, 319, 304, 308, 305, 311, 307, 300, 305)
#perform two sample t-test
t.test(x = sample1, y = sample2)

Output:

Welch Two Sample t-test data: sample1 and sample2


t = -2.1009, df = 19.112, p-value = 0.04914
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-14.73862953 -0.03060124
sample estimates:
mean of x mean of y
307.2308 314.6154
58
From the output we can see:
t-test statistic: -2.1009
degrees of freedom: 19.112
p-value: 0.04914
95% confidence interval for true mean difference: [-14.74, -0.03]
mean of sample 1 weights: 307.2308
mean of sample 2 weights: 314.6154

Since the p-value of the test (0.04914) is less than .05, we reject the null hypothesis.
This means we have sufficient evidence to say that the mean weight between the two
species is not equal.

59
A company uses two different processes to put sand into bags. The question is
whether each process places the same weight into the bags. Ten bags from each
process were weighed. The data are shown below. We want to see if there is a
difference in the means from the two processes.

Process 1 50.1 49.9 50 49.9 50.2 50 50.4 49.9 49.7 49.7

Process 2 50.1 49.7 50 49.4 49.4 49.4 49.7 49.4 49.4 49.3

Solution:

process1<-c(50.1,49.9,50,49.9,50.2,50,50.4,49.9,49.7,49.7)
process2<-c(50.1,49.7,50,49.4,49.4,49.4,49.7,49.4,49.4,49.3)
t.test(x = process1, y = process2)

60
Output:
t.test(x = process1, y = process2)

Welch Two Sample t-test

data: process1 and process2


t = 3.5666, df = 16.819, p-value = 0.002409
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1631883 0.6368117
sample estimates:
mean of x mean of y
49.98 49.58
Since the p-value of the test (0.002409) is less than .05, we reject the null hypothesis.
This means we have sufficient evidence to say that the mean weight between the two
processes is not equal.
61
The EuStockMarkets data set in R provides daily closing prices of four major European stock indices:
Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. Using this data set, test to see if
there are differences in the closing prices of the SMI(Second column EuStockMarkets) of and
CAC(Third column EuStockMarkets) indices. Carry out this test at the 5% significance level and do
not assume equal variances.

Solution:
# load the data
data(EuStockMarkets)

# create the SMI variable which is the second column of the EuStockMarkets data
SMI = EuStockMarkets[,2]

# create the CAC variable which is the third column of the EuStockMarkets data
CAC = EuStockMarkets[,3]

# execute the two-sample t-test


t.test(SMI, CAC, alternative="two.sided", mu=0, conf.level=0.95)
62
##
## Welch Two Sample t-test
##
## data: SMI and CAC
## t = 28, df = 2305, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1068 1228
## sample estimates:
## mean of x mean of y
## 3376 2228
We find a p-value much smaller than α=0.05, so we can reject the null and conclude
that there are differences in the closing prices between the Swiss SMI and French CAC
stock indices.

63
Paired Samples t-test in R
A paired samples t-test is used to compare the means of two samples when each observation in
one sample can be paired with an observation in the other sample.

Example:
Suppose we want to know whether or not a certain training program is able to increase the max
vertical jump (in inches) of basketball players.
To test this, we may recruit a simple random sample of 12 college basketball players and measure
each of their max vertical jumps. Then, we may have each player use the training program for one
month and then measure their max vertical jump again at the end of the month.
The following data shows the max jump height (in inches) before and after using the training
program for each player:

Before 22 24 20 19 19 20 22 25 24 23
After 23 25 20 24 18 22 23 28 24 25

64
#define before and after max jump heights
before <- c(22, 24, 20, 19, 19, 20, 22, 25, 24, 23, 22, 21)
after <- c(23, 25, 20, 24, 18, 22, 23, 28, 24, 25, 24, 20)

#perform paired samples t-test


t.test(x = before, y = after, paired = TRUE)

Paired t-test data:


before and after t = -2.5289, df = 11, p-value = 0.02803 alternative
hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -2.3379151 -0.1620849
sample estimates:
mean of the differences
-1.25

65
From the output we can see:
t-test statistic: -2.5289
degrees of freedom: 11
p-value: 0.02803
95% confidence interval for true mean difference: [-2.34, -0.16]
mean difference between before and after: -1.25

Since the p-value of the test (0.02803) is less than .05, we reject the null hypothesis.
This means we have sufficient evidence to say that the mean jump height before and
after using the training program is not equal.

66
In order to promote fairness in grading, each application was graded twice by different graders. Based on the
grades, can we see if there is a difference between the two graders? The data is
Grader 1: 3 0 5 2 5 5 5 4 4 5
Grader 2: 2 1 4 1 4 3 3 2 3 5
Code:
x <- c(3, 0, 5, 2, 5, 5, 5, 4, 4, 5)
y <- c(2, 1, 4, 1, 4, 3, 3, 2, 3, 5)
t.test(x,y,paired=TRUE)
Output:
Paired t-test
data:
x and y
t = 3.3541, df = 9, p-value = 0.008468
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
0.3255550 1.6744450
sample estimates: mean of the differences
Which would lead us to reject the null hypothesis.
Notice, the data are not independent of each other as grader 1 and grader 2 each grade the same papers. We
expect that if grader 1 finds a paper good, that grader 2 will also and vice versa 67
In a clinical trial of a cholesterol-lowering agent, 10 patients’ cholesterol (in mmol L-1) was
measured before treatment and 3 weeks after starting treatment. Data is listed in the
following table:
Write a program to find whether or not the treatment lowers the cholesterol of patients?

Patient 1 2 3 4 5 6 7 8 9 10
Before 9.1 8.0 7.7 10.0 9.6 7.9 9.0 7.1 8.3 9.6
After 8.2 6.4 6.6 8.5 8.0 5.8 7.8 7.2 6.7 9.8

before <- c(9.1, 8.0, 7.7, 10.0, 9.6, 7.9, 9.0, 7.1, 8.3, 9.6)
after <- c(8.2, 6.4, 6.6, 8.5, 8.0, 5.8, 7.8, 7.2, 6.7, 9.8)
#perform paired samples t-test
t.test(x = before, y = after, paired = TRUE)

68
A medical researcher wants to compare two methods of measuring cardiac output.
Method A is the standard method and is considered very accurate. But this method is
invasive. Method B is less accurate but not invasive. Cardiac output from 26 patients was
measured using both methods.

Method A 6.3 6.3 3.5 5.1 5.5 7.7 6.3 2.8 3.4 5.7
Method B 5.2 6.6 2.3 4.4 4.1 6.4 5.7 2.3 3.2 5.2

Write a program to determine if there is a significant difference between the two


methods.

69
methodA <- c(6.3,6.3,3.5,5.1,5.5,7.7,6.3,2.8,3.4,5.7)
methodB <- c(5.2,6.6,2.3,4.4,4.1,6.4,5.7,2.3,3.2,5.2)
#perform paired samples t-test
t.test(x = methodA , y = methodB , paired = TRUE)

t.test(x = methodA , y = methodB , paired = TRUE)

Paired t-test

data: methodA and methodB


t = 4.2394, df = 9, p-value = 0.002176
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.3358031 1.1041969
sample estimates:
mean difference
0.72 70
Chi-Square Goodness of Fit Test in R
A Chi-Square Goodness of Fit Test is used to determine whether or not a categorical
variable follows a hypothesized distribution.

Example: Chi-Square Goodness of Fit Test in R


A shop owner claims that an equal number of customers come into his shop each
weekday. To test this hypothesis, a researcher records the number of customers that
come into the shop in a given week and finds the following:
Monday: 50 customers
Tuesday: 60 customers
Wednesday: 40 customers
Thursday: 47 customers
Friday: 53 customers
Use the following steps to perform a Chi-Square goodness of fit test in R to determine if
the data is consistent with the shop owner’s claim.

71
observed <- c(50, 60, 40, 47, 53)
expected <- c(.2, .2, .2, .2, .2) #must add up to 1, 5 days equal number(1/5)

#perform Chi-Square Goodness of Fit Test


chisq.test(x=observed, p=expected)

Chi-squared test for given probabilities


data: observed
X-squared = 4.36, df = 4, p-value = 0.3595
NOTE: The Chi-Square test statistic is found to be 4.36 and the corresponding p-
value is 0.3595.
Note that the p-value corresponds to a Chi-Square value with n-1 degrees of
freedom (dof), where n is the number of different categories. In this case, dof = 5-1 =
4.
Since the p-value (.35947) is not less than 0.05, we fail to reject the null hypothesis.
This means we do not have sufficient evidence to say that the true distribution of
customers is different from the distribution that the shop owner claimed. 72
We collected wild tulips and found that 81 were red, 50 were yellow and 27 were white.

Question 1:
Are these colors equally common?
If these colors were equally distributed, the expected proportion would be 1/3 for each of the
color.

tulip <- c(81, 50, 27)


res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res

Chi-squared test for given probabilities


data: tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07

The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05.
We can conclude that the colors are significantly not commonly distributed with a p-value =
8.80310^{-7}.
73
Question 2:
Suppose that, in the region where you collected the data, the ratio of red, yellow and white tulip is 3:2:1 (3+2+1 = 6).
This means that the expected proportion is:
3/6 (= 1/2) for red
2/6 ( = 1/3) for yellow
1/6 for white
We want to know, if there is any significant difference between the observed proportions and the expected
proportions?

tulip <- c(81, 50, 27)


res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res

Chi-squared test for given probabilities


data: tulip
X-squared = 0.20253, df = 2, p-value = 0.9037

The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can
conclude that the observed proportions are not significantly different from the expected proportions.

74
Consider a standard package of milk chocolate M&Ms. There are six different colors: red,
orange, yellow, green, blue and brown. Suppose that we are curious about the distribution
of these colors.
Suppose that we have a simple random sample of 600 M&M candies with the following
distribution:
212 of the candies are blue.
147 of the candies are orange.
103 of the candies are green.
50 of the candies are red.
46 of the candies are yellow.
42 of the candies are brown.
Perform a Chi-Square goodness of fit test to find whether all six colors occur in equal
proportion?

75
candies <- c(212,147,103,50,46,42)
res <- chisq.test(candies, p = c(1/6, 1/6, 1/6,1/6,1/6,1/6))
res

Output
Chi-squared test for given
probabilitiesdata: candiesX-squared = 235.42,
df = 5, p-value < 2.2e-16

The p-value of the test is 2.2e-16 , which is less than the significance level alpha = 0.05.
We can conclude that the colors are significantly not equally distributed

76

You might also like