R-Unit 5
R-Unit 5
R-Unit 5
UNIT – 5
1
CORRELATION AND REGRESSION ANALYSIS
Correlation analysis is used to investigate the association between two or more
variables.
Correlation is one of the most common statistics.
Using one single value, it describes the "degree of relationship" between two variables.
Correlation ranges from -1 to +1.
Negative values of correlation indicate that as one variable increases the other variable
decreases.
Positive values of correlation indicate that as one variable increase the other variable
increases as well.
It does not specify that one variable is the dependent variable and the other is the
independent variable.
2
Pearson Correlation
The most commonly used type of correlation is Pearson correlation, named after Karl Pearson,
introduced this statistic around the turn of the 20th century. Pearson's r measures
the linear relationship between two variables, say X and Y. A correlation of 1 indicates the data points
perfectly lie on a line for which Y increases as X increases. A value of -1 also implies the data points lie
on a line; however, Y decreases as X increases. The formula for r is
3
4
R method to find correlation coefficient
Syntax:
cor(x, y, method = c("pearson", "kendall", "spearman"))
Note: If the data contain missing values, the following R code can be used to handle
missing values by case-wise deletion.
5
Example:
Suppose a training program was conducted to improve the participants’ knowledge of ICT. Data were
collected from a selected sample of 10 individuals before and after the ICT training program. Find the
association/correlation between the paired samples
Code:
library(stats)
before <- c(12.2, 14.6, 13.4, 11.2, 12.7, 10.4, 15.8, 13.9, 9.5, 14.2)
after <- c(13.5, 15.2, 13.6, 12.8, 13.7, 11.3, 16.5, 13.4, 8.7, 14.6)
cor.test(x=before,y=after,method=c(“pearson”),conf.level=0.95)
6
First, create before and after as objects containing the scores of ICT training.
before <- c(12.2, 14.6, 13.4, 11.2, 12.7, 10.4, 15.8, 13.9, 9.5, 14.2)
after <- c(13.5, 15.2, 13.6, 12.8, 13.7, 11.3, 16.5, 13.4, 8.7, 14.6)
data <- data.frame(subject = rep(c(1:10), 2),
time = rep(c("before", "after"), each = 10),
score = c(before, after))
You may be interested in the summary statistics of the difference in scores before and after ICT
training. Subtract the ICT score before training from the one after training to get the score
differences. Again using the summary function for diff() object will result in summary statistics for
the differences.
7
Correlation using Scatter Plot
8
Correlation using Scatter Plot
Write a program to import the built-in dataset of R "mtcars" having the following fields:
mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
Write an R program to perform the following:
i) Plot a scatter plot of wt~mpg with relevant title, x-axis & y-axis labels and a colour of your choice.
ii) Display the correlation value inside the plot
iii) Plot a correlation plot by considering all the fields of mtcars.
9
Scatterplot matrix
When we have more than two variables and we want to find the correlation between one
variable versus the remaining ones we use scatterplot matrix. We use pairs() function to
create matrices of scatterplots.
The basic syntax for creating scatterplot matrices in R is:
pairs(formula, data)
Example:
10
Correlation Plot
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars), method="circle")
11
Correlation Plot
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars), method="number")
12
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars,method="kendall"), method="pie")
13
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars,method="kendall"), method="number")
14
library(corrplot)
#need to install corrplot
input <- mtcars[,c('wt','mpg')]
corrplot(cor(input,method="kendall"), method="square")
15
library(corrplot)
#need to install corrplot
corrplot(cor(mtcars,method="kendall"), method="pie",type = "upper")
16
Plot the following for the dataset "mtcars"
corrplot.mixed(cor(mtcars),
upper = "square",
lower = "number")
17
A company wants to evaluate the cause-effect relationship between several factors (foam, scent, color,
and residue) on the perceived quality of shampoo.
Write an R program to create and display the data frame with the above data. Use the data frame to
create the following plots.
i) Plot a scatter plot matrix. Set the colour of plot as blue and provide a suitable title.
ii) Plot a correlation plot using Kendall correlation method. Show the correlation values of fields in the
lower triangular portion of the plot and piechart on the upper triangular portion of plot.
18
library(corrplot)
Data_Frame <- data.frame(
Foam=c(6.3,4.4,3.9,5.1,5.6,4.6,4.8,6.5,8.7),
Scent=c(5.3,4.9,5.3,4.2,5.1,5.6,4.6,4.8,4.5),
Colour=c(4.8,3.5,4.8,3.1,5.5,5.1,4.8,4.3,3.9),
Residue=c(3.1,3.9,4.7,3.6,5.1,4.1,3.3,5.2,2.9),
Quality=c(91,87,82,83,83,84,90,84,97))
print(Data_Frame)
pairs(~foam+scent+Colour+Residue+Quality,
data = Data_Frame,
main = "Scatterplot Matrix",
col="blue")
#need to install corrplot
corrplot.mixed(cor(Data_Frame,
method="kendall"),
upper = "pie",
lower = "number" )
19
A soft drink bottler is trying to predict delivery times for a driver. He has collected data
on the delivery time, the number of cartons delivered and the distance the driver walked
and the same is shown below.
Delivery Time 16.68 11.5 12.03 14.88 13.75 18.11 8 17.83 79.24 21.5
X1 (Cartons) 7 3 3 4 6 7 2 7 30 5
X2 (Distance) 560 220 340 80 150 330 110 210 1460 605
Write a program in R to create Scatter plot matrix and Correlation matrix for the above
data.
20
library(corrplot)
Data_Frame <- data.frame(
delivery_time=c(16.68,11.5,12.03,14.88,13.75,18.11,8,17.83,79.24,21.5),
X1=c(7,3,3,4,6,7,2,7,30,5),
X2=c(560,220,340,80,150,330,110,210,1460,605))
print(Data_Frame)
pairs(~delivery_time+X1+X2,
data = Data_Frame,
main = "Scatterplot Matrix",
col="blue")
Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear
relationship where the exponent of any variable is not equal to 1 creates a curve.
23
lm() Function
This function creates the relationship model between the predictor and the response
variable.
Syntax
The basic syntax for lm() function in linear regression is :
lm(formula,data)
24
df = data.frame(x=c(1, 3, 3, 4, 5, 5, 6, 8, 9, 12),
y=c(12, 14, 14, 13, 17, 19, 22, 26, 24, 22))
#fit linear regression model using 'x' as predictor and 'y' as response variable
model <- lm(y ~ x, data=df)
abline(model)
25
Create a Scatterplot with a Regression Line
#create scatterplot
plot(mpg ~ wt, data=mtcars)
26
Let height is a variable that describes the heights (in cm) of ten people and bodymass is
a variable that describes the masses (in kg) of the same ten people.
height 176 154 138 196 132 176 181 169 150 175
bodymass 82 49 53 112 47 69 77 71 62 78
27
height <- c(176, 154, 138,196, 132, 176, 181, 169, 150, 175)
bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
plot(bodymass, height, col = "blue", main = "HEIGHT Vs. BODY MASS", xlab = "BODY MASS
(kg)", ylab = "HEIGHT (cm)")
abline(lm(height ~ bodymass))
coeff<-round(coef(lm(height~bodymass)),2)
print(lm(height ~ bodymass))
text(90,140, paste("Y = ", coeff[1], "+", coeff[2], "x"))
Output:
28
The years of Experience and Salary details of a sample of 10 Employees are shown
below.
For the above data:
Perform linear regression and display the result
Create a Regression Plot with the following specifications
Display the title of the graph as “Years of Experience Vs. Salary”
Set the color of the plot as green
Display the x-axis label as “Years of Experience”
Display the y-axis label as “Salary”
Display the regression equation inside the plot.
Employee 1 2 3 4 5 6 7 8 9 10
Years of Experience 1.1 1.3 1.5 2.0 2.2 2.9 3.0 3.2 3.2 3.7
Salary 39343 46205 37731 43525 39891 56642 60150 54445 64445 57189
29
YoE<- c(1.1,1.3,1.5,2.0,2.2,2.9,3.0,3.2,3.2,3.7)
Salary <- c(39343,46205,37731,43525,39891,56642,60150,54445,64445,57189)
plot(YoE, Salary,
col = "green",
main = "Years of Experience Vs. Salary",
xlab = "Years of Experience",
ylab = "Salary")
abline(lm(Salary~YoE))
coeff<-round(coef(lm(Salary~YoE)),2)
print(lm(Salary~YoE))
text(2,60000, paste("Y = ", coeff[1], "+", coeff[2], "x"))
30
The local ice cream shop keeps track of how much ice cream they sell versus the noon
temperature on that day.
Ice Cream Sales vs Temperature
31
ICS<- c(215,325,185,332,406,522,412,614,544,421)
Temperature <- c(14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1)
plot(Temperature, ICS,
col = "green",
main = "Temperature Vs. Ice Cream Sales",
xlab = "Temperature",
ylab = "Ice Cream Sales")
abline(lm(ICS~Temperature))
coeff<-round(coef(lm(ICS~Temperature)),2)
print(lm(ICS~Temperature))
text(18,200, paste("Y = ", coeff[1], "+", coeff[2], "x"))
32
The years of Experience and Salary details of a sample of 10 Employees are shown below.
Employee 1 2 3 4 5 6 7 8 9 10
Years of Experience 1.1 1.3 1.5 2.0 2.2 2.9 3.0 3.2 3.2 3.7
Salary 39343 46205 37731 43525 39891 56642 60150 54445 64445 57189
33
YoE<- c(1.1,1.3,1.5,2.0,2.2,2.9,3.0,3.2,3.2,3.7)
Salary <- c(39343,46205,37731,43525,39891,56642,60150,54445,64445,57189)
plot(YoE, Salary,
col = "green",
main = "Years of Experience Vs. Salary",
xlab = "Years of Experience", ylab = "Salary")
abline(lm(Salary~YoE))
coeff<-round(coef(lm(Salary~YoE)),2)
print(lm(Salary~YoE))
text(2,60000, paste("Y = ", coeff[1], "+", coeff[2], "x"))
34
An organization is interested in knowing the relationship between the monthly e-
commerce sales and the online advertising costs(in Dollars). The survey results for 7
online stores for last year are shown below.
Monthly E-Commerce 368 340 665 954 331 556 376
Sales(in 1000s)
Online Advertising Cost (in 1.7 1.5 2.8 5 1.3 2.2 1.3
1000s)
abline(lm(ecommercesales ~ advertisingcosts))
coeff<-round(coef(lm(ecommercesales ~ advertisingcosts)),2)
lm(ecommercesales ~ advertisingcosts)
text(4,500, paste("Y = ", coeff[1], "+", coeff[2], "x"))
36
Classification using Logistic Regression
37
38
Logistic Regression
Syntax:
logistic_model <- glm( formula, family, dataframe )
Parameter:
•formula: determines the symbolic description of the model to be fitted.
•family: determines the description of the error distribution and link function to be
used in the model.
•dataframe: determines the data frame to be used for fitting purpose
39
The in-built data set "mtcars" describes different models of a car with their various engine specifications.
In "mtcars" data set, the transmission mode (automatic or manual) is described by the column am which
is a binary value (0 or 1). Create a logistic regression model between the columns "am" and 3 other
columns - hp, wt and cyl and display the summary.
print(summary(am.data))
plot(am.data)
40
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8
41
Create a data frame with the following data.
Perform logistic regression for the dataset that shows whether or not college basketball players got drafted into the
NBA (draft: 0 = no, 1 = yes) based on their average points, rebounds, and assists in the previous season.
Draft 0 1 0 1 1 1 1 0 0 1
AvgPts 12 13 13 12 14 14 17 17 21 21
Rebounds 3 4 4 9 4 4 2 6 5 9
Assists 6 4 6 9 5 4 2 5 7 3
Draft.data = glm(formula = Draft ~ AvgPts + Rebounds +Assists, data = input, family = binomial)
print(summary(Draft.data))
plot(Draft.data)
42
Introduction to Statistical Hypothesis Testing in R
A statistical hypothesis is an assumption made by the researcher about the data of the population collected
for any experiment. It is not mandatory for this assumption to be true every time. Hypothesis testing, in a
way, is a formal process of validating the hypothesis made by the researcher.
In order to validate a hypothesis, it will consider the entire population into account. However, this is not
possible practically. Thus, to validate a hypothesis, it will use random samples from a population. On the
basis of the result from testing over the sample data, it either selects or rejects the hypothesis.
43
Hypothesis Testing in R
Statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected.
Hypothesis testing is conducted in the following manner:
State the Hypotheses – Stating the null and alternative hypotheses.
Formulate an Analysis Plan – The formulation of an analysis plan is a crucial step in this stage.
Analyze Sample Data – Calculation and interpretation of the test statistic, as described in the analysis plan.
Interpret Results – Application of the decision rule described in the analysis plan.
Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in other words what
the data are about the population. The p-value ranges between 0 and 1. It can be interpreted in the
following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.
44
Two-proportions z-test
The two-proportions z-test is used to compare two observed proportions
Example:
We have two groups of individuals:
We want to know, whether the proportions of smokers are the same in the two groups of individuals?
45
1. whether the observed proportion of smokers in group A (pA) is equal to the observed proportion of
smokers in group (pB)?
2. whether the observed proportion of smokers in group A (pA) is less than the observed proportion of
smokers in group (pB)?
3. whether the observed proportion of smokers in group A (pA) is greater than the observed proportion of
smokers in group (pB)?
46
Formula of the test statistic
Case of large sample sizes
The test statistic (also known as z-test) can be calculated as follow:
where,
Note that, the formula of z-statistic is valid only when sample size (n) is large enough. nAp, nAq, nBp and
nBq should be ≥ 5.
47
Compare two-proportions z-test in R
R function: prop.test()
x: a vector of counts of successes
n: a vector of count trials
alternative: a character string specifying the alternative hypothesis
correct: a logical indicating whether Yates’ continuity correction should be applied where possible
48
We want to know, whether the proportions of smokers are the same in the two groups of individuals?
res <- prop.test(x = c(490, 400), n = c(500, 500))
# Printing the results
res
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
The function returns:
If you want to test whether the observed proportion of smokers in group A (pA) is less
than the observed proportion of smokers in group (pB):
prop.test(x = c(490, 400), n = c(500, 500), alternative = "less")
Or, if you want to test whether the observed proportion of smokers in group A (pA) is
greater than the observed proportion of smokers in group (pB):
prop.test(x = c(490, 400), n = c(500, 500), alternative = "greater")
50
The result of prop.test() function is a list containing the following components:
We have a population of mice containing half male and have female (p = 0.5 = 50%). Some of these mice (n =
160) have developed a spontaneous cancer, including 95 male and 65 female.
We want to know, whether the cancer affects more male than female?
In this setting:
52
1) whether the observed proportion of male (po) is equal to the expected proportion (pe)?
2) whether the observed proportion of male (po) is less than the expected proportion (pe)?
3) whether the observed proportion of male (po) is greater than the expected proportion (pe)?
#To know whether the cancer affects more male than female
res <- prop.test(x = 95, n = 160, p = 0.5, correct = FALSE)
# Printing the results
res
#to test whether the proportion of male with cancer is less than 0.5
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE, alternative = "less")
#to test whether the proportion of male with cancer is greater than 0.5 (one-tailed test)
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE, alternative = "greater")
53
Example:
A survey is taken two times over the course of two weeks. The pollsters wish to see if there is a difference in the
results as there has been a new advertising campaign run. Here is the data
Week 1 Week 2
Favorable 45 56
Unfavorable 35 47
prop.test(c(45,56),c(45+35,56+47))
54
Example:
ABC company manufactures tablets. For quality control, two sets of tablets were tested. In the first
group, 32 out of 700 were found to contain some sort of defect. In the second group, 30 out of 400 were
found to contain some sort of defect. Is the difference between the two groups significant? Use a 5%
alpha level
#to test whether the observed proportion of defect in group one is less than the observed proportion of
defect in group two
prop.test(x = c(32, 30), n = c(700, 400), alternative = "less")
# to test whether the observed proportion of defects in group one is greater than the observed
proportion of defects in group two
prop.test(x = c(32, 30), n = c(700, 400), alternative = "greater")
55
Example:
Researchers want to test the side effects of a new COVID vaccine. In clinical trail, 62 out of 300 individuals
taking X1 vaccine report side effects while 48 individuals out of 400 taking X2 vaccine report side effects. At
95% confidence level, Write a program in R to answer the following questions?
Is the side effect of X1 same as X2?
Is the side effect of X1 is less than X2?
Is the side effect of X1 is greater than X2?
prop.test(x=c(62,48), n=(300,400))
prop.test(x=c(62,48), n=(300,400), alternative= “less”)
prop.test(x=c(62,48), n=(300,400) ,alternative= “greater”)
56
Two Sample t-test in R
A two sample t-test is used to test whether or not the means of two populations are
equal.
Example:
Suppose we want to know whether or not the mean weight between two different
species of turtles is equal. To test this, we collect a simple random sample of turtles from
each species with the following weights:
Sample 1: 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303
Sample 2: 335, 329, 322, 321, 324, 319, 304, 308, 305, 311, 307, 300, 305
57
#define vector of turtle weights for each sample
sample1 <- c(300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303)
sample2 <- c(335, 329, 322, 321, 324, 319, 304, 308, 305, 311, 307, 300, 305)
#perform two sample t-test
t.test(x = sample1, y = sample2)
Output:
Since the p-value of the test (0.04914) is less than .05, we reject the null hypothesis.
This means we have sufficient evidence to say that the mean weight between the two
species is not equal.
59
A company uses two different processes to put sand into bags. The question is
whether each process places the same weight into the bags. Ten bags from each
process were weighed. The data are shown below. We want to see if there is a
difference in the means from the two processes.
Process 2 50.1 49.7 50 49.4 49.4 49.4 49.7 49.4 49.4 49.3
Solution:
process1<-c(50.1,49.9,50,49.9,50.2,50,50.4,49.9,49.7,49.7)
process2<-c(50.1,49.7,50,49.4,49.4,49.4,49.7,49.4,49.4,49.3)
t.test(x = process1, y = process2)
60
Output:
t.test(x = process1, y = process2)
Solution:
# load the data
data(EuStockMarkets)
# create the SMI variable which is the second column of the EuStockMarkets data
SMI = EuStockMarkets[,2]
# create the CAC variable which is the third column of the EuStockMarkets data
CAC = EuStockMarkets[,3]
63
Paired Samples t-test in R
A paired samples t-test is used to compare the means of two samples when each observation in
one sample can be paired with an observation in the other sample.
Example:
Suppose we want to know whether or not a certain training program is able to increase the max
vertical jump (in inches) of basketball players.
To test this, we may recruit a simple random sample of 12 college basketball players and measure
each of their max vertical jumps. Then, we may have each player use the training program for one
month and then measure their max vertical jump again at the end of the month.
The following data shows the max jump height (in inches) before and after using the training
program for each player:
Before 22 24 20 19 19 20 22 25 24 23
After 23 25 20 24 18 22 23 28 24 25
64
#define before and after max jump heights
before <- c(22, 24, 20, 19, 19, 20, 22, 25, 24, 23, 22, 21)
after <- c(23, 25, 20, 24, 18, 22, 23, 28, 24, 25, 24, 20)
65
From the output we can see:
t-test statistic: -2.5289
degrees of freedom: 11
p-value: 0.02803
95% confidence interval for true mean difference: [-2.34, -0.16]
mean difference between before and after: -1.25
Since the p-value of the test (0.02803) is less than .05, we reject the null hypothesis.
This means we have sufficient evidence to say that the mean jump height before and
after using the training program is not equal.
66
In order to promote fairness in grading, each application was graded twice by different graders. Based on the
grades, can we see if there is a difference between the two graders? The data is
Grader 1: 3 0 5 2 5 5 5 4 4 5
Grader 2: 2 1 4 1 4 3 3 2 3 5
Code:
x <- c(3, 0, 5, 2, 5, 5, 5, 4, 4, 5)
y <- c(2, 1, 4, 1, 4, 3, 3, 2, 3, 5)
t.test(x,y,paired=TRUE)
Output:
Paired t-test
data:
x and y
t = 3.3541, df = 9, p-value = 0.008468
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
0.3255550 1.6744450
sample estimates: mean of the differences
Which would lead us to reject the null hypothesis.
Notice, the data are not independent of each other as grader 1 and grader 2 each grade the same papers. We
expect that if grader 1 finds a paper good, that grader 2 will also and vice versa 67
In a clinical trial of a cholesterol-lowering agent, 10 patients’ cholesterol (in mmol L-1) was
measured before treatment and 3 weeks after starting treatment. Data is listed in the
following table:
Write a program to find whether or not the treatment lowers the cholesterol of patients?
Patient 1 2 3 4 5 6 7 8 9 10
Before 9.1 8.0 7.7 10.0 9.6 7.9 9.0 7.1 8.3 9.6
After 8.2 6.4 6.6 8.5 8.0 5.8 7.8 7.2 6.7 9.8
before <- c(9.1, 8.0, 7.7, 10.0, 9.6, 7.9, 9.0, 7.1, 8.3, 9.6)
after <- c(8.2, 6.4, 6.6, 8.5, 8.0, 5.8, 7.8, 7.2, 6.7, 9.8)
#perform paired samples t-test
t.test(x = before, y = after, paired = TRUE)
68
A medical researcher wants to compare two methods of measuring cardiac output.
Method A is the standard method and is considered very accurate. But this method is
invasive. Method B is less accurate but not invasive. Cardiac output from 26 patients was
measured using both methods.
Method A 6.3 6.3 3.5 5.1 5.5 7.7 6.3 2.8 3.4 5.7
Method B 5.2 6.6 2.3 4.4 4.1 6.4 5.7 2.3 3.2 5.2
69
methodA <- c(6.3,6.3,3.5,5.1,5.5,7.7,6.3,2.8,3.4,5.7)
methodB <- c(5.2,6.6,2.3,4.4,4.1,6.4,5.7,2.3,3.2,5.2)
#perform paired samples t-test
t.test(x = methodA , y = methodB , paired = TRUE)
Paired t-test
71
observed <- c(50, 60, 40, 47, 53)
expected <- c(.2, .2, .2, .2, .2) #must add up to 1, 5 days equal number(1/5)
Question 1:
Are these colors equally common?
If these colors were equally distributed, the expected proportion would be 1/3 for each of the
color.
The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05.
We can conclude that the colors are significantly not commonly distributed with a p-value =
8.80310^{-7}.
73
Question 2:
Suppose that, in the region where you collected the data, the ratio of red, yellow and white tulip is 3:2:1 (3+2+1 = 6).
This means that the expected proportion is:
3/6 (= 1/2) for red
2/6 ( = 1/3) for yellow
1/6 for white
We want to know, if there is any significant difference between the observed proportions and the expected
proportions?
The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can
conclude that the observed proportions are not significantly different from the expected proportions.
74
Consider a standard package of milk chocolate M&Ms. There are six different colors: red,
orange, yellow, green, blue and brown. Suppose that we are curious about the distribution
of these colors.
Suppose that we have a simple random sample of 600 M&M candies with the following
distribution:
212 of the candies are blue.
147 of the candies are orange.
103 of the candies are green.
50 of the candies are red.
46 of the candies are yellow.
42 of the candies are brown.
Perform a Chi-Square goodness of fit test to find whether all six colors occur in equal
proportion?
75
candies <- c(212,147,103,50,46,42)
res <- chisq.test(candies, p = c(1/6, 1/6, 1/6,1/6,1/6,1/6))
res
Output
Chi-squared test for given
probabilitiesdata: candiesX-squared = 235.42,
df = 5, p-value < 2.2e-16
The p-value of the test is 2.2e-16 , which is less than the significance level alpha = 0.05.
We can conclude that the colors are significantly not equally distributed
76