Xstklatest Update
Xstklatest Update
Xstklatest Update
2|Page
CONTENTS
1. INTRODUCTION.............................................................................................3
1.1. About this subject.......................................................................................3
1.2. About our tool (R-studio)...........................................................................3
1.3. Our problems..............................................................................................4
2. THEORIES........................................................................................................5
2.1. t- Test............................................................................................................5
2.1.1. Definition...............................................................................................5
2.1.2. Formula.................................................................................................5
2.2. Analysis of Variance (ANOVA).................................................................7
2.2.1. One- way ANOVA................................................................................8
2.2.2. Two-way ANOVA.................................................................................9
3. DATA AND STATISTICAL METHODS....................................................11
3.1. Importing data..........................................................................................11
3.2. Data cleaning.............................................................................................11
3.3. Data visualization......................................................................................12
3.4. Normal distribution and line regression checking (1) (to know whether
it is normally distributed with 2 variables, G2 and G3).................................17
3.4.1. Normal distribution testing................................................................17
3.4.2 Fitting line regression model.............................................................21
3.5. Applying t-test...........................................................................................22
3.5.1. One- sample t-test...............................................................................22
3.5.2. Two- sample t-test...............................................................................22
3.6. Applying ANOVA.....................................................................................22
3.6.1. One-way ANOVA...............................................................................22
3.6.2. Two-way ANOVA...............................................................................23
3.7. Conclusion.................................................................................................23
4. REFERENCES................................................................................................24
5. ACKNOWLEDGEMENT..............................................................................24
3|Page
1. INTRODUCTION
5|Page
2. THEORIES
1.
2.
2.1. t- Test
2.1.1. Definition
A t-test is a statistical test used to compare two groups' means. It is
frequently used in hypothesis testing to determine whether a process or treatment
has an effect on the population of interest, or whether two groups are different
from one another.
The t-test formula is applied to the sample population. The t-test formula
depends on the mean, variance and standard deviation of the data being
compared.
There are 3 types of t-tests that could be performed on the n number of
samples collected:
One-sample t- test.
Independent sample t- test.
Paired samples t- test.
The critical value is obtained from the t-table looking for the degree of
freedom (df =n−1) and the corresponding α value (usually 0.05 or 0.1). If the t-test
obtained statistically > CV then the initial hypothesis is wrong and we conclude
that the results are significantly different.
2.1.2. Formula
2.1.2.1. One- sample t- test
For comparing the mean of a population from n samples, with a specified
theoretical mean μ, we use a one-sample t-test.
6|Page
x−μ
t=
σ
√n
Where,
x is the mean of the sample.
1.
2.
2.1.
2.1.1.
2.1.2.
2.1.2.1.
2.1.2.2. Independent sample t- test
Independent sample t-test is used to compare the mean of two groups of
samples. It helps determine whether the means of the two sets of data are
statistically significantly different from each other.
x1 −x2
t=
√
2 2
s1 s 2
+
n1 n2
Where,
t is the Independent sample t- test.
7|Page
s1 is the standard deviation of group 1.
t=
∑ ( x 1−x 2)
s
√n
Where,
t is the paired sample t- test.
8|Page
One-Way ANOVA Two-Way ANOVA
A test that allows one to make
A test that allows one to make comparisons between the means of
Definition comparisons between the means three or more groups of data,
of three or more groups of data. where two independent variables
are considered.
Number of
Independen One Two
t Variables
The means of three or more The effect of multiple groups of
What is
groups of an independent two independent variables on a
Being
variable on a dependent dependent variable and on each
Compared?
variable. other.
Number of
Each variable should have
Groups of Three or more.
multiple samples.
Samples
2.2.1. One- way ANOVA
The formulas for the sums of squares are:
Total sum of squares (SST)
I J I J
1
SST =∑ ∑ ( X ij −X ..)2 =¿ ∑ ∑ X ij 2− X 2 .. ¿
i=1 j=1 i=1 j =1 n
Treatment sum of squares (SSTr)
I J I
1 1 2
SSTr=∑ ∑ (X i . −X ..) =¿ ∑ X i . − X .. ¿
2 2
i=1 j=1 J i i =1 n
i=1 j=1
i=1 j=1
10 | P a g e
SSB MSB
Factor B J−1 SSB MSB=
J −1
f B=
MSE
SSE
Error ( I−1)(J −1) SSE MSE=
(I −1)(J −1)
Total IJ −1 SST
Rejection Region
f A ≥ F α , I−1 ,(I−1)(J −1)
f B ≥ F α , J−1 ,(I−1)(J −1)
2.2.2.2. K ij >1
The formulas for the sums of squares are:
Total sum of squares (SST)
SST =∑ ∑ ∑ ( X ijk− X …)
2
i j k
i j k
Source of Sum of
df Mean Square f
Variation Squares
11 | P a g e
SSA MSA
Factor A I −1 SSA MSA=
I −1
f A=
MSE
SSB MSB
Factor B J−1 SSB MSB=
J −1
f B=
MSE
SSAB MSAB
Interaction ( I−1)(J −1) SSAB MSAB= f AB=
( I −1)(J −1) MSE
SSE
Error IJ ( K−1) SSE MSE=
IJ ( K −1)
Rejection Region
f A ≥ F α , I−1 ,IJ ( K−1)
f B ≥ F α , J−1 ,IJ ( K−1)
f AB ≥ F α,( I−1)(J −1) ,IJ (K −1)
12 | P a g e
Before and after cleaning:
Output:
13 | P a g e
Histogram:
Input:
Output:
14 | P a g e
Boxplot:
Input:
Ouput:
Output:
15 | P a g e
16 | P a g e
Graph represent the correlation coefficients between the variables
"traveltime","studytime","failures" with the assumption that they are
normally distributed:
Input:
17 | P a g e
Output:
3.4. Normal distribution and line regression checking (1) (to know whether
it is normally distributed with 2 variables, G2 and G3)
3.4.1. Normal distribution testing
Command:
>reg<-lm(new.data$G2~new.data$G3)
>summary(reg)
18 | P a g e
We know that residual mean must be 0 and in here the median is -0.1081,
not too far away from 0. The quantiles 25% (1Q) and 75% (3Q) are also well-
balanced around the median, indicating that the residual of this equation is
relatively well-balanced.
T h e t - t e s t v a l u
,
indicating that β is not zero. In other words, we have evidence to suggest that there
is a relationship between G2 and G3 and this relationship is statistically significant.
A t h e r e
S =1.605 , we even have F-test but in the case of simple linear regression analysis
2
(with one factor) we do not need to care about the F-test. we need to pay attention
to the coefficient of determination R2 which is the sum of the squares between the
estimates and the mean divided by the sum of the squares of the observations and
the mean. The R2 value in this example is 0.8161, which means that the linear
equation (with G3 as a factor) explains about 82% of the differences in G2
between individuals. Of course, the R2 value ranges from 0 to 100% (or 1). A
higher R2 value is an indication that the relationship between the two variables G2
and G3 is stronger.
Input:
19 | P a g e
>fitted(reg)
>resid(reg)
Output:
Command:
> reg<-lm(new.data$freetime~new.data$G2)
> summary(reg)
20 | P a g e
No statistical significance
Command
> fitted(reg)
> resid(reg)
> op<-par(mfrow=c(2,2))
> plot(reg)
21 | P a g e
Using non-nor distribution cor.test when (1) fail:
Input:
Output:
Input:
Output:
22 | P a g e
3.5. Applying t-test
3.5.1. One- sample t-test
Input:
Output:
Input:
Output:
23 | P a g e
3.6. Applying ANOVA
3.6.1. One-way ANOVA
Input:
Output:
Input:
Output:
3.7. Conclusion
24 | P a g e
From “one- sample t-test” we can say that the true mean is equal 7.1875 with
95% confidence level.
From “two- samples t-test” we can say that the true difference in means
between group F and M is equal to 0 with 95% confidence level.
From the “one-way ANOVA” and “two-way ANOVA” we can say that the
“absences”, “reason” have no statistical significance with a 95% confidence level.
4. REFERENCES
[1] Douglas C. Montgomery, George C. Runger (2010) Applied Statistics and
Probability for Engineers, Fifth Edition, Arizona State University.
[2] Devore, J. (2011) Probability and Statistics for Engineering and the Sciences,
Eighth Edition, California Polytechnic State University, San Luis Obispo.
[3] Ruairi J Mackenzie (2018), One- way vs Two- way ANOVA: Differences,
Assumptions and Hypotheses
https://www.technologynetworks.com/informatics/articles/one-way-vs-two-way-
anova-definition-differences-assumptions-and-hypotheses-306553
[4] Tran Quang Quy (2021), Introduction about R and Rstudio, Thai Nguyen
University of Information and Communication
https://www.rpubs.com/tranquangquy_ictu/772213
[5] Nguyen Van Tuan, Phân tích bảng số liệu và biểu đồ bằng Rstudio, Garvan
Institute of Medical Research Sydney, Australia
25 | P a g e
https://cran.r-project.org/doc/contrib/Intro_to_R_Vietnamese.pdf?
fbclid=IwAR3OzNejhbRbp-
CLaToED7YlJ4oyo8dqecLlAMFEhCfxcY5GljGDgssEhjU
5. ACKNOWLEDGEMENT
We would like to specially thank to Dr. Nguyen Tien Dung, our lecturer
from the Faculty of Applied Science, for providing us with important knowledge as
well as basic skills to accomplish this assignment. It is probable that specialized
knowledge is still limited during the project's execution. As a result, avoiding
errors in comprehending, presenting, and assessing the data will be challenging.
Therefore, we encourage any attention, assessment, and ideas from our lecture to
make our topic more accurate.
26 | P a g e