0% found this document useful (0 votes)
7 views26 pages

Xstklatest Update

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 26

VIETNAM NATIONAL UNIVERSITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY


FACULTY OF APPLIED SCIENCE

SUBJECT: PROBABILITY AND STATISTICS (MT2013)


GROUP ASSIGNMENT: USING R-STUDIO SOLVING
PRACTICAL DATA SET
INSTRUCTOR: Dr. Nguyen Tien Dung
No. Family Name First Name Student ID Contribution

1 Trần Vũ Quỳnh Anh 2052385 20%

2 Tạ Anh Đức 2052453 20%

3 Trần Tiền Hào 2052972 20%

4 Võ Khắc Huy 2053056 20%

5 Trương Quốc Khánh 2053121 20%


Ho Chi Minh City, May, 2022

2|Page
CONTENTS
1. INTRODUCTION.............................................................................................3
1.1. About this subject.......................................................................................3
1.2. About our tool (R-studio)...........................................................................3
1.3. Our problems..............................................................................................4
2. THEORIES........................................................................................................5
2.1. t- Test............................................................................................................5
2.1.1. Definition...............................................................................................5
2.1.2. Formula.................................................................................................5
2.2. Analysis of Variance (ANOVA).................................................................7
2.2.1. One- way ANOVA................................................................................8
2.2.2. Two-way ANOVA.................................................................................9
3. DATA AND STATISTICAL METHODS....................................................11
3.1. Importing data..........................................................................................11
3.2. Data cleaning.............................................................................................11
3.3. Data visualization......................................................................................12
3.4. Normal distribution and line regression checking (1) (to know whether
it is normally distributed with 2 variables, G2 and G3).................................17
3.4.1. Normal distribution testing................................................................17
3.4.2 Fitting line regression model.............................................................21
3.5. Applying t-test...........................................................................................22
3.5.1. One- sample t-test...............................................................................22
3.5.2. Two- sample t-test...............................................................................22
3.6. Applying ANOVA.....................................................................................22
3.6.1. One-way ANOVA...............................................................................22
3.6.2. Two-way ANOVA...............................................................................23
3.7. Conclusion.................................................................................................23
4. REFERENCES................................................................................................24
5. ACKNOWLEDGEMENT..............................................................................24
3|Page
1. INTRODUCTION

1.1. About this subject


Probability is a part of mathematics that deals with numerical descriptions of
the probability that an event will occur, or the probability that a proposition is true.
The probability of an occurrence is a number between 0 and 1, where 0 denotes the
impossibility of the event and 1 represents certainty. It's usually applied in fields
such as mathematics, statistics, economics, gambling, science (particularly
physics), artificial intelligence, machine learning, computer science, philosophy,
and so on.
Statistics is the study of several disciplines, including data analysis,
interpretation, presentation, and organization. It plays a critical part in the research
process by providing analytically significant statistics to assist statistical analysts
in obtaining the most correct results to address associated difficulties with social
activities.
To sum up, Probability and Statistics nowadays is becoming significant in
our modern life, especially with student whose major is in natural science,
technology, and economy,...

1.2. About our tool (R-studio)


R is a programming language and environment that is widely used in
statistical computing, data analysis, and scientific research. It is a popularly used
programming language for data collecting, cleaning, analysis, graphing, and
visualization.
R is the next generation language of the “S language” in reality. The S
programming language allows users and students of engineering and technology
university to calculate and modify data. As a language, one can use R to develop
specialized software for a particular computational problem.
4|Page
1.3. Our problems
In this project is given by Dr. Nguyen Tien Dung contains information about
statistics of factors affecting 3 Groups (G1, G2, G3).
 Attribute Information:
"school" 
  "sex"     binary(F=female,M=Male)
   "age"       year-scale
   "address” binary (U, R)
   "famsize" -familysize (GT3=greater than 3, LE3=less than 3)
    "Pstatus" 
    "Medu"    mother education  
   "Fedu" father education
      "Mjob" mother job
       "Fjob" father job
      "reason"    
  "guardian" binary (father, mother)
   "traveltime" hour
  "studytime" hour
  "failures"  
  "schoolsup" school support binary (yes, no)
  "famsup"   family support binary (yes, no)
   "paid"   binary (yes, no)
     "activities" binary (yes, no)
  "nursery" binary (yes, no)
   "higher" binary (yes, no)
    "internet" binary (yes, no)
   "romantic"  binary (yes, no)
  "famrel"   
   "freetime" 
   "goout"    go out
   "Dalc"      
  "Walc"    
    "health"
      "absences"  
  "G1"        
  "G2"     
     "G3"     

5|Page
2. THEORIES
1.
2.
2.1. t- Test
2.1.1. Definition
A t-test is a statistical test used to compare two groups' means. It is
frequently used in hypothesis testing to determine whether a process or treatment
has an effect on the population of interest, or whether two groups are different
from one another.
The t-test formula is applied to the sample population. The t-test formula
depends on the mean, variance and standard deviation of the data being
compared.
There are 3 types of t-tests that could be performed on the n number of
samples collected:
 One-sample t- test.
 Independent sample t- test.
 Paired samples t- test.
The critical value is obtained from the t-table looking for the degree of
freedom (df =n−1) and the corresponding α value (usually 0.05 or 0.1). If the t-test
obtained statistically > CV then the initial hypothesis is wrong and we conclude
that the results are significantly different.
2.1.2. Formula
2.1.2.1. One- sample t- test
For comparing the mean of a population from n samples, with a specified
theoretical mean μ, we use a one-sample t-test.

6|Page
x−μ
t=
σ
√n
Where,
x is the mean of the sample.

μ is the assume mean.

σ is the standard deviation.

n is the number of observations.


σ
is the standard error.
√n

1.
2.
2.1.
2.1.1.
2.1.2.
2.1.2.1.
2.1.2.2. Independent sample t- test
Independent sample t-test is used to compare the mean of two groups of
samples. It helps determine whether the means of the two sets of data are
statistically significantly different from each other.
x1 −x2
t=


2 2
s1 s 2
+
n1 n2

Where,
t is the Independent sample t- test.

x 1 is the mean of group 1.

x 2 is the mean of group 2.

7|Page
s1 is the standard deviation of group 1.

s2 is the standard deviation of group 2.

n1 is the number of observations in group 1.

n2 is the number of observations in group 2.

2.1.2.3. Paired Samples t- Test


Whenever two distributions of the variables are highly correlated, they could
be pre and post test results from the same people. In such cases, we use the paired
samples t- test.

t=
∑ ( x 1−x 2)
s
√n

Where,
t is the paired sample t- test.

x 1−x 2 is the difference mean of the pairs.

s is the standard deviation.

n is the sample size.


2.2. Analysis of Variance (ANOVA)
ANOVA is a statistical analysis tool that divides observed aggregate
variability found within a data set into two parts: systematic components and
random factors. Random factors have no statistical influence on the supplied data
set, whereas systematic factors do. In a regression research, analysts use the
ANOVA test to examine the impact of independent variables on the dependent
variable.
There are two types in ANOVA: one-way and two- way, it depends on the
number of independent variables.

8|Page
One-Way ANOVA Two-Way ANOVA
A test that allows one to make
A test that allows one to make comparisons between the means of
Definition comparisons between the means three or more groups of data,
of three or more groups of data. where two independent variables
are considered.
Number of
Independen One Two
t Variables
The means of three or more The effect of multiple groups of
What is
groups of an independent two independent variables on a
Being
variable on a dependent dependent variable and on each
Compared?
variable. other.
Number of
Each variable should have
Groups of Three or more.
multiple samples.
Samples 
2.2.1. One- way ANOVA
The formulas for the sums of squares are:
Total sum of squares (SST)
I J I J
1
SST =∑ ∑ ( X ij −X ..)2 =¿ ∑ ∑ X ij 2− X 2 .. ¿
i=1 j=1 i=1 j =1 n
Treatment sum of squares (SSTr)
I J I
1 1 2
SSTr=∑ ∑ (X i . −X ..) =¿ ∑ X i . − X .. ¿
2 2

i=1 j=1 J i i =1 n

Error sum of squares (SSE)


I J
SSE=∑ ∑ ( X ij − X i . ) =SST−SSTr
2

i=1 j=1

Source of df Sum of Mean Square f


9|Page
Variation Squares
SSTr MSTr
Treatments I−1 SSTr MSTr =
I −1 MSE
SSE
Error n−I SSE MSE=
n−I
Total n−1 SST

One- way ANOVA Table


Rejection region: f ≥ F α , I −1 , n−I
2.2.2. Two-way ANOVA
2.2.2.1. K ij =1
The formulas for the sums of squares are:
Total sum of squares (SST)
I J
SST =∑ ∑ ( X ij −X ..)
2

i=1 j=1

Sum of squares factor A (SSA)


I J I
SSA=∑ ∑ ( X i .− X ..)2=¿ J ∑ ( X i .− X ..)2 ¿
i=1 j=1 i=1

Sum of squares factor B (SSB)


I J J
SSB=∑ ∑ ( X . j −X ..)2=¿ I ∑ ( X . j −X ..)2 ¿
i=1 j=1 j=1

Error sum of squares (SSE)


I J
SSE=∑ ∑ ( X ij − X i .− X . j + X )2
i=1 j=1

The fundamental identity is


SST =SSA +SSB+ SSE
Source of Sum of
df Mean Square f
Variation Squares
SSA MSA
Factor A I −1 SSA MSA=
I −1
f A=
MSE

10 | P a g e
SSB MSB
Factor B J−1 SSB MSB=
J −1
f B=
MSE
SSE
Error ( I−1)(J −1) SSE MSE=
(I −1)(J −1)
Total IJ −1 SST

Rejection Region
f A ≥ F α , I−1 ,(I−1)(J −1)
f B ≥ F α , J−1 ,(I−1)(J −1)

2.2.2.2. K ij >1
The formulas for the sums of squares are:
Total sum of squares (SST)
SST =∑ ∑ ∑ ( X ijk− X …)
2

i j k

Error sum of squares (SSE)


SSE=∑ ∑ ∑ ( X ijk−X ij. )2
i j k

Sum of squares factor A (SSA)


SSA=∑ ∑ ∑ ( X i ..−X ... )2
i j k

Sum of squares factor B (SSB)


SSB=∑ ∑ ∑ (X . j . −X ...)2
i j k

Interaction sum of squares (SSAB)


SSAB=∑ ∑ ∑ (X ij . −X i ..−X . j . + X ... )
2

i j k

The fundamental identity is


SST =SSA +SSB+ SSAB+ SSE

Source of Sum of
df Mean Square f
Variation Squares

11 | P a g e
SSA MSA
Factor A I −1 SSA MSA=
I −1
f A=
MSE
SSB MSB
Factor B J−1 SSB MSB=
J −1
f B=
MSE
SSAB MSAB
Interaction ( I−1)(J −1) SSAB MSAB= f AB=
( I −1)(J −1) MSE
SSE
Error IJ ( K−1) SSE MSE=
IJ ( K −1)

Total IJK −1 SST

Rejection Region
f A ≥ F α , I−1 ,IJ ( K−1)
f B ≥ F α , J−1 ,IJ ( K−1)
f AB ≥ F α,( I−1)(J −1) ,IJ (K −1)

3. DATA AND STATISTICAL METHODS


3.
3.1. Importing data

Then select “grade.csv”.


1.
2.
3.
3.1.
3.2. Data cleaning

12 | P a g e
Before and after cleaning:

3.3. Data visualization


 Making function for mean, SD, SE:

 Naming the variables:

 Summary, desc, tally, min, max:


 Input:


Output:

13 | P a g e
 Histogram:
 Input:

 Output:

14 | P a g e
 Boxplot:
 Input:

 Ouput:

 Age Statistic by sex:


 Input:

 Output:

15 | P a g e
16 | P a g e
 Graph represent the correlation coefficients between the variables
"traveltime","studytime","failures" with the assumption that they are
normally distributed:

 Input:
17 | P a g e
 Output:

3.4. Normal distribution and line regression checking (1) (to know whether
it is normally distributed with 2 variables, G2 and G3)
3.4.1. Normal distribution testing
Command:
>reg<-lm(new.data$G2~new.data$G3)
>summary(reg)

18 | P a g e
We know that residual mean must be 0 and in here the median is -0.1081,
not too far away from 0. The quantiles 25% (1Q) and 75% (3Q) are also well-
balanced around the median, indicating that the residual of this equation is
relatively well-balanced.
T h e t - t e s t v a l u
,

indicating that β is not zero. In other words, we have evidence to suggest that there
is a relationship between G2 and G3 and this relationship is statistically significant.

A t h e r e

S =1.605 , we even have F-test but in the case of simple linear regression analysis
2

(with one factor) we do not need to care about the F-test. we need to pay attention
to the coefficient of determination R2 which is the sum of the squares between the
estimates and the mean divided by the sum of the squares of the observations and
the mean. The R2 value in this example is 0.8161, which means that the linear
equation (with G3 as a factor) explains about 82% of the differences in G2
between individuals. Of course, the R2 value ranges from 0 to 100% (or 1). A
higher R2 value is an indication that the relationship between the two variables G2
and G3 is stronger.

 Input:
19 | P a g e
>fitted(reg)

>resid(reg)

 Output:

 Try again with 2 other variables:

Command:

> reg<-lm(new.data$freetime~new.data$G2)

> summary(reg)

20 | P a g e
 No statistical significance

Command

> fitted(reg)

> resid(reg)

> op<-par(mfrow=c(2,2))

> plot(reg)

21 | P a g e
 Using non-nor distribution cor.test when (1) fail:

 Input:

Output:

3.4.2 Fitting line regression model

 Input:

 Output:

22 | P a g e
3.5. Applying t-test
3.5.1. One- sample t-test

 Input:

 Output:

3.5.2. Two- sample t-test

 Input:

 Output:

23 | P a g e
3.6. Applying ANOVA
3.6.1. One-way ANOVA

 Input:

 Output:

3.6.2. Two-way ANOVA

 Input:

 Output:

3.7. Conclusion

24 | P a g e
From “one- sample t-test” we can say that the true mean is equal 7.1875 with
95% confidence level.
From “two- samples t-test” we can say that the true difference in means
between group F and M is equal to 0 with 95% confidence level.
From the “one-way ANOVA” and “two-way ANOVA” we can say that the
“absences”, “reason” have no statistical significance with a 95% confidence level.

4. REFERENCES
[1] Douglas C. Montgomery, George C. Runger (2010) Applied Statistics and
Probability for Engineers, Fifth Edition, Arizona State University.
[2] Devore, J. (2011) Probability and Statistics for Engineering and the Sciences,
Eighth Edition, California Polytechnic State University, San Luis Obispo.
[3] Ruairi J Mackenzie (2018), One- way vs Two- way ANOVA: Differences,
Assumptions and Hypotheses
https://www.technologynetworks.com/informatics/articles/one-way-vs-two-way-
anova-definition-differences-assumptions-and-hypotheses-306553
[4] Tran Quang Quy (2021), Introduction about R and Rstudio, Thai Nguyen
University of Information and Communication
https://www.rpubs.com/tranquangquy_ictu/772213
[5] Nguyen Van Tuan, Phân tích bảng số liệu và biểu đồ bằng Rstudio, Garvan
Institute of Medical Research Sydney, Australia

25 | P a g e
https://cran.r-project.org/doc/contrib/Intro_to_R_Vietnamese.pdf?
fbclid=IwAR3OzNejhbRbp-
CLaToED7YlJ4oyo8dqecLlAMFEhCfxcY5GljGDgssEhjU

5. ACKNOWLEDGEMENT
We would like to specially thank to Dr. Nguyen Tien Dung, our lecturer
from the Faculty of Applied Science, for providing us with important knowledge as
well as basic skills to accomplish this assignment. It is probable that specialized
knowledge is still limited during the project's execution. As a result, avoiding
errors in comprehending, presenting, and assessing the data will be challenging.
Therefore, we encourage any attention, assessment, and ideas from our lecture to
make our topic more accurate.

26 | P a g e

You might also like