Data Analysis Final Requierements

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Sahn Chien A.

Montecastro BSCpE 2-7 1/16/2021


DATA ANALYSIS FINAL REQUIEREMENTS

7. Statistical Intervals
7.1. Confidence Intervals: Single Sample
When conducting a research, estimates based on samples are, of course, subject to
sampling error, and it is important to evaluate the precision of the estimates. Confidence Interval
is a range that you can expect the population parameter to fall within. Confidence Intervals are
conducted using statistical methods, such as t-test. Confidence Interval: Single Sample uses
single number estimate of the population.
Example:
 The sample mean for the fill weights of 100 boxes is 12.050. The population variance of
the fill weights is known to be (0.100)2. Find a 95% Confidence Interval for the
population mean µ fill weight of the boxes.
ANS.:
The 95% confidence Interval for µ is [12.030, 12.070]
We are 95% confident that the true parameter value lies in this interval.
7.2. Confidence Intervals: Multiple samples
In Confidence Interval for multiple samples is simply compare two or more groups with
respect to their mean scores. Meaning, that there are different people in the groups being
compared.
Example:
 Is a drug’s effectiveness the same in children and adults?
 Compare body mass index (BMI) in alcoholic and non-alcoholic people.
7.3 Prediction Intervals
In Prediction Intervals, from the word “prediction” itself, it is used to predict future data
observations based on known population of data. It provides a range of values that is likely to
occur in future events. It is generally calculated in relation to a statistical model of the known
data, often using a linear regression analysis.
Example:
 The mean life of the battery is 100-110 hours. Then the battery will fall into the range of
100 to 110 hours 95% of the time
7.4. Tolerance Intervals
According to my research, Tolerance Intervals are similar to Prediction Intervals but
Tolerance Intervals covers a specified proportion of the population for a given Confidence level.
Example:
 75% of the time, batteries will fall into the interval 90 to 120 hours, with 95%
confidence.

8. Test of Hypothesis for a single sample


8.1. Hypothesis Testing
In hypothesis testing, an analyst tests a statistical sample to assess the plausibility of a
hypothesis by using sample data by measuring and examining a random sample of population
being analyzed.
Example:
 A person wants to test that a 1-peso coin has exactly a 50% chance of landing on tails.
 If I will give a review to all the topics before an exam, then student test scores most likely
to be above average.
8.1.1. One-sided and two-sided Hypothesis
What is a one-sided Hypothesis? One-sided Hypothesis is strictly bounded to show that
the sample mean would be higher or lower than the population mean, but not both. A one-sided
Hypothesis can be bounded at any value.
Example:
 The null hypothesis states that the mean is less than or equal to 10. The alternative
hypothesis would be that the mean is greater than 10.
What is two-sided hypothesis? A two-sided Hypothesis is not bounded from above or from
below, as opposed to a one-sided hypothesis. A two-sided hypothesis is nothing more than the
union of two one-sided hypotheses.
Example:
 We compare the mean of a sample to a given value x using a t-test. Our null hypothesis is
that the mean is equal to x.
81.2. P-value in Hypothesis Test
In Hypothesis testing, P-value was used to help you to support or reject the null
hypothesis. It is served as the evidence. P value are expressed as decimals and the smaller the p-
value, the stronger the evidence that you need to reject the null hypothesis.
Example:
 A p value of 0.0254 is 2.54%. This means there is a 2.54% chance your result could be
random. Therefore, the smaller the p value, the more important your result.
8.1.3. General Procedure for Test Hypothesis
There are 5 main steps in hypothesis testing, these are:
1. State your null (Ho) and alternate (Ha) hypothesis.
Ho: men are not taller than woman
Ha: men are taller than women
2. Collect data
*Conduct a census
3. Perform a statistical test
*your t-test shows an average height of men and women, with an estimate of the true
difference, we can come up with the p-value.
4. Decide whether null hypothesis is supported or refused.
*if you find that the p-value is below your cutoff, so you decide to reject your null
hypothesis of no difference
5. Present the findings.
*in comparison of mean height between men and women, therefore we can support or
reject the null hypothesis.
8.2. Test on the Mean of a Normal Distribution, Variance Known
Example:
A coffee machine is set to fill 100 ml cups. In reality, the machine fills cups with a
process standard deviation of 25 ml. A sample of 20 cups of coffee gives mean of 99.2 ml.
8.3. Test on the Mean of a Normal Distribution, Variance Unknown
8.4. Test on the Variance and Statistical Deviation of a Normal
8.5. Test on a Population Proportion
A population proportion is a fraction of the population that has a certain characteristic.
Population proportion is used to determine the appropriate sample size.
These are hypothesis test as they relate to a claim about a population proportion:
Step 1: Determine the hypotheses
Step 2: Collect the data
Step 3: Assess the evidence
Step 4: Give the conclusion
Example:
 There are 500 animals in the clinic and 245 of those animals is dogs.
9. Statistical Inference of two samples
9.1. Inference on the difference in Means of two normal Distributions, Variances
known
9.2. Inference on the difference in Means of two normal Distributions, Variances
Unknown
9.3. Inference on the variance of two normal distributions
9.4. Inference on two population proportions
To draw conclusions about a difference between two population proportions, there are
two inference procedures to draw. First, a confidence interval when our goal is to estimate the
difference and second, a hypothesis test when the goal is to test a claim about the difference.
Both are based on sampling distribution.

10. Simple Linear Regression and Correlation


10.1. Empirical Models
Empirical Models is only supported by experimental Data. Its objective is to provide
adequate description of certain types of observable phenomena in the form of statistic models.
Example:
 Investigating the relationship of inflowing nutrients in a lake to algal biomass production.
10.2. Regression: Modeling linear relationships-the least squares approach
The least squares approach is a method and a form of mathematical regression analysis
used to determine the line of best fit for the list of data, providing a visual demonstration of the
relationship between the data points.
10.3. Correlation: Estimating the strength of linear Regression
Correlation is a term that is a measure of a linear relationship between two quantitative
variables. It is said that the relationship between the two variables is considered to be strong
when their r value is larger than 0.7.
10.4. Hypothesis test in simple linear regression
A linear Regression model attempts to explain the relationship between two or more
variables using a straight line. Simple Linear Regression model have only one independent
variable. Test are used to conduct hypothesis test on the regression coefficient obtained in simple
linear regression.
10.4.1. Use of t-test
A t-test is used to determine if there is significant difference between the mean of two
group. It allows to compare the average values and determine if they came from the same
population. T-test takes a sample from each of the two sets and establish the problem statement
by assuming a null hypothesis that the two means are equal.
Example: Calculate a paired t test by hand for the following data:
Subject # Score 1 Score 2
1 3 13
2 3 15
3 12 20
4 15 23
5 20 25

10.4.2. Analysis of variance approach to test significance of regression


The analysis of variance is another method to test for the significance of regression. It
observes data to determine if a regression model can be applied to the observed data.
10.5. Prediction of new observations
From the word “Prediction”, we can say that it is an estimation of effects in the model. It
is one of the important application of a linear regression model, to predict future observations Y
corresponding to the given value of x.
10.6. Adequacy of the regression model
There are 4 different checks to find if the model is adequate, it is applicable to any
regression model.
Adequacy Check #1: Plot of straight like regression model vs. data to visually inspect how well
the data fits the line.
Adequacy Check #2: Calculation of the coefficient of determination, r2.
Adequacy Check #3: Determine if the residuals as a function of x show nonlinearity.
Adequacy Check #4: Determine if 95% of the values of scaled residuals are within [-2,2]
10.6.1. Residual analysis
The vertical distance between a data point and regression line is called Residual. Each
Data point has residual. They are positive if they are above the regression line and negative if
they are below the regression line. If it passes through the point, the residual is zero.
10.6.2. Coefficient of Determination
The coefficient of determination is used to analyze how differences in once variable can
be explained by a difference in a second variable. It is similar to correlation coefficient.
Example:
 When a woman is pregnant, it has direct relation to when they gave birth.
10.7. Correlation
Correlation is a measure of how thing are related to each other. It is used to test
relationship between quantitative variables or categorical variables.
Example:
 The dog’s name and the type of biscuit they prefer
 The amount of time you study and your GPA
11. Multiple Linear Regression Model
11.1. Multiple Linear Regression
Multiple linear regression is also known as multiple regression. It is a statistical technique
that uses several explanatory variables to predict the outcome of the response variable. The goal
of MLR is to model the linear relationship between the explanatory and response variable.
Example:
 Predict blood pressure (the dependent variable) from independent variables such as
height, weight, age, sex and hours of exercise per week.
11.2. Hypothesis test in multiple linear regression
11.3. Prediction of new observations
From the word “Prediction”, we can say that it is an estimation of effects in the model. It
is one of the important application of a linear regression model, to predict future observations Y
corresponding to the given value of x.
11.4. Model adequacy checking
The fitting of the linear regression model, estimation of parameters testing of hypothesis
properties of the estimator, is based on the following assumptions:
1. The relationship between the study variable and explanatory variables is linear, at
least approximately.
2. The error term has zero mean
3. The error term has a constant variance
4. The errors are uncorrelated
5. The errors are normally distributed
12. Design and Analysis of Single Factor Experiments
12.1. Completely randomized single factor experiments
A completely randomized single factor experiments where both: one factor of two or
more levels has been manipulated and; Each respondent in the survey is shown one and only one
of the levels of the factor.
12.1.1. Analysis of Variance (ANOVA)
ANOVA is an analysis tool used in statistics that splits an observed aggregate variability
found inside a data set into two parts: systemic factors and random factors. The systemic factors,
have a statistical influence on the given data set, while the random factors do not. It is used to
determine the influence that independent variables have on the dependent variable in a regression
study.
Example:
 You have a group of individuals randomly split into smaller groups and completing
different tasks. For example, you might be studying the effects of tea on weight loss and
form three groups: green tea, black tea and no tea.
12.1.2. Multiple Comparison following the ANOVA
Test conducted on subsets of data tested previously in another analysis are called post hoc
test. A class of post hoc test that provide this type of detailed information for ANOVA results are
called “Multiple comparison Analysis”. The most commonly used multiple comparison analysis
statistics include the following tests: tukey, Newman-keuls, scheffee, bonferroni and dunnet.
12.1.3. Residual analysis and Model Checking
Residuals are differences between the one-step-predicted output from the model and the
measures output from the validation data set. Residual analysis consists of two tests: the
whiteness test and the independence test.
According to the whiteness test criteria, a good model has the residual autocorrelation function
inside the confidence interval of the corresponding estimates, indicating that the residuals are
uncorrelated.
According to the independence test criteria, a good model has residuals uncorrelated with past
inputs. Evidence of correlation indicates that the model does not describe how part of the output
relates to the corresponding input.
12.1.4. Determining Sample Size
Sample size is a frequently-used term in statistics and market research, and one that
inevitably comes up whenever you’re surveying a large population of respondents. It relates to
the way research is conducted on large populations.
The steps that follow are suitable for finding a sample size for continuous data:
Stage 1: Consider your sample size variables
Stage 2: Calculate the sample size
Example:
 A recent report from the Framingham Heart Study indicated that 26% of people free of
cardiovascular disease had elevated LDL cholesterol levels, defined as LDL > 159
mg/dL.9 An investigator hypothesizes that a higher proportion of patients with a history
of cardiovascular disease will have elevated LDL cholesterol. How many patients should
be studied to ensure that the power of the test is 90% to detect a 5% difference in the
proportion with elevated LDL cholesterol? A two sided test will be used with a 5% level
of significance.
12.2. The Random-Effects Model
Random-effects models are statistical models in which some of the parameters (effects)
that define systematic components of the model exhibit some form of random variation.
Statistical models always describe variation in observed variables in terms of systematic and
unsystematic components.
Example:
 Sodium content on beer
12.2.1. Fixed versus Random factors
Fixed effect factor: Data has been gathered from all the levels of the factor that are of
interest.
Example:
 The purpose of an experiment is to compare the effects of three specific dosages of a drug
on the response. "Dosage" is the factor; the three specific dosages in the experiment are
the levels; there is no intent to say anything about other dosages.
Random effect factor: The factor has many possible levels, interest is in all possible
levels, but only a random sample of levels is included in the data.
Example:
 A large manufacturer of widgets is interested in studying the effect of machine operator
on the quality final product. The researcher selects a random sample of operators from
the large number of operators at the various facilities that manufacture the widgets. The
factor is "operator." The analysis will not estimate the effect of each of the operators in
the sample, but will instead estimate the variability attributable to the factor "operator".
12.2.2. ANOVA and Variance Components
Variance components are estimates of a part of the total variability accounted for by a
specified source of variability. Estimates of the variance components are extracted from
the ANOVA by equating the mean squares to the expected mean squares.
12.3. Randomized Complete Block Design
A Randomized Complete Block Design (RCBD) is defined by an experiment whose
treatment combinations are assigned randomly to the experimental units within a block.
Generally, blocks cannot be randomized as the blocks represent factors with restrictions in
randomizations such as location, place, time, gender, ethnicity, breeds, etc.
Example:
 Suppose we used four coupons (each having four experimental units), randomly assigned
the tips to each experimental unit (thus having a completely randomized design with
single factor Tip), and by chance obtained the same results as in the block design
experiment. Analyze the data under this assumption and compare with the results in the
RCBD analysis.
12.3.1. Design and statistical analysis
12.3.2. Multiple Comparisons
Test conducted on subsets of data tested previously in another analysis are called post hoc
test. A class of post hoc test that provide this type of detailed information for ANOVA results are
called “Multiple comparison Analysis”. The most commonly used multiple comparison analysis
statistics include the following tests: tukey, Newman-keuls, scheffee, bonferroni and dunnet.
12.3.3. Residual analysis and Model checking
Residuals are differences between the one-step-predicted output from the model and the
measures output from the validation data set. Residual analysis consists of two tests: the
whiteness test and the independence test.
According to the whiteness test criteria, a good model has the residual autocorrelation
function inside the confidence interval of the corresponding estimates, indicating that the
residuals are uncorrelated.
According to the independence test criteria, a good model has residuals uncorrelated with
past inputs. Evidence of correlation indicates that the model does not describe how part of the
output relates to the corresponding input.

13. Design of Experiments with Several Factors


13.1. Factorial Experiments
A factorial experiments an experiment in which several factors (such as fertilizers or
antibiotics) are applied to each experimental unit and each factor is applied at two, or more,
levels. In statistics, a full factorial experiment is an experiment whose design consists of two or
more factors, each with discrete possible values or "levels", and whose experimental units take
on all possible combinations of these levels across all such factors.
Example:
 The number of different treatment groups that we have in any factorial design can easily
be determined by multiplying through the number notation. For instance, in
our example we have 2 x 2 = 4 groups. In our notational example, we would need 3 x 4 =
12 groups. We can also depict a factorial design in design notation.
13.2. Two-Factor Factorial Experiments
A two-factor factorial design is an experimental design in which data is collected for all
possible combinations of the levels of the two factors of interest. If equal sample sizes are taken
for each of the possible factor combinations then the design is a balanced two-factor factorial
design.
13.2.1. statistical analysis of the Fixed-Effects Model
Fixed-effects models are a class of statistical models in which the levels (i.e., values) of
independent variables are assumed to be fixed (i.e., constant), and only the dependent variable
changes in response to the levels of independent variables. This class of models is fundamental
to the general linear models that underpin fixed-effects regression analysis and fixed-effects
analysis of variance, or ANOVA (fixed-effects ANOVA can be unified with fixed-effects
regression analysis by using dummy variables to represent the levels of independent variables in
a regression model
13.2.2. Model Adequacy Checking
13.3. 2k Factorial design
2k designs are particularly useful in the early stages of experimental work when they are
many factors to be investigated. It provides the smallest number of runs with which k factors can
be studied.
Example:
 The effect of percent carbonation (10% or 12%) and operating pressure (25 psi or 30 psi)
on the fill height of a carbonated beverage (Pepsi).
13.3.1. Single Replicate
13.3.2. Addition of Center points
Adding center points to a factorial design. This is a wise thing to do as there are significant
benefits Improved model and determining if “curvature” is significant. Factorial designs without center
points assume linear relationships.

13.4. Blocking and Confounding in the 2k design


Blocking is a technique for dealing with controllable nuisance variables. Sometimes, it is
impossible to perform all 2k factorial experiments under homogeneous condition. Blocking technique is
used to make the treatments are equally effective across many situations.
Confounding is a design technique for arranging experiments to make high-order interactions to
be indistinguishable from (or confounded with) blocks.

13.5. Fractional Replication of the 2k design


Experiments with many factors involve a large number of possible treatments, even when
all factors are used at only two levels. Often the available resources are not sufficient for even a
single replication of the complete factorial design. A carefully chosen fraction of the complete
design will give much information, though always less than the complete design.
13.6. Response Surface Methods
The response surface methodology (RSM) is a widely used mathematical and
statistical method for modeling and analyzing a process in which the response of interest is
affected by various variables and the objective of this method is to optimize the response

You might also like