Research 9 Q3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

RESEARCH 9 Branches of Statistics

DATA ANALYSIS & INFERENTIAL DESCRIPTIVE


STATISTICS
- Involves organizing or displaying data
Data Sets
INFERENTIAL
POPULATION
- Involves using sample data to draw
- The collection of all outcomes, conclusions about a population
responses, measurements, or counts - A statistic computed from a sample to
that are of interest. estimate a parameter, the
corresponding value in the population
SAMPLE which it is selected.
- A subset of the population - This is a set of methods used to make a
generalization, estimate, prediction, or
PARAMETER decision.
- A number that describes a population - The mathematics and logic of how this
characteristic generalization from sample to
population can be made.
STATISTICS - Consist of methods for drawing and
- A number that describes a sample measuring the reliability of conclusions
characteristic about population based on information
obtained from a sample of the
population.
- Produce through complex mathematical
calculations that allow scientists to infer
trends about a larger population based
on a study of a sample taken from it.
- Examine relationships between
variables within a sample and then
make generalizations or predictions
about how those variables will relate to
PARAMETER STATISTICS a larger population.
Certain if the entire Samples are chosen
Sampling Techniques
population is in a deliberate way
measured. so that the influence SIMPLE RANDOM SAMPLING
of chance or
probability can be - Every possible sample of the same size
estimated. has the same chance of being selected.
There is no There is an inference
inference. that estimates the STRATIFIED SAMPLING
parameters. - Divide a population into groups (strata)
These basic ideals and select a random sample from each
that comprise the
group.
foundation for
testing hypotheses CLUSTER SAMPLING
using statistical
techniques. - Divide the population into groups
Reason to tenable (clusters) and select all the members in
conclusions about one or more, but not all of the clusters.
the parameters
SYSTEMATIC SAMPLING - Each outcome in a sample space is
equally likely.
- Choose a starting value at radnom.
Then choose every nth member of the P(E) = Number of outcomes in event E /
population. Number of outcomes in sample space

EMPIRICAL (STATISTICAL) PROBABILITY

Two Major Divisions of Inferential Statistics - Based on observations obtained from


probability experiments.
CONFIDENCE INTERVAL
- Relative frequency of an event.
- Gives a range of values for an unknown
P(E) = Frequency of Event E / Total Frequency
parameter of the population by
=f/n
measuring a statistical sample.
- This is expressed in terms of an interval
and the degree of confidence that the
LAW OF LARGE NUMBERS
parameter is within the interval.
- An experiment is repeated over and
TESTS OF SIGNIFICANCE OR HYPOTHESIS
over, the empirical probability of an
TESTING
event approaches the theoretical
- Scientists make a claim about the (actual) probability of an event.
population by analyzing a statistical
SUBJECTIVE PROBABILITY
sample.
- There is some uncertainty - An intuition, educated guess, or
- Can be expressed in terms of a level of estimate.
signficance
RANGE OF PROBABILITIES RULE
PROBABILITY EXPERIMENTS
- The probability of an event E is between
- An action or trial through which specific 0 and 1, inclusive. 0 =< P(E) =< 1
results (counts, measurements, or
responses) are obtained.

OUTCOME

- The result of a single trial in a


probability experiment. RANDOM VARIABLE

SAMPLE SPACE - Represents a numerical value


associated with each outcome of a
- The set of all possible outcomes of a probability distribution.
probability experiment. - Denoted by x
EVENT

- Consists of one or more outcomes and is


a subset of the sample space.

Types of Probability

CLASSICAL (THEORETICAL) PROBABILITY


DISCRETE RANDOM VARIABLE - A continuous probability distribution for
a random variable, x.
- A finite or countable number of possible
- The most important continuous
outcomes that can be listed.
probability distribution in statistics
DISCRETE PROBABILITY DISTRIBUTION - Its graph is called a normal curve.

- Lists each possible value the random


variable can assume, together with its
probability

BINOMIAL EXPERIMENTS

- The experiment is repeated for a fixed


number of trials, where each trial is
independent of other trials.
- Only two possible outcomes of interest
for each trial, which can be classified as
(S) Success or (F) Failure.
- The probability of a success or P(S) is
the same for each trial.
- The random variable x counts the
number of successful trials.

CONTINUOUS RANDOM VARIABLE MEAN

- An uncountable/infinite number of - Gives the location of the line of


possible outcomes/values that can be symmetry
represented by an interval on the
STANDARD DEVIATION
number line.
- Describes the spread of the data
CONTINUOUS PROBABILITY
DISTRIBUTION SAMPLING DISTRIBUTION
- The probability distribution of a - The probability distribution of a sample
continuous random variable. statistic.
- Formed when samples of size n are
repeatedly taken from a population.

CENTRAL LIMIT THEOREM

- If samples of size n is greater than or


equal to 30 are drawn from any
population with a population mean and
a population standard deviation, then
the sampling distribution of sample
means approximates a normal
distribution.
- The greater the sample size, the better
the approximation.
- If the population itself is normally
distributed, then the sampling
NORMAL DISTRIBUTION distribution of sample means is
normally distributed for any sample size - The difference between the point
n. estimate and the actual population
parameter value.
NORMAL APPROXIMATION TO A BINOMIAL
MARGIN OF ERROR
- The normal distribution is used to
approximate the binomial distribution - The greatest possible distance between
when it would be impractical to use the the point estimate and the value of the
binomial distribution to find a parameter it is the estimating for a
probability. given level of confidence, c.
- Condition: np and nq are greater than
CONFIDENCE INTERVAL
or equal to 5.
- Denoted by c.
CONTINUITY CORRECTION
- The probability that the confidence
- When you use a continuous normal interval contains the population mean.
distribution to approximate a binomial
SAMPLE SIZE
probability, you need to move 0.5 unit
to the left and right of the midpoint to - Given a c-confidence level and margin
include all possible x-values in the of error E, the minimum sample size n
interval. needed to estimate the population
mean is

CONFIDENCE INTERVALS - If the population standard deviation is


unknown, you can estimate it using the
Point Estimate
sample standard deviation, provided
- A single value estimate for a population you have a preliminary sample with at
parameter, wherein the most unbiased least 30 members.
is the sample mean.

Interval Estimate SLOVIN’S FORMULA

- An interval, or range of values, used to - It is used to calculate the minimum


estimate a population parameter. sample sized needed to estimate a
statistic based on an acceptable margin
LEVEL OF CONFIDENCE
of error.
- The probability that the interval - It is calculated as n = N / (1 + Ne2)
estimate contains the population where n is the sample size needed, N is
parameter. the population size, and e is the
- If the level of confidence = 90%, this acceptable margin of error.
means that we are 90% confident that - The lower the margin of error, the larger
the interval contains the population sample size needed.
mean. - Use this formula when nothing is known
about the behavior of a population.
SAMPLING ERROR
HYPOTHESIS TEST - To write the null and alternative
hypotheses, translate the claim made
- It is a process that uses sample
about the population parameter from a
statistics to test a claim about the value
verbal statement to a mathematical
of a population parameter.
statement.
- Then write its complement.

- Regardless of which pair of hypotheses


you use, you always assume that the
population mean is equal to k and
examine the sampling distribution on
the basis of this assumption.
STATISTICAL HYPOTHESIS
GREEK LETTERS
- It is a statement, or claim, about a
population parameter.
- It needs a pair of hypotheses: one that
represents the claim, and the other, its
complement.
- When one of these hypotheses is false,
the other must be true.

STATING A HYPOTHESIS

TYPES OF ERRORS

- No matter which hypothesis represents


the claim, always begin the hypothesis
test assuming that the equality
condition in the null hypothesis is true.
- At the end of the test, one of two
decisions will be made: reject the null
hypothesis, and fail to reject the null
hypothesis.
- Because your decision is based on a
sample, there is the possibility of
making the wrong decision.
NATURE OF TESTS

1. Left-tailed: Alternative < null


2. Right-tailed: Alternative > null
3. Two-tailed: Alternative ≠ null

- The type of test depends on the region


of the sampling distribution that favors
a rejection of H0.
- This region is indicated by the
alternative hypothesis.
Types of Errors
LEFT-TAILED TEST
Type I: Rejecting null when it is true
- The alternative hypothesis HA contains
Type II: Failing to reject null when it is false the less-than inequality symbol (<).
LEVEL OF SIGNIFICANCE

- Your maximum allowable probability of


making a type I error, denoted by α.
- Commonly used levels of significance: α
= 0.10, α = 0.05, and α = 0.01.
- P(type II error) = β.

STATISTICAL TESTS

- After stating the null and alternative


hypotheses and specifying the level of
significance, a random sample is taken
from the population and sample RIGHT-TAILED TEST
statistic are calculated. - The alternative hypothesis HA contains
- The statistic that is compared with the the greater-than inequality symbol (>).
parameter in the null hypothesis is
called the test statistic.

P-VALUE (PROBABILITY VALUE)

- It is a number, calculated from a


statistical test, that describes how likely
you are to have found a particular set
of observation if the null hypothesis
were true.
- These are used in hypothesis testing to
help decide whether to reject the null
hypothesis.
TWO-TAILED TEST
- The smaller the p-value, the more likely
you are to reject the null hypothesis. - The alternative hypothesis HA contains
- Depends on the nature of the test. the not-equal-to symbol. Each tail has
an area of (1/2)P.
deviation is known, or for any
population when the sample size n is at
least 30.
- When n ≥ 30, the sample standard
deviation can be substituted for the
population standard deviation.

REJECTION REGION (CRITICAL REGION)

- It is the range of values for which the


MAKING A DECISION null hypothesis is not probable.
- If a test statistic that falls in this region,
the null hypothesis is rejected.
- A critical value z0 separates the
rejection region from the nonrejection
region.

T-TEST FOR A POPULATION MEAN

- It is a statistical test for a population


mean.
STEPS OF HYPOTHESIS TESTING - It can be used when the population is
normal or nearly normal, σ is unknown,
and n < 30.
- The degrees of freedom are d.f. = n – 1.

DEGREE OF FREEDOM

- It refers to the maximum number of


logically independent values, which are
values that have the freedom to vary, in
the data sample.
- These are the number of independent
variables that can be estimated in a
statistical analysis.

Z-TEST FOR A POPULATION PROPORTION


p

- It is a statistical test for a population


proportion p.
- It can be used when a binomial
distribution is given such that np ≥ 5
and nq ≥ 5.

Z-TEST FOR A POPULATION MEAN


TWO-SAMPLE HYPOTHESIS TEST
- It can be used when the population is
normal and the population standard It compares two parameters from two
populations.
Sampling Methods - Null Hypothesis H0 – It is a statistical
hypothesis that usually states that
INDEPENDENT SAMPLES
there is no difference between the
- The sample selected from one parameters of two populations. It
population is not related to the sample always contains the symbol ≤, =, or ≥.
selected from the second population.
- Alternative Hypothesis HA: It is a
DEPENDENT SAMPLES (paired or matched
statistical hypothesis that is true when
samples)
H0 is false. It always contains the
- Each member of one sample symbol >, ≠, or <.
corresponds to a member of the other
TWO SAMPLE T-TEST FOR THE DIFFERENCE
sample.
BETWEEN MEANS

- If samples of size less than 30 are


taken from normally-distributed
populations, a t-test may be used to
test the difference between the
population means.
- The standard error and the degrees of
freedom of the sampling distribution
depend on whether the population
variances are equal.

Three conditions are necessary to use a t-


test for small independent samples.

1. The samples must be randomly selected.


2. The samples must be independent.
3. Each population must have a normal
distribution.

NORMAL OR T-DISTRIBUTION?

HYPOTHESIS TEST FOR INDEPENDENT


SAMPLES
T-TEST FOR THE DIFFERENCE BETWEEN
MEANS OF DEPENDENT SAMPLES

- To perform a two-sample hypothesis


test with dependent samples, the
difference between each data pair
(between entries for a data pair) is first
found, then the mean of these
differences.

NOTE: n is the number of data pairs.

Three conditions are required to conduct the


test.

1. The samples must be randomly selected.


2. The samples must be dependent (paired).
3. Both populations must be normally
distributed.

TWO SAMPLE Z-TEST FOR PROPORTIONS

- This is used to test the difference


between two population proportions.

Three conditions are required to conduct the


test: CORRELATION COEFFICIENT

- It is a measure of the strength and the


1. The samples must be randomly selected. direction of a linear relationship
2. The samples must be independent. between two variables.
3. The samples must be large enough to use a - The symbol r represents the sample
normal sampling distribution. (≥ 5)
correlation coefficient.

CORRELATION

- It is a relationship between two


variables.
- The data can be represented by - The range of the correlation coefficient
ordered pairs (x, y) where x is the is -1 to 1.
independent (explanatory) variable,
and y is the dependent (response)
variable.

SCATTER PLOT

- It can be used to determine whether a


linear (straight line) correlation exists
between two variables.
REGRESSION LINE (LINE OF BEST FIT)

- It is the line for which the sum of the


squares of the residuals is a minimum.
- The equation of a regression line for an
independent variable x and a
dependent variable y is

PEARSON CORRELATION COEFFICIENT


(PRODUCT-MOMENT CORRELATION
COEFFICIENT)

- The two variables should be measured


at the interval or ratio level.
- There should exist a linear relationship
between the two variables.
- Both variables should be roughly
normally distributed.
- Each observation in the data set should
have a pair of values
- There should be no extreme outliers in
the dataset.
COEFFICIENT OF DETERMINATION
REGRESSION LINES
- It is the ratio of the explained variation
- After verifying that the linear
to the total variation.
correlation between two variables is
- It is denoted by r2.
significant, next we determine the
equation of the line that best models
the data (regression line).
- It can be used to predict the value of y
for a given value of x.

RESIDUAL
VARIATION ABOUT A REGRESSION LINE
- It is the difference between the
observed y-value and the predicted y-
value for a given x-value on the line. Three types of variation about a regression
line

1. Total Variation
2. Explained Variation
3. Unexplained Variation

- To find the total variation, you must


first calculate the total deviation,
explained deviation, and unexplained - It is a probability experiment consisting
deviation. of a fixed number of independent trials
in which there are more than two
possible outcomes for each trial.
- The probability for each outcome is
fixed and each outcome is classified
into categories.
- Recall that a binomial experiment had
only two possible outcomes.

MULTINOMIAL EXPERIMENTS

- It can compare the distribution of


proportions obtained in the multinomial
experiment with the previous survey’s
specified distribution.
- It can perform a chi-square goodness-
of-fit test.
THE STANDARD ERROR OF ESTIMATE

- It is the standard deviation of the


CHI-SQUARE GOODNESS-OF-FIT TEST
observed y- values about the predicted
value for a given x value. - It is used to test whether a frequency
- The closer the observed y-values are to distribution fits an expected
the predicted y-values, the smaller the distribution.
standard error of estimate will be. - The null hypothesis states that the
frequency distribution fits the specified
distribution.
MULTIPLE REGRESSION EQUATION
- The alternative hypothesis states that
- In many instances, a better prediction the frequency distribution does not fit
can be found for a dependent the specified distribution.
(response) variable by using more than - To calculate the test statistic for the
one independent (explanatory) chi-square goodness-of-fit test, the
variable. observed frequencies and the expected
frequencies are used.
- The OBSERVED FREQUENCY of a
category is the frequency for the
category observed in the sample data.
- The EXPECTED FREQUENCY of a
category is the calculated frequency
for the category.
- Because the mathematics associated
with this concept is complicated,
For the chi-square goodness-of-fit test to be
technology is generally used to
used, the following must be true:
calculate the multiple regression
equation.
1. The observed frequencies must be obtained
by using a random sample.
MULTIPLE EXPERIMENT
2. Each expected frequency must be greater
than or equal to 5. - Expected value of cells should be 5 or
greater in at least 80% of cells.

- If these conditions are satisfied, then F-DISTRIBUTION


the sampling distribution for the
- Let s sub1 sup2 and s sub2 sup 2
goodness-of-fit test is approximated by
represent the sample variances of two
a chi-square distribution with k – 1
different populations.
degrees of freedom. Where k is the
- If both populations are normal and the
number of categories.
population variances are equal, then
the sampling distribution of F = sub 1
R X C CONTINGENCY TABLE sup 2 / s sub 2 sup 2 is called an F-
distribution.
- It shows that observed frequencies for
two variables, which are arranged in r
rows and c columns. PROPERTIES OF THE F-DISTRIBUTION
- The intersection of a row and a column
1. The F-distribution is a family of curves each
is called a cell.
of which is determined by two types of
CHI-SQUARE INDEPENDENCE TEST degrees of freedom:

For the chi-square independence test to be


used, the following must be true: > The degrees of freedom corresponding to the
variance in the numerator, denoted dfN.
1. The observed frequencies must be obtained
> The degrees of freedom corresponding to the
by using a random sample.
variance in the denominator, denoted by dfD.
2. Each expected frequency must be greater
than or equal to 5.
2. F-distributions are positively skewed.
- If these conditions are satisfied, then
3. The total area under each curve of an F-
the sampling distribution for the chi-
distribution is equal to 1.
square independence test is
4. F-values are always greater than or equal to
approximated by a chi-square
0.
distribution with (r – 1)(c – 1) degrees of
5. For all F-distributions, the mean value of F is
freedom. Where r and c are the number
approximately equal to 1.
of rows and columns, respectively, of a
contingency table.

TWO-SAMPLE F-TEST FOR VARIANCES

To use the two-sample F-test for comparing


two population variances, the following must
be true.

1. The samples must be randomly selected.


ASSUMPTIONS OF A CHI-SQUARE TEST 2. The samples must be independent.
3. Each population must have a normal
- Both variables are categorical.
distribution.
- All observations are independent.
- Cells in the contingency table are
mutually exclusive. ONE-WAY ANOVA
- It is a hypothesis-testing technique that ANOVA SUMMARY TABLE
is used to compare means from three or
- A table is a convenient way to
more populations. It is usually
summarize the results in a one-way
abbreviated as ANOVA.
ANOVA test.

In a one-way ANOVA test, the following must


be true.
ASSUMPTIONS OF THE ONE-WAY ANOVA
1. Each sample must be randomly selected - Each sample was drawn from a
from a normal, or approximately normal, normally distributed population.
population. - The variances of the populations that
2. The samples must be independent of each the samples come from are equal.
other. - The observations in each group are
3. Each population must have the same independent of each other and the
variance. observations within groups were
obtained by a random sample.
1. The variance between samples MSB NOTE: If these assumptions are not met then
measures the differences related to the the results of our one-way ANOVA could be
treatment given to each sample and is unreliable.
sometimes called the MEAN SQURE BETWEEN.

TWO-WAY ANOVA
2. The variance within samples MSW measures
the differences related to entries within the - It is a hypothesis-testing technique that
same sample. This variance, sometimes called is used to test the effect of two
the MEAN SQUARE WITHIN, is usually due to independent variables, or factors, on
sampling error. one dependent variable.

- If the conditions for a one-way analysis of Suppose that a medical researcher wants to
variance are satisfied, then the sampling test the effect of gender and type of
distribution for the test is approximated by the medication on the mean length of time it takes
F-distribution. pain relievers to provide relief.
TWO-WAY ANOVA HYPOTHESES

MAIN EFFECT: The effect of one independent


variable on the dependent variable.

INTERACTION EFFECT: The effect of both


independent variables on the dependent
variable.
PAIRED-SAMPLE SIGN TEST

- It is used to test the difference between


two population medians when the
populations are not normally
distributed.

For the paired-sample sign test to be used,


the following must be true.

1. A sample must be randomly selected


from each population.
2. The samples must be dependent (paired).
It is possible to reject none, one, two, or all of
the null hypotheses.
The difference between corresponding
data entries is found and the sign of the
difference is recorded.

WILCOXON SIGNED-RANK TEST

- It is a nonparametric test that can be


used to determine whether two
dependent samples were selected from
populations having the same
NON PARAMETRIC TEST distribution.
- Unlike the sign test, it considers the
- It is a hypothesis test that does not magnitude, or size, of the data entries.
require any specific conditions
concerning the shape of the population WILCOXON RANK SUM TEST
or the value of any population - It is a nonparametric test that can be
parameters. used to determine whether two
- It is generally easier to perform than independent samples were selected
parametric tests. from populations having the same
- It is usually less efficient than distribution.
parametric tests (stronger evidence is - It is a requirement for the Wilcoxon
required to reject the null hypothesis). rank sum test is that the sample size of
both samples must be at least 10.
SIGN TEST FOR A POPULATION MEDIAN KRUSKAL-WALLIS TEST
- It is a nonparametric test that can be - It is a nonparametric test that can be
used to test a population median used to determine whether three or
against a hypothesized value k. more independent samples were
selected from populations having the ASSUMPTIONS OF THE SPEARMAN RANK
same distribution. CORRELATION

This is better to use in two specific scenarios:

1. when working with ranked data.


Example: ranks in math and science exams
2. when one or more extreme outliers are
present.
Two conditions for using this test are:
- When extreme outliers are present in a
1. Each sample must be randomly selected. dataset, Pearson’s correlation coefficient is
2. The size of each sample must be at least 5. highly affected.

If these conditions are met, the test is ASSUMPTIONS OF THE MANN-WHITNEY U


approximated by a chi-square distribution with TEST
k – 1 degrees of freedom where k is the
- The variable you are analyzing is ordinal or
number of samples
continuous.
- All the observations from both groups are
SPEARMAN RANK CORRELATION independent of each other.
COEFFICIENT - The shapes of the distributions for the two
groups are roughly the same.
- It is a measure of the strength of the
relationship between two variables.
- It is a nonparametric equivalent to the FUNDAMENTAL LEVELS OF
Pearson correlation coefficient. MEASUREMENT SCALES
- It is calculated using the ranks of paired
sample data entries, and is denoted by
rs. NOMINAL SCALE
- Its values range from -1 to 1, inclusive. - Categorical variable scale
> If the ranks of corresponding data pairs - Consist of categories in each of which the
are identical, rs is equal to +1. number of respective observations is
> If the ranks are in “reverse” order, rs is recorded.
equal to -1. - The categories are in no logical order
> If there is no relationship, rs is equal to 0. and have no particular relationship.
- The categories are said to be mutually
exclusive since an individual, object, or
To determine whether the correlation measurement can be included in only
between variables is significant, you can one of them.
perform a hypothesis test for the
population correlation coefficient ρS.
ORDINAL SCALE

- It contains more information.


- Consists of distinct categories in which
order is implied.
- Values in one category are larger or
smaller than values in other categories
- The values are in interval or ratio scale
Example:
Typical assumptions are:
rating-excellent, good, fair, poor
1. Normality:
- Data have a normal distribution (or at least
INTERVAL SCALE
is symmetric) [Gaussian Distribution]
- A set of numerical measurements in which 2. Homogeneity of variances:
the distance between numbers is of a - Data from multiple groups have the same
known constant size. variance
3. Linearity:
RATIO SCALE
- Data have a linear relationship
- Consists of numerical measurements where 4. Independence:
the distance between numbers is of a - Data are independent
known, constant size, in addition, there is a
nonarbitrary zero point.
NON-PARAMETRIC TESTS
- The zero point is real and nonarbitrary, so a
value of zero means there is nothing. - Used when either:
1) Sample is not normally distributed.
2) Sample size is small.

- The variables are measured on nominal or


ordinal scale.

Religion Nominal
Address Nominal
IQ Scores Ordinal
Size of a T-Shirt Ordinal
Speed of a car Interval
Land area Interval
Civil Status Nominal
Salary of workers Interval
Number of books in Ratio
the library
Number of hours Ratio
spent in studying

PARAMETRIC STATISTICAL TEST

- Makes assumptions about the parameters


(defining properties) of the population
distribution(s) from which one's data are
drawn
EXAMPLE

1. T-test (n<30), which is further classified into


1-sample and 2-sample MALE
2. Anova (Analysis of Variance)- One way
Anova, Two way Anova
3. Pearson’s r Correlation
4. Z-test for large samples (n> 30)

T-TEST

- Used to compare two means


➢ the means of two independent FEMALE
samples or two independent groups
and the means of correlated
samples before and after treatment.
➢ used when there are less than 30
samples, but some researchers use
the t-test even if there are more
than 30 samples data.

DEGREE OF FREEDOM

- An estimate
- The number of independent pieces of
information that went into calculating the
estimate.

You might also like