Statistics Notes BS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Statistics

Population
The complete collection of individuals, objects, or measurements that have a
characteristic in common or totality of related observations in a given study is described
as a population. The population that is being studied is also called the target population.
1. Population can be finite (limited in its size) or infinite (unrestricted).
i) Population of trees underspecified climatic conditions.
ii) Population of animals fed a certain type of diet.
iii) Population of farms having a certain type of natural fertility.
iv) Population of households, etc.
2. The entire group of individuals is population.
3. For example, a researcher may be interested in the relation between class size (variable 1)
and academic performance (variable 2) for the population of third-grade children.
Parameter is a value, usually numerical value that describes a population. This may be obtained
from a single measurement, or it, may be derived from a set of measurement from the
population.
Sample
Usually populations are so large that a researcher cannot examine the entire group.
Therefore, a sample is selected to represent the population in a research study. The goal
is to use the results obtained from the sample to help answer questions about the
population.

Statistics is a value, usually a numerical value that describes a sample. This may be obtained
from a single measurement, or it, may be derived from a set of measurement from the
population.

Sampling Error
Sampling error is the discrepancy, or amount of error, that exists between a sample statistics and
the corresponding population parameter. A statistic always has some margin of error.
Variables
• A variable is a characteristic or condition that can change or take on different values.
• Most research begins with a general question about the relationship between two
variables for a specific group of individuals.
Categorical variables Numeric variables
(Qualitative variable) (Quantifiable characteristic)
1. Nominal 1. Continuous variables
A nominal variable is one that describes a (such as time or weight) are infinitely
name, label or category without natural order. divisible into whatever units a researcher may
2. Ordinal choose. For example, time can be measured
An ordinal variable is a variable whose values to the nearest minute, second, half-second,
are defined by an order relation between the etc.
different categories. 2. Discrete variables
Discrete variables (such as class size) consist
of indivisible categories

Scales of Measurement

Nominal Ordinal Interval Ratio


is an unordered set of is an ordered set of is an ordered series of is an interval scale
categories identified categories. Ordinal equal-sized where a value of zero
only by name. measurements tell categories. Interval indicates none of the
Nominal you the direction of measurements variable. Ratio
measurements only difference between identify the direction measurements
permit you to two individuals. and magnitude of a identify the direction
determine whether difference. The zero and magnitude of
two individuals are point is located differences and allow
the same or different. arbitrarily on an ratio comparisons of
interval scale. measurements.

Data
The measurements obtained in a research study are called the data. The goal of statistics is to
help researchers organize and interpret the data.
Population and Sample

Sampling Distribution/ Types/Nature of Curve


1. Bell shape curve
Normal distribution. The normal distribution is the proper term for a probability bell
curve. In a normal distribution, the mean is zero and the standard deviation is 1.

2. Skewness and Kurtosis


i. Skewness is used to denote the horizontal pull on the data. It is the measure of
asymmetry that occurs when our data deviates from the norm. Skewness measures the
symmetry of a variable's distribution. If the distribution stretches toward the right or left
tail, it's skewed. Negative skewness indicates more larger values, while positive skewness
indicates more smaller values. A skewness value between -1 and +1 is excellent, while -2
to +2 is generally acceptable. Values beyond -2 and +2 suggest substantial nonnormality
(Hair et al., 2022).
ii. Kurtosis is used to find the vertical pull or the peak's height. Kurtosis is used to find the
presence of outliers in our data. It gives us the total degree of outliers present. Kurtosis
indicates whether the distribution is too peaked or flat compared to a normal distribution.
Positive kurtosis means a more peaked distribution, while negative kurtosis means a
flatter one. A kurtosis greater than +2 suggests a too peaked distribution, while less than -
2 indicates a too flat one. When skewness and kurtosis are close to zero, it's considered a
normal distribution (Hair et al., 2022). In the rare scenario where both skewness and
kurtosis are zero, the pattern of responses is considered a normal distribution.

Level of Significance (Alpha Level or P value)


• Is a probability value that is used to define the concept of “very unlikely” in a hypothesis
test.
Critical Region

• The critical region is composed of the extreme sample values that are very unlikely (as
defined by the alpha level) to be obtained if the null hypothesis is true. The boundaries
for the critical region are determined by the alpha level. If sample data fall in the critical
region, the null hypothesis is rejected.
• Technically, the critical region is defined by sample outcomes that are very unlikely to
occur if the treatment has no effect (that is, if the null hypothesis is true).

Confidence Interval

A confidence interval, in statistics, refers to the probability that a population parameter will fall
between a set of values for a certain proportion of times. Analysts often use confidence intervals
that contain either 95% or 99% of expected observations. Thus, if a point estimate is generated
from a statistical model of 10.00 with a 95% confidence interval of 9.50 to 10.50, it means one is
95% confident that the true value falls within that range.

Directional and Non Directional hypothesis


Directional Hypothesis Non Directional hypothesis
A hypothesis that is built upon a certain It involves an open-ended non-directional
directional relationship between two variables
hypothesis that predicts that the independent
and constructed upon an already existing
theory, is called a directional hypothesis. To variable will influence the dependent variable;
understand more about what is directional
however, the nature or direction of a
hypothesis here is an example, Girls perform
better than boys (‘better than’ shows the relationship between two subject variables is
direction predicted).
not defined or clear.
One tail and Two tail tests

One-tailed Tests Two-tailed Tests


A one-tailed test may be either left-tailed or The main difference between one-tailed and
right-tailed. two-tailed tests is that one-tailed tests will
only have one critical region whereas two-
A left-tailed test is used when the alternative tailed tests will have two critical regions. If
hypothesis states that the true value of the we require a 100(1−α)100(1−α)% confidence
parameter specified in the null hypothesis is interval we have to make some adjustments
less than the null hypothesis claims. when using a two-tailed test

A right-tailed test is used when the alternative


hypothesis states that the true value of the
parameter specified in the null hypothesis is
greater than the null hypothesis claims
Type One and Type Two (I & II) Errors

Type I Error Type II Error


A Type I error occurs when a researcher A Type II error occurs when a researcher fails
rejects a null hypothesis that is actually true. to reject a null hypothesis that is really false.
In a typical research situation, a Type I error In a typical research situation, a Type II error
means that the researcher concludes that a means that the hypothesis test has failed to
treatment does have an effect when, in fact, it detect a real treatment effect.
has no effect. It occurs when the sample mean is not in the
Type I error, however, means that this is a critical region even though the treatment has
false report. Thus, Type I errors lead to false had an effect on the sample. Often this
reports in the scientific literature. Other happens when the effect of the treatment is
researchers may try to build theories or relatively small.
develop other experiments based on the false
results. A lot of precious time and resources The consequences of a Type II error are
may be wasted usually not as serious as those of a Type I
error.
Percentage (%)

An amount, such as an allowance or commission, that is a proportion of a larger sum of money.

A rate, number, or amount in each hundred.

Percentile Ranks

The rank or percentile rank of a particular score is defined as the percentage of individuals in the
distribution with scores at or below the particular value. When a score is identified by its
percentile rank, the score is called a percentile. Percentile describes the individual’s exact
position in the population.

Minimum and Maximum Values

The highest value of a function is considered the maximum value of the function, and the lowest
value of the function is considered the minimum value of the function.

Standard Deviation (σ)

Is a measure which shows how much variation (such as spread, dispersion, spread,) from the
mean exists. The standard deviation indicates a “typical” deviation from the mean. It is a popular
measure of variability because it returns to the original units of measure of the data set. Like the
variance, if the data points are close to the mean, there is a small variation whereas the data
points are highly spread out from the mean, then it has a high variance. Standard deviation
calculates the extent to which the values differ from the average. Standard Deviation, the most
widely used measure of dispersion, is based on all values. Therefore a change in even one value
affects the value of standard deviation.

Z-scores

Z-Score, also known as the standard score, indicates how many standard deviations an entity is,
from the mean.

• A z-score greater than 0 represents an element greater than the mean.


• A z-score of less than 0 represents an element less than the mean.
• A z-score equal to 0 represents an element equal to the mean.
• A z-score equal to 1 represents an element, which is 1 standard deviation greater than the
mean; a z-score equal to 2 signifies 2 standard deviations greater than the mean; etc.
• A z-score equal to -1 represents an element, which is 1 standard deviation less than the
mean; a z-score equal to -2 signifies 2 standard deviations less than the mean; etc.
• If the number of elements in the set is large, about 68% of the elements have a z-score
between -1 and 1; about 95% have a z-score between -2 and 2 and about 99% have a z-
score between -3 and 3.

Statistical power

Statistical power, or sensitivity, is the likelihood of a significance test detecting an effect when
there actually is one. A true effect is a real, non-zero relationship between variables in a
population. An effect is usually indicated by a real difference between groups or a correlation
between variables. High power in a study indicates a large chance of a test detecting a true effect.
Low power means that your test only has a small chance of detecting a true effect

Effect size

It tells you how meaningful the relationship between variables or the difference between groups
is. It indicates the practical significance of a research outcome. A large effect size means that a
research finding has practical significance, while a small effect size indicates limited practical
applications.

Effect size Cohen’s d Pearson’s (r)


Small 0.2 .1 to .3 or -.1 to -.3
Medium 0.5 .3 to .5 or -.3 to -.5
Large 0.8 .5 or greater or -.5 or less

Descriptive statistics Inferential statistics


Descriptive statistics are methods for Inferential statistics are methods for using
organizing and summarizing data. sample data to make general conclusions
For example, tables or graphs are used to (inferences) about populations.
organize data, and descriptive values such as Because a sample is typically only a part of
the average score are used to summarize data. the whole population, sample data provide
A descriptive value for a population is called only limited information about the population.
a parameter and a descriptive value for a As a result, sample statistics are generally
sample is called a statistic.
imperfect representatives of the
corresponding population parameters.

Parametric tests Non Parametric tests


1. Normality – Data in each group should be 1. No Normality
normally distributed. 2. Equal Variance not assumed
2. Equal Variance – Data in each group 3. No Independence of observation
should have approximately equal variance. 4. Presence of Outliers
3. Independence – Data in each group should
be randomly and independently sampled from
the population.
4. No Outliers – There should be no extreme
outliers.
5. Data in interval or ratio
Tests of Differences
Independent sample t-test The Mann-Whitney test
Paired sample t-test Wilcoxon test
ANOVA Kruskal-Wallis one-way ANOVA
Repeated Measure ANOVA Friedman test
Tests of Relationships/Prediction
Pearson Product Moment correlation Spearman’s Rank

Compare and Contrast of Parametric and Non-Parametric

Criterion Parametric Non-Parametric


Population Proper understanding of population No detail information for
is available population is available
Distribution Normal Do not require the population to be
normal; it can be arbitrary
Sample size Require sample size to be over 30 Can work with small samples
Interpretability Are easy to interpret Are difficult to interpret
Implementation Are difficult to implement Are easy to implement
Reliability Output is more reliable and powerful Are less powerful and reliable
Types of Works with continuous/quantitative Works with continuous/quantitative
variables variables variables as well as
categorical/discrete variables.
Central Measurement of the central tendency Measurement of the central
tendency is typically done using mean. tendency is typically done using
median.
Outliers Affected by outliers Less affected by outliers
One Sample t Test
The One Sample t Test examines whether the mean of a population is statistically different from
a known or hypothesized value. The One Sample t Test is a parametric test.
This test is also known as:
• Single Sample t Test
The variable used in this test is known as:
• Test variable
In a One Sample t Test, the test variable's mean is compared against a "test value", which is a
known or hypothesized value of the mean in the population. Test values may come from a
literature review, a trusted research organization, legal requirements, or industry standards.
Data Requirements:
1. Test variable that is continuous (i.e., interval or ratio level)
2. Scores on the test variable are independent (i.e., independence of observations)
• There is no relationship between scores on the test variable
• Violation of this assumption will yield an inaccurate p value
3. Random sample of data from the population
4. Normal distribution (approximately) of the sample and population on the test variable
• Non-normal population distributions, especially those that are thick-tailed or
heavily skewed, considerably reduce the power of the test
• Among moderate or large samples, a violation of normality may still yield
accurate p values
5. Homogeneity of variances (i.e., variances approximately equal in both the sample and
population)
6. No outliers

Independent Samples t Test


The Independent Samples t Test compares the means of two independent groups in order to
determine whether there is statistical evidence that the associated population means are
significantly different. The Independent Samples t Test is a parametric test.
This test is also known as:
• Independent t Test
• Independent Measures t Test
• Independent Two-sample t Test
• Student t Test
• Two-Sample t Test
• Uncorrelated Scores t Test
• Unpaired t Test
• Unrelated t Test
The variables used in this test are known as:
• Dependent variable, or test variable
• Independent variable, or grouping variable
Your data must meet the following requirements:
1. Dependent variable that is continuous (i.e., interval or ratio level)
2. Independent variable that is categorical and has exactly two categories
3. Cases that have values on both the dependent and independent variables
4. Independent samples/groups (i.e., independence of observations)
• There is no relationship between the subjects in each sample. This means that:
• Subjects in the first group cannot also be in the second group
• No subject in either group can influence subjects in the other group
• No group can influence the other group
• Violation of this assumption will yield an inaccurate p value
5. Random sample of data from the population
6. Normal distribution (approximately) of the dependent variable for each group
• Non-normal population distributions, especially those that are thick-tailed or
heavily skewed, considerably reduce the power of the test
• Among moderate or large samples, a violation of normality may still yield
accurate p values
7. Homogeneity of variances (i.e., variances approximately equal across groups)
• When this assumption is violated and the sample sizes for each group differ,
the p value is not trustworthy. However, the Independent Samples t Test output
also includes an approximate t statistic that is not based on assuming equal
population variances. This alternative statistic, called the Welch t Test statistic1,
may be used when equal variances among populations cannot be assumed. The
Welch t Test is also known an Unequal Variance t Test or Separate
Variances t Test.
8. No outliers
When one or more of the assumptions for the Independent Samples t Test are not met, then
nonparametric Mann-Whitney U Test will be used instead.
Paired Samples t
The Paired Samples t Test is a parametric test.
This test is also known as:
• Dependent t Test
• Paired t Test
• Repeated Measures t Test
The Paired Samples t Test compares the means of two measurements taken from the same
individual, object, or related units. These "paired" measurements can represent things like:
• A measurement taken at two different times (e.g., pre-test and post-test score with an
intervention administered between the two time points)
• A measurement taken under two different conditions (e.g., completing a test under a
"control" condition and an "experimental" condition)
• Measurements taken from two halves or sides of a subject or experimental unit (e.g.,
measuring hearing loss in a subject's left and right ears).
The purpose of the test is to determine whether there is statistical evidence that the mean
difference between paired observations is significantly different from zero.
Data Requirements:
1. Dependent variable that is continuous (i.e., interval or ratio level)
2. Related samples/groups (i.e., dependent observations)
1. The subjects in each sample, or group, are the same. This means that the subjects
in the first group are also in the second group.
3. Random sample of data from the population
4. Normal distribution (approximately) of the difference between the paired values
5. No outliers in the difference between the two related groups
When one or more of the assumptions for the Paired Samples t Test are not met, nonparametric
Wilcoxon Signed-Ranks Test will be used instead.
One-Way ANOVA
One-Way ANOVA is a parametric test.
This test is also known as:
• One-Factor ANOVA
• One-Way Analysis of Variance
• Between Subjects ANOVA
One-Way ANOVA ("analysis of variance") compares the means of two or more independent
groups in order to determine whether there is statistical evidence that the associated population
means are significantly different.
The variables used in this test are known as:
• Dependent variable
• Independent variable (also known as the grouping variable, or factor)
• This variable divides cases into two or more mutually exclusive levels, or groups.
The One-Way ANOVA is often used to analyze data from the following types of studies:
• Field studies
• Experiments
• Quasi-experiments
Data Requirements:
1. Dependent variable that is continuous (i.e., interval or ratio level)
2. Independent variable that is categorical (i.e., two or more groups)
3. Cases that have values on both the dependent and independent variables
4. Independent samples/groups (i.e., independence of observations)
1. There is no relationship between the subjects in each sample. This means that:
1. subjects in the first group cannot also be in the second group
2. no subject in either group can influence subjects in the other group
3. no group can influence the other group
5. Random sample of data from the population
6. Normal distribution (approximately) of the dependent variable for each group (i.e., for
each level of the factor)
1. Non-normal population distributions, especially those that are thick-tailed or
heavily skewed, considerably reduce the power of the test
2. Among moderate or large samples, a violation of normality may yield fairly
accurate p values
7. Homogeneity of variances (i.e., variances approximately equal across groups)
1. When this assumption is violated and the sample sizes differ among groups,
the p value for the overall F test is not trustworthy. These conditions warrant
using alternative statistics that do not assume equal variances among populations,
such as the Browne-Forsythe or Welch statistics (available via Options in the
One-Way ANOVA dialog box).
2. When this assumption is violated, regardless of whether the group sample sizes
are fairly equal, the results may not be trustworthy for post hoc tests. When
variances are unequal, post hoc tests that do not assume equal variances should be
used (e.g., Dunnett’s C).
8. No outliers
When the normality, homogeneity of variances, or outliers assumptions for One-Way ANOVA
are not met, nonparametric Kruskal-Wallis test will be applied instead.
Repeated Measure ANOVA
An ANOVA with repeated measures is used to compare three or more group means where the
participants are the same in each group.
This usually occurs in two situations:
(1) When participants are measured multiple times to see changes to an intervention; or
(2) When participants are subjected to more than one condition/trial and the response to each of
these conditions wants to be compared.
Whilst the repeated measures ANOVA is used when you have just "one" independent variable, if
you have "two" independent variables (e.g., you measured time and condition), you will need to
use a two-way repeated measures ANOVA.
Assumptions:
1. Dependent variable should be measured at the continuous level (i.e., they
are interval or ratio variables).
2. Independent variable should consist of at least two categorical, "related
groups" or "matched pairs".
3. There should be no significant outliers in the related groups.
4. The distribution of the dependent variable in the two or more related groups should
be approximately normally distributed.
5. Known as sphericity, the variances of the differences between all combinations of
related groups must be equal. If the p-value is not less than some significance level (e.g. α
= .05) then we fail to reject the null hypothesis and conclude that the assumption of
sphericity is met. “H0: The variances of the differences are equal

Correlation Analysis
Correlation Analysis is statistical method that is used to discover if there is a relationship
between two variables/datasets, and how strong that relationship may be.

Parametric: Pearson Product-Moment Coefficient


• “r” value lie between -1 to +1
• The value of “r” indicates the strength of the correlation and also called a measure of
effect size.
Weak correlation 0-.2
Moderate .3-.6
Strong .7-1

• The significance value (less than .05) also be considered for the importance of
correlation.

You might also like