Quantitative Research Techniques and Statistics Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Cecelia Hof

WK1 Assignment: Quantitative Module Pre-Test and Study Notes


Main focus from pre-test results:
- Analysis of Variance
- Data Collection and Sampling
- Intro to Hypothesis Testing
- Sampling Distributions
Sources:
- Peregrine Quantitative Research Techniques and Statistics Modules
- Chapter 3 of Introduction to Business Analytics

Descriptive Statistics
1. Definition: Descriptive statistics focuses on organizing, summarizing, and presenting
data to reveal its key features in an informative way. It helps in understanding the
characteristics of a dataset.
2. Graphical Techniques: Graphical methods like histograms or bar graphs visually
represent data distributions, showing patterns such as normal distribution, skewness, or
multimodality.
3. Numerical Techniques: Numerical methods summarize data using measures like the
mean (average), median (midpoint), mode (most frequent value), and range (difference
between highest and lowest values). Variance and standard deviation are also used to
describe data variability.
4. Practical Use: Descriptive statistics are useful for summarizing data, such as estimating
annual profits from an exclusivity agreement by analyzing sample data to understand
overall consumption patterns, which then informs decision-making.

Inferential Statistics
1. Purpose: Inferential statistics involves using sample data to make inferences or draw
conclusions about a larger population.
2. Sampling: Instead of surveying an entire population, a smaller sample is used to infer
characteristics of the whole. This is more practical and cost-effective.
3. Accuracy and Uncertainty: Predictions based on samples come with some degree of
uncertainty. The accuracy of these predictions is typically expressed as a confidence
level, often between 90% and 99%.
In summary, inferential statistics helps in making educated guesses about large populations
based on sample data, though these inferences come with inherent uncertainty and are subject
to correction as new information becomes available.

Key Concepts
Statistical inference involves three key concepts: population, sample, and statistical inference. A
population encompasses all items of interest, such as the diameters of ball bearings, and its
descriptive measure is a parameter. A sample is a subset of the population, and its descriptive
measure is a statistic. Statistical inference uses sample data to estimate, predict, or make
decisions about the larger population. Since examining the entire population is often
impractical, samples are used instead. The reliability of these inferences is quantified by
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
confidence levels (the proportion of times an estimate is correct) and significance levels (the
likelihood of incorrect conclusions).
Confidence Level + Significance Level = 1
Data Collection and Sampling
Data collection in statistics involves several methods, including direct observation, experiments,
and surveys. Direct observation is inexpensive but may yield limited insights and potential
biases. Experiments offer more reliable data but are costlier. Surveys, which can be conducted
through personal interviews, telephone interviews, or self-administered questionnaires, vary in
cost and response accuracy. Key aspects of surveys include ensuring a high response rate and
designing questions clearly to avoid biases.

Sampling methods are used to make inferences about a population based on a smaller sample,
with common techniques including simple random sampling, stratified random sampling, and
cluster sampling. Each method has its advantages and drawbacks in terms of cost, accuracy,
and representativeness.

Sampling errors arise from natural variations between the sample and the population, while
non-sampling errors result from issues like data acquisition mistakes or non-responses. Non-
sampling errors can seriously affect results and are not mitigated by increasing sample size.
Ensuring accurate and representative data collection is crucial to valid statistical analysis.

Sampling Plans
 A simple random sample is a sample selected in such a way that every possible sample
with the same number of observations is equally likely to be chosen.
 A stratified random sample is obtained by separating the population into mutually
exclusive sets, or strata, and then drawing simple random samples from each stratum.
 A cluster sample is a simple random sample of groups or clusters of elements versus a
simple random sample of individual objects.

Probability
To understand probability, we first need to define a random experiment, which is an action or
process leading to one of several possible outcomes.
Probabilities are assigned to outcomes using a sample space, which is a list of all possible
outcomes that is both exhaustive (includes all possibilities) and mutually exclusive (no two
outcomes can occur simultaneously).
Assigning Probabilities involves three approaches:
1. Classical Approach: Used for well-defined scenarios like games of chance. If an
experiment has ( n ) possible outcomes, each outcome is assigned a probability of 1/n.
2. Relative Frequency: Defines probability based on the long-run frequency of outcomes.
For instance, if an event occurs a certain number of times in a large number of trials, its
probability is estimated as the ratio of the number of occurrences to the total number of
trials. This estimate becomes more accurate with a larger sample size.
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
3. Subjective Approach: Used when historical data is unavailable or impractical. This
involves assigning probabilities based on personal judgment or belief, such as
predictions based on experience or analysis.
Interpreting Probability involves understanding it as the long-term relative frequency of an
event occurring. This approach links probability with statistical inference and real-world
applications.

Joint, Marginal, and Conditional Probability


To calculate probabilities for more complex events from related simpler events, we use several
key concepts:
1. Intersection of Events: This refers to the occurrence of both events A and B
simultaneously. It is denoted as A∩B and represents the joint probability P(A∩B).
2. Union of Events: This denotes the occurrence of either event A, event B, or both. It is
denoted as A∪B and is calculated as P(A∪B)=P(A)+P(B)−P(A∩B)
3. Marginal Probability: This measures the likelihood of an individual event occurring
regardless of other events. It is obtained by summing the joint probabilities for rows or
columns in a table.
4. Conditional Probability: This determines the probability of an event A given that another
event B has occurred, denoted as P(A∣B). It is calculated using P(A∣B)=P(A∩B)/P(B).
5. Independence of Events: Two events A and B are independent if P(A∣B)=P(A) or
P(B∣A)=P(B). This means the occurrence of one event does not affect the probability of
the other.
Using these concepts, we can analyze probabilities, calculate the likelihood of various events,
and understand their relationships.

Probability Rules and Trees


- Complement Rule: The complement rule states that the probability of an event not
occurring is 1 minus the probability of the event occurring. This rule helps determine the
likelihood of the event not happening by subtracting the probability of the event
happening from 1.
- Multiplication Rule: The multiplication rule calculates the joint probability of two events.
- Addition Rule: The addition rule calculates the probability of either of two events
occurring, including the possibility of both. This rule accounts for the overlap between
events to avoid double-counting.
- Probability Trees: Probability trees are visual representations of the probabilities of
different outcomes in a sequential process. Each branch of the tree represents an event
and its probability. The joint probabilities are found by multiplying the probabilities
along the branches, and the total of all probabilities at the ends of the branches should
sum to 1.
These rules and methods provide various approaches to calculating probabilities and
understanding the likelihood of different outcomes in both simple and complex scenarios.

Sampling Distributions
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
A sampling distribution is the probability distribution of a given statistic based on a random
sample. It shows how the statistic varies from sample to sample.
1. Distribution of the Sample Mean:
a. If you take multiple samples from a population and calculate the mean of each
sample, the distribution of these sample means is called the sampling
distribution of the sample mean.
b. According to the Central Limit Theorem, if the sample size is sufficiently large,
the sampling distribution of the sample mean will be approximately normally
distributed, regardless of the shape of the population distribution.
2. Standard Error:
a. The standard error measures the dispersion of the sample statistic around the
population parameter. For the sample mean, it is the standard deviation of the
sampling distribution of the sample mean.
3. Central Limit Theorem (CLT):
a. The CLT states that the sampling distribution of the sample mean will tend to be
normal (or approximately normal) if the sample size is large enough, regardless
of the population's distribution.
4. Sampling Distribution of Proportions:
a. When dealing with proportions, the sampling distribution of the sample
proportion (e.g., proportion of successes in a sample) also approaches a normal
distribution as the sample size increases, provided the sample size is large
enough to satisfy the conditions
5. Application:
a. Sampling distributions are used to estimate population parameters, construct
confidence intervals, and conduct hypothesis tests.

Intro to Hypothesis Testing


Concept: Hypothesis testing is a statistical method used to make decisions about a population
based on sample data. It involves assessing evidence to support or reject a hypothesis about a
population parameter.
Key Components:
1. Hypotheses:
a. Null Hypothesis (H0): The default assumption that there is no effect or
difference.
b. Alternative Hypothesis (H1): The hypothesis that contradicts the null hypothesis,
representing an effect or difference.
2. Testing Procedure:
a. Begin by assuming the null hypothesis is true.
b. Use sample data to calculate a test statistic.
c. Determine if the test statistic provides enough evidence to reject the null
hypothesis in favor of the alternative hypothesis.
3. Decisions:
a. Reject H0: Conclude there is sufficient evidence to support the alternative
hypothesis.
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
b. Fail to Reject H0: Conclude there is not enough evidence to support the
alternative hypothesis.
4. Errors:
a. Type I Error (α): Occurs when the null hypothesis is rejected when it is actually
true (false positive).
b. Type II Error (β): Occurs when the null hypothesis is not rejected when it is
actually false (false negative).
5. Significance Level:
a. α (Alpha): The probability of making a Type I error, often set at 0.05 or 5%.
b. β (Beta): The probability of making a Type II error.
Hypothesis testing involves comparing sample data against the null hypothesis to determine if
there is enough statistical evidence to support the alternative hypothesis, while managing the
risks of Type I and Type II errors.

Testing the population mean when the population standard deviation is known
1. Formulate Hypotheses: Null Hypothesis and Alternative Hypothesis
2. Determine the Significance Level (α):This is the probability of rejecting the null
hypothesis when it is true, commonly set at 0.05, 0.01, or 0.10.
3. Calculate the Test Statistic: The test statistic for the population mean when the standard
deviation is known is calculated using the formula: where x is the sample mean, μ₀ is the
hypothesized population mean, σ is the known population standard deviation, and n is
the sample size.
4. Determine the Rejection Region:
a. For a two-tailed test, the rejection regions are in both tails of the normal
distribution, determined by critical z-values corresponding to the significance
level
b. For a one-tailed test, the rejection region is in one tail (either right or left)
depending on whether the alternative hypothesis specifies greater than or less
than
5. Make a Decision:
a. Rejection Region Method: Compare the test statistic to the critical z-value(s). If
the test statistic falls into the rejection region, reject the null hypothesis.
b. p-Value Method: Calculate the p-value, which is the probability of observing a
test statistic as extreme as, or more extreme than, the one computed. If the p-
value is less than α, reject the null hypothesis.
6. Interpret Results:
a. If you reject the null hypothesis, there is statistical evidence suggesting that the
population mean differs from the specified value. If you do not reject the null
hypothesis, there is insufficient evidence to claim a difference.
This process allows you to determine whether observed sample data provide enough evidence
to make inferences about the population mean.

Calculating the probability of a Type II error


Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
1. Determine the Critical Region:
a. Establish the critical region (rejection region) based on the significance level (α).
This is where you would reject the null hypothesis.
2. Calculate the Probability of Type II Error:
a. Compute the probability of not rejecting the null hypothesis when the
alternative hypothesis is true. This involves finding the probability that the test
statistic falls inside the acceptance region under the true parameter value (μ1).
3. Use the Power of the Test:
a. The power of the test is 1−β, which is the probability of correctly rejecting the
null hypothesis when the alternative hypothesis is true. Increasing the power
decreases the probability of a Type II error.
4. Adjusting Parameters:
a. The probability of a Type II error can be influenced by the sample size,
significance level, and the effect size (the difference between μ0 and μ1).
Increasing the sample size or choosing a higher significance level can reduce β.

Inference about a population mean when the standard deviation is unknown


1. Objective:
a. To estimate the population mean (μ) and make inferences about it when the
population standard deviation (σ) is not known.
2. Use the Sample Standard Deviation:
a. Instead of the population standard deviation, use the sample standard deviation
(s) to estimate σ.
3. T-Distribution:
a. When σ is unknown, the sampling distribution of the sample mean (xˉbar )
follows a t-distribution rather than a normal distribution. This accounts for the
additional uncertainty introduced by estimating σ.
4. Degrees of Freedom:
a. The t-distribution is parameterized by degrees of freedom (df), which is n−1 for a
single sample.
5. Hypothesis Testing:
a. For hypothesis tests about the mean, compare the t-statistic to critical t-values
from the t-distribution to determine whether to reject the null hypothesis. Use a
significance level (α) to define the rejection region.
In summary, when the population standard deviation is unknown, the t-distribution is used to
estimate the population mean and to perform hypothesis tests and construct confidence
intervals.

Inference about a population variance


To make inferences about the population variance (σ ^2) based on sample data.
1. Chi-Square Distribution:
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
a. The sampling distribution of the sample variance (s^2) follows a chi-square
distribution when the population is normally distributed. This distribution is used
for making inferences about the population variance.
2. Test Statistic:
a. Compute the test statistic for variance using the formula
3. Degrees of Freedom:
a. The chi-square distribution is parameterized by degrees of freedom, which is n−1
for a sample variance.
4. Confidence Intervals:
a. To construct a confidence interval for the population variance, use the chi-
square distribution.
5. Hypothesis Testing:
a. For hypothesis tests about the population variance, compare the chi-square
statistic to critical chi-square values to determine whether to reject the null
hypothesis. Use a significance level (α) to define the rejection region.

Inference about a population proportion


1. Estimate or test hypotheses about the proportion (p) of a certain characteristic in a
population based on sample data.
2. Sample Proportion: Calculate the sample proportion
3. Standard Error: Compute the standard error of the sample proportion as
4. Confidence Interval: Construct a confidence interval for the population proportion
5. Hypothesis Testing: Perform hypothesis tests by comparing the test statistic
(standardized difference between sample proportion and hypothesized proportion) to
the standard normal distribution.
In summary, inference about a population proportion involves estimating the proportion,
constructing confidence intervals, and testing hypotheses using sample data and standard
normal distribution.

One-Way Analysis of Variance


(ANOVA) is a statistical method used to test if there are significant differences between the
means of three or more independent groups.
1. Objective: Determine whether there are statistically significant differences in the means
of different groups.
2. Hypotheses:
o Null Hypothesis (H0): All group means are equal.
o Alternative Hypothesis (HA): At least one group mean is different.
3. Assumptions:
o Independence of observations.
o Normality: Data in each group should be approximately normally distributed.
o Homogeneity of variances: The variances among groups should be approximately
equal.
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
4. Test Statistic: Calculate the F-statistic using the ratio of the variance between groups to
the variance within groups:
5. Steps:
o Calculate Group Means: Compute the mean for each group.
o Compute Between-Group Variance: Measure how much the group means
deviate from the overall mean.
o Compute Within-Group Variance: Measure how much the individual
observations deviate from their group mean.
o Calculate F-Statistic: Divide the between-group variance by the within-group
variance.
6. Decision Rule:
o Compare the calculated F-statistic to the critical value from the F-distribution
table based on the degrees of freedom.
o If the F-statistic is greater than the critical value, reject the null hypothesis.
7. Post-Hoc Tests: If the null hypothesis is rejected, perform additional tests (like Tukey’s
HSD) to determine which specific group means differ.

Multiple Comparisons
involve testing multiple hypotheses simultaneously to determine if there are significant
differences among groups or treatments.
1. Identify which specific group means differ when multiple comparisons are made after a
significant result from an overall test (e.g., ANOVA).
2. Problem: Performing multiple statistical tests increases the risk of Type I errors (false
positives), where you incorrectly conclude that a difference exists when it does not.
3. Techniques:
o Tukey's Honestly Significant Difference (HSD): Compares all pairs of means while
controlling for Type I errors. Suitable for equal sample sizes.
o Bonferroni Correction: Adjusts the significance level by dividing it by the number
of comparisons. It is conservative and reduces Type I errors but may increase
Type II errors (false negatives).
o Scheffé’s Test: Flexible and can handle unequal sample sizes. It is less powerful
but controls Type I errors in a broader range of comparisons.
o Dunnett’s Test: Compares each group to a control group, controlling Type I
errors when comparing multiple treatments to a single control.
4. Decision: Choose an appropriate method based on the nature of the comparisons and
the balance between controlling Type I and Type II errors.

Analysis of Variance (ANOVA) Experimental Designs


involve using ANOVA techniques to analyze data from experiments with different structures.
1. To determine whether there are statistically significant differences in means among
groups or treatments in an experiment.
2. Types of Experimental Designs:
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
o One-Way ANOVA: Tests differences among means of three or more independent
groups based on a single factor or treatment. It evaluates if at least one group
mean differs significantly from the others.
o Two-Way ANOVA: Evaluates the impact of two independent variables (factors)
on a dependent variable, and also examines the interaction between the factors.
Useful for understanding how two factors simultaneously influence the outcome.
o Repeated Measures ANOVA: Used when the same subjects are measured
multiple times under different conditions. It accounts for the correlation
between repeated measurements on the same subjects.
o Factorial ANOVA: Assesses multiple factors and their interactions. Each factor
can have multiple levels, and the design helps understand how different factors
combine to affect the outcome.
3. Assumptions:
o Normality: Data within each group should be approximately normally
distributed.
o Homogeneity of Variances: The variance among groups should be approximately
equal.
o Independence: Observations should be independent of each other.
4. Procedure:
o Calculate the F-Statistic: Ratio of the variance between group means to the
variance within groups.
o Compare to Critical Value: Determine if the F-statistic is significant by comparing
it to a critical value from the F-distribution.
5. Post-Hoc Tests: If ANOVA indicates significant differences, follow-up tests (e.g., Tukey’s
HSD) identify which specific groups differ from each other.

Randomized Block (Two-Way) Analysis of Variance (ANOVA)


a statistical technique used to examine the effects of two factors on a dependent variable,
while controlling for variability due to a third factor (blocking factor).
1. To assess the effects of two independent variables (factors) on a dependent variable,
while accounting for variability from a blocking factor that groups similar experimental
units together.
2. Design:
a. Factors: Two factors are analyzed, each with multiple levels. The design assesses
how each factor influences the outcome and whether there is an interaction
between the factors.
b. Blocking Factor: A blocking factor is included to control for variability by grouping
similar experimental units. This helps reduce error variance and increase the
sensitivity of the test.
3. Procedure:
a. Randomization: Experimental units are randomly assigned to different
treatments within each block.
b. Calculate F-Statistics: Compute F-statistics for the main effects of each factor and
their interaction, as well as the effect of the blocking factor.
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
c. Compare to Critical Values: Assess significance by comparing F-statistics to
critical values from the F-distribution.
4. Assumptions:
a. Normality: Data within each group should be approximately normally
distributed.
b. Homogeneity of Variances: The variance among groups should be roughly equal.
c. Independence: Observations should be independent within blocks and across
blocks.
5. Interpretation:
a. Main Effects: Determine the impact of each factor on the dependent variable.
b. Interaction Effects: Assess whether the effect of one factor depends on the level
of the other factor.
c. Blocking Effect: Evaluate if the blocking factor accounts for significant variability.

Two-Factor Analysis of Variance (ANOVA)


a statistical method used to examine the effects of two independent variables (factors) on a
dependent variable, as well as any interaction between these factors.
1. To evaluate:
a. The main effects of each factor on the dependent variable.
b. The interaction effect between the two factors.
2. Design:
a. Factors: Two factors, each with multiple levels, are included in the study.
b. Levels: Each factor has different levels (e.g., different treatments or conditions).
3. Procedure:
a. Randomization: Subjects or experimental units are randomly assigned to
different combinations of factor levels.
b. Calculate Sums of Squares: Compute the sums of squares for the main effects of
each factor, their interaction, and error.
c. Calculate F-Statistics: Use these sums of squares to compute F-statistics for each
effect and compare them to critical values from the F-distribution.
4. Assumptions:
a. Normality: The data in each group should be approximately normally distributed.
b. Homogeneity of Variances: The variances across groups should be roughly equal.
c. Independence: Observations should be independent.
5. Interpretation:
a. Main Effects: Determine if each factor independently affects the dependent
variable.
b. Interaction Effect: Assess if the effect of one factor depends on the level of the
other factor.
c. Statistical Significance: Identify which effects are statistically significant.

In general, the difference between the two experimental designs is that, in the randomized
block experiment, blocking is performed specifically to reduce variation, whereas in the two-
factor model, the effect of the factors on the response variable is of interest to the statistics
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
practitioner. The criteria that define the blocks are always characteristics of the experimental
units. Consequently, factors that are characteristics of the experimental units will be treated
not as factors in a multifactor study, but as blocks in a randomized block experiment.

Hypothesis testing vs. decision analysis:


1. The technique for hypothesis testing concludes with either rejecting or not rejecting
some hypothesis concerning a dimension of a population. In decision analysis, we deal
with the problem of selecting one alternative from a list of several possible decisions.
2. In hypothesis testing the decision is based on the statistical evidence available. In
decision analysis, there may be no statistical data, or if there are data, the decision may
depend only partly on them.
3. Costs (and profits) are only indirectly considered (in the selection of a significance level
or in interpreting the p-value) in the formulation of a hypothesis test. Decision analysis
directly involves profits and losses.

Google uses "people analytics" to inform its talent management strategies, employing data to
enhance various aspects of its workforce dynamics. This approach includes relational analytics,
which examines how employee interactions impact overall performance. By analyzing what’s
termed "digital exhaust," which encompasses data from emails, chats, and collaboration tools,
Google gains insights into the underlying social networks that contribute to its success. There
are six key elements: ideation, influence, efficiency, innovation, silos, and vulnerability. By
assessing these factors, Google can identify employees who play critical roles in achieving
company objectives. Initiatives like Project Oxygen exemplify this data-driven strategy, as they
identify the traits of effective managers and share these insights to enhance leadership
development across teams.

I believe that leveraging data in this manner is ethical, especially when employees are kept
informed about how their data is utilized. Google’s focus on fostering a positive work
environment while pursuing organizational goals reflects a commitment to employee welfare.
It’s essential for companies to establish clear policies regarding data collection and usage,
ensuring that employees feel comfortable and respected throughout the process. Moreover,
management must remember that data represents real individuals, not just numbers. By
maintaining a personal connection while employing analytics, organizations can create a more
engaged and productive workforce. As long as employees are aware of and consent to the use
of their data, relational analytics can serve as a valuable tool for enhancing both employee
experience and overall performance. In summary, as companies like Google continue to
advance their talent management practices through data, prioritizing ethical considerations and
a human-centered approach is crucial for long-term success.
Cecelia Hof
WK1 Assignment: Quantitative Module Pre-Test and Study Notes
In evaluating whether Amazon should be "broken up" under antitrust laws, it’s essential to
consider the ethical implications of its business practices, particularly regarding competition.
Reports indicating that Amazon employees have used proprietary data from independent
sellers to develop competing products raise serious concerns about fairness and transparency.
This behavior not only undermines trust among third-party sellers but also threatens the
competitive landscape of the marketplace. If Amazon's actions are shown to significantly stifle
competition and harm independent brands, it may warrant regulatory scrutiny akin to the
enforcement of the Sherman Antitrust Act by President Roosevelt.

To draw the line for intervention, we should assess factors such as Amazon's market power, the
impact of its practices on consumer prices and innovation, and whether these practices mirror
historical monopolistic behaviors. Ultimately, fostering open discussions around these ethical
issues in our data analytics framework can help illuminate the broader implications of such
business practices and guide potential regulatory approaches, ensuring a fair and competitive
marketplace for all stakeholders.

When considering whether Amazon should be "broken up" under antitrust laws, we need to
think about the ethical issues surrounding its business practices. Reports that Amazon
employees have used data from independent sellers to create competing products raise serious
concerns about fairness. This not only undermines trust among third-party sellers but also
threatens competition on the platform. If it's shown that Amazon's actions significantly hurt
these sellers and limit competition, that could justify regulatory action similar to what President
Roosevelt did with the Sherman Antitrust Act.

To figure out where to draw the line, we should look at Amazon's market power, how its
practices affect prices and innovation for consumers, and whether these actions resemble past
monopolistic behaviors. Ultimately, fostering open discussions about these ethical issues can
help us understand the wider implications and guide future regulatory approaches, ensuring a
fair marketplace for everyone involved.

You might also like