ECON20003 S1 2024 Sample Exam Solutions
ECON20003 S1 2024 Sample Exam Solutions
ECON20003 S1 2024 Sample Exam Solutions
The student analyzed the data to determine if there is a difference in satisfaction levels
between consumers based on the type of packaging. The analysis produced several R
printouts, which are presented below. Use a 1% level of significance throughout this
question.
(4 + 6 + 4 + 2 + 5 = 21 marks)
The R output of histograms for Satisfaction by Type of Packaging
Distributions of Satisfaction Levels for Distributions of Satisfaction Levels for
Box Packaging Bag Packaging
Satisfaction_Bag Satisfaction_Box
median 80.000 70.000
mean 79.490 71.160
SE.mean 1.021 0.952
CI.mean.0.95 2.026 1.889
var 104.293 90.600
std.dev 10.212 9.518
coef.var 0.128 0.134
skewness -0.168 0.328
skew.2SE -0.347 0.680
kurtosis -0.516 -0.554
kurt.2SE -0.539 -0.580
normtest.W 0.989 0.976
normtest.p 0.563 0.061
1
R output for various tests on the variable Satisfaction by Type of Packaging
2
(a) Briefly explain the aims of the tests shown on the last four printouts. What do they
test and what do they assume about the underlying populations?
(4 marks)
Explanation:
Is a test for the null hypothesis that two normal populations have the same
variance.
Independent samples t-test (two sample t-test on the printout) aims to verify
some statement about the difference between two population means.
Explanation:
Based on two independent random samples drawn from the two normal
populations.
Welch's t-test also known as unequal variances t-test. Is used when you
want to test whether the means of two normal populations are equal.
Explanation:
This test is generally applied when the there is a difference between the variances
of two normal populations.
Explanation:
but it compares the medians rather than the means and assumes that both
independent populations have the same ‘shape’ (under the alternative their
distributions are the same except for a shift in the median).
3
(b) Looking at the “F test to compare two variances” printout, test at the
1% significance level whether the population variance of the level of satisfaction for
cereal in a box differs from the population variance of the satisfaction of cereal in a
bag. Answer by stating the null and alternative hypotheses; providing an appropriate
test statistic and its distribution and describing an appropriate decision rule. You may
use the test statistic provided in the printout. (6 marks)
Answer:
4
(c) Given the research question, the variable of interest and the data, which test
provided in the last three printouts is most appropriate this time? Explain why it is
appropriate and the others are not. What are the null and alternative hypotheses of
your chosen test in the context of this study?
(4 marks)
Answer:
5
(d) Evaluate the printout of the test you chose in part (c) at the 1% significance level to
decide whether there is a difference in the satisfaction levels of cereal sold in a box
and in a bag. Specify the p-value, make a statistical decision, and draw your
conclusion. (2 marks)
Answer:
P-value:
The p-value of the Two-sample t-test is 1.098e-08
Statistical decision
Since the p-value <0.01, therefore, at the 1% significance level we REJECT the
null hypothesis.
Conclusion.
It is possible to conclude that the mean difference in satisfaction is different
across consumers who receive the two different package types.
6
(e) What assumptions about the population data are required to validate the test you
performed in part (d)? Use as much information as possible to find out whether these
conditions are likely satisfied this time. Briefly explain your answer.
(5 marks)
First requirement:
Explanation:
“The student randomly selected two separate groups of 100 individuals” implying also
that the samples are independent.
Second requirement:
Explanation:
Third requirement:
Explanation:
We can check normality using the histograms, the reported descriptive statistics,
and the Shapiro-Wilk tests.
The histogram is symmetric, the histograms suggest that the two sampled
populations are unimodal and have similar shapes.
The sample mean and median are similar, skew.2SE and kurt.2SE are less than
|1|,
7
3. A researcher wishes to explore factors that are associated with life expectancy
across different countries to help identify potential disparities in health outcomes
across different regions and provide a basis for developing policies aimed at reducing
these disparities.
To this end, for a sample of 180 countries, they regress the variable Life_expectancyi
(Life expectancy at birth, total years) against the following explanatory variables:
Call:
lm(formula = Life_expectancy ~ Health_expenditure_PC + GDP_PC +
UHC + Urban_pop, data = data)
Residuals:
Min 1Q Median 3Q Max
-15.6898 -2.4945 0.7976 3.5943 9.7819
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.34741 1.05252 60.187 < 2e-16 ***
Health_expenditure_PC 0.29979 0.41207 0.728 0.4679
GDP_PC 0.13900 0.03342 4.160 4.99e-05 ***
UHC 2.51961 1.50867 1.670 0.0967 .
Urban_pop 0.08735 0.02103 4.153 5.11e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
11
The R output of QQ Plot, Histogram, Descriptive Statistics of model residuals and Residuals
vs. Predicted Values:
20
5
Sample Quantiles
Frequency
15
0
-5
10
-10
5
-15
0
-2 -1 0 1 2 -15 -10 -5 0 5 10
mean 1.72E-16
SE.mean 0.345
5
CI.mean.0.95 0.680
var 21.371
Residuals
std.dev 4.623
coef.var 2.69E+16
-5
skewness -0.739
skew.2SE -2.040
-10
kurtosis 0.170
kurt.2SE 0.235
-15
normtest.W 0.951
normtest.p 6.89E-06
65 70 75 80 85 90
The R output of the White test (studentized Breusch-Pagan test) for homoskedasticity:
studentized Breusch-Pagan test
data: model
BP = 7.344, df = 4, p-value = 0.1188
12
The R output of the Variance Inflation Factor:
> # calculate the VIF for each independent variable
> vif(model)
Health_expenditure_PC GDP_PC UHC Urban_pop
5.904326 4.484705 2.806123 1.816616
The R output of the Ramsey Regression Equation Specification Error Test (RESET) of the
model:
> # run a RESET test on the model
> resettest(model)
RESET test
data: model
RESET = 31.786, df1 = 2, df2 = 173, p-value = 1.751e-12
Hypothesis:
Health_expenditure_PC = 0
13
(a) What does the estimated dependent variable measure? Given the dependent variable,
what type of regression model has been estimated? (2 marks)
The estimated regression model is a multiple linear regression using OLS estimation.
(b) Test the overall significance of the model at the 1% significance level. What are the
hypotheses, the statistical decision and the conclusion? (4 marks)
p-value for the test of overall utility of the model is 2.2e-16 < 0.01. Hence reject the
null hypothesis.
Conclusion:
▪ The model is better than predicting Life Expectancy than one that assumes no
relationship between Life Expectancy and the explanatory variables.
▪ The regression model performs better at predicting Life expectancy than the
sample mean.
14
(c) Test appropriate hypotheses concerning the two-sided significance of the slope
coefficients using t-tests at the 1% significance level. What are the hypotheses and
what do you conclude? (3 marks)
𝐻𝐴 : 𝛽𝑗 0
Conclusion:
(d) Carefully explain the meanings of the slope coefficients that you deemed significant
at the 1% level. (4
marks)
GDP_PCi:
For every $1000 increase is GDP per capita (current international $s) Life Expectancy
increases by 0.139 years on average holding other variables constant.
Award 1 mark only if does not mention the impact on the Life Expectancy.
Urban_popi:
For every 1 percentage point increase in Urban Population; Life Expectancy increases
by 0.087 years on average holding other variables constant.
Award 1 mark only if does not mention the impact on the Life Expectancy.
15
(e) Interpret the coefficient of determination. Is the model likely to be useful in
predicting Life Expectancy? Explain your answer. (2 marks)
The adjusted R2 is about 0.61, so about 61% of the total sample variation in Life
Expectancy can be accounted for by the model.
The model has a “reasonably good” fit and is likely useful in predicting life
expectancy.
(f) Does the normality assumption seem to be satisfied? Why or why not? Use as much
information from the R printouts as you can and be specific. (3
marks)
The normality requirement is likely not satisfied and statistical inference from the F-and
t-tests are invalid.
Reasons:
• The histogram is left-skewed and on the QQ plot the points are deviated
from the straight line.
• The residuals’ median deviates from the mean and the absolute values of
skew.2SE is smaller than one.
Based on the errata sheet - Check the solution and the rubric
16
(g) Comment of the “OLS residuals versus price_hat” ‘Life_expectancy_hat” scatterplot
and on the “studentized Breusch-Pagan test” printout. What do they suggest to you
about the estimated regression model?
(4 marks)
a. On the scatterplot of res against price_hat, there is a clear “funnel pattern” with
decreasing variance indicating heteroskedasticity.
(h) Determine whether imperfect multicollinearity is a concern in the given model? Please
provide as much evidence as possible to support your conclusion. If multicollinearity is
a concern, propose a solution to address the problem. (5 marks)
• While signs are as expected, two explanatory variables (GDP Per Capita
and Universal healthcare Coverage) that we expect to be statistically
significantly associated with Life Expectancy appear not to be
(comparatively large standard errors for coefficient estimates)
• Variance Inflation Factor is higher than the rule of thumb value of 5 for
Health Expenditure PC and high (4.46) for GDP Per Capita.
• Increasing sample size (would shrink s.e.’s though limited use given upper
limit to number of countries in world)
(2 + 4 + 3+ 4 + 2 + 3 + 4 + 5 = 27 marks)
17
4. A commodities researcher considers monthly data on the world prices of Corn (Corn)
and Soybeans (Soy), both measured in US$ dollars per metric ton, from Jan. 2018 to
Mar. 2023. Both soybeans and canola are used in the production of vegetable oil and
are widely traded internationally. As a result, their prices are often closely linked and
may be considered substitutes for each other in certain contexts. For these reasons,
their prices are likely interrelated in the sense that price movements on the Corn market
are probably affected, among others, by economic developments in the world Soybean
market.
log(Corn)t=β0+β1log(Corn)t−1+ β2log(Soy)t+ϵt
where,
Call:
lm(formula = logCorn ~ lag(logCorn, 1) + lag(logSoy, 0), data =
data)
Residuals:
Min 1Q Median 3Q Max
-0.073996 -0.029353 -0.005261 0.026902 0.152338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.69120 0.16629 -4.157 0.000106 ***
lag(logCorn, 1) 0.59969 0.05995 10.004 2.54e-14 ***
lag(logSoy, 0) 0.47137 0.07013 6.721 8.01e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
18
0.05 0.10 0.15
OLS residuals Residuals vs Lagged Residuals
e(t)
-0.05
-0.05
0 10 20 30 40 50 60 -0.05 0.00 0.05 0.10 0.15
Time e(t-1)
19
(a) Test appropriate hypotheses concerning overall model utility and the two-sided
significance of the slope coefficients using t-tests at the 1% significance level. You
may use p-values and summarize results. Overall, does the model have good ‘fit’.
Explain your conclusions? (4
marks)
F-statistic p-value<0.01, reject null that slope coefficients are zero, model has utility.
Conclusion:
As a result; both the one month lagged value of the log of Corn Prices and log of Soy
Prices appear to be statistically significant predictors of current logged Corn prices.
(b) Carefully explain the meanings of the slope coefficients that you deemed significant
at the 1% level. Explain if they make sense or not. (4 marks)
Explanations:
20
(c) Determine whether autocorrelation is a concern in the given model? Explain what
autocorrelation is and provide as much evidence as possible to support your
conclusion and use a significant level of 1%. If autocorrelation is a concern, propose a
solution to address the problem.
(5 marks)
Since the population error term cannot be observed, we study the OLS residuals.
i. The plot of et against time (t) does not present a strong pattern for first
order serial correlation, though it does seem that there a few stretches of
errors with the same sign; indicative of potential first order positive serial
correlation.
ii. The plot of et against et-1 does not present a strong pattern across diagonal
quadrants though there are some more extreme values in the top right
quadrant and alongside a preponderance of observations along the bottom-
left and top-right quadrant; indicative of potential first order positive serial
correlation.
iii. The Durbin-Watson d -test rejects the null of no first order serial correlation
in the residuals. (p<0.01)
iv. The Breusch-Godfrey L rejects the null of no serial correlation up to order 2.
(4 + 4 + 5 = 11 13 marks)
21