ETF Sample Questions
ETF Sample Questions
ETF Sample Questions
Explain why the correlation is a more useful measure than the covariance. (2 mark)
- The correlation is calculated as: Cov( Y, X ) / ( SD( X ) * SD( Y ) ), SD = standard deviation
- The correlation is a unit-free measurement as it has been standardised to be between -1 and 1
- Unlike the covariance of which the magnitude still depends on the unit of measurement of the variable
- The correlation is always between -1 and 1 as a result. The correlation is a “units-free” measure of the
relationship between two variables while the covariance is influenced by the units of measurement.
Using the scatter diagram and calculated correlation, explain the relationship between the two variables
- There is a strong (0,5 marks) positive (0.5 marks) linear relationship (0.5 marks) between the young people’s age
and the proportion of each age working in the formal sector.
- As the age increases, the proportion of young people who work in the formal sector also increases (0.5 mark).
MEASURE OF VARIABILITY
Outline in words how the range and the sample standard deviation are calculated. Discuss which is the best
measure of the spread of a distribution.
- The range is the difference between the maximum and minimum values.
- The sample standard deviation is the square root of the average (though we divide by “n-1” not “n”) of the
squared deviations from the mean.
- The standard deviation is generally regarded as a better measure of the spread of the distribution than the
range. It takes account of all values and is less influenced by outliers than the range.
What is the appropriate measure of central location for Price and Area? Justify your answer. (3 marks)
- The appropriate measure of central location is the median for both distributions. (1mark)
- This is because both distributions are asymmetric, with Price being positively skewed, and Area being negatively
skewed
Report the first and third quartiles of the incomes, and explain these values.
- 1st quartile: $24.35. 25% of children have a monthly family income per person below $24.35
- 3rd quartile: $34.47. 25% of children have a family income per person above $34.47
Describe the shape of the distribution of Price based on the summary statistics. (2 marks)
- The distribution of price is asymmetric and is positively skewed. (1 mark)
- This is because the skewness statistic is positive, at 1.837. Also, mode < median < mean, which indicates that the
mean is affected by extreme positive values. (1 marks) (Interpretation: there are some very expensive/cheap
houses compared to the rest of the properties where the area is taken into consideration)
HYPOTHESIS TESTING
Hypothesis Test (Normal): Conduct a hypothesis test to assess if the mean area of new houses is smaller than 618
square metres. In your answer, include the null and alternative hypotheses, the test statistic, the p-value and a
conclusion from your test. Use the significance level of 0.05. Report any numerical answers to three decimal places.
(5 marks)
- Null hypothesis H0: Mean area of new houses = 618 (1 mark)
- Alternative hypothesis H1: Mean area of new houses < 618 (1 mark)
- t-statistic = (sample mean – H0 value)/StdErr = (418.467 – 618)/25.125 = -5.434
- p-value =norm.s.dist(-5.434, TRUE) = 0.000 (1 mark)
- Using the significance level of 0.05, we reject the null hypothesis.
- There is evidence to suggest that the mean area of new houses is smaller than 618.
- H0: B1=0 (no difference between catholic and government schools regarding student outcomes)
- H1: B1 > 0 (catholic school students perform better) -> use /= for “not equal”
- Significance level 5% = alpha
- p-value = (7.382 x 10^-7)/2 (one tail test).
- So, we reject H0 since the p-value is < alpha -> We conclude that there is strong evidence to suggest that Catholic
schools produce better average outcomes in this test.
Undertake a hypothesis test that the proportion of countries recording declines in CO2 emissions in 1990 and 2016 is
different. Use a 5% significance level.
Suppose you estimated a model with the same dependent variable of Formal, but instead of Male, you replaced the
variable with “Female” (Female=1 if Male=0, = 0 if Male=1). What would the intercept and “Female” coefficients be in
this model? Explain intuitively how you obtained these values. (3 marks)
- The intercept will be 0.2677 and the slope coefficients will be -0.1043 (1 mark = 0.5 each)
- Because the interpretation of the new intercept will be the estimated proportion of males who have a formal
sector job (0.2677)
- The interpretation of the “Female” coefficient will be the estimated difference in the proportion of females and
males who have a formal sector job, that is 0.1634-0.2677 = -0.1043
Which type of error could have been made in the hypothesis test you performed in the previous question? Explain
the error in the context of that hypothesis. [4 marks]
- As we decide to reject the null hypothesis, the type of error we could have made in the test is the Type I error of
rejecting the null while it is true (2 mark).
- In this context, we reject the null statement that the fraud is less likely if the car was reported as driveable after
the accident (or the fraud has nothing to do with the car reported as driveable) while it is actually true. (2 marks)
What level of certainty do you have regarding the conclusion drawn from the hypothesis test in this case? Explain
your answer. (2 marks)
- We have a 95% level of certainty regarding the conclusion drawn from this hypothesis test (0.5 marks) Because
we allow a 5% chance (0.5 marks) of making a type I error.
- That is the error made when the null hypothesis is rejected while there is no difference in the likelihood of
getting formal sector jobs between males and females/while it is actually true. (1 mark)
Define what the p-value represents. Discuss the statistical significance of the coefficients in Exhibit 8 by looking at
each of the p-values.
- The p-value measures the chance of getting a coefficient further from zero (more extreme) than the one actually
obtained.
- The intercept has a p-value of 0.021. This means it is statistically significantly different from zero at the 5% level
but not the 1% level.
- Population has a p-value which is effectively zero. Hence it is very highly statistically significant. GDP per capita
has a p-value of 0.0016. This is very low and much less than conventional significance levels such as 5% and 1%,
so it is highly statistically significant.
One fraud investigator has said, <In my experience, fraud is about twice as likely to occur when the car is reported
as driveable=. Does this sample confirm the investigator’s view? Explain your reasoning.
- Yes, the regression results are consistent with this statement. The coefficient on Driveable is around 0.1. This is
roughly equivalent to the coefficient on the intercept. This means that the probability of fraud doubles if the car
is driveable after the accident.
The intercept for the model is negative. Discuss what this means.
- The Y variable, CO2 emissions, is very unlikely to be negative. However, the negative intercept implies that the
predicted value of Y is negative when all the X variables are zero.
- However, this is not a sensible way to use the model because it is being asked to predict so far outside the range
of observed X values. There are no countries where population and GDP per capita are zero.
Does the intercept have a meaningful interpretation in this context? Explain the reasoning behind your answer
- The intercept is 539.357. This is the expected Price ($000) when Area = 0. (2 marks)
- In this case, the intercept does not make sense because a property would not exist if Area = 0, as it implies a
property does not exist.
Explain why the slope coefficients of “Area” in Exhibit 4 (simple regression) and Exhibit 5 (multiple regression) are
different. (4 marks)
- This is because the single regression model only includes one input variable & suffers from omitted variable bias.
- Larger properties are also expected to hold larger dwellings. The simple regression does not account for this.
After dwelling features are accounted for, the area premium in the multiple regression is now reduced.
Compare the regression models from Exhibit 4 (simple regression) and Exhibit 5 (multiple regression. Which of these
two models performs better in explaining the data? Use the reported statistics to support your argument. (4 marks)
- R-square for simple regression = 0.320; R-square for multiple regression = 0.652 (2 marks)
- The multiple regression explains 65.2% of the variation in price, while the simple regression explains 32% of the
price variation. With the multiple regression explaining more variation in Price, it is a superior model. The
standard error for the multiple regression is also lower at 249.155 compared to the simple regression at 345.890.
- A well-chosen combination of independent variables in a multiple regression model may allow for a better
explanation of changes in the dependent variable compared to a simple linear regression model
Discuss the fit of the model. Should discuss the R-Square or the standard error.
- The R-Square is bounded between 0 and 1 and at 0.68 it is reasonably high but not outstandingly so. There still
remains significant unexplained variation in CO2 per capita.
Discuss the meaning of the Standard Error of 2.15 in the regression output. Given that the standard deviation value
of the dependent variable is 2.23, would you say this indicates an accurate model?
- Model predictions on the score will be 2.15 below and above the actual total score on average.
- The standard error of 2.15 is quite close to the standard deviation of the total score (2.23), indicating the model is
not entirely accurate
Explain carefully the meaning of the coefficients for the 3 education dummy variables, Primary, Secondary and
HigherEd. Remember that the base education category is a person with no education, or only some primary
education. Comment on the corresponding P-values of the education dummy variable coefficients.
- The person who have completed the Primary, Secondary or Higher Education is expected to have a higher chance
of working in the formal sector (1 mark) by 0.0073, 0.0664 and 0.0987, respectively (1 mark) relative to a person
who has no education or not completing the primary education (1 mark), given all other variables constant
- The three education dummy variables are statistically significant at least at the 10% level of significance
- Hence the education level has a significant effect on the likelihood of working in the formal sector
Discuss the statistical significance of the coefficients. Are these results supportive of including a quadratic trend in
the model as well as a linear trend?
- All coefficients are highly statistically significant. Their p-values are much lower than conventional significance
levels such as 1% and 5%. Therefore we can conclude that the coefficients are likely to meaningfully influence the
dependent variable.
- Because the coefficient on Year Squared is highly significant, this means this coefficient should be in the model.
This implies the quadratic trend model better summarizes the data than would the linear trend model.
CONFIDENT INTERVAL
From Exhibit 5, report the 95% confidence interval for the slope of “DNew”. Interpret the confidence interval you
reported. (4 marks)
- 95% confidence interval for the slope of DNew = [113.847,277.848] (1 mark)
- We can be 95% confident that the expected premium for new houses, relative to old houses, is between $113,847
and $277,848, given beds, baths, area and DSchoolZone held fixed.
- Interpretation: We can be 95% confident that the expected premium for <X-Dummy>, relative to <X-notDummy>,
is between <LB> and <UB>, given <all other features> held fixed.
In this case, the confidence interval you calculated will be very narrow. What does this say about the accuracy of
your estimate? Knowing how confidence intervals are calculated, what is the main reason the interval is so narrow
in this case?
- It is a very accurate/precise estimate. The accuracy depends on standard error (standard deviation (s) divided by
the square root of sample size (n)).
- The standard error is very small due to a large sample size.
Based on this confidence interval results, what is your conclusion about the relationship between the variable and
probability of fraud? Interpret the relationship in the context of the probability of fraud. [5 marks]. Based on this
confidence interval results, what is your conclusion about the relationship between the variable and probability of
fraud? Interpret the relationship in the context of the probability of fraud. [5 marks]
- The 95% confidence interval is [0.257062, 0.284536]. (1 mark)
- As the confidence interval only includes positive values and does not include 0, we can conclude that there is a
positive relationship between lower claim amount (<$5000) and the probability of fraud. (2 mark)
- That is, when the claim amount is <$5000, the probability of fraud increases by between 0.26 and 0.28 compared
to when the claim amount is >$5000, after controlling for other variables. (2 marks)
m/m0, beta/b, alpha/a, sqrt(_), R2/R^2, Std Error, Std Dev, =NORM.S.DIST