Department of Statistics: STA4806: Advanced Research Methods in Statistics
Department of Statistics: STA4806: Advanced Research Methods in Statistics
Department of Statistics: STA4806: Advanced Research Methods in Statistics
Department of Statistics
STA4806: Advanced Research Methods in Statistics
Assignment 2, 2016
Unique Nr.: 828027
Fixed closing date:08 July 2016
Consider each of the following research problems and answer the following question for each problem.
Which statistical technique is applicable? Clearly state which variables are the dependent and which are
the independent variables. If applicable for a specific problem, what information can be added to improve
the survey and its results?
In a survey of 200 patients who suffered heart attacks, various observations were made on each patient:
age, height, weight, pulse rate, blood pressure, cholesterol level, blood sugar level and red cell count.
The medical researcher is interested in isolating the variables that contribute most to the occurrence of
heart attacks. (4)
A researcher wants to compare apple trees of six different rootstocks. Data on the trunk circumference
at 4 years and at 15 years, the extension growth at 4 years and the weight of the tree at 15 years are
available for 20 trees from each of the six different rootstocks. (3)
Annual financial data are collected for 25 bankrupt firms approximately two years prior to their
bankruptcy and for 25 financially sound firms at about the same time. For each firm, data on the
following four variables were obtained:
1 = (net income)/(total assets); 2 = (cash flow)/(total debt);
An analyst wants to construct a model which can be used to predict the financial soundness of a firm
in terms of bankruptcy or non-bankruptcy. (3)
3 STA4806/ASS2/0
A provider of IT services would like to test perceptions of the prices that it charges for its services to
customers. A survey was completed with 148 of its customers, which among other things contained 7
questions regarding pricing:
Respondents were asked to rate these questions on a scale of 1 to 5, where 1 = Very high prices, 2 =
High Prices, 3 = Moderate Prices, 4 = Low prices and 5 = Very low prices. Specifically, the service
provider would like to determine the effect that each service’s price has on the overall price perception
that customers have. They would like to do this by using regression analysis with Q3.2 as the dependent
variable, and Q3.2.1 to Q3.2.6 as the independent variables.
On each of the 7 questions, respondents were allowed to answer “Don’t know” if they felt that they could
not provide a perception of the prices.
Customers could fall into one of 5 segments according to the type of service they receive from the service
The following tables contain results for the missing value analysis on these variables.
Using the above tables, discuss the missing values in the dataset with specific reference to:
(b) The extent of the missing data and the effect it might have on the regression analysis. (10)
(c) The possible reason why 9 cases were deleted between running the first two tables (Tables 2.1 and 2.2)
and running the third table (Table 2.3). (5)
(d) The randomness of the missing data (specifically referring to whether customer segment influences the
missing data process). (5)
(e) Any other analyses that might have added valuable information to the missing value
analysis. (5)
(f) If you can deduce an imputation method to be used from the results provided, which method would
you use? (2)
5 STA4806/ASS2/0
Holzinger and Swineford (1939) gave 24 psychological tests to 145 seventh and eighth grade students in
a Chicago suburb. The data are typical of the ability tests that have been used throughout the history of
factor analysis. The factor-analytic problem itself is concerned with the number and kind of dimensions
that can be used to describe the ability. The data is on myunisa under additional resources, sub-directory
assignment 2. Export data into SPSS and answer the following questions.
(a) Calculate the mean and standard deviation of the 24 variables. (15)
(b) Do reliability analysis and give the reliability coefficent for each variable. (20)
(c) Do the correlation matrix of the data and comment on key features. (10)
(d) Do exploratory factor analysis of the data and label the factors and ensure that you:
(i) find the number of factors,
(ii) identify poor factor indicators,
(iii) identify poorly measured factors, and
(iv) label your factors and interpret your final factor solution.
(d) Calculate the composite variables of each factor (by taking the average) and test whether the mean
differs by:
The change of water (a liquid) in the soil to water vapour (a gas) is called evaporation. Heat and wind
help water to evaporate more quickly.
Evaporation, is amongst others, a function of air and soil temperatures, relative humidity and wind. Since
these factors vary considerably throughout the day it is not clear which variables are important.
The data set Soil evaporation data set.sav found on myunisa under additional resources , sub-directory
assignment 2 consist of 46 observations . We want to find the effect of these variables decribed below on
daily water evaporation in the soil.
(a) Find the descriptive statistics of the variables (minimum, maximum, mean, standard deviation and
coeffcient of variation) and discuss the statistics . (10)
(b) Calculate the correlation coefficient matrix and comment on the key features. (8)
(c) Is the sample size adequate for a multiple regression model involving all the independent variables?
Explain. (3)
7 STA4806/ASS2/0
(d) Consider the correlation matrix you obtain in part (b). Which of the seven independent variables do
you think might be possible predictors of EVAP? Why? (3)
(e) Fit a multiple regression model containing all the seven independent variables. Ensure that you include
the collinearity statistics. Give the statistics which indicate that the regression model containing all
seven independent variables gives a significant fit and explain why. (8)
(f) Looking at the table "coefficients" of the output you generated in (e), comment on the t-values of the
variables. (3)
(g) What can you deduce from the "tolerance values" of the output obtained in (e)? Compare your answer
with the conclusion that can be drawn from the values of the
"condition indexes". (4)
(h) Do a backward elimination model and ensure you output include the collinearity statistics, collinearity
diagnostic and residual analysis statistics. Which of the two models would you recommend and why?
(i) Write down the estimated regression equation for the chosen model. (1)
(j) Using the backward elimination model, comment on the statistical significance of the parameter
estimates. How do these conclusions compare with the conclusions drawn
in (a)? (3)
(k) Is there any multicollinearity present in the backward elimination regression model? Justify your
answer. (2)
(l) Are there outliers or influential observations in the data set? Justify. (3)
(m) What can be done to improve the results of the regression analysis? (4)