Tutorial 3 Logistic Regression Solutions 1
Tutorial 3 Logistic Regression Solutions 1
Intelligence
1. If you are unsure, try repeating the examples shown in the lecture but trying
not to look at the lecture itself.
2. Data: Diabetes data.sav.
The variable “outcome” is a variable indicating of a person has
diabetes (1) or not (0).
a. Explore the data set in tables with “outcome” as the top and the
independent variables as the rows.
b. What variables are predicting the diabetes outcome data. Is this as you
would expect? How well does the model perform and what are the
limitations?
Create a binary logistics variable with Outcome as the dependent.
Either choose AGE OR Age group (cannot have both).
Remove the non-sig vars and iterate until all are significant. Eventually
arrive at the below.
It’s doing a good job at predicting those who are not diagnosed as
diabetic but not too good a job at those who are.
We can alter the threshold, making it harder for respondents to be
within the 0 “non diabetes”, hence making sure we are predicting a
larger amount of those with diabetes. However the price we pay is for
more respondents without diabetes to be now classified as with
diabetes. The balance is one which you need to decide on and which
“error” are you happier absorbing. Below is a 0.2 threshold (in options
menu).
Some expected insulin to be in the model. The issue here is that if run
a correlation analysis, it is correlated with Glucose and hence the
model will only chose one of them.
You can force insulin in to the model by starting with that and then
progressively adding other terms (but leaving out Glucose)
This gives the below classification (threshold reset to 0.5). so not as
good a model as with Glucose in but it may be a more logical model
with insulin instead of glucose (depends on the circumstances).
The variable does not do a good job predicting diabetes on its own and
is required to have other variables to help it predict.
3. Data: “candy.xls”.
a. Explore the Candy data in Excel and then import the data to SPSS.
Create appropriate labels in the Variable view from information in the
Excel file. Save as a .sav file.
b. Using the scale variables as independent variables, how well can each
of the candy types be predicted (one model per candy type). Interpret
the models statistically and also from a lay audience perspective
i. Chocolate
ii. fruity caramel
iii. peanutyalmondy
iv. nougat
v. crispedricewafer
Use the binary logistic regression function to create a model for each one of the above.
Some models are better than others! Sugar coeff is not statistically significant!?
Chocolate
Fruity
Caramel
Peanut almond
A very low level of prediction
Nougat
If we use the first category as a reference category the model does improve in term sof
interpretation
The classification does OK identifying only chocolate or only fruity but not both or none
Here is the parameter estimates. They are all versus a base of “brand C”. The idea would be
to find the statements in questions and see if some of them are statistically higher (+ coef)
than brand C and statistically lower (- coef) than brand C (or vice versa). This gives you a
distinctive positioning you can use to market the brands under those circumstances.
On the other hand “seductive” is not statistically different for brand A compared to brand C
but it is for brans B compared to brand C (larger coef and significant p-value).
The same logic applies for checking the other statements.
For example both brand A and B are statistically lower coef than brand C which means that
brand C would be more appropriate for that statement.
This is essentially the same exercise as above with one added step. That is
you must only select the data relating to away from home consumption. You
can do this is SPSS by selecting Data/Select Cases. Click on the If condition
satisfies option. Then select the HomeAway variable and move it to the top
box. Then set this “=1”. Click continue and then OK.
5. Load the solver Excel sheet from the lecture resources on Blackboard. Try
using the Diabetes data to build a model using solver as a form of estimation.
How do the results compare to question 2?
See the Excel sheet I have loaded up. You may very well have got a different
answer depending on constraints and initial starting values.