0% found this document useful (0 votes)
56 views13 pages

Tutorial 3 Logistic Regression Solutions 1

This document provides instructions for several logistic regression modeling exercises using various datasets. Key points: 1. Analyze diabetes data to predict diabetes diagnosis, achieving good prediction for non-diabetics but poorer prediction for diabetics. Variables like age and glucose are associated with diabetes. 2. Candy data predicts candy types like chocolate moderately well but others poorly. A multinomial logistic model categorizes candy types better. 3. Spirits drink data advises which brand image statements (e.g. mysterious, social) best fit each brand based on consumption context (home vs away). 4. Excel solver is used to build a diabetes model for comparison to SPSS results. Model accuracy may differ

Uploaded by

springfield12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views13 pages

Tutorial 3 Logistic Regression Solutions 1

This document provides instructions for several logistic regression modeling exercises using various datasets. Key points: 1. Analyze diabetes data to predict diabetes diagnosis, achieving good prediction for non-diabetics but poorer prediction for diabetics. Variables like age and glucose are associated with diabetes. 2. Candy data predicts candy types like chocolate moderately well but others poorly. A multinomial logistic model categorizes candy types better. 3. Spirits drink data advises which brand image statements (e.g. mysterious, social) best fit each brand based on consumption context (home vs away). 4. Excel solver is used to build a diabetes model for comparison to SPSS results. Model accuracy may differ

Uploaded by

springfield12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

EFIM30051: Data Analytics and Artificial

Intelligence

Tutorial 3: Logistic Regression


Instructions

1. If you are unsure, try repeating the examples shown in the lecture but trying
not to look at the lecture itself.
2. Data: Diabetes data.sav.
The variable “outcome” is a variable indicating of a person has
diabetes (1) or not (0).
a. Explore the data set in tables with “outcome” as the top and the
independent variables as the rows.

Run a tables analysis to look to see if some variables are more


associated with diabetes than others. For example it looks like older
people with higher glucose levels are associated with diabetes.

b. What variables are predicting the diabetes outcome data. Is this as you
would expect? How well does the model perform and what are the
limitations?
Create a binary logistics variable with Outcome as the dependent.
Either choose AGE OR Age group (cannot have both).

Look at the parameter estimates (seg levels and B estimates)

Remove the non-sig vars and iterate until all are significant. Eventually
arrive at the below.

All vars are significant and sensible.


Look at the classification

It’s doing a good job at predicting those who are not diagnosed as
diabetic but not too good a job at those who are.
We can alter the threshold, making it harder for respondents to be
within the 0 “non diabetes”, hence making sure we are predicting a
larger amount of those with diabetes. However the price we pay is for
more respondents without diabetes to be now classified as with
diabetes. The balance is one which you need to decide on and which
“error” are you happier absorbing. Below is a 0.2 threshold (in options
menu).

Some expected insulin to be in the model. The issue here is that if run
a correlation analysis, it is correlated with Glucose and hence the
model will only chose one of them.

You can force insulin in to the model by starting with that and then
progressively adding other terms (but leaving out Glucose)
This gives the below classification (threshold reset to 0.5). so not as
good a model as with Glucose in but it may be a more logical model
with insulin instead of glucose (depends on the circumstances).

c. The variable “Diabetes Pedigree function” is a mathematical concept


based on heredity data. How well does this variable perform in
predicting diabetes in patients?

The variable does not do a good job predicting diabetes on its own and
is required to have other variables to help it predict.

3. Data: “candy.xls”.
a. Explore the Candy data in Excel and then import the data to SPSS.
Create appropriate labels in the Variable view from information in the
Excel file. Save as a .sav file.
b. Using the scale variables as independent variables, how well can each
of the candy types be predicted (one model per candy type). Interpret
the models statistically and also from a lay audience perspective
i. Chocolate
ii. fruity caramel
iii. peanutyalmondy
iv. nougat
v. crispedricewafer

Use the binary logistic regression function to create a model for each one of the above.
Some models are better than others! Sugar coeff is not statistically significant!?

Chocolate

Fruity
Caramel

This model is not doing anything to predict the candy.

Peanut almond
A very low level of prediction

Nougat

No level of prediction (I may as well just call everything not Nougat)

c. Either in SPSS syntax OR in Excel create a new variable with the


following properties
i. 0 if the product is NEITHER a chocolate or a fruity
ii. 1 if the product is a fruity
iii. 2 if the product is a chocolate
iv. 3 if the product is both chocolate and fruity
Use the remaining variables which you think are a sensible inclusion to
try and predict this multiple category variable. What conclusions do you
draw and how would you report his to management?
We need to use Multinomial Logistic Regression as we have more than 2 categories to
predict

Not a very good model using the category 3 as a base.

If we use the first category as a reference category the model does improve in term sof
interpretation
The classification does OK identifying only chocolate or only fruity but not both or none

4. Data: “spirits drinks.sav”


The data refers to a questionnaire distributed to 8000 respondents to
enquire about their choice of three spirits brands. The drink choice
indicates the drink they had chosen (Brand A, Brand B and Brand C).
The image statements refer to the drink and the respondent had to
state whether that particular statement was influential in why they
chose that brand (score of 1) or was of no relevance whether they
chose that brand (score of 0). The last variable refers to whether the
occasion was at home/friend’s home/etc. or whether it was Away from
home (e.g., in a bar/restaurant//café/etc). Each of the three brands are
owned by your organisation and the Marketing Director is asking your
advice on how the marketing spend should be directed in terms of
appealing to various aspects of each brand. The aspects in question
are the following.
High opinion
Seductive
Mysterious
I would drink it with friends
The Marketing Director needs direction of which image statements
should be associated with each brand.
a. By creating an appropriate model, advise the Marketing Director which
elements from their list above should be associated with each of the
brands and why. How would you describe the “fit/classification” of your
model statistically and what does that mean for the brand managers of
each of the brands (A, B and C)?

Here is the parameter estimates. They are all versus a base of “brand C”. The idea would be
to find the statements in questions and see if some of them are statistically higher (+ coef)
than brand C and statistically lower (- coef) than brand C (or vice versa). This gives you a
distinctive positioning you can use to market the brands under those circumstances.
On the other hand “seductive” is not statistically different for brand A compared to brand C
but it is for brans B compared to brand C (larger coef and significant p-value).
The same logic applies for checking the other statements.
For example both brand A and B are statistically lower coef than brand C which means that
brand C would be more appropriate for that statement.

b. The Marketing manager of brand A is also working on a pilot project


with some bars in the Bristol area to promote brand A. They are trying
to come up with the most important image statements which appeal to
that brand in this environment and are considering the following to
create the bar environment.
i. High opinion of the brand
ii. Focus on attractive packaging
iii. Make the atmosphere of the bar “daring and mysterious”
iv. Promote the brand being mixed with a Bristol based mixer drink
v. Charge a premium price as the consumers will be happy to pay
a higher price for the experience
Advise the Manager on what you would recommend. How confident
are you, statistically, of the advice and how would you communicate
this to the brand manager?

This is essentially the same exercise as above with one added step. That is
you must only select the data relating to away from home consumption. You
can do this is SPSS by selecting Data/Select Cases. Click on the If condition
satisfies option. Then select the HomeAway variable and move it to the top
box. Then set this “=1”. Click continue and then OK.
5. Load the solver Excel sheet from the lecture resources on Blackboard. Try
using the Diabetes data to build a model using solver as a form of estimation.
How do the results compare to question 2?

See the Excel sheet I have loaded up. You may very well have got a different
answer depending on constraints and initial starting values.

You might also like