- Hand back assignments
I am building on the foundation that I hope I laid on Thursday.
Definition. Logistic regression is a technique for making predictions when the dependent variable is a dichotomy, and the independent variables are continuous and/or discrete.
We are not really restricted to dichotomous dependent variables, because the technique can be modified to handle polytomous logistic regression. where the dependent variable can take on several levels. We have just exhausted my knowledge of the subject, but students can look in Hosmer and Lemeshow.
I am going to use the example from the text, because I want to have something they have seen before.
This is a more traditional approach, and the students advisor may first suggest that route. Here the idea is that we are using one or more independent variables to predict "group membership," but there is no difference between "group membership" and "survivor/non-survivor."
The problem with discriminant analysis is that is requires certain normality assumptions that logistic regression does not require. In addition, the emphasis there is really on putting people in groups, whereas it is easier to look at the underlying structure of the prediction ("what are the important predictors?") when looking at logistic regression. (Psychologists are rarely interested in the specific prediction, but in the role that different variables play in that prediction.)
We could use plain old linear regression with a dichotomous dependent variable.
We have already seen this back in Chapter 9 when we talked about point-biserial correlation.
In fact, it works pretty well when the probability of survival varies only between .20 and .80. It falls apart at the extremes, though probably not all that badly.
It assumes that the relationship between the independent variable(s) and the dependent variable is linear, whereas logistic regression assumes that it is logarithmic. The reason it works in the non-extreme case is that the logistic curve is quite linear in the center. (Illustrate on board.)
Epping-Jordan, Compas, & Howell (1994)
We were interested in looking at cancer outcomes as a function of psychological variablesspecifically intrusions and avoidance behavior.
The data are available at logisticreg.sav
The emphasis here was on the variables. rather than on the prediction .
I'm going to start with one predictor, and then move to multiple predictors.Variables
- Outcome 1 = Improved 2 = Worse
- SurvRate higher scores = better prognosis
I have discussed some of these variables before in other contexts, so I shouldnt need to go over them all.
What we are really interested in are Intrusions and Avoidance, but I need to start with a simple example, so I will start with the Survival Rating as the sole predictor. This also has the advantage of allowing me to ask if those psychological variables have something to contribute after we control for disease variables. (This is another example of what we mean when we speak of hierarchical regression.)
We can plot the relationship between Outcome and Survival Rating, but keep in mind that there are overlapping points. To create this figure I altered the data for the outcome variable to let 1 = success and 0 = failure (no improvement or worse) I don't know why I did that, but I am too lazy to redraw the graphs--especially the second one. This is an important point, because the results vary by which end of the dichotomy we predict.
I have used a sunflower plot here. Every line in the "dot" represents a case, so if we have a dot, we have one case; a vertical line = 2 cases; a cross = 3 cases; etc. Notice that as we move from right to left we have most of the cases as Outcome = 0, then cases equally spread between Outcome = 1 and 0, and then most of the cases at Outcome = 1.
Draw logistic function on this figure. The following was cut and pasted, after much much work with an image editor, from the text. It plots a theoretical continuous outcome (Y ) as a function of a predictor (X )
(Note that on the left we have only tiny increases in the amount of the curve that is shaded black. In the center we have major differences in the amount of black. On the right we again have only minor differences in the amount of black.
Explain what censored data are, using the figure above.
Explain how this leads to sigmoidal data.
Ask them when they would expect to see censored data in what they do.
- Plain old boring pass-fail measures
- DWI versus not DWI
- Those admitted to graduate school versus those not admitted.
The way most of us think about data like this is in terms of probabilities. We talk about the probability of survival.
But it is equally possible to think in terms of the odds of survival, and it works much better, statistically, if we talk about odds.
Odds Survival = Number Survived/Number Not-survived
Odds Survival = p (survival)/(1p (survival))
as an aside, if odds survival are given as above, then probability of survival = p (survival) = 1/(1+odds)
One reason why we like working with odds is that odds are a logarithmic function of X. whereas, as we saw, probabilities are a sigmoidal function of X. One advantage of a logarithmic function is that it can increase without a ceiling. A second advantage is that if we plot the log of the odds, the relationship will be linear, which is always nice.
If we had an unlimited number of subjects, and therefore lots of subjects at each survival rating, we could calculate these odds. But we dont have an unlimited number of points, and therefore we cant really get them for every point. But that doesnt mean we cant operate as if we could. (Here is where the magic comes in. They are not going to see simple formulae for slope and intercept, like the ones they see in regression.)
Draw figure on the board plotting odds rather than probabilities.
At the very least, we have problems with probabilities at the high (and low) end. Once you get high enough you can't really get much higher in terms of probability. If a score of 70 gives you a probability of .96 of survival, a score of 80 can, at most. move you up .04. That isn't the case with odds. because odds have no theoretical upper limit.
Now we have to go one step further to get log odds
log odds will allow the relationship I discussed just above to become linear.
log odds survival = ln(odds) = ln(p /1-p )
Notice that, by tradition, we use the natural logarithm rather than log10. (There is no great reason why this couldn't have been worked out in base 10 logs, except that statisticians and mathematicians like natural logs.)
This is often called the logit or the logit transform
We will work with the logit, and will solve for the equation
This is just a plain old linear equation because we are using logs. Thats why we switched to logs in the first place. The equation would not be linear in terms of odds or probabilities, as we saw in the graph above.
b 0 is the intercept, and we usually dont care about it.
b 1 is the slope, and is the change in the log odds for a one unit change in SurvRate.
We will solve for all of this by magic, since a pencil and paper solution is out of the question. We will use an iterative solution. (Explain)
Graphing the relationships
I have talked about the shape of these distributions. I want students to understand why we go to all this work, so I will jump ahead and calculate the predicted probability of success as a function of Survrate.
Emphasize that I am jumping ahead here.
To do this I just calculated the predicted log odds (survrate), then took exp(log odds) to get odds, and then took prob = odds/(1+odds).
Notice that SPSS has taken it on itself to predict nonsurvival rather than survival. It just bases that on the way the data are coded.
Notice it is sigmoidal in shape. (I could exaggerate it if I put in some cases with even lower SurvRate.)
Now plot as odds against SurvRate
That is very uninteresting, but I did it along the way. (Odds do extreme things at the extremes.)
Now plot ln(odds) against SurvRate
Step 1 with SPSS
Intercorrelation Matrix of Predictors
Remember that these are linear relationships, but it gives us an idea of where we are starting.
I have simplified the output but the sample size was always 66 and the significance is shown by asterisks
Now we need to run the Logistic Regression itself, with Outcome as the dv and Survrate as the predictor.
SPSS Logistic Regression
I am using SPSS version 10.1 fro some of what follows, and version 9 for the rest. You can tell which is which, because 10.1 has prettier tables.
NOTE what they have done. I coded Worse/NoChange = 1, and they converted it to zero. Improved was a 2, and they changed it to 1. But they have kept the order intact.
Block 1: Method = Enter
In the tables above, the 40.022 is a test on the significance of this step--does the model fit better now that we have added one (or perhaps more) variables?
The value of 37.323 is a test on whether there is still variability in the data to be explained. There is, but that doesn't detract from the usefulness of SURVRATE.
The 17.756 is another test on whether Survrate is a significant predictor.
What follows is version 9, so that people can see what is actually happening.I will skip that for class, because there is too much to cover.
Beginning Block Number 0. Initial Log Likelihood Function
-2 Log Likelihood 77.345746
* Constant is included in the model.
Discuss this printout in detail.
2. The next thing that we see is " 2 Log Likelihood"
This is a model with just an intercept included. It is like testing a linear regression model with just = b 0 in it.
That model is very uninteresting, but it gives us a base to start from.
2 Log Likelihood = 77.346 is a chi-square statistic on 1 df. which is clearly significant. But we dont care about its significance here. A significant result means that the model does not fit the data adequately, just as a traditional chi-square test is significant when an independence model does not fit adequately.
3. Then SPSS enters SurvRate as an independent variable and reports another chi-square:
2 Log Likelihood = 37.323
This is a model with just an intercept included. It is like testing a linear regression model with just
= b 0 +b 1 *X 1 in it, where X 1 = SurvRate.
This is a test on whether the new model, with SurvRate added, fits the data. A significant chi-square would say that it does not fit the data completely, though that certainly doesn't mean that it doesn't fit better than the previous model.
This is a chi-square on df = number of predictors + 1 (the constant) = 2, and the test is significant.
But we arent so much interested in whether it is a perfect fit
as we are in whether the model with SurvRate in it fits better than the model without SurvRate. For that test we just find the amount of improvement in chi-square.
Improvement = 77.346 37.323 = 40.023
This is itself a chi-square on 2 1 = 1 df. because we have added one predictor, and is certainly significant.
In other words, SurvRate adds significantly to the prediction of Outcome. (This is so much clearer in version 9.0 than in version 10.0.)-
4. I deleted the classification table from the output. I think that they are generally quite misleading, because even dreadful data can sometimes have a high correct classification percentage.
5. The Regression Equation
Log (odds Survival) = .0812*SurvRate + 2.6836
This means that whenever two people differ by one point in SurvRate, the log odds of survival differ by .0812
Notice that this interpretation is the same as for normal regression, except that we are predicting log odds.
Take someone with a SurvRate = 50. Then
log odds = .0812(50) + 2.6836 = 1.3764
odds = e -1.3764 = .2525
This means that they are .25 times more likely to die than survive. (It is important to keep in mind whether we are predicting death or survival.)
If we take the inverse we have 1/.25 = 4.0, which means that with a 50 you are 4 times more likely to live than die.
Keep in mind that this is the odds, not the odds ratio. So you are 4 times more likely to live than you are to die--it is not contrasting you with someone else.
Now someone with a 51 would have
log odds = .0812(51) + 2.6836 = 1.4576
odds = e -1.4576 = .2378
The difference in log odds is .0812, which is the coefficient.
But. what does that mean?
Notice that as Survrate increased the odds decreased. BUT these are the odds of NOT surviving. In other words, SPSS has chosen its own definition of "survival." You always have to watch out for this in logistic regression, regardless of the program you use.
But if odds = p /(1-p ), then p = odds/(1 + odds)
For someone with a 50, p = .2524/1.2524 = .20
for someone with a 51, p =. 2328/1.2328 = .19
If you have a SurvRate = 50, you are not too likely to die. In fact, the probability of improving is .80. But if your survival rating increases to 51, the probability of your dying decreases a tiny bit to .19. Thus higher survival ratings are associated with lower probabilities of dying.
The only way I know of for being sure which direction things are going is to calculate a couple of probabilities and make sure you know what they mean. (You could read the manual, but who does that :-) )
We could calculate the probability of surviving for every subject using the above equation. In fact, SPSS will do that for us and SAVE all of the predicted values. We can then make a scatterplot of predicted values against SurvRate.
Notice that the probabilities (as calculated from log odds) do not exceed 0 and 1, and behave in just the ways Ive been talking about. This again makes it obvious that we are plotting probability of getting worse, since it wouldn't make sense for the probabilities of survival to decrease as the rating of survival increases.
Notice the sigmoidal curve we have been talking about.
More about the coefficients:
There is another way to look at this printout that is not about probabilities. To the extreme right of the log odds ratio, we see 0.922. (Thus a one point increase in SurvRate multiplies the odds of death by .922.) We can say that a one point increase in SurvRate reduces the log odds of death by .0812. And e .-0812 = 0.922.
In other words, the entry to the right is exp(ln odds).
Notice also that we have a test on the significant of the coefficients. This test is labeled "Wald", and it is a chi-square test (sort of). It isnt exactly distributed chi-square, but nearly).
Here Wald = 17.7558 on 1 df, which is significant.
Notice that the Wald chi-square (17.7558), which asks if SurvRate is significant, doesnt agree with the change in chi-square (40.022), which asks if adding SurvRate leads to a better fit. Blame this on Waldit is not a great test--it tends to be conservative. (The comparable tests in linear regression (F and t ) are exactly the same, but not in logistic regression.) The change in chi-square is the better test, but if we had added two variables, instead of one, we need Wald to tell us about each individually.
Predicting Group Membership.
We could make a prediction for every subject, and then put each subject with p > .50 in the "non-survival" group, and everyone with p < .50 in the "survival" group. This is shown below.
In this figure we have shown what actually happens, and you can see that a few people with a predicted value less than 50 actually got worse, and a few above 50 actually got better. But it isnt a huge difference between predicted and actual.
SPSS actually gives us a table of outcomes in the printout, and this shows that 86.36% of the predictions were correct.
Classification tables usually have an important "feel good" component, but that can be very misleading. It is easy to come up with data where almost everyone survives, and then all we have to do to get a great %correct is to predict that everyone will survive. We will be pretty accurate, but not particularly astute.
People shouldn't be impressed with my amazing accuracy to say that my Howell Test of galactic Threat (HTGT) is extremely accurate because I am never wrong. (I simply make the same prediction for everyone--that they will not die from being hit on the head by a meteor), and haven't been wrong yet.)
Multiple Independent Variables
Epping-Jordan, Compas, and Howell (1994) were not really interested in the prediction of survival, although thats a good thing. They really wanted to know what role Avoidance and Intrusions played in outcomes.
Here we have another hierarchical regression question.
Get them to see why we want to look at SurvRate first .
The approach we will take is the hierarchical one of first entering SurvRate and then adding one or more other variables, such as Avoid or Intrus. The first part we have already seen.
I could just enter both Survrate and Avoid at the same time to see what I get. But by adding them at separate stages (using "next" in the dialog box) I can get more useful information.
First use Survrate, and then added Avoid. outcome = dv.
Block 2: Method = Enter
At the first step with just SurvRate, -2 Log Likelihood = 37.323. (We saw this several pages back.) With two predictors, chi-square = 32.206. The difference between these is the test that Avoid adds something to the prediction over and above SurvRate. This difference is 5.118, which is shown above. It is on 1 df. and is significant at a = .0237. Thus Avoidance adds to (actually subtract from) survivability after we control for medical variables that are included in SurvRate.
We can see that the optimal regression equation is
log odds(worse) = -0.082*SurvRate + .133*Avoid + 1.196
We can also see the Wald test on these coefficients. Note that the test on Avoid gives a p = .035, which is somewhat different from the more accurate p = .024 that we found above.
If we want to go from log odds to odds, we see the result on the right.
e -0.0823 = .9210
e 1.1325 = 1.1417
Thus a one point difference in SurvRate multiplies the odds of dying by .9210, when we control for Avoid. Likewise, a one point increase in Avoid multiplies the odds of dying by 1.1417 when we control for SurvRate.
This would make sense because we would expect the odds of dying would decrease (multiply by < 1) as Survrate increases, but that the odds of dying would increase (multiply by > 1) if Avoid increases.
CONCLUSION Even after we control for the degree of illness (Survrate), avoidance is a bad thing.
What if we add Intrus as well?
The following is a greatly abbreviated output, just focusing on our problem. What I have done is to put in SurvRata at step 1, and both Avoid and Intrus at step 2. Thus the significance test is on whether the two variables together add significantly. They don't (p = .059).
Notice that the test of adding Intrus and Avoid after Survrate is not quite significant (p = .059). Why not?
Although Avoid has much to offer, Intrus has almost nothing. We have increased the change in LR-chi-square a bit more by adding Intrus, as well as Avoid, at this stage (from 5.118 to 5.673), but we have spent a degree of freedom to do this. Whereas 5.118 on 1 df was significant, 5.673 on 2 df is not. (It would need to exceed 5.99).
Note that Wald still calls Avoid significant.
We would be better off going back to the one predictor case.
The following is an e-mail exchange that I received last year. I think that it brings up some interesting points. I don't expect you to remember it all, but I would like you to remember that it is here, and refer to it if you need something like r-squared.
>A colleague using multiple logistic regression would like to have:
>(1) an overall measure of the explanatory power of the model, such as
>proportion of variance explained in linear regression, and.
This issue has been considered extensively in the literature. Apparently
the correlation between predicted probabilities and observed (a
[For our date, r = .751, r 2 = .564
Agresti, A. (1996). An introduction to categorical data analysis, Wiley.
(p. 129) discusses this approach.
(b) Use the model deviance (-2 log likelihood) to calculate a reduction in
error statistic. The deviance is analogous to sums of squares in linear
regression, so one measure of proportional reduction in error--I
think--that is similar to adjusted R^2 in linear regression would be:
pseudo R^2 = (DEV(model)-DEV(null))/DEV(null)
where DEV is the deviance, DEV(null) is the deviance for the null model
(intercept only), and DEV(model) is the deviance for the fitted model.
There exists a number of methods for calculating pseudo R^2 values. A good
discussion can be found in Maddala, G.S. (1983), Limited-dependent and
qualitative variables in economics, Cambridge.
There are many published articles on this topic. Here are just a few.
Nagelkerke, N.J.D. (1991). A note on a general definition of the
coefficient of determination. Biometrika, 78, 3, 691-692.
Agresti, A. (1986). Applying R^2 type measures to ordered categorical data.
Technometrics, 28, 2, 133-138.
Laitila, T. (1993). A pseudo-R^2 measure for limited and qualitative
dependent variable models. Journal of Econometrics, 56, 341-356.
Cox, D.R. & Wermuth, N. (1992). A comment on the coefficient of
determination for binary responses. The American statistician, 46, 1, 1-4.
>(2) a way to compare the contributions of two independent variables when
>both (and possibly other variables as well) are in the model, such as
>incremental R square in linear regression.
For effect sizes I understand that the odds-ratio is the measure of choice.
I don't know, however, how to determine an appropriate comparison of
odds-ratios for two continuous predictors on different scales.
Another possibility would be to look at the change in model deviance
attributed to both variables.