Logistic Regression: Binary
Why use logistic regression to do statistical analysis on data?
- it is useful when dealing with a combination of quantitative and categorical data.
- output in MINITAB© gives an odds ratio, and a p-value (a measure of statisitcal significance).
- can be easily used to measure interaction between variables.
But what does it all mean? -
What are odds?
Odds(event x)= p(x)/(1-p(x))
where p(x) is the probability of event x occuring
Odds ratio - comparing the likelihood of an event occuring in two different groups to see if there is any significant difference between the two groups;
Let the event be denoted by A, and the two groups be denoted by l and m. Then the odds ratio is determined by:
Odds ratio= [Odds(A| l )] / [Odds(A| m )]
Why use odds instead of probability as a measure of likelihood?
Odds has no upper bound - useful because:
Some assumptions of linear regression, using a (0,1) variable:
If yi=α+β*Xi+εi and E[εi]=0
Then:
E[yi]=E[α+βXi+εi]=
E[α]+E[βXi]+E[εi]=α+βXi
so pi=α+βXi is not bounded by [0,1], which causes some problems. To get around this problem, we use logistic regression, as opposed to ordinary linear regression.
How does logistic regression get around the upper and lower bounds issue?
By transforming the probability to odds we remove the upper bound, and by taking logs we remove the lower bounds.
log(pi/1-pi)=α+β1Xi1+β2Xi2+..............+βkXik
where the β's can be interpreted as "odds ratios" and
pi=(eα+β1Xi1+.....+βkXik)/(1+eα+β1Xi1+....+βkXik) = 1/(1+e-α -β1Xi1-.......-βkXik)
and logit(p)=ln(p/1-p)=α+βXi
Odds ratio is useful for comparing two groups, to see if there is any significant difference between them.
What significance level to attach to the odds ratio?
95% is the standard setting on MINITAB©, and on most other statistical packages, and is widely regarded as an acceptable confidence level.
If the 95% Confidence Interval encompasses one, we cannot say that there is a statistically significant difference between the default value, and the one that we are comparing to the default value.
How to distinguish between categories?
eg: response variable - Yes or No [1=Yes 0=No], male/female, age - four categories labelled 1 - 4
Default is male in age category 1 (probability male in age category 1 is a Yes is 1/(1+eα), where α is the constant coeffient given at the top of the logistic regression readout in MINITAB©.
So if we want to determine the probability of a female in age category 2 saying Yes, then we also include the coefficient of female, and that of age category 2 (as β's in the above equation)
The Xi's are treated as indicator variables - 1 if the subject is a member of the group, and 0 otherwise.
We can look at the interactions between variables in the same way, by including them in the "model" section of the binary logistic regression interface in (variable_A)*(variable_B) form.
Diagnostic tools vary with different software packages, so I'm not going to delve into that area just yet!
For further information on logistic regression, a recommended text is David Hosmer and Stanley Lemeshow's Applied Logistic Regression.