| tags:R statistics model selection GLM AIC deviance
So, you did some GLMs & compared with AIC. Congrats!
Here’s what you need to report in a paper about the model comparison:
- residual deviance
- residual df
- delta AIC
- AIC weight
You should also report the null deviance and degrees of freedom, maybe in a table caption.
Thanks to Emilio Bruna for prompting this post and suggesting its title. (Update 2015-12-14: thanks also to Ben Bolker for pointing out some issues in the first version of this post.) Below I’ll explain why you should include at least these numbers, do a worked example, and mention some situations where it’s better to use something other than AIC.
What to report
For model selection, a model’s AIC is only meaningful relative to that of other models, so Akaike and others recommend reporting differences in AIC from the best model, \(\Delta\)AIC, and AIC weight. The latter can be viewed as an estimate of the proportion of the time a model will give the best predictions on new data (conditional on the models considered and assuming the same process generates the data; this heuristic view appears justified by simulations, e.g., Burnham and Anderson 2002 pp. 348). Also, neither differences in AIC less than 0.1 nor differences in AIC weights below 0.01 are really meaningful, so round the reported numbers appropriately.
Finally, even if using an information criterion include the residual deviance and degrees of freedom for each model. These provide a (rough) goodness of fit.
Example: UC Berkeley admissions and gender
Let’s look at the built-in UCBAdmissions
data. As R
will tell you, these data are often used to illustrate Simpson’s paradox (see
?UCBAdmissions
and Bickel et al. 1975 or my PPS below).
d <- as.data.frame(UCBAdmissions)
d <- tidyr::spread(d, Admit, Freq) # use Hadley's excellent tidyr to reshape
d[order(d$Dept), ]
## Gender Dept Admitted Rejected
## 1 Male A 512 313
## 7 Female A 89 19
## 2 Male B 353 207
## 8 Female B 17 8
## 3 Male C 120 205
## 9 Female C 202 391
## 4 Male D 138 279
## 10 Female D 131 244
## 5 Male E 53 138
## 11 Female E 94 299
## 6 Male F 22 351
## 12 Female F 24 317
Using logistic regression, encode several models for the effect of applicant gender, department identity, or both on admission.
m1 <- glm(cbind(Admitted, Rejected) ~ Gender, d, family='binomial')
m2 <- glm(cbind(Admitted, Rejected) ~ Dept, d, family = 'binomial')
m3 <- glm(cbind(Admitted, Rejected) ~ Dept + Gender, d, family = 'binomial')
model.names <- c("1 Gender", "2 Dept", "3 Gender + Dept")
(Note: although we might like to allow for an interaction between gender and department, the data here are insufficient to fit such a model.)
Running summary
on any one of the fits yields a bunch of stats: AIC, Residual
and null deviance, as well as coefficients, their standard errors, and
significance.
How to do it in R
We could type by hand the AIC and other stats. No fun!
There are two other options.
First is to use David Robinson’s broom
which gives tidy summaries of model objects. The second is to use Ben Bolker’s bbmle
which provides methods for generating pretty tables of xIC values.
Using broom
summ.table <- do.call(rbind, lapply(list(m1, m2, m3), broom::glance))
If we take a look at summ.table
, we’ll see it has all the ingredients we might like to report from model selection, whether via AIC, BIC, or just the deviance. These are, in order, null.deviance, df.null, logLik, AIC, BIC, deviance, df.residual.
Creating a table with our own desired columns in an appropriate order is easy.
table.cols <- c("df.residual", "deviance", "AIC")
reported.table <- summ.table[table.cols]
names(reported.table) <- c("Resid. Df", "Resid. Dev", "AIC")
reported.table[['dAIC']] <- with(reported.table, AIC - min(AIC))
reported.table[['weight']] <- with(reported.table, exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC)))
reported.table$AIC <- NULL
reported.table$weight <- round(reported.table$weight, 2)
reported.table$dAIC <- round(reported.table$dAIC, 1)
row.names(reported.table) <- model.names
## Warning: Setting row names on a tibble is deprecated.
With my advice above in mind, here’s a minimal table for reporting model
selection on GLMs using fitting results extracted with broom::glance
:
Caption: Model selection for the effect gender (model 1), department (model 2), and both gender and department (model 3) on admission probability fit to 12 observations (i.e., total degrees of freedom) with 877.056 null deviance.
reported.table
## # A tibble: 3 x 4
## `Resid. Df` `Resid. Dev` dAIC weight
## * <int> <dbl> <dbl> <dbl>
## 1 10 784. 754. 0
## 2 6 21.7 0 0.56
## 3 5 20.2 0.5 0.44
Using bbmle
reported.table2 <- bbmle::AICtab(m1, m2, m3, weights = TRUE, sort = FALSE, mnames = model.names)
reported.table2[["Resid. Dev"]] <- summ.table[["deviance"]] # get the deviance from broom'd table
Caption: Model selection for the effect gender (model 1), department (model 2), and both gender and department (model 3) on admission probability fit to 12 observations (i.e., total degrees of freedom) with 877.056 null deviance.
reported.table2
## dAIC df weight Resid. Dev
## 1 Gender 753.9 2 <0.001 783.6
## 2 Dept 0.0 6 0.56 21.7
## 3 Gender + Dept 0.5 7 0.44 20.2
Of course, model selection is just the beginning of reporting your results. See the PPS below for some thoughts on reporting results of the best model(s).
And, before you even did model selection, you should have asked yourself…
Is AIC the right criterion?
- For small data and frequentist inference, you should use AICc – the small sample correction which provides greater penalty for each parameter but approaches AIC as \(n\) becomes large. If it makes a difference, you should use it. (I probably should have used it above.)
- For large data and frequentist inference , consider BIC, which is asymptotically consistent while AIC is not (see Hastie et al. 2009, which is available online). AIC typically favors overly-complex models with large \(n\) relative to BIC. Note, however, that this is not an issue for prediction, only inference of a true model (if one exists; Sec. 6.4 McElreath 2015).
- For Bayesian models, consider WAIC or LOO (instead of DIC, which has issues with non-Gaussian posteriors McElreath 2015).
- Don’t use information criteria for model selection between GLMs with different link functions
Further Reading & References
- Ben Bolker on reporting results in the text for GLMMs this is a piece of a larger chapter GLMMs: worked examples that approaches inference as model checking and improvement (as opposed to multi-model inference sensu Burnham and Anderson 2004)
- Burnham, and Anderson. 2004. Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociological Methods & Research 33(2):261-304
- Burnham, and Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Second Edition. Springer.
- Elliot and Brook 2007 for a short piece on the philosophical background of multiple working hypotheses
- Hastie, Tibshirani, and Friedman. 2009. Elements of Statistical Learning, Second Edition. Springer, New York, NY, USA.
- McElreath. 2015. Statistical Rethinking. CRC Press.
- Brian O’Meara’s AIC tutorial
PS: Nested models
Reporting the residual deviance and degrees of freedom as above is relatively similar to R’s output for conducting an ANOVA on a GLM (where you can optionally add a statistical test). For nested models, you may as well just do this and report the table:
m3.anova <- anova(m3, test="Chisq")
round(m3.anova, digits = 4)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(Admitted, Rejected)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 11 877
## Dept 5 855 6 22 <2e-16 ***
## Gender 1 2 5 20 0.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
PPS: Evaluating the best model(s)
For the best model(s), you should then go on to visualize the fit relative to the data (and report model results in the text).
To visualize the admissions data and mean fits from models 2 and 3 (which have
approximately equal AIC weight), we can use ggplot2
and augment
from broom
(which adds model predictions and statistics to the data).
To provide model-averaged predictions is a bit more work:
m3.pred <- broom::augment(m3)
m2.pred <- broom::augment(m2)
m2.weight <- reported.table[2, "weight"]
m3.weight <- reported.table[3, "weight"]
mavg.pred <- m2.weight$weight * m2.pred[ , -(1:2)] + m3.weight$weight * m3.pred[ , -(1:3)]
mavg.pred <- cbind(m3.pred[1:3], mavg.pred)
library(ggplot2)
ggplot(d) +
geom_point(aes(Dept, Admitted / (Admitted + Rejected), color=Gender,
size=Admitted + Rejected),
position = position_dodge(width = 0.5)) +
geom_pointrange(aes(Dept, plogis(.fitted),
ymin = plogis(.fitted - 2 * .se.fit),
ymax = plogis(.fitted + 2 * .se.fit),
shape=Gender),
position=position_dodge(width = 0.5), data =mavg.pred, alpha=0.4) +
theme_minimal() + scale_color_manual(values=c("blue", "orange"))
Admissions (colored dots, size indicates total applicants) versus department
from UCB Admissions data (included in R
), and averaged predictions (means \(\pm\) 2 SE)
from model 2 (department only) and model 3 (department and gender), with
averaging by AIC weight.
Simpson’s paradox
Comparing model 3 with model 1 illustrates Simpson’s paradox. Without accounting for department, the apparent effect of female gender on admission is negative (female odds relative to male 0.543, \(p \approx 0\)), whereas after accounting for department, the (within-department) effect is positive (but weak: female odds relative to male 1.105, \(p \approx\) 0.22).