Not That Model Selection

5 minute read


Model selection? Model selection! Maybe not the 'model selection' in your mind though. Actually, this blog is meant to be a memo on another masterclass us STOR-i students had today after the previous Gaussian Processes masterclass that I also blogged about. This masterclass was given by Prof. Gerda Claeskens from KU Leuven, who introduced us to the main criteria used in model selection.



In regression model selection, $latex R^{2} $ and Residual Sum of Squares (RSS) are commonly used to describe how well a model fits the data. No surprise, the full model with all predictors in it always yield the highest $latex R^{2} $ and lowest RSS. The reason for this is that both $latex R^{2} $ and RSS are measures of in-sample error, which means they only measure how well the model fits the training data. In practice, models that give wonderful predictions on training data can preform very poorly on the test data. What we really care about, or at least equally care about, is how well the model fits our test data. Generally, there are two ways to answer this question. Either we directly estimate the test error by using cross validation, where subsets of the data are repeatedly left out; or we indirectly estimate the test error by making adjustment to the training error to account for overfitting. The latter approach was the main focus of today's masterclass, and will be the focus for the rest of this blog.

There are three most commonly used criteria to assess model performance on training data: Mallow's $latex C_{p} $, Akaike's Information Criterion (AIC), and Bayesian Information Criterion (BIC).  Let's see how they guide us through the process of selecting models with different number of variables.

  • Mallow's $latex C_{p} $

It is a criterion especially used for models fitted with least squares. Mallow's $latex C_{p} $ is defined as follows:

$latex C_{p}=\frac{SSE_{p}}{S^{2}}+2p-n $

where SSE is the model sum of squares in the p-predictor case, $latex S^{2} $ is the mean square error (MSE) in the full model, and n is the sample size. As we can see from the above formulation, $latex C_{p} $ penalises the model for having too many predictors. On the other hand, it takes on small values when the test error is small, which can be easily seen in the following formulation:

$latex C_{p}=\frac{1}{n}\left ( RSS+2d \sigma^{2}  \right ) $

where $latex \sigma^{2} $ is an estimate of the variance of the error associated with the response variable. $latex 2d \sigma^{2} $ is the penalty term adding on top of the training RSS.

  • Akaike's Information Criterion (AIC)

AIC is also applied to models fitted by least squares with Gaussian errors. The calculation of AIC is given as follows (with an additive constant omitted):

$latex AIC=\frac{1}{n \sigma^{2}}(RSS+2d\sigma^{2}) $

which can also be formed as the form of a penalised log likelihood as follows:

$latex AIC=-2logL(M)+2p $

Comparing to Mallow's $latex C_{p} $, we see AIC has an additional $latex \sigma^{2} $ term sits at the denominator, but these two are proportional to each other. The second formulation shows us the balance AIC aims to strike between model fit (represented by log likelihood) and model complexity (represented by the number of predictors used).

  • Bayesian Information Criterion (BIC)

BIC is very much like AIC, but penalises model complexity much more harshly than AIC. As is shown in the following formulation:

$latex BIC=\frac{1}{n \sigma^{2}}(RSS+log(n)d\sigma^{2}) $

Comparing AIC and BIC, we see the parameter for the penalty term in AIC is 2, whereas log(n) in BIC. Using any reasonable sample size larger than 7 ($latex log(7)\approx 2 $) results in a harsher penalty in BIC than in AIC. Thus, by using BIC criterion, we usually end up with a much smaller set of predictors than using AIC.

So far, we've seen all these three criteria penalise model complexity. The difference between them lies in the way they penalise model complexity. Following is a plot comparing the model selection results given by different criteria based on the same training data.


FYI, following is the code for generating the above four plots in R, in case you want to give it a go :D

# read data in R

# remove missing value
Hitters = na.omit(Hitters)

# apply the method
regfit.full = regsubsets(Salary~., data = Hitters, nvmax = 19) # fit up to a 19-variable (full) model
reg.summary = summary(regfit.full)

# plot the outputs

plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l") # plot 1

plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R-square", type = "l") # plot 2
wm = which.max(reg.summary$adjr2) # identify the model with the largest adjusted R square
points(wm, reg.summary$adjr2[wm], col="red", cex=2, pch=20)

plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l") # plot 3
wmin1 = which.min(reg.summary$cp)
points(wmin1, reg.summary$cp[wmin1], col="red", cex=2, pch=20)

plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l") # plot 4
wmin2 = which.min(reg.summary$bic)
points(wmin2, reg.summary$bic[wmin2], col="red", cex=2, pch=20)