What on earth is Statistical Learning?

4 minute read


Everyone is talking about stats learning or machine learning, as if they are the sexiest terms on earth. Literally, does it have something to do with statistics? Or machine? Depends on the area you are in, and depends on the people you are talking to. According to the omniscient Wiki, stats learning deals with the problem of finding a predictive function based on data.  As for my understanding of stats learning, it's no more than just a tool, that helps people to better understand their data and of course, to thus make better predictions. Usually, people classify stats learning into two categories, supervised and unsupervised.

Supervised learning is when we have both input and output variables; whereas with unsupervised learning, there are only input variables but without a supervising output variable. It is natural to assume that usually what we want to predict is influenced by multiple factors. In the supervised setting, these multiple factors are the input variables, and the thing we want to predict should be our output variable. Take sales data as an example, a company might be interested in understanding the behaviour of their sales, which might be related to (or affected by) factors such as different advertising channels.

Let me use the sales example again to illustrate in the setting of unsupervised learning. Say we divide groups of people by different advertising channels, and we might want to know in what way are people similar to each other according to their observed characteristics. This is a very obvious clustering problem, and we do not intend to make predictions for anything.

Big fan of real world data, let me carry on with the marketing data example in the supervised setting. Say the conversion rate that we want to model (predict) is our output  variable  Y. We want to have our estimate $latex \widehat{Y}$ as accurate as possible. In other words, we want to find a function $latex \widehat{f}$ such that $latex Y\approx \widehat{f}(X)$ for any observation (X,Y). There are in general two sets of methods for this task, parametric and nonparametric.

  • Using parametric method, we would end up with a model with the following form

$latex Sales \approx  \beta_{0}+\beta_{1} \times TV + \beta_{2} \times radio + \beta_{3} \times newspaper $

  • Using nonparametric method, there's no need to impose any pre-specified model on f, but usually we would require a large set of observation in order to get an accurate estimate.

Following is the sales data set that I’ve been on and on about for the whole post. A bit messy with a slightly increasing trend, but it’s not hard to see that it would be a disaster if we fit the data with a straight line. (You can find the Advertising.csv data set I’m using here)

rplot1I fitted both multiple linear regression (parametric) and smoothing spline additive regression (nonparametric) to the data set. The model diagnostics results are shown as follows:

lmFrom the above model diagnostics plots,  we can see the model doesn't fit the data well. Higher values of residuals at both ends of the fitted values. Also, the Leverage plot shows a few points with high leverages. So how can we do better than this? Have a model flexible model! smooth2

The above four plots show the model diagnostics for the smoothing spline additive regression. The response and fitted values agree almost everywhere with each other, most residuals take values around 0. This is so much better isn't it?

But wait! There is such a thing called Bias-Variance Trade-Off. Obviously the model fits the data so much better in the second case, but what if we instead use another sales data set? Ideally we don't want our model to change, however accompanied by the small bias for such flexible methods always come a large variance. As a rule of thumb, people resort to mean square error (MSE) when testing their data.

All in all, the trade-off between bias and variance is also a trade-off between prediction accuracy and interpretability. So much for now, I will talk more about these appealing statistical learning methods of different levels of flexibility in my future posts.

 [1] G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning (with Applications in R), (2014) 15-42.