New lights shed on Regression

3 minute read


The best way to end my weekend is, well, bragging (no...blogging) about what new stuff I found during the weekend. Equipped with a basic understanding of what Gaussian Process (GP) is from a previous masterclass, I decided to do some further reading in this fascinating area.

Not sure how well I explained the idea of GP in my previous post, but I'm going to give it another shot so as to pave the way for explaining how it can be used in both regression and classification setting. First take a look at the following three plots.


One thing in common for all three plots is that they all contain three draws from a multivariate Gaussian distribution. The first plot shows three draws from a 2D Gaussian, the second one shows three draws from a 6D Gaussian, and the third one shows three draws from a 25D Gaussian. Now, one way to think about GP is to view it as a multivariate Gaussian composed of infinitely many Gaussian random variables. Having said that, how can we then apply it to regression problems? Keep reading :)

In classic regression setting, we map our independent variable $latex x $ to  dependent variable $latex y $, using linear regression, quadratic regression, cubic regression, etc. In Gaussian Process Regression (GPR), we still want to do more or less the same thing, but in a less intuitive way. Let's illustrate it in a simplest 1-dimensional setting. Say we have training data $latex \left \{ (x_{1},y_{1}),...,(x_{n},y_{n}) \right \} $, then every $latex y $ can be thought of as related to an underlying function specified as follows:

$latex y=f(x)+N(0,\sigma_{n}^{2}) $

It looks quite similar to the regression setting we are familiar with, except now that we put our focus on searching for $latex f(x)$. Since we now view it as a multivariate Gaussian, all we want to know is then the mean function m(x) and covariance function k(x,x'). Usually after simple transformation, it's assumed the mean function is zero everywhere. Thus all our focus should be put on the covariance function which links all observations together. Use the popular choice "squared exponential" as mentioned in my first GP post, we end up with the following covariance function:

   $latex k(x,x')=\sigma_{f}^{2}exp\left \{ \frac{-(x-x')^{2}}{2l^{2}} \right \}+\sigma_{n}^{2}\delta(x,x') $

where $latex \sigma_{f} $ is defined as the maximum covariance allowed, and lengthscale $latex l $ decides how influential is one observation to another based on their distance. (On a sidenote, originally I thought distant observations almost always have lower influence than nearer ones, but I've just been shown that this is not necessarily true in some forecasting situations that has periodic element in it.)

With this covariance function at hand, our next job is to find the best set of parameters $latex \theta= \left \{ l,\sigma_{f},\sigma_{n} \right \} $. Our maximum a posteriori (MAP) estimate of $latex \theta $ occurs when $latex p(\theta|x,y) $ is at its greatest. Since in most cases we have little prior knowledge about what $latex \theta $ should be, our goal is then to maximize

$latex log(p(y|x,\theta))=-\frac{1}{2}y^{T}K^{-1}y-\frac{1}{2}log(|K|)-\frac{n}{2}log(2\pi) $

Now I'm basically done with providing you with the big picture of GPR. Of course, if you have read my previous GP blog, you know for sure covariance function is the key in any GP modelling, and you also know that "squared exponential" covariance function is the most popular but not your only choice.