Falling in Love with Gaussian Processes

4 minute read

Published:

Today us STOR-i students had our first masterclass this year in Gaussian Processes given by a great speaker Neil Lawrence who specialises in Machine Learning. Gaussian process models are extremely flexible models in a way that it allows us to place probability distributions over functions.

The full story of GP starts with Gaussian. A 1D Gaussian is just boring, mainly because we've all been looking at it millions of times. Let's start with multivariate Gaussians then. No surprise, just adding a bunch of independent Gaussians together, and we get this: $latex P(y)=(2\pi D)^{-\frac{1}{2}}exp\left \{ -\frac{1}{2}(y-\mu)^{T}D^{-1}(y-\mu) \right \}$. One step further, we can also get correlated Gaussians by rotating the original data space using matrix R, then we get this: $latex P(y)=(2\pi D)^{-\frac{1}{2}}exp\left \{ -\frac{1}{2}(R^{T}y-R^{T}\mu)^{T}D^{-1}(R^{T}y-R^{T}\mu) \right \} $. Now we have a covariance structure of the form: $latex C^{-1}=RD^{-1}R^{T} $.

Remember, this covariance structure, also known as kernel, is very important. We will consider Gaussian Processes with a particular structure of covariance matrix, because they often leads to very different behaviours. Often, we want to make predictions based on what we've already known, this requires conditional probability which is also Gaussian as you'd expect: $latex P(f_{*}|f)=N(f_{*}|\mu,\sum ) $.

I generated in Jupyter notebook 20 sample paths from Gaussian processes with the Brownian and RBF Covariance function respectively. Look at the following plots, aren't they just pretty? (You can find the code I use on my GitHub here, which I shamelessly forked from elsewhere.) Okay, falling in love at first sight it is! Wait wait wait, it's more than just a pretty look. It took me countless hours (by 'countless' I mean the whole day and night of our masterclass...) and I'm still not sure I understand every respect deep behind. So keep reading, hopefully my post can boost you to explore more about GP.

image1

This is a special case covariance function of Matern covariance, it has covariance function: $latex k(x,x^{‘}) = min(x, x^{‘}) $ image2

Exponentiated Quadratic Kernel Function, a.k.a. RBF Covariance Function, is built based on the Euclidean distance.

$latex k(x,x^{‘})=\alpha exp\left { -\frac{||x-x^{‘}||^{2}_{2}}{2l^{2}} \right }$

In order to put things together, let me introduce another idea here called Basis Function. Basis functions are usually used to map data into a "feature space" by summing a lot of nonlinear functions together. For example, a radial basis has the following form:

$latex \phi_{k}(x_{i})=exp\left (- \frac{|x_{i}-\mu_{k}|^{2}}{2l^{2}} \right )$

We can thus represent a function by a linear sum of the above basis functions as the following:

$latex f(x_{i,:},w)=\sum_{k=1}^{m}w_{k}\phi_{k}(x_{i,:})$

where elements of w are independently sampled from Gaussian densities. Our main interest here is the covariance function or kernel, because the inner products $latex \phi(x)^{T} \sum \phi(x^{'})$ in the input space can usually be replaced by $latex k(x,x^{'}) $. Also, the covariance function we choose for our model always represent our assumptions about what we wish to learn or predict.

Now, with all the important ingredients at hand, we may wish to do something with GP - may it be regression or classification, both of which can be viewed as function approximation problems. In regression case, people can do Gaussian Process Regression (GPR) which can be obtained by generalized linear regression. In classification case, the Laplace approximation is being applied. The link between these two types of problems in a simple two-dimensional case is quite straightforward. One can simply view the classification problem as turning the output of a regression problem into a class probability. The idea is to squash everything together, such that the domain of $latex (-\infty ,+ \infty) $ turns into the range of [0,1].

A toy example is the linear logistic regression:

$latex  P(C_{1}|x) = \lambda (x^{T}w) $  where $latex \lambda (z) = \frac{1}{1+exp(-z)} $

which combines the linear model with the logistic function. Of course, there are far more details about a comprehensive treatment of such classification problems, such as the discriminative Gaussian process classifiers which I'll  probably tap into in my future posts.

Great GP Stuff: 

  1. Neil Lawrence: Introduction to Gaussian Processes.

https://www.youtube.com/watch?v=ewJ3AxKclOg&t=490s

  1. Gaussian Processes for Machine Learning. (Carl Edward Rasmussen and Christopher K. I. Williams)