All Roads Lead to Rome - Image Recognition

6 minute read


Give you a picture and ask you to identify certain features in it, that's pretty easy. Give you two pictures and ask you to identify the common features in these two, that's fairly simple as well. What if you were given a stream of photos and being asked to identify certain features in each one of them, a bit intimidating?  Not that bad with the help of some smart methodology and technology. Up till now, I've been made aware there are at least two types of methods to approach image recognition type of problems (Forgive me for being ignorant if you know more).  Wavelets transforms can be applied to capture information in images, Classification methods are also being widely used in the setting of image pattern recognition. These two methods are my focus today, I will talk briefly about what they do, how they work, and the difference between these two.

  • Wavelets Transforms

In image processing and recognition, wavelet transforms are usually being applied in two dimensions or three (time included). In two dimensional setting, we can get the pixel data from photo format such as JPEG. The idea is to divide the pixel information for one photo into many smaller blocks x=x[i], and apply transform on each pixel block. The transformed signal y=y[i] is given by: y[i]=A*x[i] for some $latex m \times m$ matrix.

Discrete Fourier Transform (DFT) is commonly used to transform pixel data after they have been level shifted. Usually pixel scale ranges from 0 to 256, level shift simply means the pixel data takes value in [-128,128) after subtract 128 from numbers. A Discrete Fourier Transform is defined by the transform

$latex A=(e^{i\frac{2\pi}{m}pq})_{p,q}$ 

which is unitary. Usually for image compression, Discrete Cosine Transform (DCT) is the one that is commonly used. It is similar to DFT but using only real numbers. The DCTs are generally related to Fourier Series coefficients of a periodically and symmetrically extended sequence, whereas DFTs are related to Fourier Series coefficients of a periodically extended sequence. Fasten your seat belt before you scroll down to see the horrible looking formula for the DCT, which computes the i, jth entry of the DCT of an image such as the following one:


$latex D(i,j)=\frac{1}{\sqrt{2N}}C(i)C(j)\sum_{x=0}^{N-1}\sum_{y=0}^{N-1}p(x,y)cos\left [ \frac{(2x+1)i\pi}{2N} \right ]cos\left [ \frac{(2y+1)i\pi}{2N}  \right ]$

Just take a closer look, it's not that horrible. p(x,y) is the x, y th element of the image represented by the matrix p, N is the size of the pixel block as I mentioned before which the DCT is done on. JPEG uses the standard 8 by 8 block, where N equals 8 and x and y ranges from 0 to 7.

Applications of wavelet transforms abound in areas such as fingerprint compression and matching, testing hair products through comparing the texture of two photos of hair. An image block with lots of changes in frequency such as the above one has a very random looking resulting matrix, which makes sense if you think about it. What kinds of signal would you want to detect in a one-color "image"?

  • Classification

In machine learning world, people use classification methods to tackle image recognition challenges. Facial recognition is one of the most famous vision challenges, the objective is to predict keypoint positions on face images (eyes, mouth, nose, etc.). It seems natural to use linear regression on the entire pixels as a starting point. Nothing seems wrong at a first glance, after all we can possibly learn a lot from a large set of training data where we already know the keypoint positions in a certain image. But usually the thing is, there is plethora of features (pixels) which makes predictions prone to overfitting. Overfitting? Not an uncommon word in stats learning. If you happen to have read my first blog post where I wrote about Katie's conference talk on minimum density hyperplanes, you may possibly recall that I've mentioned PCA as a preliminary step in the presence of high dimensional data. Here in facial recognition same idea applies, people devise all kinds of smart ways to conduct feature selection and dimension reduction.

If you want to get your hands dirty by messing around with some real data, here is a kaggle challenge that deals with facial keypoints detection which I tried last year. I actually implemented PCA to reduce the "dimension of image” on my test data set. Although I have absolutely no idea what are the interpretations for those features that have been filtered out and for those stayed, an intuitive thinking for me is that there might be some features that may affect the detection of gray scales; such as glasses, dark circle under eyes, etc. By using PCA, it can somehow filter out unimportant details. Following are several image patches where I kept different proportion of principal components through of PCA: (sorry if you find the images terrifying...)


Basically, PCA seems to assume that this image patch has 21*21 components since it has 21*21 pixels. And simply choose to keep the first 120 components, it can keep 99% cumulative proportion of the information in this patch, or in other words, “image energy”.

Building up on PCA, a more advanced method called Principal Component Locally Weighted Linear Regression (LWLR) takes similarities between images into the account of prediction purpose. The idea of this method is fairly similar to PCA. Let 𝑋 denote the matrix, which includes all training images as vectors in its rows. The most important information of 𝑋 can be extracted via singular value decomposition.

 X=U $latex \sum V^{T} $

Then we can only work with the first few singular vectors that contain a large proportion of the total information.

A common way to measure the prediction performance is k-fold cross validation. We divide our data set into  k subsets, then train our model on k-1 subsets and test it on the kth set as the validation set. It is a very good way for tuning the parameter of the model, and thus avoid overfitting. As you may have discerned, I am more attached to the classification way of approaching the image recognition problem, although being aware of the computational issue that arise inevitably with larger data sets. But still, as the blog title suggests, all roads lead to Rome, choose the one you want.


[1] O. Ryan. Applications of the Wavelet transform in image processing. 2014.

[2] K. Cabeen, P. Gent. Image Compression and the Discrete Cosine Transform.

[3] A. Esmaeili, K. Khosravi, S. Mirjalili. Facial Keypoint Detection.

[4] Wikipedia: Discrete Cosine Transform.