Subspace Clustering of Very Sparse High-Dimensional Data

Published in 2018 IEEE International Conference on Big Data, 2018

Recommended citation: H. Peng, N. Pavlidis, I. Eckley and I. Tsalamanis, "Subspace Clustering of Very Sparse High-Dimensional Data", Proceedings of 2018 IEEE International Conference on Big Data , 2018.

In this paper we study the problem of clustering collections of very short texts using subspace clustering. This problem arises in many application areas such as product categorisation, fraud detection, and sentiment analysis. The main challenge lies in the fact that the vectorial representation of short texts is both high-dimensional, due to the large number of unique terms in the corpus, and extremely sparse, as each text contains a very small number of words with no repetition.


We propose a new and simple subspace clustering algorithm that relies on linear algebra to cluster such datasets. Experimental results on identifying product categories from product names obtained from the US Amazon website indicate that the algorithm is competitive against state-of-the-art clustering algorithms. [pdf][code]