Subspace Clustering with Active Learning

Hankui Peng

Supervisors: Nicos Pavlidis, Idris Eckley

Industrial partner: Data Science Campus, Office for National Statistics

3rd May 2019

 

 

Outline

    1. Introduction

    2. Subspace Clustering

    3. Active Learning

    4. Experiments

    5. Conclusions & Future Work

Introduction

Classic Setting

Subspace Setting

Motivation

  • Image segmentation Color

Motivation

  • Digit recognition Color

and more...

  • face recognition
  • motion segmentation
  • text analysis

Subspace Clustering

Subspace Clustering

Definition

The subspace clustering problem refers to the problem of finding the subspaces, their dimensions, a basis for each subspace, and the segmentation of the data.

Notation

$X$: $N$ by $P$ data matrix.

$N$: number of data objects.

$P$: full data dimensionality.

$K$: number of clusters.

$q$: subspace dimensionality.

K-Subspace Clustering (KSC)

KSC is a $K$-means like algorithm that iterates between updating subspace locations and updating cluster assignment, in order to minimise the overall reconstruction error:

$$L(\textbf{X},\Omega)= \sum_{k=1}^{K}\sum_{\textbf{x}\in\mathcal{X}_{k} } \left\|\textbf{x}-V_{k}V_{k}^{T}\textbf{x} \right\|_{2}^{2},$$

where

$V_{k}$ is the $P\times q$ basis matrix for the $k$th subspace,

$\mathcal{X}_{k}$ is the set of data that belong to the $k$th subspace, and

$\Omega=\left\{\omega_{1},\ldots,\omega_{N} \right\}$ is the set of cluster assignments, where $\omega_{i}\in\left\{1,\ldots,K \right\}$ for $i\in\left\{1,\ldots,N \right\}$.

For data that lie in $q$-dimensional subspaces, a local optimum can be found by iterating between the following:

Subspace formation

Conducting SVD decomposition:

$$X_{k}=U_{k}\Sigma_{k}V_{k}^{T}, \text{for } k=\left\{1,\ldots,K\right\},$$

gives us the basis matrix $V_{k}$ of the $k$th subspace.

Assignment update

Assigning each data object to its closest subspace:

$$\omega_{i}=\text{arg}\underset{k=1,\ldots,K}{\text{ min}} \left\|\textbf{x}_{i}-V_{k}V_{k}^{T}\textbf{x}_{i} \right\|_{2}^{2}, \text{for } i=\left\{1, \ldots, N\right\}.$$

Stop the iterative procedure when the objective function value plateaus.

Visual Example

 

Simply Easy Learning

Advantages

  • Converges in very few number of iterations

Drawbacks

  • Can easily get stuck in local minima

  • Sensitive to initialisations

Active Learning

Active Learning

 

Active learning improves the performance of machine learning algorithms through actively querying data labels.

  • Uncertainty sampling

  • Query by committee

  • Expected model change

Our proposed AL framework borrows ideas from both uncertainty sampling and expected model change.

The Framework

Assumptions:

  • The removal of a misclassified data object from its currently allocated cluster would result in a large decrease in the reconstruction error of that cluster;

  • The addition of a misclassified data object to its true cluster would incur small increase in the reconstruction error of that cluster.

Strategy:

  • Query stage: querying the true classes of those data objects that are potentially misclassified;

  • Update stage: update the subspaces and correspondingly the labels of other potentially misclassified data objects.

Query Stage

Query Stage

Influence of data deletion

The influence of data deletion $U_{1}(\textbf{x}_{s},\omega_{s})$ is measured by the decrease in reconstruction error:

$$U_{1}(\textbf{x}_{s},\omega_{s})=\sum_{\textbf{x}\in\mathcal{X}_{\omega_{s}}}L(\textbf{x},\omega_{s})-\sum_{\textbf{x}\in\mathcal{X}_{\omega_{s}}\backslash\left\{\textbf{x}_{s}\right\}}\tilde{L}_{-}(\textbf{x},\omega_{s}),$$

where

  • $\textbf{x}_{s}$ is the data object to be removed from its allocated cluster $\omega_{s}$;

  • $\tilde{L}_{-}(\textbf{x}_{s},\omega_{s})$ denotes the reconstruction error of the perturbed subspace (after $\textbf{x}_{s}$ is removed).

Query Stage

Influence of data addition

The influence of data addition $U_{2}(\textbf{x}_{s},k)$ is measured by the increase in reconstruction error:

$$U_{2}(\textbf{x}_{s},k)=\sum_{\textbf{x}\in\mathcal{X}_{k}\cup\left\{\textbf{x}_{s}\right\}}\tilde{L}_{+}(\textbf{x},k)-\sum_{\textbf{x}\in\mathcal{X}_{k}}L(\textbf{x},k),$$

where

  • $\textbf{x}_{s}$ is the data object to be added to cluster $k$ ($k\in\left\{1,\ldots,K \right\}$);

  • $\tilde{L}_{+}(\textbf{x}_{s},k)$ denotes the reconstruction error of the perturbed subspace (after $\textbf{x}_{s}$ is added).

Query Stage

Combining the influence of data deletion and addition together, we decide the most informative data object as: $$\textbf{x}_{s}^{*}=\arg\max_{\textbf{x}_{s}\in\mathcal{X}_{U}} \left\{U_{1}(\textbf{x}_{s},\omega_{s})-U_{2}(\textbf{x}_{s},\omega^{*}_{s}) \right\},$$

where

$$\omega^{*}_{s}=\arg\min_{\substack{k\in\left\{1,\ldots,K \right\}\\k\neq\omega_{s}}}L(\textbf{x}_{s},k),$$

is considered as the most likely true label for $\textbf{x}_{s}$.

Perturbation Analysis on the Eigenproblem

Approximation through convergent power series ($n\gg l$):

  • Perturbed covariance:
$$S(\epsilon)=S+S^{(1)}\epsilon+S^{(2)}\epsilon^{2}+\cdots,$$
  • Perturbed eigenvalue:
$$\lambda(\epsilon)=\lambda+\alpha_{1}\epsilon+\alpha_{2}\epsilon^{2}+\cdots,$$
  • Perturbed eigenvector:
$$\textbf{v}(\epsilon)=\textbf{v}+\boldsymbol{\psi}_{1}\epsilon+\boldsymbol{\psi}_{2}\epsilon^{2}+\cdots.$$

We can also deduce the perturbed forms (deletion / addition) of the covariance matrix as:

$S_{I_{-}}=S+\frac{l}{n-l}\left[(S-S_{I})-(\bar{\textbf{x}}_{I}-\bar{\textbf{x}})(\bar{\textbf{x}}_{I}-\bar{\textbf{x}})^{T} \right]-\frac{l^{2}}{(n-l)^{2}}(\bar{\textbf{x}}_{I}-\bar{\textbf{x}})(\bar{\textbf{x}}_{I}-\bar{\textbf{x}})^{T},$

$S_{I_{+}}=S+\frac{l}{n+l}\left[(S_{I}-S)-(\bar{\textbf{x}}_{I}+\bar{\textbf{x}})(\bar{\textbf{x}}_{I}+\bar{\textbf{x}})^{T} \right] +\frac{l^{2}}{(n+l)^{2}}\left(\bar{\textbf{x}}_{I}+\bar{\textbf{x}} \right)\left(\bar{\textbf{x}}_{I}+\bar{\textbf{x}} \right)^{T}.$

Perturbation Analysis on the Eigenproblem

  • Equating the coefficients in $S(\epsilon)\textbf{v}(\epsilon)=\lambda(\epsilon)\textbf{v}(\epsilon)$ gives us:
\begin{aligned} \alpha_{1}=\textbf{v}^{T}S^{(1)}\textbf{v}. \end{aligned}
  • It is the Rayleigh quotient of $S^{(1)}$ and the normalised $\textbf{v}$.

  • For $l=1$, the first order approximation of the $k$th ($k\in\left\{1,\ldots,K \right\}$) perturbed eigenvalue can be written as

\begin{aligned} \lambda_{k}(\epsilon)&=\lambda_{k}+\epsilon\lambda_{k}^{(1)}+\mathcal{O}(\epsilon^{2})\\ &\approx\lambda_{k}+\epsilon \textbf{v}_{k}^{T}S^{(1)}\textbf{v}_{k}\\ &= \lambda_{k}+\frac{1}{n+1} \textbf{v}_{k}^{T}\left(\left(\bar{\textbf{x}}-\textbf{x}_{i} \right)\left(\bar{\textbf{x}}-\textbf{x}_{i} \right)^{T}-S \right)\textbf{v}_{k}\\ &=\frac{1}{n+1}\alpha_{ki}^2-\frac{n}{n+1} \lambda_{k}, \end{aligned}

where $\alpha_{ki}=\textbf{v}_{k}^{T}(\textbf{x}_{i}-\bar{\textbf{x}})$.

Perturbation Analysis on the Eigenproblem

The perturbed form of the influence of single data deletion:

\begin{equation} \begin{aligned} U_{1}(\textbf{x}_{s},\omega_{s}) &=\sum_{\textbf{x}\in\mathcal{X}_{\omega_{s}}}L(\textbf{x},\omega_{s})-\sum_{\textbf{x}\in\mathcal{X}_{\omega_{s}}\backslash\left\{\textbf{x}_{s}\right\}}\tilde{L}_{-}(\textbf{x},\omega_{s})\\ &=\sum_{k=q+1}^{P}\left(\frac{1}{n-1}\alpha_{ki}^{2}-\frac{1}{n-1}\lambda_{k} \right). \end{aligned} \end{equation}

The perturbed form of the influence of single data addition:

\begin{equation} \begin{aligned} U_{2}(\textbf{x}_{s},\omega) &=\sum_{\textbf{x}\in\mathcal{X}_{\omega}\cup\left\{\textbf{x}_{s} \right\}}\tilde{L}_{+}(\textbf{x},\omega)-\sum_{\textbf{x}\in\mathcal{X}_{\omega}}L(\textbf{x},\omega)\\ &=\sum_{k=q+1}^{P}\left\{\frac{1}{n+1}\alpha_{ki}^2-\frac{2n+1}{n+1}\lambda_{k} \right\}. \end{aligned} \end{equation}

Visual Example (Cont'd)

 

Simply Easy Learning Simply Easy Learning Simply Easy Learning

Update Stage

$K$ Subspace Clustering with Constraints (KSCC)

Constrained objective:

$$g(\mathcal{V})=\sum_{\textbf{x}_{u}\in\mathcal{X}_{U}}\left\{\min_{m\in\left\{1,\ldots,K \right\}}\left\|\textbf{x}_{u}-V_{m}V_{m}^{T}\textbf{x}_{u} \right\|_{2}^{2} \right\}+\min_{\substack{P_{n}\in\mathcal{P}(K)\\n\in\left\{1,\ldots,K! \right\}}}\left\{\sum_{l=1}^{K} \left\|X_{l}-V_{P_{nl}}V_{P_{nl}}^{T}X_{l} \right\|_{2}^{2} \right\},$$

where

  • $\mathcal{V}$ is the set of all basis matrices;

  • $\mathcal{X}_{U}$ denotes the set of unqueried data objects;

  • $\mathcal{P}(K)$ is the set of all possible permutations of $\left\{1,\ldots, K \right\}$.

$K$ Subspace Clustering with Constraints (KSCC)

KSCC algorithm

  • Stage 1: fitting subspaces

  • Stage 2: updating assignment

  • Stage 3: satisfying constraints by solving

$$P_{n}^{*} =\arg \min_{\substack{P_{n}\in\mathcal{P}(K)\\n\in\left\{1,\ldots,K! \right\}}}\left\{\sum_{l=1}^{K} \left\|X_{l}-V_{P_{nl}}V_{P_{nl}}^{T}X_{l} \right\|_{2}^{2} \right\},$$

through exhaustive search ($\mathcal{O}(K!)$) when $K<=5$; or

through the Hungarian algorithm ($\mathcal{O}(K^3)$) when $K>5$.

*We can show that the constrained objective decreases monotonically throughout iterations of KSCC.

Experiments

Strategies for comparison

  • SCAL-A: querying data with the least influence upon adding to their most likely cluster

  • SCAL-D: querying data with the largest influence upon deleting from their allocated cluster

  • SCAL: the combination of SCAL-A and SCAL-D

  • MaxResid: querying data with the maximum projection distance to their allocated cluster

  • MinMargin: querying data with the minimum margin in projection distances

  • Random: querying data by random sampling

Visual Example (Cont'd)

 

Comparing different strategies (from left to right: Random, MinMargin, SCAL):

Visual Example (Cont'd)

Color

Faces Example

 

- Clustering faces ($K=3$) under various lighting

 

- 64 faces for each face in 2016-dimensional space

 

- Images of one subject lie close to one low

  dimensional subspace of dimension 9

Faces Example

 

Conclusions & Future Work

Conclusions & Future Work

Conclusions

  • Active strategies based on changes in reconstruction error improve the cluster performance quickly

Future work

  • Explore other approaches to robustify the KSC algorithm

  • Understand the interplay between rate of improvement and noise level / between subspace angles

  • Extend the current framework to incorporate other types of subspace algorithms (e.g. spectral methods)

References

  • P. S. Bradley and O. L. Mangasarian, “K-plane clustering,” Journal of Global Optimization, vol. 16, no. 1, pp. 23–32, 2000.

  • J. Lipor and L. Balzano, “Margin-based active subspace clustering,” in 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2015, pp. 377–380.

  • J. Lipor and L. Balzano, “Leveraging union of subspace structure to improve constrained clustering,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2130–2139.

  • H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.

Questions?