Nowadays, dimensionality is a serious problem of data analysis as the huge data we experience today results in very sparse sets and very high dimensions. Although, data scientists have long used tools such as principal component analysis (PCA) and independent component analysis (ICA) to project the high-dimensional data onto a subspace, but all those techniques reply on the computation of the eigenvectors of a $n \times n$ matrix, a very expensive operation (e.g., spectral decomposition) for high dimension $n$. Moreover, even though eigenspace has many important properties, it does not lead good approximations for many useful measures such as vector norms. We discuss another method random projection to reduce dimensionality.

In 1984, two mathematicians introduced and proved the following lemma.

Johnson-Lindenstrauss lemma

For any $\epsilon \in (0,\frac{1}{2})$, $\forall x_1, x_2, \dots, x_d \in \mathbb{R}^{n}$, there exists a matrix $M \in \mathbb{R}^{m \times n}$ with $m = O(\frac{1}{\epsilon^2} \log{d})$ such that $\forall 1 \leq i,j \leq d$, we have

Remark: This lemma states that for any pair vector $x_i, x_j$ in $\mathbb{R}^n$ dimension, there exist a sketch matrix $M$ which maps $\mathbb{R}^n \rightarrow \mathbb{R}^m$ and the Euclidean distance is preserved within $\epsilon$ factor. The result dimension does not relate to origin dimension $n$ (only relates to the number of vector pairs $d$).

During a long time, no one can figure out how to get this sketch matrix.

Random Projection

Until 2003, some researches point out that this sketch matrix can be created using Gaussian distribution.

Consider the following matrix $A \in \mathbb{R}^{m \times n}$, where $A_{ij} \sim \mathcal{N}(0,1)$ and all $A_{ij}$ are independent. We claim that this matrix satisfies the statement of JL lemma.

Proof. It is obvious that sketch has an additional property, $\forall i, (Ax)_i = \sum_{j=1}^{n} A_{ij} x_j \sim \mathcal{N}(0, ||x||_2^2)$. In other word, Gaussian distribution is 2-stable distribution. Then we can obtain $||Ax||_2^2 = \sum_{i=1}^{m} y_i^2$, where $y_i \sim \mathcal{N}(0, ||x||_2^2)$. That is to say, $||Ax||_2^2$ follows a $\chi^2$ (chi-squared) distribution with degrees of freedom $m$. For tail bound of $\chi^2$ distribution, we can get

for a constant $C > 0$.

Fix two index $i, j$, and let $y^{ij} = x_i - x_j$ and $M = \frac{1}{\sqrt{m}} A$, and set $m = \frac{4}{C \epsilon^2} \log{n}$ to get

Take the union bound to obtain,

Thus,

which is same as the guarantee in Johnson-Lindenstrauss lemma.

Application

In this or other forms, the JL lemma has been used for a large variety of computational tasks, especially in streaming algorithm, such as

Reference

  • EPFL Topics in Theoretical Computer Science (Sublinear Algorithm for Big Data Analysis), 2017
  • EPFL Advanced Algorithm, 2016
  • Johnson, William B.; Lindenstrauss, Joram (1984). “Extensions of Lipschitz mappings into a Hilbert space”. In Beals, Richard; Beck, Anatole; Bellow, Alexandra; et al. Conference in modern analysis and probability (New Haven, Conn., 1982). Contemporary Mathematics. 26. Providence, RI: American Mathematical Society. pp. 189–206.
  • Kane, Daniel M.; Nelson, Jelani (2012). “Sparser Johnson-Lindenstrauss Transforms”. Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms,. New York: Association for Computing Machinery (ACM).

The post is used for study purpose only.