Implementing t-SNE (t-Distributed Stochastic Neighbor Embedding)

Question

Context

Six months ago @M.R. asked about an implementation of the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm by van der Maaten and Hinton (2008). (@M.R.'s question)

@Alexey Golyshev gave a solid answer utilizing RLink. However, I thought it would be more interesting* to try and implement t-SNE in Mathematica (v11+). However I have run into a bit of difficulty and would appreciate the community's help.

*it may also have a minimal relation to MacOS related issues with RLink.

Note: I like to code using Mathematica's symbolic notation (and hence it looks like crude even with the Mathematica Stack Exchange Plugin). A notebook with the code can be found here (rather than a mess of a \[VariousLongSymbolNames] for a post).

Pseudo-Code

Page 2587 gives a the pseudo-code for the t-SNE algorithm:

Algorithm 1

Data: data set $\chi=\{x_1, x_2, \dots, x_n\}$,
cost function parameters: perplexity Perp,
optimization parameters: number of iterations $T$, learning rate $\eta$, momentum $\alpha(t)$.
Result: low-dimensional data representation $Y^{(T)}=\{y_1, y_2, \dotsm y_n\}.$

begin
----compute pariwase affinities $p_{j|i}$ with perplexity Perp (using Equation 1)
----set $p_{ij}=\frac{p_{j|i}+p_{i|j}}{2n}$
----sample initial solution $Y^{(0)}=\{y_1,y_2,\dots,y_n\}$ from $N(0,10^{-4}I)$
----for $t=1$ to $T$ do
--------computer low-dimensional affinities $q_{ij}$ (using Equation 4)
--------compute gradient $\frac{\delta C}{\delta Y}$ (using Equation 5)
--------set $Y^{(t)}=Y^{(t-1)}+\eta \frac{\delta C}{\delta Y} + \alpha(t)(Y^{(t-1)}-Y^{(t-2)})$
----end
end

Equations

1 $$ p_{j\vert i}=\frac{\exp(-\| x_i-x_j\|^2/2\sigma^2)}{\sum_{k\neq i}\exp(-\| x_i-x_k\|^2/2\sigma^2)} $$

4 $$ q_{ij}=\frac{(1+\| y_i-y_j\|^2)^{-1}}{\sum_{k\neq l}(1+\| y_i-y_l\|^2)^{-1}} $$

5 $$ \frac{\delta C}{\delta y_i}=4\sum_j(p_{ij}-q_{ij})(y_i-y_j)(1+\| y_i-y_j\|^2)^{-1} $$

Implementation

So here is what I have been able to implement so far... (see code)

Questions

how to better handle functions requiring access to elements of the Dataset. I like assigning $x_i$ values inside Table (which does work), however I fear that this is against convention?
how can I write this following Mathematica conventions?
how to update the learning term?
any obvious mistakes?

Git

code

Update

One may notice that the pseudo-code from the original paper has a typo. It has a recurrence relation for the gradient, which depends on the previous two gradients; only one gradient is initialized, however. Ergo one needs to define $T^{-1}$ to be something (perhaps) all zeros.

In addition, as pointed out in the comments, although variance is continually referred to as $\sigma_i$ in the paper, that is a repeated typo and should be $\sigma^2_i$.

I is an IdentityMatrix. Also you don't need to square Variance and there exists SquaredEuclideanDistance. — swish, Feb 17 '17 at 09:25
@SumNeuron You can take the Python code here and make the naive translation of it to the Wolfram language + Compile. — Alexey Golyshev, Feb 17 '17 at 11:12
@SumNeuron https://gist.githubusercontent.com/sw1sh/560fc4c27d9f173bf85e34c88e327a7d/raw/d9ccc04d50ec00dc3e59cea9cabdd70845db15a3/t-SNE.nb — swish, Feb 17 '17 at 11:50
@swish, yes in traditional statistics, but they define variance as $sigma_i$: where, $sigma_i$ is the variance... so I figured that was just a weird terminology thing... — SumNeuron, Feb 17 '17 at 13:14
@AlexeyGolyshev yes I already have the python code, the question is about implementing it in Mathematica, not necessarily with a naive translation, but from the equations and simple pseudo-code provided. — SumNeuron, Feb 17 '17 at 13:15
+1 just for the clarity of the question. I was looking for the pseudocode for this algorithm and you came on top on google :) — Yohan Obadia, Feb 16 '18 at 09:25
@ΑλέξανδροςΖεγγ in regards to what it is or how it is used? To the former perplexity is simply $2^H(P_i)$, and can be interpreted as a smooth measure of the effective number of neighbors. If you are a visual learner, perhaps https://distill.pub/2016/misread-tsne/ will be of use. If it is the later, the variance for the Gaussian centered on each high dimensional data point is chosen to match a fixed perplexity. — SumNeuron, Oct 04 '18 at 08:26

score 9 · Answer 1 · answered Mar 10 '17 at 16:04

9

In Mathematica 11.1 DimensionReduce supports t-SNE.

answered Mar 10 '17 at 16:04

Alexey Golyshev

9,526
2
27
57

3

Which implementation? The Barnes Hut? Or the traditional? Can one set the Perplexity? This would be important given that Perplexity basically defined the scope on which t-SNE searches for local structure in high dimensions. – SumNeuron Mar 10 '17 at 20:48
@SumNeuron Good question. But I don't know. – Alexey Golyshev Mar 11 '17 at 05:10
2

@sumneuron You can set the perplexity like this Method -> {"TSNE", "Perplexity" -> 100} – user5601 Mar 16 '17 at 21:49
@user5601 thanks! Just f.y.i. the original publication Perplexity was best set between 5-50 :) – SumNeuron Mar 17 '17 at 08:26
7

@SumNeuron: we use the original exact-gradient method for very small datasets, and a parallelized C++ implementation of the Barnes-Hut approximation method for larger datasets. – Sebastian Mar 27 '17 at 20:44