Actup: Analyzing and Consolidating Tsne & Umap: Mnist Fashion-Mnist Coil-100
Actup: Analyzing and Consolidating Tsne & Umap: Mnist Fashion-Mnist Coil-100
Andrew Draganov1 , Jakob Jørgensen1 , Katrine Scheel1 , Davide Mottin1 , Ira Assent1 , Tyrus
Berry2 , Cigdem Aslay1
1
Aarhus University
2
George Mason University
{draganovandrew, jakobrj, scheel, davide, ira, cigdem}@cs.au.dk, [email protected]
tSNE
pretable low-dimensional embeddings. Despite
their popularity, however, little work has been done
to study their full span of differences. We theoret-
ically and experimentally evaluate the space of pa-
GDRtsne
rameters in both tSNE and UMAP and observe that
a single one – the normalization – is responsible
for switching between them. This, in turn, implies
that a majority of the algorithmic differences can be
toggled without affecting the embeddings. We dis-
UMAP
3651
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
first principles and theoretically showing the effect that the points. These repulsion techniques are, on their face, incom-
normalization has on the gradient structure. We supplement patible with one another, i.e., several modifications have to be
this by identifying every implementation and hyperparameter made to each algorithm before one can interchange the repul-
difference between the two methods and implementing tSNE sive force calculations.
and UMAP in a common library. Thus, we study the effect There is a growing amount of work that compares tSNE
that each choice has on the embeddings and show both quan- and UMAP through a more theoretical analysis [Damrich
titatively and qualitatively that, other than the normalization and Hamprecht, 2021][Bohm et al., 2020][Damrich et al.,
of the pairwise similarity matrices, none of these parameters 2022][Kobak and Linderman, 2021]. [Damrich and Ham-
significantly affect the outputs. precht, 2021] find that UMAP’s algorithm does not optimize
Based on this analysis, we introduce the necessary changes the presented loss and provide its effective loss function. Sim-
to the UMAP algorithm such that it can produce tSNE em- ilarly [Bohm et al., 2020] analyze tSNE and UMAP through
beddings as well. We refer to this algorithm as Gradient Di- their attractive and repulsive forces, discovering that UMAP
mensionality Reduction (GDR) to emphasize that it is con- diverges when using O(n) repulsions per epoch. We expand
sistent with the presentations of both tSNE and UMAP. We on the aforementioned findings by showing that the forces
experimentally validate that GDR can simulate both meth- are solely determined by the choice of normalization, giving
ods through a thorough quantitative and qualitative evalua- a practical treatment to the proposed ideas. Lastly, [Dam-
tion across many datasets and settings. Lastly, our analysis rich et al., 2022] provides the interesting realization that tSNE
provides insights for further speed improvements and allows and UMAP can both be described through contrastive learn-
GDR to perform gradient descent faster than the standard im- ing approaches. Our work differs from theirs in that we ana-
plementation of UMAP. lyze the full space of parameters in the algorithms and distill
In summary, our contributions are as follows: the difference to a single factor, allowing us to connect the
1. We perform the first comprehensive analysis of the dif- algorithms without the added layers of contrastive learning
ferences between tSNE and UMAP, showing the effect theory. Lastly, the authors in [Kobak and Linderman, 2021]
of each algorithmic choice on the embeddings. make the argument that tSNE can perform UMAP’s mani-
fold learning if given UMAP’s initialization. Namely, tSNE
2. We theoretically and experimentally show that changing randomly initializes the low dimensional embedding whereas
the normalization is a sufficient condition for switching UMAP starts from a Laplacian Eigenmap [Belkin and Niyogi,
between the two methods. 2003] projection. While this may help tSNE preserve the lo-
3. We release simple, plug-and-play implementations of cal kNN structure of the manifold, it is not true of the macro-
GDR, tSNE and UMAP that can toggle all of the iden- level distribution of the embeddings. Lastly, [Wang et al.,
tified hyperparameters. Furthermore, GDR obtains em- 2021] discusses the role that the loss function has on the re-
beddings for both algorithms faster than UMAP. sulting embedding structure. This is in line with our results,
as we show that the normalization’s effect on the loss func-
2 Related Work tion is fundamental in the output differences between tSNE
When discussing tSNE we are referring to [Van Der Maaten, and UMAP.
2014] which established the nearest neighbor and sampling
improvements and is generally accepted as the standard tSNE 3 Comparison of tSNE and UMAP
method. A popular subsequent development was presented We begin by formally introducing the tSNE and UMAP algo-
in [Linderman et al., 2019], wherein Fast Fourier Transforms rithms. Let X ∈ Rn×D be a high dimensional dataset of n
were used to accelerate the comparisons between points. points and let Y ∈ Rn×d be a previously initialized set of n
Another approach is LargeVis [Tang et al., 2016], which points in lower-dimensional space such that d < D. Our aim
modifies the embedding functions to satisfy a graph-based is to define similarity measures between the points in each
Bernoulli probabilistic model of the low-dimensional dataset. space and then find the embedding Y such that the pairwise
As the more recent algorithm, UMAP has not had as many similarities in Y match those in X.
variations yet. One promising direction, however, has ex- To do this, both algorithms define high- and low-
tended UMAP’s second step as a parametric optimization on dimensional non-linear functions p : X × X → [0, 1] and
neural network weights [Sainburg et al., 2020]. q : Y × Y → [0, 1]. These form pairwise similarity matrices
Many of these approaches utilize the same optimization P (X), Q(Y ) ∈ Rn×n , where the i, j-th matrix entry repre-
structure where they iteratively attract and repel points. While sents the similarity between points i and j. Formally,
most perform their attractions along nearest neighbors in the
high-dimensional space, the repulsions are the slowest op- exp(−d(xi , xj )2 /2σi2 )
eration and each method approaches them differently. tSNE ptsne
j|i (xi , xj ) = P 2 2
k̸=l exp(−d(xk , xl ) /2σk )
samples repulsions by utilizing Barnes-Hut (BH) trees to sum (1)
the forces over distant points. The work in [Linderman et tsne (1 + ||yi − yj ||22 )−1
al., 2019] instead calculates repulsive forces with respect to qij (yi , yj ) =P 2 −1
k̸=l (1 + ||yk − yl ||2 )
specifically chosen interpolation points, cutting down on the
O(n log n) BH tree computations. UMAP and LargeVis, on pumap (xi , xj ) = exp (−d(xi , xj )2 + ρi )/τi
j|i
the other hand, simplify the repulsion sampling by only cal- −1 (2)
umap
culating the gradient with respect to a constant number of qij (yi , yj ) = 1 + a(||yi − yj ||22 )b ,
3652
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
1.0
0.5
0.0
0.5
1.0
1 2 3 4
0
to the point in question and are subsequently symmetrized. High dim distance >> High dim distance >> High dim distance >>
In essence, tSNE minimizes the KL divergence of the entire We remind the reader that we are overloading notation – p
pairwise similarity matrix since its P and Q matrices sum to and q are normalized when they are in the tSNE setting and
1. UMAP instead defines Bernoulli probability distributions are unnormalized in the UMAP setting.
{pij , 1 − pij }, {qij , 1 − qij } and sums the KL divergences In practice, tSNE and UMAP optimize their loss functions
between the n2 pairwise probability distributions 2 . by iteratively applying these attractive and repulsive forces. It
is unnecessary to calculate each such force to effectively esti-
3.1 Gradient Calculations mate the gradient, however, as the pij term in both the tSNE
We now describe and analyze the gradient descent approaches and UMAP attractive forces decays exponentially. Based on
in tSNE and UMAP. First, notice that the gradients of each this observation, both methods establish a nearest neighbor
algorithm change substantially due to the differing normal- graph in the high-dimensional space, where the edges repre-
izations. In tSNE, the gradient can be written as an attractive sent nearest neighbor relationships between xi and xj . It then
AtSN
i
E
and a repulsive Rtsne
i force acting on point yi with suffices to only perform attractions between points yi and yj
if their corresponding xi and xj are nearest neighbors.
∂Ltsne X X
2 This logic does not transfer to the repulsions, however, as
= −4Z pij qij (yi − yj ) − qik (yi − yk ) the Student-t distribution has a heavier tail so repulsions must
∂yi
j,j̸=i k,k̸=i be calculated evenly across the rest of the points. tSNE does
(5) this by fitting a Barnes-Hut tree across Y during every epoch.
= 4Z(Atsne + Rtsne ) If yk and yl are both in the same tree leaf then we assume
i i
qik = qil , allowing us to only calculate O(log(n)) similari-
tSN E ties. Thus, tSNE estimates all n − 1 repulsions by perform-
where Z is the normalization term in qij . On the other
hand, UMAP’s attractions and repulsions3 are presented as ing one such estimate for each cell in Y ’s Barnes-Hut tree.
[McInnes et al., 2018] UMAP, on the other hand, simply obtains repulsions by sam-
pling a constant number of points uniformly and only apply-
X −2ab∥yi − yj ∥2(b−1) ing those repulsions. These repulsion schemas are depicted
Aumap
i = 2
pij (yi − yj ) (6) in Figure 3. Note, tSNE collects all of the gradients before a
1 + ∥yi − yj ∥22
j,j̸=i full momentum gradient descent step whereas UMAP moves
X 2b each point immediately upon calculating a force.
Rumap
i = qik (1 − pik )(yi − yk ). (7)
ε + ∥yi − yk ∥22 There are a few differences between the two algorithms’
k,k̸=i
gradient descent loops. First, the tSNE learning rate stays
1
In practice, we can assume that 2σi2 is functionally equivalent constant over training while UMAP’s linearly decreases. Sec-
to τi , as they are both chosen such that the entropy of the resulting ond, tSNE’s gradients are strengthened by adding a “gains”
distribution is equivalent. term which scales gradients based on whether they point in
2
Both tSNE and UMAP set the diagonals of P and Q to 0
3 4
The ε value is only inserted for numerical stability We derive this in section A.2 in the supplementary material
3653
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
the same direction from epoch to epoch5 . We refer to these TSNE UMAP
Angle (radians)
directly changes the ratio of attraction/repulsion magnitudes,
inducing the difference between tSNE and UMAP embed-
dings. Thus, we can toggle the normalization to alternate
between their outputs. Furthermore, Theorem 2 shows that
the attraction/repulsion ratio in the normalized setting is inde- MNIST epochs Coil-100 epochs Swiss Roll epochs
pendent of the number of repulsive samples collected. This
second point allows us to accelerate tSNE to UMAP speeds Figure 4: Average angle in radians between repulsive forces calcu-
without impacting embedding quality by simply removing the lated with O(1) and O(n) repulsions. The red line is at 0 radians.
dependency on Barnes-Hut trees and calculating 1 per-point
repulsion as in UMAP. We now provide the necessary defini- h i
tions for the theorems. E [|Atsne |] E |Ãtsne
i |
i
Assume that the pij terms are given. We now consider the Theorem 2. tsne = h i
E [|Ri |] E |R̃tsne |
dataset Y probabilistically by defining a set of random vari- i
ables vij = yi − yj and assume that all O(n2 ) vij vectors The proofs are given in Sections A.3 and A.4 of the sup-
are i.i.d. around a non-zero mean. Let rij = (1 + |vij |2 )−1 plementary material. We point out that ptsne is normalized
P n2 ij
and define Z = i,j rij as the sum over n2 pairs of points over the sum of all cn attractions that are sampled, giving us
Pn
and Z̃ = i,j rij as the sum over n pairs of points. Then the estimate ptsne
ij ∼ 1/(cn). Theorem 1’s result is visualized
in the gradient plots in Figure 2. There we see that, for non-
applying n per-point repulsions Pn gives us the force acting on
negligible values of d(xi , xj ), the UMAP repulsions can be
point yi of E[|Rtsne |] = E[ j ||(rij 2
/Z 2 ) · vij ||]. We now
orders of magnitude larger than the corresponding tSNE ones,
define an equivalent force term in the setting where we have even when accounting for the magnitude of the attractions.
1 per-point repulsion: E[|R̃tsne |] = E[||(rij 2
/Z̃ 2 )·vij ||]. Note Furthermore, Section 5 evidences that toggling the normal-
that we have a constant number c of attractive forces acting ization is sufficient to switch between the algorithms’ embed-
on each point, giving E[|Atsne |] = c · ptsne ij E[||(rij /Z) · vij ||] dings and that no other hyperparameter accounts for the dif-
and E[|Ãtsne |] = c · ptsne
ij E[||(r ij / Z̃) · vij ||]. ference in inter-cluster distances between tSNE and UMAP.
tsne tsne
Thus, |A | and |R | represent the magnitudes of the
forces when we calculate tSNE’s O(n) per-point repulsions 4 Unifying tSNE and UMAP
while |Ãtsne | and |R̃tsne | represent the forces when we have This leads us to GDR– a modification to UMAP that can
UMAP’s O(1) per-point repulsions. Given this, we have the recreate both tSNE and UMAP embeddings at UMAP speeds.
following theorems: We choose the general name Gradient Dimensionality Reduc-
Theorem 1. Let ptsne
ij ∼
1/(cn) and d(xi , xj ) > tion to imply that it is both UMAP and tSNE.
h
tsne
i Our algorithm follows the UMAP optimization procedure
p umap
E[|Ai |] E | Ã i | except that we (1) replace the scalar sampling by iteratively
log(n2 + 1)τ . Then < h i.
E[|Rumap
i |] E |R̃tsne |
processing attractions/repulsions and (2) apply the gradients
i after having collected all of them, rather than immediately
5
This term has not been mentioned in the literature but is present upon processing each one. The first change accommodates
in common tSNE implementations. the gradients under normalization since the normalized re-
6
When possible, we use index k to represent repulsions and j pulsive forces do not have the 1 − pik term to which UMAP
to represent attractions to highlight that pik is never calculated in samples proportionally. The second change allows for per-
UMAP. See Section A.6 in the supplementary material for details. forming momentum gradient descent for faster convergence
3654
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
Table 1: List of differences between hyperparameters of tSNE and UMAP. These are analyzed in Figures 3 and 4.
Table 2: Effect of changing the normalization for the original tSNE and UMAP algorithms on the MNIST, Fashion-MNIST, and Swiss Roll
datasets. Each dataset is shown with normalization followed by no normalization. We use Laplacian Eigenmap initializations for consistent
orientation. The normalized UMAP plots were made with the changes described in section 5.2.
in the normalized setting. distance metric seems to have no effect on the embeddings,
Since we follow the UMAP optimization procedure, GDR as seen in Tables 3 and 4. It is possible that the algo-
defaults to producing UMAP embeddings. In the case of rithm’s reliance on highly non-convex gradient descent devi-
replicating tSNE, we simply normalize the P and Q matri- ates enough from the theoretical discussion that the pseudo-
ces and scale the learning rate. Although we only collect distance metric loses its applicability. It may also be the case
O(1) attractions and repulsions for each point, their magni- that this pseudo-distance metric, while insightful from a the-
tudes are balanced due to Theorems 1 and 2. We refer to GDR oretical perspective, is not a necessary calculation in order to
as GDRumap if it is in the unnormalized setting and as GDRtsne achieve the final embeddings.
if it is in the normalized setting. We note that changing the Furthermore, many of the other differences between tSNE
normalization necessitates gradient amplification. and UMAP are not motivated by the theoretical foundation
By allowing GDR to toggle the normalization, we are free of either algorithm. The gradient descent methodology is
to choose the simplest options across the other parameters. entirely heuristic, so any differences therein do not impact
GDR therefore defaults to tSNE’s asymmetric attraction and the theory. This applies to the repulsion and attraction sam-
a and b scalars along with UMAP’s distance-metric, initial- pling and gradient descent methods. Moreover, the high-
ization, nearest neighbors, and pij symmetrization. dimensional symmetrization function, embedding initializa-
The supplementary material provides some further infor- tion, symmetric attraction, and a, b scalars can all be switched
mation on the flexibility of GDR (A.1), such as an acceler- to their alternative options without impacting either method’s
ated version of the algorithm where we modify the gradient consistency within its theoretical presentation. Thus, each of
formulation such that it is quicker to optimize. This change these heuristics can be toggled without impacting the embed-
induces a consistent 2× speedup of GDR over UMAP. De- ding’s interpretation, as most of them do not interfere with
spite differing from the true KL divergence gradient, we find the theory and none affect the output.
that the resulting embeddings are comparable. Our reposi-
tory also provides a CUDA kernel that calculates GDRumap We also question whether the choice of normalization is
and GDRtsne embeddings in a distributed manner on a GPU. necessitated by either algorithm’s presentation. tSNE, for ex-
ample, treats the normalization of P and Q as an assumption
4.1 Theoretical Considerations and provides no further justification. In the case of UMAP,
it appears that the normalization does not break the assump-
UMAP’s theoretical framework identifies the existence of tions of the original paper [McInnes et al., 2018, Sec. 2,3].
a locally-connected manifold in the high-dimensional space We therefore posit that the interpretation of UMAP as finding
under the UMAP pseudo-distance metric d. ˜ This pseudo- the best fit to the high-dimensional data manifold extends to
distance metric is defined such that the distance from point xj tSNE as well, as long as tSNE’s gradients are calculated un-
˜ i , xj )=d(xi , xj )− minl̸=i d(xi , xl ). De-
to xi is equal to d(x der the pseudo-distance metric in the high-dimensional space.
spite this being a key element of the UMAP foundation, we We additionally theorize that each method can be paired with
find that substituting the Euclidean distance for the pseudo- either normalization without contradicting the foundations
3655
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
Fashion Coil Single Cifar-10 studying macro-structures loses local information. To ac-
MNIST 100 Cell count for this, we employ separate metrics to study the em-
UMAP 78.0; 0.5 80.8; 3.3 43.4; 1.9 24.2; 1.1 beddings at the micro- and macro-scales. Specifically, we
kNN Acc.
GDRumap 77.3; 0.7 77.4; 3.4 42.8; 2.2 23.8; 1.1 use the kNN accuracy to analyze preservation of local neigh-
borhoods as established in [Van Der Maaten et al., 2009] and
tSNE 80.1; 0.7 63.2; 4.2 43.3; 1.9 28.7; 2.5 the V-measure [Rosenberg and Hirschberg, 2007] to study the
GDRtsne 78.6; 0.6 77.2; 4.4 44.8; 1.4 25.6; 1.1 embedding’s global cluster structures7 .
UMAP 60.3; 1.4 89.2; 0.9 60.6; 1.3 7.6; 0.4
5.1 Hyperparameter Effects
V-score
3656
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
Default setting Random init Pseudo distance Symmetrization Sym attraction a, b scalars
tSNE
95.1; 70.9 95.2; 70.7 96.0; 73.9 94.9; 70.8 94.8; 80.7 95.1; 73.2
GDRtsne
96.1; 67.8 95.6; 61.3 96.1; 63.0 96.1; 68.4 96.3; 72.7 96.1; 68.8
UMAP
95.4; 82.5 96.6; 84.6 94.4; 82.2 96.7; 82.5 96.6; 83.5 96.5; 82.2
GDRumap
96.2; 84.0 96.4; 82.1 96.7; 85.2 96.6; 85.1 96.5; 83.3 95.8; 81.2
Table 4: Effect of the algorithm settings from Table 1 on the MNIST dataset. Each parameter is changed from its default to its alternative
setting; e.g., the random init column implies that tSNE was initialized with Laplacian Eigenmaps while UMAP and GDR were initialized
randomly. Below each image the KNN-accuracy and K-Means V-score show unchanged performance.
UMAP ones as we are still estimating O(n) repulsions, caus- 5.3 Time Efficiency
ing the embedding to fall closer to the mean of the multi- We lastly discuss the speeds of UMAP, tSNE, GDR, and our
modal datasets. To account for the n× more repulsions, we accelerated version of GDR in section A.1 of the supplemen-
scale each repulsion by 1/n for the sake of Pconvergence. This tary material due to space concerns. Our implementations of
is a different effect than normalizing by pij as we are not UMAP and GDR perform gradient descent an order of mag-
affecting the attraction/repulsion ratio in Theorem 1. nitude faster than the standard UMAP library, implying a cor-
responding speedup over tSNE. We also provide an accelera-
The analysis is slightly more involved in the case of tion by doing GDR with scalar sampling that provides a fur-
UMAP. Recall that the UMAP algorithm approximates the ther 2× speedup. Despite the fact that this imposes a slight
pij and 1 − pik gradient scalars by sampling the attractions modification onto the effective gradients, we show that this is
and repulsions proportionally to pij and 1 − pik , which we qualitatively insignificant in the resulting embeddings.
referred to as scalar sampling. However, the gradients in the
normalized setting (Equation 5) lose the 1 − pik scalar on 6 Conclusion & Future Work
repulsions. The UMAP optimization schema, then, imposes We discussed the set of differences between tSNE and UMAP
an unnecessary weight on the repulsions in the normalized and identified that only the normalization significantly im-
setting as the repulsions are still sampled according to the no- pacts the outputs. This provides a clear unification of tSNE
longer-necessary 1 − pik scalar. Accounting for this requires and UMAP that is both theoretically simple and easy to im-
dividing the repulsive forces by 1 − pik , but this (with the plement. Beyond this, our analysis has uncovered multiple
momentum gradient descent and stronger learning rate) leads misunderstandings regarding UMAP and tSNE while hope-
to a highly unstable training regime. We refer the reader to fully also clarifying how these methods work.
Figure 7 in the supplementary material for details. We raised several questions regarding the theory of
gradient-based DR algorithms. Is there a setting in which
This implies that stabilizing UMAP in the normalized set- the UMAP pseudo-distance changes the embeddings? Does
ting requires removing the sampling and instead directly mul- the KL divergence induce a better optimization criterium than
tiplying by pij and 1 − pik . Indeed, this is exactly what we the Frobenius norm? Is it true that UMAP’s framework can
do in GDR. Under this change, GDRumap and GDRtsne ob- accommodate tSNE’s normalization? We hope that we have
tain effectively identical embeddings to the default UMAP facilitated future research into the essence of these algorithms
and tSNE ones. This is confirmed in the kNN accuracy and through identifying all of their algorithmic components and
K-means V-score metrics in Table 3. consolidating them in a simple-to-use codebase.
3657
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
3658