0% found this document useful (0 votes)
12 views

Actup: Analyzing and Consolidating Tsne & Umap: Mnist Fashion-Mnist Coil-100

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Actup: Analyzing and Consolidating Tsne & Umap: Mnist Fashion-Mnist Coil-100

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

ActUp: Analyzing and Consolidating tSNE & UMAP

Andrew Draganov1 , Jakob Jørgensen1 , Katrine Scheel1 , Davide Mottin1 , Ira Assent1 , Tyrus
Berry2 , Cigdem Aslay1
1
Aarhus University
2
George Mason University
{draganovandrew, jakobrj, scheel, davide, ira, cigdem}@cs.au.dk, [email protected]

Abstract MNIST Fashion- Coil-100


MNIST
tSNE and UMAP are popular dimensionality re-
duction algorithms due to their speed and inter-

tSNE
pretable low-dimensional embeddings. Despite
their popularity, however, little work has been done
to study their full span of differences. We theoret-
ically and experimentally evaluate the space of pa-

GDRtsne
rameters in both tSNE and UMAP and observe that
a single one – the normalization – is responsible
for switching between them. This, in turn, implies
that a majority of the algorithmic differences can be
toggled without affecting the embeddings. We dis-
UMAP

cuss the implications this has on several theoretic


claims behind UMAP, as well as how to reconcile
them with existing tSNE interpretations.
Based on our analysis, we provide a method (GDR)
GDRumap

that combines previously incompatible techniques


from tSNE and UMAP and can replicate the re-
sults of either algorithm. This allows our method
to incorporate further improvements, such as an
acceleration that obtains either method’s outputs
Figure 1: A single method (GDR) can recreate tSNE and UMAP
faster than UMAP. We release improved versions outputs just by changing the normalization.
of tSNE, UMAP, and GDR that are fully plug-and-
play with the traditional libraries.
large intra-cluster distances. Second, UMAP runs signifi-
cantly faster as it performs efficient sampling during gradient
1 Introduction descent. While attempts have been made to study the gaps be-
Dimensionality Reduction (DR) algorithms are invaluable for tween the algorithms [Kobak and Linderman, 2021], [Bohm
qualitatively inspecting high-dimensional data and are widely et al., 2020], [Damrich and Hamprecht, 2021], there has not
used across scientific disciplines. Broadly speaking, these al- yet been a comprehensive analysis of their methodologies nor
gorithms transform a high-dimensional input into a faithful a method that can obtain both tSNE and UMAP embeddings
lower-dimensional embedding. This embedding aims to pre- at UMAP speeds.
serve similarities among the points, where similarity is often We believe that this is partly due to their radically differ-
measured by distances in the corresponding spaces. ent presentations. While tSNE takes a computational angle,
tSNE [Van der Maaten and Hinton, 2008] [Van UMAP originates from category theory and topology. De-
Der Maaten, 2014] and UMAP [McInnes et al., 2018] are two spite this, many algorithmic choices in UMAP and tSNE are
widely popular DR algorithms due to their efficiency and in- presented without theoretical justification, making it difficult
terpretable embeddings. Both algorithms establish analogous to know which algorithmic components are necessary.
similarity measures, share comparable loss functions, and In this paper, we make the surprising discovery that the dif-
find an embedding through gradient descent. Despite these ferences in both the embedding structure and computational
similarities, tSNE and UMAP have several key differences. complexity between tSNE and UMAP can be resolved via
First, although both methods obtain similar results, UMAP a single algorithmic choice – the normalization factor. We
prefers large inter-cluster distances while tSNE leans towards come to this conclusion by deriving both algorithms from

3651
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

first principles and theoretically showing the effect that the points. These repulsion techniques are, on their face, incom-
normalization has on the gradient structure. We supplement patible with one another, i.e., several modifications have to be
this by identifying every implementation and hyperparameter made to each algorithm before one can interchange the repul-
difference between the two methods and implementing tSNE sive force calculations.
and UMAP in a common library. Thus, we study the effect There is a growing amount of work that compares tSNE
that each choice has on the embeddings and show both quan- and UMAP through a more theoretical analysis [Damrich
titatively and qualitatively that, other than the normalization and Hamprecht, 2021][Bohm et al., 2020][Damrich et al.,
of the pairwise similarity matrices, none of these parameters 2022][Kobak and Linderman, 2021]. [Damrich and Ham-
significantly affect the outputs. precht, 2021] find that UMAP’s algorithm does not optimize
Based on this analysis, we introduce the necessary changes the presented loss and provide its effective loss function. Sim-
to the UMAP algorithm such that it can produce tSNE em- ilarly [Bohm et al., 2020] analyze tSNE and UMAP through
beddings as well. We refer to this algorithm as Gradient Di- their attractive and repulsive forces, discovering that UMAP
mensionality Reduction (GDR) to emphasize that it is con- diverges when using O(n) repulsions per epoch. We expand
sistent with the presentations of both tSNE and UMAP. We on the aforementioned findings by showing that the forces
experimentally validate that GDR can simulate both meth- are solely determined by the choice of normalization, giving
ods through a thorough quantitative and qualitative evalua- a practical treatment to the proposed ideas. Lastly, [Dam-
tion across many datasets and settings. Lastly, our analysis rich et al., 2022] provides the interesting realization that tSNE
provides insights for further speed improvements and allows and UMAP can both be described through contrastive learn-
GDR to perform gradient descent faster than the standard im- ing approaches. Our work differs from theirs in that we ana-
plementation of UMAP. lyze the full space of parameters in the algorithms and distill
In summary, our contributions are as follows: the difference to a single factor, allowing us to connect the
1. We perform the first comprehensive analysis of the dif- algorithms without the added layers of contrastive learning
ferences between tSNE and UMAP, showing the effect theory. Lastly, the authors in [Kobak and Linderman, 2021]
of each algorithmic choice on the embeddings. make the argument that tSNE can perform UMAP’s mani-
fold learning if given UMAP’s initialization. Namely, tSNE
2. We theoretically and experimentally show that changing randomly initializes the low dimensional embedding whereas
the normalization is a sufficient condition for switching UMAP starts from a Laplacian Eigenmap [Belkin and Niyogi,
between the two methods. 2003] projection. While this may help tSNE preserve the lo-
3. We release simple, plug-and-play implementations of cal kNN structure of the manifold, it is not true of the macro-
GDR, tSNE and UMAP that can toggle all of the iden- level distribution of the embeddings. Lastly, [Wang et al.,
tified hyperparameters. Furthermore, GDR obtains em- 2021] discusses the role that the loss function has on the re-
beddings for both algorithms faster than UMAP. sulting embedding structure. This is in line with our results,
as we show that the normalization’s effect on the loss func-
2 Related Work tion is fundamental in the output differences between tSNE
When discussing tSNE we are referring to [Van Der Maaten, and UMAP.
2014] which established the nearest neighbor and sampling
improvements and is generally accepted as the standard tSNE 3 Comparison of tSNE and UMAP
method. A popular subsequent development was presented We begin by formally introducing the tSNE and UMAP algo-
in [Linderman et al., 2019], wherein Fast Fourier Transforms rithms. Let X ∈ Rn×D be a high dimensional dataset of n
were used to accelerate the comparisons between points. points and let Y ∈ Rn×d be a previously initialized set of n
Another approach is LargeVis [Tang et al., 2016], which points in lower-dimensional space such that d < D. Our aim
modifies the embedding functions to satisfy a graph-based is to define similarity measures between the points in each
Bernoulli probabilistic model of the low-dimensional dataset. space and then find the embedding Y such that the pairwise
As the more recent algorithm, UMAP has not had as many similarities in Y match those in X.
variations yet. One promising direction, however, has ex- To do this, both algorithms define high- and low-
tended UMAP’s second step as a parametric optimization on dimensional non-linear functions p : X × X → [0, 1] and
neural network weights [Sainburg et al., 2020]. q : Y × Y → [0, 1]. These form pairwise similarity matrices
Many of these approaches utilize the same optimization P (X), Q(Y ) ∈ Rn×n , where the i, j-th matrix entry repre-
structure where they iteratively attract and repel points. While sents the similarity between points i and j. Formally,
most perform their attractions along nearest neighbors in the
high-dimensional space, the repulsions are the slowest op- exp(−d(xi , xj )2 /2σi2 )
eration and each method approaches them differently. tSNE ptsne
j|i (xi , xj ) = P 2 2
k̸=l exp(−d(xk , xl ) /2σk )
samples repulsions by utilizing Barnes-Hut (BH) trees to sum (1)
the forces over distant points. The work in [Linderman et tsne (1 + ||yi − yj ||22 )−1
al., 2019] instead calculates repulsive forces with respect to qij (yi , yj ) =P 2 −1
k̸=l (1 + ||yk − yl ||2 )
specifically chosen interpolation points, cutting down on the
O(n log n) BH tree computations. UMAP and LargeVis, on pumap (xi , xj ) = exp (−d(xi , xj )2 + ρi )/τi

j|i
the other hand, simplify the repulsion sampling by only cal- −1 (2)
umap
culating the gradient with respect to a constant number of qij (yi , yj ) = 1 + a(||yi − yj ||22 )b ,

3652
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

where d(xi , xj ) is the high-dimensional distance func- tSNE UMAP Frob-UMAP


tion, σ and τ are point-specific variance scalars1 , ρi = 0

1.0

minl̸=i d(xi , xl ), and a and b are constants. Note that the 4 4 5 4


4

0.5

tSNE denominators in Equation 1 are the sums of all the nu-

Low dim distance >>

Low dim distance >>

Low dim distance >>


3
3 3 10 3

0.0

merators. We thus refer to tSNE’s similarity functions as be- 2 2 15 2


2

0.5

ing normalized while UMAP’s are unnormalized. 1 1


20
1
1

1.0

The high-dimensional p values are defined with respect 1 2 3 4 1 2 3 4


25

1 2 3 4
0

to the point in question and are subsequently symmetrized. High dim distance >> High dim distance >> High dim distance >>

WLOG, let pij = S(pj|i , pi|j ) for some symmetrization func-


tion S. Going forward, we write pij and qij without the su- Figure 2: Gradient relationships between high- and low-dimensional
perscripts when the normalization setting is clear from the distances for tSNE, UMAP, and UMAP under the Frobenius norm.
The dotted line represents the locations of magnitude-0 gradients.
context. Higher values correspond to attractions while lower values corre-
Given these pairwise similarities in the high- and low- spond to repulsions. The left image is a recreation of the original
dimensional spaces, tSNE and UMAP attempt to find the gradient plot in [Van der Maaten and Hinton, 2008].
embedding Y such that Q(Y ) is closest to P (X). Since
both similarity measures carry a probabilistic interpretation,
we find an embedding by minimizing the KL divergence In the setting where a = b = 1 and ε = 0, Equations 6, 7 can
KL(P ∥Q). This gives us: be written as4
X pij X
Ltsne = pij log (3) Aumap
i = −2 pij qij (yi − yj )
qij
i̸=j j,j̸=i
X pij 1 − pij X 1 − pik (8)
Lumap = pij log + (1 − pij ) log (4) Rumap
i =2 2
qik (yi − yk )
qij 1 − qij 1 − qik
i̸=j k,k̸=i

In essence, tSNE minimizes the KL divergence of the entire We remind the reader that we are overloading notation – p
pairwise similarity matrix since its P and Q matrices sum to and q are normalized when they are in the tSNE setting and
1. UMAP instead defines Bernoulli probability distributions are unnormalized in the UMAP setting.
{pij , 1 − pij }, {qij , 1 − qij } and sums the KL divergences In practice, tSNE and UMAP optimize their loss functions
between the n2 pairwise probability distributions 2 . by iteratively applying these attractive and repulsive forces. It
is unnecessary to calculate each such force to effectively esti-
3.1 Gradient Calculations mate the gradient, however, as the pij term in both the tSNE
We now describe and analyze the gradient descent approaches and UMAP attractive forces decays exponentially. Based on
in tSNE and UMAP. First, notice that the gradients of each this observation, both methods establish a nearest neighbor
algorithm change substantially due to the differing normal- graph in the high-dimensional space, where the edges repre-
izations. In tSNE, the gradient can be written as an attractive sent nearest neighbor relationships between xi and xj . It then
AtSN
i
E
and a repulsive Rtsne
i force acting on point yi with suffices to only perform attractions between points yi and yj
  if their corresponding xi and xj are nearest neighbors.
∂Ltsne X X
2 This logic does not transfer to the repulsions, however, as
= −4Z  pij qij (yi − yj ) − qik (yi − yk ) the Student-t distribution has a heavier tail so repulsions must
∂yi
j,j̸=i k,k̸=i be calculated evenly across the rest of the points. tSNE does
(5) this by fitting a Barnes-Hut tree across Y during every epoch.
= 4Z(Atsne + Rtsne ) If yk and yl are both in the same tree leaf then we assume
i i
qik = qil , allowing us to only calculate O(log(n)) similari-
tSN E ties. Thus, tSNE estimates all n − 1 repulsions by perform-
where Z is the normalization term in qij . On the other
hand, UMAP’s attractions and repulsions3 are presented as ing one such estimate for each cell in Y ’s Barnes-Hut tree.
[McInnes et al., 2018] UMAP, on the other hand, simply obtains repulsions by sam-
pling a constant number of points uniformly and only apply-
X −2ab∥yi − yj ∥2(b−1) ing those repulsions. These repulsion schemas are depicted
Aumap
i = 2
pij (yi − yj ) (6) in Figure 3. Note, tSNE collects all of the gradients before a
1 + ∥yi − yj ∥22
j,j̸=i full momentum gradient descent step whereas UMAP moves
X 2b each point immediately upon calculating a force.
Rumap
i = qik (1 − pik )(yi − yk ). (7)
ε + ∥yi − yk ∥22 There are a few differences between the two algorithms’
k,k̸=i
gradient descent loops. First, the tSNE learning rate stays
1
In practice, we can assume that 2σi2 is functionally equivalent constant over training while UMAP’s linearly decreases. Sec-
to τi , as they are both chosen such that the entropy of the resulting ond, tSNE’s gradients are strengthened by adding a “gains”
distribution is equivalent. term which scales gradients based on whether they point in
2
Both tSNE and UMAP set the diagonals of P and Q to 0
3 4
The ε value is only inserted for numerical stability We derive this in section A.2 in the supplementary material

3653
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

the same direction from epoch to epoch5 . We refer to these TSNE UMAP

two elements as gradient amplification.


Note that UMAP’s repulsive force has a 1 − pik term that
is unavailable at runtime, as xi and xk may not have been
nearest neighbors. In practice, UMAP estimates these 1 −
pik terms by using the available pij values6 . We also note
that UMAP does not explicitly multiply by pij and 1 − pik .
Instead, it samples the forces proportionally to these scalars.
For example, if pij = 0.1 then we apply that force without
the pij multiplier once every ten epochs. We refer to this as
Figure 3: Visualization of the repulsive forces in tSNE (left) and
scalar sampling. UMAP (right). tSNE calculates the repulsion for representative
points and uses this as a proxy for nearby points, giving O(n) total
3.2 The Choice of Normalization repulsions acting on each point. UMAP calculates the repulsion to
We now present a summary of our theoretical results before a pre-defined number of points and ignores the others, giving O(1)
providing their formal statements. As was shown in [Bohm per-point repulsions. Bright red points are those for which the gra-
et al., 2020], the ratio between attractive and repulsive mag- dient is calculated; arrows are the direction of repulsion.
nitudes determines the structure of the resulting embedding.
Given this context, Theorem 1 shows that the normalization

Angle (radians)
directly changes the ratio of attraction/repulsion magnitudes,
inducing the difference between tSNE and UMAP embed-
dings. Thus, we can toggle the normalization to alternate
between their outputs. Furthermore, Theorem 2 shows that
the attraction/repulsion ratio in the normalized setting is inde- MNIST epochs Coil-100 epochs Swiss Roll epochs
pendent of the number of repulsive samples collected. This
second point allows us to accelerate tSNE to UMAP speeds Figure 4: Average angle in radians between repulsive forces calcu-
without impacting embedding quality by simply removing the lated with O(1) and O(n) repulsions. The red line is at 0 radians.
dependency on Barnes-Hut trees and calculating 1 per-point
repulsion as in UMAP. We now provide the necessary defini- h i
tions for the theorems. E [|Atsne |] E |Ãtsne
i |
i
Assume that the pij terms are given. We now consider the Theorem 2. tsne = h i
E [|Ri |] E |R̃tsne |
dataset Y probabilistically by defining a set of random vari- i
ables vij = yi − yj and assume that all O(n2 ) vij vectors The proofs are given in Sections A.3 and A.4 of the sup-
are i.i.d. around a non-zero mean. Let rij = (1 + |vij |2 )−1 plementary material. We point out that ptsne is normalized
P n2 ij
and define Z = i,j rij as the sum over n2 pairs of points over the sum of all cn attractions that are sampled, giving us
Pn
and Z̃ = i,j rij as the sum over n pairs of points. Then the estimate ptsne
ij ∼ 1/(cn). Theorem 1’s result is visualized
in the gradient plots in Figure 2. There we see that, for non-
applying n per-point repulsions Pn gives us the force acting on
negligible values of d(xi , xj ), the UMAP repulsions can be
point yi of E[|Rtsne |] = E[ j ||(rij 2
/Z 2 ) · vij ||]. We now
orders of magnitude larger than the corresponding tSNE ones,
define an equivalent force term in the setting where we have even when accounting for the magnitude of the attractions.
1 per-point repulsion: E[|R̃tsne |] = E[||(rij 2
/Z̃ 2 )·vij ||]. Note Furthermore, Section 5 evidences that toggling the normal-
that we have a constant number c of attractive forces acting ization is sufficient to switch between the algorithms’ embed-
on each point, giving E[|Atsne |] = c · ptsne ij E[||(rij /Z) · vij ||] dings and that no other hyperparameter accounts for the dif-
and E[|Ãtsne |] = c · ptsne
ij E[||(r ij / Z̃) · vij ||]. ference in inter-cluster distances between tSNE and UMAP.
tsne tsne
Thus, |A | and |R | represent the magnitudes of the
forces when we calculate tSNE’s O(n) per-point repulsions 4 Unifying tSNE and UMAP
while |Ãtsne | and |R̃tsne | represent the forces when we have This leads us to GDR– a modification to UMAP that can
UMAP’s O(1) per-point repulsions. Given this, we have the recreate both tSNE and UMAP embeddings at UMAP speeds.
following theorems: We choose the general name Gradient Dimensionality Reduc-
Theorem 1. Let ptsne
ij ∼
1/(cn) and d(xi , xj ) > tion to imply that it is both UMAP and tSNE.
h
tsne
i Our algorithm follows the UMAP optimization procedure
p umap
E[|Ai |] E | Ã i | except that we (1) replace the scalar sampling by iteratively
log(n2 + 1)τ . Then < h i.
E[|Rumap
i |] E |R̃tsne |
processing attractions/repulsions and (2) apply the gradients
i after having collected all of them, rather than immediately
5
This term has not been mentioned in the literature but is present upon processing each one. The first change accommodates
in common tSNE implementations. the gradients under normalization since the normalized re-
6
When possible, we use index k to represent repulsions and j pulsive forces do not have the 1 − pik term to which UMAP
to represent attractions to highlight that pik is never calculated in samples proportionally. The second change allows for per-
UMAP. See Section A.6 in the supplementary material for details. forming momentum gradient descent for faster convergence

3654
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

Initialization Distance function Symmetrization Sym Attraction Scalars


High-dim Attraction(yi , yj ) Values for
Y initialization Setting pij = pji
distances calculation applied to both a and b
tSNE Random d(xi , xj ) (pi|j + pj|i )/2 No a = 1, b = 1
UMAP Lapl. Eigenmap d(xi , xj ) − mink d(xi , xk ) pi|j +pj|i −pi|j pj|i Yes Grid search

Table 1: List of differences between hyperparameters of tSNE and UMAP. These are analyzed in Figures 3 and 4.

MNIST MNIST Fashion-MNIST Fashion-MNIST Swiss Roll Swiss Roll


tSNE
UMAP

Table 2: Effect of changing the normalization for the original tSNE and UMAP algorithms on the MNIST, Fashion-MNIST, and Swiss Roll
datasets. Each dataset is shown with normalization followed by no normalization. We use Laplacian Eigenmap initializations for consistent
orientation. The normalized UMAP plots were made with the changes described in section 5.2.

in the normalized setting. distance metric seems to have no effect on the embeddings,
Since we follow the UMAP optimization procedure, GDR as seen in Tables 3 and 4. It is possible that the algo-
defaults to producing UMAP embeddings. In the case of rithm’s reliance on highly non-convex gradient descent devi-
replicating tSNE, we simply normalize the P and Q matri- ates enough from the theoretical discussion that the pseudo-
ces and scale the learning rate. Although we only collect distance metric loses its applicability. It may also be the case
O(1) attractions and repulsions for each point, their magni- that this pseudo-distance metric, while insightful from a the-
tudes are balanced due to Theorems 1 and 2. We refer to GDR oretical perspective, is not a necessary calculation in order to
as GDRumap if it is in the unnormalized setting and as GDRtsne achieve the final embeddings.
if it is in the normalized setting. We note that changing the Furthermore, many of the other differences between tSNE
normalization necessitates gradient amplification. and UMAP are not motivated by the theoretical foundation
By allowing GDR to toggle the normalization, we are free of either algorithm. The gradient descent methodology is
to choose the simplest options across the other parameters. entirely heuristic, so any differences therein do not impact
GDR therefore defaults to tSNE’s asymmetric attraction and the theory. This applies to the repulsion and attraction sam-
a and b scalars along with UMAP’s distance-metric, initial- pling and gradient descent methods. Moreover, the high-
ization, nearest neighbors, and pij symmetrization. dimensional symmetrization function, embedding initializa-
The supplementary material provides some further infor- tion, symmetric attraction, and a, b scalars can all be switched
mation on the flexibility of GDR (A.1), such as an acceler- to their alternative options without impacting either method’s
ated version of the algorithm where we modify the gradient consistency within its theoretical presentation. Thus, each of
formulation such that it is quicker to optimize. This change these heuristics can be toggled without impacting the embed-
induces a consistent 2× speedup of GDR over UMAP. De- ding’s interpretation, as most of them do not interfere with
spite differing from the true KL divergence gradient, we find the theory and none affect the output.
that the resulting embeddings are comparable. Our reposi-
tory also provides a CUDA kernel that calculates GDRumap We also question whether the choice of normalization is
and GDRtsne embeddings in a distributed manner on a GPU. necessitated by either algorithm’s presentation. tSNE, for ex-
ample, treats the normalization of P and Q as an assumption
4.1 Theoretical Considerations and provides no further justification. In the case of UMAP,
it appears that the normalization does not break the assump-
UMAP’s theoretical framework identifies the existence of tions of the original paper [McInnes et al., 2018, Sec. 2,3].
a locally-connected manifold in the high-dimensional space We therefore posit that the interpretation of UMAP as finding
under the UMAP pseudo-distance metric d. ˜ This pseudo- the best fit to the high-dimensional data manifold extends to
distance metric is defined such that the distance from point xj tSNE as well, as long as tSNE’s gradients are calculated un-
˜ i , xj )=d(xi , xj )− minl̸=i d(xi , xl ). De-
to xi is equal to d(x der the pseudo-distance metric in the high-dimensional space.
spite this being a key element of the UMAP foundation, we We additionally theorize that each method can be paired with
find that substituting the Euclidean distance for the pseudo- either normalization without contradicting the foundations

3655
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

Fashion Coil Single Cifar-10 studying macro-structures loses local information. To ac-
MNIST 100 Cell count for this, we employ separate metrics to study the em-
UMAP 78.0; 0.5 80.8; 3.3 43.4; 1.9 24.2; 1.1 beddings at the micro- and macro-scales. Specifically, we
kNN Acc.

GDRumap 77.3; 0.7 77.4; 3.4 42.8; 2.2 23.8; 1.1 use the kNN accuracy to analyze preservation of local neigh-
borhoods as established in [Van Der Maaten et al., 2009] and
tSNE 80.1; 0.7 63.2; 4.2 43.3; 1.9 28.7; 2.5 the V-measure [Rosenberg and Hirschberg, 2007] to study the
GDRtsne 78.6; 0.6 77.2; 4.4 44.8; 1.4 25.6; 1.1 embedding’s global cluster structures7 .
UMAP 60.3; 1.4 89.2; 0.9 60.6; 1.3 7.6; 0.4
5.1 Hyperparameter Effects
V-score

GDRumap 61.7; 0.8 91.0; 0.6 60.1; 1.6 8.1; 0.6


We first show that a majority of the differences between
tSNE 54.2; 4.1 82.9; 1.8 59.7; 1.1 8.5; 0.3 tSNE and UMAP do not significantly affect the embeddings.
GDRtsne 51.7; 4.7 85.7; 2.6 60.5; 0.8 8.0; 3.7 Specifically, Table 4 shows that we can vary the hyperparam-
eters in Table 1 with negligible change to the embeddings of
Table 3: Row means and std. deviations for kNN-accuracy and
V-score on Fashion MNIST, Coil-100, Single-Cell, and Cifar-10 any discussed algorithm. Equivalent results on other datasets
datasets. For example, the cell [Fashion-MNIST, kNN accuracy, can be found in Tables 8 and 9 in the supplementary ma-
tSNE] implies that the mean kNN accuracy across the hyperparam- terial. Furthermore, Table 3 provides quantitative evidence
eters in Table 1 was 80.1 for tSNE on the Fashion-MNIST dataset. that the hyperparameters do not affect the embeddings across
datasets; similarly, Table 9 in the supplementary material con-
firms this finding across algorithms.
laid out in its paper. Looking at Table 4, the initialization and the symmetric at-
We evidence the fact that tSNE can preserve manifold traction induce the largest variation in the embeddings. For
structure at least as well as UMAP in Table 2, where Barnes- the initialization, the relative positions of clusters change
Hut tSNE without normalization cleanly maintains the struc- but the relevant inter-cluster relationships remain consistent8 .
ture of the Swiss Roll dataset. We further discuss these man- Enabling symmetric attraction attracts yj to yi when we at-
ifold learning claims in the supplementary material (A.5). tract yi to yj . Thus, switching from asymmetric to symmetric
For all of these reasons, we make the claim that tSNE and attraction functionally scales the attractive force by 2. This
UMAP are computationally consistent with one another. That leads to tighter tSNE clusters that would otherwise be evenly
is, we conjecture that, up to minor changes, one could have spread out across the embedding, but does not affect UMAP
presented UMAP’s theoretical foundation and implemented significantly. We thus choose asymmetric attraction for GDR
it with the tSNE algorithm or vice-versa. as it better recreates tSNE embeddings.
We show the effect of single hyperparameter changes for
4.2 Frobenius Norm for UMAP combinatorial reasons. However, we see no significant dif-
Finally, even some of the standard algorithmic choices can ference between changing one hyperparameter or any num-
be modified without significantly impacting the embeddings. ber of them. We also eschew including hyperparameters that
For example, UMAP and tSNE both optimize the KL diver- have no effect on the embeddings and are the least interesting.
gence, but we see no reason that the Frobenius norm cannot These include the exact vs. approximate nearest neighbors,
be substituted in its place. Interestingly, the embeddings in gradient clipping, and the number of epochs.
Figure 8 in the supplementary material show that optimiz- 5.2 Effect of Normalization
ing the Frobenius norm in the unnormalized setting produces
outputs that are indistinguishable from the ones obtained by Although Theorem 2 shows that we can take fewer repulsive
minimizing the KL-divergence. To provide a possible indi- samples without affecting the repulsion’s magnitude, we must
cation as to why this occurs, Figure 2 shows that the zero- also verify that the angle of the repulsive force is preserved as
gradient areas between the KL divergence and the Frobenius well. Towards this end, we plot the average angle between the
norm strongly overlap, implying that a local minimum under tSNE Barnes-Hut repulsions and the UMAP sampled repul-
one objective satisfies the other one as well. sions in Figure 4. We see that, across datasets, the direction of
the repulsion remains consistent throughout the optimization
We bring this up for two reasons. First, the Frobenius
process. Thus, since both the magnitude and the direction are
norm is a significantly simpler loss function to optimize than
robust to the number of samples taken, we conclude that one
the KL divergence due to its convexity. We hypothesize that
can obtain tSNE embeddings with O(1) per-point repulsions.
there must be simple algorithmic improvements that can ex-
We now show that toggling the normalization allows tSNE
ploit this property. Further detail is given in Section A.7 in
to simulate UMAP embeddings and vice versa. Table 2 shows
the supplementary material. Second, it is interesting to con-
exactly this. First note that tSNE in the unnormalized setting
sider that even fundamental assumptions such as the objec-
has significantly more separation between clusters in a man-
tive function can be changed without significantly affecting
ner similar to UMAP. The representations are fuzzier than the
the embeddings across datasets.
7
We provide formalization of these metrics in the supplementary
5 Results material (A.8)
8
As such, we employ the Laplacian Eigenmap initialization on
Metrics. There is no optimal way to compare embeddings – small datasets (<100K) due to its predictable output and the random
an analysis at the point-level loses global information while initialization on large datasets (>100K) to avoid slowdowns.

3656
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

Default setting Random init Pseudo distance Symmetrization Sym attraction a, b scalars
tSNE

95.1; 70.9 95.2; 70.7 96.0; 73.9 94.9; 70.8 94.8; 80.7 95.1; 73.2
GDRtsne

96.1; 67.8 95.6; 61.3 96.1; 63.0 96.1; 68.4 96.3; 72.7 96.1; 68.8
UMAP

95.4; 82.5 96.6; 84.6 94.4; 82.2 96.7; 82.5 96.6; 83.5 96.5; 82.2
GDRumap

96.2; 84.0 96.4; 82.1 96.7; 85.2 96.6; 85.1 96.5; 83.3 95.8; 81.2

Table 4: Effect of the algorithm settings from Table 1 on the MNIST dataset. Each parameter is changed from its default to its alternative
setting; e.g., the random init column implies that tSNE was initialized with Laplacian Eigenmaps while UMAP and GDR were initialized
randomly. Below each image the KNN-accuracy and K-Means V-score show unchanged performance.

UMAP ones as we are still estimating O(n) repulsions, caus- 5.3 Time Efficiency
ing the embedding to fall closer to the mean of the multi- We lastly discuss the speeds of UMAP, tSNE, GDR, and our
modal datasets. To account for the n× more repulsions, we accelerated version of GDR in section A.1 of the supplemen-
scale each repulsion by 1/n for the sake of Pconvergence. This tary material due to space concerns. Our implementations of
is a different effect than normalizing by pij as we are not UMAP and GDR perform gradient descent an order of mag-
affecting the attraction/repulsion ratio in Theorem 1. nitude faster than the standard UMAP library, implying a cor-
responding speedup over tSNE. We also provide an accelera-
The analysis is slightly more involved in the case of tion by doing GDR with scalar sampling that provides a fur-
UMAP. Recall that the UMAP algorithm approximates the ther 2× speedup. Despite the fact that this imposes a slight
pij and 1 − pik gradient scalars by sampling the attractions modification onto the effective gradients, we show that this is
and repulsions proportionally to pij and 1 − pik , which we qualitatively insignificant in the resulting embeddings.
referred to as scalar sampling. However, the gradients in the
normalized setting (Equation 5) lose the 1 − pik scalar on 6 Conclusion & Future Work
repulsions. The UMAP optimization schema, then, imposes We discussed the set of differences between tSNE and UMAP
an unnecessary weight on the repulsions in the normalized and identified that only the normalization significantly im-
setting as the repulsions are still sampled according to the no- pacts the outputs. This provides a clear unification of tSNE
longer-necessary 1 − pik scalar. Accounting for this requires and UMAP that is both theoretically simple and easy to im-
dividing the repulsive forces by 1 − pik , but this (with the plement. Beyond this, our analysis has uncovered multiple
momentum gradient descent and stronger learning rate) leads misunderstandings regarding UMAP and tSNE while hope-
to a highly unstable training regime. We refer the reader to fully also clarifying how these methods work.
Figure 7 in the supplementary material for details. We raised several questions regarding the theory of
gradient-based DR algorithms. Is there a setting in which
This implies that stabilizing UMAP in the normalized set- the UMAP pseudo-distance changes the embeddings? Does
ting requires removing the sampling and instead directly mul- the KL divergence induce a better optimization criterium than
tiplying by pij and 1 − pik . Indeed, this is exactly what we the Frobenius norm? Is it true that UMAP’s framework can
do in GDR. Under this change, GDRumap and GDRtsne ob- accommodate tSNE’s normalization? We hope that we have
tain effectively identical embeddings to the default UMAP facilitated future research into the essence of these algorithms
and tSNE ones. This is confirmed in the kNN accuracy and through identifying all of their algorithmic components and
K-means V-score metrics in Table 3. consolidating them in a simple-to-use codebase.

3657
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)

References [Sainburg et al., 2020] Tim Sainburg, Leland McInnes, and


[Belkin and Niyogi, 2003] Mikhail Belkin and Partha Timothy Q Gentner. Parametric umap embeddings for rep-
Niyogi. Laplacian eigenmaps for dimensionality re- resentation and semi-supervised learning. arXiv preprint
duction and data representation. Neural computation, arXiv:2009.12981, 2020.
15(6):1373–1396, 2003. [Tang et al., 2016] Jian Tang, Jingzhou Liu, Ming Zhang,
[Bohm et al., 2020] Jan Niklas Bohm, Philipp Berens, and and Qiaozhu Mei. Visualizing large-scale and high-
Dmitry Kobak. A unifying perspective on neighbor em- dimensional data. In TheWebConf, pages 287–297, 2016.
beddings along the attraction-repulsion spectrum. arXiv [Tasic et al., 2018] Bosiljka Tasic, Zizhen Yao, Lucas T
preprint arXiv:2007.08902, 2020. Graybuck, Kimberly A Smith, Thuc Nghi Nguyen, Dar-
[Damrich and Hamprecht, 2021] Sebastian Damrich and ren Bertagnolli, Jeff Goldy, Emma Garren, Michael N
Fred A Hamprecht. On umap’s true loss function. Economo, Sarada Viswanathan, et al. Shared and distinct
Advances in Neural Information Processing Systems, 34, transcriptomic cell types across neocortical areas. Nature,
2021. 563(7729):72–78, 2018.
[Damrich et al., 2022] Sebastian Damrich, Jan Niklas [Van der Maaten and Hinton, 2008] Laurens Van der Maaten
Böhm, Fred A Hamprecht, and Dmitry Kobak. Con- and Geoffrey Hinton. Visualizing data using t-sne. Journal
trastive learning unifies t-sne and umap. arXiv preprint of machine learning research, 9(11), 2008.
arXiv:2206.01816, 2022. [Van Der Maaten et al., 2009] Laurens Van Der Maaten,
[Deng, 2012] Li Deng. The mnist database of handwritten Eric Postma, Jaap Van den Herik, et al. Dimensionality re-
digit images for machine learning research. IEEE Signal duction: a comparative. J Mach Learn Res, 10(66-71):13,
Processing Magazine, 29(6):141–142, 2012. 2009.
[Dong et al., 2011] Wei Dong, Charikar Moses, and Kai Li. [Van Der Maaten, 2014] Laurens Van Der Maaten. Ac-
Efficient k-nearest neighbor graph construction for generic celerating t-sne using tree-based algorithms. JMLR,
similarity measures. In Proceedings of the 20th inter- 15(1):3221–3245, 2014.
national conference on World wide web, pages 577–586, [Wang et al., 2021] Yingfan Wang, Haiyang Huang, Cynthia
2011. Rudin, and Yaron Shaposhnik. Understanding how di-
[Hull, 1994] J. J. Hull. A database for handwritten text mension reduction tools work: an empirical approach to
recognition research. IEEE Transactions on Pattern Anal- deciphering t-sne, umap, trimap, and pacmap for data vi-
ysis and Machine Intelligence, 16(5):550–554, 1994. sualization. The Journal of Machine Learning Research,
22(1):9129–9201, 2021.
[Kobak and Linderman, 2021] Dmitry Kobak and George C
[Xiao et al., 2017] Han Xiao, Kashif Rasul, and Roland
Linderman. Initialization is critical for preserving global
data structure in both t-sne and umap. Nature biotechnol- Vollgraf. Fashion-mnist: a novel image dataset for bench-
ogy, 39(2):156–157, 2021. marking machine learning algorithms. arXiv preprint
arXiv:1708.07747, 2017.
[Krizhevsky, 2009] Alex Krizhevsky. Learning multiple lay-
ers of features from tiny images. Technical Report, Uni-
versity of Toronto, 2009.
[Linderman et al., 2019] George C Linderman, Manas
Rachh, Jeremy G Hoskins, Stefan Steinerberger, and
Yuval Kluger. Fast interpolation-based t-sne for improved
visualization of single-cell rna-seq data. Nature methods,
16(3):243–245, 2019.
[McInnes et al., 2018] Leland McInnes, John Healy, and
James Melville. Umap: Uniform manifold approximation
and projection for dimension reduction. arXiv preprint
arXiv:1802.03426, 2018.
[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg
Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. arXiv preprint
arXiv:1301.3781, 2013.
[NENE, 1996] SA NENE. Columbia object image library
(coil-100). Technical Report CUCS-006-96, 1996.
[Rosenberg and Hirschberg, 2007] Andrew Rosenberg and
Julia Hirschberg. V-measure: A conditional entropy-based
external cluster evaluation measure. In EMNLP-CoNLL,
pages 410–420, 2007.

3658

You might also like