CO-Optimal Transport
CO-Optimal Transport
Abstract
Optimal transport (OT) is a powerful geometric and probabilistic tool for finding
correspondences and measuring similarity between two distributions. Yet, its origi-
nal formulation relies on the existence of a cost function between the samples of the
two distributions, which makes it impractical when they are supported on different
spaces. To circumvent this limitation, we propose a novel OT problem, named
COOT for CO-Optimal Transport, that simultaneously optimizes two transport
maps between both samples and features, contrary to other approaches that either
discard the individual features by focusing on pairwise distances between samples
or need to model explicitly the relations between them. We provide a thorough the-
oretical analysis of our problem, establish its rich connections with other OT-based
distances and demonstrate its versatility with two machine learning applications
in heterogeneous domain adaptation and co-clustering/data summarization, where
COOT leads to performance improvements over the state-of-the-art methods.
1 Introduction
The problem of comparing two sets of samples arises in many fields in machine learning, such as
manifold alignment [1], image registration [2], unsupervised word and sentence translation [3] among
others. When correspondences between the sets are known a priori, one can align them with a global
transformation of the features, e.g, with the widely used Procrustes analysis [4, 5]. For unknown
correspondences, other popular alternatives to this method include correspondence free manifold
alignment procedure [6], soft assignment coupled with a Procrustes matching [7] or Iterative closest
point and its variants for 3D shapes [8, 9].
When one models the considered sets of samples as empirical probability distributions, Optimal
Transport (OT) framework provides a solution to find, without supervision, a soft-correspondence
map between them given by an optimal coupling. OT-based approaches have been used with success
in numerous applications such as embeddings’ alignments [10, 11] and Domain Adaptation (DA)
[12] to name a few. However, one important limit of using OT for such tasks is that the two sets are
assumed to lie in the same space so that the cost between samples across them can be computed.
This major drawback does not allow OT to handle correspondence estimation across heterogeneous
spaces, preventing its application in problems such as, for instance, heterogeneous DA (HDA). To
circumvent this restriction, one may rely on the Gromov-Wasserstein distance (GW) [13]: a non-
convex quadratic OT problem that finds the correspondences between two sets of samples based on
∗
Authors contributed equally.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
their pairwise intra-domain similarity (or distance) matrices. Such an approach was successfully
applied to sets of samples that do not lie in the same Euclidean space, e.g for shapes [14], word
embeddings [15] and HDA [16] mentioned previously. One important limit of GW is that it finds the
samples’ correspondences but discards the relations between the features by considering pairwise
similarities only.
In this work, we propose a novel OT approach called CO-Optimal transport (COOT) that simulta-
neously infers the correspondences between the samples and the features of two arbitrary sets. Our
new formulation includes GW as a special case, and has an extra-advantage of working with raw
data directly without needing to compute, store and choose computationally demanding similarity
measures required for the latter. Moreover, COOT provides a meaningful mapping between both
instances and features across the two datasets thus having the virtue of being interpretable. We thor-
oughly analyze the proposed problem, derive an optimization procedure for it and highlight several
insightful links to other approaches. On the practical side, we provide evidence of its versatility in
machine learning by putting forward two applications in HDA and co-clustering where our approach
achieves state-of-the-art results.
The rest of this paper is organized as follows. We introduce the COOT problem in Section 2 and
give an optimization routine for solving it efficiently. In Section 3, we show how COOT is related to
other OT-based distances and recover efficient solvers for some of them in particular cases. Finally,
in Section 4, we present an experimental study providing highly competitive results in HDA and
co-clustering compared to several baselines.
2
MNIST USPS s matrix between samples USPS colored coded pixels MNIST pixels through v MNIST pixels through entropic v
MNIST samples
USPS samples
Figure 1: Illustration of COOT between MNIST and USPS datasets. (left) samples from MNIST and
USPS data sets; (center left) Transport matrix π s between samples sorted by class; (center) USPS
image with pixels colored w.r.t. their 2D position; (center right) transported colors on MNIST image
using π v , black pixels correspond to non-informative MNIST pixels always at 0; (right) transported
colors on MNIST image using π v with entropic regularization.
where for 1 , 2 > 0, the regularization term writes as Ω(π s , π v ) = 1 H(π s |ww0T ) +
s
πi,j
2 H(π v |vv0T ) with H(π s |ww0T ) = s
P
i,j log( wi wj0 )πi,j being the relative entropy. Note that
similarly to OT [17] and GW [20], adding the regularization term can lead to a more robust estimation
of the transport matrices but prevents them from being sparse.
Illustration of COOT In order to illustrate our proposed COOT method and to explain the intuition
behind it, we solve the optimization problem (1) using the algorithm described in section 2.2 between
two classical digit recognition datasets: MNIST and USPS. We choose these particular datasets
for our illustration as they contain images of different resolutions (USPS is 16×16 and MNIST is
28×28) that belong to the same classes (digits between 0 and 9). Additionally, the digits are also
slightly differently centered as illustrated on the examples in the left part of Figure 1. Altogether, this
means that without specific pre-processing, the images do not lie in the same topological space and
thus cannot be compared directly using conventional distances. We randomly select 300 images per
class in each dataset, normalize magnitudes of pixels to [0, 1] and consider digit images as samples
while each pixel acts as a feature leading to 256 and 784 features for USPS and MNIST respectively.
We use uniform weights for w, w0 and normalize average values of each pixel for v, v0 in order to
discard non-informative ones that are always equal to 0.
The result of solving problem (1) is reported in Figure 1. In the center-left part, we provide the
coupling π s between the samples, i.e the different images, sorted by class and observe that 67% of
mappings occur between the samples from the same class as indicated by block diagonal structure
of the coupling matrix. The coupling π v , in its turn, describes the relations between the features,
i.e the pixels, in both domains. To visualize it, we color-code the pixels of the source USPS image
and use π v to transport the colors on a target MNIST image so that its pixels are defined as convex
combinations of colors from the former with coefficients given by π v . The corresponding results
are shown in the right part of Figure 1 for both the original COOT and its entropic regularized
counterpart. From these two images, we can observe that colored pixels appear only in the central
areas and exhibit a strong spatial coherency despite the fact that the geometric structure of the image
is totally unknown to the optimization problem, as each pixel is treated as an independent variable.
COOT has recovered a meaningful spatial transformation between the two datasets in a completely
unsupervised way, different from trivial rescaling of images that one may expect when aligning
USPS digits occupying the full image space and MNIST digits liying in the middle of it (for further
evidence, other visualizations are given in the supplementary material).
COOT as a billinear program COOT is an indefinite Bilinear Program (BP) problem [21]: a
special case of a Quadratic Program (QP) with linear constraints for which there exists an optimal
solution lying on extremal points of the polytopes Π(w, w0 ) and Π(v, v0 ) [22, 23]. When n =
n0 , d = d0 and weights w = w0 = 1nn , v = v0 = 1dd are uniform, Birkhoff’s theorem [24] states that
the set of extremal points of Π( 1nn , 1nn ) and Π( 1dd , 1dd ) are the set of permutation matrices so that
3
Algorithm 1 BCD for COOT
s
1: π(0) ← ww0T , π(0)
v
← vv0T , k ← 0
2: while k < maxIt and err > 0 do
3: v
π(k) ← OT (v, v0 , L(X, X0 ) ⊗ π(k−1)
s
) // OT problem on the samples
4: s
π(k) ← OT (w, w0 , L(X, X0 ) ⊗ π(k−1)
v
) // OT problem on the features
v v
5: err ← ||π(k−1) − π(k) ||F
6: k ←k+1
there exists an optimal solution (π∗s , π∗v ) which transport maps are supported on two permutations
σ∗s , σ∗v ∈ Sn × Sd .
The BP problem is also related to the Bilinear Assignment Problem (BAP) where π s and√π v are
r
p was shown to be NP-hard if d = O( n) for
searched in the set of permutation matrices. The latter
fixed r and solvable in polynomial time if d = O( log(n)) [25]. In this case, we look for the best
permutations of the rows and columns of our datasets that lead to the smallest cost. COOT provides a
tight convex relaxation of the BAP by 1) relaxing the constraint set of permutations into the convex
set of doubly stochastic matrices and 2) ensuring that two problems are equivalent, i.e., one can
always find a pair of permutations that minimizes (1), as explained in the paragraph above.
Finding a meaningful similarity measure between datasets is useful in many machine learning tasks
as pointed out, e.g in [26]. To this end, COOT induces a distance between datasets X and X0 and it
vanishes iff they are the same up to a permutation of rows and columns as established below2 .
Proposition 1 (COOT is a distance). Suppose L = | · |p , p ≥ 1, n = n0 , d = d0 and that the
weights w, w0 , v, v0 are uniform. Then COOT(X, X0 ) = 0 iff there exists a permutation of the
samples σ1 ∈ Sn and of the features σ2 ∈ Sd , s.t, ∀i, k Xi,k = X0σ1 (i),σ2 (k) . Moreover, it is
symmetric and satisfies the triangular inequality as long as L satisfies the triangle inequality, i.e.,
COOT(X, X00 ) ≤ COOT(X, X0 ) + COOT(X0 , X00 ).
Note that in the general case when n 6= n0 , d 6= d0 , positivity and triangle inequality still hold but
COOT(X, X0 ) > 0. Interestingly, our result generalizes the metric property proved in [27] for the
election isomophism problem with this latter result being valid only for the BAP case (for a discussion
on the connection between COOT and the work of [27], see supplementary materials). Finally, we
note that this metric property means that COOT can be used as a divergence in a large number of
potential applications as, for instance, in generative learning [28].
4
in the numerical experiments that the BCD converges in few iterations (see e.g. Figure 2). We
refer the interested reader to the supplementary materials for further details. Finally, we can use
the same BCD procedure for the entropic regularized version of COOT (2) where each iteration
an entropic regularized OT problem can be solved efficiently using Sinkhorn’s algorithm [17] with
several possible improvements [18, 30, 31]. Note that this procedure can be easily adapted in the
same way to include unbalanced OT problems [32] as well.
Below, we explicit the link between GW and COOT using a reduction of a concave QP to an
associated BP problem established in [35] and show that they are equivalent when working with
0 0
squared Euclidean distance matrices C ∈ Rn×n , C0 ∈ Rn ×n .
0 0
Proposition 2. Let L = | · |2 and suppose that C ∈ Rn×n , C0 ∈ Rn ×n are squared Euclidean
distance matrices such that C = x1Tn + 1n xT − 2XXT , C0 = x0 1Tn0 + 1n0 x0T − 2X0 X0T with
x = diag(XXT ), x0 = diag(X0 X0T ). Then, the GW problem can be written as a concave quadratic
program (QP) which Hessian reads Q = −4 ∗ XXT ⊗K X0 X0T .
When working with arbitrary similarity matrices, COOT provides a lower-bound for GW and using
Proposition 2 we can prove that both problems become equivalent in the Euclidean setting.
0 0
Proposition 3. Let C ∈ Rn×n , C0 ∈ Rn ×n be any symmetric matrices, then:
COOT(C, C0 , w, w0 , w, w0 ) ≤ GW (C, C0 , w, w0 ).
The converse is also true under the hypothesis of Proposition 2. In this case, if (π∗s , π∗v ) is an optimal
solution of (1), then both π∗s , π∗v are solutions of (3). Conversely, if π∗s is an optimal solution of (3),
then (π∗s , π∗s ) is an optimal solution for (1) .
Under the hypothesis of Proposition 2 we know that there exists an opti- matrix for GW
mal solution for the COOT problem of the form (π∗ , π∗ ), where π∗ is an
optimal solution of the GW problem. This gives a conceptually very sim-
ple fixed-point procedure to compute an optimal solution of GW where
MNIST samples
s v
one optimises over one coupling only and sets π(k) = π(k) at each iter-
ation of Algorithm 1. Interestingly enough, in the concave setting, these
iterations are exactly equivalent to the Frank Wolfe algorithm described
in [33] for solving GW. It also corresponds to a Difference of Convex
Algorithm (DCA) [36, 37] where the concave function is approximated
USPS samples
at each iteration by its linear majorization. When used for entropic regu-
larized COOT, the resulting algorithm also recovers exactly the projected Figure 3: GW samples’
gradients iterations proposed in [20] for solving the entropic regularized coupling for MNIST-USPS
version of GW. We refer the reader to the supplementary materials for task
more details.
To conclude, we would like to stress out that COOT is much more than a generalization of GW and
that is for multiple reasons. First, it can be used on raw data without requiring to choose or compute
the similarity matrices, that can be prohibitively costly, for instance, when dealing with shortest path
distances in graphs, and to store them (O(n2 + n02 ) overhead). Second, it can take into account
additional information given by feature weights v, v0 and provides an interpretable mapping between
them across two heterogeneous datasets. Finally, contrary to GW, COOT is not invariant neither to
feature rotations nor to the change of signs leading to a more informative samples’ coupling when
5
Domains No-adaptation baseline CCA KCCA EGW SGW COOT
C→W 69.12±4.82 11.47±3.78 66.76±4.40 11.35±1.93 78.88±3.90 83.47±2.60
W→C 83.00±3.95 19.59±7.71 76.76±4.70 11.00±1.05 92.41±2.18 93.65±1.80
W→W 82.18±3.63 14.76±3.15 78.94±3.94 10.18±1.64 93.12±3.14 93.94±1.84
W→A 84.29±3.35 17.00±12.41 78.94±6.13 7.24±2.78 93.41±2.18 94.71±1.49
A→C 83.71±1.82 15.29±3.88 76.35±4.07 9.82±1.37 80.53±6.80 89.53±2.34
A→W 81.88±3.69 12.59±2.92 81.41±3.93 12.65±1.21 87.18±5.23 92.06±1.73
A→A 84.18±3.45 13.88±2.88 80.65±3.03 14.29±4.23 82.76±6.63 92.12±1.79
C→C 67.47±3.72 13.59±4.33 60.76±4.38 11.71±1.91 77.59±4.90 83.35±2.31
C→A 66.18±4.47 13.71±6.15 63.35±4.32 11.82±2.58 75.94±5.58 82.41±2.79
Mean 78.00±7.43 14.65±2.29 73.77±7.47 11.12±1.86 84.65±6.62 89.47±4.74
p-value <.001 <.001 <.001 <.001 <.001 -
compared to GW in some applications. One such example is given in the previous MNIST-USPS
transfer task (Figure 1) for which the coupling matrix obtained via GW (given in Figure 3) exhibits
important flaws in respecting class memberships when aligning samples.
Invariant OT and Hierarchical OT In [10], the authors proposed InvOT algorithm that aligns
samples and learns a transformation between the features of two data matrices given by a linear
map with a bounded Schatten p-norm. The authors further showed in [10, Lemma 4.3] that, under
some mild assumptions, InvOT and GW lead to the same samples’ couplings when cosine similarity
matrices are used. It can be proved that, in this case, COOT is also equivalent to them both (see
supplementary materials). However, note that InvOT is applicable under the strong assumption that
d = d0 and provides only linear relations between the features, whereas COOT works when d 6= d0
and its feature mappings is sparse and more interpretable. InvOT was further used as a building block
for aligning clustered datasets in [38] where the authors applied it as a divergence measure between
the clusters, thus leading to an approach different from ours. Finally, in [39] the authors proposed a
hierarchical OT distance as an OT problem with costs defined based on precomputed Wasserstein
distances but with no global features’ mapping, contrary to COOT that optimises two couplings of
the features and the samples simultaneously.
4 Numerical experiments
In this section, we highlight two possible applications of COOT in a machine learning context:
HDA and co-clustering. We consider these two particular tasks because 1) OT-based methods are
considered as a strong baseline in DA; 2) COOT is a natural match for co-clustering as it allows for
soft assignments of data samples and features to co-clusters.
In classification, domain adaptation problem arises when a model learned using a (source) domain
Xs = {xsi }N s Ns
i=1 with associated labels Ys = {yi }i=1 is to be deployed on a related target domain
s
N
Xt = {xti }i=1 t
where no or only few labelled data are available. Here, we are interested in the
heterogeneous setting where the source and target data belong to different metric spaces. The most
prominent works in HDA are based on Canonical Correlation Analysis [40] and its kernelized version
and a more recent approach based on the Gromov-Wasserstein distance [16]. We investigate here the
use of COOT for both semi-supervised HDA, where one has access to a small number nt of labelled
samples per class in the target domain and unsupervised HDA with nt = 0.
In order to solve the HDA problem, we compute COOT(Xs , Xt ) between the two domains and use
the π s matrix providing a transport/correspondence between samples (as illustrated in Figure 1) to
estimate the labels in the target domain via label propagation [41]. Assuming uniform sample weights
and one-hot encoded labels, a class prediction Ŷt in the target domain samples can be obtained
by computing Ŷt = π s Ys . When labelled target samples are available, we further prevent source
samples to be mapped to target samples from a different class by adding a high cost in the cost matrix
for every such source sample as suggested in [Sec. 4.2][12].
6
Competing methods and experimental settings We evaluate COOT on Amazon (A), Caltech-256
(C) and Webcam (W) domains from Caltech-Office dataset [42] with 10 overlapping classes between
the domains and two different deep feature representations obtained for images from each domain
using the Decaf [43] and GoogleNet [44] neural network architectures. In both cases, we extract
the image representations as the activations of the last fully-connected layer, yielding respectively
sparse 4096 and 1024 dimensional vectors. The heterogeneity comes from these two very different
representations. We consider 4 baselines: CCA, its kernalized version KCCA [40] with a Gaussian
kernel which width parameter is set to the inverse of the dimension of the input vector, EGW
representing the entropic version of GW and SGW [16] that incorporates labelled target data into two
regularization terms. For EGW and SGW, the entropic regularization term was set to 0.1, and the
two other regularization hyperparameters for the semi-supervised case to λ = 10−5 and γ = 10−2 as
done in [16, 45]. We use COOT with entropic regularization on the feature mapping, with parameter
2 = 1 in all experiments. For all OT methods, we use label propagation to obtain target labels as the
maximum entry of Ŷt in each row. For all non-OT methods, classification was conducted with a k-nn
classifier with k = 3. We run the experiment in a semi-supervised setting with nt = 3, i.e., 3 samples
per class were labelled in the target domain. The baseline score is the result of classification by
only considering labelled samples in the target domain as the training set. For each pair of domains,
we selected 20 samples per class to form the learning sets. We run this random selection process
10 times and consider the mean accuracy of the different runs as a performance measure. In the
presented results, we perform adaptation from Decaf to GoogleNet features, and report the results for
nt ∈ {0, 1, 3, 5} in the opposite direction in the supplementary material.
Results We first provide in Table 1 the results for the semi-supervised case. From it, we see that
COOT surpasses all the other state-of-the-art methods in terms of mean accuracy. This result is
confirmed by a p-value lower than 0.001 on a pairwise method comparison with COOT in a Wilcoxon
signed rank test. SGW provides the second best result, while CCA and EGW have a less than average
performance. Finally, KCCA performs better than the two latter methods, but still fails most of
the time to surpass the no-adaptation baseline score given by a classifier learned on the available
labelled target data. Results for the unsupervised case can be found in Table 2. This setting is rarely
considered in the literature as unsupervised HDA is regarded as a very difficult problem. In this
table, we do not provide scores for the no-adaptation baseline and SGW, as they require labelled data.
As one can expect, most of the methods fail Domains CCA KCCA EGW COOT
in obtaining good classification accuracies in C→W 14.20±8.60 21.30±15.64 10.55±1.97 25.50±11.76
this setting, despite having access to discrimi- W→C 13.35±3.70 18.60±9.44 10.60±0.94 35.40±14.61
W→W 10.95±2.36 13.25±6.34 10.25±2.26 37.10±14.57
nant feature representations. Yet, COOT suc- W→A 14.25±8.14 23.00±22.95 9.50±2.47 34.25±13.03
ceeds in providing a meaningful mapping in A→C 11.40±3.23 11.50±9.23 11.35±1.38 17.40±8.86
A→W 19.65±17.85 28.35±26.13 11.60±1.30 30.95±18.19
some cases. The overall superior performance A→A 11.75±1.82 14.20±4.78 13.10±2.35 42.85±17.65
of COOT highlights its strengths and underlines C→C 12.00±4.69 14.95±6.79 12.90±1.46 42.85±18.44
C→A 15.35±6.30 23.35±17.61 12.95±2.63 33.25±15.93
the limits of other HDA methods. First, COOT Mean 13.66±2.55 18.72±5.33 11.42±1.24 33.28±7.61
does not depend on approximating empirical p-value <.001 <.001 <.001 -
quantities from the data, contrary to CCA and
KCCA that rely on the estimation of the cross- Table 2: Unsupervised HDA for nt = 0 from
covariance matrix that is known to be flawed for Decaf to GoogleNet task.
high-dimensional data with few samples [46]. Second, COOT takes into account the features of the
raw data that are more informative than the pairwise distances used in EGW. Finally, COOT avoids
the sign invariance issue discussed previously that hinders GW’s capability to recover classes without
supervision as illustrated for the MNIST-USPS problem before.
While traditional clustering methods present an important discovery tool for data analysis, they
discard the relationships that may exist between the features that describe the data samples. This
idea is the cornerstone of co-clustering [47] where given a data matrix X ∈ Rn×d and the number of
samples (rows) and features (columns) clusters denoted by g ≤ n and m ≤ d, respectively, we seek
to find Xc ∈ Rg×m that summarizes X in the best way possible.
COOT-clustering We look for Xc which is as close as possible to the original X w.r.t COOT by
solving minXc COOT(X, Xc ) = minπs ,πv ,Xc hL(X, Xc ) ⊗ π s , π v i with entropic regularization.
More precisely, we set w, w0 , v, v0 as uniform, initialize Xc with random values and apply the BCD
7
Face dataset Centroids for sample clustering Pixel (feature) clustering 40
35
30
25
20
15
10
5
Figure 4: Co-clustering with COOT on the Olivetti faces dataset. (left) Example images from the
dataset, (center) centroids estimated by COOT (right) clustering of the pixels estimated by COOT
where each color represents a cluster.
Algorithms
Data set
K-means NMF DKM Tri-NMF GLBM ITCC RBC CCOT CCOT-GW COOT
D1 .018 ± .003 .042 ± .037 .025 ± .048 .082 ± .063 .021 ± .011 .021 ± .001 .017 ± .045 .018 ± .013 .004 ± .002 0
D2 .072 ± .044 .083 ± .063 .038 ± .000 .052 ± .065 .032 ± .041 .047 ± .042 .039 ± .052 .023 ± .036 .011 ± .056 .009 ± 0.04
D3 – – .310 ± .000 – .262 ± .022 .241 ± .031 – .031 ± .027 .008 ± .001 .04 ± .05
D4 .126 ± .038 – .145 ± .082 – .115 ± .047 .121 ± .075 .102 ± .071 .093 ± .032 .079 ± .031 0.068 ± 0.04
Table 3: Mean (± standard-deviation) of the co-clustering error (CCE) obtained for all configurations.
“-” indicates that the algorithm cannot find a partition with the requested number of co-clusters. All
the baselines results (first 9 columns) are from [48].
Simulated data We follow [48] where four scenarios with different number of co-clusters, degrees
of separation and sizes were considered (for details, see the supplementary materials). We choose
to evaluate COOT on simulated data as it provides us with the ground-truth for feature clusters
that are often unavailable for real-world data sets. As in [48], we use the same co-clustering
baselines including ITCC [49], Double K-Means (DKM) [50], Orthogonal Nonnegative Matrix
Tri-Factorizations (ONTMF) [51], the Gaussian Latent Block Models (GLBM) [52] and Residual
Bayesian Co-Clustering (RBC) [53] as well as the K-means and NMF run on both modes of the data
matrix, as clustering baseline. The performance of all methods is measured using the co-clustering
error (CCE) [54]. For all configurations, we generate 100 data sets and present the mean and standard
deviation of the CCE over all sets for all baselines in Table 3. Based on these results, we see that
our algorithm outperforms all the other baselines on D1, D2 and D4 data sets, while being behind
CCOT-GW proposed by [48] on D3. This result is rather strong as our method relies on the original
data matrix, while CCOT-GW relies on its kernel representation and thus benefits from the non-linear
information captured by it. Finally, we note that while both competing methods rely on OT, they
remain very different as CCOT-GW approach is based on detecting the positions and the number of
jumps in the scaling vectors of GW entropic regularized solution, while our method relies on coupling
matrices to obtain the partitions.
Olivetti Face dataset As a first application of COOT for the co-clustering problem on real data,
we propose to run the algorithm on the well known Olivetti faces dataset [55].
We take 400 images normalized between 0 and 1 and run our algorithm with g = 9 image clusters
and m = 40 feature (pixel) clusters. As before, we consider the empirical distributions supported on
images and features, respectively. The resulting reconstructed image’s clusters are given in Figure 4
and the pixel clusters are illustrated in its rightmost part. We can see that despite the high variability
in the data set, we still manage to recover detailed centroids, whereas L2-based clustering such as
standard NMF or k-means based on `2 norm cost function are known to provide blurry estimates in
this case. Finally, as in the MNIST-USPS example, COOT recovers spatially localized pixel clusters
with no prior information about the pixel relations.
8
M1 M20
Shawshank Redemption (1994) Police Story 4: Project S (Chao ji ji hua) (1993)
Schindler’s List (1993) Eye of Vichy, The (Oeil de Vichy, L’) (1993)
Casablanca (1942) Promise, The (Versprechen, Das) (1994)
Rear Window (1954) To Cross the Rubicon (1991)
Usual Suspects, The (1995) Daens (1992)
Table 4: Top 5 of movies in clusters M1 and M20. Average rating of the top 5 rated movies in M1 is
4.42, while for the M20 it is 1.
MovieLens We now evaluate our approach on the benchmark M OVIE L ENS-100K3 data set that
provides 100,000 user-movie ratings, on a scale of one to five, collected from 943 users on 1682
movies. The main goal of our algorithm here is to summarize the initial data matrix so that Xc reveals
the blocks (co-clusters) of movies and users that share similar tastes. We set the number of user and
film clusters to g = 10 and m = 20, respectively as in [56].
The obtained results provide the first movie cluster consisting of films with high ratings (3.92 on
average), while the last movie cluster includes movies with very low ratings (1.92 on average).
Among those, we show the 5 best/worst rated movies in those two clusters in Table 4. Overall,
our algorithm manages to find a coherent co-clustering structure in M OVIE L ENS-100K and obtains
results similar to those provided in [48, 56].
Acknowledgements
We thank Léo Gautheron, Guillaume Metzler and Raphaël Chevasson for proofreading the manuscript
before the submission. This work benefited from the support from OATMIL ANR-17-CE23-0012
project of the French National Research Agency (ANR). This work has been supported by the French
government, through the 3IA Côte d’Azur Investments in the Future project managed by the National
Research Agency (ANR) with the reference number ANR-19-P3IA-0002. This action benefited
from the support of the Chair ”Challenging Technology for Responsible Energy” led by l’X – Ecole
polytechnique and the Fondation de l’Ecole polytechnique, sponsored by TOTAL. We gratefully
acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this
research.
Broader impact
Despite its evident usefulness the problem of finding the correspondences between two datasets is
rather general and may arise in many fields in machine learning. Consequently it is quite difficult to
exhaustively state all the potential negative ethical impacts that may occur when using our method.
As described in the paper, it could be used to solve the so-called election isomorphism problem [27]
where one wants to find how similar are two elections based on the knowledge of votes and candidates.
Although having these type of datasets seems unrealistic in modern democracies, using our approach
3
https://grouplens.org/datasets/movielens/100k/
9
on this problem runs the risk of breaking some privacy standards by revealing precisely how the
votes have been moved from one election to the other. Generally speaking, and when given access to
two datasets with sensitive data, our method is able to infer correspondences between instances and
features which could possibly lead to privacy issues for a malicious user. From a different perspective,
the Optimal Transport framework is known to be quite computationally expensive and even recent
improvements turns out to be super-linear in terms of the computational complexity. It is not an
energy-free tool and in a time when carbon footprints must be drastically reduced, one should have in
mind the potential negative impact that computationally demanding algorithms might have on the
planet.
10
Supplementary materials
The supplementary is organized as follows. After the MNIST-USPS illustration (Section 2 of the
main paper), Section B presents the proof of Proposition 1 from the main paper and the computational
complexity of calculating the value of the COOT problem as mentioned in Section 2.3 of the
main paper. We provide the proofs for the equivalence of COOT to Gromov-Wassserstein distance
(Propositions 2 and 3 from the main paper and algorithmic implications discussed after Proposition 3),
InvOT and election isomorphism problem in Section C. Finally, in Section D, we provide additional
experimental results for heterogeneous domain adaptation problem and precise the simulation details
for the co-clustering task.
MNIST samples
Figure 5: Comparison between the coupling matrices obtained via GW and COOT on MNIST-USPS.
USPS
Resize
v
Map
v
Map reg
Figure 6: Linear mapping from USPS to MNIST using π v . (First row) Original USPS samples,
(Second row) Samples resized to target resolution, (Third row) Samples mapped using π v , (Fourth
row) Samples mapped using π v with entropic regularization.
11
MNIST
Resize
v
Map
v
Map reg
Figure 7: Linear mapping from MNIST to USPS using π v . (First row) Original MNIST samples,
(Second row) Samples resized to target resolution, (Third row) Samples mapped using π v , (Fourth
row) Samples mapped using π v with entropic regularization.
Proof. The symmetry follows from the definition of COOT. To prove the triangle inequality of
COOT for arbitrary measures, we will use the gluing lemma (see [57]) which states the existence
0 0 00 00
of couplings with a prescribed structure. Let X ∈ Rn×d , X0 ∈ Rn ×d , X00 ∈ Rn ×d associated
with w ∈ ∆n , v ∈ ∆d , w0 ∈ ∆0n , v0 ∈ ∆0d , w00 ∈ ∆00n , v00 ∈ ∆00d . Without loss of generality, we can
suppose in the proof that all weights are different from zeros (otherwise we can consider w̃i = wi if
wi > 0 and w̃i = 1 if wi = 0 see proof of Proposition 2.2 in [58])
Let (π1s , π1v ) and (π2s , π2v ) be two couples of optimal solutions for the COOT problems associated
with COOT(X, X0 , w, w0 , v, v0 ) and COOT(X0 , X00 , w0 , w00 , v0 , v00 ) respectively.
We define:
1 1
S1 = π1s diag π2s , S2 = π1v diag π2v
w0 v0
Then, it is easy to check that S1 ∈ Π(w, w00 ) and S2 ∈ Π(v, v00 ) (see e.g Proposition 2.2 in [58]).
We now show the following:
∗ 1 1
COOT(X, X00 , w, w00 , v, v00 ) ≤ hL(X, X00 ) ⊗ S1 , S2 i = hL(X, X00 ) ⊗ [π1s diag( )π s ], [π1v diag( 0 )π2v ]i
w0 2 v
∗∗ 1 1
≤ h[L(X, X0 ) + L(X0 , X00 )] ⊗ [π1s diag( )π s ], [π1v diag( 0 )π2v ]i
w0 2 v
1 1 1 1
= hL(X, X0 ) ⊗ [π1s diag( 0 )π2s ], [π1v diag( 0 )π2v ]i + hL(X0 , X00 ) ⊗ [π1s diag( 0 )π2s ], [π1v diag( 0 )π2v ]i,
w v w v
where in (*) we used the suboptimality of S1 , S2 and in (**) the fact that L satisfies the triangle
inequality.
Now note that:
1 1 1 1
hL(X, X0 ) ⊗ [π1s diag( 0 )π2s ], [π1v diag( 0 )π2v ]i + hL(X0 , X00 ) ⊗ [π1s diag( 0 )π2s ], [π1v diag( 0 )π2v ]i
w v w v
X
0 π1s i,e π2s e,j π1v k,o π2v o,l X
0
s s v v
00 π1 i,e π2 e,j π1 k,o π2 o,l
= L(Xi,k , Xe,o ) + L(Xe,o , Xj,l )
we0 vo0 we0 vo0
i,j,k,l,e,o i,j,k,l,e,o
∗
X X
0 0 00
= L(Xi,k , Xe,o )π1s i,e π1v k,o + L(Xe,o , Xj,l )π2s e,j π2v o,l
i,k,e,o l,j,e,o
Overall, from the definition of π1s , π1v and π2s , π2v we have:
COOT(X, X00 , w, w00 , v, v00 ) ≤ COOT(X, X0 , w, w0 , v, v0 ) + COOT(X0 , X00 , w0 , w00 , v0 , v00 ).
For the identity of indiscernibles, suppose that n = n0 , d = d0 and that the weights w, w0 , v, v0 are
uniform. Suppose that there exists a permutation of the samples σ1 ∈ Sn and of the features σ2 ∈ Sd ,
12
s.t ∀i, k ∈ [[n]] × [[d]], Xi,k = X0σ1 (i),σ2 (k) . We define the couplings π s , π v supported on the graphs
of the permutations σ1 , σ2 respectively, i.e π s = (Id × σ1 ) and π v = (Id × σ2 ). These couplings
have the prescribed marginals and lead to a zero cost hence are optimal.
Conversely, as described in the paper, there always exists an optimal solution of (1) which lies
on extremal points of the polytopes Π(w, w0 ) and Π(v, v0 ). When n = n0 , d = d0 and uniform
weights are used, Birkhoff’s theorem [24] states that the set of extremal points of Π( 1nn , 1nn ) and
Π( 1dd , 1dd ) are the set of permutation matrices so there exists an optimal solution (π∗s , π∗v ) sup-
ported on σ∗s , σ∗v respectively with σ∗s , σ∗v ∈ Sn × Sd . Then, if COOT(X, X0 ) = 0, it implies that
0 p 0
P
i,k L(Xi,k , Xσ∗s (i),σ∗v (k) ) = 0. If L = | · | then Xi,k = Xσ∗s (i),σ∗v (k) which gives the desired result.
0 0
If n 6= n , d 6= d the COOT cost is always strictly positive as there exists a strictly positive element
outside the diagonal.
As mentionned in [20], if L can be written as L(a, b) = f (a) + f (b) − h1 (a)h2 (b) then we have that
L(X, X0 ) ⊗ π s = CX,X0 − h1 (X)π s h2 (X0 )T ,
where CX,X0 = Xw1Tn0 + 1n w0T X0T so that the latter can be computed in O(ndd0 + n0 dd0 ) =
O((n + n0 )dd0 ). To compute the final cost, we must also calculate the scalar product with π v that can
be done in O(n02 n) making the complexity of hL(X, X0 ) ⊗ π s , π v i equal to O((n + n0 )dd0 + n02 n).
Finally, as the cost is symmetric w.r.t π s , π v , we obtain the overall complexity of O(min{(n +
n0 )dd0 + n02 n; (d + d0 )nn0 + d02 d}).
As pointed in [35], we can relate the solutions of a QAP and a BAP using the following theorem:
Theorem 1. If Q is a positive semi-definite matrix, then problems:
maxx f (x) = cT x + 21 xT Qx
(4)
s.t. Ax = b, x ≥ 0
maxx,y g(x, y) = 21 cT x + 21 cT y + 12 xT Qy
(5)
s.t. Ax = b, Ay = b, x, y ≥ 0
are equivalent. More precisely, if x∗ is an optimal solution for (4), then (x∗ , x∗ ) is a solution for (5)
and if (x∗ , y∗ ) is optimal for (5), then both x∗ and y∗ are optimal for (4).
Proof. This proof follows the proof of Theorem 2.2 in [35]. Let z∗ be optimal for (4) and (x∗ , y∗ )
be optimal for (5). Then, by definition, for all x satisfying the constraints of (4), f (z∗ ) ≥ f (x).
In particular, f (z∗ ) ≥ f (x∗ ) = g(x∗ , x∗ ) and f (z∗ ) ≥ f (y∗ ) = g(y∗ , y∗ ). Also, g(x∗ , y∗ ) ≥
maxx,x s.t Ax=b,x≥0 g(x, x) = f (z∗ ).
To prove the theorem, it suffices to prove that
f (y∗ ) = f (x∗ ) = g(x∗ , y∗ ) (6)
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
since, in this case, g(x , y ) = f (x ) ≥ f (z ) and g(x , y ) = f (y ) ≥ f (z ).
Let us prove (6). Since (x∗ , y∗ ) is optimal, we have:
1 1 ∗T
0 ≤ g(x∗ , y∗ ) − g(x∗ , x∗ ) = cT (y∗ − x∗ ) + x Q(y∗ − x∗ )
2 2
1 1 ∗T
0 ≤ g(x∗ , y∗ ) − g(y∗ , y∗ ) = cT (x∗ − y∗ ) + y Q(x∗ − y∗ ).
2 2
By adding these inequalities we obtain:
(x∗ − y∗ )T Q(x∗ − y∗ ) ≤ 0.
13
Since Q is positive semi-definite, this implies that Q(x∗ − y∗ ) = 0. So, using previous inequalities,
we have cT (x∗ − y∗ ) = 0, hence g(x∗ , y∗ ) = g(x∗ , x∗ ) = g(y∗ , y∗ ) as required.
Note also that this result holds when we add a constant term to the cost function.
We now prove all the theorems from Section 3 from the main paper. We first recall the GW problem
for two matrices C, C0 :
GW (C, C0 , w, w0 ) = s min 0 hL(C, C0 ) ⊗ π s , π s i. (7)
π ∈Π(w,w )
We will now prove the Proposition 2 in the main paper stated as follows.
0 0
Proposition 2. Let L = | · |2 and suppose that C ∈ Rn×n , C0 ∈ Rn ×n are squared Euclidean
distance matrices such that C = x1Tn + 1n xT − 2XXT , C0 = x0 1Tn0 + 1n0 x0T − 2X0 X0T with
x = diag(XXT ), x0 = diag(X0 X0T ). Then, the GW problem can be written as a concave quadratic
program (QP) which Hessian reads Q = −4 ∗ XXT ⊗K X0 X0T .
This result is a consequence of the following lemma.
Lemma 1. With previous notations and hypotheses, the GW problem can be formulated as:
GW (C, C0 , w, w0 ) = min −4vec(M)T vec(π s ) − 8vec(π s )T Qvec(π s ) + Cte
π s ∈Π(w,w0 )
with
M = xx0T − 2xw0T X0 X0T − 2XXT wx0T and Q = XXT ⊗K X0 X0T ,
X X
Cte = kxi − xj k42 wi wj + kx0i − x0j k42 wi0 wj0 − 4wT xw0T x0
i i
Proof. Using the results in [20] for L = | · |2 , we have L(C, C0 ) ⊗ π s = cC,C0 − 2Cπ s C0 with
cC,C0 = (C)2 w1Tn0 + 1n w0T (C0 )2 , where (C)2 = (C2i,j ) is applied element-wise.
We now have that
hCπ s C0 , π s i = tr π s T (x1Tn + 1n xT − 2XXT )π s (x0 1Tn0 + 1n0 x0T − 2X0 X0T )
Moreover, since:
tr(π s T XXT π s x0 1Tn0 ) = tr(1Tn0 π s T XXT π s x0 ) = tr(wT XXT π s x0 ) = tr(π s T XXT wx0T )
and tr(w0 xT π s X0 X0T ) = tr(π s T xw0T X0 X0T ), we can simplify the last expression to obtain:
hCπ s C0 , π s i = tr π s T xw0T (x0 1Tn0 + 1n0 x0T ) + π s T xx0T + w0 xT wx0T
= 2wT xw0T x0 + 2hxx0T − 2xwT X0 X0T − 2XXT wx0T , π s i + 4tr(π s T XXT π s X0 X0T ).
14
The term 2wT xw0T x0 is constant since it does not depend on the coupling. Also, we can verify that
cC,C0 does not depend on π s as follows:
X X
hcC,C0 , π s i = kxi − xj k42 wi wj + kx0i − x0j k42 wi0 wj0
i i
implying that:
hcC,C0 − 2Cπ s C0 , π s i = Cte − 4hxx0T − 2xwT X0 X0T − 2XT Xwx0T , π s i − 8tr(π s T XXT π s X0 X0T ).
We can rewrite this equation as stated in the proposition using the vec operator.
Using a standard QP form cT x + 21 xQ0 xT with c = −4vec(M) and Q0 = −4XXT ⊗K X0 X0T
we see that the Hessian is negative semi-definite as the opposite of a Kronecker product of positive
semi-definite matrices XXT and X0 X0T .
Using previous propositions we are able to prove the Proposition 3 of the paper.
0 0
Proposition 3. Let C ∈ Rn×n , C0 ∈ Rn ×n be any symmetric matrices, then:
COOT(C, C0 , w, w0 , w, w0 ) ≤ GW (C, C0 , w, w0 ).
The converse is also true under the hypothesis of Proposition 2. In this case, if (π∗s , π∗v ) is an optimal
solution of (1), then both π∗s , π∗v are solutions of (7). Conversely, if π∗s is an optimal solution of (7),
then (π∗s , π∗s ) is an optimal solution for (1) .
Proof. The first inequality follows from the fact that any optimal solution of the GW problem is
an admissible solution for the COOT problem, hence the inequality is true by suboptimality of this
optimal solution.
For the equality part, by following the same calculus as in the proof of Proposition 1, we can verify
that:
COOT(C, C0 , w, w0 , w, w0 ) = min −2vec(M)T vec(π s )
π s ∈Π(w,w0 )
Let us first recall the general algorithm used for solving COOT for arbitrary datasets.
Using Proposition 3, we know that when X = C, X0 = C0 are squared Euclidean matrices, then
there is an optimal solution of the form (π ∗ , π ∗ ). In this case, we can set π(k)
s v
= π(k) during the
iterations of Algorithm 2 to obtain an optimal solution for both COOT and GW. This reduces to
Algorithm 3 that corresponds to a DC algorithm where the quadratic form is replaced by its linear
upper bound.
15
Below, we prove that this DC algorithm for solving GW problems is equivalent to the Frank-Wolfe
(FW) based algorithm presented in [33] and recalled in Algorithm 4 when L = | · |2 and for squared
Euclidean distance matrices C0 , C00 .
Algorithm 3 DC Algorithm for COOT and GW with squared Euclidean distance matrices
1: Input: maxIt, thd
s
2: π(0) ← ww0T
3: while k < maxIt and err > thd do
4: s
π(k) ← OT (w, w0 , L(C, C0 ) ⊗ π(k−1)
s
)
s s
5: err ← ||π(k−1) − π(k) ||F
6: k ←k+1
The case when L = | · |2 and C, C0 are squared Euclidean distance matrices has interesting implica-
tions in practice, since in this case the resulting GW problem is a concave QP (as explained in the
paper and shown in Lemma 1 of this supplementary). In [59], the authors investigated the solution
to QP with conditionally concave energies using a FW algorithm and showed that in this case the
line-search step of the FW is always 1. Moreover, as shown in Proposition 1, the GW problem can be
written as a concave QP with concave energy and is minimizing a fortiori a conditionally concave
energy. Consequently, the line-search step of the FW algorithm proposed in [33] and described
in Algorithm4 always leads to an optimal line-search step of 1. In this case, the Algorithm.4 is
equivalent to Algorithm 5 goven below, since τ (k) = 1 for all k.
s
Finally, by noticing that in the step 3 of Algorithm 5 the gradient of (7) w.r.t π(k−1) is 2L(C, C0 ) ⊗
s
π(k−1) , which gives the same OT solution as for the OT problem in step 3 of Algorithm 3, we can
conclude that the iterations of both algorithms are equivalent.
16
C.4 Relation with Invariant OT
The objective of this part is to prove the connections between GW, COOT and InvOT [10] defined as
follows:
0
InvOTL
p (X, X ) := min min hMf , πiF ,
π∈Π(w,w0 ) f ∈Fp
where (Mf )ij = L(xi , f (x0j )) and Fp is a space of matrices with bounded Shatten p-norms, i.e.,
Fp = {P ∈ Rd×d : ||P||p ≤ kp }.
We prove the following result.
Proposition
√ 4. Using previous notations, L = | · |2 , p = 2, (i.e F2 = {P ∈ Rd×d : ||P||F =
d}) and cosine similarities C = XXT , C0 = X0 X0T . Suppose that X0 is w0 -whitened, i.e
X diag(w0 )X = I. Then, InvOTL
0T 0 0 0
2 (X, X ), COOT(C, C ) and GW (C, C ) are equivalent, namely
any optimal coupling of one of this problem is a solution to others problems.
Proof. For GW, we refer the reader to [33, Equation 6]. For COOT we have:
COOT(C, C0 , w, w0 ) = min hL(C, C0 ) ⊗ π s , π v i
π s ∈Π(w,w0 ),π v ∈Π(w,w0 )
1 1
= min hL(C, C0 ) ⊗ π s , π v i + hL(C, C0 ) ⊗ π s , π v i
π s ∈Π(w,w0 ),π v ∈Π(w,w0 ) 2 2
1 1
= min hL(C, C0 ) ⊗ π s , π v i + hL(C, C0 ) ⊗ π v , π s i
π s ∈Π(w,w0 ),π v ∈Π(w,w0 ) 2 2
1 1
= min hCw1n0 + 1n w C , π i + hCw1Tn0 + 1n w0 C0 , π v i − 2hCπ s C0 , π v i.
T 0 0 s
π s ∈Π(w,w0 ),π v ∈Π(w,w0 ) 2 2
Last equality gives the desired result.
This section shows that COOT approach can be used to solve the election isomorphism problem
defined in [27] as follows: let E = (C, V ) and E 0 = (C 0 , V 0 ) be two elections, where C =
{c1 , . . . , cm } (resp. C 0 ) denotes a set of candidates and V = (v1 , . . . , vn ) (resp. V 0 ) denotes a set of
voters, where each voter vi has a preference order, also denoted by vi . The two elections E = (C, V )
and E 0 = (C 0 , V 0 ), where |C| = |C 0 |, V = (v1 , . . . , vn ), and V 0 = (v10 , . . . , vn0 ), are said to be
isomorphic if there exists a bijection σ : C → C 0 and a permutation ν ∈ Sn such that σ(vi ) = vν(i) 0
for all i ∈ [n]. The authors further propose a distance underlying this problem defined as follows:
n
X
d-ID(E, E 0 ) = min min 0
d σ(vi ), vν(i) ,
ν∈Sn σ∈Π(C,C 0 )
i=1
17
where Sn denotes the set of all permutations over {1, . . . , n}, Π(C, C 0 ) is a set of bijections and
d is an arbitrary distance between preference orders. The authors of [27] compute d-ID(E, E 0 )
in practice by expressing it as the following Integer Linear Programming problem over the tensor
Pijkl = Mij Nkl where M ∈ Rm×m , N ∈ Rn×n
X
min Pk,l,i,j |posvi (ck ) − posvj0 (c0l )|
P,N,M
i,j,k,l
where posvi (ck ) denotes the position of candidate ck in the preference order of voter vi . Let us
now define two matrices X and X0 such that Xi,k = posvi (ck ) and X0j,l = posvj0 (c0l ) and denote
by π∗s , π∗v a minimizer of COOT(X, X0 , 1n /n, 1n /n, 1m /m, 1m /m) with L = | · | and by N∗ , M∗
the minimizers of problem (8), respectively.
As shown in the main paper, there exists an optimal solution for COOT(X, X0 ) given by permutation
matrices as solutions of the Monge-Kantorovich problems for uniform distributions supported on
the same number of elements. Then, one may show that the solution of the two problems coincide
modulo a multiplicative factor, i.e., π∗s = n1 N∗ and π∗v = m 1
M∗ are optimal since |C| = |C 0 | and
0 s v
|V | = |V |. For π∗ (the same reasoning holds for π∗ as well), we have that
1
n, j = νi∗
(π∗s )ij =
0, otherwise.
where νi∗ is a permutation of voters in the two sets. The only difference between the two solutions π∗s
and N∗ thus stems from marginal constraints (8). To conclude, we note that COOT is a more general
approach as it is applicable for general loss functions L, contrary to the Spearman distance used in
[27], and generalizes to the cases where n 6= n0 and m 6= m0 .
Here, we present the results for the heterogeneous domain adaptation experiment not included in
the main paper due to the lack of space. Table 5 follows the same experimental protocol as in the
paper but shows the two cases where nt = 1 and nt = 5. Table 6 and Table 7 contain the results for
the adaptation from GoogleNet to Decaf features, in a semi-supervised and unsupervised scenarios,
respectively Overall, the results are coherent with those from the main paper: in both settings, when
nt = 5, one can see that the performance differences between SGW and COOT is rather significant.
Table 8 below summarizes the characteristics of the simulated data sets used in our experiment.
E Initialization’s impact
We conducted a study regarding the convergence properties of COOT in the co-clustering application
when the πs , πv and Xc are initialized randomly over 100 trials. This leads to a certain variance in the
obtained value of the COOT distance as expected when solving a non-convex problem. The obtained
CCEs remain largely in line with the obtained results even for different random initializations.
References
[1] Zhen Cui, Hong Chang, Shiguang Shan, and Xilin Chen. Generalized unsupervised manifold
alignment. In NIPS, pages 2429–2437, 2014.
18
Decaf → GoogleNet
Domains Baseline CCA KCCA EGW SGW COOT
nt = 1
C→W 30.47±6.90 13.37±7.23 29.21±13.14 10.21±1.31 66.95±7.61 77.74±4.80
W→C 26.53±7.75 16.26±5.18 40.68±12.02 10.11±0.84 80.16±4.78 87.89±2.65
W→W 30.63±7.78 13.42±1.38 36.74±8.38 8.68±2.36 78.32±5.86 89.11±2.78
W→A 30.21±7.51 12.47±2.99 39.11±6.85 9.42±2.90 80.00±3.24 89.05±2.84
A→C 41.89±6.59 12.79±2.95 28.84±6.24 9.89±1.17 72.00±8.91 84.21±3.92
A→W 39.84±4.27 19.95±23.40 38.16±19.30 12.32±1.56 75.84±7.37 89.42±4.24
A→A 42.68±8.36 15.21±7.36 38.26±16.99 13.63±2.93 75.53±6.25 91.84±2.48
C→C 28.58±7.40 18.37±17.81 35.11±17.96 11.05±1.63 61.21±8.43 78.11±5.77
C→A 31.63±4.25 15.11±5.10 33.84±9.10 11.84±1.67 66.26±7.95 82.11±2.58
Mean 33.61±5.77 15.22±2.44 35.55±3.98 10.80±1.47 72.92±6.37 85.50±4.89
nt = 5
C→W 74.27±5.53 14.53±7.37 73.27±4.99 11.40±1.13 84.00±3.99 85.53±2.67
W→C 90.27±2.67 21.13±6.85 85.00±3.44 10.60±1.05 95.20±2.84 94.53±1.83
W→W 90.93±2.50 15.80±3.27 90.67±2.95 9.80±2.60 95.40±2.47 94.93±2.70
W→A 90.47±2.92 16.67±4.85 87.93±2.47 9.80±2.68 95.40±1.53 95.80±2.15
A→C 88.33±2.33 15.73±4.64 83.13±2.84 10.40±1.89 84.47±5.81 91.47±1.45
A→W 88.40±3.17 13.60±6.25 87.27±2.82 11.87±2.40 87.87±4.66 93.00±1.96
A→A 86.20±3.08 14.07±2.93 87.00±3.48 14.07±1.65 89.80±2.58 92.20±1.69
C→C 75.93±4.83 13.13±2.98 70.47±3.45 11.13±1.52 85.73±3.54 84.60±2.32
C→A 73.47±3.62 15.47±6.50 74.13±5.42 11.20±2.47 85.07±3.26 87.20±1.78
Mean 84.25±7.01 15.57±2.25 82.10±7.03 11.14±1.23 89.21±4.64 91.03±3.97
Table 5: Semi-supervised Heterogeneous Domain Adaptation results for adaptation from Decaf
to GoogleNet representations with different values of nt . Note that the case nt is provided in the
main paper.
[2] S. Haker and A. Tannenbaum. Optimal mass transport and image registration. In Proceedings
IEEE Workshop on Variational and Level Set Methods in Computer Vision, pages 29–36, 2001.
[3] Reinhard Rapp. Identifying word translations in non-parallel texts. In ACL, pages 320–322,
1995.
[4] John C. Gower and Garmt B. Dijksterhuis. Procrustes problems, volume 30 of Oxford Statistical
Science Series. Oxford University Press, 2004.
[5] Colin Goodall. Procrustes methods in the statistical analysis of shape. Journal of the Royal
Statistical Society: Series B (Methodological), 53(2):285–321, 1991.
[6] Chang Wang and Sridhar Mahadevan. Manifold alignment without correspondence. In IJCAI,
page 1273–1278, 2009.
[7] Anand Rangarajan, Haili Chui, and Fred L. Bookstein. The softassign procrustes matching
algorithm. In Information Processing in Medical Imaging, pages 29–42, 1997.
[8] Paul J. Besl and Neil D. McKay. A method for registration of 3-d shapes. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 14:239–256, 1992.
[9] Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration,
2020.
[10] David Alvarez-Melis, Stefanie Jegelka, and Tommi S. Jaakkola. Towards optimal transport with
global invariances. In AISTATS, volume 89, pages 1870–1879, 2019.
[11] Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddings
with wasserstein procrustes. In AISTATS, pages 1880–1890, 2019.
[12] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport
for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(9):1853–1865, 2017.
19
GoogleNet → Decaf
Domains Baseline CCA KCCA EGW SGW COOT
nt = 1
C→A 31.16±6.87 12.16±2.78 33.32±2.47 7.00±2.11 77.16±8.00 83.26±5.00
C→C 30.42±3.73 13.74±5.29 32.58±9.98 12.47±2.81 76.63±8.31 86.21±3.26
W→A 37.68±4.04 15.79±3.71 34.58±5.71 14.32±1.77 86.68±1.90 89.95±3.43
A→C 35.95±3.89 15.32±8.18 40.16±17.54 13.21±3.49 87.89±4.03 90.68±7.54
A→A 36.89±4.73 13.84±2.47 34.84±10.44 13.16±1.56 89.79±3.93 94.68±2.21
W→W 32.05±4.63 19.89±11.82 36.26±21.98 10.00±2.59 84.21±4.55 90.42±2.66
W→C 32.68±5.56 21.53±21.01 33.79±22.72 11.47±3.03 86.26±3.41 89.53±1.92
A→W 33.84±4.75 16.00±7.74 39.32±18.94 11.00±4.01 87.21±3.67 91.53±5.85
C→W 32.32±7.76 15.58±7.72 34.05±15.96 12.89±2.52 81.84±3.51 84.84±5.71
Mean 33.67±2.45 15.98±2.81 35.43±2.50 11.73±2.08 84.19±4.43 89.01±3.38
nt = 3
C→A 76.35±4.15 17.47±3.45 73.94±4.53 7.41±2.27 88.24±2.23 89.88±0.94
C→C 78.94±3.61 18.18±3.44 69.94±3.51 14.18±3.16 89.71±2.25 91.06±1.91
W→A 85.41±3.25 19.29±3.10 80.59±3.82 14.24±2.72 94.76±1.45 95.29±2.35
A→C 89.53±4.05 23.18±7.17 80.59±6.30 13.88±2.69 93.76±2.72 94.76±1.83
A→A 89.76±1.92 17.00±3.11 83.71±3.30 14.41±2.28 93.29±2.09 95.53±1.45
W→W 86.65±5.07 21.88±4.78 84.65±3.67 9.94±2.37 94.88±1.79 94.53±1.66
W→C 88.94±5.02 22.59±9.23 80.06±5.65 13.65±3.15 96.18±1.15 95.29±2.91
A→W 90.29±1.35 22.35±7.00 87.88±2.53 13.88±3.60 94.53±1.54 95.35±1.59
C→W 78.59±3.44 22.53±13.42 80.12±2.95 11.59±3.25 89.29±1.86 89.59±2.22
Mean 84.94±5.19 20.50±2.34 80.16±5.12 12.58±2.31 92.74±2.72 93.48±2.38
nt = 5
C→A 84.20±2.65 18.60±3.75 84.33±2.33 6.40±1.27 92.13±2.61 91.93±2.05
C→C 85.33±2.76 21.80±5.91 78.60±2.74 13.47±2.00 91.33±2.48 92.27±2.67
W→A 95.13±2.29 31.00±9.67 91.93±2.82 14.67±1.40 96.13±2.04 96.40±1.84
A→C 91.67±2.60 21.80±4.35 85.33±3.27 13.40±3.63 95.47±1.51 94.87±1.27
A→A 93.20±1.57 23.33±4.66 89.67±1.98 13.27±2.10 95.33±1.07 95.00±1.37
W→W 95.00±2.33 23.80±5.48 92.13±1.78 11.20±2.58 96.47±1.93 96.67±1.37
W→C 95.67±1.50 28.27±9.71 87.67±3.79 14.27±3.19 97.67±1.31 96.93±2.25
A→W 92.13±2.36 22.67±3.94 89.20±3.14 11.67±2.50 93.60±1.40 94.27±2.11
C→W 84.00±3.45 20.40±4.31 82.53±3.56 11.07±3.70 90.20±2.23 92.40±1.69
Mean 90.70±4.57 23.52±3.64 86.82±4.26 12.16±2.37 94.26±2.42 94.53±1.85
GoogleNet → Decaf
Domains CCA KCCA EGW COOT
C→A 11.30±4.04 14.60±8.12 8.20±2.69 25.10±11.52
C→C 13.35±4.32 17.75±10.16 11.90±2.99 37.20±14.07
W→A 14.55±10.68 25.05±24.73 14.55±2.05 39.75±17.29
A→C 13.80±6.51 20.70±17.94 16.00±2.44 30.25±18.71
A→A 16.90±10.45 28.95±30.62 12.70±1.79 41.65±16.66
W→W 14.50±6.72 24.05±19.35 9.55±1.77 36.85±9.20
W→C 13.15±4.98 14.80±8.79 11.40±2.65 30.95±17.18
A→W 10.85±4.62 14.40±12.36 12.70±2.99 40.85±16.21
C→W 18.25±14.02 25.90±25.40 11.30±3.87 34.05±13.82
Mean 14.07±2.25 20.69±5.22 12.03±2.23 35.18±5.24
Table 7: Unsupervised Heterogeneous Domain Adaptation results for adaptation from GoogleNet
to Decaf representations.
20
Data set n×d g×m Overlapping Proportions
D1 600 × 300 3×3 [+] Equal
D2 600 × 300 3×3 [+] Unequal
D3 300 × 200 2×4 [++] Equal
D4 300 × 300 5×4 [++] Unequal
Table 8: Size (n × d), number of co-clusters (g × m), degree of overlapping ([+] for well-separated
and [++] for ill-separated co-clusters) and the proportions of co-clusters for simulated data sets.
Characteristics
Data set
Runtime(s) BCD #iter. (COOT+Xc ) BCD #iter. (COOT) COOT value
D1 4.72±6 21.5±24.57 3.16±0.37 0.46±0.25
D2 0.64±0.81 9.77±11.53 3.4±0.58 1.35±0.16
D3 0.95±1.55 8.47±11.11 3.01±0.1 2.52±0.24
D4 6.27±5.13 33.15±23.75 4.21±0.41 0.06±0.005
Table 9: Mean (± standard-deviation) of different runtime characteristics of COOT.
[13] Facundo Memoli. Gromov wasserstein distances and the metric approach to object matching.
Foundations of Computational Mathematics, pages 1–71, 2011.
[14] Justin Solomon, Gabriel Peyré, Vladimir G. Kim, and Suvrit Sra. Entropic metric alignment for
correspondence problems. ACM Transactions on Graphics, 35(4):1–13, 2016.
[15] David Alvarez-Melis and Tommi S. Jaakkola. Gromov-Wasserstein Alignment of Word Embed-
ding Spaces. In EMNLP, pages 1881–1890, 2018.
[16] Yuguang Yan, Wen Li, Hanrui Wu, Huaqing Min, Mingkui Tan, and Qingyao Wu. Semi-
supervised optimal transport for heterogeneous domain adaptation. In IJCAI, pages 2969–2975,
2018.
[17] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS,
pages 2292–2300, 2013.
[18] Jason Altschuler, Jonathan Niles-Weed, and Philippe Rigollet. Near-linear time approximation
algorithms for optimal transport via sinkhorn iteration. In NeurIPS, pages 1964–1974, 2017.
[19] Aude Genevay, Lénaı̈c Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample
complexity of sinkhorn divergences. In ICML, pages 1574–1583, 2019.
[20] Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and
distance matrices. In ICML, pages 2664–2672, 2016.
[21] Giorgio Gallo and Aydin Ülkücü. Bilinear programming: An exact algorithm. Mathematical
Programming, 12:173–194, 1977.
[22] Panos M. Pardalos and J. Ben Rosen, editors. Bilinear programming methods for nonconvex
quadratic problems, pages 75–83. Springer Berlin Heidelberg, 1987.
[23] R. Horst and H. Tuy. Global Optimization: Deterministic Approaches. Springer Berlin
Heidelberg, 1996.
[24] Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucumán Rev. Ser. A,
1946.
[25] Ante Custic, Vladyslav Sokol, Abraham Punnen, and Binay Bhattacharya. The bilinear assign-
ment problem: Complexity and polynomially solvable special cases. Mathematical Program-
ming, 166, 2016.
[26] David Alvarez-Melis and Nicolò Fusi. Geometric dataset distances via optimal transport, 2020.
[27] P. Faliszewski, P. Skowron, A. Slinko, S. Szufa, and N. Talmon. How similar are two elections?
In AAAI, pages 1909–1916, 2019.
[28] Charlotte Bunne, David Alvarez-Melis, Andreas Krause, and Stefanie Jegelka. Learning
generative models across incomparable spaces. arXiv preprint arXiv:1905.05461, 2019.
21
[29] Hiroshi Konno. A cutting plane algorithm for solving bilinear programs. Math. Program.,
11(1):14–27, 1976.
[30] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Niles-Weed. Massively scalable
sinkhorn distances via the nyström method. In NeurIPS, pages 4429–4439, 2019.
[31] Mokhtar Z. Alaya, Maxime Berar, Gilles Gasso, and Alain Rakotomamonjy. Screening Sinkhorn
Algorithm for Regularized Optimal Transport. In NeurIPS, pages 12169–12179, 2019.
[32] Lenaı̈c Chizat. Unbalanced Optimal Transport : Models, Numerical Methods, Applications.
Thesiss, PSL Research University, November 2017.
[33] Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Optimal
transport for structured data with application on graphs. In ICML, pages 6275–6284, 2019.
[34] Danielle Ezuz, Justin Solomon, Vladimir G. Kim, and Mirela Ben-Chen. GWCNN: A Metric
Alignment Layer for Deep Shape Analysis. Computer Graphics Forum, 36(5):49–57, 2017.
[35] Hiroshi Konno. Maximization of a convex quadratic function under linear constraints. Mathe-
matical Programming, 11(1):117–127, 1976.
[36] Pham Dinh Tao et al. The dc (difference of convex functions) programming and dca revisited
with dc models of real world nonconvex optimization problems. Annals of operations research,
133(1-4):23–46, 2005.
[37] Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural computation,
15(4):915–936, 2003.
[38] John Lee, Max Dabagia, Eva Dyer, and Christopher Rozell. Hierarchical optimal transport for
multimodal distribution alignment. In NeurIPS, pages 13474–13484, 2019.
[39] Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M.
Solomon. Hierarchical optimal transport for document representation. In NeurIPS, pages
1599–1609, 2019.
[40] Yi-Ren Yeh, Chun-Hao Huang, and Yu-Chiang Frank Wang. Heterogeneous domain adapta-
tion and classification by exploiting the correlation subspace. IEEE Transactions on Image
Processing, 23(5):2009–2018, 2014.
[41] I. Redko, N. Courty, R. Flamary, and D. Tuia. Optimal transport for multi-source domain
adaptation under target shift. In International Conference on Artificial Intelligence and Statistics
(AISTAT), 2019.
[42] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains.
In ECCV, LNCS, pages 213–226, 2010.
[43] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep
convolutional activation feature for generic visual recognition. In ICML, 2014.
[44] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.
In CVPR, pages 1–9, 2015.
[45] Yuguang Yan, Wen Li, Michael Ng, Mingkui Tan, Hanrui Wu, Huaqing Min, and Qingyao Wu.
Learning discriminative correlation subspace for heterogeneous domain adaptation. In IJCAI,
pages 3252–3258, 2017.
[46] Yang Song, Peter J. Schreier, David Ramı́rez, and Tanuj Hasija. Canonical correlation analysis
of high-dimensional data with very small sample support. Signal Process., 128:449–458, 2016.
[47] J. A. Hartigan. Direct Clustering of a Data Matrix. Journal of the American Statistical
Association, 67(337):123–129, 1972.
[48] Charlotte Laclau, Ievgen Redko, Basarab Matei, Younès Bennani, and Vincent Brault. Co-
clustering through optimal transport. In ICML, pages 1955–1964, 2017.
[49] Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S. Modha. Information-theoretic
co-clustering. In SIGKDD, pages 89–98, 2003.
[50] R. Rocci and M. Vichi. Two-mode multi-partitioning. Computational Statistics and Data
Analysis, 52(4):1984–2003, 2008.
[51] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for
clustering. In Proceedings ACM SIGKDD, pages 126–135, 2006.
22
[52] M. Nadif and G. Govaert. Algorithms for model-based block gaussian clustering. In DMIN’08,
the 2008 International Conference on Data Mining, 2008.
[53] Hanhuai Shan and Arindam Banerjee. Residual bayesian co-clustering for matrix approximation.
In SDM, pages 223–234, 2010.
[54] A. Patrikainen and M. Meila. Comparing subspace clusterings. IEEE Transactions on Knowl-
edge and Data Engineering, 18(7):902–916, 2006.
[55] Ferdinando S Samaria and Andy C Harter. Parameterisation of a stochastic model for human
face identification. In Proceedings of 1994 IEEE workshop on applications of computer vision,
pages 138–142. IEEE, 1994.
[56] Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, and Dharmendra S.
Modha. A generalized maximum entropy approach to bregman co-clustering and matrix
approximation. Journal of Machine Learning Research, 8:1919–1986, 2007.
[57] Cédric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wis-
senschaften. Springer, 2009 edition, September 2008.
[58] Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends®
in Machine Learning, 11:355–607, 2019.
[59] Haggai Maron and Yaron Lipman. (probably) concave graph matching. In NeurIPS, pages
408–418, 2018.
23