Isometry Pursuit
Isometry Pursuit
Abstract
1 Introduction
Many real-world problems may be abstracted as selecting a subset of the columns of a matrix
representing stochastic observations or analytically exact data. This paper focuses on a simple such
problem that appears in interpretable learning and diversification. Given a rank D matrix X ∈ RD×P
with P > D, select a square submatrix X.S where subset S ⊂ P satisfies |S| = D that is as
orthonormal as possible.
This problem arises in interpretable learning specifically because while the coordinate functions of a
given feature space may have no intrinsic meaning, it is sometimes possible to generate a dictionary
of interpretable features which may be considered as potential parametrizing coordinates. When
this is the case, selection of candidate interpretable features as coordinates can take the above form.
While implementations vary across data and algorithmic domains, identification of such coordinates
generally aids mechanistic understanding, generative control, and statistical efficiency.
This paper shows that an adapted version of the algorithm in Koelle et al. [1] leads to a convex
procedure that can improve upon greedy approaches such as those in Cai and Wang [2], Chen and
Meila [3], Kohli et al. [4], Jones et al. [5] for finding isometries. The insight leading to isometry
pursuit is that multitask basis pursuit applied to an appropriately normalized X selects orthonormal
submatrices. Given vectors in RD , the normalization log-symmetrizes length and favors those closer
to unit length, while basis pursuit favors those which are orthogonal. Our results formalize this
intuition within a limited setting, and show the usefulness of isometry pursuit as a trimming procedure
prior to brute force search for diversification and interpretable coordinate selection. We also introduce
a novel ground truth objective function against which we measure the success of our algorithm, and
discuss the reasonableness of the trimming procedure.
2 Background
Our algorithm is motivated by spectral and convex analysis.
1
Work conducted outside of Amazon.
2
Code is available at https://github.com/sjkoelle/isometry-pursuit.
Our goal is, given a matrix X ∈ RD×P , to select a subset S ⊂ [P ] with |S| = D such that X.S is as
orthonormal as possible in a computationally efficient way. To this end, we define a ground truth loss
function that measures orthonormalness, and then introduce a surrogate loss function that convexifies
the problem so that it may be efficiently solved.
Our motivating example is the selection of data representations from within sets of putative coor-
dinates: the columns of a provided wide matrix. Compared with Sparse PCA [6, 7, 8], we seek a
low-dimensional representation from the set of these column vectors rather than their span.
This method applies to interpretability, for which parsimony is at a premium. Interpretability arises
through comparison of data with what is known to be important in the domain of the problem. This
knowledge often takes the form of a functional dictionary. Evaluation of independence of dictionary
features arises in numerous scenarios [9, 10, 11]. The requirement that dictionary features be full
rank has been called functional independence [10] or feature decomposability [12], with connection
between dictionary rank and independence via the implicit function theorem. Besides independence,
the metric properties of such dictionary elements are of natural interest. This is formalized through
the notion of differential.
It is not always necessary to explicitly estimate tangent spaces when applying this definition. The
most commonly encountered manifolds are vector spaces for which the tangent spaces are trivial.
This is the case for full-rank tabular data, for which isometry has a natural interpretation as a type of
diversification, and often for the latent spaces of deep learning models. In this case, B = D.
The applications of pointwise isometry are themselves manifold. Pointwise isometric embeddings
faithfully preserve high-dimensional geometry. For example, Local Tangent Space Alignment [13],
Multidimensional Scaling [14] and Isomap [15] non-parametrically estimate embeddings that are
as isometric as possible. Another approach stitches together pointwise isometries selected from a
dictionary to form global embeddings [4]. The method is particularly relevant since it constructs such
isometries through greedy search, with putative dictionary features added one at a time.
That Dϕ is orthonormal has several equivalent formulations. The one motivating our ground truth
loss function comes from spectral analysis.
Proposition 1 The singular values σ1 . . . σD are equal to 1 if and only if U ∈ RD×D is orthonormal.
On the other hand, the formulation that motivates our convex approach is that orthonormal matrices
consist of D coordinate features whose gradients are orthogonal and of unit length.
Proposition 2 The componentvectors u1 . . . uD ∈ RB form a orthonormal matrix if and only if, for
1 d1 = d2
all d1 , d2 ∈ [D], ⟨ud1 , ud2 ⟩ = .
0 d1 ̸= d2
2
2.3 Subset selection
Given a matrix X ∈ RD×P , we compare algorithmic paradigms for solving problems of the form
arg min l(X.S ) (3)
S∈([P
D)
]
where [P ]
D = {A ⊆ [P ] : |A| = D}. Brute force algorithms consider all possible solutions. These
algorithms are conceptually simple, but have the often prohibitive time complexity O(Cl P D ) where
Cl is the cost of evaluating l. Greedy algorithms consist of iteratively adding one element at a time to
S. This algorithms have time complexity O(Cl P D) and so are computationally more efficient than
brute force algorithms, but can get stuck in local minima. Formal definitions are given in Section 6.1.
Sometimes, it is possible to introduce an objective which convexifies problems of the above form.
Solutions
arg min f (β) : Y = Xβ (4)
to the overcomplete regression problem Y = Xβ are a classic example [16]. When f (β) = ∥β∥0 , this
problem is non-convex, and is thus suitable for greedy or brute algorithms, but when f (β) = ∥β∥1 ,
the problem is convex, and may be solved efficiently via interior-point methods. When the equality
constraint is relaxed, Lagrangian duality may be used to reformulate as a so-called Lasso problem,
which leads to an even richer set of optimization algorithms.
The form of basis pursuit that we apply is inspired by the group basis pursuit approach in Koelle et al.
[10]. In group basis pursuit (which we call multitask basis pursuit when grouping is dependent only
on the structure of matrix-valued response variable y) the objective function is f (β) = ∥β∥1,2 :=
PP
p=1 ∥βp. ∥2 [17, 18, 19]. This objective creates joint sparsity across entire rows βp. and was used
in Koelle et al. [10] to select between sets of interpretable features.
3 Method
We adapt the group lasso paradigm used to select independent dictionary elements in Koelle et al.
[10, 1] to select pointwise isometries from a dictionary. We first define a ground truth objective
computable via brute and greedy algorithms that is uniquely minimized by orthonormal matrices.
We then define the combination of normalization and multitask basis pursuit that approximates this
ground truth loss function. We finally give a brute post-processing method for ensuring that the
solution is D sparse.
We’d like a ground truth objective to be minimized uniquely by orthonormal matrices, invariant under
rotation, and depend on all changes in the matrix. Deformation [4] and nuclear norm [20] use only a
subset of the differential’s information and are not uniquely minimized at unitarity, respectively. We
therefore introduce an alternative ground truth objective that satisfies the above desiderata and has
convenient connections to isometry pursuit.
This objective is
lc : RD×P → R+ (5)
D
X
X 7→ g(σd (X), c) (6)
d=1
3
Our ground truth program is therefore
arg min lc (X.S ). (9)
S∈([P ]
d )
3.2 Normalization
Since basis pursuit methods tend to select longer vectors, selection of orthonormal submatrices
requires normalization such that both long and short candidate basis vectors are penalized in the
subsequent regression. We introduce the following definition.
We use such functions to normalize vector length in such a way that vectors of length 1 prior to
normalization have longest length after normalization and vectors are shrunk proportionately to their
deviation from 1. That is, we normalize vectors by
n : RD → RD (13)
v 7→ q(v)v (14)
and matrices by
w : RD×P → RD (15)
X.p 7→ n(X.p ) ∀ p ∈ [P ]. (16)
Isometry pursuit is the application of multitask basis pursuit to the normalized design matrix w(X, c)
to identify submatrices of X that are as orthonormal as possible. Define the multitask basis pursuit
penalty
∥ · ∥1,2 : RP ×D → R+ (19)
P
X
β 7→ ∥βp. ∥2 . (20)
p=1
4
(a) Ground truth loss (b) Normalized length as a function (c) Basis pursuit loss
of unnormalized length
Figure 1: Plots of ground truth loss, normalized length, and basis pursuit loss for different values of c in the
one-dimensional case D = 1. The two losses are equivalent in the one-dimensional case.
3.4 Theory
The intuition behind our application of multitask basis pursuit is that submatrices consisting of vectors
which are closer to 1 in length and more orthogonal will have smaller loss. A key initial theoretical
assertion is that I SOMETRY P URSUIT is invariant to choice of basis for X.
A proof is given in Section 6.2.1. This has as an immediate corollary that we may replace ID in the
constraint by any orthonormal D × D matrix.
We also claim that the conditions of the consequent of Proposition 2 are satisfied by minimizers of
the multitask basis pursuit objective applied to suitably normalized matrices in the special case where
a rank D orthonormal submatrix exists and |S| = D.
While this Proposition falls short of showing that an orthonormal submatrix will be selected should
one be present, it provides intuition justifying the preferential efficacy of I SOMETRY P URSUIT on
real data. A proof is given in Section 6.2.2.
5
T WO S TAGE I SOMETRY P URSUIT(Matrix X ∈ RD×P , scaling constant c)
1: S
bIP = I SOMETRY P URSUIT(X, c)
2: S
b = B RUTE S EARCH(X b , lc )
.SIP
3: Output S
b
Similar two-stage approaches are standard in the Lasso literature [21]. This method forms our
practical isometry estimator, and is discussed further in Sections 5 and 6.4.
4 Experiments
Say you are hosting an elegant dinner party, and wish to select a balanced set of wines for drinking and
flowers for decoration. We demonstrate T WO S TAGE I SOMETRY P URSUIT and G REEDY S EARCH on
the Iris and Wine datasets [22, 23, 24]. This has an intuitive interpretation as selecting diverse
elements that reflects the peculiar structure of the diversification problem. Features like petal width
are rows in X. They are features on the basis of which we may select among the flowers those which
are most distinct from another. Thus, in diversification, P = n.
We also analyze the Ethanol dataset from Chmiela et al. [25], Koelle et al. [10], but rather than
selecting between bourbon and scotch we evaluate a dictionary of interpretable features - bond
torsions - for their ability to parameterize the molecular configuration space. In this interpretability
use case, columns denote gradients of informative features. We compute Jacoban matrices of putative
parametrization functions and project them onto estimated tangent spaces (see Koelle et al. [10] for
preprocessing details). Rather than selecting between data points, we are selecting between functions
which parameterize the data.
For basis pursuit, we use the SCS interior point solver [26] from CVXPY [27, 28], which is able to
push sparse values arbitrarily close to 0 [29]. Statistical replicas for Wine and Iris are created by
resampling across [P ]. Due to differences in scales between rows, these are first standardized. For the
Wine dataset, even B RUTE S EARCH on SbIP is prohibitive in D = 13, and so we truncate our inputs
to D = 6. For Ethanol, replicas are created by sampling from data points and their corresponding
tangent spaces are estimated in B = 252.
Figure 2 and Table 1 show that the l1 accrued by the subset SbG estimated using G REEDY S EARCH with
objective l1 is higher than that for the subset estimated by T WO S TAGE I SOMETRY P URSUIT. This
effect is statistically significant, but varies across datapoints and datasets. Figure 3 details inter-
mediate support recovery cardinalities from I SOMETRY P URSUIT. We also evaluated second stage
B RUTE S EARCH selection after random selection of SbIP but do not report it since it often lead to
catastrophic failure to satisfy the basis pursuit constraint. Wall-clock runtimes are given in Section
6.5.
5 Discussion
We have shown that multitask basis pursuit can help select isometric submatrices from appropriately
normalized wide matrices. This approach - isometry pursuit - is a convex alternative to greedy
methods for selection of orthonormalized features from within a dictionary. Isometry pursuit can
6
(a) Wine dataset (b) Iris dataset (c) Ethanol dataset
Figure 2: Isometry losses l1 for Wine, Iris, and Ethanol datasets across R replicates. Lower brute losses are
shown with turquoise, while lower two stage losses are shown with pink. Equal losses are shown with black
lines. As detailed in Table 1, losses are generally lower for two-stage isometry pursuit solutions.
7
References
[1] Samson J Koelle, Hanyu Zhang, Octavian-Vlad Murad, and Marina Meila. Consistency
of dictionary-based manifold learning. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen
Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and
Statistics, volume 238 of Proceedings of Machine Learning Research, pages 4348–4356. PMLR,
2024.
[2] T. Tony Cai and Lie Wang. Orthogonal matching pursuit for sparse signal recovery with noise.
IEEE Transactions on Information Theory, 57(7):4680–4688, 2011. doi: 10.1109/TIT.2011.
2146090.
[3] Yu-Chia Chen and Marina Meila. Selecting the independent coordinates of manifolds with
large aspect ratios. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/
2019/file/6a10bbd480e4c5573d8f3af73ae0454b-Paper.pdf.
[4] Dhruv Kohli, Alexander Cloninger, and Gal Mishne. LDLE: Low distortion local eigenmaps. J.
Mach. Learn. Res., 22, 2021.
[5] Peter W Jones, Mauro Maggioni, and Raanan Schul. Universal local parametrizations via heat
kernels and eigenfunctions of the laplacian. September 2007.
[6] Santanu S Dey, R Mazumder, M Molinaro, and Guanyi Wang. Sparse principal component
analysis and its l1 -relaxation. arXiv: Optimization and Control, December 2017.
[7] D Bertsimas and Driss Lahlou Kitane. Sparse PCA: A geometric approach. J. Mach. Learn.
Res., 24:32:1–32:33, October 2022.
[8] Dimitris Bertsimas, Ryan Cory-Wright, and Jean Pauphilet. Solving Large-Scale sparse PCA to
certifiable (near) optimality. J. Mach. Learn. Res., 23(13):1–35, 2022.
[9] Yu-Chia Chen and M Meilă. Selecting the independent coordinates of manifolds with large
aspect ratios. Adv. Neural Inf. Process. Syst., abs/1907.01651, July 2019.
[10] Samson J Koelle, Hanyu Zhang, Marina Meila, and Yu-Chia Chen. Manifold coordinates with
physical meaning. J. Mach. Learn. Res., 23(133):1–57, 2022.
[11] Jesse He, Tristan Brugère, and Gal Mishne. Product manifold learning with independent
coordinate selection. In Proceedings of the 2nd Annual Workshop on Topology, Algebra, and
Geometry in Machine Learning (TAG-ML) at ICML, June 2023.
[12] Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian
Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham,
Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R.
Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom
Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.
Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/
scaling-monosemanticity/index.html.
[13] Zhenyue Zhang and Hongyuan Zha. Principal manifolds and nonlinear dimensionality reduction
via tangent space alignment. SIAM J. Scientific Computing, 26(1):313–338, 2004.
[14] Lisha Chen and Andreas Buja. Local Multidimensional Scaling for nonlinear dimension reduc-
tion, graph drawing and proximity analysis. Journal of the American Statistical Association,
104(485):209–219, March 2009.
[15] J.B. Tenenbaum, V. Silva, and J.C. Langford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290(5500):2319–2323, 2000.
[16] Scott Shaobing Chen and David L. Donoho and Michael A. Saunders. Atomic decomposition
by basis pursuit. SIAM REVIEW, 43(1):129, February 2001.
8
[17] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. J.
R. Stat. Soc. Series B Stat. Methodol., 68(1):49–67, February 2006.
[18] G Obozinski, B Taskar, and Michael I Jordan. Multi-task feature selection. 2006.
[19] Dit-Yan Yeung and Yu Zhang. A probabilistic framework for learning task relationships in
multi-task learning. 2011.
[20] Stephen P Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,
March 2004.
[21] Tim Hesterberg, Nam Hee Choi, Lukas Meier, and Chris Fraley. Least angle and ℓ1 penalized
regression: A review. February 2008.
[23] Stefan Aeberhard and M. Forina. Wine. UCI Machine Learning Repository, 1991. DOI:
https://doi.org/10.24432/C5PC7J.
[25] Stefan Chmiela, Huziel E Sauceda, Klaus-Robert Müller, and Alexandre Tkatchenko. Towards
exact molecular dynamics simulations with machine-learned force fields. Nat. Commun., 9(1):
3887, September 2018.
[26] Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via
operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and
Applications, 169(3):1042–1068, June 2016. URL http://stanford.edu/~boyd/papers/
scs.html.
[27] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for
convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
[28] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system
for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
[30] Jaime Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering
documents and producing summaries. SIGIR Forum, 51(2):209–210, August 1998.
[31] Qiong Wu, Yong Liu, Chunyan Miao, Yin Zhao, Lu Guan, and Haihong Tang. Recent advances
in diversified recommendation. May 2019.
[33] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun,
Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A
survey. arXiv [cs.CL], December 2023.
[34] Marc Pickett, Jeremy Hartman, Ayan Kumar Bhowmick, Raquib-Ul Alam, and Aditya Vempaty.
Better RAG using relevant information gain. arXiv [cs.CL], July 2024.
[35] Yeonjun In, Sungchul Kim, Ryan A Rossi, Md Mehrab Tanjim, Tong Yu, Ritwik Sinha, and
Chanyoung Park. Diversify-verify-adapt: Efficient and robust retrieval-augmented ambiguous
question answering. arXiv [cs.CL], September 2024.
9
[36] Sam Weiss. Enhancing diversity in RAG document retrieval using
projection-based techniques. https://medium.com/@samcarlos_14058/
enhancing-diversity-in-rag-document-retrieval-using-projection-based-techniques-9fef5422e043
August 2024. Accessed: 2024-11-22.
[37] Get diverse results and comprehensive summaries with vec-
tara’s MMR reranker. URL https://www.vectara.com/blog/
get-diverse-results-and-comprehensive-summaries-with-vectaras-mmr-reranker.
Accessed: 2024-11-22.
[38] Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse
autoencoders for interpretability and control. arXiv [cs.LG], May 2024.
[39] Jaime Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering
documents and producing summaries. In Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in information retrieval, New York, NY, USA,
August 1998. ACM.
[40] Maria C. N. Barioni, Marios Hadjieleftheriou, Marcos R. Vieira, Caetano Traina, Vassilis J.
Tsotras, Humberto L. Razente, and Divesh Srivastava. On query result diversification . In 2011
27th IEEE International Conference on Data Engineering (ICDE 2011), pages 1163–1174, Los
Alamitos, CA, USA, April 2011. IEEE Computer Society. doi: 10.1109/ICDE.2011.5767846.
URL https://doi.ieeecomputersociety.org/10.1109/ICDE.2011.5767846.
[41] Marina Drosou and Evaggelia Pitoura. Search result diversification. SIGMOD Rec., 39(1):
41–47, September 2010. ISSN 0163-5808. doi: 10.1145/1860702.1860709. URL https:
//doi.org/10.1145/1860702.1860709.
[42] Lu Qin, Jeffrey Xu Yu, and Lijun Chang. Diversifying top-k results. Proceedings VLDB
Endowment, 5(11):1124–1135, July 2012.
[43] Matevž Kunaver and Tomaž Požrl. Diversity in recommender systems – a survey. Knowledge-
Based Systems, 123:154–162, 2017. ISSN 0950-7051. doi: https://doi.org/10.1016/j.
knosys.2017.02.009. URL https://www.sciencedirect.com/science/article/pii/
S0950705117300680.
[44] Shengbo Guo and Scott Sanner. Probabilistic latent maximal marginal relevance. In Proceedings
of the 33rd International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’10, page 833–834, New York, NY, USA, 2010. Association for Computing
Machinery. ISBN 9781450301534. doi: 10.1145/1835449.1835639. URL https://doi.org/
10.1145/1835449.1835639.
[45] Mustafa Abdool, Malay Haldar, Prashant Ramanathan, Tyler Sax, Lanbo Zhang, Aamir Man-
aswala, Lynn Yang, Bradley Turnbull, Qing Zhang, and Thomas Legrand. Managing diversity
in airbnb search. In Proceedings of the 26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, KDD ’20, page 2952–2960, New York, NY, USA, 2020.
Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3403345.
URL https://doi.org/10.1145/3394486.3403345.
[46] Hsiang-Fu Yu, Cho-Jui Hsieh, Qi Lei, and Inderjit S. Dhillon. A greedy approach for budgeted
maximum inner product search. In Neural Information Processing Systems, 2016. URL
https://api.semanticscholar.org/CorpusID:7076785.
[47] Qiang Huang, Yanhao Wang, Yiqun Sun, and Anthony K H Tung. Diversity-aware k-maximum
inner product search revisited. arXiv [cs.IR], February 2024.
[48] S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE
Transactions on Signal Processing, 41(12):3397–3415, 1993. doi: 10.1109/78.258082.
[49] S. Mallat and Z. Zhang. Adaptive time-frequency decomposition with matching pursuits. In
[1992] Proceedings of the IEEE-SP International Symposium on Time-Frequency and Time-
Scale Analysis, pages 7–10, 1992. doi: 10.1109/TFTSA.1992.274245.
10
[50] Y.C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: recursive
function approximation with applications to wavelet decomposition. In Proceedings of 27th
Asilomar Conference on Signals, Systems and Computers, pages 40–44 vol.1, 1993. doi:
10.1109/ACSSC.1993.342465.
[51] J.A. Tropp, A.C. Gilbert, and M.J. Strauss. Simultaneous sparse approximation via greedy
pursuit. In Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech,
and Signal Processing, 2005., volume 5, pages v/721–v/724 Vol. 5, 2005. doi: 10.1109/ICASSP.
2005.1416405.
[52] Joel A. Tropp. Algorithms for simultaneous sparse approximation. part ii: Convex relaxation.
Signal Processing, 86(3):589–602, 2006. ISSN 0165-1684. doi: https://doi.org/10.1016/
j.sigpro.2005.05.031. URL https://www.sciencedirect.com/science/article/pii/
S0165168405002239. Sparse Approximations in Signal and Image Processing.
[53] Jie Chen and Xiaoming Huo. Theoretical results on sparse representations of multiple-
measurement vectors. IEEE Transactions on Signal Processing, 54:4634–4643, 2006. URL
https://api.semanticscholar.org/CorpusID:17333301.
[54] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the lasso and its dual.
Journal of Computational and Graphical Statistics, 9:319 – 337, 2000. URL https:
//api.semanticscholar.org/CorpusID:14422381.
[55] Charles Dossal. A necessary and sufficient condition for exact sparse recovery by l1 min-
imization. Comptes Rendus Mathematique, 350(1):117–120, 2012. ISSN 1631-073X.
doi: https://doi.org/10.1016/j.crma.2011.12.014. URL https://www.sciencedirect.com/
science/article/pii/S1631073X11003694.
[56] Stéphane Chrétien and Sébastien Darses. On the generic uniform uniqueness of the lasso
estimator. arXiv: Statistics Theory, 2011. URL https://api.semanticscholar.org/
CorpusID:88518316.
[57] Ryan J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:
1456–1490, 2012. URL https://api.semanticscholar.org/CorpusID:5849668.
[58] Karl Ewald and Ulrike Schneider. On the distribution, model selection properties and uniqueness
of the lasso estimator in low and high dimensions. Electronic Journal of Statistics, 2017. URL
https://api.semanticscholar.org/CorpusID:54044415.
[59] Alnur Ali and Ryan J. Tibshirani. The generalized lasso problem and uniqueness. Elec-
tronic Journal of Statistics, 2018. URL https://api.semanticscholar.org/CorpusID:
51755233.
[60] Ulrike Schneider and Patrick Tardivel. The geometry of uniqueness, sparsity and clustering in
penalized estimation. arXiv [math.ST], April 2020.
[61] Aaron Mishkin and Mert Pilanci. The solution path of the group lasso. 2022. URL https:
//api.semanticscholar.org/CorpusID:259504228.
[62] Xavier Dupuis and Samuel Vaiter. The geometry of sparse analysis regularization. SIAM
J. Optim., 33:842–867, 2019. URL https://api.semanticscholar.org/CorpusID:
195791526.
[63] Thomas Debarre, Quentin Denoyelle, and Julien Fageot. On the uniqueness of solutions
for the basis pursuit in the continuum. Inverse Problems, 38, 2020. URL https://api.
semanticscholar.org/CorpusID:246473440.
[64] Jasper Marijn Everink, Yiqiu Dong, and Martin Skovgaard Andersen. The geometry
and well-posedness of sparse regularized linear regression. 2024. URL https://api.
semanticscholar.org/CorpusID:272424099.
[65] David L. Donoho. For most large underdetermined systems of linear equations the minimal l1-
norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics,
59, 2006. URL https://api.semanticscholar.org/CorpusID:8510060.
11
[66] T Hastie, R Tibshirani, and M Wainwright. Statistical learning with sparsity: The lasso and
generalizations. May 2015.
[67] Samson Jonathan Koelle. Geometric algorithms for interpretable manifold learning. Phd thesis,
University of Washington, 2022. URL http://hdl.handle.net/1773/48559. Statistics
[108].
[68] Emmanuel Candes and Terence Tao. Decoding by linear programming. February 2005.
[69] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to
Algorithms, Third Edition. The MIT Press, 3rd edition, 2009. ISBN 0262033844.
[70] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall
Press, USA, 3rd edition, 2009. ISBN 0136042597.
[71] Travis Dick, Eric Wong, and Christoph Dann. How many random restarts are enough. 2014.
URL https://api.semanticscholar.org/CorpusID:9473630.
[72] E Anderson, Z Bai, and J Dongarra. Generalized qr factorization and its applications. Linear
Algebra Appl., 162-164:243–271, February 1992.
12
6 Supplement
This section contains algorithms, proofs, and experiments in support of the main text.
6.1 Algorithms
We give definitions of the brute and greedy algorithms for the combinatorial problem studied in this
paper. The brute force algorithm is computationally intractable for all but the smallest problems, but
always finds the global minima.
Greedy algorithms are computationally expedient but can get stuck in local optima [69, 70], even
with randomized restarts [71].
13
6.2 Proofs
Proof:
P
X
∥βU ∥1,2 = ∥βp. U ∥ (25)
p=1
P
X
= ∥βp. ∥ (26)
p=1
= ∥β∥1,2 (27)
□
We then show that this implies that the resultant loss is unchanged by unitary transformation of X.
Proof:
b X) = arg min ∥β∥1,2 : ID = U Xβ
β(U (28)
β∈RP ×D
1 d = d′ ∈ {1 . . . D}
βdd′ = . (33)
0 otherwise
Thus, we need to show that this is a lower bound on the obtained loss.
From the conditions in Definition 3, normalized matrices will consist of vectors of maximum length
(i.e. 1) if and only if the original matrix also consists of vectors of length 1. Such vectors will clearly
result in lower basis pursuit loss, since longer vectors in X require smaller corresponding covectors
in β to equal the same result.
Therefore, it remains to show that X consisting of orthogonal vectors of length 1 have lower loss
compared with X consisting of non-orthogonal vectors. Invertible matrices X.S admit QR decompo-
sitions X̃.S = QR where Q and R are orthonormal and upper-triangular matrices, respectively [72].
14
Denoting Q to be composed of basis vectors [e1 . . . eD ], the matrix R has form
⟨e1 , X.S1 ⟩ ⟨e1 , X.S2 ⟩ . . . ⟨e1 , X.SD ⟩
0 ⟨ed , X.S2 ⟩ . . . ⟨e2 , X.SD ⟩
R= 0 0 ... ... . (34)
... ... ... ...
0 0 . . . ⟨eD , X.SD ⟩
Thus, |Rdd | ≤ ∥X.Sd ∥2 for all d ∈ [D], with equality obtained only by orthonormal matrices. On the
other hand, by Proposition 3, lc (X) = lc (R) and so ∥β∥1,2 = ∥R−1 ∥1,2 . Since R is upper triangular
−1
it has diagonal elements βdd = Rdd and so ∥βd. ∥ ≥ ∥X.Sd ∥−1 = 1. That is, the penalty accrued by
a particular covector in β is bounded from below by 1 - the inverse of the length of the corresponding
vector in X.S - with equality occurring only when X.S is orthonormal. □
15
(a) Wine Dataset (b) Iris Dataset (c) Ethanol Dataset
Figure 3: Support Cardinalities for Wine, Iris, and Ethanol datasets
Figure 3 plots the distribution of |SbIP | from Table 1 in order to contextualize the reported means.
While typically |SbIP | << P , there are cases for Ethanol where this is not the case that drive up the
means.
16
(a) Iris Isometry Losses (b) Iris Multitask Losses
As mentioned in Section 5, the conditions under which the restriction P = D in Proposition 4 may
be relaxed are of theoretical and practical interest. The results in Section 4 show that there are
circumstances in which the G REEDY S EARCH performs better than T WO S TAGE I SOMETRY P URSUIT,
so clearly T WO S TAGE I SOMETRY P URSUIT does not always achieve a global optimum. Figure
4 gives results on the line of inquiry about why this is the case based on the reasoning presented
in Section 5. In these results a two-stage algorithm achieves the global optimum of a slightly
different brute problem, namely brute optimization of the multitask basis pursuit penalty ∥ · ∥1,2 .
That is, brute search on ∥ · ∥1,2 gives the same result as the two stage algorithm with brute search
on ∥ · ∥1,2 subsequent to isometry pursuit. This suggests that failure to select the global optimum
by T WO S TAGE I SOMETRY P URSUIT is in fact only due to the mismatch between global optimums
of brute optimization of the multitask penalty and the isometry loss given certain data. Theoretical
formalization, as well as investigation of what data configurations this equivalence holds for, is a
logical follow-up.
17
6.5 Timing
18