Sparse
Sparse
Abstract
This paper describes new factorization algorithms that exploit branch-induced
sparsity in the joint-space inertia matrix (JSIM) of a kinematic tree. It also presents
new formulas that show how the cost of calculating and factorizing the JSIM vary
with the topology of the tree. These formulas show that the cost of calculating
forward dynamics for a branched tree can be considerably less than the cost for an
unbranched tree of the same size. Branches can also reduce complexity; and some
examples are presented of kinematic trees for which the complexity of calculating
and factorizing the JSIM are less than O(n 2 ) and O(n3 ), respectively. Finally, a cost
comparison is made between an O(n) algorithm and an O(n 3 ) algorithm, the latter
incorporating one of the new factorization algorithms. It is shown that the O(n 3 )
algorithm is only 15% slower than the O(n) algorithm when applied to a 30-DoF
humanoid, but is 2.6 times slower when applied to an equivalent unbranched chain.
This is due mainly to the O(n3 ) algorithm running about 2.2 times faster on the
humanoid than on the chain.
1 Introduction
Forward dynamics algorithms for kinematic trees can be classified broadly into two main
types: propagation algorithms and inertia-matrix algorithms. Many examples of these
can be found in the literature, e.g. [1, 2, 4, 6, 8, 9, 13, 17, 18, 19, 21, 22, 23, 24]. There
are also a variety of algorithms that do not fit into either category.
Propagation algorithms work by propagating constraints between bodies in such a
way that the joint accelerations can be calculated one at a time. They typically have a
computational complexity that is stated either as O(N ) or O(n), where N is the number of
bodies and n is the number of joint variables; but there is no real difference between them,
as O(N ) = O(n) for a kinematic tree. Due to their complexity, propagation algorithms
are often simply called O(n) algorithms.
0
This paper has been accepted for publication in the International Journal of Robotics Research, and
the final (edited, revised and typeset) version of this paper will be published in the International Journal
of Robotics Research, vol. 24, no. 6, pp. 487–500, June 2005 by Sage Publications Ltd. All rights reserved.
c Sage Publications Ltd.
1
Inertia-matrix algorithms work by formulating and solving an equation of the form
H q̈ = τ − C , (1)
where H is the joint-space inertia matrix (JSIM), q̈ is the vector of joint accelerations, τ
is a vector of joint forces, and C is a bias vector containing gravity terms, Coriolis and
centrifugal terms, and so on. This equation is a set of n linear equations in n unknowns,
where n is the number of joint variables. In general, the costs of calculating H and C are
O(n2 ) and O(n), respectively, and the cost of solving Eq. 1 is O(n3 ). Thus, inertia-matrix
algorithms have an overall complexity of O(n3 ), and are often referred to simply as O(n3 )
algorithms.
If a kinematic tree contains branches, then certain elements of the JSIM will auto-
matically be zero. Indeed, the number of such zeros can be a large fraction of the total.
This phenomenon is called branch-induced sparsity. The exact number of branch-induced
zeros in a JSIM depends only on the topology of the kinematic tree, and their locations
depend only on the topology and the numbering scheme (which determines the order in
which joint variables appear in q̈).
Branch-induced sparsity has a profound effect on the efficiency of inertia-matrix algo-
rithms. If a significant fraction of the JSIM’s elements are zero, then fewer calculations
are required to calculate the nonzero elements, and fewer calculations to factorize the
matrix. It is therefore possible for an inertia-matrix algorithm to run significantly faster
on a branched kinematic tree than on an unbranched tree having the same number of
joints and joint variables.
This paper therefore makes the following contributions:
1. it describes new factorization algorithms that exploit branch-induced sparsity in the
factorization process; and
2. it presents new cost formulas that show how the cost of calculating and factorizing
the JSIM depend on the topology of the kinematic tree.
Two factorization algorithms are presented: one that performs an LTL factorization
(H = LT L), and one that performs an LTDL factorization (H = LT D L), where L
is a lower-triangular matrix and D is diagonal. These factorizations have the following
special property: if they are applied to a matrix with branch-induced sparsity, then the
factorization proceeds without filling in any of the zeros. Such factorizations are described
as optimal in [7], and they produce factors that are maximally sparse. In contrast, the
factors produced by the standard Cholesky and LDLT factorizations (H = L LT and
H = L D LT ) are dense.
The cost formulas show that the complexity of an inertia-matrix algorithm can range
between O(n) and O(n3 ), depending on the topology of the tree. This result is illustrated
with a few examples of trees for which the complexity is less than O(n3 ). They also show
that calculation costs are lower for branched trees generally, even when the complexity
remains at O(n3 ).
This paper also presents a detailed cost comparison between an inertia-matrix algo-
rithm and a propagation algorithm, applied to two test cases: a 30-DoF humanoid (or
quadruped) mechanism and an equivalent 30-DoF unbranched chain. The inertia-matrix
algorithm comprises the fastest published algorithm for calculating C in Eq. 1 [3], the
2
fastest version of the composite rigid body algorithm (CRBA) to calculate H, and the
LTDL algorithm described in this paper. The propagation algorithm is the fastest pub-
lished implementation of the articulated body algorithm [14, 15]. The results of this
comparison can be summarized as follows: the O(n) algorithm beats the O(n3 ) algorithm
by a factor of 2.6 on the unbranched chain, but is only about 15% faster on the humanoid.
This is mainly due to the O(n3 ) algorithm running approximately 2.2 times faster on the
humanoid than on the unbranched chain.
This is not the first paper to advocate the use of sparse matrix algorithms for robot
dynamics. In particular, Baraff has described an O(n) algorithm based on sparse matrix
techniques [5]. However, his algorithm exploits a different pattern of sparsity in a matrix
that is very different from the JSIM. Sparse matrix techniques are also used in multibody
dynamics simulators, e.g. [16].
This is also not the first paper to advocate the use of an LTDL factorization on
the JSIM, since this has already been proposed by Saha [20] (who calls it a UDUT
factorization). He describes various interesting properties of the factorization, but does
not consider branch-induced sparsity.
The rest of this paper is organized as follows. Section 2 describes the sparse factor-
ization algorithm, and Section 3 explains how it relates to existing sparse matrix theory.
Sections 4 and 5 present computational cost and complexity analyses for the sparse factor-
ization and the CRBA, respectively, and Section 6 compares costs for a 30-DoF humanoid
and an equivalent unbranched chain.
3
4 5 6 7
5 6 7
4
2 3 2 3
1
1
0 joint 1
0 (root)
(a) (b)
Figure 1: A binary kinematic tree (a) and its connectivity graph (b).
joints. The fixed body provides a fixed base, or fixed reference frame, for the rest of the
mechanism.
In the connectivity graph of this mechanism, the bodies are represented by nodes, the
joints by arcs, and the base body is the root node. The bodies are numbered according
to the following rule: the base is numbered 0, and the other bodies are numbered con-
secutively from 1 in any order such that each has a higher number than its parent. This
is called a regular numbering scheme. The joints are then numbered such that joint i
connects body i to its parent.
The connectivity of a kinematic tree can be described by an array of integers called
the parent array. It has one entry for each mobile body, which identifies the body number
of its parent. Thus, if λ is the parent array for Tree 1, then λ = (0, 1, 1, 2, 2, 3, 3), and
λ(i) is the parent of body i for any i ∈ {1 . . . 7}. Regular numbering ensures that λ has
the following important property:
∀i, 0 ≤ λ(i) < i . (2)
In the special case of an unbranched kinematic tree, λ(i) = i − 1.
If there are N mobile bodies in a kinematic tree then there are also N joints. However,
the total number of joint variables can be larger than this, because individual joints can
have more than one degree of freedom (DoF), hence more than one joint variable. Let n
be the number of joint variables for the tree, and let ni be the number of variables for
joint i. n is then given by the formula
N
X
n= ni . (3)
i=1
The joint acceleration and force vectors for the whole system will be n-dimensional vectors,
and the JSIM will be an n × n matrix.
If a kinematic tree contains joints with more than one DoF then it is necessary to
construct an expanded parent array for use by the factorization algorithm. This is because
the parent array is one of the inputs to the algorithm, and it must have the same dimension
as the matrix to be factorized. The expanded parent array is obtained from an expanded
connectivity graph, which is constructed as follows:
For each joint in turn, if ni > 1 then replace joint i with a serial chain of ni − 1
bodies and ni joints. Number these extra bodies and joints consecutively, and
then add ni −1 to each of the remaining body and joint numbers in the system.
4
4 5 6 6 7 10
7 11
9
2 4 8
3 5
3
1 2
0 0
Figure 2: Expanding a connectivity graph to account for multi-DoF joints in the original
mechanism.
The important property of the expanded graph is that the variable at index position i in
the joint acceleration or force vector can be associated with joint i in the expanded graph.
Thus, joint i in the expanded graph pertains to row and column i in the matrix.
The procedure is illustrated in Figure 2. This figure shows the connectivity graph
and numbering scheme that would result if joints 1, 2 and 6 in Tree 1 were to have 2,
2 and 3 DoF, respectively. In this case, the JSIM would be an 11 × 11 matrix, and the
factorization algorithm would need an 11-element parent array. The expanded parent
array is λ = (0, 1, 2, 3, 2, 4, 4, 5, 8, 9, 5).
Returning to Tree 1, let us assume that every joint in this mechanism is a 1-DoF joint,
so that n = N = 7. The JSIM for this mechanism will then be a 7 × 7 matrix with the
following sparsity pattern:
× × × × × × ×
× × × ×
× × × ×
H=
× × ×
(4)
× × ×
× × ×
× × ×
where ‘×’ denotes a nonzero entry in the matrix, and the zeros have been left blank. This
pattern follows directly from the definition of the JSIM for a branched kinematic tree
[9, 10]: T c
si Ii sj
if j ∈ ν(i)
Hij = sTi Icj sj if i ∈ ν(j) (5)
0 otherwise
where si is the motion axis of joint i, ν(i) is the set of bodies descended from body i,
including body i itself, and Ici is the composite rigid-body inertia of all the bodies in
ν(i). For Tree 1, ν(1) = {1, 2, 3, 4, 5, 6, 7}, ν(2) = {2, 4, 5}, and so on. Eq. 5 gives us the
following general formula for the pattern of branch-induced sparsity in a JSIM:
i∈
/ ν(j) ∧ j ∈
/ ν(i) ⇒ Hij = 0 . (6)
5
Another way to say this is that Hij = 0 whenever i and j are on different branches.
If we factorize a JSIM into either H = L LT or H = L D LT (standard Cholesky and
LDLT factorizations, respectively), then the resulting triangular factors will be dense.
However, if instead we factorize the JSIM into either H = LT L or H = LT D L (LTL and
LTDL factorizations, respectively), then the factorization proceeds without filling in any
of the zeros in the JSIM, and produces a triangular factor that is maximally sparse for
the given matrix. A factorization that accomplishes this is considered optimal [7]. The
sparsity pattern for an optimally-sparse L is
i∈
/ ν(j) ⇒ Lij = 0 ,
× × ×
× × ×
× × ×
for k = n √
to 1 do
Hkk = Hkk
j = λ(k)
while j 6= 0 do
Hkj = Hkj /Hkk
j = λ(j)
end
i = λ(k)
while i 6= 0 do
j=i
while j 6= 0 do
Hij = Hij − Hki Hkj
j = λ(j)
end
i = λ(i)
end
end
6
1. never accessed
1 2. no longer accessed
5 3. Hkk := Hkk
k
4 3
2
4. Hkj := Hkj / Hkk
The sparse LTDL algorithm differs only slightly from this, and is described in Appendix A.
This appendix also describes algorithms for calculating the expressions L x, LT x, L−1 x
and L−T x, for arbitrary vectors x, in a way that exploits the sparsity in L.
The sparse LTL algorithm differs from a standard Cholesky algorithm in only two
respects: the outer loop runs backwards from n to 1; and the inner loops iterate over the
ancestors of k, i.e., λ(k), λ(λ(k)), and so on, back to the root. The reversal of the outer
loop is what makes this algorithm perform an LTL factorization instead of Cholesky, and
the behaviour of the inner loops is what exploits the sparsity.
Figure 3 illustrates the factorization process. At the point where the algorithm begins
to process row k, it has already completed the processing of rows k + 1 to n, and these
rows now contain rows k + 1 to n of L. The processing of row k involves three steps:
1. Replace element Hkk with its square root.
2. Divide the elements in region 4 (the portion of row k below the diagonal) by Hkk .
3. Subtract from region 5 the outer product of region 4 with itself. Thus, each element
Hij within region 5 is replaced by Hij − Hki Hkj .
Furthermore, in the processing of row k, the inner loops iterate only over the ancestors
of body k. This has the effect of visiting only the nonzero elements in region 4, and a
subset of the nonzero elements in region 5. This is where the cost savings come from—the
algorithm performs the minimum possible amount of work to accomplish the factorization,
given the sparsity pattern of the matrix.
Consider what happens when the algorithm is applied to the JSIM of Tree 1, which
has the sparsity pattern shown in Eq. 4. Starting at k = 7, the inner loops iterate over
only the two values 3 and 1, because λ(7) = 3, λ(λ(7)) = 1 and λ(λ(λ(7))) = 0 (the exit
condition for the inner loops). Thus, the algorithm updates only the two elements H73
and H71 at step 2, and the three elements H33 , H31 and H11 at step 3. In contrast, a
dense matrix factorization would update 6 elements at step 2, and a further 21 elements
at step 3.
In effect, the algorithm performs a stripped-down version of the LTL factorization of
a dense matrix, in which it simply skips over all the entries that are known to be zero in
the original matrix, and thereby omits every operation that involves a multiplication by
zero. This strategy works because the factorization process preserves the sparsity pattern
of the matrix: any element that starts out zero, remains zero throughout the factorization
process.
7
We can prove this property by induction. First, assume that rows k + 1 to n have
already been processed, and that no fill-in has yet occurred as a result of this processing;
so the pattern of zeros is still the same as in the original matrix. Now, it is impossible for
fill-in to occur as a result of executing steps 1 and 2 on row k, so we focus on step 3. This
step affects every element Hij within region 5 for which Hki Hkj 6= 0; but this expression
can only be nonzero if both i and j are ancestors of k, which in turn implies that either
i ∈ ν(j) or j ∈ ν(i), which is the condition for Hij to be a nonzero element of H. Thus,
no fill-in occurs during the processing of row k.
The LTL and LTDL algorithms have been tested for numerical accuracy on a variety
of JSIMs, with the following results. On matrices with a substantial amount of sparsity,
they tend to be slightly more accurate than Matlab’s native Cholesky factorization; but
on matrices with little or no sparsity, they tend to be slightly less accurate.
Compact storage schemes for sparse matrices have not been investigated, on the
grounds that there is little to be gained from them unless the JSIM contains thousands
of zeros. After all, even a thousand zeros is still only eight kilobytes.
3 Context
This section puts the new algorithm into its correct context within existing sparse matrix
theory as described in [11]. It is shown that the new algorithm is equivalent to a reordered
Cholesky factorization, and that the JSIM of a kinematic tree belongs to a class of matrices
which are known to be factorizable without fill-in. Thus, the new algorithm does not
accomplish anything that could not have been done using existing methods. The novelty
therefore lies only in the details of its principle of operation, which results in a particularly
simple and easy factorization process for JSIMs.
Let A be a symmetric, positive-definite matrix, and let P be a permutation matrix.
The matrix à = PT A P is then a symmetric permutation of A, and is also a symmetric,
positive-definite matrix. Permutation matrices are orthogonal (i.e., PT = P−1 ), so it
follows that A = P Ã PT .
A Cholesky factorization of à into à = G GT is called a reordered Cholesky factor-
ization of A, the factors being P G and (P G)T :
A = P Ã PT = P G GT PT = (P G) (P G)T .
To obtain the LTL factorization, we introduce a second permutation, Q, and insert a
factor Q QT (which is the identity matrix) as follows:
A = P G Q Q T GT P T
= (P G Q) (P G Q)T .
We then set P = Q = R, where R is the special permutation that reverses the order of
the rows of a matrix. R has ones along its off-diagonal (top right to bottom left) and
zeros elsewhere. With this substitution, the factorization becomes
A = (R G R) (R G R)T .
Now, R G R is an upper-triangular matrix, so we may equate it with a lower-triangular
matrix, L, as follows:
LT = R G R .
8
Having established the correspondence, it follows that the LTL factorization has the same
general mathematical and numerical properties as the Cholesky factorization.
Branch-induced sparsity has a pattern that is a symmetric permutation of a standard
pattern known as nested, doubly-bordered block-diagonal. This pattern is defined recur-
sively as follows: the matrix as a whole consists of a block-diagonal sub-matrix bordered
below and to the right by one or more rows and columns of nonzero entries; and zero or
more of the blocks within the block-diagonal sub-matrix have the same structure.
Any JSIM can be brought into this form by the following procedure. First, construct
a new regular numbering (if necessary) that traverses the tree in depth-first order; then
reorder the rows and columns of H according to the new regular numbering; then reverse
the order of the rows and columns. If this procedure is applied to H in Eq. 4, to produce
a permuted matrix H̃, then
× × ×
× × ×
× × × ×
H̃ =
× × × .
(7)
× × ×
× × × ×
× × × × × × ×
In the sparse matrix literature, it is usually the case that a nested, doubly-bordered
block-diagonal matrix has a substantial amount of additional sparsity in its borders and
atomic blocks (the ones that are not themselves nested, doubly-bordered block-diagonal).
The challenge is then to devise algorithms that exploit all of the sparsity. The JSIM is a
special case in which there is no additional sparsity, which makes it amenable to simpler
algorithms.
One important property of H̃ is that it contains no zeros inside its envelope. The
envelope of a sparse matrix is the set of elements below the main diagonal that are either
nonzero elements themselves, or are located somewhere to the right of a nonzero element.
The envelope of H̃ consists of the elements H̃31 , H̃32 , H̃64 , H̃65 and H̃71 . . . H̃76 . The
importance of the envelope is that a standard Cholesky factorization preserves every zero
outside the envelope. As H̃ contains no zeros inside its envelope, it follows that a standard
Cholesky factorization will proceed without filling in any of the zeros in this matrix.
This is sufficient to demonstrate that an optimal factorization of a JSIM with branch-
induced sparsity can be accomplished using standard techniques from sparse matrix the-
ory. In comparison, the only advantages offered by the LTL algorithm are matters of
convenience: it is easy to implement and retro-fit into existing code; it works directly
on the original JSIM; and it produces factors that are literally triangular, rather than
permutations of a triangular matrix.
9
LTL LTDL
√
factorize n + D1 div + D2 (m + a) D1 div + D2 (m + a)
back-subst 2n div + 2D1 (m + a) n div + 2D1 (m + a)
T
L x, L x n m + D1 (m + a) D1 (m + a)
−1 −T
L x, L x n div + D1 (m + a) D1 (m + a)
and n
X di (di − 1)
D2 = . (9)
i=1 2
D1 is the total number of step-2 operations performed by the factorization algorithm. It
is also the total number of nonzero elements below the diagonal in the JSIM. D2 is the
total number of step-3 operations performed the algorithm, each such operation involving
one multiplication and one subtraction. The total cost of the sparse LTL factorization is
therefore √
n + D1 div + D2 (m + a) ,
√
where the symbols , div, m and a stand for square-root calculations, divisions, mul-
tiplications and additions, respectively, with subtractions counting as additions for cost
purposes.
Table 1 presents a summary of the computational costs of factorization, back-substi-
tution, and multiplying a vector by a sparse triangular factor or its inverse, for both the
LTL and LTDT factorizations. As you can see, the LTDL algorithm comes out slightly
ahead, beating the LTL algorithm by n square-root operations in the factorization process
and n divisions in the back-substitution process.
The quantities D1 and D2 are bounded by
and
0 ≤ D2 ≤ (n3 − n)/6 . (11)
Thus, the asymptotic complexity of factorization can vary between O(1) and O(n 3 ), but
the asymptotic complexity of back-substitution can vary only between O(n) and O(n 2 ).
10
root
Figure 4: A kinematic chain with short side branches (a), a balanced binary tree (b) and
a spanning tree for a square grid (c).
The lower limit occurs when every body is connected directly to the root (i.e., di = 1
for all i). In this case, the matrix is diagonal. Although the LTDL factorization can, in
theory, be performed without any cost, the algorithm listed in Appendix A contains a
loop that will iterate over all n rows, and will therefore execute O(n) instructions.
The upper limit occurs when there are no branches in the kinematic tree (di = i for
all i). In this case, the matrix is dense, and the factorization costs for LTL and LTDL
are identical to the costs for standard Cholesky and LDLT factorizations, respectively.
However, there is a slight overhead in the inner loops of the sparse algorithms, in that
assignment statements like i = λ(i) typically take slightly longer to execute than incre-
menting (or decrementing) a variable. Nevertheless, the overhead is sufficiently small that
there is almost nothing to lose, and potentially a lot to gain, by simply replacing Cholesky
and LDLT factorizations with LTL and LTDL wherever JSIMs get factorized.
The depth of the tree has a major influence on complexity. For example, if di is subject
to an upper limit, dmax , such that di ≤ dmax for all i, then D1 and D2 are bounded by
0 ≤ D1 ≤ n (dmax − 1) (12)
and
0 ≤ D2 ≤ n dmax (dmax − 1)/2 . (13)
If dmax is a constant, then both D1 and D2 are O(n). Systems with this property do occur
in practice; for example, a swarm of identical mobile robots.
The exact cost of a sparse factorization or back-substitution must be calculated via D 1
and D2 ; but a reasonable estimate can be obtained via the following rule of thumb. Let
α be a number between 0 and 1 representing the density of the matrix to be factorized,
i.e., the ratio of nonzero elements to the total number of elements in the matrix. The
costs of sparse factorization and back-substitution will then be approximately α 2 and α
times the costs of dense factorization and back-substitution, respectively. Thus, if 50% of
the elements of a JSIM are nonzero then the sparse factorization will be about four times
faster than a dense factorization, and so on.
11
Topology D1 D2 where Order
2 3
unbranched (n − n)/2 (n − n)/6 n3
short s.b. m2 3 2
(2m + 3m + m)/6 n = 2m n3
Pm−1 Pm−1
bal. tree i=1 i · 2
i
i=1 i (i+1) · 2
i−1
n = 2m − 1 n(log(n))2
span grid m 3 − m2 (7m4 − 6m3 − m2 )/12 n = m2 n2
unbranched
10
6 short s.b.
span grid
4 bal. tree
10
2
10
0
10
0 1 2 3
10 10 10 10
Figure 5: Comparison of factorization cost (operations count) versus n for the tree topolo-
gies in Table 2.
side branches, a balanced binary tree and a spanning tree for a square grid. The latter
three are illustrated in Figure 4.
Table 2 presents formulas for D1 and D2 that allow their values to be calculated as
functions of n. The formulas for the unbranched tree are valid for all n, and are therefore
expressed directly in terms of n. The other formulas are not valid for all n, so they are
expressed in terms of a second integer, m, and an expression is given in the ‘where’ column
to indicate how n is related to m. The expression n = 2m, for example, implies that n
must be an even number.
Figure 5 plots the factorization cost of each topology against n. The cost figures in
this graph, and those quoted in the rest of this section, are based on a simple operations
count; i.e., the cost of an LTDL factorization is simply the number D1 + 2D2 . Cost figures
in Section 6 are computed differently.
For the chain with short side branches, D1 and D2 converge to one half and one
quarter, respectively, of their values for the unbranched case. As D2 dominates, the cost
of factorization converges to one quarter of the cost for the unbranched chain. This is a
good example of the rule of thumb mentioned previously: about half the elements of the
JSIM are zeros, and the factorization cost is four times less than the unbranched case.
If we increase the number of bodies in each side branch from one to two, then D1 and
D2 converge to one third and one ninth, respectively, of their values for the unbranched
case. The same would be true if we kept the branches at their present size, but had two
branches per node on the main chain instead of only one. More generally, if β is the ratio
of the length of the main chain to the total number of bodies, then D1 and D2 converge
to β and β 2 times their values for the unbranched case.
12
The balanced binary tree shows a very different picture. In a tree containing n = 2m −1
nodes, excluding the root, there are 2k−1 nodes for which di = k, for k = 1 . . . m. So D1
and D2 are bounded by
(m − 1) · 2m−1 ≤ D1 ≤ (m − 1) · 2m
and
m(m − 1) · 2m−2 ≤ D2 ≤ m(m − 1) · 2m−1 .
Thus, D1 is O(mn) and D2 is O(m2 n), where m ' log2 n, and the cost of factorization is
therefore O(n(log(n))2 ). Compared with the cost of factorizing a dense matrix, the cost
of factorizing the JSIM of a balanced binary tree is 7.8 times less at n = 15, and 430
times less at n = 255.
Figure 4(c) shows one of several possible designs of spanning tree for a square grid of
nodes. Any spanning tree is acceptable, provided it connects every node in the grid to
the root via a minimum-length path. The formulas in Table 2 apply to all such spanning
trees. For this kind of tree, D1 is O(n1.5 ) and D2 is O(n2 ). Compared with the cost of
factorizing a dense matrix, the cost of factorizing a JSIM for this kind of tree is 5.3 times
less at n = 16, and 74 times less at n = 256.
for i = 1 to n do
Ici = Ii
end
for i = n to 1 do
f = Ici si
Hii = sTi f
if λ(i) 6= 0 then
Icλ(i) = Icλ(i) + λ(i) XFi Ici i XM
λ(i)
end
j=i
while λ(j) 6= 0 do
f = λ(j) XFj f
j = λ(j)
Hij = Hji = sTj f
end
end
13
This algorithm calculates every nonzero element of the JSIM. It does not initialize
or access the zeros. If the JSIM is to be accessed by other software that is not aware
of its sparsity structure, then the zero elements must be initialized to zero at least once.
This can be done when the matrix is created or allocated, or before its first use. If other
software fills in the zeros (e.g. by using a dense factorization algorithm) then the zero
elements must be initialized each time the JSIM is calculated.
The quantities appearing in this algorithm are as follows. All are expressed in link
coordinates. si is a 6-D vector representing the axis of joint i; and Ii and Ici are 6 × 6
matrices representing the rigid-body inertias of link i and body Ci , respectively, where Ci is
the composite rigid body formed by the rigid assembly of all the links in ν(i). si and Ii are
constants in link-i coordinates. λ(i) XFi and i XMλ(i) are coordinate transformation matrices
that transform a force vector or a motion vector, respectively, from the coordinate system
indicated in the subscript to the coordinate system indicated in the leading superscript.
They are related by (λ(i) XFi )T = i XM λ(i) . Finally, f is a vector representing the force
required to impart an acceleration of si to Ci . This force is first calculated in link-i
coordinates, and is then transformed successively to the coordinate systems of link λ(i),
link λ(λ(i)), and so on.
For cost calculation purposes, it is assumed that the coordinates of si consist of five
zeros and a one, so that the multiplications Ici si and sTi f simplify to selecting a column
from Ici and an element from f , respectively. Given these assumptions, the computational
cost of the above algorithm can be expressed as
D0 (ra + rx) + D1 vx , (14)
where the symbols ra, rx and vx stand for the operations ‘rigid-body add’, ‘rigid-body
transform’ and ‘vector transform’, respectively. D0 is the number of mobile bodies in the
system that are not connected directly to the base; i.e., the number of bodies for which
λ(i) 6= 0, or the number of bodies for which di > 1. D0 can be expressed as
n
X n
X
D0 = min(1, di − 1) = min(1, λ(i)) , (15)
i=1 i=1
14
branches, then D1 (at least) will be smaller, and the computational cost correspondingly
less. From the data in Table 2, the cost of the CRBA for a chain with short side branches
should converge to half the cost of the unbranched case for the same number of bodies;
and the asymptotic complexity of the CRBA is O(n log(n)) for a balanced binary tree
and O(n1.5 ) for the spanning tree of a square grid.
Much effort has gone into finding minimum-cost implementations for operations like
ra, rx and vx. The minimum cost for ra is 10a, but the minimum costs for the other
two depend on how the link coordinate frames are defined. This is where the situation
becomes more complicated for a branched tree than for an unbranched tree.
The most efficient implementations of rx and vx require that the coordinate frames
be located in accordance with a set of DH parameters [9, 12], in which case the coordi-
nate transformations implied by λ(i) XFi and i XM
λ(i) can be accomplished via the successive
application of two axial screw transforms: one aligned with the x axis, and one aligned
with the z axis [9, 14]. The current best figures for these operations are 32m + 33a for
rx, and 20m + 12a for vx (Table II in [14]).1
However, as explained in [12], if a node has two or more children, then only one child
can have the benefit of DH parameters, while the others must use a general coordinate
transformation instead (unless the mechanism has a special geometry). We shall use the
terms ‘DH node’ and ‘non-DH node’ to refer to those nodes that do have the benefit of
DH parameters, and those that do not, respectively. The root node has no parent, and is
therefore excluded from this classification.
The number of non-DH nodes is determined by the connectivity graph. If c(i) is the
number of children of node i, then the number of non-DH nodes is given by the formula
N
X
non-DH = max(0, c(i) − 1) , (17)
i=0
where N is the number of mobile bodies in the tree. In an unbranched chain, c(i) ≤ 1
for all i, so every node is a DH-node; but Tree 1 has three nodes with two children each,
and therefore has three non-DH nodes. Referring back to Figure 1(b), one child of node
1, one child of node 2 and one child of node 3 must be non-DH nodes, but we are free to
choose which child in each case.
The best figures for rx and vx for a non-DH node are 47m + 48a and 24m + 18a,
respectively. The former is the cost of three successive axial screws, according to the
figures in [14], and the latter comes from Table 8-3 in [9].
To account for these differences in cost, it is necessary to separate D0 and D1 each
into two components: one to count how many operations are performed at the DH nodes,
and the other to count operations at the non-DH nodes. Thus, we seek an expanded cost
formula of the form
D0a (ra + rxa ) + D1a vxa
(18)
+ D0b (ra + rxb ) + D1b vxb ,
where the subscripts a and b refer to the DH and non-DH nodes, respectively. We already
know the cost figures for ra, rx and vx, so we just need expressions for D0a . . . D1b .
1
After discussing the matter with Prof. Orin, I have used 15m + 15a as the base cost of a screw
transform, instead of the 15m + 16a that appears in this table.
15
D0a counts how many times ra + rx is performed at a DH node; D0b counts how many
times ra + rx is performed at a non-DH node; and so on. An inspection of the CRBA
implementation listed above reveals that it performs ra + rx + |ν(i)| vx at each node i
satisfying λ(i) 6= 0, where |ν(i)| is the number of elements in ν(i). So let us define the
sets πa and πb to be the set of all DH nodes, and the set of all non-DH nodes, respectively,
that are not directly connected to the root. It follows immediately from these definitions
that
D0a = |πa | , D1a = i∈πa |ν(i)| ,
P
(19)
D0b = |πb | , D1b = i∈πb |ν(i)| .
P
Before moving on, let us tie up one loose end. It is clearly necessary that D0a + D0b =
D0 and D1a + D1b = D1 . The first equation follows directly from the definitions of D0 , πa
and πb . The second equation can be proved as follows. Starting from Eq. 8,
n
X
D1 = (di − 1)
i=1
Xn
= (|ν(i)| − 1)
i=1
Xn
= |ν(i)| − n
i=1
X X X
= |ν(i)| + |ν(i)| + |ν(i)| − n
i∈πa i∈πb i6∈(πa ∪πb )
X X
= |ν(i)| + |ν(i)| .
i∈πa i∈πb
To follow the first step, observe that di − 1 and |ν(i)| − 1 are the numbers of nonzero
elements on row i below and above the main diagonal, respectively. Thus, the summations
on the first and second lines count the total number of nonzero elements below and above
the main diagonal, respectively. As the matrix is symmetrical, the two are the same. The
final step uses the fact that i6∈(πa ∪πb ) |ν(i)| is just a count of all the descendants of the
P
When choosing the DH nodes, any choice that maximizes D1a is optimal.
Equation 20 gives the cost of the CRBA for a fixed-base system; but there is one more
optimization one can make for a floating-base system, which exploits the fact that three
of the DH parameters between links i and λ(i) can be set to zero if link λ(i) happens to
be a floating base [15]. This allows a saving of 18m + 21a on each affected rx operation,
and 12m + 8a on each affected vx operation. To incorporate this optimization into the
cost formula, we define a third set, πc , which is the set of DH nodes that are children of
a floating base. We then define the numbers D0c = |πc | and D1c = i∈πc |ν(i)|, which
P
count the number of occurrences of the rx and vx savings, respectively. The final cost
16
torso
6−DoF joint
root
expansion
limbs
formula is then
D0a (32m + 43a) + D1a (20m + 12a)
+ D0b (47m + 58a) + D1b (24m + 18a) (21)
− D0c (18m + 21a) − D1c (12m + 8a) .
6 Practical Example
The purpose of this section is to illustrate the effect of branches on the cost of forward
dynamics calculations. To this end, it presents cost figures for two rigid-body systems: a
simple humanoid (or quadruped) consisting of a single rigid torso and four 6-DoF limbs,
and an unbranched chain with a floating base at one end. Both systems have 25 bodies
connected together by 24 revolute joints; both have a floating base, hence 30 DoF in total;
and both have general geometry and general inertias. Thus, the only relevant difference
between them is that one is branched and the other is not.
The figures presented here support the following two statements:
1. an O(n3 ) algorithm based on the CRBA and the LTDL factorization can calculate
the dynamics of the humanoid more than twice as quickly as it can calculate the
dynamics of the equivalent unbranched chain; and
2. this O(n3 ) algorithm can calculate the dynamics of the humanoid almost as quickly
as the fastest O(n) algorithm.
The obvious conclusion is that O(n3 ) algorithms are still competitive with O(n) algo-
rithms, even at relatively high values of n like n = 30, provided there is a sufficient
amount of branching in the kinematic tree.
In the paragraphs that follow, we will be comparing the costs of various calculations
by quoting cost ratios. Unfortunately, these numbers depend on the relative cost of an
addition compared to a multiplication, and there is no agreed value for this ratio. We
can overcome this difficulty by quoting two figures: one based on the assumption that
additions are free (i.e., a = 0), and the other based on the assumption that an addition
costs the same as a multiplication (i.e., a = m). These represent two extreme cases, and
any realistic assumption about the cost of an addition will lie somewhere in between.
Thus, if we say that calculation X is 2.5 times faster than calculation Y if a = 0, and 2.6
times faster if a = m, then the realistic cost ratio will lie somewhere in between. Divisions
will be counted as multiplications for cost calculation purposes.
17
The first step is to calculate the various D quantities for the humanoid. These are ob-
tained from the connectivity graph, which is shown in Figure 6. As the torso is connected
to the fixed base via a 6-DoF joint, we will actually be using two connectivity graphs:
the original graph and the expanded graph, where the latter is obtained from the former
by replacing the 6-DoF joint with the expansion shown inset in the diagram. To avoid
possible confusion, we shall apply a superscript e to the D quantities obtained from the
expanded graph.
The factorization and back-substitution costs depend on D1e and D2e , which are ob-
tained from the expanded graph as follows. From Figure 6, one can see that there is one
body in the expanded graph for which di = 1, one body for which di = 2, and so on, up
to di = 6, this body being the torso. Thereafter, there are four bodies for which di = 7,
four for which di = 8, and so on, up to di = 12. Thus, from Eqs. 8 and 9,
6 12
D1e =
X X
(i − 1) + 4 × (i − 1) = 219
i=1 i=7
and
6 12
i (i − 1) i (i − 1)
D2e =
X X
+4× = 1039 .
i=1 2 i=7 2
These numbers imply that 48% of the elements in the humanoid’s JSIM (which is
a 30 × 30 matrix) are branch-induced zeros. To see this, recall that Die is the number
of nonzero elements below the main diagonal, as mentioned in Section 4.1. The total
number of nonzero elements is therefore 2 D1e + 30 = 468, hence the number of zeros is
900 − 468 = 432.
Using the formulas in Table 1, the cost of factorizing the humanoid’s JSIM (setting
div = m for cost purposes) is 1258m + 1039a, and the cost of back-substitution is 468m +
438a per vector. For comparison, the cost of factorizing a dense 30 × 30 matrix is 4930m +
4495a, and the cost of back-substitution is 900m + 870a per vector.
To calculate the cost of the CRBA, we need the quantities D0a . . . D1c pertaining to
the original connectivity graph. In this graph, the torso is the only body for which di = 1;
then there are four bodies for which di = 2, four for which di = 3, and so on, up to di = 7.
Thus, from Eqs. 15 and 8,
25
X
D0 = min(1, di − 1) = 24
i=1
and
7
X
D1 = 0 + 4 × (i − 1) = 84 .
i=2
The torso is the only body with more than one child. Only one child can be a DH node, so
the other three are non-DH nodes. These are the only three non-DH nodes in the graph.
Each non-DH node is the root of a subtree containing six nodes, so |ν(i)| = 6 for each
one. Thus, from Eq. 19, we have
D0b = 3 , D1b = 3 × 6 = 18 ,
D0a = 21 , D1a = D1 − D1b = 66 .
18
And finally, there is only one DH node that is the child of a floating base, and this node
also has |ν(i)| = 6, so
D0c = 1 , D1c = 6 .
For comparison, the corresponding figures for the equivalent unbranched chain are:
Plugging these values into Eq. 21 gives the cost of calculating the JSIM of the humanoid
as 2475m+2124a, and the cost of calculating the JSIM of the equivalent unbranched chain
as 6462m + 4419a. Thus, the CRBA runs more than twice as quickly on the humanoid
than on the equivalent unbranched chain. The cost ratio is 2.37 if we assume a = m, and
2.61 if we assume a = 0.
Let us now look at the cost of the complete forward dynamics calculation for both
the humanoid mechanism and its equivalent unbranched chain; and let us compare these
figures with the cost of calculating the forward dynamics via an efficient O(n) algorithm.
For this comparison, we need a set of cost figures for an inverse dynamics algorithm (to
calculate C in Eq. 1), and for an O(n) forward dynamics algorithm.
For the inverse dynamics, we shall use the efficient implementation of the recursive
Newton-Euler algorithm (RNEA) described in [3] as Algorithm 3. The cost formula for
this algorithm is (93n−69)m+(81n−66)a. However, these figures refer to an unbranched
kinematic chain with a fixed base. To account for a floating base, we first increase n to
25, which accounts for the extra joint, but costs it as revolute; then we add a correction
term, 7m + 13a, which is the difference between the cost of a 6-DoF joint and a revolute
joint if both are connected to the root node. This gives us a cost figure of 2263m + 1972a
for the equivalent chain. To get a figure for the humanoid, we add a further correction
of 3 × (4m + 8a), which accounts for the extra transformation costs at the three non-DH
nodes, giving a cost figure of 2275m + 1996a. These correction terms are specific to this
algorithm.
For the O(n) forward dynamics, we shall use the figures in Table II of [15], which
pertain to a very efficient implementation of the articulated body algorithm (ABA). Ac-
cording to this table, the cost formula for an unbranched chain with a floating base is
(224n − 30)m + (205n − 37)a. This immediately gives us a figure of 5346m + 4883a for the
cost of the ABA on the equivalent unbranched chain. To get a figure for the humanoid,
we must add a correction of 3 × (66m + 57a) to account for the additional transformation
costs incurred at the three non-DH nodes, resulting in a total cost of 5544m + 5054a. This
correction term is specific to the algorithm described in [15], and it was obtained with the
aid of data in Table II of [14].
Based on these figures, the cost of the complete forward dynamics calculation via
the CRBA is 6476m + 5597a for the humanoid, and 14555m + 11756a for the equivalent
unbranched chain. As a result of the branches in the kinematic tree, the dynamics cal-
culation for the humanoid is 2.18 times faster than for the unbranched chain assuming
a = m, or 2.25 times faster assuming a = 0. It is also only about 14% slower than the
ABA assuming a = m, or 17% slower assuming a = 0. These results are summarized in
Figure 7. Observe that the ABA’s speed advantage is almost completely wiped out by
19
30−DoF Unbranched Floating Chain
ABA
30−DoF Humanoid
dense
RNEA CRBA F&S sparse
ABA
total arith. ops.
Figure 7: Computational cost of forward dynamics, via CRBA and ABA, for the humanoid
mechanism shown in Figure 6 and an equivalent unbranched chain.
the cost reduction due to branch-induced sparsity. In round numbers, the ABA is faster
by a factor of 2.6 on the unbranched chain, but only 15% faster on the humanoid. One
may therefore conclude that O(n3 ) algorithms are still competitive with O(n) algorithms,
even at quite high values of n like n = 30, provided there is sufficient branching in the
kinematic tree.
7 Conclusion
This paper has presented a new factorization algorithm that fully exploits branch-induced
sparsity in the joint-space inertia matrix (JSIM). It is simple to implement and use, and it
incurs almost no overhead compared with standard algorithms; yet it can deliver large re-
ductions in the cost of factorizing a JSIM, and in the cost of using the resulting sparse fac-
tors. The complexity of factorization depends on the number of nonzero elements, rather
than the size of the matrix, and the theoretical lower limit is O(1). Some examples are
presented of kinematic trees with factorization complexities ranging from O(n(log(n)) 2 )
to O(n3 ).
This paper also presented a cost and complexity analysis for the composite rigid body
algorithm (CRBA) for the case of a branched kinematic tree. It is shown that the cost
of this algorithm can be considerably less for a branched kinematic tree than for an
equivalent unbranched chain, and that the theoretical lower limit on the complexity of
the CRBA is O(1).
Finally, this paper presented the results of a detailed costing of the forward dynamics
of a 30-DoF branched kinematic tree that could represent either a simple humanoid robot
or a quadruped, and an equivalent 30-DoF unbranched chain. It was shown that an
O(n3 ) algorithm incorporating the CRBA and a sparse factorization algorithm runs 2.6
times slower than an efficient O(n) algorithm (the articulated-body algorithm) on the
unbranched chain, but only 15% slower on the humanoid. This is mainly due to the
O(n3 ) algorithm running 2.2 times faster on the humanoid than on the unbranched chain.
20
Thus, an O(n3 ) algorithm can be competitive with an O(n) algorithm, even at n = 30, if
there are enough branches in the tree, and if the branch-induced sparsity is fully exploited.
This paper did not consider systems with kinematic loops; but the results are relevant
to any dynamics algorithm that solves closed-loop dynamics via the JSIM of the spanning
tree.
References
[1] Angeles, J., and Ma, O. 1988. Dynamic Simulation of n-Axis Serial Robotic Manip-
ulators Using a Natural Orthogonal Complement. Int. J. Robotics Research, vol. 7,
no. 5, pp. 32–47.
[2] Bae, D. S., and Haug, E. J. 1987. A Recursive Formulation for Constrained Mechan-
ical System Dynamics: Part I: Open Loop Systems. Mechanics of Structures and
Machines, vol. 15, no. 3, pp. 359–382.
[3] Balafoutis, C. A., Patel, R. V., and Misra, P. 1988. Efficient Modeling and Com-
putation of Manipulator Dynamics Using Orthogonal Cartesian Tensors. IEEE J.
Robotics & Automation, vol. 4, no. 6, pp. 665–676.
[4] Balafoutis, C. A., and Patel, R. V. 1989. Efficient Computation of Manipulator In-
ertia Matrices and the Direct Dynamics Problem. IEEE Trans. Systems, Man &
Cybernetics, vol. 19, no. 5, pp. 1313–1321.
[5] Baraff, D. 1996. Linear-Time Dynamics using Lagrange Multipliers. Proc. SIG-
GRAPH ’96, New Orleans, August 4–9, pp. 137–146.
[6] Brandl, H., Johanni, R., and Otter, M. 1988. A Very Efficient Algorithm for the
Simulation of Robots and Similar Multibody Systems Without Inversion of the Mass
Matrix. Theory of Robots, P. Kopacek, I. Troch & K. Desoyer (eds.), Oxford: Perg-
amon Press, pp. 95–100.
[7] Duff, I.S., Erisman, A. M., and Reid, J. K. 1986. Direct Methods for Sparse Matrices.
Oxford: Clarendon Press.
[9] Featherstone, R. 1987. Robot Dynamics Algorithms. Boston: Kluwer Academic Pub-
lishers.
[10] Featherstone, R., and Orin, D. E. 2000. Robot Dynamics: Equations and Algorithms.
Proc. IEEE Int. Conf. Robotics & Automation, San Francisco, April, pp. 826–834.
[11] George, A., and Liu. J. W. H. 1981. Computer Solution of Large Sparse Positive
Definite Systems. Englewood Cliffs, N.J.: Prentice Hall.
[12] Khalil, W., and Dombre, E. 2002. Modeling, Identification and Control of Robots.
New York: Taylor & Francis Books.
21
[13] Lilly, K. W., and Orin, D. E. 1991. Alternate Formulations for the Manipulator
Inertia Matrix. Int. J. Robotics Research, vol. 10, no. 1, pp. 64–74.
[14] McMillan, S., and Orin, D. E. 1995. Efficient Computation of Articulated-Body In-
ertias Using Successive Axial Screws. IEEE Trans. Robotics & Automation, vol. 11,
no. 4, pp. 606–611.
[15] McMillan, S., and Orin, D. E. 1995. Efficient Dynamic Simulation of an Underwater
Vehicle with a Robotic Manipulator. IEEE Trans. Systems, Man & Cybernetics, vol.
25, no. 8, pp. 1194–1206.
[16] Orlandea, N., Chace, M. A., and Calahan, D. A. 1977. A Sparsity-Oriented Approach
to the Dynamic Analysis and Design of Mechanical Systems—Part 1. Trans. ASME
J. Engineering for Industry, vol. 99, no. 3, pp. 773–779.
[17] Rodriguez, G. 1987. Kalman Filtering, Smoothing, and Recursive Robot Arm For-
ward and Inverse Dynamics. IEEE J. Robotics & Automation, vol. RA-3, no. 6, pp.
624–639.
[18] Rodriguez, G., Jain, A., and Kreutz-Delgado, K. 1991. A Spatial Operator Algebra
for Manipulator Modelling and Control. Int. J. Robotics Research, vol. 10, no. 4, pp.
371–381.
[20] Saha, S. K. 1997. A Decomposition of the Manipulator Inertia Matrix. IEEE Trans.
Robotics & Automation, vol. 13, no. 2, pp. 301–304.
[21] Saha, S. K. 1999. Dynamics of Serial Multibody Systems Using the Decoupled Natu-
ral Orthogonal Complement Matrices. Trans. ASME, J. Applied Mechanics, vol. 66,
no. 4, pp. 986–996.
[22] Stejskal, V., and Valášek, M. 1996. Kinematics and Dynamics of Machinery. New
York: Marcel Dekker.
[24] Walker, M. W., and Orin, D. E. 1982. Efficient Dynamic Computer Simulation of
Robotic Mechanisms. Trans. ASME, J. Dynamic Systems, Measurement & Control,
vol. 104, no. 3, pp. 205–211.
22
A Algorithms
This appendix lists an algorithm for the sparse LTDL factorization, and algorithms for
the four multiplications L x, LT x, L−1 x and L−T x, where L is a sparse triangular factor
and x is an arbitrary vector or rectangular matrix. If x is a vector then the symbols x i
and yi refer to element i of vectors x and y; otherwise they refer to row i of matrices
x and y. The multiplication algorithms assume that L is a non-unit triangular factor,
as produced by the LTL factorization, in which Lii 6= 1. They can be modified to work
with the unit triangular factors produced by the LTDL factorization simply by replacing
occurrences of ‘Lii ’ with ‘1’ and making any appropriate simplifications.
LTDL factorization
The algorithm below is the LTDL equivalent of the LTL algorithm described in Section 2.
It expects the same inputs as the LTL algorithm, and requires them to meet the same
conditions; i.e., an n × n symmetric, positive-definite matrix A, and an n-element array
λ, such that A satisfies Eq. 6 and λ satisfies Eq. 2. This algorithm works in situ, and it
accesses only the lower triangle of its matrix argument. The computed factors D and L
are returned in this triangle. As the diagonal elements of L are known to be 1, only the
off-diagonal elements are returned.
for k = n to 1 do
i = λ(k)
while i 6= 0 do
a = Aki /Akk
j=i
while j 6= 0 do
Aij = Aij − Akj a
j = λ(j)
end
Aki = a
i = λ(i)
end
end
Algorithm for y = L x
for i = n to 1 do
yi = Lii xi
j = λ(i)
while j 6= 0 do
yi = yi + Lij xj
j = λ(j)
end
end
23
If y and x are different vectors then this algorithm assigns the value L x to y, leaving x
unaltered. If y and x refer to the same vector then this algorithm performs an in-situ
multiplication on x, overwriting it with L x.
Algorithm for y = LT x
for i = 1 to n do
yi = Lii xi
j = λ(i)
while j 6= 0 do
yj = yj + Lij xi
j = λ(j)
end
end
This algorithm assigns the value LT x to y, leaving x unaltered. It does not work in situ,
so x and y must be different. An in-situ version would be inefficient.
This algorithm works in situ on the vector x, replacing it with L−1 x. To implement
y = L−1 x, the algorithm can be modified by inserting the line ‘yi = xi ’ between lines 1
and 2, and replacing lines 4 and 7 with ‘yi = yi − Lij yj ’ and ‘yi = yi /Lii ’, respectively.
Alternatively, one can simply copy x to y and apply the above algorithm to y.
This algorithm works in situ on the vector x, replacing it with L−T x. To implement
y = L−T x, one must copy x to y and apply the algorithm to y.
24