Some Combinatorial Properties of Certain Trees With Applications To Searching and Sorting
Some Combinatorial Properties of Certain Trees With Applications To Searching and Sorting
THOMAS N. HIBBA~D
System Development Corporation, Santa Monica, California
INTRODUCTION
This paper introduces an abstract entity, the binary search tree, and exhibits
some of its properties. The properties exhibited are relevant to processes occur-
ring in stored program computers--in particular, to search processes. The dis-
cussion of this relevance is deferred until Section 2.
Section 1 constitutes the body of the paper. Section 1.1 consists of some mathe-
matical formulations which arise in a natural way from the somewhat less formal
considerations of Section 2.1. The main results are Theorem 1 (Section 1.2) and
Theorem 2 (Section 1.3).
T h e initial motivation of the paper was an actual computer programming
problem. This problem was the need for a list which could be searched efficiently
and also changed efficiently. Section 2.1 contains a description of this problem
and explains the relevance to its solution of the results of Section 1.
Section 2.2 contains an application to sorting.
T h e reader who is interested in the programming applications of the results
but not in their mathematical content can profit b y reading Section 2 and making
only those few references to Section 1 which he finds necessary.
(i) There is one and only one node, called the root, such (:hat for any node
p there exists one and only one path which begins with the root and
ends with p.
(ii) For each node p, the number of links beginning with p is either two or
zero. If the number is two, then p is said t:o be a proper node. If the
mlmber is zero, then p is said to be a bla~zk node.
(iii) The set of links is partitioned into two sets L and t~,. Each liuk belonging
to L is called a l@ link. Each link belonging to [~. is called a right link.
(iv) For each proper node p, there is exactly one left link beginning with p
and exactly one right link beginning with p.
For brevity, a search tree will here be defined as a binary search tree.
The following notational devices will be emptoyed for search trees.
The length of a path is the number of proper nodes in the path.
T o each node of any search tree we ~ttach a label p~, where ~ is a sequence
of t% and 0's uniquely defined b y the following. Let the root be Balled PA (where
/k denotes the null sequence).2 If, for any ~, p~ is a proper node, then p~0 and
p~, are such that (p~, p~0) is a left link and (p~, p~) is a right link.
For a search tree T, the search ~ubtree T~ is defined for each ~ as follows.
E v e r y node of T~ belongs to T. A node q of T belongs to T~ if and only if q belongs
to a path in T beginning with p~. E v e r y left (right) link of T~ is a left (right)
link of T. A left (fight) link (q, r) of T is a left (right) link of T~ if and only
if both q and r belong to T~. (Hence, if (r is such t h a t T contains no node p~,
then T¢ is defined to be empty.)
T h r o u g h o u t this paper, the symbols o-, p, and r will denote sequences of 1%
and 0%. Further, any symbol having ~, p, or r as a subscript, is to be understood
to be a function of ¢, p, or ~.
i n the following definitions, let T be a search tree of n proper nodes and let
S be a set of n numbers.
The list function fsr is a one-one mapping of the proper nodes of T onto the
elements of S, with the following property. If q is a proper node of T¢0 then
fsr(q) < fsr(P¢), and if q is a proper node of T¢~ then f s r ( q ) > fsr(p¢). (Thus,
given any two of (T, S, f s r ) , the third is uniquely defined.) When S and T are
understood, f will be used to denote fsr •
For each ¢, the subset S¢ of S is defined as follows, y is in S¢ if and only if
there exists a node q of T¢ such t h a t y = fsr(q).
A path in T~ which begins with pC is said to be a search path in T¢. If a search
path ends with a proper node then it is said to be an internal search path. If a
search path ends with a blank node then it is said to be an open search path.
(Note t h a t if any search tree T has n proper nodes then there exist nq-1 open
search paths in T, one for each blank node. This is easily shown by an induction
on n.) T h e lengths of the search paths in TA are the subject of the next two
sections.
The expressions p/,, and p are equivalent. While the use Of/k is never essential, it some-
times increases clarity. Both notations will be used here.
SOME COM13INAT0"RIAL P R O I ~ E t ~ T I E S OF C E R T A I N T R E E S 15
l'(T) = k 1'
i=1
Since the sequence search tree for the sequence of ranks r ( R ~) is identical to
the sequence search tree :for R ~, it is clear that, U ( S ) and U ' ( S ) depend only
on n. Hence, let S = {1, 2, . . . , n} and let
V(n) = g ( s ) ;
V'(n) = U'(S).
Let the mean open search length [( n ) and the mean internal search length ['( n )
now be defined as follows.
i(n) - y(n)
(n -1- 1)!"
i'(n) - V ' ( n )
n.n[
THEOREM 1. It is true for each n that
7(n)=2 + g + . . . + ~ (~)
and
PROOF OF (1). Let each tree T ~ have search subtrees T~~. For each integer
t, 0 < t < n, let as~ be the set of all sequences R ~ such t h a t Rot has length t. For
each sequence R ~ in (Pt, the first element of R ~ is t + l ; and the set of elements
m
• R0~ ~s
. the set
. of the
. first. t positive integers. Therefore, Ro¢"is one of t.T sequer~ces.
Let P and Q be any two of these sequences. Let mp and me denote the n u m b e r
of R ~ in ~ such t h a t R0~ = P and R0~ = Q, respectively• I t is clear t h a t me =
m e . Since qh has ( n - l ) ! members, it follows t h a t me -- (n -- 1)!/t!. Now
To~ is the sequence search tree for Ro~. Hence,
Z(To~) _ (n - 1)! V ( t ) .
Hence,
~-1 (n -- 1)1
l(To ~) = ~ " V(t).
~=1 e=o t!
S y m m e t r y requires that
n! n!
~ Z ( T o ~) = ~ Z(T?).
i=1 i=1
For each T ¢,
l( T ~) = l( To ~) + l( T1 ~) + n + 1•
I t now follows t h a t
'~-~ V(t)
V(n) = (n + 1)! + 2 ( n - 1)!Z
t=-o t[
SOME COMBINATORIAL PROPERTIES OF C E R T A I N T R E E S 17
l(n)- 2 q-l(n--1),
n-t-1
from which (1) is easily obtained•
PROOF OF (2). To prove this, we need only note that if a proper node of
7 '~ beIongs to m open search paths in T ~, then it belongs to m - 1 internal search
paths of T i. Hence, for all i, l'(T ~) = l ( T ~) -- n. From this (2) is easily oh-
gained. Q.E.D.
Note t h a t with the use of j (dx/x), bounds can be put on l(n) thus:
i
If so : q ~ , q s , ' " ,q,~then (qj,qj+~) is a left link of f/"~for ] < j < 7~l. Now
it can be easily verified that the search paths t '~ are all the open search paths
in the trees f['~ excepting the search paths so~. Let V0(n) denote the sum of the
search lengths of the paths So~, Let [j(n) be defined by
Z~(%) = V(n) - V ~ ( n )
n-n [
Since we are considering n.n! deletion trees, lj(n) represents the mean search
length of t ~. V0(n) is found by a method exactly similar to that used for V(n)
in the previous section. The result is Vo(n) = n!(1 + ½ + . . . . t- ( l / n ) ) .
(Thus V o ( n ) / n t , which represents the mean search length of s0, is only about
half as large as Z(n).) Hence,
1 (l t 1--1 +~_~) (4)
L0z) = i(~) + ~,5 2 + 5 + "'" + '~-~
Hence, Z~('n) is approximately equal to l(n).
We now consider the search properties of the search trees D ~'.
There are n.n! pairs iy, and each corresponding D ~ has n open search paths
and n - ] internal search paths. Hence, for the trees D '~ we define the mean
open search length 7~(n) and the mean internal search length l ~ ( n ) as follows.
n!
%2"~n,[ i ~ l yin8
and
n!
Z(D = X: z(J)'),
iyinA i=1 yvS
n[
z'(z/") = 2z'(D").
Subs~itiution of these values into the defining equations for 1D(n) and [D'(n)
will then yield (5) and (6).
Therefore, to prove the theorem, it only remains to find tile sequences R ~.
Let f be any sequence of length m, m > 0, and containing m distinct numbers.
Let y be a n y t e r m of P. For each such pair (P, y) we define the sequences
d(P, y) and g(P, y) as follows.
Let. P = w~, w2, . . . , w ~ . Let y = wj where 1 _-= j __< m. I f there exists a
n u m b e r in P which is greater t h a n y, let wk be equal to the smallest such number.
L e t d(P, y) be a sequence of length m - 1 defined as follows.
(i) If y is the greatest n u m b e r in P, or if y is not the greatest n u m b e r in P and
j > It, t h e n d(P, y) is a subsequence of P and contains every t e r m of
P except w j .
(ii) I f y is not the greatest n u m b e r in P a n d j < k, then let Q' be a subsequenee
of P containing every t e r m of P except wk. Replace the j t h t e r m of
Q' with w~. The sequence so obtained is d(P, y).
I t can be verified easily t h a t if S is the set of numbers in P, and if T is the
sequence search tree for P, then D(T, S, y) is the sequence search tree for
d(P, y).
g(P, y) is defined to be the sequence of ranks for d(P, y). T h a t is, g(P, y) =
r[d(P, y)]. Clearly, D(T, S, y) is the sequenee search tree for g(P, y).
Now, let R ~y = g(R ¢, y). I t follows from the foregoing t h a t D ~yis the sequence
search tree for R ~v.
Now, let Q be a n y sequence equal to at least one of the sequences R% T h a t
is, Q is a n y sequence of the first n - 1 positive integers. T h a t Q = R ~y for n 2
distinct pairs iy will be proved b y induction on n. T h e ease n = 1 is trivial.
T h e ease n = 3 shows the n a t u r e of the problem and is exhibited in Table I.
TABLE I
Sequences R ¢~for n = 3
Ri Ril Ri2 Ri3
1 123 12 12 12
2 132 12 12 12
3 213 12 21 21
4 231 12 21 21
5 312 21 21 12
6 321 21 21 21
20 TItOMAS N. HIBBARD
2. A P P L I C A T I O N S
2.1 Searching
Computer programmers are familiar with the fact that lists which can be
searched efficiently tend to be difficult to change efficiently, and vice versa.
Some lists are never changed, and some are never searched. If a list needs to be
both searched and changed, the conflict can often be resolved b y devoting more
storage to the list. But when not enough storage is available to resolve the con-
flicL a problem arises. This problem is to effect with the limited storage, a
workable compromise between a search-oriented list design and change-oriented
list design. In the next two sections we consider two well-known kinds of lists
which illustrate this conflict. In Section 2.1.3 we consider a list which offers, at
a reasonable cost in storage, a compromise between searching and changing;
and we apply to this list the main results of Section 1.
2.1.1 SequenceLists
The classic example of a search-oriented list is the following. Let zl, z~, . . . ,
z~ be an ascending sequence of length n. For each i, 1 =< i _-< n, let z~ occupy
location A+b~ in a computer. The well-known binary search algorithm is ap-
plicable to the list. This algorithm makes its first comparison with the n u m b e r
in location A +bin~2] (or thereabouts), thus eliminating from the search one of
the sequences {Zi}, b -<_ i < bin~2], or {Zi}, b[n/2] < i <= b~. The algorithm
then applies itself recursively to the remaining sequence.
The search length, defined as the number of comparisons in a search, has an
expected value of approximately log~ n. However, to make an insertion or a
deletion in the list entails a lot of work. For, on the average, assuming random
choice of the n u m b e r to be inserted or deleted, n/2 numbers must be given new
m e m o r y locations.
Hence, a sequence list is search-oriented but is not change-oriented.
Note that there is a binary search tree associated with the binary search
algorithm. Each location corresponds to a proper node of the search tree. T h e
tree has the property that it is balanced. T h a t is, letting m, denote for each
the number of nodes in T , , it is true for all ¢ that ] m,0 - m,11 =< 1.
2.1.2 Singly Linked Lists
The leading example of a change-oriented list is the singly linked or "push-
down" list [3, 4]. Let zl, z~, • • • , z~ be a sequence. Let a singly linked list consist
of a number l0 together with a set {d~ t 1 =< i == n} of ordered pairs d~ = (z~, ll).
l~, 0 ~ i < n, is the location 3 of d~+1. l~ is any number not belonging to the
set of possible locations.
Clearly, an insertion or deletion is accomplished in a singly linked list simply
b y changing one of the l~ and, for an insertion, adding a new pair to the set of
pairs. Note, however, t h a t searching in a singly linked list is a long process.
Expected search time is proportional to the n u m b e r n of entries in the list.
The term location as used here is applicable to any set of memory components to which
the program has ready access. Thus z~ and l~, as well as d~ have locations.
22 T H O M A S N. I-II:BBARD
TABLE II
A Singly Linked List
i Locatio:: si l{
5 200 6
10 600 28
14 100 10
17 400 23
23 500 5
28 300 17
lo = 14
5 600 28 ¢
10 300 ¢ ¢
14 500 ¢ ¢
17 2QO.i 23 5
23 10% : ¢ 4
28 4O0 10 t4
: E = 17
SO~E COI~Y3INATOIRIALPROPERTIES OF CERTAIN TREES ~3
/ \
/ X
L
FIG. 1. The binary search tree corresponding to the list of Table I I I . Circles denote
proper nodes, rectangles denote blank nodes. The numbers ar e the faz(P~). The tre e i s
the sequence search tree for the sequence (200, 600, 400, 100, 500, 300).
procedure search (y) doubly linked list: (x,C,r)main output: (z) secondary outputs: (i,j,w);
real array x; integer array ~, r; real y;
boolean z; integer w, i, j;
comment z is "true" if and only if y is in x. The secondary outputs are for the use of
the "insert" and "delete" procedures defined below;
begini := ro; w := 1; ] := ¢;
$1: if i = ¢ then z := false
else if y = ×~ then z := true
2~ THOMAS N. HIBBARD
e l s e b e g i n j : = i; i f y > x i t h e n b e g i n i : = r~; w := l e n d
e l s e b e g i n i : = C~ ; w : = 0 e n d ;
go to $1 e n d
end search
Note t h a t if the arrays x, l, r constitute a doubly linked list at, the beginning
of the execution of "insert" then t h e y will do so at the end. Observe that the
insertion is accomplished by changing four numbers in the list proper and one
in the available storage list. Thus the only portion of the process which depends
on the size of the list is t h a t performed by the search algorithm.
Let y~ . - . y, be a sequence of n distinct numbers chosen and ordered ran-
domly. Let r, = ¢ initially (that is, let the list be essentially empty), and let
"insert (y~, x, l, r, k ) " be performed for i = 1, 2 , . . . , n. The resulting list
will have associated with it the sequence search tree for y~ . . . y , . Therefore,
Theorem 1 gives the expected number of comparisons in a search. T h a t is, the
expected number t of comparisons when y is not found is t = l(n), and the ex-
pected number t' when y is found is t' = l'(n). The overall expected number of
comparisons depends on the probability that y will be found.
Consider now the following deletion algorithm.
p r o c e d u r e d e l e t e (y) d o u b l y l i n k e d list: (x,~,r) n e x t a v a i l a b l e l o c a t i o n : (k) ;
b e g i n i n t e g e r h, i, j , w; b o o l e a n z;
s e a r c h (y,x,C,r,z,i,j ,w) ;
if z t h e n
b e g i n if r~ = ~ t h e n
b e g i n h : = j; j : = i; i f w = 1 t h e n r t , : = C i e l s e ~h : = C i e n d e l s e
b e g i n j := ri ; i f ~ = ~ t h e n
b e g i n x ~ := xi ; r~ : = ri ; go t o C e n d ;
B: h := j; j : = ~j ; if Cj # ~ t h e n g o t o B ; xi := xj ; Ct, := ~j
end ;
C: rj : = k; c o m m e n t see " i n s e r t " c o m m e n t ; k := j
end
end delete
SOME COMBINATORIAL ~PIi~.OPERTIES OF CERTAIN TREES 25
It; is directly verifiable t h a t the list resulting from "delete" has associated
with it the tree D(T, S, y) of Section 1.3. Therefore, if y is selected at random
from among the xi then Theorem 2 is applicable. That is, the functional rela-
tionship between the expected search lengths t, t' and the number n of entries
in the list, is preserved.
lln general, therefore, we assert that, starting with an empty list and per-
forming rn~ random insertions and m2 random deletions in any order, if mt - m~ =
n then
and
t' - n - b l t _ l~t-- 1.
below. First let us define the present algorithm more precisely by means of
AJX~OL 60 as follows.
procedure P(~, n); value n; arraya; i n t e g e r n;
b e g i n i n t e g e r e , h , i , j , k ; i n t e g e r a r r a y f , g ( l : log~ n ) ; r e a l d;
e := 1; h := n; k := 1;
B : if e ~ h t h e n go to C; d : = a~. ; i : = e; j : = h ;
]~;: i f a j < d t h e n
b e g i n ai : = a~ ; i : = i -[- 1; i f i = j t h e n g o to D e n d
e l s e b e g i n j : = j - 1; i f i = j t h e n go t o D ; go t o e e n d ;
F: if a~ > d t h e n
b e g i n a~ : = al ; J : - i - 1; i f i = j t h e n go to D ; go to 1~ e n d
e l s e b e g i n i : = i -[- 1; i f i = j t h e n go to ]7); go to I,' e n d ;
D : ai : = d;
i f i -- e < h -- i t h e n
b e g i n f k : = i q- 1; gk : = h ; h : = i - 1 end
e l s e b e g i n fk : = e; gk : = i -- 1; e : = i -[- 1 e n d ;
k : = k - } - 1; go t o B ;
C : k : = k - 1; if k > 0 t h e n b e g i n e : = fk ; h : = gk ; go to B e n d
end
From the following considerations, it can easily be verified that each execu-
tion of this algorithm has a search tree associated with it. Let the execution of
the algorithm start at time 0. Let A ( t ) = a l ( t ) , . . . , a,,(t) denote the value of
A at any time t. Let ~ denote the time of the completion of the first execution of
statement D. At time a, a,(0) has been compared with every other number in
A, and a~(a) = al(0). Moreover, a=(~z) < al(0) for all m < i, and a~(~) >
a~(0) for all m > i. That is, a,(0) occupies its proper position in the sequence at
time a, and therefore undergoes no more comparisons. Clearly, at(0) thus cor-
responds to the root of the search tree. The algorithm is then applied recursively
to the sequences a,(a), . - . , a~-l(a) and a ~ + l ( ~ ) , - - - , a~(oe) (not necessarily
in that order), a~(a) will therefore correspond to the :node p0 of the search tree
and a~+~(a) will correspond to the node p~. Or, in terms of the list function f
for the search tree and the set of elements i n A , we havef(p) = al(0), f ( p o ) =
a,(a) and f ( P O = a~+~(ce). Continuing in this way will, clearly, define a list
function, and therefore a search tree.
It can also be verified that if A is a sequence of the first n integers, then the
n! possible values of A(0) correspond in a 1,to-1 way with the sequences R ~
of Section 1.3. That is, if T is the search tree corresponding to the execution of
the algorithm for A (0), then the sequence R ~corresponding to A (0) is such that
T is the sequence search tree for R ~.
It follows that (7) gives the expected number of comparisons in an execution:
of the algorithm for a randomly chosen A (0).
Similar considerations show that (7) is also applicable to Quieksort, thus
verifying Hoare's reported average of 2 n log~ n comparisons.
The only storage requirement, beyond that necessa~T for the program itself
and for the sequence A, is the storage necessary for the numbers fk, g, • The
maximum value of h during the algorithm does not exceed log~ n. This is en-
sured because, at each completion of statement D, the algorithm chooses the
smaller of the two subsequences open to it.
28 THOMAS N. HIBBARD
The numbers .f~,, g~ do not, appear in Quieksort because of the reeursive way
in which that algorithm is defined. Bu~ it is clear that at each new level of recur-
sion some information about the previous level must be retained; that is, the
fk, g~ are needed by Qtficksort too. Moreover, k is there allowed to attain the
value ~; thus Quieksort does not have property (ii) above. It can, of course, be
given that property by the same means used in the present algorithm.
Equation (7) holds for the present algorithm if the sequence of ranks is
random. Since this implies nothing about the distribution of the numbers in
the sequence, property (iii) holds. If the sequence of ranks is not random then
the algorithm might be slowed down considerably. The worst ease, n ( n -- 1 ) / 2
comparisons, occurs with either an ascending or a descending sequence.
"Quicksort's" worst ease is also n ( n - 1)/2 comparisons, but the question
of which sequence gives rise to it depends on the nature of the "random" choice
of an element from the subsequenee a~, -.. , ah. Although this choice requires
an amount of extra work proportional to n, it has an advantage if the sequence
of ranks is biased. For, even though it is possible for a bias toward Quieksort's
worst case to exist (since nothing is really random on a computer), this kind of
bias seems much less likely to exist than a bias toward ascending or descending
order.
If the present algorithm were preceded by a "random" scrambling of the se-
quence (a process with duration proportional to n) it would be equivalent to
Quieksort. 5
A search of the literature shows that Shell's method [5, 6] is the fastest sorting
algorithm having the property of being insensitive to distribution and having a
storage requirement comparable to that of the present algorithm. The expected
number of comparisons for Shell's method, given a randomly ordered sequence,
• 6
is given approximately by n(.194(log2 n) ~ .77 log~ n + 8.4). Itenee the
present algorithm is, for large enough n, superior to Shell's method. It will be
superior to the radix exchange method [7] only in eases of uneven distribution.
REFERENCES
1. HUFFMAN,D.A. A method for the construction of minimum redundancy codes. Proc.
I.R.E. 40 (1952), 1098-1101.
2. BURGh,W.H. Sorting, trees and measures of order. Informat. Contr. i (1958), 181-197.
3. CARR,JO~N W., III. Recursive subscripting compilers and list-type memories. Comm.
A C M 2 (1959), 4-6.
4. NEWELL, A., AND SttAW, J.C. Programming the logic theory machine. Proc. Western
Joint Comput. Conf. (1957), 230-240.
5. SHE~m,D.L. A high speed sorting procedure. Comm. A C M 2, No. 7 (1959), 30-32.
6. FRANI~, R. M., AND LAZ~kRUS, R. B. A high speed sorting procedure. Comm• A C M 3
(1960), 20-22.
7. I-IILDEBltANDT, P., AND ISBITZ, H. Radix exchange--an internal sorting method for
digital computers. J. A C M 6 (1959), 156-163.
8. NAUER, P. (ed.); BACKUS,J. W., ET AL. Report on the algorithmic language ALGOL
60. Comm. A C M 3 (1960), 299-314.
9. HOARE,C. A.R. Algorithms 63 and 64. Comm. A C M 4 (1961), 321.
5 E x c e p t when equal elements are allowed, in which case a slight modification of the
present algorithm is desirable.
6 E x p e r i m e n t a l l y determined b y the author on an I B M 709 computer; holds for n >= 200.