0% found this document useful (0 votes)
15 views

Some Combinatorial Properties of Certain Trees With Applications To Searching and Sorting

This paper introduces an abstract entity, the binary search tree, and exhibits some of its properties. The properties exhibited are relevant to processes occurring in stored program computers--in particular, to search processes.

Uploaded by

Rudi Theunissen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Some Combinatorial Properties of Certain Trees With Applications To Searching and Sorting

This paper introduces an abstract entity, the binary search tree, and exhibits some of its properties. The properties exhibited are relevant to processes occurring in stored program computers--in particular, to search processes.

Uploaded by

Rudi Theunissen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Some Combinatorial Properties of Certain Trees With

Applications to Searching and Sorting*

THOMAS N. HIBBA~D
System Development Corporation, Santa Monica, California

INTRODUCTION
This paper introduces an abstract entity, the binary search tree, and exhibits
some of its properties. The properties exhibited are relevant to processes occur-
ring in stored program computers--in particular, to search processes. The dis-
cussion of this relevance is deferred until Section 2.
Section 1 constitutes the body of the paper. Section 1.1 consists of some mathe-
matical formulations which arise in a natural way from the somewhat less formal
considerations of Section 2.1. The main results are Theorem 1 (Section 1.2) and
Theorem 2 (Section 1.3).
T h e initial motivation of the paper was an actual computer programming
problem. This problem was the need for a list which could be searched efficiently
and also changed efficiently. Section 2.1 contains a description of this problem
and explains the relevance to its solution of the results of Section 1.
Section 2.2 contains an application to sorting.
T h e reader who is interested in the programming applications of the results
but not in their mathematical content can profit b y reading Section 2 and making
only those few references to Section 1 which he finds necessary.

i. COMBINATORIAL PROPERTIES OF BINARY SEARCH


TREES

1.1 Preliminary Definitions


T h e central notion of this paper is that of a certain type of directed graph
undergoing " r a n d o m " operations. The present section is given to defining the
graph and certain quantities associated with the graph. " E x p e c t e d " values of
these quantities will be defined combinatorially in Section 1.2; these "expected"
values will then be calculated assuming " r a n d o m insertions" (Section 1.2) and
" r a n d o m deletions" (Section 1.3).
DEFINITION. A binary search tree is a directed graph 1 having the following
properties.
* Received March, 1961; revised July, 1961.
1 A directed graph is a set of points together with a set of ordered pairs of these points.
The points are called nodes and the pairs are called links. A path in a directed graph is a
sequence (p~, p2, "'" , p,,) of nodes such that, for 1 =< i < n, (pi, pi+l) is a link belonging
to the directed graph. (Thus each link is a path of length 2.) Any path (p, .-. , q) is said
to begin with p and end with q.
13
t4 THOMAS N. ttIB][~AI%D

(i) There is one and only one node, called the root, such (:hat for any node
p there exists one and only one path which begins with the root and
ends with p.
(ii) For each node p, the number of links beginning with p is either two or
zero. If the number is two, then p is said t:o be a proper node. If the
mlmber is zero, then p is said to be a bla~zk node.
(iii) The set of links is partitioned into two sets L and t~,. Each liuk belonging
to L is called a l@ link. Each link belonging to [~. is called a right link.
(iv) For each proper node p, there is exactly one left link beginning with p
and exactly one right link beginning with p.
For brevity, a search tree will here be defined as a binary search tree.
The following notational devices will be emptoyed for search trees.
The length of a path is the number of proper nodes in the path.
T o each node of any search tree we ~ttach a label p~, where ~ is a sequence
of t% and 0's uniquely defined b y the following. Let the root be Balled PA (where
/k denotes the null sequence).2 If, for any ~, p~ is a proper node, then p~0 and
p~, are such that (p~, p~0) is a left link and (p~, p~) is a right link.
For a search tree T, the search ~ubtree T~ is defined for each ~ as follows.
E v e r y node of T~ belongs to T. A node q of T belongs to T~ if and only if q belongs
to a path in T beginning with p~. E v e r y left (right) link of T~ is a left (right)
link of T. A left (fight) link (q, r) of T is a left (right) link of T~ if and only
if both q and r belong to T~. (Hence, if (r is such t h a t T contains no node p~,
then T¢ is defined to be empty.)
T h r o u g h o u t this paper, the symbols o-, p, and r will denote sequences of 1%
and 0%. Further, any symbol having ~, p, or r as a subscript, is to be understood
to be a function of ¢, p, or ~.
i n the following definitions, let T be a search tree of n proper nodes and let
S be a set of n numbers.
The list function fsr is a one-one mapping of the proper nodes of T onto the
elements of S, with the following property. If q is a proper node of T¢0 then
fsr(q) < fsr(P¢), and if q is a proper node of T¢~ then f s r ( q ) > fsr(p¢). (Thus,
given any two of (T, S, f s r ) , the third is uniquely defined.) When S and T are
understood, f will be used to denote fsr •
For each ¢, the subset S¢ of S is defined as follows, y is in S¢ if and only if
there exists a node q of T¢ such t h a t y = fsr(q).
A path in T~ which begins with pC is said to be a search path in T¢. If a search
path ends with a proper node then it is said to be an internal search path. If a
search path ends with a blank node then it is said to be an open search path.
(Note t h a t if any search tree T has n proper nodes then there exist nq-1 open
search paths in T, one for each blank node. This is easily shown by an induction
on n.) T h e lengths of the search paths in TA are the subject of the next two
sections.

The expressions p/,, and p are equivalent. While the use Of/k is never essential, it some-
times increases clarity. Both notations will be used here.
SOME COM13INAT0"RIAL P R O I ~ E t ~ T I E S OF C E R T A I N T R E E S 15

J .2 i%que~cc I~inary ,%arch Trees


We will now formalize, in combinatorial Germs, the notion of a "randomly
cons{;rueted" search tree and of" the "expected length" of a search path in such
~%search tree.
Notation. Throughout this and the following sections, the following notation
will be applicable to any sequence Ie. For each ~, a unique subsequenee R, of
t~ is defined as follows. For each non-null R~, let z~ be the first term of R~.
If ~ is null then Ie~ = R, Given R~ for any ~, then R.0 and R~, are subsequenees
of ]~ and are obtained as follows. Each term of R~, with the exception of x~,
belongs either to tg~0 or to tg~j. E v e r y term of t~0 is less t h a n z~, and every term
of te,j is greater than x~.
Dx¢>',xrrxox. Let ig be a sequence of distinct numbers. Let S be the set of
numbers in re. For each non-null R~, let z~ denote the first term of R~. The
sequence (binary) search tree for R is the search tree T such that f s r ( N ) = x~.
Observe t h a t ~q~ is the set of numbers in R , , and t h a t T~ is the sequence
search tree for R~.
Now the notion of a "randomly constructed" search tree is that of the se-
quence search tree for a " r a n d o m " sequence. The notion will be formalized
eombinatorially by means of the family of sequence search trees associated
with the ~,t permutations of a given set of n numbers.
Notation. For each sequence R the sequence of ranks for R, written r ( R ) , is
defined as follows. Let R = g , , y2, "'" , y~. For each i, 1 < i < n, let r~ be
the n u m b e r of integers }6 1 -</c -< n, such t h a t y~ =< y~. Now r ( R ) = r , , r,e, - - - ,
r~. Also, r~ is said to be the rank of y; in R.
Clearly, for any two sequences P and Q such t h a t r ( P ) = r ( Q ) , P and Q
have the same sequence search tree.
For a search tree T having n proper nodes, let {s~[ i = 1, 2, . . . , n + 1} be
the set of n + l distinct open search paths in T. Let l~ denote the length of s~.
The function l ( T ) is now defined by
n+l
Z(T) = Ez .
i=l

Next let { s . / I i = 1, 2, . . . , n} be the set of n distinct internal search paths


in T, and let l / b e the length of s/. L e t I ' ( T ) be defined b y

l'(T) = k 1'
i=1

For a s e t S o f n n u m b e r s , l e t ~ = {R~]i = 1 , 2 , . . . , n!} be t h e s e t of n!


sequences such that each sequenee has length n and contains every member of
S. Fox' each ig ~ in ~, let T ~ be the sequence search tree for R ~. Let,
rd
U(S) = E/(T~);
i=1
n!
u'(s) = E
i~l
]~ THOMAS N. HIBBARD

Since the sequence search tree for the sequence of ranks r ( R ~) is identical to
the sequence search tree :for R ~, it is clear that, U ( S ) and U ' ( S ) depend only
on n. Hence, let S = {1, 2, . . . , n} and let
V(n) = g ( s ) ;
V'(n) = U'(S).
Let the mean open search length [( n ) and the mean internal search length ['( n )
now be defined as follows.

i(n) - y(n)
(n -1- 1)!"

i'(n) - V ' ( n )
n.n[
THEOREM 1. It is true for each n that

7(n)=2 + g + . . . + ~ (~)

and

Z'(n) - ~ + 1 i(n) - 1. (2)


n

PROOF OF (1). Let each tree T ~ have search subtrees T~~. For each integer
t, 0 < t < n, let as~ be the set of all sequences R ~ such t h a t Rot has length t. For
each sequence R ~ in (Pt, the first element of R ~ is t + l ; and the set of elements
m
• R0~ ~s
. the set
. of the
. first. t positive integers. Therefore, Ro¢"is one of t.T sequer~ces.
Let P and Q be any two of these sequences. Let mp and me denote the n u m b e r
of R ~ in ~ such t h a t R0~ = P and R0~ = Q, respectively• I t is clear t h a t me =
m e . Since qh has ( n - l ) ! members, it follows t h a t me -- (n -- 1)!/t!. Now
To~ is the sequence search tree for Ro~. Hence,

Z(To~) _ (n - 1)! V ( t ) .

Hence,
~-1 (n -- 1)1
l(To ~) = ~ " V(t).
~=1 e=o t!

S y m m e t r y requires that
n! n!
~ Z ( T o ~) = ~ Z(T?).
i=1 i=1

For each T ¢,
l( T ~) = l( To ~) + l( T1 ~) + n + 1•
I t now follows t h a t
'~-~ V(t)
V(n) = (n + 1)! + 2 ( n - 1)!Z
t=-o t[
SOME COMBINATORIAL PROPERTIES OF C E R T A I N T R E E S 17

Writing V ( n - - 1 ) , solving for ~t~-o2 ( V ( t ) / t ! ) , substituting in V(n) and col-


lecting terms gives
V(n) = 2n! q-(n q- 1)V(n -- 1).
Since l(n) = V ( n ) / ( n -~ 1)! and i(n - 1) = V(n - 1)/n!, this gives

l(n)- 2 q-l(n--1),
n-t-1
from which (1) is easily obtained•
PROOF OF (2). To prove this, we need only note that if a proper node of
7 '~ beIongs to m open search paths in T ~, then it belongs to m - 1 internal search
paths of T i. Hence, for all i, l'(T ~) = l ( T ~) -- n. From this (2) is easily oh-
gained. Q.E.D.
Note t h a t with the use of j (dx/x), bounds can be put on l(n) thus:

21og,(2-1-1) <Z(n) <21og,(n+l)~l.41og,(n-]-l). (3)

1.3 Deletion Trees


In this section the notion of a "random deletion" will be considered.
DEFINITION. Let T be a search tree of n proper nodes, S a set of n numbers,
and y a member of S. The deletion search tree D ( T , 8, y) is defined as follows.
Let r be such t h a t fsr(p,) = y. Let S' -- S - {y}. Let the nodes of D ( T , S, y)
be denoted by p~'. For each a, let D~ denote the search subtree of D ( T , S, y)
having p D as its root. Now D ( T , S, y) is given by (i) and (ii) below, viz:
(i) If S~1 is empty, then:
D, = T~0 ;
for each node p~ not belonging to T~, p D = p , .
(ii) If ~,1 is not empty, then let p be such t h a t fsr(pp) is the smallest member
of S,1. (Observe that So0 must then be empty.) Now:
Dp = T01 ;
D
for each node p~ not belonging to T , , p~ = p~ •
Note the following. In case (i), for every proper node q of T such that q ~ p , ,
it is true that q belongs to D ( T , S, y) and fs, D(q) = fsr(q). In case (ii): for
every proper node q of T such that q ~ pp, q belongs to D ( T , S, y); and for
every proper node q of D ( T , S, y) such that q ~ p , , it is.true that fs,D(q) =
fsr(q); finally, f s , , ( P , ) = fsr(Pp).
For the trees T ¢ of the previous section, let D ~y denote, for each R ~ in cb and
for each y in S, the deletion tree D ( T ~, S, y).
Before considering the search properties of the trees D 'y, we consider briefly
the nature of the search paths t ¢~, where t i~ is defined as follows. Let r and p be
as in the definition of D. For case (i) of this definition (where P~I is a blank
node of T~), t Cyis the search path pC . . . p ~ . For case (ii) (P~I a{ . proper node),
t ¢~is p; - -" P~o • Let So¢be an open search path, defined for each R m 4~ as follows.
1~ TIiOSIAS N. t[I[BBARD

i
If so : q ~ , q s , ' " ,q,~then (qj,qj+~) is a left link of f/"~for ] < j < 7~l. Now
it can be easily verified that the search paths t '~ are all the open search paths
in the trees f['~ excepting the search paths so~. Let V0(n) denote the sum of the
search lengths of the paths So~, Let [j(n) be defined by
Z~(%) = V(n) - V ~ ( n )
n-n [
Since we are considering n.n! deletion trees, lj(n) represents the mean search
length of t ~. V0(n) is found by a method exactly similar to that used for V(n)
in the previous section. The result is Vo(n) = n!(1 + ½ + . . . . t- ( l / n ) ) .
(Thus V o ( n ) / n t , which represents the mean search length of s0, is only about
half as large as Z(n).) Hence,
1 (l t 1--1 +~_~) (4)
L0z) = i(~) + ~,5 2 + 5 + "'" + '~-~
Hence, Z~('n) is approximately equal to l(n).
We now consider the search properties of the search trees D ~'.
There are n.n! pairs iy, and each corresponding D ~ has n open search paths
and n - ] internal search paths. Hence, for the trees D '~ we define the mean
open search length 7~(n) and the mean internal search length l ~ ( n ) as follows.
n!

%2"~n,[ i ~ l yin8

and
n!

7~,(n) - (n -- 1 ) n . n ! ~=1 ~ins


Obse~-e that., as before, the assertion that these quantities depend only on
n is iustified because two sequences having the same sequence of ranks have
the same sequence search tree. Therefore, we take S to be the set of the first
n positive integers.
THEOREM 2. It i8 true for each n that
~ . ( n ) = ~(n - 1) (5)
and
Z~,(n) = Z ' ( n - 1). (6)
PROOF. To show this, we will construct for each pair iy a sequence R ~.
~ Y will be such that D ~y is the sequence search tree for R ~y. Then, letting Q
be any sequence equal to at least one of these sequences R ~y, we will show that
there are exactly n 2 distinct pairs iN such that R ~' = Q. It will then follow that
the set of pairs iy can be partitioned into n 2 sets such that Theorem I is directly
applicable to each set. That is, if A is any one of these sets, then
Z Z(D'D = n!7(~ - 1),
iyirLA

l'(D ~) = ( n - 1)(n- 1 ) ! Y ( n - 1).


iyinA
SOME COM]3INATORIAL PROPERTIES OF CERTAIN TREES 19

I t will also follow that


n!

Z(D = X: z(J)'),
iyinA i=1 yvS

n[
z'(z/") = 2z'(D").

Subs~itiution of these values into the defining equations for 1D(n) and [D'(n)
will then yield (5) and (6).
Therefore, to prove the theorem, it only remains to find tile sequences R ~.
Let f be any sequence of length m, m > 0, and containing m distinct numbers.
Let y be a n y t e r m of P. For each such pair (P, y) we define the sequences
d(P, y) and g(P, y) as follows.
Let. P = w~, w2, . . . , w ~ . Let y = wj where 1 _-= j __< m. I f there exists a
n u m b e r in P which is greater t h a n y, let wk be equal to the smallest such number.
L e t d(P, y) be a sequence of length m - 1 defined as follows.
(i) If y is the greatest n u m b e r in P, or if y is not the greatest n u m b e r in P and
j > It, t h e n d(P, y) is a subsequence of P and contains every t e r m of
P except w j .
(ii) I f y is not the greatest n u m b e r in P a n d j < k, then let Q' be a subsequenee
of P containing every t e r m of P except wk. Replace the j t h t e r m of
Q' with w~. The sequence so obtained is d(P, y).
I t can be verified easily t h a t if S is the set of numbers in P, and if T is the
sequence search tree for P, then D(T, S, y) is the sequence search tree for
d(P, y).
g(P, y) is defined to be the sequence of ranks for d(P, y). T h a t is, g(P, y) =
r[d(P, y)]. Clearly, D(T, S, y) is the sequenee search tree for g(P, y).
Now, let R ~y = g(R ¢, y). I t follows from the foregoing t h a t D ~yis the sequence
search tree for R ~v.
Now, let Q be a n y sequence equal to at least one of the sequences R% T h a t
is, Q is a n y sequence of the first n - 1 positive integers. T h a t Q = R ~y for n 2
distinct pairs iy will be proved b y induction on n. T h e ease n = 1 is trivial.
T h e ease n = 3 shows the n a t u r e of the problem and is exhibited in Table I.

TABLE I
Sequences R ¢~for n = 3
Ri Ril Ri2 Ri3

1 123 12 12 12
2 132 12 12 12
3 213 12 21 21
4 231 12 21 21
5 312 21 21 12
6 321 21 21 21
20 TItOMAS N. HIBBARD

Consider a r b i t r a r y n. L e t Q = r l , r2, --" , r~-i • We define and consider in


t u r n the subsets A1, A2, A3 of the set of pairs iy.
Set At • iy is in A1 if and only if y = n is the first t e r m of R i. I t is clear t h a t
there is exactly one pair iy in A1 such t h a t R ~ = Q.
Set A2. A2 is t h a t set of pairs iy such t h a t y = r~. This condition implies
y ~ n, thus excluding m e m b e r s of A~. I t also implies t h a t the first element of
R; is either y or y + l . F o r each j, 2 =< j =< n, there are exactly two pairs iy be-
longing to A2 and satisfying b o t h of the following.
(i) T h e j t h element of R * is y or y ÷ l .
(ii) R ~ = Q. Thus, A2 contains 2n--2 pairs iy for which R ~' = Q.
Set A3. A3 is the set of all pairs iy which are not in A~ or Ag. and for which
the first, t e r m of R ~ is r l . Clearly, each pair iy which we h a v e yet to count be-
longs to A3. A necessary and sufficient condition for a pair iy to belong to A3 is
given b y the following (i) and (ii).
(i) Neither y nor y + l is the first element of R ~ (or else iy belongs to A1 or
A2).
(ii) Let x ~ be the first element of R ~. I f x ~ > y then x ~ = rl -t- 1. I f :c~ < y,
then x ~ = rl •
Therefore, there are n - - 1 allowable values of y in A3. These are all of the
integers 1 , 2 , . . . , n e x c e p t i n g r ~ . y < rl if and only if x ~ = r~-t- 1. y > r~
if and only if x ~ = r l .
N o w for each i let P~ be defined b y R ~ = x~P ~. T h a t is, P~ is the subsequence
of R ~ containing all but the first term. I t can be verified t h a t for iy in A.~, R '~ can
be constructed as follows. Let t~ be the r a n k of y in P~. Denote b y w~.'J the j t h
t e r m of the sequence g[r(P~), t~]. D e n o t e b y vj.~ the j t h t e r m of R ~. B y the
defmition of A~, v~~ = r~. v~.y for 1 < j =< n - - 1 is as follows. I f w~.y < r~ then
v5~y = w~-~Y.I f wj~Y => rl t h e n v}~ = w~.~ + 1. T h u s the sequence g[r(P~), t~] uniquely
defines R ~ for iy in A~. T h u s also if iy and /cz are in A, and R ~ = R ~ then
g[r( P~), t~] = g[r( P~), t,].
T h e induction assumption can be applied to the sequence g[r(P~), t~]. For,
let Q' be a n y sequence of length n - 1 containing each of the first n - 1 positive
integers. F o r each y it is true t h a t there is exactly one pair iy in A.~ such t h a t
r(P ~) = Q'. This is verified as follows. I t is clear t h a t for each z, 1 <= z <= n,
there is one and only one pair iy such t h a t x ~ = z and r(P ~) = Q'. But, as ob-
served above, y uniquely defines x ~ when iy is in A a . Also, if y < r~ t h e n x ~ =
r~ ~- 1 and hence t~ = y; and if y > r~ then x ~ = r~ and hence t~ = y -- 1. y
r~ for a n y iy in A a . I t follows that. t~ uniquely defines x ~ for iy in A,~. Hence
for a given t~ there is exactly one pair iy in A~ such t h a t r(P ~) --- Qq T h e in-
duction assumption therefore applies to the sequences g[r(P~), t~].
Now, let Q~ be the sequence such t h a t if g[r(P~), t~] = Q~ t h e n R ~'~ = Q.
T h e induction assumption asserts t h a t there are (n -- 1) ~ pairs iy in A3 such
t h a t g[r(P~), t~] = Q~. This would i m p l y exactly (n -- 1) ~ pairs iy in A.~ such
t h a t R ~ = Q. These, together with the one found in A1 and the 2 n - 2 found
in A~, complete the induction. Q.E.D.
S O M E C O M B I N A T O R I A L P I ~ O P E R T I E S OF C E R T A I N T R E E S 21

2. A P P L I C A T I O N S
2.1 Searching
Computer programmers are familiar with the fact that lists which can be
searched efficiently tend to be difficult to change efficiently, and vice versa.
Some lists are never changed, and some are never searched. If a list needs to be
both searched and changed, the conflict can often be resolved b y devoting more
storage to the list. But when not enough storage is available to resolve the con-
flicL a problem arises. This problem is to effect with the limited storage, a
workable compromise between a search-oriented list design and change-oriented
list design. In the next two sections we consider two well-known kinds of lists
which illustrate this conflict. In Section 2.1.3 we consider a list which offers, at
a reasonable cost in storage, a compromise between searching and changing;
and we apply to this list the main results of Section 1.
2.1.1 SequenceLists
The classic example of a search-oriented list is the following. Let zl, z~, . . . ,
z~ be an ascending sequence of length n. For each i, 1 =< i _-< n, let z~ occupy
location A+b~ in a computer. The well-known binary search algorithm is ap-
plicable to the list. This algorithm makes its first comparison with the n u m b e r
in location A +bin~2] (or thereabouts), thus eliminating from the search one of
the sequences {Zi}, b -<_ i < bin~2], or {Zi}, b[n/2] < i <= b~. The algorithm
then applies itself recursively to the remaining sequence.
The search length, defined as the number of comparisons in a search, has an
expected value of approximately log~ n. However, to make an insertion or a
deletion in the list entails a lot of work. For, on the average, assuming random
choice of the n u m b e r to be inserted or deleted, n/2 numbers must be given new
m e m o r y locations.
Hence, a sequence list is search-oriented but is not change-oriented.
Note that there is a binary search tree associated with the binary search
algorithm. Each location corresponds to a proper node of the search tree. T h e
tree has the property that it is balanced. T h a t is, letting m, denote for each
the number of nodes in T , , it is true for all ¢ that ] m,0 - m,11 =< 1.
2.1.2 Singly Linked Lists
The leading example of a change-oriented list is the singly linked or "push-
down" list [3, 4]. Let zl, z~, • • • , z~ be a sequence. Let a singly linked list consist
of a number l0 together with a set {d~ t 1 =< i == n} of ordered pairs d~ = (z~, ll).
l~, 0 ~ i < n, is the location 3 of d~+1. l~ is any number not belonging to the
set of possible locations.
Clearly, an insertion or deletion is accomplished in a singly linked list simply
b y changing one of the l~ and, for an insertion, adding a new pair to the set of
pairs. Note, however, t h a t searching in a singly linked list is a long process.
Expected search time is proportional to the n u m b e r n of entries in the list.
The term location as used here is applicable to any set of memory components to which
the program has ready access. Thus z~ and l~, as well as d~ have locations.
22 T H O M A S N. I-II:BBARD

TABLE II
A Singly Linked List
i Locatio:: si l{

5 200 6
10 600 28
14 100 10
17 400 23
23 500 5
28 300 17

lo = 14

2.1.3 A Doubly Linked List


The results of Section 1 imply t h a t a doubly linked list can be designed to
give, under random conditions, expeceed search time and expected insertion/
deletion time both proportional to log n.
Notation. Let I be a set of integers and let L = { (x~, li, re) [ i in I} be a
set of triples. T h e ~uceessor relation is defined among the numbers x,i b y the
following two statements.
(1) If Zi (respectively r i) is in [, then x~ (respectively Xr~) is a successor
of x~.
(2) If xj is a successor of x i , then every successor of x~. is a successor of x i .
DEFINITION. Let I be a set of storage locations. Let ¢ be a n u m b e r which
is not a storage location. A doubly linked list is a n u m b e r E together with a
set L = { (x<, l~, r<)l i in I} of triples, having the following properties.
(1) Each triple (x~, Z<, r~) is in location i.
(2) If L is empty, then E = ¢; otherwise, E is in I.
(3) If E ¢ ¢, then xE together with the successors of xE is equal to {x< I i in [}.
(4) For each i in I, if l< ¢ ¢ then x~ is greater t h a n xl~ and every successor
of x~ .
(5) For each i in I, if r~ N ¢ then x~ is less than x~ and every successor of x~.
An example of a doubly linked list is given in Table III. The same list is shown
TABLE III
A Doubly Linked List
i (location) ] xl li ri
I

5 600 28 ¢
10 300 ¢ ¢
14 500 ¢ ¢
17 2QO.i 23 5
23 10% : ¢ 4
28 4O0 10 t4

: E = 17
SO~E COI~Y3INATOIRIALPROPERTIES OF CERTAIN TREES ~3

/ \
/ X

L
FIG. 1. The binary search tree corresponding to the list of Table I I I . Circles denote
proper nodes, rectangles denote blank nodes. The numbers ar e the faz(P~). The tre e i s
the sequence search tree for the sequence (200, 600, 400, 100, 500, 300).

s c h e m a t i c a l l y in Figure 1. Observe t h a t Figure 1 is also a schematic d i a g r a m


of a search tree. Tile set of p r o p e r nodes of t h e tree is the set I of locations, T h e
tocation E is t h e root. F o r t w o locations (i.e. nodes) i a n d j, if j = li (respec-
tively r~) then (i, j ) is a left (respectively r i g h t ) l i n k of t h e tree. L e t T d e n o t e
the tree and let ~ = {z~ ]i in I}. T h e n the list function is given b y f s r ( i ) = X~
for e~ch i in I. T h e root p of t h e tree is p = E . F o r a n y node p~ = i, if l~ # ¢
then p,0 = l~ a n d if ri # ~ t h e n p,~ = r~. I f l~ (respectively r~) = ¢ t h e n p,0
(respectively p.~) is a b l a n k node.
L e t E be d e n o t e d b y r e . A d o u b l y linked list exists in storage in the f o r m of
three arrays, one a r r a y for each of the sets {xili in I}, {l~li in I}, { r ~ l i in
I U {¢}}. T h e order within these arrays is u n i m p o r t a n t .
Consider t h e following search algorithm. ( T h e n o t a t i o n used in this a n d in
the a l g o r i t h m s which follow it, is t h a t of ALGOL 60 [8].)

procedure search (y) doubly linked list: (x,C,r)main output: (z) secondary outputs: (i,j,w);
real array x; integer array ~, r; real y;
boolean z; integer w, i, j;
comment z is "true" if and only if y is in x. The secondary outputs are for the use of
the "insert" and "delete" procedures defined below;
begini := ro; w := 1; ] := ¢;
$1: if i = ¢ then z := false
else if y = ×~ then z := true
2~ THOMAS N. HIBBARD

e l s e b e g i n j : = i; i f y > x i t h e n b e g i n i : = r~; w := l e n d
e l s e b e g i n i : = C~ ; w : = 0 e n d ;
go to $1 e n d
end search

A measure of the length of a search process is the number of comp~risons per-


formed, tn the above algorithm, y is compared with each number in ~ sequence
x,:~ .-. x~.~. Tile sequence i~ . . . i~ is a search path in the search tree associated
with the list. If y is found by the search, the path is an int,ernal one, otherwise it
is open. The nulnber of comparisons in a search is equal to the length of the
associated search path. The expected vMue of this quantity depends upon how
the list is constructed. Consider the following "insert" algorithm, which uses
the search Mgorithm as a subroutine.
p r o c e d u r e i n s e r t (y) d o u b l y l i n k e d list: (x, ~, r) n e x t a v a i l a b l e l o c a t i o n : (k);
c o m m e n t t h e a r r a y r c o n t a i n s n o t o n l y t h e r i g h t l i n k s b u t a p u s h d o w n list of a v a i l a b l e
s t o r a g e b e g i n n i n g w i t h rk ;
b e g i n i n t e g e r i, j, w ; b o o l e a n z;
s e a r c h (y,x,~,r,z,i,j ,w) ;
if -~ z t h e n
b e g i n if w = 1 t h e n r j : = k e l s e ~j := k; j := k; k : = rk ; xj : = y;
(~j := r~ := q ~ e n d
end insert

Note t h a t if the arrays x, l, r constitute a doubly linked list at, the beginning
of the execution of "insert" then t h e y will do so at the end. Observe that the
insertion is accomplished by changing four numbers in the list proper and one
in the available storage list. Thus the only portion of the process which depends
on the size of the list is t h a t performed by the search algorithm.
Let y~ . - . y, be a sequence of n distinct numbers chosen and ordered ran-
domly. Let r, = ¢ initially (that is, let the list be essentially empty), and let
"insert (y~, x, l, r, k ) " be performed for i = 1, 2 , . . . , n. The resulting list
will have associated with it the sequence search tree for y~ . . . y , . Therefore,
Theorem 1 gives the expected number of comparisons in a search. T h a t is, the
expected number t of comparisons when y is not found is t = l(n), and the ex-
pected number t' when y is found is t' = l'(n). The overall expected number of
comparisons depends on the probability that y will be found.
Consider now the following deletion algorithm.
p r o c e d u r e d e l e t e (y) d o u b l y l i n k e d list: (x,~,r) n e x t a v a i l a b l e l o c a t i o n : (k) ;
b e g i n i n t e g e r h, i, j , w; b o o l e a n z;
s e a r c h (y,x,C,r,z,i,j ,w) ;
if z t h e n
b e g i n if r~ = ~ t h e n
b e g i n h : = j; j : = i; i f w = 1 t h e n r t , : = C i e l s e ~h : = C i e n d e l s e
b e g i n j := ri ; i f ~ = ~ t h e n
b e g i n x ~ := xi ; r~ : = ri ; go t o C e n d ;
B: h := j; j : = ~j ; if Cj # ~ t h e n g o t o B ; xi := xj ; Ct, := ~j
end ;
C: rj : = k; c o m m e n t see " i n s e r t " c o m m e n t ; k := j
end
end delete
SOME COMBINATORIAL ~PIi~.OPERTIES OF CERTAIN TREES 25

It; is directly verifiable t h a t the list resulting from "delete" has associated
with it the tree D(T, S, y) of Section 1.3. Therefore, if y is selected at random
from among the xi then Theorem 2 is applicable. That is, the functional rela-
tionship between the expected search lengths t, t' and the number n of entries
in the list, is preserved.
lln general, therefore, we assert that, starting with an empty list and per-
forming rn~ random insertions and m2 random deletions in any order, if mt - m~ =
n then

t = 2 +~+ ... -}- ~,~ 1 . 4 1 o g 2 n

and

t' - n - b l t _ l~t-- 1.

The first approximation is by (3), Section 1.2.


Consider the efficiency of the deletion algorithm. The algorithm first locates
the number y in the list. I t then finds the smallest successor of y. These two
processes together correspond to the paths t ~ discussed in Section 1.3. The ex-
peered number of list elements handled is the average length ld(n) of t ~ given
in (4). Thus, the searching portion of the deletion algorithm has an expected
length only slightly larger than t.
The deletion process modifies three to five numbers in the list, regardless of
the size of the list•
The storage required is, in a typical application, about double the required
storage of a sequenced list and about half again that of a singly linked list.
In return for this moderate amount of storage, both search time and insertion/
deletion time are proportional to log n.
2.1.4 An Unsolved Problem
It is a consequence of the results of [1] t h a t the condition for optimal searching
in a binary search tree 7' is as follows. We assume random choice of the search
operand. Let, m~ denote the number of proper nodes in T~. Let r~ denote the
integral power of 2 such t h a t r, _-_ max (m°0, m~) < 2r~. The condition for
optimal searching is that, for each ~, rain (m~0, m~) _-> r, -- 1.
Now if the numbers m, are adjoined to the doubly linked list of the previous
section, then an insertion algorithm which ensures the above condition can be
designed. While such an algorithm would tend generally to be inefficient, it
would clearly be more efficient than the insertion algorithm for the sequence
list (Section 2.1.1). Note t h a t the complexity of the list transformation would
depend on the proximity of m A to a power of 2. A quantitative analysis of this
insertion algorithm is still being sought.

2.2 A Sorting Algorithm


Constructing the list of Section 2.1.3 is, in a sense, a sorting operation. How-
ever, it is slower than other known sorting methods which use a comparable
2~ T H O M A S N. HI]SBARD

amount, of storage. But when sorting is to be done with a minimum of storage,


the results of Section 1.2 suggest a method which does have some advantages.
The method turns out to be very similar to Hoare's "Quieksort" [9] but with
two important differences which are discussed below. The advantages are
(i) an expected number of comparisons approximately equal to 1.4 n log~ n
(in close agreement with Hoare's reported average of 2n log~ n);
(ii) auxiliary storage proportional to log2 n;
(iii) insensitivity to distribution.
An)" pair of these properties can be obtained, or improved upon, in other sorting
schemes: but the scheme defined below is the only one known by the author to
have all three properties. Quicksort has properties (i) and (iii) but not (ii),
as is shown below.
The expected number of comparisons required to construct the sequence
search tree for a given sequence is calculated as follows. Let {l J I i = 1, 2 , • • • , n }
be the lengths of the internal search paths, as in Section 3. Then ~'~=~ (l~' ~ 1)
is the number of comparisons required to construct the tree. Hence the expected
number 5 of comparisons to construct a sequence search tree is [ V ' ( n ) - n. n!]/n!
Since V ~ ( n ) / n ! = nl'(n), we have
5 = n['(n) - n ~ 1.4(n q- 1) log2 n - 2n. (7)
Note that the derivation of (7) requires the assumption that the sequence con-
tains n distinct numbers.
The sorting algorithm which we wish to consider, while it does not construct
an explicit search tree, 4 performs a sequence of comparisons which is equivalent
to constructing a search tree. The method first compares a~ with each of a2,
• . . , a , ~ , thus calculating the rank of a~. a~ is then stored in the location cor-
responding to its rank. In the process, all numbers less than a~ get stored in
locations below that of a~, and the rest of the numbers get stored above. As
will be seen in the algorithm, this is accomplished with just one extra storage
location. At this point, the task is reduced to that of sorting two sequences: the
sequence stored below a~, and the sequence stored above a l . Each of the se-
quences is sorted by a recursive application of the algorithm, one of the two
sequences remaining untouched until the other is completely sorted. It is neces-
sary to store the terminal locations of the sequence which is not sorted first;
thus there will be a table of such locations. By sorting first the smaller of the two
sequences flanking a l , the length of the table is forced not to exceed log2 n.
The method bears a strong resemblance to the radix exchange method [7],
the principal difference being that the present method employs a comparison,
instead of a digit inspection, as its basic operation.
Quicksort, instead of calculating the rank of a , , calculates the rank of a
"randomly" chosen member of the sequence. Quicksort appears at first glance
to have no table like the one mentioned above. These differences are evaluated
4 (Added in proof). I t has been brought to my attention that Windley [WINDLEY, P. F.
Trees, forests and rearranging, Comput. J. 8 (1960), 84-88] has examined the explicit coa-
st,ruction of the tree as a sorting technique and has derived an expression equivalent to (7).
Windley has also calculated a mean deviation associated with (7).
$OMI'] COMBINA.TOI%IAL PROPI~I{TIES OF CERTAIN TI{EES ~7

below. First let us define the present algorithm more precisely by means of
AJX~OL 60 as follows.
procedure P(~, n); value n; arraya; i n t e g e r n;
b e g i n i n t e g e r e , h , i , j , k ; i n t e g e r a r r a y f , g ( l : log~ n ) ; r e a l d;
e := 1; h := n; k := 1;
B : if e ~ h t h e n go to C; d : = a~. ; i : = e; j : = h ;
]~;: i f a j < d t h e n
b e g i n ai : = a~ ; i : = i -[- 1; i f i = j t h e n g o to D e n d
e l s e b e g i n j : = j - 1; i f i = j t h e n go t o D ; go t o e e n d ;
F: if a~ > d t h e n
b e g i n a~ : = al ; J : - i - 1; i f i = j t h e n go to D ; go to 1~ e n d
e l s e b e g i n i : = i -[- 1; i f i = j t h e n go to ]7); go to I,' e n d ;
D : ai : = d;
i f i -- e < h -- i t h e n
b e g i n f k : = i q- 1; gk : = h ; h : = i - 1 end
e l s e b e g i n fk : = e; gk : = i -- 1; e : = i -[- 1 e n d ;
k : = k - } - 1; go t o B ;
C : k : = k - 1; if k > 0 t h e n b e g i n e : = fk ; h : = gk ; go to B e n d
end

From the following considerations, it can easily be verified that each execu-
tion of this algorithm has a search tree associated with it. Let the execution of
the algorithm start at time 0. Let A ( t ) = a l ( t ) , . . . , a,,(t) denote the value of
A at any time t. Let ~ denote the time of the completion of the first execution of
statement D. At time a, a,(0) has been compared with every other number in
A, and a~(a) = al(0). Moreover, a=(~z) < al(0) for all m < i, and a~(~) >
a~(0) for all m > i. That is, a,(0) occupies its proper position in the sequence at
time a, and therefore undergoes no more comparisons. Clearly, at(0) thus cor-
responds to the root of the search tree. The algorithm is then applied recursively
to the sequences a,(a), . - . , a~-l(a) and a ~ + l ( ~ ) , - - - , a~(oe) (not necessarily
in that order), a~(a) will therefore correspond to the :node p0 of the search tree
and a~+~(a) will correspond to the node p~. Or, in terms of the list function f
for the search tree and the set of elements i n A , we havef(p) = al(0), f ( p o ) =
a,(a) and f ( P O = a~+~(ce). Continuing in this way will, clearly, define a list
function, and therefore a search tree.
It can also be verified that if A is a sequence of the first n integers, then the
n! possible values of A(0) correspond in a 1,to-1 way with the sequences R ~
of Section 1.3. That is, if T is the search tree corresponding to the execution of
the algorithm for A (0), then the sequence R ~corresponding to A (0) is such that
T is the sequence search tree for R ~.
It follows that (7) gives the expected number of comparisons in an execution:
of the algorithm for a randomly chosen A (0).
Similar considerations show that (7) is also applicable to Quieksort, thus
verifying Hoare's reported average of 2 n log~ n comparisons.
The only storage requirement, beyond that necessa~T for the program itself
and for the sequence A, is the storage necessary for the numbers fk, g, • The
maximum value of h during the algorithm does not exceed log~ n. This is en-
sured because, at each completion of statement D, the algorithm chooses the
smaller of the two subsequences open to it.
28 THOMAS N. HIBBARD

The numbers .f~,, g~ do not, appear in Quieksort because of the reeursive way
in which that algorithm is defined. Bu~ it is clear that at each new level of recur-
sion some information about the previous level must be retained; that is, the
fk, g~ are needed by Qtficksort too. Moreover, k is there allowed to attain the
value ~; thus Quieksort does not have property (ii) above. It can, of course, be
given that property by the same means used in the present algorithm.
Equation (7) holds for the present algorithm if the sequence of ranks is
random. Since this implies nothing about the distribution of the numbers in
the sequence, property (iii) holds. If the sequence of ranks is not random then
the algorithm might be slowed down considerably. The worst ease, n ( n -- 1 ) / 2
comparisons, occurs with either an ascending or a descending sequence.
"Quicksort's" worst ease is also n ( n - 1)/2 comparisons, but the question
of which sequence gives rise to it depends on the nature of the "random" choice
of an element from the subsequenee a~, -.. , ah. Although this choice requires
an amount of extra work proportional to n, it has an advantage if the sequence
of ranks is biased. For, even though it is possible for a bias toward Quieksort's
worst case to exist (since nothing is really random on a computer), this kind of
bias seems much less likely to exist than a bias toward ascending or descending
order.
If the present algorithm were preceded by a "random" scrambling of the se-
quence (a process with duration proportional to n) it would be equivalent to
Quieksort. 5
A search of the literature shows that Shell's method [5, 6] is the fastest sorting
algorithm having the property of being insensitive to distribution and having a
storage requirement comparable to that of the present algorithm. The expected
number of comparisons for Shell's method, given a randomly ordered sequence,
• 6
is given approximately by n(.194(log2 n) ~ .77 log~ n + 8.4). Itenee the
present algorithm is, for large enough n, superior to Shell's method. It will be
superior to the radix exchange method [7] only in eases of uneven distribution.
REFERENCES
1. HUFFMAN,D.A. A method for the construction of minimum redundancy codes. Proc.
I.R.E. 40 (1952), 1098-1101.
2. BURGh,W.H. Sorting, trees and measures of order. Informat. Contr. i (1958), 181-197.
3. CARR,JO~N W., III. Recursive subscripting compilers and list-type memories. Comm.
A C M 2 (1959), 4-6.
4. NEWELL, A., AND SttAW, J.C. Programming the logic theory machine. Proc. Western
Joint Comput. Conf. (1957), 230-240.
5. SHE~m,D.L. A high speed sorting procedure. Comm. A C M 2, No. 7 (1959), 30-32.
6. FRANI~, R. M., AND LAZ~kRUS, R. B. A high speed sorting procedure. Comm• A C M 3
(1960), 20-22.
7. I-IILDEBltANDT, P., AND ISBITZ, H. Radix exchange--an internal sorting method for
digital computers. J. A C M 6 (1959), 156-163.
8. NAUER, P. (ed.); BACKUS,J. W., ET AL. Report on the algorithmic language ALGOL
60. Comm. A C M 3 (1960), 299-314.
9. HOARE,C. A.R. Algorithms 63 and 64. Comm. A C M 4 (1961), 321.
5 E x c e p t when equal elements are allowed, in which case a slight modification of the
present algorithm is desirable.
6 E x p e r i m e n t a l l y determined b y the author on an I B M 709 computer; holds for n >= 200.

You might also like