Machine Musicianship
Machine Musicianship
MACHINE MUSICIANSHIP
Robert Rowe
MACHINE MUSICIANSHIP
Musicians begin formal training by acquiring a body of musical concepts
commonly known as musicianship. These concepts underlie the musical
skills of listening, performance, and composition. Like humans, computer
music programs can benefit from a systematic foundation of musical Robert Rowe
knowledge. This book explores the technology of implementing musical
processes such as segmentation, pattern processing, and interactive
improvisation in computer programs. It shows how the resulting applications
can be used to accomplish tasks ranging from the solution of simple
MACHINE
musical problems to the live performance of interactive compositions and
the design of musically responsive installations and Web sites. MUSICIANSHIP
Machine Musicianship is both a programming tutorial and an exploration
of the foundational concepts of musical analysis, performance, and
composition. The theoretical foundations are derived from the fields of
music theory, computer music, music cognition, and artificial intelligence.
The book will be of interest to practitioners in those fields, as well as to
performers and composers.
The concepts are programmed using C++ and Max. The accompanying
CD-ROM includes working versions of the examples, as well as source
code and a hypertext document showing how the code leads to the
program’s musical functionality.
“Rowe’s book is written for the DIY music hacker. . . . Machine Musicianship
Rowe
could help you turn your old computer into your new bandmate — able to
do anything but help you carry your gear to the next gig.”
— Douglas Geers, Electronic Musician
All rights reserved. No part of this book may be reproduced in any form by any electronic
or mechanical means (including photocopying, recording, or information storage and re-
trieval) without permission in writing from the publisher.
This book was set in Melior by Achorn Graphic Services, Inc., and was printed and bound
in the United States of America.
Rowe, Robert.
Machine musicianship / Robert Rowe.
p. cm.
Includes bibliographical references and index.
Contents: Machine musicianship—Symbolic processes—Sub-symbolic processes—
Segments and patterns—Compositional techniques—Algorithmic expression and music
cognition—Interactive improvisation—Interactive multimedia—Installations—Direc-
tions.
ISBN 0-262-18206-8 (hc. : alk. paper)
1. Artificial intelligence—Musical applications. 2. Music—Computer programs.
3. Real-time programming. 4. Computer composition. I. Title.
ML74.R68 2001
780′.285—dc21
00-038699
Contents
Acknowledgments ix
1 Machine Musicianship 1
1.1 The Motivation for Machine Musicianship 3
1.2 Algorithmic Composition 6
1.3 Algorithmic Analysis 8
1.4 Structure of the Text 12
1.5 Machine Musicianship Library 14
2 Symbolic Processes 17
2.1 Chord Theory 18
2.1.1 Triad Classifier 19
2.1.2 Representations 28
2.1.3 MIDI Chord Recognizer 33
2.1.4 Chord Spelling 42
2.2 Context Sensitivity 46
2.2.1 Virtual Pitch 47
2.2.2 Determination of Chord Type 58
2.3 Key Induction 60
2.3.1 Interaction with Key Induction 60
2.3.2 Parallel Processing Model 66
2.4 C⫹⫹ and Object Orientation 77
2.4.1 The Note Class 77
2.4.2 The Event Class 80
2.4.3 The Listener Class 85
2.4.4 The ListenProp Class 89
Contents vi
3 Sub-symbolic Processes 93
3.1 Neural Networks 93
3.1.1 Neural Network Key Induction 98
3.1.2 Sequential Neural Networks 101
3.2 Time Structures 110
3.2.1 Quantization 112
3.3 Beat Tracking 122
3.3.1 Multiple Attractors 125
3.3.2 Adaptive Oscillators 130
3.3.3 Meter Induction 135
3.4 Max Externals 139
9 Installations 355
9.1 Multimedia Installations 355
9.1.1 Audio and Imagery 356
9.1.2 Large-Scale Interaction 360
9.2 Animated Improvisation 362
9.2.1 Improvisation with Scales 365
9.2.2 Improvising Melodic Lines 370
9.3 Multimodal Environments 372
10 Directions 377
10.1 Research Synergy 377
10.2 Research Directions 378
References 381
Index 393
Acknowledgments
I should make clear that this is not a psychology text, though the
techniques I describe could be used to implement music cognition
models or experiments. Psychological theories must address the
question of how the processes they propose are realized in humans.
My measure of success, however, is not whether these programs
match empirical data from research with human subjects, but
whether they output structures that make musical sense. I will gauge
their performance in those terms by comparing their output with the
answers expected from students studying introductory texts in music
theory. The software may produce an acceptable answer by using
processes similar to those of humans, or by using others that are
wildly different. All else being equal, I would prefer that the machine
processes resemble the human ones. Whether or not they do is a side
effect, however. Ultimately I am concerned with machine musician-
ship and not a strict emulation of human music cognition.
PITCH NOTE
CLASSES NAMES ROOT TYPE INTERVALS
038 C E A 2 major 38
047 CEG 0 major 47
059 CFA 1 major 59
149 C E A 2 major 38
158 D F A 0 major 47
1 6 10 D G B 1 major 59
2 5 10 D F B 2 major 38
269 D F A 0 major 47
2 7 11 DGB 1 major 59
3 6 11 D F B 2 major 38
3 7 10 E G B 0 major 47
4 8 11 E G B 0 major 47
037 C E G 0 minor 37
049 CEA 2 minor 49
058 C F A 1 minor 58
148 C E G 0 minor 37
1 5 10 D F B 2 minor 49
169 C F A 1 minor 58
259 DFA 0 minor 37
2 6 11 D F B 2 minor 49
2 7 10 D G B 1 minor 58
3 6 10 E G B 0 minor 37
3 8 11 D G B 2 minor 58
4 7 11 EGB 0 minor 37
048 C E G na augmented 48
159 D F A na augmented 48
2 6 10 D F A na augmented 48
3 7 11 E G B na augmented 48
036 C E G 0 diminished 36
039 C E A 2 diminished 39
069 C F A 1 diminshed 69
147 C E G 0 diminished 36
1 4 10 C E A 2 diminished 39
1 7 10 D G B 1 diminished 69
258 D F A 0 diminished 36
2 5 11 DFB 2 diminished 39
2 8 11 D G B 1 diminished 69
369 D F A 0 diminished 36
4 7 10 E G B 0 diminished 36
5 8 11 F A C 0 diminished 36
Symbolic Processes 21
A further rule determines the best normal order for two sets that
have the same difference between the first and last integers: ‘‘If the
least difference of the first and last integers is the same for any two
permutations, select the permutation with the least difference be-
tween first and second integers. If this is the same, select the permu-
tation with the least difference between the first and third integers,
and so on, until the difference between the first and the next to last
integers has been checked. If the differences are the same each time,
select one ordering arbitrarily as the normal order’’ (Forte 1973, 4).
Forte’s ordering scheme is not entirely algorithmic due to the last
instruction of the second rule, but as such cases are very rare, it can
certainly be implemented in a functional computer program. The ad-
vantage of Forte’s classification is that it yields a great reduction in
the number of unique pitch-class sets. There are 220 three-note
chords delimited only by the requirement that no pc be repeated.
This number is reduced to 55 when three-note chords are repre-
sented as a set of two intervals rather than three pitch-classes, as I
will establish momentarily. Forte’s normal ordering yields a list of
only 12 three-note pitch-class sets. For a table-lookup algorithm
(such as the triad identifier) smaller set lists mean smaller and more
manageable tables.
Forte defines pc sets to be equivalent if they are related by transpo-
sition or inversion followed by transposition. This equivalence rela-
tion means that his set names cannot be easily adapted to the
classification of tonal music. For example, the major triad and minor
triad are inversions of each other and so are represented by one name.
Inversion here means intervallic inversion: if we duplicate the inter-
vals of a C major triad going down from C instead of up we get C-
A -F, an F minor triad. Since we are interested in tonal distinctions
we will continue with a larger classification table. Note, however,
that the mechanism introduced here for chord identification could
be adapted quite directly to a real-time Forte set recognizer.
Whether using the Forte set list or table 2.1, we can easily write
a program that looks for certain pitch class sets and outputs the cor-
responding label. If we adopt the set list from table 2.1, the same
Chapter 2 22
In other words, the root of the set is ambiguous. Moreover, the type
is also ambiguous. Figure 2.3 shows the three pitch classes in a domi-
nant seventh chord plus a flat thirteenth with D as the root. Though
less likely than the other two, this interpretation is also correct in
some situations.
Because this collection of pitch classes is ambiguous with respect
to both type and root, it is clear that these attributes cannot be
uniquely determined from pitch class alone. The only way to decide
between rival interpretations is to appeal to the surrounding context.
A context-dependent identifier might consider the prevailing key, for
example, or voice-leading from and to chords immediately sur-
rounding the one to be identified. Even a consideration of the voicing
of notes within the chord, though not involving the surrounding con-
text, would require more information than we have allowed our-
selves thus far.
Though analysis of the context can be useful for correct chord iden-
tification, it also introduces complexity that may affect the speed and
consistency of computation. A context-independent identifier will
work faster and always produce the same result for the same collec-
tion of pitch classes. Moreover, assigning a label based only on pitch
classes is not insensitive to compositional norms. Table 2.2, for ex-
ample, assumes a context of Western tertian harmony that appears
regularly in some kinds of jazz and classical music, but does not hold
in other styles that do not follow those conventions. Table 2.2, then,
is not style-insensitive, but rather encodes a set of assumptions about
the contexts in which it will be used. I begin with a table-based ap-
proach for its simplicity, and because the technique is commonly
used in commercial chord identifiers, which constitute a kind of
Chapter 2 26
ADDRESS 0 1 2 3 4 5 6 7 8 9 10 11
1 0 0 0 1 0 0 1 0 0 0 0
Chapter 2 28
addresses 0 (corresponding to the pitch class C), 4 (E), and 7 (G) are
set to one while the others are set to zero.
The third step of the triad identification algorithm counts the num-
ber of distinct pitch classes in the chord simply by adding all of the
array elements set to one. Another routine, CallChordFinder(),
is invoked with the number of distinct pitch classes as an argument.
CallChordFinder() computes the two intervals above the first ele-
ment in the pcs array that is set to one. These two intervals form an
address into a table that contains the type and root for every three-
note chord. The identification table consists of the 55 type/root pairs
shown in table 2.1. Finally the pitch classes found together with the
root and type from the table are printed on the interface (figure 2.5).
The CD-ROM contains the triad application and all of the source
code necessary to compile it. This example program generates ran-
dom three-note chords when the Make Triad button is clicked or
the space bar is depressed. Next we will build a version that rec-
ognizes chords played on a MIDI keyboard. Before doing so, let us
consider some of the problems surrounding music representation
generally, and the MIDI standard in particular.
2.1.2 Representations
The issue of music representation is a complex and unsettled one.
Several books (Selfridge-Field 1997a; Marsden and Pople 1992; De
Poli, Piccialli, and Roads 1991; Howell, West, and Cross 1991) and
journal issues (Computer Music Journal 17[1–2]) have been devoted
to the question. Carol Krumhansl considers representation from the
standpoint of music cognition:
Symbolic Processes 29
only to draw or erase a line on the screen. The result is a little bar
graph of incoming MIDI activity (figure 2.7). Simple though it is, such
an application can be very useful in making sure that MIDI is being
transmitted properly through the serial port and operating system to
the application level.
With the addition of MIDI input, we can now modify the triad
identifier of section 2.1.1 to analyze chords arriving from a MIDI de-
vice, such as a keyboard or sequencer, as well as those generated
randomly by the program itself. The triad program discussed in sec-
tion 2.1.1 is aimed at a very specific kind of harmony: the triadic
chord construction usually found in beginning jazz piano texts. Even
in that narrow context, the triad identifier treats a very limited range
Chapter 2 36
NUMBER OF INTERVAL
PCs CHORDS REPRESENTATION
1 12 1
2 66 11
3 220 55
4 495 165
5 792 330
6 924 462
7 792 462
8 495 330
9 220 165
10 66 55
11 12 11
12 1 1
Symbolic Processes 37
2.1. Similarly, there exist 165 four-note chords. Note that the greatest
variety is for six- and seven-note chords, which have 462 intervallic
possibilities. Above seven notes the possibilities decrease again,
since the increasing number of pitch classes reduces the number of
possible intervals between them. Just as there is only one chord with
one pitch class (the unison), there is only one twelve-note chord (the
chord of all pitch classes). The symmetry continues from both ends
towards the middle, peaking at 462 varieties for sextads and septads.
The MIDI chord application generates an identification of the root
and type of any collection of pitches, not only three-note chords. The
basic algorithm remains the same, however. Chords, reduced to pitch
classes and then to intervals above the first pc in the array, are used
as an address to look up stored roots and types in a table. The applica-
tion and all of its associated source code is found on the CD-ROM.
The first test of the program is to examine its analysis of the four-
note chords listed in Dobbins’s Jazz Piano Harmony (Dobbins 1994,
13). All the five basic seventh chords are correctly identified in all
inversions. These include major sevenths, dominant sevenths, minor
sevenths, half-diminished sevenths, and diminished sevenths. The
basic seventh chords exhaust 17 of the possible 165 chords (not 4*5
or 20 because the diminished seventh has the same set of intervals
regardless of inversion).
Later examples in the Dobbins text show seven common altered
seventh chords: the major seventh 5, the major seventh 5, the domi-
nant seventh 5, the dominant seventh 5, the dominant seventh sus-
pended fourth, the minor/major seventh, and the diminished/major
seventh (1994, 54). Of these, the identifier again consistently labels
every chord in every inversion. There are two deviations from the
text identifications: the chord Dobbins labels a dominant seventh
with a suspended fourth I call a dominant eleventh, and the chord
he labels a diminished major seventh I call a dominant with a flat
ninth. Only the surrounding context or the voicing of the chord itself
could lend weight to one interpretation over the other, and these are
both sources of information that the application currently ignores.
Chapter 2 38
the root, and the chord type. Chord members that were eliminated
by KickOutMember() are listed below in parentheses. The chord
shown in figure 2.9 is a B-dominant eleventh chord with a missing
fifth. The pitch class thrown out by KickOutMember() was a C natu-
ral, in this case a good choice for elimination.
The KickOutMember() algorithm is not an ideal solution, how-
ever, and often throws out members that are best retained. In figure
2.10, for example, the process kicks out the F located a major ninth
above the root (with a score of 9), instead of the F , which is the
sharp ninth (with a score of 10). One way to deal with particularly
egregious errors caused by KickOutMember() is to simply add a
new table entry for any chord that is being incorrectly reduced. The
algorithm works well enough to fashion a reliable chord identifier
in any case and is not even invoked until the chord has reached a
fairly exotic state.
Symbolic Processes 41
We may use the OMS input port to route chords from a keyboard
to the chord identifier. One may also generate chords in some other
application and send them to the program through one of the OMS
inter-application communication (IAC) busses. Shown in figure 2.11
is a Max patch that randomly generates major triads. Every two sec-
onds (the interval specified in the metro object) the patch generates a
random number that becomes the root of a triad. The plus objects add
a major third (⫹4) and a perfect fifth (⫹7) above that root. (The modulo
[%] objects and number boxes are only there for us to see which pitch
Chapter 2 42
classes result from the operation). All three pitch numbers are then
routed to makenote objects and sent to noteout. When noteout is
set to transmit to IAC Bus #1, and the chord identifier is directed to
read from that same bus, the chord application receives MIDI messages
from and identifies every triad banged out by the Max patch.
Of course an even better way to integrate the chord identification
process with Max would be to encode the identifier as a Max exter-
nal. Then it could be used as simply another Max object within the
environment as a whole. Section 4.4 demonstrates how such C⫹⫹
applications can be recast as Max externals.
both. It first looks for a major third by checking the pcs array for a
positive entry four semitones above the root. If one is found, that
place in the pcs array is set to zero. This ensures that the same pitch
will later not be spelled as something else. It is difficult to imagine
for what other interval a major third might be mistaken, but a minor
third could be taken for a sharp ninth, for example, were it not erased
from the array.
It turns out that the spelling of stacked thirds can be largely, but
not wholly, determined by simple rules keyed to whether the root
and/or upper member are on ‘‘black keys,’’ i.e., must be spelled with
an accidental (D , E , G , A , and B ). For a minor third, all members
landing on a black key should have a flat appended to the name. This
covers B –D , C–E , E –G , F–A , and G–B . If the third does not fall
on a black key but the root does, the minor third name should also
have a flat appended to it (as in D –F ). Finally, if the root is G , the
minor third needs two flats after the letter name (B ). Such root-
specific rules are clumsy but necessary, and these three rules pro-
duce correctly spelled minor thirds in every case.
Remember that this only works if we assume that roots falling on
an accidental will always be spelled in their flatted form. Considered
in isolation, it is better to spell the triad G –B –D as F –A–C , since
A is easier to read than B . Moreover, the choice between a G and F
minor chord should be made based on the surrounding chords and,
ultimately, the key in which the sonority is embedded. David Tem-
perley proposes a more principled way of dealing with the spelling
of both chord members and their roots (Temperley 1997) as imple-
mented in the Serioso Music Analyzer (Temperley and Sleator 1999).
Before passing to a consideration of Serioso, let us consider an-
other attribute of the spelling process that is perhaps more valuable
than the spelling itself: SpellChord() comes up with the same
identification of a chord’s type as does the table entry, but by an
independent process. Though the chord identifier as it stands looks
in precompiled tables to determine the type, it could also build the
type definition algorithmically as it spells the chord. Sequentially
extracting thirds stacked above a defined root is a clear and quite
Chapter 2 44
Serioso not only labels pitch classes but identifies and spells the
roots of harmonic areas as well. As indicated earlier, this too depends
on the line of fifths: ‘‘Before beginning the process of harmonic analy-
Symbolic Processes 47
sis, the algorithm chooses a TPC label for each pitch event; in so
doing, it maps each event onto a point on the line of fifths. This is
the TPC level of the algorithm. The algorithm then proceeds to the
harmonic level, where it divides the piece into segments labeled with
roots. At this stage, too, it maps roots onto the line of fifths, at-
tempting to choose roots so that the roots of nearby segments are
close together on the line’’ (Temperley 1997, 45).
The full Serioso model is stated in a group of five preference rules.
Preference rules, as established in the Generative Theory of Tonal
Music (Lerdahl and Jackendoff 1983), indicate which of a number of
legal structures will correspond most closely to the experience of
human observers. There are two that concern line of fifths distance,
called the pitch variance and harmonic variance rules. The full set
is listed in figure 2.14 (this is the version published by Temperley
and Sleator in 1999, which is somewhat different from the one pub-
lished by Temperley in 1997).
(at the pitch locations of the figure) as evidence that the fundamental
is C, since this is the pitch that the ear would supply. (For an influen-
tial version of this theory, see Terhardt et al. [1982]).
Richard Parncutt has continued the tradition with a series of publi-
cations (1988; 1989), and he extended the model to account for the
contextual effects of voicing and tonality (1997). The virtual pitch
component of the model uses a concept of root-support intervals.
These intervals are derived from the first members of the harmonic
series when repeated pitch classes are eliminated and appear in de-
creasing order of importance: the perfect unison, perfect fifth, major
third, minor seventh, and major second. Note in figure 2.15 that the
first pitches of the series are C, G, E, B , and D when repetitions are
omitted.
The vector of weights attached to root-support intervals is: w ⫽
[ 10, 0, 1, 0, 3, 0, 0, 5, 0, 0, 2, 0 ]. To calculate a chord root, the vector
w is multiplied by a vector representing the notes of a chord where
a 1 indicates the presence of a pitch class and a 0 indicates its absence
(like the pcs array in the Chord application). A pitch class vector
representing a C major triad, for example, would be [ 1, 0, 0, 0, 1, 0,
0, 1, 0, 0, 0, 0 ]. Multiplying a C major triad by the vector w yields
18: 1 ⫻ 10 ⫹ 0 ⫻ 0 ⫹ 0 ⫻ 1 ⫹ 0 ⫻ 0 ⫹ 1 ⫻ 3 ⫹ 0 ⫻ 0 ⫹ 0 ⫻ 0 ⫹
1 ⫻ 5 ⫹ 0 ⫻ 0 ⫹ 0 ⫻ 0 ⫹ 0 ⫻ 2 ⫹ 0 ⫻ 0.
Note that this result is obtained by lining up the root-support
weight vector so that the unison weight (10) is multiplied by the
pitch class C. Because we want to compute the effect of all possible
alignments, the next step is to rotate the weight vector such that the
unison weight is aligned with C , then D, and so on. The calculated
salience of the chord multiplied by the rotated weight vector is then
stored with the pitch class of each potential root in turn. For exam-
ple, a root position C major triad contains the pitch classes C, E, and
G. After multiplying the chord by the root-support weights for all
rotations of the set, the root saliencies calculated are:
C C D D E F F G A A B B
18 0 3 3 10 6 2 10 3 7 1 0
Chapter 2 50
C C D D E F F G A A B B
34 0 6 6 19 11 4 19 6 13 2 0
C C D D E F F G A A B B
54 0 6 6 19 11 4 19 6 13 2 0
Symbolic Processes 51
0 1 2 3 4 5 6 7 8 9 10 11
major 33 0 10 1 17 15 2 24 1 11 0 5
minor 28 3 9 21 3 9 2 17 12 3 8 6
Chapter 2 52
a chord is rotated around the pc-cycle until its first element corre-
sponds to the prevailing tonic, and then added to the stability pro-
file of the prevailing tonality. The resultant profile is the predicted
goodness-of-fit tone profile of the chord in context. The peak of the
resultant profile is the predicted root of the chord in context’’ (1997,
189). To complete the example of our root-position C major triad,
the final set of saliencies after addition of the stability profile for C
major (assuming, therefore, that the chord is the tonic of the key) is:
C C D D E F F G A A B B
87 0 16 7 36 26 6 43 7 24 2 5
or without the voicing or context effects so that the user can see the
impact of these rules on the calculation.
Figure 2.18 shows the roots output by the Parncutt algorithm for
chords in the first eight bars of the Largo con Gran Espressione of
Beethoven’s E Major Piano Sonata, op. 7 (the slow movement is in
C major, not E ). There are four possible context combinations:
(1) no context, (2) voicing active, (3) tonality active, and (4) voicing
and tonality active. For each chord, the root shown in figure 2.18 is
the one calculated with both the voicing and tonal contexts active.
Most of the analyses are relatively stable no matter which rules are
on: of the 15 chords shown, 12 (80%) give the same root in either
all of the combinations or in three out of four combinations. In the
cases where one of the four does not agree, it is usually because
the voicing effect without considering tonal context has elevated the
lowest pitch class to the highest score.
The remaining three chords are interesting as an illustration of the
various rules’ contribution to the analysis. The first, marked ‘‘X’’ in
figure 2.18, consists only of the two pitch classes D and F . The most
salient pitch is found to be either D (when voicing is off) or F (when
it is on). Let the strength of the salience rating be the percentage of
the highest rating relative to the sum of the two highest ratings. When
Chapter 2 56
seen in tonal context but without voicing. The complete model, then,
calls chord Y an F chord, but with a strength of only 50.8%. Though
it is recognized as ambiguous, the ambiguity is between interpreta-
tions of F and C, not the dominant G. The difficulty for the algorithm
comes from the suspension, particularly since it is the tonic pitch
that is suspended. Whenever the tonic pitch of the key is present, it
will tend to have a large activation due to the contribution of the
Krumhansl and Kessler stability profile. Note that Parncutt an-
nounces the intention to add a consideration of voice-leading to the
context-sensitivity of the model, which presumably would treat the
suspension of the C, but this extension was not implemented in
the version described (1997).
The last ambiguous chord, marked ‘‘Z’’ in figure 2.18, illustrates
the same difficulty. Traditional music theory would regard it as a
double suspension above a D minor chord in first inversion. The Par-
ncutt algorithm analyzes it as an A, F, or C chord according to the
rules that are active. The full model (with voicing and context) calls
it an F chord with a strength of 55%. None of the cases finds chord
Z to have a root of D.
The Parncutt root salience algorithm is an important contribution
to machine recognition of harmony. For our purposes, it is also par-
ticularly interesting because it functions well in real time. How may
we incorporate it in an analysis system for use in live performance?
There are two areas of extension I wish to address here: (1) determi-
nation of chord type, and (2) interaction with key induction. Use of
the Parncutt algorithm in performance has a third limitation as
well—it assumes that all members of a chord will be presented si-
multaneously (which accounts for the mislabeling of chord X in fig-
ure 2.18). Any kind of arpeggiation must be reconciled elsewhere
before a chord can be presented to the algorithm for analysis. Other
harmonic analysis systems (Winograd 1968; Maxwell 1992) have the
same requirement: ‘‘The most important limitation is that they are
unable to handle cases where the notes of a chord are not stated fully
and simultaneously, such as arpeggiations, incomplete chords, and
unaccompanied melodies’’ (Temperley 1999, 10). Because that kind
Chapter 2 58
Knowing the root and type of a chord is useful information for jazz
analysis and improvisation, but becomes much more powerful when
processed in the context of functional harmony. To move from raw
identification to harmonic analysis, we must be able to relate a
chord’s root and type to a prevailing tonic. Identifying the tonic of
a set of chords is the task of key induction.
C c C c D d E e E e F f F f G g A a A a B b B b
C 4 2 2 1
G 6 2 1 2 1 5
G 8 4 2 2 9
C 12 1 4 2 10
C 16 6 4 11
D 16 4 13 2 1
F 9 1 8 2 2
F 10 4 2 2 2
C 14 6 2 3 2 1
G 16 2 1 6 1 7
C 20 8 3 8
F 21 12 2 2 2
C 25 14 2 3 2 1
G 27 2 1 14 1 7
d 28 3 4 14 6 1
C 32 16 2 7 1
C 36 18 4 8 1
SCALES POINTS
C Major 8
C minor 8
C Major 8
C minor 8
D Major 0
D minor 8
E Major 8
E minor 8
E Major 0
E minor 8
F Major 8
F minor 8
F Major 0
F minor 0
G Major 8
G minor 8
A Major 8
A minor 0
A Major 0
A minor 8
B Major 8
B minor 8
B Major 0
B minor 0
only minor scales that do not get points for a pitch class of C are B
minor (relative to which, C is the minor second), A minor (relative
to which, C is the major third), and F minor (relative to which, C is
the tritone). Every pc, then, contributes to 16 different scale theories:
seven major scales and nine minor.
Each of the 24 keys has a melodic (scalar) score associated with it
that is updated by the process just described, and there are additional
calculations due to primacy effects that we will review momentarily.
Each key also maintains a harmonic score derived from an analysis
of the membership of incoming pcs in certain functional chords
within the key. The functions tracked by the algorithm are the tonic,
subdominant, and dominant seventh chords of each key. As in the
case of the scale analysis, a weight equal to the duration of the note
is added to the score of each of the functional chords to which a
pitch class belongs. When a pc belongs to two chord functions, the
activation is divided between the two scores. Table 2.8 shows the
contribution made by the pitch class 0 (C) to three possible key inter-
pretations. Keep in mind that these are not the only key theories
affected.
The total contribution of the incoming pitch class is the same in
each case: 8 points, or the duration of an eighth note, equal to 8 64ths.
Notice the difference in distribution, however, among the three
cases. Because pitch class C is a member of both the tonic and sub-
dominant triads in C major, the activation is split between those two
scores. In F major, the pitch class C is a member of the tonic triad
Tonic 4 4 8
Subdominant 4 0 0
Dominant7 0 4 0
Total 8 8 8
Chapter 2 70
the work. It is interesting to see what the PPM makes of it, however,
as a guide to devising strategies for combining it with concurrent
analyses.
The model begins with a scale theory of F (major or minor) because
of the initial weight accorded to the first pitch class heard. The
chordal profile agrees once the second member of an F major triad
arrives with the A in m. 3 (simultaneously eliminating F minor as a
scalar candidate). F major is the best answer for the soprano alone—
it is the additional context in the piano accompaniment that tells us
we are in D minor. What is interesting is that the PPM, following
the soprano alone, becomes confused between F major and D minor
as soon as the primacy span has passed with step 6 (the downbeat
of m. 4).
The reason for this confusion is that beyond the primacy span, the
tonic, subdominant, and dominant seventh scores for each key are
added together to determine the chordal profile, while within the
primacy span only contributions from the tonic triad are considered.
The repeated A pcs in the first three bars of the soprano line contrib-
ute to the chordal profile of F major as part of the tonic triad, but to
the profile of D minor as part of both the tonic and dominant. Once
the dominant score is considered, and particularly when the large
weight stemming from the long A at the beginning of bar 4 is added
in, the D minor chordal profile runs ahead. The PPM continues to
find the key ambiguous until step 12, the arrival of the soprano C
on the downbeat of m. 7. Since C contributes to both the tonic and
dominant chord scores of F major, but not to the harmonic weight
of D minor, F major is established as the unambiguous tonal center
once again.
The B natural on the downbeat of m. 8 (step 15) throws us quite
suddenly into a confirmed C minor. As Palisca notes, the score dem-
onstrates a D minor/C major ambiguity throughout this first section.
We only know that it is C major from the accompaniment, so the
PPM’s estimate of C minor is quite well taken. The only reason the
PPM does not consider C major, in fact, is the B of the soprano in
Symbolic Processes 75
m. 3. As the B falls within the primacy span and outside the C major
scale, all of the scalar points C major had garnered to that point are
erased. Through these bars we can see the value of the parallel tracks
in the PPM: D minor has a melodic weight equal to that of C minor,
but is not output as the most salient tonality because it is not con-
firmed by the harmonic analysis.
The Parallel Processing Model is very well suited to real-time ap-
plications because it is efficient and causal. A recent extension to
the algorithm (not implemented in my version) uses dominant-tonic
leaps at the outset of a melody to further focus the identification of
key (Vos 1999). The scope of the PPM’s application, however, is lim-
ited by two characteristics: first, it relies on quantized durations to
calculate the salience scores; and second, it only works with mono-
phonic music.
The PPM adds weights to the scalar and harmonic scores that are
calculated as integer multiples of 64th note durations. For example,
as we have seen, an eighth note will generate a score of 8, since the
duration of one eighth note is equal to the duration of eight 64th
notes. This is troublesome for real-time processing because, first, we
must wait until the release of a note to calculate its effect. That means
that any interaction based on key input cannot take place at the at-
tack, but only after a delay equal to the duration of the input. Since
key is a relatively high-level percept, processing based on it can often
generate a response quickly enough to be musically convincing even
after such a delay. We are forced to accept it in any case if we wish
to use the algorithm.
We then must face the other complication, which requires that we
also run a real-time quantizer, without which we will have no quan-
tized durations available. The function of the algorithm, however,
does not depend on quantization as such because it is quite sensitive
to the relative values of the durations, no matter what the notated
values might be. In other words, if we were to multiply or divide all
durations in a melody by two, the algorithm would produce precisely
the same results. We can use this observation to develop a method
Chapter 2 76
if (((localTime-lastAttack)⬎100L) | | (chordSize⬎⫽e-⬎MaxNotes()))
determines when a new Event should begin. The first part of the
conditional says that when 100 milliseconds have elapsed since the
onset of the last Event, any new input should form the onset of a
new Event. The second part of the conditional ensures that when
an Event has been filled with the maximum number of notes, any
additional inputs cause the allocation of a new Event.
Figure 2.29 Code listing for Hear()
Symbolic Processes 87
Even when a new Event has been created, however, the line
scheduler-⬎ScheduleTask(Now⫹50L, 0, 2, 0, NotifyInput);
changes in the output (i.e., the output can switch from off to on in-
stead of simply changing in direct proportion to the input)’’ (Dolson
1991, 4).
Initially the connection weights are set to random values. One of
the great attractions of neural networks is that they are able to learn
weight sets that will reliably associate input patterns with output
patterns. To accomplish this, a training set of input examples with
correct answers (configurations of output activations) attached is pre-
sented to the network. Over the course of a training session, the con-
nection weights are gradually adjusted by the neural network itself
until they converge on a set that correctly relates outputs with the
corresponding inputs of the training set. If the training set captures
the regularities of a wider class of inputs, the trained network will
then be able to correctly classify inputs not found in the training set
as well. Such a process is an example of supervised learning, in
which a teacher (the training set) is used to guide the network in
acquiring the necessary knowledge (connection weights).
The adjustment of the weights is accomplished through a learning
rule. An example is the delta rule: first, an error is computed by sub-
tracting the output of a node from the desired output encoded in the
training set. The delta rule uses the error to calculate a new link
weight as shown in figure 3.2.
One of the simplest neural network types is the ADALINE, devel-
oped in 1963 by Bernard Widrow (1963). An ADALINE has some
number of input nodes and one output node. The output can be ei-
ther ⫹1 or ⫺1, which means that an ADALINE is a simple classifier
that can sort input sets into one of two classes. Since the possible
outputs are ⫹1 and ⫺1, and the desired outputs in the training set
will be restricted to these two values as well, the error can be
gestures in real time and use its recognition to control the algorithmic
generation of MIDI output (Sawada, Onoe, and Hashimoto 1997). Our
strategy here, then, will be to develop performance networks that
have already undergone a training phase before being used onstage.
us present our newly trained network with the input pattern shown
in figure 3.7.
In this example, we present the network with the seven pitches of
the C-major scale, rather than just the tonic, subdominant, and domi-
nant. What key does the network think this represents? In fact the
network outputs C as the most likely tonic of this input set (as shown
in figure 3.8), although with a much lower score than is produced by
running it with the original C major training example (0.37 instead
of 1.0). It is interesting, however, that no other candidate achieves a
significant score; even though the C major output score is low, the net-
work has still clearly identified C as the tonic of the input set.
Using an input that differs from the training set, we have obtained
an identification that is nonetheless consistent with it. Other studies
with neural networks have explored their ability to complete partial
patterns after training. If a network is trained to recognize a C-major
scale, for example, presentation of an input with some members
missing will still cause the network to recognize C major.
Sub-symbolic Processes 101
This difference would have an effect while training the network. For
example, using the values just given, if [01] is produced as output
instead of [11], this would be a lesser mistake (since they are more
similar) than producing [00] for [11]. As it learned, the network’s
knowledge of musical structure would begin to reflect this (prob-
ably) erroneous difference. Thus this distributed coding imposes a
similarity-measure on the network’s outputs that we probably do not
want—there is no a priori reason to designate [01] and [11] as more
similar than [00] and [11]. The localist pitch representation, which
does not impose this differential similarity on the outputs, works
better. (1991, 179)
Let us allocate a neural network with 12 input, hidden, and output
nodes, as before. The nodes encode the twelve pitch classes in a lo-
calist representation. The network is now sequential because de-
cayed inputs and outputs of the previous step are added to the input
before each pass through the network, as illustrated in figure 3.9.
The sequential neural network object is virtually identical to the
one described in section 3.1. The main difference is in the operation
of the ForwardPass() method (figure 3.10), which makes use of
the two data members inputDecay and outputDecay. Because the
decay constants are variables, we may easily experiment with differ-
ent settings to watch their effect on learning. We may even change
the architecture of the network by setting one or both of them to zero:
if both are zero, this network is identical to the one in section 3.1,
since no activation is fed back. Similarly, either the input or output
feedback paths may be individually eliminated by setting the corre-
sponding decay parameter to zero.
Once the network has been established, the next step is to develop
a training set of examples from which the network can learn. For the
Chapter 3 106
C C D E E F F G A A B B
C 0.50
G 1.00 0.17
G 0.95 0.31
C 1.00 0.15
C 1.00 0.15
D 1.00 0.67
F 0.99 0.02
F 1.00
C 1.00 0.01
G 1.00 0.02
C 1.00 0.02
F 1.00 0.37
C 1.00 0.66 0.02
G 1.00 0.11
D 1.00 0.02
C 1.00
C 1.00
For the moment we will consider the middle levels of the rhythmic
hierarchy: the formation of a regular pulse, and the differentiation
of those pulses into the strong and weak beats of a meter.
3.2.1 Quantization
Standard notation of Western music assumes a temporal grid of bars,
beats, and subdivisions of beats. This notation reflects an essentially
automatic cognitive process in human listeners whereby a pulse is
extracted from a regular sequence of musical events. The phenome-
non of pulse is manifested in the tapping of a foot in time to the
music, for example, or in the beat of a conductor’s baton. Rhythm
perception builds on the foundation of pulse to form hierarchies in
which some pulses (beats) are more important than others. Meter is
the notational device used to indicate such hierarchies: a meter of
4/4, for example, represents the occurrence of a strong pulse once
every four beats.
Western rhythmic notation, then, is a hierarchical system that mul-
tiplies and divides simple pulse durations by small integer values.
Multiplication of beats produces measures; division of beats pro-
duces subdivisions. Within a measure some beats are stronger than
others, and within a beat some subdivisions are stronger than others.
This economy of notation is directly related to the cognition of musi-
cal time—we experience music with even minimal temporal regular-
ities as conforming to a metrical grid.
The correspondence of notation and perception does not extend
to the duration of events as they occur in performance, however:
In performed music there are large deviations from the time intervals
as they appear in the score (Clarke 1987). Quantization is the process
by which the time intervals in the score are recovered from the dura-
tions in a performed temporal sequence; to put it in another way, it
is the process by which performed time intervals are factorized into
abstract integer durations representing the notes in the score and
local tempo factors. These tempo factors are aggregates of intended
timing deviations like rubato and unintended timing deviations like
noise of the motor system. (Desain 1993, 240)
Sub-symbolic Processes 113
The ‘‘noise of the motor system’’ refers to the fact that humans
are physically incapable of producing movements that are exactly
equally spaced in time. This deficiency is of no great consequence
while we listen to music because the variations are small, and our
perceptual system effectively filters out the discrepancies, anyway.
It does mean, however, that a computer looking at a series of human
finger motions (e.g., from a performance on a piano keyboard) will
not see a sequence of numbers that can be directly measured as a
series of simple integer multiples of an underlying pulse.
The inaccuracies caused by muscle jitter are a small but significant
obstacle to quantizing performance information. They become much
more formidable, however, when added to the purposeful deviations
introduced by musicians in the expressive performance of a work of
music. As we shall see in chapter 6, players use a number of temporal
manipulations to impart cues about the structure and content of a
composition in performance. We as listeners are able to distinguish
the structural and expressive factors activating the resulting percept
from the underlying meter. The problem of quantization is to perform
the algorithmic analog in a machine musician: separate structural
rhythms from expressive variations. Ideally we would like to pre-
serve and use both, but for the moment we will concern ourselves
with deriving a series of integrally related time points from undiffer-
entiated performance data.
Commercial sequencers perform quantization by rounding tempo-
ral measurements to the nearest quantized grid point. The grid used
is computed from a global tempo setting and a specification of the
smallest permissible duration. Both of these values are entered by
the user. In many systems, the user must also indicate whether
‘‘tuplet’’ (i.e., triplet, quintuplet, etc.) divisions of the beat may be
mixed with simple power-of-two divisions. Anyone who has used a
sequencer knows that this method yields results that require a lot of
editing for any but the most simple of rhythmic styles. In any event,
such a technique is useless in a real-time situation because there is
no pre-existing grid. The only way to get one would be to require the
musicians to play along with a metronome, something that would
Chapter 3 114
sured. Then the bang sent to the left inlet of the timer starts another
measurement that will be terminated by a subsequent note on.
Winkler’s inter-onset interval patch performs in Max the same cal-
culation that is part of the Listener⬋Hear() procedure listed in
figure 2.16. Once IOIs are calculated, they can be used as input to
a number of temporal analyses including beat tracking and density
estimation. See chapter 6 of Composing Interactive Music (Winkler
1998) for a beat tracker written entirely in Max.
been awarded for the input on the left. For example, after the first
input (409) the leading beat period theories are 409 (the IOI itself ),
818 (double the IOI length), 204 (half the IOI length), 1227 (triple the
IOI), and 613 (1.5 times the IOI). The final two columns, labeled R
and E, show the real time in milliseconds at which the event arrived
and the predicted time of a subsequent event according to the leading
beat theory. Therefore R ⫹ period[0] ⫽ E.
We can see how well the algorithm is doing by comparing the E
listings with subsequent R times. For example, after the first IOI the
beat theory is 409 milliseconds. 409 added to the arrival time of the
event (819) yields a predicted arrival time of 1228 for the next pulse
of this period. In fact, the next event arrives at time 1231 (as shown
in the R column of the second row), 3 milliseconds late according
to the 409 milliseconds theory. Therefore the activation for 409 is
added to the weight according an incoming event and collected in
the theory for 410 milliseconds, somewhat slower than the prior
theory to match the delay in the incoming event relative to the
expectation.
The input that generated the trace in figure 3.20 was a very simple
sequence of four quarter notes followed by two eighth notes and two
more quarter notes. The leading theory remained stable throughout,
Chapter 3 130
with a period moving between 407 and 410 milliseconds. The great-
est difference between an expected beat and an actual arrival time
was 13 milliseconds. This example indicates how the algorithm does
with simple rhythms. The user is encouraged to experiment with the
Attractors application on the CD-ROM to experience its behavior
with more complex inputs. All of the source code for the application
is found in the same folder for any who wish to modify the process.
tor to bring it within one period of the current time. Then the variable
phi is computed as an indication of the phase of the oscillator. The
phase is important because it delineates the temporal receptive field
of the unit—that part of its period within which adaptation is max-
imized. ‘‘Each output pulse instantiates a temporal receptive field
for the oscillatory unit—a window of time during which the unit
‘expects’ to see a stimulus pulse. The unit responds to stimulus
Chapter 3 132
pulses that occur within this field by adjusting its phase and period,
and ignores stimulus pulses that occur outside this field’’ (Large
1999, 81). The part of Large() that is executed when a pulse arrives
computes the adaptation for the unit. The adaptation strength is
modified by the value of phi, keeping it high within the receptive
field and attenuating it everywhere else.
Petri Toiviainen’s Interactive MIDI Accompanist (IMA) uses a
modification of the Large and Kolen oscillators to perform beat-
tracking in real time (Toiviainen 1998). Toiviainen’s application is
an accompanist in that it can recognize a number of jazz standards
and play the parts of a rhythm section, following the tempo of a solo-
ist. The IMA introduces some significant departures from the Large
and Kolen adaptation functions as a necessary consequence of the
nature of the application (figure 3.22).
The first change has to do with the discontinuities introduced by
the Large and Kolen adaptations. ‘‘To be able to synchronize the ac-
companiment with a live musical performance, the system must pro-
duce a continuous, monotonically increasing mapping from absolute
time (expressed in seconds) to relative time (expressed in the number
of beats elapsed since the beginning of the performance). In oscillator
terminology, it must produce a continuous mapping from time to
phase. . . . This is not the case with Large and Kolen’s oscillator, as
it adapts its phase abruptly and discontinuously at the time of each
note onset’’ (Toiviainen 1998, 65).
The other problem from the point of view of Toiviainen’s IMA is
that the Large and Kolen adaptations consider every input impulse
to be equally important. This works well enough for regularly spaced
performances, but can go dramatically awry when ornaments such as
a trill are added. Toiviainen’s response was to design new adaptation
functions for the Large and Kolen oscillator that take into account,
among other things, the duration of events associated with an input
impulse. ‘‘The main idea behind this approach is that all adaptation
takes place gradually and a posteriori, instead of occurring at the
time of the note onset. Consequently, notes of short duration do not
give rise to any significant adaptation, even if they occur within the
Figure 3.22 Toiviainen adaptation
Chapter 3 134
temporal receptive field. As a result, the field can be set rather wide
even if the rhythm of the performance is complex, thus making it
possible to follow fast tempo changes or intense rubato’’ (Toiviainen
1998, 66). Because the oscillator does not adapt at the onset of an
event but rather some time after it has passed might seem to indicate
that the response of the IMA adaptation would be slower. In practice
the oscillator converges to a new beat quickly enough to keep up
with a live performance and avoids sudden jumps of tempo with
every update of the period.
Much of the strength of the IMA adaptation arises from the fact
that it is the product of two components, called long- and short-term
adaptation. ‘‘Both types of adaptation are necessary for the system
to follow tempo changes and other timing deviations. A single timing
deviation does not give rise to any permanently significant change
in phase velocity. If, on the other hand, the oscillator finds it is, say,
behind the beat at several successive note onsets, the cumulation of
long-term adaptation gives rise to a permanent change in phase ve-
locity’’ (Toiviainen 1998, 68).
The combination of long- and short-term adaptation means that
the IMA oscillator retains a relatively stable firing frequency even
through trills and other highly ametrical inputs. The program expects
an initial estimate of the beat period to start off the oscillator: such
an estimate can be obtained from the IOI between the first two events,
for example, or from a ‘‘count-off’’ given by the performer on a foot
pedal.
The Oscillator application on the CD-ROM implements a single
adaptive oscillator that changes period and phase with the arrival of
incoming MIDI events. The user can select either the Large and Kolen
adaptation function or that of Toiviainen using radio buttons on the
interface (figure 3.23). The interface also indicates both the length in
milliseconds of the most recent inter-onset interval (IOI) and the pe-
riod of the oscillator. In figure 3.23 the last performed event corre-
sponded quite closely to the period predicted by the oscillator—the
two values are within 11 milliseconds of one another.
Sub-symbolic Processes 135
difficult (as the Marsden article establishes) and will not be at-
tempted here. Temperley adopts the simple heuristic of taking any
note onset within nine semitones of a prior one to establish the regis-
tral IOI for that prior event. This breaks down, obviously, when two
voices are operating within that span. ‘‘Taking all this into account,
we propose a measure of an event’s length that is used for the pur-
pose of the length rule: the length of a note is the maximum of its
duration and its registral IOI’’ (Temperley and Sleator 1999, 13).
Serioso’s metric analysis is the structure that best satisfies all three
rules after the evaluation of a composition as a whole. Though a com-
plete analysis is required to arrive at a final parsing, Temperley’s
approach moves through the score from left to right and keeps track
of the best solution at each step along the way (resembling Jacken-
doff’s beam-search-like proposal [ Jackendoff 1992]). At any given
point the program is able to identify a maximal metric interpretation
of the work to that moment, though the interpretation may change
in light of further evidence later in the work. The process can there-
fore be used in real time as it only requires the information available
as the piece is performed. It also accounts for ‘‘garden path’’ phenom-
ena in which one way of hearing a passage is modified by the audi-
tion of subsequent events.
Serioso generates not only a tactus for the composition under anal-
ysis (beat tracking), but two metrical levels above the tactus and two
below it. The upper and lower levels are found by evaluating tactus
points much as events were evaluated to find the tactus itself. The
tactus is called level 2, those above it are levels 3 and 4, and those
below it levels 0 and 1. ‘‘Level 3 is generated in exactly the same
way as level 2, with the added stipulation that every beat at level 3
must also be a beat at level 2, and exactly one or two level-2 beats
must elapse between each pair of level-3 beats’’ (Temperley and Slea-
tor 1999, 15).
There are two notable aspects of their method for generating addi-
tional metric levels: first, the method is essentially recursive, using
the same rules for the tactus, its meter, and its subdivision. Second,
their method searches for multiplications and divisions by two or
Sub-symbolic Processes 139
input list can be either integer or floating point values—if they are
floating, the routine truncates them to integer values before sending
them on to bang.
A user may wish to send note numbers sequentially, instead of
using a list, and have these incrementally build up a pitch-class set.
To accommodate this usage, we need a method to respond to integer
inputs and another that will set the pcs array to zero whenever a
reset message is received. The integer method is even simpler than
the list method, as all we need to do is call the bang method with a
list of one item (figure 3.29).
Finally a simple method that resets all of the members of the
pcs array to zero can be attached to a reset string sent to the inlet
of the object. The pcset clear method is already called within
pcset list to reset the array to zero after each incoming list has
been processed and can simply be called by reset as well.
If one is accustomed to the decomposition of processes into inter-
locking methods, writing Max externals is a matter of learning some
terminology and function calls unique to Max. As this simple exam-
ple has demonstrated, C⫹⫹ classes are particularly well suited to
such implementation. The data members of the class need to be trans-
ferred to a Max object struct and member functions translated to Max
methods. Though this simple demonstration has but limited utility,
the technique of its construction can be applied to virtually all of the
C⫹⫹ classes written in this text.
4 Segments and Patterns
4.1 Segmentation
They pointed out that rhythm in the tonal/metric music of the West-
ern tradition consists of two independent elements: grouping—
which is the manner in which music is segmented at a whole variety
of levels, from groups of a few notes up to the large-scale form of
the work—and meter—which is the regular alternation of strong and
weak elements in the music. Two important points were made in this
definition: first, although the two elements are theoretically indepen-
dent of one another, the most stable arrangement involves a congru-
ence between them such that strong points in the meter coincide with
group boundaries. Second, the two domains deal respectively with
time spans (grouping) and time points (meter): grouping structure
is concerned with phenomena that extend over specified durations,
whereas meter is concerned with theoretically durationless moments
in time. (Clarke 1999, 478)
We can see the operation of the minimum size rule inside the
AssertBoundary() method shown in figure 4.2, which is called
whenever one of the other rules has detected a segment boundary.
The argument number contains the length in events of the newly
Figure 4.1 Segment() class
Segments and Patterns 151
asserted from these rules until two notes have passed beyond the
location of the boundary. For the analysis and processing of music
during performance, we would like to be able to segment and treat
material more quickly. The solution developed by Stammen and Pen-
nycook (1993) is to notice immediately the distinctive transitions
that make up the first part of the preference rules and posit a provi-
sional segment boundary when they are found. Once all of the evi-
dence has arrived, two notes later, the provisional boundary may be
confirmed or eliminated. Time-critical processing is executed on the
basis of provisional boundaries, and analyses that occur over a longer
span can wait for firmer segmentation.
Figure 4.3 demonstrates an incomplete occurrence of GPR2b
(Attack-Point). We may consider that n1 and n2 correspond to the
events so marked in the figure. To fully decide GPR2b we need
three durations: the length of time between the attacks of n1 and n2,
the length of time between the attacks of n2 and n3, and the length
of time between the attacks of n3 and n4. When (n3-n2) ⬎ (n2-n1) and
(n3-n2) ⬎ (n4-n3), GPR2b is true. Once the half note marked n2 in figure
4.4 has sounded for longer than a quarter note with no subsequent
attack, we already know that the first part of the conjunction is true
because (n3-n2) will necessarily be longer than (n2-n1). We may then
assert a provisional boundary at the attack of the next event n3. When
the attack of n4 arrives, we will know whether the second part of the
conjunction is also true ((n3-n2) ⬎ (n4-n3)).
Figure 4.4 shows these two possibilities: 4.4 (left) is the case where
the second half of the conjunction is true, leading to a segment
boundary between n2 and n3 (indicated in the figure by a bar line).
Figure 4.4 (right) shows an example where (n3-n2) ⫽ (n4-n3), mean-
Segments and Patterns 155
ing that the second half of the conjunction is false and GPR2b
does not apply. The dotted line before n3 represents the provi-
sional boundary generated when the first part of the conjunction was
true.
Positing provisional occurrences of a grouping rule makes it possi-
ble to recognize that an event is the beginning of a new segment at
the moment it arrives. It even becomes possible to recognize that an
event is the end of an ongoing segment while that event is still sound-
ing. The price of such speed is, of course, that some events will be
treated as segment boundaries when in fact they are not. Whether or
not the trade-off is worthwhile depends entirely on what will be done
with the provisional segments.
What happens if we wait until all the evidence is in? We still
would not be in a terrible position with respect to real-time pro-
cessing because we will know the outcome of the rule with the attack
of the event following the first event in the segment. For example,
we know in figure 4.4 whether GPR2b is true or false with the onset
of n4. In fact, we know the negation of the rule even without n4: as
soon as the duration (n3-n2) has passed without finding the onset of
n4, we know GPR2b to be false.
Figure 4.5 lists the code for the full attack-point function. Notice
the calculation of the search duration for long notes: the variable
longOnToOn is set to the most recent duration plus 10%. Because
we are dealing with actual performed durations and not quantized
durations in a score, the 10% margin helps us avoid triggering the
rule because of expressive variations. When we know we are looking
at the duration n2-n3, the routine LongNoteFound() is scheduled to
execute once longOnToOn has elapsed.
Chapter 4 156
rules come into conflict. Beyond that, GPRs 4–7 are more problem-
atic both in terms of algorithmic formulation and real-time perfor-
mance. Consider GPR6 (figure 4.7).
There are two problems here from an algorithmic point of view: first,
how to determine that two musical segments are parallel, and second
how to ensure that they form parallel parts of groups. The parallelisms
that Lerdahl and Jackendoff use to illustrate the rule are similarities of
melodic and rhythmic contour. Exactly which similarities apply and
whether they must be present both melodically and rhythmically are
among the issues that are left to the judgment of the analyzer. The next
section is devoted to noticing such similarities algorithmically.
Even assuming we could develop a parallelism recognizer, how-
ever, let us note in passing a circular relationship that would become
more acute in combination with GPR6: pattern recognition, particu-
larly in real time, relies on having consistent groups to compare. If
grouping depends on patterns that depend on groups, we find again
the kinds of control structure complications that were encountered
with the interplay between chords and keys. Here is another instance
of interacting processes that must collaborate to converge on a con-
vincing structure.
The relevance of grouping preference rules for real-time segmenta-
tion depends, as usual, on the ends to which the discovered grouping
boundaries will be put. Similarly, the issues of control and arbitration
between competing rules can only be decided within the framework
of a particular application. GTTM itself does not offer any recommen-
dations, but Ray Jackendoff suggested some ideas in his 1992 article,
‘‘Musical Processing and Musical Affect.’’ In particular he outlined a
parallel multiple-analysis model in which several structural candi-
Segments and Patterns 159
labeled along the y axis, and time advancing from left to right along
the x axis. The input to the application that generated figure 4.8 was
a performance of the opening of Mozart’s G Minor Symphony. This
work was used because it is also the musical material for the demon-
stration of GPRs 2 and 3 found on page 47 of GTTM. Since I have
here implemented only GPRs 2a, 2b, and 3a, only those rules appear
on the interface on the final event of the segment they define. Their
placement in figure 4.8 corresponds to the illustration in figure 3.19
of GTTM (Lerdahl and Jackendoff 1983, 47).
The only differences arise from the way the rules are imple-
mented—rule 3a fires on the 12th note, for example, because the pro-
gram considers the unison to be an interval. Since the leap between
notes 12 and 13 is larger than the half-step between 11 and 12, and
larger than the unison between 13 and 14, rule 3a fires. Rules 2a and
2b do not fire on note 10, on the other hand, because of the prohibi-
tion against short segments. Since a boundary was just attached to
note 9, a new one cannot be generated on note 10. Already in this
short example we see that arbitration between conflicting rules is the
primary hurdle to a fully developed GTTM segmenter.
that close elements are grouped together; similarity, that like ele-
ments give rise to groups; common fate, that elements changing the
same way should be grouped; and the principle that we tend to group
elements so as to form familiar configurations.
In Tenney and Polansky’s work, the Gestalt principles of proximity
and similarity are used as the basis for rules that govern grouping of
elements, clangs, and sequences. ‘‘An element may be defined more
precisely as a TG [temporal gestalt] which is not temporally divisible,
in perception, into smaller TGs. A clang is a TG at the next higher
level, consisting of a succession of two or more elements, and a suc-
cession of two or more clangs—heard as a TG at the next higher
level—constitutes a sequence’’ (Tenney and Polansky 1980, 206–
207). Essentially, an element corresponds to an Event in the hierar-
chical representation outlined in Chapter 2. A clang is then a group
of Events, or a Segment. A sequence in Tenney and Polansky’s
work would be a collection of Segments in ours. Consequently, we
are concerned for the moment primarily with the application of the
rules to form clangs, or groups of elements.
The 1980 article formalized concepts that Tenney had been work-
ing with since the early 1960s, particularly the idea that Gestalt
principles of proximity and similarity are two primary factors con-
tributing to group formation in music perception. The rule related to
proximity is defined as follows: ‘‘In a monophonic succession of ele-
ments, a clang will tend to be initiated in perception by any element
which begins after a time-interval (from the beginning of the previous
element, i.e., after a delay-time) which is greater than those immedi-
ately preceding and following it, ‘other factors being equal’ ’’ (Tenney
and Polansky 1980, 208 [italics in original]). The rule in this form,
expressed with reference to time, is in fact identical to Lerdahl and
Jackendoff’s Grouping Preference Rule 2b (Attack-Point).
Lerdahl and Jackendoff note the resemblance in GTTM while con-
sidering the feasibility of implementing their rule system in a com-
puter program. ‘‘Tenney and Polansky . . . state quantified rules of
local detail, which are used by a computer program to predict group-
ing judgments. They point out, however, that their system does not
Segments and Patterns 163
‘‘In the Euclidean metric, the distance between two points is al-
ways the square root of the sum of the squares of the distances (or
intervals) between them in each individual dimension (in two di-
mensions, this is equivalent to the familiar Pythagorean formula for
the hypotenuse of a right triangle). In the city-block metric, on the
other hand, the distance is simply the sum of the absolute values of
the distances (or intervals) in each dimension’’ (Tenney and Polan-
sky 1980, 212).
The interval between two elements in any individual dimension
is quantified by some measure appropriate to that parameter: pitch
intervals are measured in semitones; durations are measured as a
multiple of some quantization value (e.g., eighth-notes); and inten-
sity in terms of dynamic-level differences printed in a score. The
city-block distance between two elements is calculated by adding
these parameter-specific intervals for all features under consider-
ation. Figure 4.9 demonstrates the segmentation of the opening me-
lodic line of Beethoven’s Fifth Symphony arising from its intervals of
pitch and duration (example taken from Tenney and Polansky [1980,
215]).
The segmentation rules used in figure 4.9 are quite simple and can
be computed using a mechanism identical to the one introduced
in the previous section. The ‘‘delay-time’’ and pitch intervals be-
tween consecutive Events can be found using the code shown in
figure 4.10. The delay-time is simply the IOI between one Event and
the next, and the pitch interval is found by taking the absolute value
of the first MIDI pitch number minus the second.
The problems with this from a real-time perspective are the ones
we have come to expect: first, the delays in the Tenney and Polansky
article are expressed in terms of some underlying quantized values,
e.g., eighth notes. We will need to substitute ratios of some common
base duration since the onset times of performed events have no
quantized values attached (see a description of this process in section
2.4.2). The other problem is that segment boundaries are not found
until one event beyond their occurrence, because we need to see that
a given interval is larger than the ones both before and after it. This
delay simply must be accepted if we wish to use the rules their article
suggests.
The gestalt segmentation idea uses, in effect, the inverse of prox-
imity and similarity to identify boundary lines between groups. In
other words, when elements are not proximate and/or dissim-
ilar, they tend to form the beginning and end of neighboring clangs.
This concept is echoed in the literature of music theory, particularly
in the work of Wallace Berry. In his text Structural Functions in
Music, Berry states that musical structure ‘‘can be regarded as
the confluence of shaped lines of element-succession which either
Chapter 4 166
structure can be derived from the grouping structure and the reverse’’
(Cambouropoulos 1997, 285).
Berry’s system puts particular emphasis on transitions to superior
values, and the proximity rules of both GTTM and Tenney and
Polansky similarly favor boundaries on large intervals. As Cam-
bouropoulos points out, this bias excludes the recognition of a
boundary in a rhythmic sequence such as that shown in figure 4.12.
The ICR and PR rules are written in such a way that diminishing
intervals (i.e., changes to shorter, smaller, quieter values) will form
segment boundaries as well as intensifying ones. This orientation
produces a segment boundary in figure 4.12 at the location indicated
by the dashed line.
We have seen the Gestalt principles of proximity and similarity
cited by several researchers as the operative processes behind group
formation in music. The low-level distance metrics adopted by these
systems differ in detail but also show striking resemblances. My own
program Cypher noticed simultaneous discontinuities between sev-
eral features of neighboring events as a way to detect segment bound-
aries, another way of expressing the same idea (Rowe 1993). The
Segmenter application on the CD-ROM can be used as the framework
for segmentation schemes along these lines, combining difference de-
tectors across several parameters simultaneously to arrive at low-
level grouping structures in real time.
through a known score (see section 5.2). The tempo of the soloist’s
rendition can be calculated from the durations between successive
matches, and that tempo drives the computer’s performance of an
accompaniment.
Imagine a representation of a solo musical line that consists of a
sequence of pitch numbers, one for each note in the solo. Now imag-
ine that same musical line being performed on a MIDI keyboard. With
each note that is played, a new pitch number arrives at the computer.
If the line is performed without error, the sequence of pitch numbers
stored in the machine and the sequence arriving from the performer
match exactly. Because errors often do occur, however, the goal of
the Bloch and Dannenberg algorithm is to find the association be-
tween a soloist’s performance and the score in memory that at any
given moment has the greatest number of matched events.
A variable called the rating shows at any given moment the number
of matches found between the performance (or candidate) and the
stored score (or template). The rating is calculated using an integer
matrix in which rows are associated with the template and columns
with the candidate (table 4.1). When an item is matched, the rating
for the corresponding matrix location is set to the maximum of two
values: either the previous rating for that template location, or the
rating of the previous candidate location plus one.
The matrix shown in table 4.1 illustrates how the values are up-
dated as candidate elements are introduced. Time increases from left
candidate
9 1 8 2 7
template 9 1 1 1 1 1
1 1 2 2 2 2
8 1 2 3 3 3
2 1 2 3 4 4
7 1 2 3 4 5
Chapter 4 172
to right across the figure. Labels at the top of the matrix indicate new
candidate elements as they are introduced and the labels down the
leftmost column represent the elements of the template pattern. The
underlined values show which matches cause the rating value to be
updated.
For example, when the candidate element 1 arrives, it matches the
template element 1 at the second location in the pattern. At the loca-
tion (1, 1) in the matrix, the rating is increased from one to two since
incrementing the previous rating by one yields a value greater than
the rating stored in the second location of the candidate prior to the
match.
The MusicPattern class implements the Bloch and Dannenberg
matcher. Figure 4.13 lists the heart of the algorithm. The matrix
maxRating keeps track of the maximum number of matches found
at any point in the performance, and the array matched shows
which of the template pattern members have been found in the in-
coming material. Let us use the class to test some simple match con-
ditions. We compare a template pattern of five elements { 9,1,8,2,7 }
to four candidate patterns representing some common types of devia-
tion: insertion, deletion, repetition, and transposition.
In the insertion test, the intruding element is not matched. Note
in table 4.2 that the matrix entries under the candidate member 6
(the inserted element) are the same as those under the previous can-
didate. All other members are correctly identified as shown. In the
implementation on the CD-ROM, the variable newMatch is set to ⫺1
when the routine finds no match for an element under consideration,
as happens with the insertion in this example. A negative value of
newMatch, then, signals the insertion of an element that does not
appear in the template at all.
The deletion test yields four matches (table 4.3). An array of inte-
gers, called matched, shows which elements of the template pattern
have been found and which have not. At the end of this test, matched
contains { 1, 1, 0, 1, 1 }, indicating that the third element was not
found.
Figure 4.13 PatternMatcher() listing
Chapter 4 174
candidate
9 1 6 8 2 7
template 9 1 1 1 1 1 1
1 1 2 2 2 2 2
8 1 2 2 3 3 3
2 1 2 2 3 4 4
7 1 2 2 3 4 5
candidate
9 1 2 7
template 9 1 1 1 1
1 1 2 2 2
8 1 2 2 2
2 1 2 3 3
7 1 2 3 4
Segments and Patterns 175
candidate
9 1 6 2 7
template 9 1 1 1 1 1
1 1 2 2 2 2
8 1 2 2 2 2
2 1 2 2 3 3
7 1 2 2 3 4
candidate
9 1 8 8 2 7
template 9 1 1 1 1 1 1
1 1 2 2 2 2 2
8 1 2 3 3 3 3
2 1 2 3 3 4 4
7 1 2 3 3 4 5
detects one element in advance because both the deletion and substi-
tution tests compare input with the highest unmatched template ele-
ment and its successor. The IntervalMatch application and its
associated source code may be found on the CD-ROM.
Matching algorithms, like the other processes we have reviewed,
must be considered with reference to some application. The Inter-
valMatch algorithm is one of a family of procedures that can be ap-
plied according to the nature of the task at hand. Adding the
matched array and eliminating the matrix calculation from the Dan-
nenberg algorithm speeds the work even as it discards information
that may be useful for recognizing re-ordered patterns. For restricted
kinds of material even exhaustive search may be tractable: traversing
a tree that encodes each successive interval of a pattern at a deeper
branching level could be accomplished in real time with a tree
Chapter 4 182
Pattern ABCDEF
Full Overlap ABC BCD CDE DEF
Partial Overlap ABC CDE DEF
No Overlap ABC DEF
does not require that all possible partitions be produced, as full over-
lap does.
David Cope’s matching engine uses full overlap and therefore is
exhaustive in that scores are broken into patterns of motive-size
length such that all possible contiguous patterns are generated. If
motive-size were three, for example, each interval in the score would
appear in three patterns—once as the first interval of a pattern, once
as the second, and once as the last. All possible patterns of a given
size in the compared scores are examined, using tolerance and inter-
polation to tune the degree of similarity allowed. The multiple-view-
point music prediction system of Conklin and Witten similarly uses
a full-overlap, exhaustive search paradigm (Witten, Manzara, and
Conklin 1994).
This clearly constitutes pattern induction in our parlance, and yet
is difficult to use in real time for the simple reason that it requires
exhaustive search. For our purposes, matching all possible patterns
of a given size is unacceptable because it will take too much time.
Further, we wish to find patterns which presumably will be of several
different sizes, meaning that we would need to exhaustively search
many different pattern lengths simultaneously, making the algorithm
that much further removed from real-time performance.
The great advantage of full overlap is that it will never miss a pat-
tern due to faulty segmentation. A no-overlap system critically de-
pends on the consistent grouping of events such that subsequent
pattern matching will compare sequences that start at the same point
in the figure. The other possibility is partial overlap: that is, the same
event may be used in more than one segment when their boundaries
Segments and Patterns 185
a shift of one note can sometimes correct for this, and prediction
performance degradation is less drastic’’ (Reis 1999).
Ben Reis’s segmentation shifting work establishes empirically two
important principles for real-time pattern processing: first, that a par-
tial or no overlap regime can produce the most relevant segments for
context modeling; and second, that the use of perceptual cues de-
rived from music cognition research is effective in establishing the
proper segmentation.
those documented in 1995 (Rowe and Li). That work itself relied
heavily on the Timewarp application described by Pennycook et al.
(1993).
Timewarp was built on the Dynamic TimeWarp (DTW) algorithm
first developed for discrete word recognition (Sankoff and Kruskal
1983). The DTW can be visualized as a graph, where the horizontal
axis represents members of the candidate pattern and the vertical
axis members of the template. A local distance measure is computed
for each grid point based on feature classifications of the two pat-
terns. Pennycook and Stammen used an intervallic representation of
pitch content and suggested duration ratios for rhythm, defined as
the ratio of a note’s duration divided by the previous note’s duration.
in texts and in legal proceedings, yet it does not represent what the
cochlea reports: it has linearly spaced filter bins (y-axis) with the
same bandwidth at all frequencies, while the cochlea has near loga-
rithmic spacing of hair cell frequencies with roughly proportional
bandwidths (constant ratio to the center frequency). The FFT gives
poor frequency resolution in the lower octaves, and too much in the
upper, and since its bandwidths are constant it entirely misses the
‘beating’ that can make up for a missing fundamental’’ (Vercoe 1997,
313).
To skirt the problems of the FFT and model much more closely
the function of the ear itself, Vercoe developed an auditory model
based on constant-Q filters and auto-correlation that is able to per-
form beat tracking from an audio signal. The first step divides the
audio signal into 96 filter bands, with 12 bands per octave spread
across 8 octaves. These bands are scaled according to the Fletcher-
Munson curves to approximate human loudness sensitivity (Handel
1989). Once scaled, the process detects note onsets by tracking posi-
tive changes in each filter channel. These positive difference spectra
are elongated with recursive filters to extend their influence over
time and added together to provide a single energy estimation. A
version of narrowed auto-correlation is then performed on the
summed energy to recognize regularities in the signal and predict
their continuation: ‘‘As the expectations move into current time, they
are confirmed by the arrival of new peaks in the auditory analysis;
if the acoustic source fails to inject new energy, the expectations will
atrophy over the same short-term memory interval’’ (Vercoe 1997).
This algorithm, realized with Vercoe’s widespread Csound audio
processing language, can perform beat tracking on an acoustic signal
without recourse to any intervening note-level representation.
Marc Leman has developed a system that can learn to identify tonal
centers from an analysis of acoustic input (1995). It is a central prem-
ise of his system that it proceeds from an auditory model rather than
from some predefined representational abstraction: ‘‘Many computer
models of music cognition . . . have thus far been based on symbol
representations. They point to the objects in the world without re-
Segments and Patterns 195
flecting any of the physical properties of the object’’ (Leman and Car-
reras 1997, 162).
Leman’s analysis systems send an audio signal through a number
of steps that produce increasingly specific cognitive ‘‘images’’ of the
input. They are divided into two parts: a perception module and a
cognition module. The perception module is an auditory model of
the output of the human ear. Leman has used three kinds of auditory
models in his work, including (1) a simple acoustical representation;
(2) one based on the work of Ernst Terhardt (Terhardt, Stoll, and
Seewann 1982) that computes virtual pitches from a summation of
subharmonics; and (3) another derived from work described by Van
Immerseel and Martens (1992) that models the temporal aspects of
auditory nerve cell firings. Shepard tones (Shepard 1964) are used
as input to the models, eliminating the effect of tone height.
The input part of the simple acoustical model (SAM) is organized
in essentially the same way as the pitch class sets we introduced
in chapter 2. Because the representation models Shepard tones, all
pitches are reduced to a single octave in which height has been elimi-
nated from the percept and only the tone chroma (pitch class) re-
mains. The analytic part of SAM, then, outputs vectors of integers
where a one indicates the presence of a pitch class and a zero its
absence. The synthetic part calculates tone completion images from
input patterns of this type. Tone completion images are computed
from the sum of subharmonics of the pitches in the chord, following
the tradition of virtual pitch extractors discussed in section 2.3. Sub-
harmonics are weighted according to the table in table 4.7.
A C-major triad, for example, would assign a weight of 1.83 to the
pitch class C, since the chord contains intervals of an octave (1.00),
fifth (0.50), and major third (0.33) relative to that pitch class. The C
pitch class, in comparison, would get a weight of 0.1 since only the
minor third above C (E) is present in the input chord.
The heart of the Van Immerseel and Martens perception module
(VAM) is a set of 20 asymmetric bandpass filters distributed through
the range 220–7075 Hz at distances of one critical band. After some
additional processing, the ‘‘auditory nerve image’’ is output, a
Chapter 4 196
SUBHARMONIC WEIGHT
Octave 1.00
Perfect Fifth 0.50
Major Third 0.33
Minor Seventh 0.25
Major Second 0.20
Minor Third 0.10
NUMBER TYPE
12 major triads
12 minor triads
12 diminished triads
4 augmented triads
12 major seventh chords
12 minor seventh chords
12 dominant seventh chords
12 half dim. seventh chords
12 augmented seventh chords
12 minor/major seventh chords
3 diminished seventh chords
Chapter 4 198
of two layers, an input layer and the Kohonen layer. The input layer
contains one node for each component of the input patterns. In
SAMSOM, then, the input layer contains 12 nodes, one for each pitch
class. Each unit in the input layer is connected to all units in the
Kohonen layer, and an initial random weight is associated with each
connection. When an input pattern is presented, it is multiplied by
the weights on each connection and fed to the Kohonen layer. There
the unit that is most activated by the input is able to participate in
learning, hence the name ‘‘competitive learning.’’ In fact, the most
highly activated node is referred to as the ‘‘winning’’ node.
The Kohonen layer is organized as a two-dimensional array. Fol-
lowing the example of the cerebral cortex, the winning node and
units in the neighborhood benefit from learning. A neighborhood
comprises all the nodes within a certain number of rows or columns
around the winner in the grid. ‘‘As the training process progresses,
the neighborhood size decreases until its size is zero, and only the
winning node is modified each time an input pattern is presented to
the network. Also, the learning rate or the amount each link value
can be modified continuously decreases during training. Training
stops after the training set has been presented to the network a pre-
determined number of times’’ (Rogers 1997, 136).
At the end of training, a topology of characteristic neurons has
emerged as a classification of the input set. The characteristic neuron
(CN) for an input is the unit in the Kohonen layer most activated
by that input. The CNs for the training set become clustered during
learning according to the similarities of their represented inputs. In
the CN-map produced by SAMSOM, for example, the characteristic
neurons for C major and A-minor 7th are near one another. These
two chords are closely related from the standpoint of music theory,
as well, because all of the pitch classes of C major are subsumed in
A-minor 7th. All other major/relative minor 7th combinations pre-
sented to the model become similarly clustered.
Aside from the identification of closely related chords (which Le-
man shows goes well beyond the simple example just cited [1995]),
the SAMSOM output map shows an association between tonal cen-
Segments and Patterns 199
ters that are related by the cycle of fifths. The response region of
an input is that part of the Kohonen layer activated by the input.
Considering the set of major chord inputs, the response regions of
the SAMSOM map most highly correlated to C major (for example)
are G major and F major—those a perfect fifth away from C. The
correlation between G major and F major is much lower.
Leman refers to the product of the SAMSOM model as ‘‘images
out of time,’’ meaning that the inputs make no reference to the timing
of their presentation. We may also consider it ‘‘out of time’’ because
the calculation it performs is complex enough to take it well out of
the realm of real time. When such a Kohonen network has been
trained, however, subsequent inputs are classified according to the
learned associations of the training set. Thus a trained network can
be used, even in real time, as a classifier of novel inputs.
The main reason that auditory input to and output from interactive
systems have been used less than MIDI I/O is that they required more
sophisticated (and more expensive) hardware. Even personal com-
puters have now become so powerful, however, that extensive treat-
ment and synthesis of digital audio can be performed with no
additional hardware at all. With the physical and financial limita-
tions to the technology largely eliminated, the focus shifts to deriving
structure from an audio stream that will multiply the possibilities
for interaction. Though the techniques of unsupervised learning are
not appropriate for use onstage, the work of Leman and others has
demonstrated that it can be applied before performance to derive
structural categories for the interpretation of audio signals in real
time.
5 Compositional Techniques
composition simply does not exist. Part of the reason for that may
stem from the phenomenon just discussed—that composers may be
engaging in quite different activities when they work. Another reason
is that it is much harder to test what is going on cognitively during
composition. Analysis, like listening, is easier to model compu-
tationally in a general sense than is composition. The result of an
analysis is a written, rational document that may be examined for
formal constructions amenable to implementation in a computer pro-
gram. It is not, however, the inverse transform of an act of composi-
tion. At the end of an analysis we are not back at the composer’s
thoughts.
In his novel Amsterdam, Ian McEwan describes the thoughts of a
fictional composer as he walks, looking for ideas for a large-scale
orchestral work: ‘‘It came as a gift; a large grey bird flew up with a
loud alarm call as he approached. As it gained height and wheeled
away over the valley it gave out a piping sound on three notes which
he recognised as the inversion of a line he had already scored for a
piccolo. How elegant, how simple. Turning the sequence round
opened up the idea of a plain and beautiful song in common time
which he could almost hear. But not quite. An image came to him
of a set of unfolding steps, sliding and descending—from the trap
door of a loft, or from the door of a light plane. One note lay over
and suggested the next. He heard it, he had it, then it was gone’’
(McEwan 1998, 84).
Such a stream of thought seems familiar to me, at least, as a com-
poser, even though I rarely compose songs in common time. Because
of the fleeting and unconscious quality of much compositional work,
I cannot reduce the entirety of composition to a set of rules or even
to a list of parameters and a training set. I can, however, imagine
using processes to make music and even imagine how different pro-
cesses might be appropriate to different musical situations. It is in
that spirit, then, that we will explore algorithmic composition—as
a collection of possibilities that can be employed within the aesthetic
context of a composition or improvisation.
Compositional Techniques 203
e(n)], we see that each element of the alphabet is given an index that
corresponds to its place in the sequence (here 1, 2, 3, . . . n). Then
the operators are ways of selecting a new element from the alphabet.
For example, the operator same is defined as: s[e(k)] ⫽ e(k). In other
words, the same operation applied to any element of an alphabet
produces that same element again. Figure 5.1 shows the complete
list of elementary operators.
Similarly, Eugene Narmour’s implication-realization model de-
fines a number of operations for the elaboration of structural pitches
on several hierarchical levels (Narmour 1999).
expression in figure 5.2, then, will generate the following list of val-
ues: (60 61 62 63 64 65 66 67 68 69 61 62 63 64 65). Multiple-band-
widths is used to produce a collection of stored elements that Berg
calls stockpiles. Stockpiles are sets of values from which selections
can be made to generate sequences. They are similar to the alphabets
of Simon and Sumner, and Deutsch and Feroe, except that elements
of a stockpile may be repeated while those in an alphabet typically
are not. Stockpiles might be interpreted as pitches, durations, or dy-
namics. They can be formed initially by listing the elements of the
set, by combining existing stockpiles, or by applying a generation
rule (such as multiple-bandwidths).
Figure 5.3 lists a C⫹⫹ version of the AC Toolbox multiple-band-
widths generator. Much of the code in it is necessary purely because
C⫹⫹ is not Lisp. (Notice that even the name must change because
C⫹⫹ thinks the hyphen is a minus sign). The arguments to the Lisp
version of multiple-bandwidths are simply pairs of low and high
bounds. This works because Lisp can see when the list of pairs comes
Compositional Techniques 207
Make button of figure 5.4. Though it does not occur in this example,
there is in fact no restriction on the relationship between upper and
lower boundary values. In particular, the upper bounds need not be
greater than the lower. A tendency mask could well have the top and
bottom boundary lines cross several times throughout the duration
of the mask.
The use of a tendency mask, once defined, depends on the number
of events that are to be generated. If, for example, we were to request
100 events from the tendency mask defined in figure 5.4, the top and
bottom boundary values would be distributed evenly across the gen-
erated events, i.e., a new value would be encountered every 10
events. Tendency masks are actually a kind of breakpoint envelope,
in which the values given are points of articulation within a con-
stantly changing line.
Figure 5.6 lists the member function GetValue(), which returns
a number from the tendency mask at the current index position. First
the actual upper and lower bounds corresponding to the index argu-
ment are calculated. Then a random number is chosen that falls
within the absolute value of the difference between the upper and
lower bounds. (We may consider the random number the absolute
value because the modulo operation simply returns the remainder
after division by the argument—the remainder will be positive
whether the division was by either a positive or negative number.)
If the range is equal to zero we simply return the lower value (since
upper and lower are the same), thereby avoiding a division by zero
in the modulo operation. If the range is in fact negative, the random
value is made negative as well. Finally the lower boundary value is
added to the random number and the result is returned.
Figure 5.7 shows the output of the tendency mask defined in figure
5.4 applied to a stockpile of pitch values covering all semitones be-
tween C2 and the C6 (where C4 ⫽ middle C). Though there is random
movement on a note-to-note basis, the overall shape of the pitch se-
quence follows the outline of the mask shown in figure 5.5. The ten-
dency mask selection principle offers a way to produce variants of
a constant structural gesture. Each realization of the mask is differ-
ent due to the random selection, but they all will express the same
5.2.2 Jupiter
Jupiter was the first of four interactive works Philippe Manoury and
Miller Puckette realized at IRCAM that together form the cycle Sonus
Ex Machina: Jupiter for flute and 4X (1987); Pluton for piano and 4X
(1988); La Partition du Ciel et de l’Enfer for flute, two pianos, 4X and
orchestra (1989); and Neptune for three percussion and 4X (1991).
The titles of the compositions reveal the technological platform of
Chapter 5 216
was written but who died before it could be completed. There are
eight pitches in the cell that correspond to the name as shown in
figure 5.9 (Odiard 1995, 54). Figure 5.10 shows how the cell is ani-
mated rhythmically during the flute opening of the piece.
The detection/interpolation sectional pairs are based on the recog-
nition and elaboration of sequences of durations. The processing is
typical of Manoury’s disposition toward algorithms as a composi-
tional device: ‘‘A process which can be perceived as such, that is to
say whose future course may be anticipated, destroys itself of its own
accord. It should reveal without being revealed, as if it were the se-
cret artisan of a universe whose forms we perceive, but whose mecha-
nisms we fail to grasp’’ (1984, 149–150).
The Max process that implements the required interpolation is di-
vided into several sub-patches. The recording of the interpolated
material itself is accomplished using explode, Miller Puckette’s
multi-channel MIDI recording/playback object.
Figure 5.11 demonstrates one sub-process from the interpolation
patch, PlaybackLoop (note that this example, like all of the exam-
ples in this book, has been modified to focus on the algorithms under
discussion and is therefore somewhat different from the one actually
used in Jupiter).
All of the Jupiter patches make extensive use of the Max send/
receive mechanism, and PlaybackLoop is no exception. There are
Chapter 5 218
two receivers present, one which resets the note index to zero, and
another that resets the number of notes to be played in a loop. The
main function of the patch is to increment a note count with each
incoming pitch. The inlet receives the pitches from an explode ob-
ject that has recorded the material to be repeated. Whenever a new
pitch arrives at the inlet of PlaybackLoop, it bangs out the number
stored in the int object in the center of the patch. Initially the int
holds zero, but each time a value is output it adds one to itself, incre-
menting the pitch count. The select object at the bottom of the
patch performs the rest of the processing. When the note index from
the int matches the number of notes in the loop, select sends a
bang to the outlet of PlaybackLoop.
Figure 5.12 demonstrates another sub-patch from the interpolation
process: NumberOfPlaybacks. At the top of the patch is a receive
object that starts the patch when a playback message is received. Note
Compositional Techniques 219
tor, unless all requested loops have been produced. Then the interpo-
lation is set to 1.0 (the end of interpolation) and a bang is sent
through the rightmost outlet of NumberOfLoops, signaling the end
of the process.
Figure 5.13 shows how the PlaybackLoop and NumberOfPlay-
backs patches are tied together. Two explode objects are used to
record the two sequences. As they play back, the interpolation factor
generated by NumberOfPlaybacks is sent into the right inlet of an
expression that calculates a duration interpolated between the out-
put of the two sequences.
These patch fragments are taken from the program written by Cort
Lippe for his composition Music for Piano and Computer (1996), one
of a number of his works for soloist and interactive digital signal
processing. As indicated in the citation, jack⬃ is used in this compo-
sition not only for pitch tracking, but to control a bank of twenty
oscillators as well. The input signal to the analysis is the sound of
the live piano. The traditional approach to analysis/resynthesis
uses sine-wave oscillators to recreate individual partials of the ana-
lyzed sound. Lippe follows tradition for part of the piece, but re-
places the sine waves in the oscillator lookup tables with other
sounds during other sections. In fact, at some points the resynthesis
sounds are determined algorithmically by the piano performance—
for example, by switching oscillator tables according to the per-
former’s dynamic.
Zack Settel has written a large number of analysis processes con-
tained in the Jimmies library distributed by IRCAM. In keeping with
Max design philosophy, these processes are small units that can be
plugged together to create more individual and sophisticated forms
of audio tracking. Figure 5.16 shows the patch for calculating the
root-mean-square (RMS) amplitude of a digital audio signal. The
RMS amplitude is computed just as the name suggests: values from
the signal are squared and their mean is taken. The square root of
the mean provides a measure of the energy in the signal.
The object zerocross⬃, another building block, returns a rough
estimation of the amount of noise in a signal by counting the
Chapter 5 224
Clicktime and other controls allow her to quickly change and manip-
ulate such effects in completely improvised performances.
The Center for New Music and Audio Technologies (CNMAT) has
developed many systems for interactive improvisation, often ex-
pressed through algorithmic signal processing or synthesis. Their
work has an extensive rhythmic component based on research first
carried out by Jeff Bilmes at the MIT Media Laboratory. Bilmes built
an engine for the analysis and synthesis of expressive timing varia-
tions on the observation that musicians maintain independent, very
fine-grained subdivisions of the beat pulse as a reference for the
placement of events in time. ‘‘When we listen to or perform music,
we often perceive a high frequency pulse, frequently a binary, tri-
nary, or quaternary subdivision of the musical tactus. What does it
mean to perceive this pulse, or as I will call it, tatum? The tatum is
the high frequency pulse or clock that we keep in mind when per-
ceiving or performing music. The tatum is the lowest level of the
metric musical hierarchy. We use it to judge the placement of all
musical events’’ (Bilmes 1993, 21–22).
Tatums are very small units of time that can be used to measure
the amount of temporal deviation present or desired in any per-
formed event. The name tatum has multiple connotations: first, it is
an abbreviation of ‘‘temporal atom,’’ referring to its function as an
indivisible unit of time. In addition, it honors the great improviser,
Art Tatum, as well.
A central tenet of the tatum approach is that expressive timing vari-
ation in many musical styles (including jazz, African, and Latin mu-
sic) is not convincingly modeled by tempo variation alone. Rather
than defining expression as a change in the tempo of an underlying
pulse, the deviation of a given event from a fixed pulse is used. When
modeling the performance of an ensemble, each member of the en-
semble has her own deviation profile. This means that some perform-
ers might be ahead of or behind the beat, while others play more
strictly in time. Such a conceptualization corresponds more closely
to the way musicians think about their temporal relationships during
Compositional Techniques 231
performance than does the idea that they all are varying their tempi
independently.
In his thesis, Bilmes demonstrated tools for deriving, analyzing,
and synthesizing temporal deviations present in the multiple layers
of drums he recorded in a performance by the Cuban percussion en-
semble, Los Muñequitos de Matanzas (1993). His work has become
a foundation of the CNMAT Rhythm Engine (CRE), used to organize
the rhythmic activity in several interactive environments.
One such environment used the CRE to store, transform, and com-
bine rhythmic patterns related to those of North Indian classical mu-
sic. Rhythm in this tradition is based on tal theory, a way of
organizing drum patterns within beat cycles of a certain length, such
as the 16-beat tin tal or 12-beat jap tal. ‘‘A particular tal is character-
ized not only by its number of beats, but also by traditional thekas,
fixed patterns that would normally be played on a tabla drum to de-
lineate the rhythmic structure of the tal in the most straightforward
way’’ (Wright and Wessel 1998).
The system organizes a collection of rhythmic subsequences in a
database. Each subsequence identifies one ‘‘reference tatum,’’ the
point that is anchored to a rhythmic grid when the subsequence is
timescaled and scheduled to be played. Normally the reference tatum
is the first event of the subsequence, but could come sometime later
if the pattern includes pickup notes. Subsequences can undergo sev-
eral forms of transformation before they are played, the most impor-
tant of which is timescaling. That is, a subsequence can be sped up
or slowed down by some multiple, as well as adjusted to match the
tempo of an ongoing performance.
Subsequences record the placement of events in time. Usually an
event is a note or drum stroke, but it can represent some other type
of action as well: for example, the system can position ‘‘start record’’
and ‘‘stop record’’ messages within a rhythmic pattern. In this way,
material from the live performance can be sampled at precisely con-
trolled moments within the rhythmic structure: ‘‘We think of the
notes before the ‘start record’ as a musical stimulus, and then use
Chapter 5 232
The traditional creation of Western music moves from the score writ-
ten by a composer through a performer’s interpretation to the under-
standing of a listener. We have considered algorithmic tools for
composition and analysis, informed by the cognition of listening. In
this chapter we will review proposals for algorithmic performance,
particularly those that deal with expression. Before addressing that
work, let us pause to think more closely about the cross-fertilization
between music cognition and machine musicianship.
people do, but plays better than almost all of them. Certainly no one
would suggest that Deep Blue should be made to play worse if it
then would better match the data from experiments on human chess
players.
Machine musicianship as I consider it in these pages is a form of
weak AI: weak in that I claim no necessary isomorphism to the hu-
man cognitive processes behind the competencies being emulated.
It would be wonderful, certainly, if these programs could shed any
light on the psychology of music and some of them may serve as
platforms from which to launch such inquiry. It certainly works in
the other direction, however: because of their meticulous attention
to the details of human musicianship, music cognition and music
theory are the most abundant sources of ideas I know for program-
ming computers to be more musical.
Artificial intelligence has been an important partner in the devel-
opment of musically aware computer programs as well. Every new
wave of AI technology, in fact, seems to engender an application of-
fering possibilities for some aspect of music composition, analysis,
or performance. Ebcioglu’s expert system for harmonizing chorales
in the style of J.S. Bach (1992), Bharucha and Todd’s modeling of
tonal perception with neural networks (1989), and Cope’s use of aug-
mented transition networks for composition (1991) are only some of
the more celebrated instances of this phenomenon. From the point
of view of artificial intelligence research, such applications are at-
tractive because of music’s rich capacity to support a wide variety
of models. Musical problems can be stated with the rigidity and pre-
cision of a building-blocks world, or may be approached with more
tolerance for ambiguity, multiple perspectives, and learning. From
the point of view of algorithmic composition and performance, arti-
ficial intelligence is of central importance because it directly ad-
dresses the modeling of human cognition.
6.2.1 Schemata
An important representation for all of these fields, and for music cog-
nition in particular, is the schema, or frame. Albert Bregman defines
schema as ‘‘a control structure in the human brain that is sensitive
to some frequently occurring pattern, either in the environment, in
ourselves, or in how the two interact’’ (Bregman 1990, 401). Programs
modeling some aspect of cognition frequently make use of a formal
version of the schema, typically interpreted as a collection of inter-
related fields that is activated when a certain pattern of activity is
encountered. For example, Marc Leman describes the output of the
self-organizing maps used in his system (section 4.3) as schemata.
The title of his book, Music and Schema Theory reflects the impor-
tance of the concept to his model (1995).
Schemata are important both as a way of organizing the response to
a situation and as a compact representation of knowledge in memory.
Many of the structures and processes we have already developed
may be considered from a schematic viewpoint. Irene Deliège and
her colleagues suggest that the structures of the Generative Theory
of Tonal Music, for example, are schemata for the organization of
tonal material in memory: ‘‘The same underlying structure could be
related by a listener to many different surface structures. The capac-
ity to abstract an appropriate underlying structure in listening to a
given piece could thus be held to represent the most highly devel-
oped mode of musical listening’’ (Deliège et al. 1996, 121). The appli-
cation of such schemata to different compositions could indeed
Chapter 6 242
reference to the style of the music in which the cues are embedded. In
section 8.1 we will review a proposal for automatic style recognition.
Certain cue attributes could be tied to the recognition of styles such
that when a given style is active, the presence of the associated attri-
butes would mark a pattern as a cue.
None of the segmentation techniques we built in section 4.1 make
use of the recognition of distinctive patterns as suggested by the cue-
abstraction proposal. Using cues in segmentation would lead to a
circularity that must be dealt with in control structures governing
the two. That is, we use calculated segments as the starting point of
the pattern recognition procedures outlined in section 4.2. Ongoing
segments are matched against stored patterns that have been assigned
boundaries by the same grouping process. Once we use the recogni-
tion of patterns as boundary markers, we encounter a feedback prob-
lem in which cues determine groups that determine cues. Though
cue abstraction may be a very powerful addition to the segmentation
techniques currently in use, the resonance between groups and cues
would need to be closely controlled in the real-time case.
Another kind of schema models progressions of tension and relax-
ation through time:
Research on musical expressivity and on musical semantics, carried
out by Francès (1958) and Imberty (1979, 1981) showed the essential
part played by musical tension and relaxation schemas; these sche-
mas are extracted from the musical piece and then assimilated to
kinetic and emotional schemas of tension and relaxation, which ac-
cumulate all of the affective experience of the listener. Therefore, it
seems reasonable to consider that the most important part of musical
expressivity might be determined firstly by the specific way each
composer organises the musical tension and relaxation in time, and
secondly by the kinds of musical tension and relaxation the listener
manages to abstract. (Bigand 1993, 123–124)
Bigand proposes algorithmic means of determining the tension
and relaxation of musical segments, particularly as these are mani-
fested by melodic, harmonic, and rhythmic processes. For example,
Chapter 6 244
Tonic 7
Dominant 6
Third 5
Other 4
uating the affective course of a piece of music. I will not pursue the
algorithmic analysis of emotional responses in this text, though
many of the analytic tools we have developed could help describe
the affective dynamic of a work when cast in those terms. (The work
that has been published in this area is still developing. For one of
the most refined examples, see Antonio Camurri’s EyesWeb project;
see also section 9.3.) For the moment let us consider the other crucial
part of Bigand’s tension/relaxation schemata, which is that they are
formulated with explicit reference to temporal development.
6.3 Learning
with which they play. Once the program appears onstage, or even in
the rehearsal hall, it has largely reached a level of performance that
will change little until the concert is over. Exceptions to this general-
ization are programs such as Barry Vercoe’s Synthetic Performer,
which learns in rehearsal the interpretation that a particular human
player brings to a piece of chamber music (Vercoe 1984).
Steven Tanimoto defines learning as ‘‘an improvement in informa-
tion-processing ability that results from information-processing ac-
tivity’’ (Tanimoto 1990, 312). Learning therefore involves change in
the functioning of a system due to the operation of the system itself.
Tanimoto’s definition makes learning a positive change—the system
is not held to have learned if its performance degrades. Critical to
identifying learning, then, is a specification of what ‘‘improvement’’
means for a particular system. The field of machine learning is a
broad and active one, and I will not undertake any overview here.
Rather, let us consider the issues involved in using machine learning
techniques in an interactive system.
‘‘Each learning model specifies the learner, the learning domain,
the source of information, the hypothesis space, what background
knowledge is available and how it can be used, and finally, the crite-
rion of success’’ (Richter et al. 1998, 1). Here again we find reference
to an improvement metric as a way to quantify learning. One of the
most widely used learning models is the neural network, which we
have already considered in some detail. The key induction network
introduced in section 3.1, for example, used information from music
theory both as the source of knowledge about chord functions and as
the criterion of success, and the network was considered successful
if it produced key identifications in agreement with music theory.
Another criterion of success might be agreement with the results of
music cognition experiments. As in the case of representation, suc-
cessful learning can only be judged within the context of a specific
problem.
We can distinguish two basic learning phases that we might wish
a machine musician to undergo: first, learning as preparation for
performance; and second, learning during performance. Several
Chapter 6 248
mosomes are mated by taking parts of the bit strings of each and
combining them to make a new string. The methods of combining
chromosomes follow biology: two of the most important mating func-
tions are crossover and mutation. In crossover, a chromosome locus
is chosen at random. Bits before the locus in parent A are concate-
nated with bits after the locus in parent B to form one offspring, and
the post-locus bits of A are concatenated with the pre-locus bits of
B to form another. Figure 6.1 shows the offspring created from two
four-bit parents with crossover at locus 2.
Mutation randomly flips bits in one chromosome. That is, a single
bit position in a chromosome is chosen at random and the value there
is inverted (a zero becomes a one, and vice versa).
A fitness function gauges the success of a solution relative to other
solutions (chromosomes). A famous genetic algorithm was devel-
oped by W. Daniel Hillis to evolve sorting algorithms (1992). In that
case the fitness of a chromosome was a measure of whether or not
it could correctly sort a list of 16 items. The sorting algorithm is a
well-known optimization problem in computer science, described
extensively by Donald Knuth in the third volume of his classic text
The Art of Computer Programming (1973). One of Knuth’s solutions
is a sorting network, a parallel process in which pairs of elements
in the list are compared and swapped if they are found to be out of
order. Such networks are correct if they are guaranteed to produce
a sorted list at the end of all comparisons. They are minimal if they
produce a correct result with the fewest number of comparisons.
Chapter 6 250
INDEX 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PITCH C E F G A B C E F G A B C E
for a tune being learned is read from a file prior to running the GA.
A preprocessor then constructs scale tables for each measure ac-
cording to the given harmonic structure. The scale used to accom-
pany an Fm7 chord might consist of 14 pitches arranged as shown
in table 6.2.
Note that these are not pitch classes but rather pitches in a particu-
lar octave. A GenJam chromosome with the sequence of symbols
{ 9 10 11 12 13 11 10 9 }, then, would perform the musical fragment
shown in figure 6.2 when played against an Fm7 chord (this example
is taken from Biles [1998, 233]).
Indices into the table begin with one because the locus symbol zero
has a special meaning: whenever the program encounters a zero in
a chromosome it generates a rest in the output by sending a note
off command to the MIDI pitch most recently sounded. A locus
symbol of 15 also has a special meaning, in that when 15 is encoun-
tered no MIDI messages at all are sent, effectively holding the most
recently sounded pitch (or continuing a rest).
GenJam makes four kinds of mutations to measure chromosomes:
This version of the algorithm in fact does not evolve phrases based
on a fitness evaluation. The human player has no foot pedal or other
virtual gong with which to reduce the fitness of phrases judged to
be poor. Other implementations of GenJam have used a fitness esti-
mation, however, either issued by a human observer as in Karl Sim’s
drawing GA, or even derived from the reaction of an audience of
listeners.
John Biles’s GenJam is interesting both because of the ingenuity
with which it adapts genetic algorithms for jazz improvisation and
because it uses a learning algorithm during the performance itself.
The applications to which GAs could be put as part of an interactive
system are legion—GenJam is to some extent an aural equivalent of
Sims’ drawing program, and there are many other ways to interpret
chromosome strings as control parameters of compositional algo-
rithms. GAs might be used, for example, to evolve interpolation func-
tions such as those found in Philippe Manoury’s Jupiter or tendency
mask settings for the AC Toolbox. Learning could be carried out be-
fore the performance allowing the composer to exert pressure on the
GA according to the nature of the composition.
The CD-ROM includes a genetic algorithm application that simply
evolves bit strings with as many bits as possible set to one. I use such
a simple example because the fitness function involved is particu-
larly straightforward. The use of this process for other applications
involves a redesign of the interpretation of the bit strings and an asso-
ciated fitness function—much of the evolutionary mechanism (cross-
over, mutation) can remain as written. Beyond this example, genetic
algorithms (like neural networks) are well documented on the in-
ternet and many working environments can be found with a simple
keyword search.
Figure 6.3 shows the definition of the genetic algorithm class. Most
of the member functions follow the preceding discussion: the func-
tion Fitness(), for example, is a routine that determines the fitness
of a particular chromosome. SelectParent() will choose a parent
for the next generation of offspring according to how highly it is rated
by Fitness().
Figure 6.3 Genetic Algorithm class definition
Chapter 6 256
values of the plan units, the network will predict another output vec-
tor. This process continues sequentially until the melodies are com-
pleted. (Goldman et al. 1999, 78)
the detected bass and snare onsets to stored drum patterns and re-
ports beat information from the pattern that best matches the input.
BTS tracks the beat of its target music style remarkably well, to
the point that animated dancers can synchronize with the beat of
arbitrary compact disc recordings of popular music in real time
(figure 6.6).
The BTS in 1994 could locate the tactus, or quarter-note beat level.
An extension reported in 1997 finds not only the quarter-note level
but also higher-level pulses including the half-bar and the bar (as-
suming 4 beats to a bar). Detecting the higher levels involves recog-
nizing strong and weak beats of a meter, not simply a multiplication
of the quarter-note pulse by two or four. Besides metric structure,
the 1997 program employs music knowledge to find the pulse of
drumless performances as well. Though this brings the work closer
to MIDI-based beat trackers, the authors note the difficulty in
Chapter 6 262
durations connecting two longer ones, and the ‘‘triadic melodic con-
tinuation’’ outlines a triad within a larger gesture. Because all condi-
tions of the rule are fulfilled here, a performance by Widmer’s learner
will play this figure with a decelerando.
The system can learn nested structures as well. That is, the phrase
level figure shown in figure 6.11 might be subsumed in a larger mo-
tion that is associated with a different expressive shape. When the
learner performs such hierarchically related structures it averages
the effect of the two to arrive at one local variation.
Widmer tested the impact of structure-level descriptions on the
efficacy of the resulting rules by comparing performances generated
by three different strategies: (1) note-level simple descriptions (pitch,
metric position, etc.); (2) note-level descriptions with structural
backgrounds; and (3) structure-level descriptions. The training ex-
amples were taken from performances of Schumann’s Träumerei by
Claudio Arrau, Vladimir Ashkenazy, and Alfred Brendel, as col-
lected by Bruno Repp (Repp 1992).
Rules learned by analyzing the second half of the pianists’ rendi-
tions using the three strategies were subsequently applied to the first
half of the composition, after which the experts’ and the machine’s
performances were compared. A significant improvement in the
agreement between the machine and expert performances emerged
as more structural knowledge was incorporated. Weighted for metri-
cal strength (a rough measure of salience), the first strategy yielded
agreement of 52.19%, the second 57.1%, and the third 66.67%
(Widmer 1996, 200).
Gerhard Widmer’s expressive learning system relates to the goals
of machine musicianship in several ways: its symbolic orientation
produces rules that can be executed algorithmically as long as the
relevant structural characteristics can be recognized. The application
of expressive shapes to timing and dynamics can similarly be accom-
plished in an interactive system if a generated phrase is available for
modification before it is performed.
Generally the application of expression rules requires a significant
degree of planning. A program must analyze at least one full phrase
Chapter 6 276
to three possible principles: (1) The first event may be part of a hierar-
chical structure, to some extent worked out in advance, and to some
extent constructed in the course of the improvisation. . . . (2) The
first event may be part of an associative chain of events, each new
event derived from the previous sequence by the forward transfer of
information. . . . (3) The first event may be selected from a number
of events contained within the performer’s repertoire, the rest of the
improvisation consisting of further selections from this same reper-
toire, with a varying degree of relatedness between selections. (1988,
8–9)
Clarke goes on to theorize that improvisation styles can be char-
acterized by their combination of these three strategies. ‘‘The im-
provising style known as free jazz is principally characterized by
associative structure, since it eschews the constraints of a pre-
planned structure, and attempts to avoid the use of recognizable
‘riffs’ ’’ (1988, 10). He similarly characterizes traditional jazz as being
more hierarchical because of the importance of the harmonic struc-
ture, and bebop as more selective ‘‘in the way in which a performer
may try to construct an improvisation so as to include as many
‘quotes’ from other sources as possible (ranging from other jazz
pieces to national anthems)’’ (1988, 10).
There is no general machine improviser yet. Though improvisation
systems do not know what will happen in performance, they gener-
ally are programmed to participate in music of a particular style. Be-
bop improvisers have been implemented that would sound out of
place in a performance of free jazz, and new-music-style improvisers
cannot play the blues. Ultimately the ongoing research in style
analysis/synthesis may make it possible to write a machine impro-
viser that could recognize the stylistic characteristics of the music
being played and adapt its contribution accordingly. Style-specific
improvisers have already proven their artistic merit, however, and
still benefit from an analysis of the musical context even when that
context is essentially restricted a priori.
In this chapter I will discuss a number of case studies, systems
developed by composer/improvisers for improvising interactively
Interactive Improvisation 279
7.1.1 SeqSelect
Teitelbaum’s improvisation systems have long included techniques
for recording, transforming, and playing back musical material dur-
ing the course of a performance. The Max patch developed for SEQ
TRANSMIT PARAMMERS has a number of objects devoted to these
operations. We will examine SeqSelect, a subpatch that provides
compact management of a group of up to nine sequences. Obviously,
nine is an arbitrary number and could be adjusted up or down simply
by changing the number of seq objects in the patch.
SeqSelect has two inlets at the top of the patch (figure 7.1). The
first accepts messages for the embedded seq objects, such as play
and stop. (Seq is a Max object that records, reads, writes, and plays
back MIDI files.) Note that SeqSelect uses the seq objects for play-
back only—all of the material to be read out has been prepared previ-
ously. The second inlet to SeqSelect gets a sequence number,
selecting the corresponding internal sequence to receive subsequent
seq messages through the other inlet until it is overridden by another
selection.
The gate and switch objects route messages through the patch.
The sequence selector arriving at the second inlet is sent to the con-
trol inlets of both gate and switch. In the case of gate, this routes
all messages arriving at the right inlet to the corresponding outlet of
the gate. Similarly, switch will send anything arriving at the se-
lected input through to its outlet (switch looks and behaves rather
like an upside-down gate). The switch routes the bangs that are
sent when a seq object completes playback of a file.
Interactive Improvisation 281
point the overflow flag will be sent from the third outlet of the
counter and change the switch to direct its input to the right outlet,
leading nowhere. Now when the next bang comes from finding the
end of sequence nine, it will not feed back around to SeqSelect but
rather fall into empty space, ending the sequential playback. Note
that at the end of this process the machine will no longer work—for
one thing, the graphic switch will be set to the wrong position. The
bang button at the top of the patch will reinitialize everything so
that the process could be run through again.
In Teitelbaum’s own use of SeqSelect, the object is wrapped with
pack/unpack and midiformat/midiparse pairs as shown in fig-
ure 7.4. These surrounding pairs are used to make MIDI messages of
Chapter 7 284
incoming pitch and velocity values, and then to unwrap the resulting
MIDI messages to retrieve the original pitch and velocity. The point
of the exercise is both to make available the individual components
of the MIDI channel voice messages handled by the seq objects,
and to allow new pitch and velocity values to be recorded inside
SeqSelect.
The ⫹ object appended after unpack in figure 7.5, for example,
provides for the input of a transposition value. Any new integer sent
to the right inlet of the ⫹ is added to the pitch numbers of all MIDI
messages coming from the sequences managed in SeqSelect. The
flush object ensures that all sounding notes can be sent a corre-
sponding note off when necessary. This is to address a common
problem with transposition operators: that the transposition level
may be changed between a note on and a note off, causing the
transposed notes to be turned off when the time comes and the origi-
nal pitches to stay stuck on.
SEQ TRANSMIT PARAMMERS takes pitch values from MIDI in-
put to control the transposition of sequence playback as shown in
figure 7.6. Under certain conditions, a MIDI pitch number will be
read from the stripnote object and 60 subtracted from the value.
The trigger object (here abbreviated t) then carefully times the
delivery of two bangs and the calculated transposition amount. The
Interactive Improvisation 285
The CEMS system and the SalMar Construction were the first interac-
tive composing instruments, which is to say that they made musical
decisions, or at least seemed to make musical decisions, as they pro-
duced sound and as they responded to a performer. These instru-
ments were interactive in the sense that performer and instrument
were mutually influential. The performer was influenced by the mu-
sic produced by the instrument, and the instrument was influenced
by the performer’s controls. These instruments introduced the con-
cept of shared, symbiotic control of a musical process wherein
the instrument’s generation of ideas and the performer’s musical
judgment worked together to shape the overall flow of the music.
(1997, 291)
Edmund Campion, in his composition/improvisation environ-
ment Natural Selection, similarly uses ideas of influence to organize
the behavior of a large-scale algorithmic improviser written in Max.
Natural Selection for MIDI piano and Max was composed at IRCAM
in 1996, with programming by Tom Mays and Richard Dudas. Cam-
pion designed the first version of the work as an improvisation envi-
ronment for himself performing on a MIDI grand piano and
interacting with a Max patch called NatSel. The rules governing the
operation of the patch are the same rules imparted to the performer
who interacts with it. The construction and operation of NatSel ex-
press the fundamental compositional ideas of both the improvisation
and the later piece: in many respects the patch is the piece. Richard
Povall expresses a similar orientation to his own work: ‘‘The most
difficult task at hand is to allow the system to be the composition—
to allow both performer and system to interact with a degree of free-
dom that makes for a compelling, often surprisingly controlled out-
come’’ (1995, 116).
Interactive Improvisation 289
⫹ object before they are overwritten with the new ones. We arrange
the correct order of processing with the trigger object. Messages
are transmitted from a trigger in right-to-left order. The bang issu-
ing from the rightmost outlet will first make int send its current
value into the right inlet of ⫹. Then the integer output of the trigger
(from the left outlet) will overwrite the memory location in int and
add itself to ⫹. The / object divides the sum by two and sends the
resulting average to the outlet of avg2.
Of course this patch will perform averaging on any sequence of
numbers: in Natural Selection it is used on velocities but also to com-
pute the average duration between successive note attacks, the sec-
ond form of performer influence. Using avg2 on these durations
provides a running average of the inter-onset-intervals (IOIs) arriving
from the performer. To avoid skewing the average when the perfor-
mer is leaving a long rest, the program does not send IOIs beyond a
certain threshold (about 1.5 seconds) to the averager.
The third and last form of influence derives from a comparison of
incoming pitch combinations to a matrix. The matrix comprises 64
three-note combinations that are recognized when played either in
sequence or as a chord, and 24 six-note combinations that are recog-
nized only when played as chords. When a combination is found,
the input processor issues an ‘‘influence’’ variable. Influence vari-
ables are used to change the presentation of sequences as well as to
trigger and modify several independent note generation processes.
The performer can record sequences at will. These can be played
back as recorded when triggered, or after transformation by one of
the influence processes. One such process is called ‘‘3-exchange,’’
which maps all of the notes in a recorded sequence to the pitches of
the last identified tri-chord. The sub-patch order-route (figure
7.10) performs the mapping: MIDI note numbers enter through the
inlet at the top. These are changed to pitch classes, and sent through
a cascade of select objects. When an incoming pitch class matches
one of the select arguments, the corresponding member of the tri-
chord is passed to the outlet. C , F , G, and C are mapped to the first
Interactive Improvisation 291
sages by sending a bang to its outlet. The bang fires once after the last
input has been received. (Note that activity will not bang until it
has received at least one message at its left inlet).
Campion uses activity to reset certain processes to a default
state during the performance. For example, one part of the program
scales incoming velocity values using a floating point multiplier. If
no new velocities arrive within a five second window, the multiplier
is reset to the default value of one—that is, all scaling is eliminated
until more input arrives.
Figure 7.13 implements this idea in a subpatch. The left inlet
receives MIDI velocity values. The right inlet takes floating point
Chapter 7 294
scalers. All velocities arriving on the left side are multiplied by the
scaler and clipped to a range of 0 to 120 before being sent back out.
Whenever there has been an absence of velocity inputs for five sec-
onds or more, the activity object will bang a one back into the
multiplication object, effectively eliminating the amplitude scaling
until a new value is received at the right inlet.
The clip object restricts input to a specified range (figure 7.14).
Any input that falls between the upper and lower bounds of the range
is passed through unchanged. If input falls above or below the
bounds, it is pinned to the limit it has passed. So, if a clip object
has bounds of 0 and 120 (as in figure 7.13), any input above 120 will
be output as 120, and any input below zero will be changed to zero.
Clip accomplishes this behavior by saving the lower and upper lim-
its in int objects. A split handles the first test, passing any input
between the bounds directly to the output. Inputs that do not fall
within the bounds are sent to the right outlet of split, where they are
tested by two conditionals. The first checks to see if the input is less
than the lower bound, in which case the int containing the lower
bound is banged, sending the lower bound to the outlet. The second
conditional does the same operation for the upper bound, checking
Interactive Improvisation 295
to see if the input is higher than the upper bound and banging out
the upper bound if it is. Only one of these conditionals can be true
for any given input.
NatSel’s independent generation processes produce material with-
out reference to the sequences and can operate simultaneously with
sequence playback. The sub-patch repeat-notes, for example, repeats
any note played four times as long as the inter-onset interval between
any successive two of the four does not exceed 400 milliseconds.
The patch in figure 7.15 demonstrates the control structure that de-
termines when the condition for repeat-notes has been met. In-
coming note-ons are first gathered into a list by the quickthresh
object if they arrive within a 50 millisecond window. The change
object bangs a one into the counter whenever an incoming note is
different from the previous one. Therefore the counter will only ad-
vance beyond two if a pitch is repeated. The activity sub-patch
(figure 7.12) resets the counter to one if nothing happens for 400
milliseconds. When the counter advances beyond a limit set in the
Chapter 7 296
greater-than object (⬎), a bang is sent to the process that produces the
repeats. The inter-onset-interval between the last two of the repeated
notes is used to control the speed of the generated repetitions, as
shown in figure 7.16.
The repeat-notes independent process outputs a number of rep-
etitions of the input note when the condition determined by figure
7.15 is met. The expression shown in figure 7.16 determines the tim-
ing of these repetitions as well as, indirectly, their number. (The com-
plete repeat-notes patch can be found on the CD-ROM.) The
expression calculates a quasi-exponential curve whose shape is de-
termined by the middle inlet. In figure 7.16 two demonstration val-
ues are supplied: lower values (such as 0.25) cause the rise to be
sharper at the end and higher ones (such as 0.6) make a more gradual
ascent. In Natural Selection, this parameter is set by the velocity with
which the final repetition of the input note was played, such that
harder attacks produce smoother curves and vice versa.
The maximum value, determined by the third inlet, is the value
that when entered in the leftmost inlet will cause the expression to
output 100. Notice in figure 7.16 that the maximum has been set to
300. The value input on the left is 30, multiplied by 10 (300), which
yields an output from the expression of 100. The expression outlet
Interactive Improvisation 297
The preceding two sections detailed the use of sequenced and gener-
ation techniques in improvisation environments. A third algorithmic
composition style concerns transformation, in which material arriv-
ing from the outside (from the performance of a human partner, or
from another program) is transformed by processes that vary the ma-
terial, usually based on an analysis of its properties. Amnon Wol-
man’s composition New York for two player pianos was written for
and premiered by pianist Ursula Oppens in 1998. The two player
pianos are MIDI-equipped—one is played by the human pianist and
the other by a computer program. The performance from the human
is sent to a Max patch that combines pre-recorded sequences with
live transformations of the pianist’s material to form a counterpoint
that is output on the second player piano.
Chapter 7 298
has been taken out. In this example we use the bang to send a clear
message back to urn, refilling it with the values 0 through 7 for an-
other round of selection.
The metro at the top of the patch sends bangs to both the urn
object and a counter. The output of urn is adjusted to yield eight
different values spaced evenly between 20 and 83. The counter tags
these with ID numbers ranging from one to three, which control
transmission of the urn values through the route object attached to
pack. The counter sends a zero to the toggle controlling the metro
when it has produced all three identifiers. This halts the selection
of items from the urn once the three identifiers have all been associ-
ated with new values.
The values spaced evenly between 20 and 83 finally reach a set
of three equality tests, as shown at the bottom of figure 7.18. These
operators compare the urn outputs against the values issuing from
a second counter (shown on the left-hand side of figure 7.18). The
big (left-hand) counter goes from 1 to 108 over a span of 54 seconds
(here I have used a metro with a pulse rate of 500 milliseconds—
in the piece a tempo object gives the performer control over the speed
of the big counter).
Whenever the big counter strikes one, the select object un-
derneath it sends a bang to the metro controlling the urn. This re-
commences the urn process, filling the equality operators with three
new values. When the big counter is between 20 and 83, its output
may match one of the three urn values. That span of 63 values occu-
pies a little more than half of the full range output by the big
counter, which will match one of them at three different points
within that span. Because the values are distributed among the
points { 20, 29, 38, 47, 56, 65, 74, 83 }, the hits of the counter will
fall randomly on one of eight possible pulses spaced 4500 millisec-
onds apart. Figure 7.19 expands the patch to show what happens
when a hit occurs.
ber is produced that is less than 6, it will fall into empty space from
the right outlet. This means that 15% of the time, no limit change
will be made, interrupting the regular 2 second rate of variation.
Figure 7.21 demonstrates one way in which Kimura makes the
chromatic patch interactive. MIDI pitch numbers coming from the
pitch-to-MIDI converter of her Zeta violin are mapped to start num-
bers for the chromatic scale generator. Two conditions affect the in-
teraction: first, the patch is only sensitive to violin pitches within
the range 54–70 as determined by the split object. This means that
notes from the bottom of the violin range through the B above middle
C will affect the scales. Second, the insertion of the table object
provides a layer of indirection between her own melodic output and
the base notes of the scales. Rather than a slavish imitation of her
performance, the table gives the machine’s output a relationship to
the melodic material of the violin that nonetheless remains distinct.
The patch in figure 7.22 demonstrates another of the texture gener-
ators from Izquierda e Derecha. This one produces variations on a
Chapter 7 304
number of triads based on the C-major scale that are saved in a coll
file. There are five stored triads, which are accessed in random order
by the urn object. The urn first outputs the five addresses for the
coll file in random order, at a rate of four per second (determined
by the metro object at the top of the patch). Once all of the addresses
have been generated, as in Amnon Wolman’s New York patch, the
urn refills its collection and turns off the metro. While the addresses
are being produced, they retrieve new triads from the coll file and
set these into a note list. The notes are associated with velocities
through a table and played out, along with a copy that is transposed
by 0–6 semitones. The transposition is not random but linear, gener-
ated by the counter object at the left of the patch. The counter is
banged only every six seconds, meaning that the transposition level
changes slowly relative to the quick successions of triads.
Figure 7.22 is included on the CD-ROM under the name ‘‘C
Chords’’ and is written to operate independently. In her composition,
Kimura interacts with the C Chord generator by turning the metros
on and off and changing their speeds interactively based on certain
pitch triggers. Here again we see her technique of using quasi-
random texture generators that are activated and influenced by her
onstage improvisation.
Interactive Improvisation 305
cesses, is that it does not repeat any value until all possible values
have been output. The RTC-lib includes a small filter called anti-bis
that can add this functionality to other kinds of random objects as
well (figure 7.24).
The anti-bis subpatch protects against repetition of the integers
presented to its inlet. When an integer arrives, it is compared against
the number last input to the right inlet of the expression object. If
the two numbers are different, the new input is fed through to the
left outlet of anti-bis and stored in the int object above the right
inlet of the expression. When the number reaches the int, it is both
stored there and transmitted through to the expression. Because the
expression is only evaluated when a number reaches its left inlet,
the number from the int simply waits to be compared to a subse-
quent new input whenever it arrives. If the expression finds the two
values to be the same, it outputs a bang from its right outlet. This
bang can be used to elicit another value from the process sending
inputs to anti-bis, thereby running it continually until it does not
repeat.
Figure 7.25 demonstrates how anti-bis can be used in conjunc-
tion with Brownian to provide random walk outputs without repeti-
tions. The metro continually bangs out values from Brownian.
Whenever one of its outputs proves to be a repetition, the bang from
the right outlet of anti-bis is fed back into Brownian, eliciting
other values until one is generated that is not a repetition of the last.
Chapter 7 308
The same technique can be used with any random generator that pro-
duces output on receiving a bang.
The CD-ROM includes Brownian, anti-bis, and their combina-
tion. The reader is referred to Karlheinz Essl’s clear and instructive
RTC-lib and the Lexicon-Sonate itself for more examples of their use.
scribes some of his early work: ‘‘In the live electronic pieces I com-
posed in the 1960s and early 1970s . . . it seemed to make sense to
double, triple or quadruple the amount of hardware for use in perfor-
mances. The sound textures were enriched and several performers
could play together. The ‘scores’ consisted of general instructions,
rather than of specific commands governing moment-to-moment ac-
tions. Inevitably a kind of counterpoint would result as the perform-
ers pursued their individual paths while listening to one another’’
(Kuivila and Berhman 1998, 15).
One of the oldest and best-known collectives is The Hub, a
group of six programmer/improvisers who have developed various
strategies for performing interactively with composing computers
(Gresham-Lancaster 1998). The development of MIDI allowed them
to formulate a fully interconnected network arrangement in which
the computer of any member could communicate MIDI messages to
the computer of any other. Tim Perkis devised the piece Waxlips
(1991) for this configuration: ‘‘Waxlips . . . was an attempt to find
the simplest Hub piece possible, to minimize the amount of musical
structure planned in advance, in order to allow any emergent struc-
ture arising out of the group interaction to be revealed clearly. The
rule is simple: each player sends and receives requests to play one
note. Upon receiving the request, each should play the note re-
quested, then transform the note message in some fixed way to a dif-
ferent message, and send it out to someone else’’ (Perkis 1995). The
rule applied by each member remained constant during each section
of the piece. One lead player could change sections by sending a
message to the other players, and simultaneously kick the new sec-
tion into motion by ‘‘spraying the network with a burst of requests’’
(Perkis 1995).
Another collective is Sensorband, three improvisers (Edwin van
der Heide, Zbigniew Karkowski, and Atau Tanaka) who have special-
ized in the use of new controllers, some of which require the manipu-
lations of all three. One of these is the Soundnet, a very large web
of ropes in which sensors are embedded. Sensorband performs on
the instrument by actually climbing on it—the resulting sound is
produced from their combined movements on the net.
Chapter 7 310
The rope, the metal, and the humans climbing on it require intense
physicality, and focus attention on the human side of the human-
machine interaction, not on mechanistic aspects such as interrupts,
mouse clicks, and screen redraws. Sensorband has chosen to work
with digital recordings of natural sounds. The signals from Soundnet
control DSPs that process the sound through filtering, convolution,
and waveshaping. Natural, organic elements are thus put in direct
confrontation with technology. The physical nature of movement
meeting the virtual nature of the signal processing creates a dynamic
situation that directly addresses sound as the fundamental musi-
cal material. Through gesture and pure exertion, the performers
sculpt raw samples to create sonorities emanating from the huge net.
(Bongers 1998, 15)
The problem with ensemble improvisation, beyond the basic tech-
nical difficulties mentioned by Manoury, is one of designing an ap-
propriate control structure. As with many of the analytical systems
discussed in this text, arbitration between competing sources of in-
formation becomes harder as the number of discrete sources in-
creases and their interaction becomes more complex. In the case of
Soundnet, the integration of three ‘‘information sources’’ is accom-
plished by the interface itself. The input to the sound-producing al-
gorithm is simply the output of the web as a whole.
its reactions to each part and is likely to be far too dense to be useful
in an ensemble setting. Given the problems inherent in simply prolif-
erating copies of the program with each additional player, the second
strategy was to employ a critic. The Cypher critic is a separate copy
of the listener that reviews output from the program before it is
played. If the critic finds certain conditions are true of the incipient
output, it can change the material before the actual performance. In-
structing the critic to reduce the density of response when several
players are active improves the contribution of the program notice-
ably. Figure 7.27 illustrates this revision to the architecture.
The computer is a better member of an ensemble when it has a
sense of what the group as a whole is doing. The next strategy of
Multiple Cypher, then, was to develop a meta-listener that compares
the analyses of individual players within an ensemble to arrive at a
global characterization of the performance.
Figure 7.28 shows how the meta-listener fits in the architecture of
the program as a whole. As before, each player is tracked by a sepa-
rate copy of the Cypher listener. Now, these individual listener
reports are passed to the meta-listener, which compares and
summarizes the individual reports before passing the information on
to a single player. The user of the program determines how the player
Interactive Improvisation 313
if (Speed(Player 2) ⫽⫽ kSlow)
Accelerate(Input(Player 2));
Along with characterizations of each player individually, the
meta-listener sends a global characterization of the ensemble perfor-
mance as a whole.
Let us consider how feature analyses from individual players can
be combined to arrive at an ensemble classification. An obvious pos-
sibility is what I will call the mean-value strategy. To arrive at a mean
loudness classification, for example, one would sum the loudness
values of all members of the ensemble and divide by the number of
players. While the mean loudness is certainly useful, it can also mis-
lead the program as to the musical nature of the performance. For
example, if one member of a three-player ensemble is playing an ex-
tremely loud solo, the mean-value strategy would still find that the
group as a whole was playing softly (because the other two perform-
ers are not playing at all).
To compensate for this type of confusion, the meta-listener sends
as part of its global report a continually updated analysis of the play-
ers’ relative levels of activity. A player who is not performing is cer-
tainly considered inactive. Moreover, the program takes players who
are at significantly lower levels of loudness, speed, and density to
be less active as well. Now a mean loudness for all active players can
be computed and sent as a message distinct from the overall mean
loudness report. A rule using some of these messages might be:
if ((Loudness(activePlayers))⫽⫽kLoud)) &&
(Number(activePlayers)⬎1))
Silence();
Interactive Improvisation 315
8.1.1 Interactor
The entire application of Intimate Immensity was written in Inter-
actor, a graphic interaction programming environment for Apple
Macintosh computers. Interactor was designed by Mark Coniglio and
Morton Subotnick and implemented by Coniglio. Interactor pro-
grams process several types of events, including most notably MIDI,
timing, and Macintosh events. Statements in Interactor generally fol-
low an if-then logic flow, where attributes of incoming events are
evaluated to determine whether they match some conditions listed
in the if part of a statement. When a condition is found to be true,
additional operators (the then part) are executed. Such if-then se-
quences can occur several times in a single statement; whenever a
conditional operator returns a false value, the current statement ends
and the next one begins.
Figure 8.2 shows a scene edit window. Three statements, made up
of operators, from the basic building blocks of Interactor. Operators
evaluate conditions or execute actions. One operator in the first state-
ment will evaluate true when the scene (collection of statements) is
opened, but not again until the scene is restarted. Indicated by the
small x at the side, this kind of operator is used to initialize various
conditions in the scene. In this case the statement simply informs
the user that the rest of the scene is now active.
In the other two statements, the first operator is a conditional look-
ing for a MIDI note on message within a particular velocity range.
The box below the operator is a comment showing the values of the
operator’s parameters. These parameters can be modified simply by
double-clicking on the operator, which brings up a dialog box
wherein parameters values can be textually modified. In the figure
we see that statement 2 will react to a middle C played on any MIDI
channel with a velocity between 0 and 63. The first operator of state-
ment 3 evaluates true when the same pitch is played but with a veloc-
ity between 64 and 127. In effect, statement 2 is the ‘‘soft C’’ handler,
while statement 3 is the ‘‘loud C’’ handler.
In both cases, an action operator (send note) that outputs a triad
engages next. Action operators are always true—that is, any addi-
tional operators after an action will be executed until a conditional
operator is encountered that tests false, or the end of the statement
is found. Notice that the ‘‘soft C’’ send note operator will trigger an
A -major triad just below middle C played with a velocity of 64 and
a duration of one beat. The ‘‘loud C’’ handler triggers a C-major triad
with a velocity of 120.
Interactor supports eight simultaneous multi-channel sequencers.
There are also eight independent timebases so that each sequencer
can have its own temporal behavior. Scheduling operators use the
timebases as well, e.g., the Delay Timer which postpones the execu-
tion of subsequent operators in a statement by some number of ticks.
Time is expressed in measures, beats, and ticks, where 480 ticks
equal one quarter note. This allows tempo to be varied while individ-
ual events maintain their relationship to an underlying metric struc-
ture. We see a timebase in operation in the send note operators of
statements 2 and 3 in figure 8.2: the duration of the triads is specified
relative to the first timebase (T1) and given a value of 1.0 beats. The
actual duration in milliseconds of the triads, then, depends on the
speed of T1 and will change as T1 changes.
Interactor shows some obvious similarities to Max, in both its
graphic orientation and the kinds of interactive applications it was
written to support. There are important differences as well: (1) Inter-
actor’s left-to-right statement construction enforces a particular
graphic convention that is consistent and clear. What graphic regu-
Interactive Multimedia 323
larities exist in Max arise from the execution order, namely that con-
trol flows from top to bottom and from right to left. Any consistency
beyond that is left up to the user. (2) The use of registers and lists
to store input information and pass data between operators is a sig-
nificant departure from Max’s patchcords. (3) The combination of
sequencers and timebases in Interactor provides particularly power-
ful support for the interactive performance of sequenced material.
Max’s seq and mt can be used to accomplish the same things that
Interactor does, but the environment itself is not organized around
them to the same extent that Interactor is based on its sequencers.
Figure 8.3 shows a characteristic Interactor statement from the
opening sections of Intimate Immensity. This is the 26th statement
in a collection of 63 contained in one scene, all simultaneously com-
paring their start conditions with the state of the onstage perfor-
mance. In figure 8.3, the first operator tests the contents of the r100
register. If r100 contains the value 2, execution advances to the fol-
lowing operator. This looks for any Note On event coming from the
‘‘infrared’’ channel, corresponding to one of the motion sensors that
transmits information to the computer by way of MIDI messages. The
infrared sensing is so acute that even the blinking of the Cyber-
Angel’s eye (figure 8.4) can be used to trigger events.
When such an event from the infrared channel is found, it is sent
through a ‘‘time filter’’ (the third operator). The time filter performs
the same function as Max’s speedlim; in other words, it filters out
repetitions of events that occur before a certain duration has passed.
When a note passes the time filter, operators 4 through 10 per-
form the following actions: (4) continuous controller 64 is changed
from 127 to 0 over a duration of 4 beats; (5) a random number be-
tween 1 and 6 is stored in register 22; (6) the first sequencer (S1) is
Chapter 8 324
audio excerpts with some slide shows that demonstrate visual as-
pects of the composition.
8.2.1 Woids
The CD-ROM includes two applications with source code, one called
Woids and the other Flock. Woids animates word sets from the Ca-
netti text using parameters set in a dialog box, while Flock performs
animation and launches video clips from an analysis of a perfor-
mance arriving over a MIDI line. In this section I will briefly intro-
duce some of the calculations that produce the flocking behavior
demonstrated by the Woids. The code is my C⫹⫹ port of Eric
Singer’s Woids program, itself adapted from Simon Fraser’s imple-
mentation of the Boids algorithm by Craig Reynolds.
The AvoidWalls function is typical of the routines that collec-
tively compute the direction and speed of a Woid’s motion (figure
8.8). The Velocity struct records how much horizontal and how
much vertical movement each Woid should make at the end of a
Figure 8.8 AvoidWalls() function
Chapter 8 332
8.3 In Transit
bass note, and a list of intervals (figure 8.12). The root is a pitch class
and forms the lower member of all of the intervals recorded in the
intervals list. That is, if root is 5 it represents the pitch class F (5
semitones above C). An interval of 10 in the intervals list, then,
would be mapped to E , 10 semitones above F.
The intervals list is zero-terminated. There may be up to kInter-
valMax-1 intervals in any IntervalChord, but following the final
interval in the list there must be a zero.
The frequency values in a chord collection are normalized such
that the sum of frequencies for each chord type is equal to the con-
stant value kCertain. The function Select() uses these frequency
values to determine which chord of a given type to choose (figure
8.13). The first operation is to generate a random number between 0
and kCertain-1. Select() then scans through all of the chords of
Chapter 8 340
the requested type, adding the frequency associated with each chord
to sum. Once sum exceeds the random number, the chord currently
reached in the search is chosen and returned. In this way, the fre-
quency associated with a chord corresponds to the likelihood that it
will be chosen with Select(). Finally the root of the output chord
is determined by adding inRoot to the root of the chord selected
from the collection.
The full selection process used for In Transit combines the Markov
selection with a constraint that tries to harmonize predicted pitches
against a chord to be played. When the program is about to complete
a chord transition, learned from a series of jazz standards, it assumes
that the performer will move up a half-step when the second chord
of the transition is played. Therefore it tries to select a chord that
contains or is consonant with the pitch one halfstep above where the
performer currently is. In this way the actual chord selected will be
affected by both the probabilities and the influence of the player,
who can direct the harmonization to a certain zone by leading to the
desired pitch.
The chord identification processes developed in chapter 2 reduce
a chord to a collection of pitch classes in which the relative and abso-
lute locations of the component notes are discarded. Certainly one of
the most critical aspects of chord performance in jazz piano playing,
however, is voicing. Voicing refers to the distribution of the constit-
uent pitches of a chord across the keyboard. When we move from
chord analysis to chord synthesis, therefore, we quickly encounter
the necessity of describing the voicing that will be used to perform
any particular abstract type.
Roger Dannenberg represents chords in two different ways in the
In Transit software. The first is an IntervalChord, shown in figure
8.12, and is used for the calculation in Select(). The second repre-
sentation is a VoicingChord (figure 8.14). A VoicingChord main-
tains the pitches of the chord as MIDI note numbers, not as intervals
above a root. There are two arrays of note numbers, one for the ‘‘left
hand’’ (lh) and one for the ‘‘right’’ (rh). Both lists are zero-terminated
and each is assumed to record note numbers in ascending order. The
Chapter 8 342
8.4.1 Controllers
The most striking aspect of performances by this group is their use
of several original and highly expressive controllers. Tarabella’s very
definition of the term interaction suggests the degree to which their
Chapter 8 344
Figure 8.18 shows Nota, a Max object implementing the duty cy-
cle parameter. It is simply makenote with a multiplication added to
the duration inlet that changes the length of the note according to
the duty value. The duty is initialized to 1.0, so that it will have no
effect until it is changed. Nota retains the most recent duty value
whenever one is sent. The outlets of Nota are normally sent to a
noteout object.
Figure 8.19 shows Nota being used to vary the articulation of ran-
dom notes. With every bang from the metro object, the counter
will increase the duty parameter linearly from 0.7 to 1.0. (The
counter uses integer values, so we divide integers ranging from 70
Interactive Multimedia 351
(comparing input n to input n-1), Peak will only be true when the
current input is higher than the last. In that case the gate on the left-
hand side of the patch is opened and the new zone value allowed to
pass through to the DoorZoneDom transmitter.
The Liner patch (figure 9.2) performs an operation similar to but
more general than that of Dominant Zone. The input to Liner is
information from the iCube device merging all of the sensors trained
on the visitors. Any sensor that changes sends its ID number and new
value to the zones send/receive pair. The expression object then
calculates a new position in a line between the two zones tracked
by a given instance of Liner (which two IDs are tracked depends on
the arguments #1 and #2 used to instantiate the object). This position
is sent through the left outlet together with a trend that tracks as-
cending or descending motion of the expression value, similar to the
ascending sensitivity of Dominant Zone. The trend is determined
by Bucket, this time connected to a Trough object. Trough com-
pares input n to n-1 and outputs a zero if n is greater than n-1 and
a one if it is not.
Thecla Schiphorst’s installation work not only encourages the visi-
tor to explore the relationship between their actions and the techni-
Installations 359
The purely audio relationship between the visitors and their dark-
ened environment often provokes a quite visceral reaction. The
threatened violence of the aural situation even makes it impossible
for some to cross the space.
Fit produces another kind of physical reaction: the work consists
of a video image of an aerobics instructor. ‘‘When a viewer moves
in front of the image, music begins and the instructor starts exercis-
ing. When a viewer stops moving, the instructor also stops exercising
and the music becomes silence. Each time a viewer begins moving
his or her body, the instructor begins a new exercise with music. If
a viewer moves non-stop, over time the instructor will exercise faster
and change every 6 seconds to increasingly quicker routines. If a
viewer exercises for 30 seconds non-stop, the instructor and music
Chapter 9 360
several groups, each typified by its own interface and outputs. For
example, a device called the Singing Tree used a microphone to sam-
ple the voice of a visitor. A dedicated PC analyzed 10 features of the
singing voice: as these indicated an increasingly stable tone at a sin-
gle pitch, a resynthesis of the visitor’s voice became more ‘‘pleasing,’’
and an animated image appeared on a monitor before them. ‘‘When
the voice falters, the animation rewinds into a set of simpler images.
The audio and video feedback on the singing voice has proven quite
effective; the tonal and visual rewards encourage even poor amateurs
to try for a reasonable tone’’ (Paradiso 1999, 133).
Another large component was the Rhythm Tree, a collection of 320
drumpads grouped into 10 strings of 32 pads each. Each drumpad
detected when it was struck by a visitor’s hand and identified the
type of stroke used. (See the CD-ROM for a video clip of visitors inter-
acting with the Rhythm Tree.) The information generated by visitors’
input controlled percussion sounds and illumination of the pads
struck.
Material sampled from the interactive environment was woven
into the staged composition, itself performed using a group of three
interactive hyperinstruments. The Brain Opera was a groundbreak-
ing experience in the organization and performance of very large-
scale interactive works: beyond its impressive technical prowess, the
piece explored several aesthetic issues surrounding the integration
of input from non-musicians and the synthesis of unrelated amateur
performances.
The Brain Opera’s musical mappings and parametric sequences ran
independently on each Lobby instrument. Although this satisfied in-
dividual players (many of whom were acoustically isolated by wear-
ing headphones or were near appropriate speakers), the overall
sound of The Brain Opera Lobby quickly dropped to the familiar,
stochastic level of an arcade. . . . In general, future research is needed
to address the balance between overall and local experiences, e.g.,
selecting and coordinating the audio responses over a network
to enable large installations like The Brain Opera to sound more
Chapter 9 362
selects intervals from one of a number of stored scales and adds these
to the found chord root. Rhythmic constants specified by the user
determine how many notes may be played per beat and what percent-
age of the possible beats will actually be articulated by performed
notes.
As the musical improvisation is being generated, messages are sent
to IMPROV that influence the animated behavior of Willy on the
screen. Typical messages include instructions to turn right, turn left,
lean back, tap foot, and so on. The system currently runs on two
computers: music analysis and generation is written in C⫹⫹ and per-
formed on an Apple Macintosh, while IMPROV is implemented in
Java and VRML and runs under any VRML2-compliant browser (cur-
rently, the CosmoPlayer on an SGI). Communication from the analy-
Installations 365
each note from the scale will be taken in order. No matter which
member of the path is selected, the scale index will always be incre-
mented by one. If, instead, the first path were chosen (1,1,⫺2,1), the
scale would be read in order except once every four notes, when the
scale reader would jump back two places instead of ahead one. Taken
with the scale C, D, G, A, B, the first path would produce C, D, E, C,
D, E, F, D, E, F, G, E, etc.
The FindIndex() routine is the first step in finding the address
of the scale member to be used at any given moment (figure 9.5).
It begins by calculating the distance between the last scale degree
produced and the first degrees past the end or before the beginning
of the scale. This is to cover the case in which a scale should be
Chapter 9 368
Antonio Camurri and his colleagues have spent several years re-
searching and building interactive multimodal environments (MEs)
‘‘conceived of as an audio-visual environment which can be used to
Installations 373
Allen, P., and R. Dannenberg. 1990. Tracking musical beats in real time. In Proceedings
of the 1990 International Computer Music Conference. San Francisco: International
Computer Music Association, 140–43.
Arcos, J., L. Mántaras, and X. Serra. 1998. Saxex: A case-based reasoning system for generat-
ing expressive musical performances. Journal of New Music Research 27(3):194–210.
Ashley, R. 1989. Modeling music listening: General considerations. Contemporary Music
Review 4:295–310.
Baisnée, P., J.-B. Barrière, O. Koechlin, and R. Rowe. 1986. Real-time interactions between
musicians and computer: Live performance utilisations of the 4X musical workstation.
In Proceedings of the 1986 International Computer Music Conference. San Francisco:
International Computer Music Association.
Baker, D. 1983. Jazz improvisation: A comprehensive method for all musicians. Van Nuys,
CA: Alfred Publishing Company.
Bent, I. 1980. Analysis. New York: W.W. Norton & Company.
Berg, P. 1998. Using the AC Toolbox. Distributed by the author at http:/ /www.koncon.nl/
ACToolbox/
Berliner, P. 1994. Thinking in jazz: The infinite art of improvisation. Chicago: The Univer-
sity of Chicago Press.
Berry, W. 1987. Structural functions in music. New York: Dover Publications.
Bharucha, J. 1987. Music cognition and perceptual facilitation: A connectionist framework.
Music Perception 5:1–30.
Bharucha, J. 1999. Neural nets, temporal composites, and tonality. In The psychology of
music. 2nd Ed. Ed. D. Deutsch. London: Academic Press.
Bharucha, J., and P. Todd. 1989. Modeling the perception of tonal structure with neural
nets. Computer Music Journal 13(4):44–53.
Bigand, E. 1993. The influence of implicit harmony, rhythm, and musical training on the
abstraction of ‘‘tension-relaxation schemas’’ in tonal musical phrases. Contemporary
Music Review 9(1,2):123–137.
Biles, J. 1994. GenJam: a genetic algorithm for generating jazz solos. In Proceedings of the
1994 International Computer Music Conference. San Francisco: International Com-
puter Music Association.
Biles, J. 1998. Interactive GenJam: Integrating real-time performance with a genetic algo-
rithm. In Proceedings of the 1998 International Computer Music Conference. San Fran-
cisco: International Computer Music Association.
Bilmes, J. 1993. Timing is of the essence: Perceptual and computational techniques for
representing, learning, and reproducing expressive timing in percussive rhythm. Mas-
ter’s thesis, Massachusetts Institute of Technology.
Bishop, C. 1995. Neural networks for pattern recognition. Cambridge, England: Clarendon
Press.
References 382
Bloch, J., and R. Dannenberg. 1985. Real-time computer accompaniment of keyboard per-
formances. In Proceedings of the 1985 International Computer Music Conference. San
Francisco: International Computer Music Association.
Bobrow, D., and T. Winograd. 1977. An overview of KRL, a knowledge representation lan-
guage. Cognitive Science 1(1):3–46.
Boden, M. 1994. What is creativity? In Dimensions of creativity. Ed. Margaret A. Boden.
Cambridge, MA: The MIT Press.
Bongers, B. 1998. An interview with Sensorband. Computer Music Journal 22(1):13–24.
Bregman, A. 1990. Auditory scene analysis. Cambridge, MA: The MIT Press.
Bresin, R. 1998. Artificial neural networks based models for automatic performance of mu-
sical scores. Journal of New Music Research 27(3):239–270.
Brown, H., D. Butler, and M. Riess Jones. 1994. Musical and temporal influences on key
discovery. Music Perception 11:371– 407.
Butler, D. 1989. Describing the perception of tonality in music: A critique of the tonal
hierarchy theory and a proposal for a theory of intervallic rivalry. Music Perception
6:219–242.
Butler, D. and W. Ward. 1988. Effacing the memory of musical pitch. Music Perception 5:
251–260.
Cambouropoulos, E. 1997. Musical rhythm: A formal model for determining local bound-
aries, accents and metre in a melodic surface. In Music, gestalt, and computing: Stud-
ies in cognitive and systematic musicology. Ed. M. Leman. Berlin: Springer.
Camurri, A., A. Catorcini, C. Innocenti, and A. Massari. 1995. Music and multimedia
knowledge representation and reasoning: The HARP system. Computer Music Journal
19(2):34–58.
Camurri, A., and M. Leman. 1997. Gestalt-based composition and performance in
multimodal environments. In Music, gestalt, and computing: Studies in cognitive and
systematic musicology. Ed. M. Leman. Berlin: Springer.
Camurri, A., and P. Ferrentino. 1999. Interactive environments for music and multimedia.
Multimedia Systems 7:32–47.
Camurri, A., M. Ricchetti, and R. Trocca. 1999. EyesWeb: Toward gesture and affect recog-
nition in dance/music interactive systems. In Proceedings of IEEE multimedia systems
1999, Florence.
Canazza, S., G. De Poli, G. Di Sanzo, and A. Vidolin. 1998. A model to add expressiveness
to automatic musical performance. In Proceedings of the 1998 International Computer
Music Conference. San Francisco: International Computer Music Association.
Canazza, S., G. De Poli, A. Rodà, and A. Vidolin. 1997. Analysis by synthesis of the expres-
sive intentions in musical performance. In Proceedings of the 1997 International Com-
puter Music Conference. San Francisco: International Computer Music Association.
Canetti, E. 1984. Crowds and power. trans. Carol Stewart. New York: Farrar Straus Giroux.
Carreras, F., M. Leman, and M. Lesaffre. 1999. Automatic description of musical signals
using schema-based chord decomposition. Journal of New Music Research.
Chadabe, J. 1989. Interactive composing: An overview. In The music machine. Ed. C.
Roads. Cambridge, MA: The MIT Press.
Chadabe, J. 1997. Electric sound: The past and promise of electronic music. Upper Saddle
River, NJ: Prentice-Hall.
Chafe, C., B. Mont-Reynaud, and L. Rush. 1989. Toward an intelligent editor of digital
audio: Recognition of musical constructs. In The music machine. Ed. C. Roads. Cam-
bridge, MA: The MIT Press.
References 383
Desain, P., and H. Honing. 1989. The quantization of musical time: A connectionist ap-
proach. Computer Music Journal 13(3).
Desain, P., and H. Honing. 1994a. Foot-tapping: A brief introduction to beat induction. In
Proceedings of the 1994 International Computer Music Conference. San Francisco:
International Computer Music Association, 78–79.
Desain, P., and H. Honing. 1994b. Rule-based models of initial-beat induction and an analy-
sis of their behavior. In Proceedings of the 1994 International Computer Music Confer-
ence. San Francisco: International Computer Music Association.
Desain, P., and H. Honing. 1994c. Advanced issues in beat induction modeling: Syncopa-
tion, tempo and timing. In Proceedings of the 1984 International Computer Music Con-
ference. San Francisco: International Computer Music Association.
Desain, P., and H. Honing. 1999. Computational models of beat induction: The rule-based
approach. Journal of New Music Research 28(1):29–42.
Deutsch, D. 1999a. Grouping mechanisms in music. In The psychology of music. 2nd Ed.
Ed. D. Deutsch. London: Academic Press.
Deutsch, D. 1999b. The processing of pitch combinations. In The psychology of music. 2nd
Ed. Ed. D. Deutsch. London: Academic Press.
Deutsch, D., and J. Feroe. 1981. The internal representation of pitch sequences in tonal
music. Psychological Review 88:503–522.
Dobbins, B. 1994. A creative approach to jazz piano harmony. Advance Music.
Dolson, M. 1991. Machine tongues XII: Neural networks. In Music and connectionism. Ed.
P. Todd and G. Loy. Cambridge, MA: The MIT Press.
Dowling, W. and D. Harwood. 1986. Music cognition. New York: Academic Press.
Duckworth, W. 1998. A creative approach to music fundamentals. 6th Ed. Belmont, CA:
Wadsworth Publishing Company.
Ebcioglu, K. 1992. An expert system for harmonizing chorales in the style of J.S. Bach. In
Understanding music with AI: Perspectives on music cognition. Ed. O. Laske and M.
Balaban. Cambridge, MA: The AAAI Press/The MIT Press.
Forte, A. 1973. The structure of atonal music. New Haven, CT: Yale University Press.
Friberg, A. 1991. Generative rules for music performance: A formal description of a rule
system. Computer Music Journal 15(2):56–71.
Gabrielsson, A. 1988. Timing in music performance and its relations to music experience.
In Generative processes in music: The psychology of performance, improvisation, and
composition. Ed. J. Sloboda. Oxford: Clarendon Press.
Gabrielsson, A. 1995. Expressive intention and performance. In Music and the mind ma-
chine. Ed. R. Steinberg. Berlin: Springer-Verlag.
Garton, B., and M. Suttor. 1998. A sense of style. In Proceedings of the 1998 International
Computer Music Conference. San Francisco: International Computer Music Associa-
tion.
Garton, B., and D. Topper. 1997. RTcmix: Using CMIX in real time. In Proceedings of the
1997 International Computer Music Conference. San Francisco: International Com-
puter Music Association.
Gjerdingen, R. 1988. A classic turn of phrase: Music and the psychology of convention.
Philadelphia: University of Pennsylvania Press.
Gjerdingen, R. 1990. Categorization of musical patterns by self-organizing neuronlike net-
works. Music Perception 8:339–370.
Gjerdingen, R. 1999. Apparent motion in music? In Musical networks: Parallel distributed
perception and performance. Ed. N. Griffith and P. Todd. Cambridge, MA: The MIT
Press.
References 385
Machover, T. 1999. Technology and the future of music. Interview by F. Oteri, August 18,
1999. NewMusicBox 6. http:/ /www.newmusicbox.org.
Manoury, P. 1984. The role of the conscious. Contemporary Music Review 1:147–156.
Manoury, P. 1991. Jupiter. Paris: Éditions Musicales Amphion.
Marsden, A. 1992. Modelling the perception of musical voices. In Computer representa-
tions and models in music. Ed. A. Marsden and A. Pople. London: Academic Press.
Marsden, A., and A. Pople, eds. 1992. Computer representations and models in music.
London: Academic Press.
Marslen-Wilson, W., and L. Tyler. 1984. The temporal structure of spoken language under-
standing. Cognition 8:1–71.
Martin, K., E. Scheirer, and B. Vercoe. 1998. Music content analysis through models of
audition. ACM multimedia 1998 workshop on content processing of music for multi-
media applications. San Francisco: Association of Computing Machinery.
Maxwell, H. 1992. An expert system for harmonic analysis of tonal music. In Understand-
ing music with AI. Ed. M. Balaban, K. Ebcioglu, and O. Laske. Cambridge, MA: The
MIT Press.
McAdams, S. 1987. Music: A science of the mind? In Contemporary Music Review 2(1):1–62.
McClelland, J., and D. Rumelhart. 1988. Explorations in parallel distributed processing.
Cambridge, MA: The MIT Press.
McEwan, I. 1998. Amsterdam. London: Jonathan Cape.
Meehan, J. 1980. An artificial intelligence approach to tonal music theory. Computer Music
Journal 4(2):64.
Messiaen, O. 1997. Traité de Rythme, de Couleur, et d’Ornithologie (Tome IV). Paris: Al-
phonse Leduc.
Minsky, M. 1985. A framework for representing knowledge. In Readings in knowledge
representation. Ed. R. Brachman and H. Levesque. Los Altos, CA: Morgan Kaufmann
Publishers.
Miranda, E., ed. 1999. Readings in music and artificial intelligence. The Netherlands: Har-
wood Academic Publishers.
Mitchell, M. 1996. An introduction to genetic algorithms. Cambridge, MA: The MIT Press.
Mozer, M. 1991. Connectionist music composition based on melodic, stylistic, and psycho-
physical constraints. In Music and connectionism. Ed. P. Todd and D.G. Loy. Cam-
bridge, MA: The MIT Press.
Mozer, M. 1993. Neural net architectures for temporal sequence processing. In Predicting
the future and understanding the past. Ed. A. Weigend and N. Gershenfeld. Reading,
MA: Addison-Wesley.
Myers, C., and L. Rabiner. 1981. A level building dynamic time warping algorithm for
connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Pro-
cessing ASSP-29(2):284–297.
Narmour, E. 1977. Beyond Schenkerism: The need for alternatives in music analysis. Chi-
cago: University of Chicago Press.
Narmour, E. 1990. The analysis and cognition of basic melodic structures: The implication-
realization model. Chicago: University of Chicago Press.
Narmour, E. 1999. Hierarchical expectation and musical style. In The psychology of music.
2nd ed. Ed. D. Deutsch. London: Academic Press.
Nigrin, A. 1993. Neural networks for pattern recognition. Cambridge, MA: The MIT Press.
Nordli, K. 1997. MIDI extensions for musical notation (1): NoTAMIDI meta-events. In Be-
yond MIDI: The handbook of musical codes. Ed. E. Selfridge-Field. Cambridge, MA:
The MIT Press, 73–79.
References 388
Selfridge-Field, E., ed. 1997a. Beyond MIDI: The handbook of musical codes. Cambridge,
MA: The MIT Press.
Selfridge-Field, E. 1997b. Describing musical information. In Beyond MIDI: The handbook
of musical codes. Ed. E. Selfridge-Field. Cambridge, MA: The MIT Press, 3–38.
Selfridge-Field, E. 1998. Conceptual and representational issues in melodic comparison.
Computing in musicology 11:3–64.
Settel, Z. 1993. Hok Pwah. Program Note.
Shannon, C., and W. Weaver. 1949. The mathematical theory of communication. Urbana:
The University of Illinois Press.
Shepard, R. 1964. Circularity in judgments of relative pitch. Journal of the Acoustical Soci-
ety of America 36:2346–2353.
Shepard, R. 1999. Cognitive psychology and music. In Music, cognition, and computerized
sound: An introduction to psychoacoustics. Ed. P. Cook. Cambridge, MA: The MIT
Press.
Siegel, W., and J. Jacobsen. 1998. The challenges of interactive dance: An overview and
case study. Computer Music Journal 22(4):29–43.
Simon, H., and K. Kotovsky. 1963. Human acquisition of concepts for sequential patterns.
Psychological Review (70):534–546.
Simon, H., and R. Sumner. 1993. Pattern in music. In Machine models of music. Ed. S.
Schwanauer and D. Levitt. Cambridge, MA: The MIT Press.
Sims, K. 1991. Artificial evolution for computer graphics. Computer Graphics 25(4):319–
328.
Singer, E., K. Perlin, and C. Castiglia. 1996. Real-time responsive synthetic dancers and
musicians. In Visual proceedings, SIGGRAPH 96. New York: ACM SIGGRAPH.
Singer, E., A. Goldberg, K. Perlin, C. Castiglia, and S. Liao. 1997. Improv: Interactive impro-
visational animation and music. ISEA 96 proceedings: Seventh international sympo-
sium on electronic art. Rotterdam, Netherlands: ISEA96 Foundation.
Sloboda, J. 1985. The musical mind: The cognitive psychology of music. Oxford: Clarendon
Press.
Smoliar, S. 1992. Representing listening behavior: Problems and prospects. In Understand-
ing music with AI. Ed. M. Balaban, K. Ebcioglu, and O. Laske. Cambridge, MA: The
MIT Press.
Stammen, D. 1999. Timewarp: A computer model of real-time segmentation and recogni-
tion of melodic fragments. Ph.D. diss., McGill University.
Stammen, D., and B. Pennycook. 1993. Real-time recognition of melodic fragments using
the dynamic timewarp algorithm. In Proceedings of the 1993 International Computer
Music Conference. San Francisco: International Computer Music Association, 232–
235.
Subotnick, M. 1997. Intimate Immensity. Program note.
Sundberg, J. 1988. Computer synthesis of music performance. In Generative processes in
music: The psychology of performance, improvisation, and composition. Ed. J. Slo-
boda. Oxford: Clarendon Press.
Sundberg, J., A. Askenfelt, and L. Frydén. 1983. Music performance: A synthesis-by-rule
approach. Computer Music Journal 7:37–43.
Sundberg, J., A. Friberg, and L. Frydén. 1991. Common secrets of musicians and listeners—
an analysis-by-synthesis study of musical performance. In Representing musical struc-
ture. Ed. P. Howell, R. West, and I. Cross. London: Academic Press.
Tanimoto, S. 1990. The elements of artificial intelligence: Using Common LISP. New York:
W.H. Freeman and Company.
References 391
West, R., P. Howell, and I. Cross. 1985. Modelling perceived musical structure. In Musical
structure and cognition. Ed. P. Howell, I. Cross, and R. West. London: Academic Press.
West, R., P. Howell, and I. Cross. 1991. Musical structure and knowledge representation.
In Representing musical structure. Ed. P. Howell, R. West, and I. Cross. London: Aca-
demic Press.
Widmer, G. 1992. Qualitative perception modeling and intelligent musical learning. Com-
puter Music Journal 16(2):51–68.
Widmer, G. 1995. Modeling the rational basis of musical expression. Computer Music Jour-
nal 19(2):76–96.
Widmer, G. 1996. Learning expressive performance: The structure-level approach. Journal
of New Music Research 25:179–205.
Widrow, B. 1963. ADALINE and MADALINE. IEEE–ICNN 1(I):143–158.
Wiggins, G., E. Miranda, A. Smaill, and M. Harris. 1993. A framework for the evaluation
of music representation systems. Computer Music Journal 17(3):31–42.
Winkler, T. 1998. Composing interactive music: Techniques and ideas using Max. Cam-
bridge, MA: The MIT Press.
Winograd, T. 1968. Linguistics and the computer analysis of tonal harmony. Journal of
Music Theory 12(1):2–49.
Winsor, P. 1989. Automated music composition. Denton: University of North Texas Press.
Winston, P. 1984. Artificial Intelligence. 2d ed. Reading, MA: Addison-Wesley Publishing
Company.
Witten, I., L. Manzara, and D. Conklin. 1994. Comparing human and computational models
of music prediction. Computer Music Journal 18(1):70–80.
Woideck, C. 1996. Charlie Parker: His music and life. Ann Arbor: The University of Michi-
gan Press.
Wright, M., and A. Freed. 1997. Open Sound Control: A new protocol for communicating
with sound synthesizers. In Proceedings of the 1997 International Computer Music
Conference. San Francisco: International Computer Music Association.
Wright, M., A. Chaudhary, A. Freed, D. Wessel, X. Rodet, D. Virolle, R. Woehrmann, and
X. Serra. 1998. New applications of the Sound Description Interchange Format. In
Proceedings of the 1998 International Computer Music Conference. San Francisco:
International Computer Music Association.
Wright, M., and D. Wessel. 1998. An improvisation environment for generating rhythmic
structures based on North Indian ‘‘Tal’’ patterns. In Proceedings of the 1998 Interna-
tional Computer Music Conference. San Francisco: International Computer Music As-
sociation.
Zicarelli, D. 1996. Writing external objects for Max. Palo Alto: Opcode Systems.
Index
Chord identification, 19, 38, 40–42, 341 Deutsch, Diana, 110–111, 145, 203–204,
KickOutMember, 40 206
Chord spelling, 42–46, 58 Differentiation rules, 266–267
Chord type identification, 58–60 di Giugno, Giuseppe, 212, 216
City-block metric, 163–165 Dobbins, Bill, 36–38
Clarke, Eric, 146, 264, 277–278 Jazz Piano Harmony, 36–38
CNMAT Rhythm Engine, 231–232 Dolson, Mark, 94–95, 103
Coltrane, John, 372 Dominant Zone, 357–358
Columbia Computer Music Center, 336 Dowling, W. Jay, 246
Combination tunes, 335 Draves, Scott, 334
Compatibility Rule, 47 Dudas, Richard, 288
Competitive learning, 197–198 Duty cycle, 348–350
Completion image, 196
Computer music, 1, 4, 377 Ebcioglu, Kemal, 237–240
Coniglio, Mark, 321, 346 Eckel, Gerhard, 306
Conklin, Darrell, 184 Brownian, 306
Connectionist Quantizer, 114–121 Emotion in music, 244–245
basic cells, 114, 118 Empirical Inquiry Hypothesis, 11
interaction cells, 114, 121 Ensemble improvisation, 308–310
sum cells, 114, 118 Essl, Karlheinz, 306, 308
Context dependence, 46–47 anti-bis, 307–308
Context image, 196 Lexicon-Sonate, 306, 308
Context independence, 25 RTC-lib, 306–308
Context modeling, 185 Euclidean metric, 163–164, 337
Continuous controls, 328, 351–353 Experiments in Musical Intelligence, 182
Convolution Brothers, The, 221 Expert systems, 237–240
Cook, Nicholas, 110, 318 Explode, 217
Analyzing Multimedia, 318 Expressive performance, 113, 264–276
Coordinated Electronic Music Studio, 288
Cope, David, 182–184, 237 Fast Fourier Transform (FFT), 193–194,
Computers and Musical Style, 182–183 262–263
Cross-disciplinary research, 377 Feigenbaum, Edward, 11
Csound, 194 FlExPat, 187–188
Cue-abstraction mechanism, 242–243 Forte, Allen, 19–21
Cypher, 60, 239–240, 258, 279 The Structure of Atonal Music, 19
critic, 240, 312–313 4X machine, 212–216
key finder, 60–66 Fraser, Simon, 327, 330
meta-listener, 312–315 Friberg, Anders, 265
Multiple Cypher, 310–315 Frydén, Lars, 265
segmentation, 168
GALileo (Graphic ALgorithmic music lan-
Dannenberg, Roger, 30–31, 170–172, guage), 346
176–178, 180–182, 212, 336–338, Garton, Brad, 335–336
341–342 Generate and test, 238–240
In Transit, 334, 338, 341 Generation techniques, 203–204
Deliège, Irene, 241–242 Genetic algorithms, 248–257
Derived class, 14 crossover, 249
Desain, Peter, 112, 114, 118–119, 122–124 mutation, 249
Index 395
Waseda University, 97
Watson, David, 336
Well-formedness rules, 146
Wessel, David, 8, 231–232
Widmer, Gerhard, 273–276
Widrow, Bernard, 95
Willy, 363–364
Winkler, Todd, 124–125
Composing Interactive Music, 124–125
Witten, Ian, 184
Woideck, Carl, 372
Woids, 327–334
AvoidWalls, 330–332
MatchHeavy, 333
Wolman, Amnon, 205, 297–298, 304
New York, 297–298, 304
Wright, Matthew, 33, 231–232
Xenakis, Iannis, 6
zerocross, 223–224
Zicarelli, David, 2, 13, 213
Writing External Objects for Max, 139–140