Speech Processing
Speech Processing
http://www.speechminded.com
ISBN-13:. . .
dedication
Contents
I.
Introduction to Praat
1. Introduction
1.1.
1.2.
1.3.
1.4.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.1. Introduction . . . . . . . . . . . . . . . . . . . . .
2.2. Acoustics . . . . . . . . . . . . . . . . . . . . . .
2.2.1. The nasty details of pure tone specifications
2.3. The sound editor . . . . . . . . . . . . . . . . . .
2.4. How do we analyse a sound? . . . . . . . . . . . .
2.5. How to make sure a sound is played correctly? . . .
2.6. Removing an offset . . . . . . . . . . . . . . . . .
2.7. Special sound signals . . . . . . . . . . . . . . . .
2.7.1. Creating tones . . . . . . . . . . . . . . .
2.7.2. Creating a damped sine (formant) . . . . .
2.7.3. Creating noise signals . . . . . . . . . . .
2.7.4. Creating linear sweep tones . . . . . . . .
2.7.5. Creating a gammatone . . . . . . . . . . .
2.7.6. Creating a gammachirp . . . . . . . . . . .
2.7.7. Ceating a sound with only one pulse . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
5
6
6
7
8
9
9
15
18
21
23
24
25
26
26
27
28
28
29
30
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
34
35
35
36
36
37
vii
Contents
3.5.3. The FLAC format . . . . . . . . . . . . . . . . . . .
3.5.4. Alaw format . . . . . . . . . . . . . . . . . . . . .
3.5.5. law format . . . . . . . . . . . . . . . . . . . . . .
3.5.6. Raw format . . . . . . . . . . . . . . . . . . . . . .
3.5.7. The mp3 format . . . . . . . . . . . . . . . . . . . .
3.5.8. The ogg vorbis format . . . . . . . . . . . . . . . .
3.6. Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1. The microphone . . . . . . . . . . . . . . . . . . .
3.6.2. The sound card . . . . . . . . . . . . . . . . . . . .
3.6.2.1. Oversteering and clipping (do it yourself)
3.6.2.2. Sound card electronic circuitry . . . . . .
3.6.3. The mixer . . . . . . . . . . . . . . . . . . . . . . .
3.6.4. Analog to Digital Conversion . . . . . . . . . . . .
3.6.4.1. Aliasing . . . . . . . . . . . . . . . . . .
3.6.5. Digital to Analog Conversion . . . . . . . . . . . .
3.6.6. The Digital Signal Processor . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Praat scripting
4.1.
4.2.
4.3.
4.4.
viii
37
38
38
38
39
39
39
39
40
42
43
44
46
46
49
50
51
52
52
54
57
59
59
60
61
61
62
62
63
64
64
65
67
67
68
69
70
72
72
76
78
80
81
83
Contents
4.7.2. Repeat until loops . . . . . . .
4.7.3. While loops . . . . . . . . . . .
4.8. Functions . . . . . . . . . . . . . . . .
4.8.1. Mathematical functions in Praat
4.8.2. String functions in Praat . . . .
4.9. The layout of a script . . . . . . . . . .
4.10. Mistakes to avoid in scripting . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5. Pitch analysis
83
83
84
84
90
93
93
95
6. Intensity analysis
95
96
98
100
101
104
104
104
105
7. The Spectrum
7.1. The spectrum of elementary signals . . . . . . . . . . . . . . . . . .
7.1.1. The spectrum of pure tones of varying frequency . . . . . . .
7.1.2. The spectrum of pure tones of varying amplitude and decibels
7.1.3. The spectrum of pure tones of varying phase . . . . . . . . .
7.1.4. The spectrum of a simple mixture of tones . . . . . . . . . . .
7.1.5. The spectrum of a tone complex . . . . . . . . . . . . . . . .
7.1.6. The spectrum of pure tones that dont fit . . . . . . . . . .
7.1.7. Spectral resolution . . . . . . . . . . . . . . . . . . . . . . .
7.1.8. Why do we also need cosines? . . . . . . . . . . . . . . . . .
7.1.9. Is the phase of a sound important? . . . . . . . . . . . . . . .
7.2. Fourier analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3. The spectrum of pulses . . . . . . . . . . . . . . . . . . . . . . . . .
7.4. Praats internal representation of the Spectrum object . . . . . . . . .
7.5. Filtering with the spectrum . . . . . . . . . . . . . . . . . . . . . . .
7.5.1. The spectrum editor . . . . . . . . . . . . . . . . . . . . . .
7.5.2. Examples of scripts that filter . . . . . . . . . . . . . . . . .
7.5.3. Shifting frequencies: modulation and demodulation . . . . . .
7.6. The spectrum of a finite sound . . . . . . . . . . . . . . . . . . . . .
7.6.1. The spectrum of a rectangular block function . . . . . . . . .
7.6.2. The spectrum of a short tone . . . . . . . . . . . . . . . . . .
109
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
110
111
113
115
116
117
122
124
126
126
126
130
130
131
132
134
135
136
136
138
ix
Contents
7.7. Technical intermezzo: the Discrete Fourier Transform (DFT) . . . . . . . . . 139
7.7.1. The Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . 139
7.8. Sound: To Spectrum... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8. The Spectrogram
143
9. Annotating sounds
9.1.
9.2.
9.3.
9.4.
145
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
11.Digital lters
11.1.
11.2.
11.3.
11.4.
11.5.
11.6.
Non-recursive filters . . . . . . . . . . . .
The impulse response . . . . . . . . . . . .
Recursive filters . . . . . . . . . . . . . . .
The formant filter . . . . . . . . . . . . . .
The antiformant filter . . . . . . . . . . . .
Applying a formant and an antiformant filter
155
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . .
12.1.1. How to create an empty KlattGrid . . . . . . . .
12.1.2. How to create an /a/ and an /au/ sound . . . . . .
12.2. The phonation part . . . . . . . . . . . . . . . . . . . .
12.3. The vocal tract part . . . . . . . . . . . . . . . . . . . .
12.4. The coupling between phonation and vocal tract . . . . .
12.5. The frication part . . . . . . . . . . . . . . . . . . . . .
12.6. Differences between KlattGrid and the Klatt synthesizer .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
155
157
160
161
164
165
169
147
149
149
150
169
170
172
173
179
182
184
185
187
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
187
188
189
190
192
194
194
199
201
Contents
13.3.5. Practical guidelines for measuring formant frequencies
13.3.5.1. File naming . . . . . . . . . . . . . . . . .
13.3.5.2. Annotating segments . . . . . . . . . . . .
13.3.5.3. Scripting example . . . . . . . . . . . . . .
13.4. Why are formant frequencies still so difficult to measure? . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14.Useful objects
14.1. Introduction . . . . . . . . . . . . . . . . . . . .
14.2. TableOfReal . . . . . . . . . . . . . . . . . . . .
14.2.1. Drawing data from a TableOfReal . . . .
14.2.1.1. Draw as numbers... . . . . . .
14.2.1.2. Draw as numbers if... . . . . .
14.2.1.3. Draw scatter plot... . . . . . . .
14.2.1.4. Draw box plots... . . . . . . .
14.2.1.5. Draw column as distribution... .
14.3. Table . . . . . . . . . . . . . . . . . . . . . . . .
14.4. Permutation . . . . . . . . . . . . . . . . . . . .
14.5. Strings . . . . . . . . . . . . . . . . . . . . . . .
203
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
209
16.LPC analysis
Introduction . . . . . . . . . . . . . . . . . . . .
Linear prediction . . . . . . . . . . . . . . . . .
Linear prediction applied to speech . . . . . . . .
Intermezzo: Z-transform . . . . . . . . . . . . .
16.4.1. Stability of the response in terms of poles
16.4.2. Frequency response . . . . . . . . . . . .
16.5. LPC interpretation . . . . . . . . . . . . . . . .
16.6. Performing LPC analysis . . . . . . . . . . . . .
16.6.1. Pre-emphasis . . . . . . . . . . . . . . .
16.6.2. The parameters of the LPC analysis . . .
16.6.3. The LPC object . . . . . . . . . . . . . .
203
203
204
205
205
205
205
205
205
205
205
207
201
202
202
202
202
211
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
211
211
214
214
216
217
218
219
219
220
221
223
18.Scripting simulations
229
xi
Contents
A. Mathematical Introduction
A.1. The sin and cos function . . . . . . . . . . . . . . . . .
A.1.1. The symmetry of functions . . . . . . . . . . . .
A.1.2. The sine and cosine and frequency notation . . .
A.1.3. The phase of a sine . . . . . . . . . . . . . . . .
A.1.4. Average value of products of sines and cosines .
A.1.5. Fade-in and fade-out: the raised cosine window .
A.2. The tan function . . . . . . . . . . . . . . . . . . . . . .
A.3. The sinc(x) function . . . . . . . . . . . . . . . . . . .
A.4. The log function and the decibel . . . . . . . . . . . . .
A.4.1. Some rules for logarithms . . . . . . . . . . . .
A.4.2. The decibel (dB) . . . . . . . . . . . . . . . . .
A.4.3. Other logarithms . . . . . . . . . . . . . . . . .
A.5. The exponential function . . . . . . . . . . . . . . . . .
A.6. The damped sinusoid . . . . . . . . . . . . . . . . . . .
A.7. The 1/x function . . . . . . . . . . . . . . . . . . . . . .
A.8. Division and modulo . . . . . . . . . . . . . . . . . . .
A.9. Integration of sampled functions . . . . . . . . . . . . .
A.10.Interpolation and extrapolation . . . . . . . . . . . . . .
A.11.Random numbers . . . . . . . . . . . . . . . . . . . . .
A.11.1. How are random numbers used in Praat? . . . . .
A.12.Correlations between Sounds . . . . . . . . . . . . . . .
A.12.1. Applying a time lag to a function . . . . . . . .
A.12.2. The cross-correlation function of two sounds . .
A.12.2.1. Cross-correlating sines . . . . . . . .
A.12.2.2. Praats cross-correlation . . . . . . . .
A.12.3. The autocorrelation . . . . . . . . . . . . . . . .
A.12.3.1. The autocorrelation of a periodic sound
A.12.3.2. Praats autocorrelation . . . . . . . . .
A.13.The summation sign . . . . . . . . . . . . . . . . . .
A.14.Abouts bits and bytes . . . . . . . . . . . . . . . . . . .
A.14.1. The Roman number system . . . . . . . . . . .
A.14.2. The decimal system . . . . . . . . . . . . . . . .
A.14.3. The general number system . . . . . . . . . . .
A.14.4. Number systems in the computer . . . . . . . . .
A.15.Matrices . . . . . . . . . . . . . . . . . . . . . . . . . .
A.16.Complex numbers . . . . . . . . . . . . . . . . . . . . .
A.16.1. Some rules of sines and cosines . . . . . . . . .
A.16.2. Complex spectrum of simple functions . . . . .
A.16.2.1. The block function . . . . . . . . . . .
A.16.2.2. The time-limited tone . . . . . . . . .
B. Advanced scripting
231
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
231
233
234
235
236
238
240
241
242
242
243
244
245
246
247
248
257
258
260
261
262
262
263
267
268
269
271
272
272
273
273
274
274
275
276
277
277
279
279
279
283
xii
Contents
B.2.
B.3.
B.4.
B.5.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C. Scripting syntax
C.1. Variables . . . . . . . . . . . . .
C.1.1. Predefined variables . . .
C.2. Conditional expressions . . . . . .
C.3. Loops . . . . . . . . . . . . . . .
C.3.1. Repeat until loop . . . . .
C.3.2. While loop . . . . . . . .
C.3.3. For loop . . . . . . . . . .
C.3.4. Procedures . . . . . . . .
C.3.5. Executing Praat commands
285
287
288
288
288
290
291
293
293
299
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
299
299
300
300
300
301
301
301
301
D. Terminology
303
Bibliography
305
xiii
Part I.
Introduction to Praat
1. Introduction
The aim of this book is to give the non-mathematically oriented reader insight into the speech
processing facilities of the computer program Praat. This program is freely available from
Praats website at http://www.praat.org and versions of the program exist for all major
platforms: Linux, Windows and Mac OS. Versions for mobile platforms are not supported yet.
The Praat computer program has been developed by Paul Boersma and David Weenink,
both at the Phonetics Institute of the University of Amsterdam. The program is used world
wide by phoneticians, phonologists and speech researchers. Besides for the analysis of speech
sounds, it is also used to analyze singing voices, music and even the vocalizations of animals
like for example birds, dolphins, apes and elephants.
Although the program is still under heavy development the users interface has been relatively stable for a long period of time now, and the way of working that Praat enforces neither
has changed much over the years. Development mainly concentrates on adding new functionality, extensions and under the hood improvements in stability. The current version of the
program is 5.3.64.
The interface of Praat facilitates the manipulation of objects that model the world of speech
signal analysis or the world of phonology. These objects, like for example a Sound, only exist
in the memory of the computer and disappear when leaving the program.
Besides the analysis part, you may also make drawings of high quality that can be send to a
printer or to a file and which can also be included in your document.
For automating your work you can extend the program with scripts. These scripts can be
used either interactively or from batch, i.e. you can direct the program what to do from a
text file. This comes in handy if you want to analyze a large number of files in a standard
way. Besides automating your work, scripting is more powerful since it also facilitates live
simulations by using a special demo window that can only be accessed via scripting.
Of course there is also a help facility in the program and Praat also comes with a number of
tutorials.
Hopefully this book will give the reader more insight into the program and it also hopes to
clarify some of the underlying signal processing theory.
1. Introduction
Figure 1.1.: The initial Praat object and picture window. On a Macintosh computer the top left
Praat menu is not in the object window but always appears at the top left in the menu
bar at the top of the display.
If this is your first try with Praat you could try now to create a new sound signal by choosing
New > Sound > Create Sound from formula.... In figure 1.2 the form that appears on the screen
Figure 1.2.: The New > Sound > Create Sound as pure tone... command.
We write down the command as on the button and replace the three dots (...) with a colon
(:), followed by the arguments of the command separated with commas (,). The order of
the arguments on the line is from left to right as how they appear in the form from top to
bottom. The first argument of this Praat command is the name you want the new sound to
have. Because this name is a text it has double quotes around it. Then follows a number (1)
because the form shows that it expects a number in this field. The following arguments are
1. Introduction
also numbers (0.0, 0.4, 44100, 100.0, 0.2, 0.01, 0.01) as they correspond, respectively, to the
Start time (s), the End time (s), the Sampling frequency (Hz), The tone frequency
(Hz), the Amplitude (Pa), the Fade-in duration (s) and the Fade-out duration (s) fields
of the form. Because this particular command is too long to fit on one line we had to break
it up in two lines. The three dots that start the second line mean that the rest of this line is
a continuation of the previous line. For other commands we follow the same procedure: we
always start with the name of the command and the rest of the arguments follow the fields of
the form from top to bottom as we showed above. If a command does not have an associated
form, as happens with all commands that do not end with three dots ..., only the command
itself is displayed. For example a Sound object has a Play command which is immediately
executed if clicked. We can mimic this one in a script as:
Play
In the sequel we will mostly use this unambiguous notation to represent Praat commands.
In the first example we start by selecting Font size... 10 from the Font menu. Drag the
mouse in the picture window and select a viewport drawing area. Chose the Draw inner box
command that you can find in the Margins menu of the picture window. This command draws
a rectangle at the border between the inner and outer part of the viewport. Change the font size
Figure 1.3.: On the left a contiguous selection of the objects numbered 3, 4 and 5. On the right a
discontiguous selection of the objects numbered 1, 3 and 5.
1. A contiguous selection. This is exemplified by the left display in figure 1.3 where the
consecutive objects numbered 3, 4 and 5 have been selected. If you have several objects
in the list of objects then dragging the mouse over the list may result in a contiguous
selection. Another way to obtain a contiguous selection of objects is to select the first
object with a mouse click, release the mouse, move the mouse to the last object in the
list that you want to include in you selection, push the Shift button on the keyboard and
1. Introduction
click with the mouse on the object. This is called a Shift-click. It results in a contiguous
selection that includes all the objects from the first selected one up to and including the
last Shift-clicked one.
2. A discontiguous selection as shown in the right display of figure 1.3 where the objects
numbered 1, 3 and 5 have been selected. The selection is interrupted by one or more
gaps of deselected objects. The way to achieve a discontiguous selection is by Ctrl-click
extension. Holding the Ctrl button down while clicking on an object in the list of objects
toggles a selection, i.e. if the object was not selected then it will be selected, if the object
was already selected it will be deselected.
With the Shift-click and the Ctrl-click options there are many ways to make a discontiguous
selection. For example, the selection in figure 1.3 could be accomplished in several ways. We
mention two of them. Firstly, we could have started by selecting the object with number 1
and subsequently, with two Ctrl-clicks, extend the selection with the objects numbered 3 and
5. Secondly, we might have started from a contiguous selection by dragging the mouse from
object 1 to 5 and subsequently deselecting the objects numbered 2 and 4 by Ctrl-clicking them.
2.2. Acoustics
From acoustics we learn that sound is defined as a mechanical disturbance from a state of
equilibrium that propagates through an elastic medium. An elastic medium is capable of
resuming its original shape after stretching or compressing. A disturbance may be produced
in several ways like plucking a guitar string, pinching a tuning fork or by the opening and
rapid closing of the vocal folds. This disturbance produces a sudden local increase in pressure,
i.e. a local compression of air. Since the medium is elastic, the compression is not permanent
and therefore the compressed region will rebound. In doing so it will compress an adjacent
region, which will then again rebound and so on. The result of this cycle of repetition is a
compression wave followed by a rarefaction wave as each region rebounds. The waves thus
generated propagate through the medium with a speed that depends on several factors like the
elasticity, density, and the temperature of the material. The propagation velocity of a sound
pressure wave in air at room temperature is approximately 340 m/s. 1 It is important to realize
that it is not the air particles in the air that carry the sound to your ear but it is the compression
wave that propagates and carries the sound. Therefore in order to study the physical aspects
of sounds we dont have to study particle physics but instead we have to know some things
about waves, in particular sound waves.
In many textbooks sound waves are illustrated with the sounding of a tuning fork and we
will continue in this tradition. A tuning fork is an acoustic resonator in the form of a twopronged U-shaped fork of elastic metal (usually steel). It resonates at a specific constant pitch
1 This
corresponds to 1224 km/h which may appear like a very large velocity but it is nothing compared to the speed
of light which is nearly 300,000 km/s. This means that after you see a lightning flash at one kilometer distance, it
will take the sound another three seconds to reach your ear.
10
11
12
13
14
15
16
17
Time
Figure 2.1.: Upper panel: seventeen phases of the displacement of the prongs of a tuning fork after
it has been made to sound by pinching the two prongs together. Lower panel: the
displacement of the right-hand prong as a function of time. Inward displacement is
negative, outward displacement positive. The baseline corresponds to the rest position.
figure 2.1 we have tried to visualize the movement of the prongs of a tuning fork at seventeen
regularly spaced time points. The tuning fork was made to sound by pinching the two prongs
together at time point 1. Now the two prongs of the fork are most close together and because
of the elasticity of the material, they will immediately start to move back to their neutral
position. However, when they reach the neutral position, at time point 3, they have so much
velocity and consequently overshoot, moving more outward. The prongs therefore pass the
neutral position at time point 3 and reach a maximum outward displacement, at time point 5,
and from there they start moving inwards again. At time point 7 they pass the neutral position
again, overshoot, and move further inward. At time point 9 they have completed one cycle and
are back at approximately the same displacement as at time point 1. They now move outward
again and a new cycle of outward and inward movement starts. This goes on until eventually
the movement dies. The movement of each prong is called a periodic movement because the
movement repeats itself in regular intervals or periods.
10
2.2. Acoustics
In the lower panel of figure 2.1 we have plotted the displacement of one of the prongs, in
this case the right prong, as a function of time. Relative to the neutral position, a displacement
inward has been given a negative sign while a displacement outward has been given a positive
sign. The baseline at 0 therefore corresponds to the neutral position. The dots correspond
to the time points displayed in the upper panel. A smooth curve has been drawn through the
displacement points to show the intermediate values of the displacements. The curve starts
at the most negative value for the displacement since the prongs are most closely together
here. It then moves upwards towards less negative values because the displacement w.r.t. to
the neutral position becomes less, crossing the neutral position at time point 3. It then reaches
the most positive value at time point 5 because the prongs are at their maximum distance apart
which is at their maximum outward position. The curve goes down again because the prong
starts moving inward again. At time point 7 the prong passes the neutral position and the value
of the displacement curve is zero. At time point 9 the value of the curve is equal again to the
value at time point 1. Now the cycle starts all over again.
This curve, that describes the motion of the prong, is called a sinusoid, which means like-asine or sine-like. It is one of the most important curves that exist in science. The displacement
of the prong of the tuning fork follows a sinusoidal curve and the prong is said to move in a
simple harmonic motion. The sound produced by a tuning curve is called a pure tone.
A sinusoidal curve describes the motion of many oscillating bodies like a spring or a pendulum. As the lower panel visualizes, the sinusoid curve is periodic: a sinusoid repeats itself.
The curve that starts at time point 1 repeats starting at time point 9. A closer look at the curve
reveals that the periodicity of the curve could start at any place. For example, if we had drawn
an 18th and a 19th point, the shape of the curve from time points 10 to 18 would equal the
shape of the curve from time points 2 to 10 and the shape from time points 3 to 11 would
equal the shape from time points 11 to 19. In fact we could pick any random point at the time
axis follow the curve and discover the next point from which the shape of the curve equals the
traced curve. From all the possible starting points two of these curves have gotten a special
name: the curve that starts at time point 3 is called a sine curve, and the curve that starts at
time point 5 is called a cosine curve. As the figure shows both sine and cosine are periodic
functions, that have a lot of similarities. In fact, the only difference between the two is that the
sine curve starts with a zero value and the cosine curve starts at the positive extreme. In words,
one period of a sine starts at zero, goes up to an extreme value and then goes down below zero
to a negative extreme and then goes up again to zero. In the chapter A, the mathematical introduction, we give a lot more information about the sine and cosine functions. Because these
functions are so important a special branch of mathematics called trigonometry is involved
dealing with them.
What is very nice about the sinusoid functions is that they also describe the motion of air
particles immediately in the neighborhood of the prongs of the tuning fork. Let us focus on
the right-hand prong as we have before. In the left panel of figure 2.2, the changing positions
of air particles are shown during the transmission of a simple sound from the right-hand
prong. The rows represents, at different times, the positions of the air particles along a certain
distance on a virtual straight line that extends from the sound source origin outwards. We must
take particle not too literally because we do not mean the individual molecules; we mean
something like the average position of a small volume of air. At the bottom of the left panel,
a wave is started by a local compression of air particles by some external disturbance, like for
11
20
Pressure at d
period T
Time
15
peq
10
Time
15
20
10
Pressure at t
wavelength
5
peq
Distance
Distance
Figure 2.2.: Left panel: Propagation of sound in air visualized by the changing position of air
particles during the transmission of a simple periodic sound wave. The rows represent the position of the air particles at successive points in time along a distance on a
straight line that extends from the sound source. The wave front is indicated by a particle marked with green color. The displacement of one single air particle is displayed
in red color. The changing position of the right-hand prong of the tuning fork is shown
on the left with small vertical bars. Upper right panel: the pressure at distance d as
a function of time. Bottom right panel: pressure variation at time t as a function of
distance.
example the right-hand prong of a tuning fork. The tuning fork sets air particles in motion
and these particles set other particles in motion and so on resulting in a local compression of
air particles. Because of this compression the air particles at the left in the first row are closer
to each other than in the rest of the row where the particles are still in their normal position.
There is only a compression at this time point at the start of the row, the rest of air particles
farther away from the sound source are still in their neutral position. The green colored air
particles mark what is called the wavefront. The picture makes clear that in the course of time
the wavefront continues to travel to the right, further away from the source. The velocity of
travel is the velocity of sound. Near the wavefront the particles due to the compression are
closer together and therefore the pressure is larger than average. At positions where there are
less particles than normal the pressure is less than average. We emphasize again that although
the wave travels to the right the air particles do not. Each particle simply follows a harmonic
motion; to show this we have given one particle a red color which enables us to track its
position as time develops. It shows that the particles displacement nicely moves along the
neutral position. This illustrates that the sound not travels with the particles but with the wave
that creates the harmonic movement of these particles.
As the left panel shows, the amount of air particles at a certain position varies with time. We
know that the amount of air particles within a certain space is directly related to the physical
measure called pressure; the more particles are present, the higher the pressure is. Pressure is
12
2.2. Acoustics
0
Pressure at t
Pressure at t
1
Time (s)
10
340
Distance (m)
Figure 2.3.: Left pane: The pressure variation during an interval of 1 s at a certain position due to
a 10 Hz sound. Right pane: the pressure variation over a distance of 340 m for a 10 Hz
sound.
expressed in units of pascal, abbreviated as Pa.2 The pressure in a small volume at a distance
d from the sound source has been plotted in the top right panel of figure 2.2. The pressure
as a function of time shows a sinusoidal curve and its period has been marked with T . The
number of wave periods that fit into one second is called the waves frequency; the dimension
or the unit of frequency is called the hertz, abbreviated as Hz.3 The hertz unit is equivalent to
cycles per second and has dimension one over time (1/s). If a period would last one second
its frequency would be one hertz. There is an inverse relation between the frequency and the
period of a wave:
1
(2.1)
f= .
T
In this formula f is the frequency in hertz and T is the period in seconds. Note that the units
on both sides are the same because the unit of the period T is in seconds and therefore the
unit of 1/T is 1/s which is the same unit as the hertz. The left panel of figure 2.3 shows
the pressure variation during an interval of 1 s at a certain position due to a 10 Hz sound.
A frequency of 10 Hz means that during this 1 s the sound pressure completes 10 cycles of
pressure variations, i.e. 10 periods will fit in this time interval. Consequently each period will
last 0.1 s (= 1/10). In the figure the starting point was chosen to be at zero pressure variation
but this does not matter. Had we chosen another starting position then also exactly 10 cycles
would have been completed. A frequency of 20 Hz has a period of 0.05 s (= 1/20) because
exactly 20 periods fit in one second. The sound frequencies that we are normally concerned
with in speech sounds are higher than these. Although the human ear is sensitive to sounds
with frequencies between 20 Hz and approximately 20000 Hz (=20 kHz), the most important
information for speech sounds is limited to the range from say 200 Hz to some 5000 Hz.
For analog telephone conversations the frequency range is deliberately limited to frequencies
between 300 and 3300 Hz.
The right bottom panel of figure 2.2 shows the pressure as a function of distance at the
time point t (bounded on the vertical axis with horizontal dotted lines). Again a sinusoidal
curve appears whose period now is called the wavelength. Because the horizontal scale is now
distance instead of time, the dimension of the wavelength is also expressed in the unit of dis2 The
3 The
13
20
Pressure at d
period T
Time
15
peq
10
Time
15
20
10
Pressure at t
wavelength
5
peq
Distance
Distance
tance, i.e. meters. Wavelength is often indicated with the Greek character (lambda). There
is a relation between the wavelength of a wave and its frequency which says that wavelength
times frequency is a constant. The constant happens to be the velocity of wave propagation
in the medium, in our case the velocity of sound in air. For frequency f , wavelength en
velocity v the formula reads:
f = v.
(2.2)
This formula shows just like equation 2.1 an inverse relationship, but now between wavelength
and frequency (we can rewrite the equation above as = v/f ). Note that the units are correct
since the left side has the unit of the wavelength (m) multiplied by the unit of frequency (1/s)
which results in a combined unit of m/s, the unit of velocity. We can easily see that this
relation is true: if we assume that the velocity v of a wave does not depend on frequency but
is a constant, then we know that in one second the wave will have traveled a distance of v
meters. Because the frequency equals f the wave has oscillated f times during this second
and so the distance traveled equals f wavelengths. Therefore f wavelengths should equal
the distance covered, and therefore f = v. If we would measure during two seconds the
distance covered would be twice as large and equal 2v meters and twice as many wavelengths
would fit in so 2f = 2v which reduces to f = v again. We could repeat this argument
for any length of time. If the frequency of a tone increases, its wavelength decreases, or, if
the frequency decreases its wavelength increases. In the right panel of figure 2.3 the pressure
variation of a 10 Hz sound has been plotted over a distance of 340 m, i.e. the distance a sound
would travel in a one second interval. As the figure makes clear the 10 periods of the sound
cover a distance of 340 m and therefore the wavelength of a 10 Hz sound will be one tenth of
the distance traveled in this one second and equal 34 m. If we increase the sounds frequency to
100 Hz then a hundred periods have to fit into the same distance and the wavelength decreases
to 3.4 m. Given the formula above we now can easily calculate that a wavelength of 0.17 m,
which happens to be the average males vocal tract length, corresponds to a frequency of
2000 Hz.
14
2.2. Acoustics
In figure 2.4 we show again the propagation of sound in air. All panes have exactly the
same scales as in figure 2.2, however, the frequency of the source signal has been doubled.
This results in a doubling of the number of compressions in the same amount of time as can be
judged by comparing the left panels in both figures. Because of the doubling of the frequency,
the period has been halved, as a comparison of the upper-right panels in figures 2.2 and 2.4
shows. As for the wavelength, the corresponding lower panels shows that because of the
frequency doubling the wavelength has been halved, in accordance with the formula above.
We have now covered how a sound is transmitted in a medium. Two further things have to
be said. The first is that without a medium there will be no sound, in other words, in a vacuum
there can be no sound because sound, in contrast to light, needs a transport medium. The
second is that besides the frequency also the loudness of a simple sound can vary. The only
way loudness information can be transmitted is by a variation of the pressure, i.e. the density
of the air particles.
The sound of a tuning fork is one of the most simple sounds possible because it is a sound
wave with a fixed frequency; it is called a pure tone. Most of sounds we encounter, however,
are not pure tones but more complex sounds. But however complex they may be, they still
are transmitted in the same way as a pure tone: by a wave that propagates as a compression
and rarefaction of air particles. These compressions and rarefactions can be recorded as local
changes in pressure by a microphone and subsequently transduced into an electrical signal.
The electrical signal can be stored in a computer as numbers. In chapter 3 we will go into
more detail how this can be accomplished. For now it suffices to say that when we say the
sound object in Praat represents the acoustical pressure variations of a sound in air you
hopefully will know from the preceding discussion what we are talking about.
2.2.1. The nasty details of pure tone specications
In the preceding sections we have seen that the pressure variation of the sound wave from
a tuning fork can be modeled with a sinusoid. However, we did not specify the sinoid in
mathematical terms. In this section we will explain how the mathematical sine function can
specify a tone in terms of frequency and time.
Let us first find out more about the sine function. From the mathematical introduction we
know that the sine function can be written as sin(x) where the argument of the sine, x, is
some dimensionless variable. In the top left pane of figure 2.5 we show how the value of the
function sin(x) varies with its argument x in the interval from 0 to 6.283. The end of the
interval could also have been displayed as 2 but instead we have displayed it as the number
6.283 to emphasize that 2 really is a number and not something magical. The period of sin(x)
is 2. This means that the value of sin(x) for some value x and for another value x + 2 are
always equal, whatever the value of x may be. In a formula this says that always
sin(x) = sin(x + 2).
(2.3)
We can reformulate this formula as: sin(x) = sin(x + 2m) for all integer values of m.4 The
is easy to see that this must be true by noting that if sin(y) = sin(y + 2) for all y then if y = x + 2
we get sin(x + 2) = sin(x + 2 + 2) = sin(x + 2 2). With the help of equation (2.3) this results in
sin(x) = sin(x + 2 2). In the same way we can show that every increase of the argument x by 2 results in the
same value.
4 It
15
-1
-1
-1
-1
-1
k=1
k=2
k=3
-1
0
6.283
x
1
Time (s)
Figure 2.5.: Different representations of sinusoidal functions. In the left column we display the
function sin(kx), where the variable x runs from 0 to approximately 6.283 (2). From
top to bottom the values for the parameter k are 1, 2 and 3, respectively. In the right
column the displayed function is sin(2kx) where x runs from 0 to 1.
legend of the figure says that we display functions sin(kx) which equals sin(x) if k equals 1.
The most important thing to note from the top left panel for the function sin(x) is that exactly
one period fits in the interval from 0 to 2 and that the length of one period therefore equals
2. The amplitude of the sine varies between +1 and 1. The next function in the left column
is sin(2x). This function has two periods on the same interval because when x runs from 0 to
2, the argument 2x runs from 0 to 4 which means that when x is halfway at the value , the
argument 2x has already reached the value 2 and we know that one period has been tracked.
Since the sine is a periodic function we know that when x runs through the second half, i.e. the
interval from to 2 another period of the sine will be traced. In the third panel in the left
column where the function sin(3x) is displayed there are three periods on the interval 0 to 2
because the argument of sin(3x), i.e. 3x now runs from 0 to 6. From the arguments above
we can make the following generalization: the function sin(kx) shows k periods when x runs
from 0 to 2. In the figure k was an integer number but this was just to get a nice display. You
may now be able to understand that k may be any real number. For example if the value of k
were 2.788 then the function sin(kx) would not show an integral number of periods when x
runs from 0 to 2 but only show 2.788 periods.
From the preceding paragraph we have learned that the function sin(kx) shows k periods
when x runs from 0 to 2. We further know that if the x variable were time then k periods in a
k
segment of duration 2 seconds would correspond to a frequency f = 2
since the frequency
is the number of periods per second. This shows that k can almost be interpreted as frequency
apart from a factor of 2. So why not include this factor in the argument? To show the effect,
we display in the right column of figure 2.5 how the function sin(2kx) varies if k varies from
1 to 3. In the display we have reduced the interval for x from 0 to 1. For k = 1 we see 1 period,
16
2.2. Acoustics
for k = 2 we see 2 periods and for k = 3 there are 3 periods during the 1 second interval.
This is very nice because if x were time than k would correspond to frequency! So why not
change the notation and write sin(2f t) instead, where f is the frequency in Hz and t is the
time in seconds. Although frequency and time have different dimensions or units, the product
f t is dimensionless since the dimension of the product of 1/s for the frequency and s for
the time cancels out. The formula sin(2f t) expresses nicely how the amplitude of a wave
with frequency f varies as a function of time.5 Let us use this formula to express the pressure
variation as a function of time at a certain point due to a sounding tuning fork of frequency f .
We write:
p(t) = sin(2f t).
Would this be the correct formula? Well, . . .almost. Two things have still have to be arranged
in the formula, amplitude and phase.
As figure 2.5 shows, the amplitudes of all the sines in all the panes vary between the values
+1 and 1. Or put in another way: no matter what the argument of a sine function is, the
result is always a number between +1 and 1. The unit of pressure is the pascal, abbreviated
as Pa. The formula above therefore only describes pressure variations between +1 and 1 Pa.6
We want to be more flexible than this and be able to describe variations between say +0.001
and 0.001 Pa. This can be easily fixed by extending the formula above with a scale factor.
We now write:
p(t) = A sin(2f t).
The pressure now varies between the numbers +A and A and by choosing an appropriate
value for A we can allow for any range of pressure variations. This shows how we can vary
the amplitude of a sound. The amplitude correlates with the loudness of a sound, the larger
the amplitude the louder the sound. We are almost there, one tiny step to take. Take a look
at the upper right panel of figure 2.2 again where the pressure variation as a function of time
was displayed. The curve certainly looks sinusoidal but the initial amplitude is definitely not
zero as the sine functions in figure 2.5. Clearly we have to be able to manipulate the amplitude
value at the start, i.e. the phase in the cycle of the sine where it should start. We know that for
t = 0 the argument of A sin(2f t), i.e. the term 2f t, is also zero. The only way to guarantee
that for t = 0 the argument is unequal zero is by adding a constant number to it. This constant
is called the phase and the most used symbol for it is the Greek letter . Finally we can now
write the complete general description of the pressure variation at a certain point in space due
to the sound of a tuning fork:
p(t) = A sin(2f t + ).
(2.4)
5 An
alternative formulation is based on the following. First we make a purely cosmetic change and move the 2
in sin(2kx) closer to the x to get sin(k2x). Then we make the obvious conclusion that if x varies from 0 to 1
then 2x varies from 0 to 2. Therefore the curve of the function sin(k2x) when x runs from 0 to 1 is identical
to the curve of sin(kx) when x runs from 0 to 2.
6 The pascal is used as the unit of air pressure. It is a derived unit of measurement and defined as newtons per square
meter (N/m2 ). The ambient air pressure is about 100,000 Pa. Our ear can detect air pressure variations as small as
approximately 0.00002 Pa for a sine wave with a frequency of 1000 Hz. The Praat help page on sound pressure
level will show additional information on the subject.
17
1
0
-1
-2-3/2 - -/2 0 /2 3/2 2
Figure 2.6.: The function sin(x + ) for x = 0 where runs from 2 to +2.
7 The
18
Figure 2.7.: The basic sound editor. The numbered parts are explained in the text.
The spectrogram is shown as grey values in the drawing area just below the sound
amplitude area. The horizontal dimension of a spectrogram represents time. The
vertical dimension represents frequency in hertz. The time-frequency area is divided in cells. The strength of a frequency in a certain cell is indicated by its
blackness. Black cells have a strong frequency presence while white cells have
very weak presence. The maximum and minimum frequencies represented in the
spectrogram are displayed on the left. here they are 0 Hz and 6000 Hz, respectively. The characteristics of the spectrogram can be modified with options from
the Spectrum menu. In chapter 8 you will find more information on the spectrogram.
The pitch is drawn as a blue dotted line in the spectrogram area. The minimum
and maximum pitch are drawn on the right side in blue color. Here they are 70 Hz
and 300 Hz, respectively. The specifics of the pitch analysis can be varied with
options from the Pitch menu. More on pitch analysis in chapter 5.
The intensity is drawn with a solid yellow line in the spectrogram area too. The
peakedness of the intensity curve are influenced by the Pitch settings... from
the Pitch menu, however. The Intensity menu settings only have influences on the
display of the intensity, not on its measurements. The minimum and maximum
values of the scale are in dBs and shown with a green color on the right side
inside the display area (normally the location depends on whether the pitch or the
spectrogram are present too). More on intensity in chapter 6.
Formant frequency values are displayed with red dots in the spectrogram area.
The Formant menu has options about how to measure the formants and about how
many formants to display. Formant analysis is treated in chapter 13.
Finally the pulses menu enable the pitch glottal pulse moments to be displayed.
19
20
Figure 2.8.: The sound editor with display of spectrogram, pitch, intensity, formants and pulses.
21
frame 4
frame 3
frame 2
frame 1
Time step
Window length
Analysis
Analysis
Analysis
Analysis
the cutting up of the sound into successive segments. Each segment is analysed in the rectangular block labeled Analysis, the results of the analysis are stored in an analysis frame and
the analysis frame is stored in the output object. An analysis frame is also called a feature
vector in the literature. What happens in the Analysis block depends of course on the particular type of analysis and necessarily the contents of the analysis frames also depends on the
analysis. For example, a pitch analysis will store pitch candidates and a formant analysis will
store formants. Before the analysis can start at least the following three parameters have to be
specified:
1. The window length. As was said before, the signal is cut up in small segments that will
be individually analysed. The window length is the duration of such a segment. This
duration will hold for all the segments in the sound. In many analyses Window length
is one of the parameters of the form that appears if you click on the specific analysis. For
example if you have selected a sound and click on the To Formant (burg)... action
a form appears where you can chose the window length (see figure 13.6). Sometimes
you dont have to give the window length explicitly, instead it can be derived from other
information that you need to supply. For pitch measurements window length is derived
from the lowest pitch you are interested in.
There is no one optimal window length that fits all circumstances as it depends on the
type of analysis and the type of signal being analysed. For example, to make spectrograms one often choses either 5 ms for a wideband spectrogram or 40 ms for smallband
spectrogram. For pitch analysis one often choses a lowest frequency of 75 Hz and this
implies a window length of 40 ms. If you want to measure lower pitches the window
22
23
anticipate here on section 4.5.3 on scripting. There is some special syntax to execute Praat commands from
within a script. The first line of this script queries a selected sound object for its mean value and assigns it to a
variable with the name mean. In the next line we subtract the mean from all samples of the sound. The Get
mean... command in this script is the same command that you can find in the sounds dynamic menu Query -
list. Because the Get mean... command needs three arguments we have to supply these on the script line.
24
For a stereo sound means are determined for each channel separately.
1
0
-1
0.1
0
Time (s)
Figure 2.10.: A sound with an offset of 0.4 before (solid line) and after (dotted line) a correction
with Subtract mean.
in the list of objects. The Number of channels argument specifies the number of channels of
the sound. This argument may be a any natural number like 1 or 2 or 3. You may also specify
either Mono or Stereo which will be translated to the numbers 1 or 2, respectively. The
Start time and End time specify the domain of the sound. Most of the time you will specify
25
The following line creates a mono sound with a tone of 800 Hz with a sampling frequency of
44100 Hz and a duration of 0.5 s.
Create Sound from formula : "s" , 1 , 0 , 0.5 , 44100 , "0.9*sin(2*pi*800*x)"
This sound is displayed in the following figure 2.12 with a dotted line.
1
-1
0.01
0
Time (s)
Figure 2.12.: The solid line shows a damped sine with a frequency of 800 Hz and a bandwidth of
80 Hz. The dotted line shows the (undamped) tone of 800 Hz.
An alternative way to create a tone is with the specialized command Create Sound as pure
tone... from the Sound menu. You dont need a sine formula to specify the tone, you only
specify the frequency and the amplitude and, additionally, you can specify fade-in and fadeout times to guarantee that no clicks can be heard at the start and the end of the sound. The
following command will create a tone with an amplitude of 0.9 Pa, a duration of 0.5 s and a
frequency of 800 Hz.
Create Sound as pure tone : "tone" , 1 , 0 , 0.5 , 44100 , 800.0 , 0.9 , 0.01 , 0.01
A damped sine can be created from the formula s(t) = et sin(2F t), where F is called the
frequency and is a positive number called the damping constant. As one can see, damping
reduces the amplitude of s(t) as time t increases. If is very small there is hardly any damping
at all and if is large the amplitude goes to zero very fast. The damping constant is often
26
In many experiments noise sounds are required. Noise sounds can be made by generating a
sequence of random amplitude values. Different kinds of noises exist, the two most important
ones are called white noise and pink noise. In white noise the power spectrum of the noise is
flat on a linear frequency scale: all frequencies have approximately the same strength. In pink
noise the power spectrum is flat on a logarithmic frequency scale. This means that on a linear
frequency scale the power varies as a 1/f function (see section A.7). Both types of noise can
be made easily with Praat.
To create white noise we can use two functions that generate random numbers from a random number distribution.
The function randomGauss (mu,sigma) generates random numbers from a Gaussian
or normal distribution with mean mu and standard deviation sigma.
Create Sound from formula : "g" , 1 , 0 , 0.5 , 44100 , "randomGauss(0 , 0.2)"
A mono sound labeled g will appear in the list of objects. Its duration will be 0.5 s and
it is filled with random Gaussian noise of zero mean and 0.2 standard deviation.
The function randomUniform(lower,upper) generates random numbers between lower
and upper. It has the advantage as compared with the randomGauss function that all
amplitudes are always limited to lie within the predefined interval.
Create Sound from formula : "u" , 1 , 0 , 0.5 , 44100 , "randomUniform( - 0.99 , 0.99)"
A mono sound labeled u wil appear in the list of objects. Its duration will be 0.5 s and
it is filled with uniform random noise. All amplitudes will be smaller than or equal to
0.99.
In the spectra of both types of noise all frequencies are equally present, and no audible spectral
differences can be heard between the two sounds. For practical use we often prefer the randomUniform noise because we have better sound amplitude control since for random Gaussian
noise some extreme amplitudes outside the (-1,1) interval might always occur. In section
4.7.1.3 we will learn how to create pink noise.
27
Up till now we have only created tones with constant frequencies. Suppose we want a tone
whose frequency changes linearly from a frequency f1 at time t1 to a frequency f2 at time
t2 . We can write such a function as sin((t)), where the phase function (t) should regulate
the frequency at any time t. How should this phase function (t) look like? We already
know that for a tone with a constant frequency F it can be written as (t) = 2F t because
sin(2F t) is the formula for such a tone. Here the frequency does not depend on time. Now,
a frequency f that increases or decreases linearly as a function of time can in general be
written as f (t) = at + b, where the coefficients a determines the slope and b the offset. With
the boundary conditions, i.e. the start and end frequencies we determine these coefficients as
f f
a = t22 t11 and b = f1 at1 . It can be shown9 that the corresponding phase (t) for this case
is (t) = at2 + 2bt + 0 , where 0 is the phase at time t = 0. The following script creates
a sweep tone of 1 s duration that starts at 500 Hz and ends at 1500 Hz.
f1 = 500
f2 = 1500
t1 = 0
t2 = 1
a = (f2 - f1) / (t2 - t1)
b = f1 - a * t1
Create Sound from formula : "sweep" , 1 , t1 , t2 , 44100 ,
... "0.99*sin(pi*a*x ^2 + 2*pi*b*x)"
A gammatone is a sound that can be described as the product of a gamma distribution with
a sinusoidal tone. It is the impulse response of the gammatone filter. It is no problem if you
dont know what a gamma distribution is, just continue. The gammatone is important because
it is often used as a model of the auditory filter. We create a gammatone as follows:
f = 500
bp = 150
gamma = 4
phase = 0
Create Sound from formula : "gammatone" , 1 , 0 , 1 , 44100 ,
... "x ^ (gamma - 1) * exp( -2 * pi * bp * x) * sin(2 * pi * f * x + phase)"
Scale peak : 0.99
In left panel of figure 2.13 we show the gammatone that results from the above script. The
formula for a gammatone is g(t) = t1 e2bt sin(2F t + 0 ), where determines the order
of the polynomial part of the gamma distribution. The parameters F and b are the carrier
frequency and the bandwidth parameters and 0 is the starting phase. The figure shows that
at the start of the tone, the polynomial rising part of the gammatone function is stronger than
the exponentially decaying part but eventually the exponential decay takes over and makes
the amplitude vanish. Compared to the formant of section 2.7.2 we note that the gammatone
d(t)
1
a function sin((t))
the instantaneous frequency f (t) is defined as f (t) = 2
dt . From this we deduce that
R
the phase (t) = f (t)dt. Given a linear function for the frequency f (t) = at + b, the phase then follows as
(t) = 2(1/2at2 + bt + c), where the integration constant c can be adapted to account for the phase at t = 0.
9 For
28
-1
-1
0.03
0
Time (s)
0.03
0
Time (s)
Figure 2.13.: On the left a gammatone with = 4, F = 500 Hz and b = 50 Hz. On the right a
gammachirp with = 4, F = 500 Hz, b = 50 Hz and an addition factor c = 50.
is like a formant with the starting part modified by the rising polynomial. The parameter b
results in an actual bandwidth that is twice this value, i.e. B = 2b. Besides auditory modeling,
many other physical phenomena, like for example knocking on wood, can be modeled with
gammatones.
2.7.6. Creating a gammachirp
The gammachirp introduces a frequency modulation term to the gammatone auditory filter
to produce a filter with an asymmetric amplitude spectrum [Irino and Patterson, 1997]. It
can be used in an asymmetric, level-dependent auditory filterbank in time-domain models of
auditory processing [Irino and Patterson, 1997]. The formula for a gammachirp is gc (t) =
t1 e2bt cos(2F t + c ln(t) + 0 ) which shows that for c = 0 it reduces to the ordinary
gammatone. From the gammachirps formula it follows that its frequency varies as a function
c
of time like f (t) = F + 2t
. So, theoretically, the frequency at the start of the sound (t = 0)
is infinite and subsequenty at a 1/t rate approaches the frequency F . In contrast with a linear
sweep tone the frequency change is not constant in time but is largest in the first part of the
sound. In order to avoid aliasing, the first part of the gammachirp, where the frequency is
larger than the Nyquist frequency has to be suppressed.
f = 500
bp = 50
gamma = 4
c = 50
phase = 0
Create Sound from formula : "gammachirp" , 1 , 0 , 0.1 , 44100 ,
... "if (f + 1 / (2 * pi * x)) < 22050
... then x ^ (gamma - 1) * exp( -2 * pi * bp * x) * cos(2 * pi * f * x + c * ln(x) + phase)
... else 0 fi"
Scale peak : 0.99
29
A sound with only one pulse can be used to study the impuls response of a digital filter. To
create such a sound where for example the first sample value is one and the rest are all zeros,
the following script suffices:
Create Sound from formula : "1" , 1 , 0 , 0.1 , 44100 , "if col =1 then 1 else 0 fi"
This script line can easily be modified to allow the 1-puls at any sample number.
If we want a pulse at a time that not exactly matches a sample time, things get a bit more
complicated. Because the time is not at a sample point, a sound like this is not really bandlimited, i.e. its spectrum will have frequencies above the Nyquist frequency. To correctly
represent such a sound with a certain sampling frequency, we first have to band-limit it by
low-pass filtering. Praat takes care of this if we start by creating a PointProcess object10 , then
add the point at the desired time (s) to the PointProcess and then create the final sound from
this PointProcess. The final part, the creation of a sound from a point process takes care of the
necessary filtering.
Create empty PointProcess : "pp" , 0 , 0.1
Add point : 0.01234567
To Sound (pulse train) : 44100 , 1 , 0.05 , 2000
Figure 2.14 shows the low-pass filtered pulse as (part of) the sound generated by the script
above.
10 See
30
-1
0.01
0.01234567
Time (s)
0.015
Figure 2.14.: A pulse at time 0.01234567 represented in a sound of 44100 Hz sampling frequency.
31
A speaker produces a speech signal, the acoustic wave is picked up by a microphone and
transduced into an electrical signal. This electrical signal is amplified and the amplified signal
is recorded either in analog or in digital form. Nowadays the recording will be in digital form
only.
The conditions at a recording session can vary widely, to name some:
We are recording at home with a microphone directly attached to the computer. This
is often the most simple environment for making a recording. All we need is a (good)
microphone and a computer with a sound card. However, we dont control the environment, which means that noises could start at any time. For example, somebody enters
our room, a car honks outside or one of the neighbors starts drilling a hole. Another
source of noise is the computer itself: a fan might start to blow or a disk might spin up
or the electrical circuit of the computer interferes with the circuitry of the input of the
sound card.
We are recording in a studio. This is clearly the preferred way because we have almost
complete control over the recording environment. Environmental noise, the main disturbing factor in making a recording, can be minimized here. We are also not limited by
the complexity, weight or size of all the necessary recording equipment.
We are recording in the field. This might be at extreme places like somewhere in the
Amazon basin where you have to record the sounds of a very uncommon Indian language, or in the tundra of Siberia where the last sounds of some extinguishing language have to be recorded or in a Chinese city making recordings for a language change
project. These kind of field recordings clearly limit the amount of control that we can
exercise on the recording environment and they also limit the size and weight of our
recording equipment.
33
digitize a collection of wax-rolls one needs specialized equipment. The old machines to play these rolls are
hardly available these days. And even if they were, one runs the risk of damaging the old rolls because physical
contact between a needle and the roll is needed. Preferably one uses the reflections of a laser beam to read out
this old material.
34
35
AIFF stands for Audio Interchange File Format. The sound is stored uncompressed in bigendian format. It is the native audio sound file format of Apple Macintosh computers. The
AIFF-C or AIFC format has the same structure as the AIFF format but may store the sound
data in a compressed form. Because the concept of these file types is so elegant, we will
2 See
36
section D on terminology.
The wav format is the native audio file format for the Microsoft Windows operating system.
Its file structure is derived from and very similar to the container format of AIFF, however, it
stores the data in little-endian format. One weak spot of this format is that sampling frequency
has to be an integer number wich poses problems in the conversion of some older formats with
non-integer sampling frequencies.
3.5.3. The FLAC format
The Free Lossless Audio Codec is an open source lossless audio codec. A codec is a piece
of software that can code and/or decode a piece of data. It supports streaming, seeking and
archiving. The compression is approximately 50%.
3 The
cross-platform Sound exchange program program SoX opens and saves audio files in most popular formats
and can optionally apply effects to them; it can combine multiple input sources, synthesise audio, and, on many
systems, act as a general purpose audio player or a multi-track audio recorder. It also has limited ability to split
the input in to multiple output files.
37
This is a sound compression format used in European telephony. The amplitudes in a file
are compressed with a logarithmic transform and quantized with 8 bits. This ensures that the
lower amplitude signals (where most of the information in speech takes place) get the highest
bit resolution while still allowing enough dynamic range to encode high amplitude signals.
The sampling frequency is 8 kHz.
3.5.5.
law
format
This is also an 8-bit format like alaw but for American and Japanese telephony. It uses a
similar logarithmic transform for quantization and also 8 kHz sampling frequency.
Sometimes sound is available in a format with incomplete data about the sound. Maybe, only
the numer of samples are known or can be guessed from its file size. These sound files are
called headerless files or raw files. Because the information is not in the file we have to supply
the information afterwards. This may end up into a trial and error game. If we know that each
speech sample is a one-byte number then we might try the Open > Read from special sound
file > Read Sound from raw Alaw file... command. If this does not help, or, we know that
the speech data is in a two-byte format we might try one of the two commands Read Sound
from raw 16-bit Little Endian file... or Read Sound from raw 16-bit Big Endian file.... The
little endian and big endian terminology is explained in appendix D and refers to the way a
number has been stored in the computer. If we have no idea which one to chose then simply
try both ways. If you have chosen the wrong endianness and play the sound it will always
sound terrible and be completely incomprehensible. The one that sounds best is the one you
want. Once you got the endianness right, however, the sounds sampling frequency still might
be wrong. If it sounds as if the speaker speaks very slow and with a very low pitch then
the sampling frequency is too low and you have to increase it. By increasing the sampling
frequency, you speed up the sound, because, since the number of samples in a sound is fixed
and you play more of them in each unit of time, the duration of the sound will decrease.
This means that by changing the sampling frequency of a sound you implicitly change its
duration. Note that we are not recording a sound here, we already have all the sample values
and we know how many samples we have, the only thing we change is the sounds playing
characteristics. If we know the correct sampling frequency we can use the Modify > Override
sampling frequency... command to impose the correct frequency on the sound. If we really
dont know what the sampling frequency of the raw file was, we have to guess it by setting
a frequency and then listen whether it sounds well. Most of the time raw sound files date
from a very long time ago because nobody would nowadays save data in this way. Popular
sampling frequencies in these days were 8 kHz, 10 kHz and 16 kHz.
38
3.6. Equipment
3.5.7. The mp3 format
A lossy file format developed mainly at the Fraunhofer institute. The mp3 file format stores a
compressed lossy version of the original. Data-compression rates of 10 to 1 are obtained easily
with only small audible artifacts. The mp3 format is not a free standard, there are licences that
limit availabilty.
3.5.8. The ogg vorbis format
A better lossy compression format than mp3 and a completely free standard. With a lossy
compression format you will never be able to restore the original sound wave while a lossless
format allows exact reconstruction. In other words: with a lossy compression format we lose
some information about the original signal. It is a pitty that this standard has not enough
followers yet.
3.6. Equipment
3.6.1. The microphone
The microphone transduces the acoustical sound pressure waves into an electrical signal.
Sound pressure variations cause a mechanical movement of a thin membrane, the diaphragm,
and this mechanical movement is transduced to an electrical signal. Various type of microphones exist that can be distinguished by the type of physical effect they use to generate the
electrical signal.
Electro-magnetic induction microphones, also called dynamic microphones use the well
known fact that if a magnet moves past a wire, the magnet induces current to flow in
the wire. There are two basic dynamic microphone types: the moving coil microphone
and the ribbon microphone. The moving coil microphone uses the same principle as is
used in the loudspeaker but instead of generating a movement of the coil in response
to a changing current, it generates a current in response to a moving coil. The ribbon
microphone uses a ribbon placed between the poles of a magnet to generate voltages by
electro-magnetic induction.
Piezoelectric microphones. Certain crystals change their electrical properties as they
change shape. By attaching the diaphragm to such a crystal, the crystal will create a
signal when sound waves hit the diaphragm.
Carbon microphones. These are the oldest and simplest microphones. They were used
in the first telephones and are still in use today. They use carbon powder in a container
that has a thin metal or plastic diaphragm on one side. Sound air pressure variations
change the degree of compression of the carbon powder and this changes the powders
electrical resistance. By running a current through the carbon, the changing resistance
changes the amount of current that flows.
39
The sound card is the sound-processing unit, the audio work horse of the computer. The card
is used, among others, for audio recording and audio generation. It is the interface between
the analog world outside the computer and the digital world inside. If you play an audio-CD
or a DVD in your computer, the sound of the player is sent in analog or digital form to the
sound card where it will be processed further. If you attach a microphone to the computer,
the microphone signal is processed by the sound card. If you order Praat to play a sound the
digital representation of the sound is sent to the sound card where it will be put in analog form
and sent to an external headphone or speaker. In figure 3.3 a typical year 2000 sound card
is shown. There is a lot of electronics on this card! In many modern computers the sound
processing unit is not on a separate card anymore but is integrated on the motherboard of the
computer. However, for a description of the sound processing that goes on in this unit, this
sound card gives a better view. This card is inside the computer in a so called PCI-slot. The
PCI-slot offers a way for the card to communicate with the CPU of the computer. It is a twoway communication channel. Digital signals can travel to and from the card to the CPU. The
metal colored plate on the left connects to the outside of the computer. In this picture you
cannot see how it shows on the outside. But we can deduce how it would look. At the top left
we see four colored blocks, orange, light blue, pink, lime green. These blocks are connectors
with the outside world. The colors have been standardized:
lime green is for the line-out jacket. It delivers an analog sound signal that can be made
audible with head phones or a loudspeaker. The maximum amplitude of a line-out signal
varies from card to card but approximates 1 volt (V).
light blue is for the line-in jacket. It accepts an analog sound signal from devices like
a CD-player, a cassette deck or a tape recorder. The maximum line-in sensitivity is not
standardized but normally is around 200 mV. There is controlling circuitry right after the
line-in jacket on the sound card to be able to accept a wider range of input amplitudes,
sometimes up to 2 V.
pink/red is for a microphone jacket. The sensitivity of the microphone input is not standardized and may vary between approximately 20 mV and 120 mV, depending on the
type of sound card. Sensitivity is defined here as the minimum voltage that produces
full scale level at the analog to digital converter if the microphone volume control is set
to its maximum (i.e. the slider in the input mixer at maximum). Immediately after the
microphone jacket a special microphone pre-amplifier amplifies the microphone signal
to bring it on par with the other line level signals. This pre-amplifier is generally not
40
3.6. Equipment
externally controllable and delivers a fixed amplification like 20 or 30 dB. The actual
signal presented at the microphone jacket may have voltages much larger than 20 or
120 mV. For these cases the volume control slider of the microphone can be used to attenuate these signals to stay within bounds acceptable for the ADC. Now the only place
where things still might go wrong is in the microphone pre-amplifier if the microphone
signal presented at the jacket overloads the pre-amplifier.
orange is the S/PDIF jacket for digital sound output. Sometimes this jacket is also used
as an analog line output for another loudspeaker.
In theory, the line-in and line-out are standardized in such a way that one should be able
to connect a signal from the line-out of a device to the accepting line-in of the sound card
without causing any signal loss or degradation. In practice this is not always the case and
one should carefully check whether the connection behaves as it should. Sometimes a kind
of adaptor device is necessary in-between. For example if the amplitude of a signal coming
out of an external device is too large for the input of the sound card the signal will be clipped.
An example is shown in figure 3.4 where the input sound on the left is severely clipped. We
clearly see that tops and valleys are flattened as if cut off by a razor blade.
To avoid this clipping taking place, there is some electronic circuitry needed between the
line-in jacket and the ADC which has a fixed maximum allowed input amplitude. This ad-
41
Figure 3.4.: An example of clipping. The signal at the left has too large amplitudes for the electronic device in the middle. The output signal on the right will be clipped (clipped
parts shown in red).
If you want to listen to a clipped sound then run the following script in Praat and listen to
the two sounds. Both sounds are 500 Hz tones.
Create Sound from formula : "s500" , 1 , 0 , 1 , 44100 , "sin(2*pi*500*x)"
Create Sound from formula : "s500o" , 1 , 0 , 1 , 44100 , "2*sin(2*pi*500*x)"
The sound labeled s500 has an amplitude of 1 while the sound labeled s500o has an amplitude
of 2. If you look at both sounds in the sound editor, they show as perfect sines. However, s500
sounds like a perfect 500 Hz tone, while s500o sounds harsh. For the explanation we have to
know how a sound is played in Praat, i.e. what happens if you push the Play button when a
sound is selected. Praat then translates the selected sound to a series of numbers that are sent
to the sound card and then the DAC will use these numbers and convert them to an analog
signal that is used to drive the loudspeaker. If the amplitude of the sound varies between 1
and +1, the range of numbers that Praat sends to the DAC guarantees that the analog signal of
the DAC varies between its minimum and its maximum amplitude. I.e. the analog signal range
4 The
42
3.6. Equipment
that the DAC produces, is optimally used if the amplitude of a sound is in the (1, 1) interval.
When Praat sends the numbers to the DAC it has to make all sound amplitudes that are larger
than 1 equal to 1 and all amplitudes that are smaller than 1 equal to 1. This is clipping and
it produces the same effect as if we had generated the sound s500o with the following script.5
Create Sound from formula : "s500o2" , 1 , 0 , 1 , 44100 , "2*sin(2*pi*500*x)"
Formula : "if self > 1 then 1 else if self < -1 then -1 else self fi fi"
Figure 3.5.: The layout of the 48 pens of the STAC9750 sound processing chip.
As an example of what might take place in such a chip the pin layout of a much simpler
one, the mid 1990s STAC9750 is shown in figure 3.5. The pen layout shows what kind of
5 This
is the formula we have used to produce the clipped sound in figure 3.4.
43
In the diagram we see the different functional units and how they are interconnected. The
interface to the computer is displayed on the left while the connections with the outside world
are displayed on the right. Arrows show the direction of signal travel. From the direction of
the arrows we deduce that the mixing takes place while all signals are analog. In modern chips
the mixing takes place in a very powerful Digital Sound Processor (DSP).
3.6.3. The mixer
As was shown before, a sound card contains a signal mixer. This signal mixer is controlled by a
piece of software which is also called a mixer. This (software) mixer pops up an interface that
lets you control the various inputs and outputs of the hardware. If you click on the volume
control icon in the Windows tool bar, i.e. the small loudspeaker icon, the Windows output
mixer pops up as is displayed in figure 3.7.6 You raise the Windows input mixer, i.e. that part
of the mixer that controls the inputs to the computer, by selecting the Properties button in
the Options menu followed by choosing Recording: figure 3.8 shows up.
6 If
there is no little loudspeaker icon in the toolbar you can also get the mixers via the Windows start menu at the
bottom left corner of the screen. The exact sequence to reach the mixers may vary somewhat but the following
choices have to be made: Control panel followed by Sounds and audio where you choose the Audio tab.
Now you can reach the output- and the input mixers via the Volume... buttons in Sound playback or Sound
recording, respectively.
44
3.6. Equipment
For making a recording on the computer we use The Golden Mixer Rule: deselect all input
devices except the one you want to use. 7
7 The
standard mixer software in Windows does not allow for mixing INPUT sources, only for PLAYING! Selection
of an input source deselects all other sources automatically. So, its not really a mixer but an input selector!
45
LPF
ADC
In figure 3.9 the analog to digital conversion process is shown. The signal 1 enters the low-pass
filter 2 where frequencies that are above the Nyquist frequency are filtered out.8 The low-pass
filtered signal 3 then enters 4, the analog to digital converter (ADC). The ADC converts the
analog signal 3 into a series of numbers 5.
The ADC uses a clock that ticks at a regular time interval. This time interval is the inverse
of the sampling frequency. For example, if the sampling frequency is 44100 Hz, then we have
clock ticks at intervals of 1/44100 s. For each small time interval the ADC first measures
the average amplitude of the signal 3. This is called sampling. In the next phase, called
quantization, this analog average value is converted to a binary number and immediately sent
to the output of the ADC. In assigning this number, the ADC can not exactly represent all
possible analog values. Giving the precision of its internal circuitry, it quantises the value
with a certain precision. This precision is expressed as a number of bits that the ADC uses for
the representation of the value. In figure 3.10 we show how this number of bits influences the
precision of the representation.
In the top left in panel (a) we see the analog signal. Panel (b) shows the crude quantization
with the two possible levels allowed in a one bit quantizer. Panel (c) shows the four possible
levels of a 2-bit quantizer. As the number of bits (n) increases the numbers of levels increases
as 2n . In this way we can make the digital representation approximate the analog representation as closely as we want. Modern ADCs use at least 16 bits for the representation of analog
values. This 16-bit precision is also used on a CD-audio on which every second of audio in
each of the two channels is represented by 44100 numbers of 16-bit precision.
3.6.4.1. Aliasing
The low-pass filtering step 2 in figure 3.9 is essential for the digitized signal to be a faithful
representation of the original (in the frequency interval we are interested in). Shannon and
Nyquist proved in the 1930s that for the digital signal to be a faithful representation of the
analog signal, a relation between the sampling frequency and the bandwidth of the signal had
8 The
frequency that is at half the sampling frequency is called the Nyquist frequency. For example, if the sampling
frequency is 44100 Hz, the Nyquist frequency is 22050 Hz.
46
3.6. Equipment
(b) 1 bit
+0
-1
-1
(c) 2 bit
(d) 4 bit
+1
+7
-2
-8
(e) 8 bit
(f) 16 bit
+127
+32767
-128
-32768
Figure 3.10.: The influence of the number of bits in the quantization step on the faithfulness of the
digital representation.
to be maintained. For speech and audio sounds this relation is expressed by the following
theorem.
The Nyquist-Shannon sampling theorem: A sound s(t) that contains no frequencies higher
than F hertz is completely determined by giving its sample values at a series of points spaced
1/(2F ) seconds apart.
The number of sample values per second corresponds to the term sampling frequency. Sample values at intervals of 1/(2F ) s translate to a sampling frequency of 2F hertz. A variant of
the above theorem is: if the highest frequency in a sound is F hertz then the minimal sampling frequency we need is 2F hertz. We know that the highest frequencies human beings
can hear is nearly 20 kHz. To faithfully represent frequencies that high we have to use a sampling frequency that is at least twice as high. Hence the 44100 Hz sampling frequencies used
in CD-audio. All DACs have a fixed highest sampling frequency and to guarantee that the
input contains no frequencies higher than half this frequency we have to filter them out. If
we dont filter out these frequencies, they get aliased and would also contribute to the digitized representation. A famous non-audio example of aliasing is in cowboy movies where the
spokes in the wheels of the stage coach sometimes seem to turn backwards; the aliasing here
is caused by not having taken enough pictures per second. In figure 3.11 we see an example of
aliasing. The figure shows with black solid poles the result of sampling a sine of 100 Hz with
47
-1
0.01
0
Time (s)
Figure 3.11.: Aliasing example. The red dotted analog 900 Hz tone gets aliased to the black dotted
100 Hz tone after analog to digital conversion with 1000 Hz sampling frequency.
a sampling frequency of 1000 Hz. This sampling frequency can be easily checked from the
figure: we have 10 sample values in 0.01 s which make 1000 sample values in one second. As
a reference, the analog sine signal is also drawn with a black dotted line. Therefore, the black
dotted line could represent the analog signal before it is converted to a digital signal, while
the black dotted poles are the output of the ADC. The red dotted line shows nine periods of
an analog sine in this same 0.01 s interval and accordingly has a frequency 900 Hz. The figure
makes clear that if the red dotted 900 Hz signal were offered to the ADC instead of the black
dotted 100 Hz signal, the analog to digital conversion process would have resulted in the same
black poles. This means that from the output of the ADC we can not reconstruct anymore
whether a 900 Hz or a 100 Hz sine was digitized: if we have a signal that contains besides a
sine of 100 Hz also a sine of 900 Hz then, after the analog to digital conversion with a sampling frequency of 1000 Hz, only one frequency is left, namely the 100 Hz frequency. From
this frequency we can not reconstruct how much the 100 Hz component contributed and how
much was aliased from the 900 Hz frequency. This is a very undesirable situation and therefore we have to take care to avoid it. It could happen because the 900 Hz frequency is above
the 500 Hz Nyquist frequency. The solution is called low-pass filtering. Before the ADC we
install a filter that lets frequencies lower that the Nyquist pass and blocks frequencies higher
than the Nyquist.
The following script makes aliasing audible.
Create Sound from formula : "s1" , 1 , 0 , 1 , 11025 , "0.5*sin(2*pi*500*x)"
Create Sound from formula : "s2" , 1 , 0 , 1 , 11025 , "0.5*sin(2*pi*44600*x)"
Create Sound from formula : "s3" , 1 , 0 , 1 , 11025 , "0.5*sin(2*pi*500*x)
... + 0.5*sin(2*pi*44600*x)"
The script creates three sounds of 1 second duration, all with a sampling frequency of 11025 Hz.
For the first sound, s1, the formula says that a tone of frequency 500 Hz is to be generated.
The second sound is generated from a frequency of 44600 Hz. The third sound is the sum of
48
3.6. Equipment
the two previous sounds. The 44600 Hz is a frequency so high that a human being can not
hear it, you have to be a dolphin to be able to hear these frequencies. However, if you listen
to the three sounds all are very audible pure tones of one frequency. What is going on? Why
do we hear sounds we are not supposed to hear? The functions that we use in the formula
in the Create Sound from formula... command are defined for all values of x. It is the fifth
argument of this command, the value for the sampling frequency, that orders that the formula
is to be evaluated at 11025 different values of x in the interval from 0 to 1 s, i.e. the analog
formula is sampled with a sampling frequency of 11025 Hz. There is an analog to digital
converter working in the Create Sound from formula... command! It is not a real hardware
converter that you can touch: it is going on in the software. This command therefore simulates an analog to digital converter.9 The 44600 Hz tone has not been represented faithfully
because the sampling frequency, 11025 Hz, was too low. To faithfully represent a 44600 Hz
tone we need at least a sampling frequency of 89200 (= 2 44600) Hz. What happened to the
44600 Hz signal is the same as what happened to the 900 Hz signal in figure 3.11: aliasing.
The 44600 (= 4 11025 + 500) Hz was aliased to a 500 Hz frequency.
DAC
LPF
In figure 3.12 the digital to analog conversion process is shown. This process is almost the
reverse of the analog to digital conversion process. We start at 1 with a series of numbers as
input to the digital to analog converter 2. At each clock tick, the DAC converts a number to
an analog voltage and maintains that voltage on its output until the next clock tick. Then a
new number is processed. This results in a not so smooth step-like signal 3. If made audible
this signal would sound harsh. In step 4, this step-like signal is low-pass filtered to remove
frequencies above the Nyquist frequency.
9 In
fact the simulation of the analog to digital conversion is much better than any hardware device now and in the
foreseeable future can deliver. The quantization in Praat is only limited by the precision of the floating point
arithmetic units. Sounds are represented with double precision numbers: this roughly corresponds to a 52-bit
precision. The best hardware nowadays quantizes with 24 bits of precision.
49
Many sound chips nowadays contain a special purpose processor for digital sounds, the Digital
Signal Processor (DSP). Without going into details, these processors are specialised for digital
signal processing.
50
4. Praat scripting
A script is a text that controls the actions of one (or more) programs, here Praat. The format
of this script text is not completely free but must confirm to certain syntax rules. For example,
a script text may have Praat menu and Praat action commands. When the actions in the script
text are performed by Praat we speak of running a script. When you run a script, the text
of the script will be interpreted and the corresponding actions will be performed. The part
of Praat that reads and interprets the script text and then initiates these actions is called an
interpreter. In short: a script is run by the interpretor.
A script can be useful for various reasons.
To automate repetitive actions. You have to do the same series of analyses on a large
corpus and you dont want to sit for months at the computer clicking away to do your
analyses on those thousands of files. Instead, you write a script that performs all necessary steps, for example, reading a sound file from disk, performing a pitch analysis and
saving the results. For these cases you first test the script thoroughly on a small number
of files and then order Praat to run the script on all the files in the corpus. You sit back
and relax while all the analyzes are carried out automatically.
To log actions. If you want to know later what you are doing now to achieve a certain
result, you can save your actions in a script and save that script to a file. If you want to
repeat the actions at a later moment you open the script file in Praat and run it.
To communicate unambiguously to other people what you have done and how they may
achieve the same results. We have already seen many example of this use of scripting
in previous chapters. In this book many examples will be accompanied by a script that
you can download.
To make drawings in the picture window. Especially for articles you want to add all
kinds of additional info in a drawing. A script lets you successively add more and more
to a picture in the drawing window. Nearly all drawings in this book were produced
with a script.
To add a new button in the menu. For example, you have a series of actions on a selected
sound that have to be performed in a prescribed order. You may script these actions and
define a new button in the dynamic menu so every time you have a sound selected
and you click that button, the actions in the associated script will be carried out in the
prescribed order.
Examples of the various uses of scripting will be given in the sequel. We will start by showing
you how simple it is to add functionality to Praat once you know how to script. To start
scripting, in its most elementary form, you do not have to learn any new commands to address
51
4. Praat scripting
the functionality of Praat: you just use the commands that you already know, i.e. a simple
script is a sequence of one or more lines that are copies of the text that is on a command
button. For example if you have created a sound and want a command in the script to play this
sound, a single script line with only the text Play suffices.
If we want to use the outcome of a Praat command in for example a complicated numerical
expression we have to use a somewhat more extensive formulation (10 characters more). Say
we want to calculate halve the duration of a selected sound. The following scriplet does so:
duration = Get total duration
mean2 = 0.5 * duration
The do is a function that accepts as first argument the command as it is on the button between
double quotes, followed by the comma separated arguments. For example, the following two
queries of a sound have the same effect:
minimum = Get minimum : 0 , 0 , "Sinc70"
minimum = do ("Get minimum..." , 0 , 0 ,
"Sinc70")
The first line saves you 10 characters in typing a command that has three dots and 7 characters
for a command without dots. In this book we will mostly use the first notation without the
explicit do. Besides the do there is do$ which returns a string instead of a number.
52
two lines, the window bar is titled untitled script (modified). We save the script as a file
on the computer disk, with the name playTwice.praat, by choosing the Save as... command
from the File menu. After saving, the title of the script editor will reflect this name change as
the right panel in figure 4.1 shows.
Now try out the script: first select a sound from the Object window. Then choose the Run
command from the Run menu in the script editor. If you have typed the Play commands
correctly you will hear the sound played twice. The actions you just performed are the basics
of scripting: opening the script editor, typing in some script lines and trying them out by
using the Run command. As you click the Run command you saw another option Run
selection whose function might be obvious now: it only runs the part of the script that you
have selected. For example, if you select one of the Play commands and next click on the
Run selection command you will only hear it played once. Clicking on the Run command
will always result in the execution of the complete contents of the script editor (even if there
is a selection). The Run selection command gives you the opportunity to only execute part
of the script.
is a shorthand notation for the path to find the Create Sound from formula... command: click on the
New menu, this opens a new list of possible commands. Click on the Sound option in this list. This opens a
new list from which you can choose the given command. This notation is sometimes necessary so you can find a
command, Praat knows where to find its commands.
1 This
53
4. Praat scripting
We have the script working and now it is the time to add the new button with the text Play
twice. Defining a new button in Praat is done by associating a script with the new button;
clicking the button will then run the script. This is easy to accomplish: from the script editors
file menu as shown in figure 4.2 choose the Add to dynamic menu... command and then
a form like the one displayed in figure 4.3 pops up. We can now add the new functionality
to the dynamic menu. In the figure we have already modified two fields. The first modified
field labeled Command originally showed Do it... and now Play twice. The text in the
Command field will be displayed on the new button. The contents of the second modified
field, After command, directs Praat to place the new button after the command you give
here. By filling out Play here, the new Play twice button will appear in the list after the
existing Play button. If you leave the After command field empty, or you gave a nonexisting command name, Praat will place the new button at the bottom of the dynamic menu
list. The last field Script file contains the complete file name of your script. Its content is
automatically placed there by Praat. Be aware that file naming is different on Linux, Macintosh
and Windows. For example, on Linux, the maps in the path are separated by / symbols. On
Windows, maps are separated by the \ symbol.
After you have clicked the OK button, the dynamic menu changes and a new button with
the Play twice text appears below the Play button. The right pane in figure 4.3 shows what
the upper part of the sounds dynamic menu looks like with the new button added. The newly
defined button has the functionality of the script that you associated with it. You may close
the script editor now.
54
Figure 4.3.: The script editors Add to dynamic menu... form and its effect on the dynamic menu
of a sound.
want to see or never want to use and also to remove buttons that you have added yourself.2
In this case you want to remove the Play twice button. The buttons editor cannot show
you all buttons at once because there are thousands of defined buttons in Praat. Therefore,
buttons are grouped into categories and these are shown in the editor just above the scrollable
part, they are labeled Objects, Picture, Editors, Actions A-M and finally Actions NZ. The last two categories concern among others the dynamic menu buttons. Selecting one
of them displays all the buttons available for object types that start with a character in the
range A-M or N-Z, respectively. Because the Play twice button only works if a Sound
object is selected, and Sound starts with an S, we have to choose the category Actions
N-Z as is displayed in figure 4.4. Lots of lines are shown in the buttons editor. Most lines
start with the word shown in blue colour, followed by the object type and, again in blue
colour, an action that can be performed with this object. The text of the action is the text you
will also find on the corresponding button. We have to scroll down until we see the part that
resembles the part shown in figure 4.4. The bottom line shows the characteristics of the Play
twice button. If we click on the blue ADDED, the text in the buttons editor immediately
changes to REMOVED and the Play twice button instantly disappears from the dynamic
menu. Clicking that same line again, will change the text back to ADDED and will make
the button reappear.
Some final remarks on adding/removing/hiding buttons.
2 And
the maintainers of Praat can guarantee that old scripts keep working by hiding, for example, buttons whose
names have changed.
55
4. Praat scripting
Figure 4.4.: Part of the buttons editor after first choosing Actions N-Z followed by scrolling to
the actions for a Sound object.
The first line in figure 4.4 shows a command that is normally hidden from the dynamic
part of the Save menu. If you click on the word hidden this word will change to
SHOWN and the action will be available in the Save menu of a sound. The first word
on a line is a toggle and switches between either hidden and shown or between
ADDED and REMOVED. Actually there is semantics in the capitalization of the
words too, as a capitalized word indicates that you changed a setting, while normal
characters indicate the setting was done by the authors of Praat.
You can use all the actions in the buttons editor as if they were real buttons, i.e. clicking
on the Sound help in the second line of the editor in figure 4.4 will display the help
window.
A number between parenthesis after the object type indicates the exact number of objects that have to be selected in order to make the command available. The figure shows
that to invoke the View & Edit command only one sound may be selected and to invoke the (hidden) Save as stereo FLAC file... command exactly two sounds have to
be selected together.
We have chosen to add a new button to the dynamic menu and not to a fixed menu
because we only want our new command to be available if a sound is selected. Adding
the new button to one of the fixed menus would always make the new button visible no
matter which object were selected. The Play twice command does not make sense to
all types of objects defined in Praat and, therefore, does not belong in any of the fixed
menus of Praat.
Our script was very simple and only contained the command Play, however, we can
use any other Praat command in the script because pushing the new button simply runs
the associated script. If the script conforms to the syntax and semantics of Praat it will
be run. All actions in Praat, i.e. all the buttons and all the forms, can be addressed in a
script, and much more.
You dont need to define a button if you want to run a script. The Run command in
the script editor can take care of running a script. We just defined the new button to
show you how easy it is to do so.
56
is how we could have started in the first place but then you would have missed the experience of a script editor
full of previous commands.
57
4. Praat scripting
If we run this script it repeats the three actions we have just carried out.
Beware: the first two lines in this script were actually one long script line. We have split
this line to fit on the paper. This is something we are always allowed to do with long script
lines that dont fit in the normal width of the editor (or paper). There are two things to keep in
mind when splitting lines: (1) we can only split on positions where adding extra white space
wouldnt matter and (2) we have to start the continuation part with three consecutive dots
(. . . ). These three dots signal that this line is a continuation of the previous line.4 We may use
as many continuation lines as we wish and white space before the three dots is allowed as the
above script shows.
Note that the fields that show up from top to bottom in a form, are shown from left to right
in a script line. We must always maintain this correspondence between the position of a field
in a form and the position of the field in a line of text in a script.
We will edit these lines until they have the form that we want. We only have to change the
End time to 0.2 and we add a line after the Play command that writes the frequency to the
info window. The script is now:
Create Sound as pure tone : "tone" , 1 , 0 , 0.4 , 44100 , 440.0 , 0.2 , 0.01 , 0.01
Play
writeInfoLine : "The frequency of the tone was 440 Hz"
Remove
The writeInfoLine first clears the info window and then writes text to it. If you run the new
script, the duration of the sound is now shorter than in the previous version. The info window
will be as in figure 4.5
After running this script a couple of times we are getting bored. The script plays a tone but
it is the same tone all the time. We like to vary the tones frequency. We can do so by typing
another number instead of the 440.0 in the Tone frequency field and changing the number
in the writeInfoLine function. Lets say we change both numbers 440 in the script above
to 1000. If we run the script now, you will hear a higher tone, one of 1000 Hz and the info
window will also show the new frequency.
By using this script we have actually saved some time. Instead of the three actions: creating
a tone, playing the tone and removing the tone, we only have one action now: running the
script.
4 All
58
commands in Praat that put a form on the screen also end with three dots.
To change the frequency of the tone we had to change the number 440 at two places, in
the formula and in the line the writeInfoLine. This, however, is error prone. For instance we
could make a typing error or forget to change the second occurence. If a typing error makes the
syntax incorrect then Praat, of course, generates an error message with detailed information
about the line number and the contents of the line where things went wrong, but nevertheless,
we have to carefully check at two different positions. We can achieve a more simple version
by introducing a variable for the frequency as the following script shows.
frequency = 1000
Create Sound as pure tone : "tone" , 1 , 0 , 0.4 , 44100 , frequency , 0.2 , 0.01 , 0.01
Play
writeInfoLine : "The frequency of the tone was " , frequency , " Hz"
Remove
The first line introduces a variable with the name frequency and assigns the value 1000 to
it. A variable is something like a labeled box in the memory of the computer: the label on
the box is the variables name and the contents of the box is the variables value. Every time
that you use the name of a variable in a script, the computer will use its value. When the
second line is executed the value of the frequency variable is used. The writeInfoLine now
also writes the value of the frequency variable to the info window. If you run the modified
script, the results will be exactly as they were in the previous section. However, by using the
variable frequency we have achieved something important: if we want want to change the
frequency of the tone and simultaneously the information displayed about this frequency in
the info window, we now only need to change the frequency value at one place in the new
script instead of the two occurrences in the previous script.
Keep in mind that variables always have to start with a lowercase character and only the
characters a-z, A-Z, 0-9 and the underscore _ are allowed in the rest of the variable name.
Therefore frequency, a34 and bcX3 are valid variable names while Frequency, $a and
_frequency are not. The characters . and $ are special in a variable name. Only Praat
menu commands (have to) start with an uppercase character.
4.4.2. Improvement 2, dening a minimum form
Now we like to skip editing the script each time we want a tone with a different frequency. We
like the script to raise a form in which we can type the desired frequency. The following script
improves on what we had.
form Play tone
positive frequency
endform
Create Sound as pure tone : "tone" , 1 , 0 , 0.4 , 44100 , frequency , 0.2 , 0.01 , 0.01
Play
writeInfoLine : "The frequency of the tone was " , frequency , " Hz"
Remove
When run, the script raises the form displayed in figure 4.6. This form is defined in the
first three lines of the script. The first line defines the title for the form, i.e. the text Play
59
4. Praat scripting
tone at the top. You can choose your own text, you can even have no text at all.5 The text
of the window should describe or summarize the actions of the script in a compact way. The
second line in the script defines a numeric field named frequency that allows only positive
numbers. If you run the script and type a number in the frequency field that is less than or
equal to zero, a message is generated that will inform you that you made an error. The second
line of the script serves two things. In the first place it defines the field of the formed labeled
frequency and at the same time it guarantees that a new variable is created that also has the
name frequency. This new frequency variable will receive the value of the frequency field
once OK is clicked. In this way the script and its form communicate: the field name in the
form corresponds to a variable that bears the same name in the script.6 The endform closes
the definition of the form.
Note that, again, all these form definitions (form, endform and positive) start with a lowercase character. Only Praat commands and actions start with an uppercase character.
4.4.3. Improvement 3, default value in the form
A minor annoyance of the previous script is that when the form pops up you have no idea what
you should type in the frequency field. If you click OK without typing anything, an error
message pops up. It would be nice when the script supplied a default value that guarantees
that the script runs if the user just clicks OK. Error messages should pop up only if you do
something wrong, there should be nothing wrong in just accepting defaults. The following
script preloads a default value in the frequency field. This number happens to be 440.0.7
form Play tone
positive frequency 440.0
endform
Create Sound as pure tone : "tone" , 1 , 0 , 0.4 , 44100 , frequency , 0.2 , 0.01 , 0.01
Play
writeInfoLine : "The frequency of the tone was " , frequency , " Hz"
Remove
The form that pops up is like the form in figure 4.6 but now shows the number 440.0 in the
frequency field.
The basic elements for constructing a form are now in place: each user supplied argument
needs at least a line in the script that starts with <argument_type> followed by <argument_name>
the latter case there has to be at least one white space after the form text in the script.
the underscore _ to create white space. For example, the field name frequency_value with associated
variable frequency_value shows as frequency value in the form.
7 To inform the user that real numbers are allowed, it is better to preload with a number that makes this explicit. We
therefore used 440.0 instead of 440 for the default value.
5 In
6 Use
60
The next improvements are only cosmetic, but nevertheless important. First, we want to see
Frequency as the title of the field instead of frequency, as all field names in a Praat form
start with an uppercase character. As we saw earlier, variables can only start with lowercase
characters, therefore, to avoid a conflict, Praat automatically converts the first character of
the associated variable to lowercase. In this way the field name Frequency can start with
an uppercase character and the associated variable frequency will start with the lowercase
character.
The other cosmetic change is that the second line is indented now, to let the form and the
endform stand out. The first three lines of the script now read as follows.
form Play tone
positive Frequency 440.0
endform
The final improvement is cosmetic again. We want to communicate that the unit for the
Frequency field is hertz. In this script the name of the field has been changed to FreScript 4.1 The final Play tone example.
form Play tone
positive Frequency_(Hz) 440.0
endform
Create Sound as pure tone : "tone" , 1 , 0 , 0.4 , 44100 , frequency , 0.2 , 0.01 , 0.01
Play
writeInfoLine : "The frequency of the tone was " , frequency , " Hz"
Remove
61
4. Praat scripting
quency_(Hz). Despite this change, the associated variable is still named frequency. During
the creation of the form, Praat chops off the last part _(Hz) to create the variable. Actually,
Praat chops off _( and everything else that follows from the field name.
4.4.6. Variation
Suppose you want to keep the Sounds that were created by the script. You remove the last line
in the script and all the newly created Sounds will be kept in the list of objects. However, they
all carry the same name. You want them to have a meaningful name that enables you to easily
identify the Sounds. The new script:
form Play tone
positive Frequency_(Hz) 440.0
endform
Create Sound as pure tone : "s_" + string$ (frequency) , 1 , 0 , 0.4 , 44100 ,
... frequency , 0.2 , 0.01 , 0.01
Play
writeInfoLine : "The frequency of the tone was " , frequency , " Hz"
There are two new things in this script: the string$ (frequency) part converts the frequency
number to a string and the + operator adds i.e concatenates two strings. All your sounds will
have names now that start with s_ and have the frequency attached. For example if you run
this script and type 1000 in the Frequency field, the sound appears with the name s_1000.
Note that only the integer part of a number will be in the name because dots are not allowed
in object names.
In the first line the variable duration is given the numeric value 0.2 and in the second line
variable start is given the value 0.1. In other words we put the numeric values 0.2 and 0.1
in the boxes labeled duration and start. Finally, in the third line we use the contents of the
boxes start and duration, add these values and put the result, which is the numeric value
0.3, in a box labeled end.
The name of a variable has to start with a lowercase character. Although nothing forces
you to do so, in general, you want to give a variable a meaningful name like for example
time or frequency or duration and not something like xDS24_hY. Variable names are case
sensitive, i.e. all the variables with names xyz, xYz, xyZ and xYZ are different variables.
There are two kind of variables in Praat: variables that hold numbers and variables that hold
text. The variables that hold text must end with a dollar sign $ and are called text variables or
62
Sometimes we just have to. If we want the user to modify a field in a form the user
input has to be variable. For example, the input form used in the create and play a
tone example of section 4.7 necessarily needs a variable to store the user input for the
Frequency field.
It makes scripting a lot easier, more powerful, and, more fun to do. Without variables
the only thing we could do is the repetition of a number of Praat commands.
There are two predefined mathematical constants in Praat, pi and e, which are used as an
approximation for the mathematical numbers and e.8 Because they are constants pi and e
can only appear on the right-hand side of an equal sign in your script, i.e. their values cannot
be changed. Praat will issue an error message if you try to. The number is defined as the
ratio between the circumference and the diameter of a circle. The number e is Eulers number
and is commonly defined as the base of the natural logarithm. Several other definitions for
e exist. These numbers are used so often in scripting that they deserve special treatment.
Table 4.1 shows the relation between the mathematical symbol, the name used in Praat and its
numerical approximation. This means that if you use pi in a numerical context, for example
in the calculation of the number 2*pi, the interpreter will substitute for pi the value in the
third column resulting in the number 6.2831853071795864769252867665590057683944.
8 The
numbers and e have an infinite number of decimals. Therefore, they cannot be represented exactly on a
digital computer whose computing elements only allow calculations with a finite number of digits.
63
4. Praat scripting
Mathematical symbol
Variable
pi
e
Approximation in Praat
3.1415926535897932384626433832795028841972
2.7182818284590452353602874713526624977572
A predefined variables is a variable that already was given a value by Praat. You do not need
to assign a value, although it is not forbidden and the interpreter will therefore not complain
most of the time. 9 We consider three types of predefined variables in Praat: predefined
string variables, predefined numerical variables associated with matrix types like sound and
spectrum and finally booleans associated with operating system identification.
9 However,
complain.
64
Because of the newline$ at the end of the third line there is an extra blank line in the info
window. Three of the predefined string variables happen to identify the same directory on my
system.
xmin, xmax: start value and end value of the row domain. For a sound they represent the
start time and the end time, for a spectrum the start frequency and the end frequency,
for example.
Suppose you want to define a sound signal whose amplitude starts at a value of zero and
then runs linearly to 1 at the end of the sound. Now, if the sound starts at 0 s and ends
at 1 s it is simple. The Formula field of Create sound from formula..., which as
we know specifies the amplitudes of the sound, can be as simple as: x. Because if the x
which stands for time runs from 0 to 1 second the amplitude will also run from 0 to 1.
However, if we dont want a 1 s duration but say 0.5 s we change the End time to 0.5
and now the Formula field must read 2*x, because we have to reach an amplitude of
1 when x reaches 0.5. For a duration of 0.2 s the Formula field has to read 5*x. Each
time we change the End time, i.e. the duration, we have to change the formula too. This
is very frustrating, and we have not even changed the starting time! Luckily there is
an elegant way to solve this problem. We can use the time domain information, i.e. the
xmin and the xmax, to construct a formula that will always give the correct amplitude
behaviour, irrespective of the duration, and whats even better, it is also independent of
the starting time. If the Formula field reads (x-xmin)/(xmax - xmin), then we are
done. Check: at the start of the sound x equals xmin, the numerator will be zero as
will be the amplitude, at the end x equals xmax and the numerator and denominator are
equal and the amplitude will equal one. For values of x in-between these extremes the
increase in amplitude will be linear as the formula says.
65
4. Praat scripting
xmin = 0;
ymin = 1;
xmax = 1;
ymax = 1;
nx = 16000;
ny = 1;
x1 = 3.125 105 ;
y1 = 1;
dx = 6.25 105 ;
dy = 1.
For a stereo sound of 2 second duration that starts at 1 s with a 16000 Hz sampling frequency
the values of these variables are
xmin = 1;
ymin = 1;
xmax = 3;
ymax = 2;
nx = 32000;
ny = 2;
x1 = 3.125 105 ;
y1 = 1;
dx = 6.25 105 ;
dy = 1.
For a more complex signal like the spectrum derived from the first sound by means of the
FFT algorithm, the values are
xmin = 0;
ymin = 1;
xmax = 8000;
ymax = 2;
nx = 8193;
ny = 2;
x1 = 0;
y1 = 1;
dx = 8000/8192;
dy = 1.
For a Matrix object created by the command Create simple Matrix... xy 10 10 row*col, the
values are
xmin = 0.5;
ymin = 0.5;
xmax = 10.5;
ymax = 10.5;
nx = 10;
ny = 10;
x1 = 1;
y1 = 1;
dx = 1;
dy = 1.
Warning: Try to avoid the variable names described in this section in an assignment, i.e. dont
use them on the left-hand side of an equal sign, because of possible ambiguities. The following scriptlet is illegal in Praat and the interpreter will issue a warning that the variable dx is
ambiguous. The variable is assigned the value of 10 but in the formula context the value of dx
is equal to 1/16000.
Create Sound from formula : "s" , 1 , 0 , 1 , 16000 , "1 /2 * sin(2*pi*377*x)"
# illegal assignment of dx before using it in array context !
dx =5
Formula : "self*dx"
The following script, although not recommended, is legal because there is no possible conflict.
Create Sound from formula : "s" , 1 , 0 , 1 , 16000 , "1 /2 * sin(2*pi*377*x)"
Formula : "self*dx"
dx =5
66
Figure 4.8.: Left: the query menus of a Sound. Right: the Get minimum query form.
Most objects in Praat can be queried. In the left part of figure 4.8 we show queries for a
sound. Needless to say that a sound must have been selected for these queries to appear. You
note from the figure that some of the queries have been grouped like Query time domain.
This query expands to three separate queries Get start time, Get end time and Get total
duration. Now if you click for example on the Get total duration query, the result of the
query will show itself in the info window: a new line appears that shows the total duration of
the sound. If we want to have access to the total duration from within a script, we can simply do
this by assigning the output of the query command to a variable like in the following scriptlet:
duration = Get total duration
writeInfoLine : "The duration is " , duration , " s."
The first line queries a selected sound for its total duration and assigns the output of this query
to the variable duration. The last line then writes the information to the info window.
For queries that show a form, like for example the Get minimum... query whose form is
shown in the right part of figure 4.8 you fill out the parameters like in the following scriptlet
where we query for the minimum value in a sound.
minimum = Get minimum : 0 , 0 , "Sinc70"
67
4. Praat scripting
For a time interval the default zero values generally mean that the whole domain is used (the
Sinc70 method uses a very precise interpolation to attain the real minimum of the sound).
In this way the generated tone will always have a correct frequency. A notational variant that
would have the same effect is
if frequency >= 22050
exitScript : The frequency must be lower than 22050 Hz.
endif
Create Sound as pure tone : "tone" , 1 , 0 , 0.4 , 44100 , frequency , 0.2 , 0.01 , 0.01
Frequencies too low for us to hear, say lower than 30 Hz, are called infrasonic frequencies.
Elephants use infra sound to communicate. You could extend the script with an extra test for
infrasound:11
if frequency >= 22050
exitScript : "The frequency must be lower than 22050 Hz."
elsif frequency <= 30
10 According
deafness/HearingRange.html).
11 Note
that the loudspeakers of PCs and especially laptops most of the time are very bad at representing low frequencies (say lower than 100 Hz).
68
or in another variant
if frequency > 30 and frequency < 22050
Create Sound as pure tone : "tone" , 1 , 0 , 0.4 , 44100 , frequency , 0.2 , 0.01 , 0.01
else
exitScript : "The frequency must be higher than 30 Hz and lower than 22050 Hz."
endif
For the conditional expression in a formula such as occur in the Create Sound from formula...
command we have to use a syntactical variant of the if. Because a formula is essentially a
one-liner we have to use the form
if <test > then < something > else < something else > endif
or
if <test > then < something > else < something else > fi
in which the <test>-parts are expressions and the else part is not optional. Instead of the
closing endif, we could also have used the shorter fi as in
if <test > then < something > else < something else > fi
For example, the following one-liner creates a noise with a gap in it (or two noises if you like).
Create Sound from formula : "gap" , 1 , 0 , 0.3 , 44100 ,
... "if x > 0.1 and x < 0.2 then 0 else randomGauss (0 , 0.1) fi"
If you select an interval you can do this by combining the lower limit and upper limit with
and, as the previous script does, or with or, as in following one. Both scripts result in exactly
the same sound.
Create Sound from formula : "gap" , 1 , 0 , 0.3 , 44100 ,
... "if x <= 0.1 or x >= 0.2 then randomGauss (0 , 0.1) else 0 fi"
We create the noise first and then modify the existing sound with a formula. In the else part
the expression self essentially means leave me alone. The self after the else indicates
that the else part applies no changes to the sound.
69
4. Praat scripting
4.6.2. Create a stereo sound from a Formula
We can create stereo sounds in Praat in various ways. For example, we could read a stereo
sound from a file with the New Read from file... command. Or we could select two mono
sound objects together and make a stereo sound from them by using the Combine Combine to stereo command. In this section, however, we use the Create Sound from formula...
command to create a stereo sound because we want to show conditional expressions in formulas. We start by creating a stereo sound with identical sounds in the left and right channels.
After this we will learn how to use a conditional expression to make different sounds in the
left and the right channel. We will learn something about beats too.12
We start by creating a stereo sound in which each channel is the same combination of two
tones that differ only slightly in frequency:
Create Sound from formula : "s" , "Stereo" , 0 , 2 , 44100 ,
... "1 /2 * sin(2*pi*500*x) + 1/2 * sin(2*pi*505*x)"
Play
This command needs the Stereo option in the Number of channels field. Of course you
could also have supplied the number 2 because a stereo sound has two channels by definition.
The formula part now contains an expression that shows the sum of two tones. To check that
the sound is stereo, click the Edit button in the dynamic menu. The two channels will appear
in the sound window, one above the other (listening to a sound in order to decide whether it is
mono or stereo sound generally does not give you any clue because many sound systems play
a mono sound on both channels, i.e. as if it were a stereo sound with identical left and right
channel).
Another way to check for stereo is to click the Info button at the bottom of the Object
window if the sound is selected. A new Info window pops up, showing information about the
selected sound. The info window starts with general information about the sound: its number
in the list of objects, its type, name and the date and time the info was requested. If you
repeat clicking the info button than this line will be the only line that changes. On line six
the number of channels shows 2, which means that it is a stereo sound. The next lines show
information on the time domain, the start time and end time values you supplied , the duration
is a value calculated as the difference between the end and start time. The digital representation
is given next. The number of samples is calculated as the product of duration and sampling
frequency. The sampling period is the inverse of the sampling frequency, i.e. 1/44100 for the
given example, and the first sample is located in the middle of the first sampling period.
When you listen to this sound, you will hear beats: the sound increases and decreases
in intensity. The sounds in both channels are equal and we can hear the beats in each ear
separately.
Next create a new stereo sound according to the following script:
Create Sound from formula : "s" , "Stereo" , 0 , 2 , 44100 ,
... "if row =1 then 1/2 * sin(2*pi*500*x) else 1/ 2 * sin(2*pi*505*x) endif"
beat is an interference between two sounds of slightly different frequencies and you perceive beats as periodic
variations in the intensity of the sound. The frequency of the beat is half the difference between the two frequencies.
70
Figure 4.9.: The first part of the text in the Info window for a stereo sound.
These formulas make use of the internal representation of a stereo sound in Praat as two
rows of numbers: the first row of numbers is for the first channel, the second row is for the
second channel. The conditional expression in the formula part of the script above, directs the
first row (channel 1) to contain a frequency of 500 Hz and the other row (channel 2) to contain
a frequency of 505 Hz.13
Listen to this sound but dont use your headphones yet. Instead use the stereo speaker(s)
from the computer. If everything works out alright, you will hear beats again.
Now use headphones, play the sound several times but listen to it only with the left ear and
then only with the right ear. You will hear tones that differ slightly in frequency. Finally, listen
with both ears and you will hear beats. In contrast to the beats in the previous examples which
were present in the audio signals entering your ears, these beats are constructed only in your
head. This is a nice demonstration that information from both ears is integrated in the brain.
In figure 4.10 the difference between the two stereo sounds we have created in this section
is illustrated. In upper part (a) you see the separate channels of the first stereo sound. It contains the same frequencies and beats in both channels. The beats are visible in the amplitude
envelope: in the upper part the envelope starts at an extreme at the start, then falls to zero at
a time of 0.1 s, then rises again to the extreme value at time 0.2 s, then falls again to zero at
time 0.3 s and finally rises again to the extreme at the end of the sound. This is a relative slow
variation of the amplitude of the underlying higher frequency sound and you will hear this
amplitude variation as a beat. Another way of describing this sound starts by realizing that if
we trace the envelope from its maximum at the start, going through the zero at time 0.1 s, then
going to the minimum at time 0.2 s, and going up again through time 0.3 s to the end, then this
envelope curve traces exactly one period of a cosine. The duration of the sound in the figure is
0.4 s, the beat frequency will therefore be 1/0.4 = 2.5 Hz. In contrast with this, the channels
part if row=1 then tests if the predefined variable row equals 1. The equal sign = after the if expression is
an equality test and is not an assignment. For more predefined variables see section C.1.1.
13 The
71
4. Praat scripting
of the last sound, as displayed in part (b), only show two slightly different frequencies in the
two channels. No sign of beats show up here!
(a)
1
-1
0.1
0.2
Time (s)
0.3
-1
0.4
(b)
1
-1
-1
0.1
0
Time (s)
Figure 4.10.: The stereo channels for the sounds that (a) have beats in the signal and (b) generate
beats in your brain.
4.7. Loops
With a conditional expression you can change the execution path in the script only once.
Sometimes you need to repeat an action several times. In this section we will introduce a
number of constructs that enable repetitive series of actions by reusing script lines.
4.7.1. For loops
Suppose you have to generate a lot of tones, all with frequencies that are related to each other.
After generation of a tone we have to print the frequency value in the info window. Suppose
the frequencies are the first five multiples of 100 Hz, i.e. 100, 200, 300, 400 and 500 Hz. To
differentiate between the sounds in the object window we also name them with a text that
72
4.7. Loops
shows their frequency. Given our knowledge so far, the first possibility that comes to mind
will probably be:
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
pure
"The
pure
"The
pure
"The
pure
"The
pure
"The
Note that the number in the name field is between double qoutes since thsi field needs text
and not numbers. This seems like a perfect way to do it. However, we will show that it
is possible to improve upon this little script. We will proceed in a number of small steps.
First we note that each pair of lines differs systematically from the previous pair of lines at
three places and it always involves the frequency number. Let first reduce two differences by
introducing a new variable f for the frequency:
f= 100
Create Sound as
appendInfoLine :
f= 200
Create Sound as
appendInfoLine :
f= 300
Create Sound as
appendInfoLine :
f= 400
Create Sound as
appendInfoLine :
f= 500
Create Sound as
appendInfoLine :
You might say: This is not an improvement, the number of lines in the script has increased. Yes, it has more lines, however, the complexity of the lines has been reduced.
We note there is a lot of repetition in this script now. The lines that start with do and
appenInfoLine are exactly repeated five times. Only the frequency assignment differs each
time. In fact, the structure of the script can be viewed as repetition of three actions: (1) assign
a value to f, (2) use f to create a sound and (3) print the f value. We now rewrite the frequency assignment a little bit and then we are ready for the shorthand notation: the for-loop.
The for-loop repeats a number of statements a fixed number of times. Before explaining the
syntax of the for-loop we make one extension to the script above by making the regularity in
the frequency value f more explicit.
i = 1
f = i * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
i = 2
f = i * 100
73
4. Praat scripting
Create Sound as
appendInfoLine :
i = 3
f = i * 100
Create Sound as
appendInfoLine :
i = 4
f= i * 100
Create Sound as
appendInfoLine :
i = 5
f = i * 100
Create Sound as
appendInfoLine :
The script states that each frequency f is a multiple of 100 Hz. However, the script has
expanded now to twenty lines, four lines per tone. Let us first discuss this script in somewhat
more detail and then get rid of it. Because of the five extra lines we now have a regular pattern
of four lines of code that act in a similar way. The script starts with assigning a value 1 to
the variable i; lets call this variable the index variable. In the following line the frequency
variable f is calculated that depends on the value of the index variable i. The sound is created
and its frequency value printed. In the next group of four lines the the only difference with the
previous group is that the index variable i has increased by one. Consequently a new frequency
value f is calculated, a new sound created and another frequency is printed. And so on for the
next groups of four lines.
Now we are ready for the for-loop magic. The twenty-line script above can be simplified to
the following five-liner:
for i from 1 to 5
f = i * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
This final script is a substantial reduction in the number of lines even compared with the first
script in this section. The groups of lines that were repeated in the previous script have been
reduced to one instance only. The script expresses the repetition explicitly with a syntactical
construct called the for-loop. The first and the last line in this script mark the for-loop: a
for-loop starts with for and ends with endfor, both on separate lines. The code that has to
be repeated is in the lines between the for and the endfor. The text, on the first line, that
follows for specifies two things: (1) the name of the index variable and (2) the successive
values that will be assigned to the index variable. What goes on in a for-loop?
INITIALIZATION: The first item after the for expresses that the name of the index
variable will be i;
i will be assigned the value 1 because the from 1 part says so;
CHECK: the interpreter checks if the value of i is smaller than or equal to 5 (this
is the number specified by the to 5 part). If i is larger than 5 the execution will
proceed with the statement just after the endfor, in other words the execution
jumps out of the loop. If i 5, the next statement will be executed;
74
4.7. Loops
f will get the value 100 (= 1*100); the sound will be created and the frequency
value printed in the info window;
at the endfor statement, the value of the index will be increased by 1, i.e. i will
be 2 now;
the execution now jumps to the CHECK label and the new value of i will be
checked;
f will get the value 200 (= 2*100); the sound will be created and the frequency
value printed in the info window;
at the endfor statement, the value of the index will be increased by 1, i.e. i will
be 3 now;
the execution now jumps to the CHECK label and the new value of i will be
checked;
f will get the value 300 (= 3*100); the sound will be created and the frequency
value printed in the info window;
i = 4; jump to CHECK; etc.
i = 5; jump to CHECK; etc.
i = 6; jump to CHECK;
jump out of the loop because i is larger than 5 now;
Continue execution after the endfor.
By now, the semantics of a for-loop should be fairly clear. We note that the name of the index
variable in a for-loop can be freely chosen, as long as it confirms to a variables name standard.
A mere change in the name of index variable will not change the functionality of a script as
the following script shows.
for k from 1 to 5
f = k * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
Note that changing the index variable from i to k has to be done at two places here: (1) in the
line that starts with for and (2) inside the loop where we use the index variable. The index
variables name doesnt have to be as short as the following equivalent script shows.
for a_very_long_index_variable_name from 1 to 5
f = a_very_long_index_variable_name * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
A note on the extendability. As the script shows it is now very easy to increase the number
of sounds by simply increasing the number after the to. The following script will generate 10
sounds, all harmonically related.
75
4. Praat scripting
for ifreq from 1 to 10
f = ifreq * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
And, may be unnecessary to add, the from and to numbers dont have to be fixed, they can be
variables too.
ifrom = 1
ito = 10
for ifreq from ifrom to ito
f = ifreq * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
By the way, a for-loop doesnt have to start at 1 as in the following example where 8 different
sounds will be generated.
ifrom = 3
ito = 10
for ifreq from ifrom to ito
f = ifreq * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
If the index starts at value 1, you can leave out the from part.
ito = 5
for ifreq to ito
f = ifreq * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
The script above generates five sounds where the loop index ifreq starts at 1. You may guess
by now why the following loop will never be executed.
ifrom = 6
ito = 5
for ifreq from ifrom to ito
f = ifreq * 100
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
76
4.7. Loops
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
Create Sound as
appendInfoLine :
pure
"The
pure
"The
pure
"The
pure
"The
pure
"The
As before, the extendability of this script is poor but since there is definitely a repetition
going on, it would be nice to have something like the following.
for i from 1 to 5
f = <a number that depends on the index i >
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
This script repeats 5 times the sequence of (1) an assignment, (2) a sound creation and
(3) printing a line. The only thing not specified yet is how to get the right-hand side of the
assignment for f as the loop index variable i varies from 1 to 5. A not very elegant first try
might be:
for i to 5
if i = 1
f = 111
elsif i = 2
f = 601
elsif i = 3
f = 277
elsif i = 4
f = 512
elsif i = 5
f = 213
endif
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
appendInfoLine : "The frequency was " , f , " Hz."
endfor
This is not elegant because the code is not shorter and, most importantly, not easily extendable.
For example if we would have 6 frequencies we have to change the script at two places: the
number 5 at line 1 has to become a 6 and we have to include 2 extra lines, one for the test elsif
i=6, and one for the assignment. There is a better way. We can achieve the functionality of
the above conditional expression with only one line of scripting by using an array variable.
Lets first present the complete script and later explain what is going on.
freq [1] = 111
freq [2] = 601
freq [3] = 277
freq [4] = 512
freq [5] = 213
for i to 5
f = freq [i]
Create Sound as pure tone : string$ (f) , 1 , 0 , 0.5 , 44100 , f , 1 , 0.01 , 0.01
77
4. Praat scripting
appendInfoLine : "The frequency was " , f , " Hz."
endfor
The first five lines in the script assign values to consecutive elements of the array variable
freq. These lines are inevitable because no regularity can be exploited and nothing else
remains than assigning each one of these values.14 The interpretation of the assignment in
the loop f = freq[i] is hopefully clear: at the start of the loop i=1 and the first line in the
loop translates to f = freq[1], i.e. f will get the value of the first element in the array freq,
which is 111. The corresponding sound will be generated in the next line. The next time in
the loop, i=2 and f = freq[2] will result in f=601, another sound will be created, and so
on. Although the number of lines in the script has not decreased as compared to the first script
in this section, the total number of characters has. Besides this, the structure of the script has
improved. All frequencies are displayed clearly at consecutive lines and the sound creation
command occurs only once, inside the loop. The extendability has improved too: for example,
if we want to generate 10 sounds instead of 5 we only have to specify freq[6] to freq[10]
and change the number 5 in the for-loop to a 10. Note that it would be o.k. to write the loop
as:
for i to 5
Create Sound as pure tone : string$ (freq [ i]) , 1 , 0 , 0.5 , 44100 , freq [ i], 1 , 0.01 , 0.
appendInfoLine : "The frequency was " , freq [i], " Hz."
endfor
where the variable f is not needed anymore and freq[i] has been substituted everywhere.
An array variable is not limited to numeric values. String arrays can also be defined like the
following examples shows:
word$ [1] = "This"
word$ [2] = "is"
word$ [3] = "a"
word$ [4] = "sentence"
appendInfo : word$ [1], " " , word$ [2] , " " , word$ [3 ], " " , word$ [4], "."
This is a sentence.
on we will learn that these values dont have to be assigned in this particular way but could also be extracted
from, for example, a supplied table of frequency values.
78
4.7. Loops
Before we go on, we repeat that a sound is represented in Praat as a matrix which means that
sounds are stored as rows of numbers. A mono sound is a matrix with only one row and many
columns. A stereo sound is a sound with two channels, each channel is represented in one row
of the matrix. A stereo sound is therefore a matrix with two rows and both rows have the same
number of columns. Each matrix cell contains one sample value. Whenever we want to use a
formula on a sound we can think about a sound as a matrix. The position of any sample value
in a sound can be indexed with a pair of numbers indicated as [row,col], where row is the
row number (i.e. channel number) and col is the column number (i.e. sample number). Row
numbers and column numbers always start at 1. For example the element indexed by [1,2] is
the second number from the first row. Each sample value in a row is indexed with a column
number and represents the (average) value of the amplitude of the analog sound in a very small
time interval, the sampling period. In general, the total duration of a sound is the sampling
period multiplied by the number of samples in a row. To be able to calculate this duration
for a sound, Praat keeps the necessary extra information, together with the rows with the
sample values, in the Sound object itself. In a script you have access to this extra information
because some predefined variables exist to access this information. The predefined variables
for a sound have already been discussed in section 4.5.2.2. To refer to the interpretation of
the column values of a matrix we have xmin, xmax, x1, nx, and dx. For the rows we
have ymin, ymax, y1, ny, and dy, while the predefined variables row and col refer to the
current row and column. Finally the variable self refers to the current element. Figure 4.11
gives an overview.
col=1
col=2
col=3
col=nx
ymin=0.5
row=1
y1=1
1.5
row=2
y1+1
ymax=2.5
x1
xmin
x1+dx
x1+2*dx
x1+(nx-1)*dx
Time
xmax
The first sample in a sound is at time x1. The second sample will be at a time that lies dx
from the first sample, i.e. at x1+dx, the third sample will be another dx further away at x1+2*dx,
et cetera. The last sample of the sound, this is also the last sample in a row, will be at time
x1+(nx-1)*dx. The general equation to calculate the time that corresponds to the sample in
column number col is therefore x1+(col-1)*dx.
The big picture now is that, for example for a mono sound, the Formula... sin(2*pi*100*x)
command is expanded by Praat like this:
for col to nx
79
4. Praat scripting
xt = x1 + (col - 1) * dx
self [1 , col ] = sin (2 * pi * 100 * xt)
endfor
Here the inner loop will be executed twice, first for row equal to 1 and then for row equal to 2.
Therefore, for a stereo sound with 44100 Hz sampling frequency a formula will be evaluated
88200 = 2*44100 times.
The Formula... command is very powerful and can do a lot more things than we have
shown here. The next sections will show you some more.
If we multiply a sound by a number larger than 1 we always have to be careful that the
amplitudes stay within the range from 1 to +1.
Rectify a sound, i.e. make negative amplitudes positive.
Modify : "if self < 0 then - self else self fi"
Chop off peaks and valleys, i.e. simulate the effect of oversteering (see figure 3.4). The
following script simulates clipping of amplitudes whose absolute value exceeds 0.5.
Formula : "if self < - 0.5 then - 0.5 else self fi"
Formula : "if self > 0.5 then 0.5 else self fi"
Here two formulas are applied in succession. First all amplitudes smaller than 0.5 will
be set to 0.5 by the first formula. Next the already modified sound will be modified
again by the second formula and all amplitudes larger than 0.5 will be set equal to 0.5.
80
4.7. Loops
Create a white noise sound.
Create Sound from formula : "white_noise" , "Mono" , 0 , 1 , 44100 , "0"
Formula : "randomGauss(0 , 1)"
White noise has equal spectral power per frequency bin on a linear frequency scale.
Create a pink noise sound.
Create Sound from formula : "white_noise" , "Mono" , 0 , 1 , 44100 , "0"
Formula : "randomGauss(0 , 1)"
To Spectrum : "no"
Formula : "if x > 100 then self*sqrt(100 / x) else 0 fi"
To Sound
Pink noise has equal spectral power per frequency bin on a logarithmic frequency scale.
0 , 1 , 44100 , "0"
2 , 3 , 44100 , "0"
0 , 3 , 44100 , "0"
0 , 3 , 44100 , "0"
The script first creates the two sound objects s1 and s2. Line 1 creates the empty sound named
s1 which starts at time 0 and lasts for 1 second. Line 2 modifies the s1 sound by changing it
into a tone of 500 Hz. In line 3 sound s2 with a duration of 1 s is created, but now the starting
time is at 2 s. In lines 5 and 7 we create two silent sounds s3 and s4, both start at 0 s and end
at 3 s. These latter two sounds will receive the results of the averaging operations. The two
fundamentally different ways to do the averaging are in lines 6 and 8 and involve using either
[] or () for indexing.
1. Let us first magnify what happens in line 6 where the formula works on the selected
sound s3:
for col to 3*44100
self [1 , col ] = (Sound_s1 [1 , col ] + Sound_s2 [1 , col ] ) / 2
endfor
The sound s3 lasts 3 seconds, therefore the last value for col is 3*44100. The assignment
in the loop to self[1,col] refers to the element at position col in the first row of the
selected sound s3. The value assigned is the sum of two terms divided by 2. Each term
involves new syntax and shows how to refer to data that is not in the current selected
object! The first term, Sound_s1[1,col], refers to the element at position col in the first
81
4. Praat scripting
row of an object of type Sound with name s1. The second term refers to an element
at the same position but now in an object of type Sound with name s2. This is a very
powerful extension within the formula context, because it gives us possibilities to use
information from other matrix type objects besides the selected one.
Therefore, in a Formula..., the syntax Sound_s1[1,col] and Sound_s2[1,col] refer
to the element at position col in row 1 from a sound named s1 and a sound named s2,
respectively.
The loop in more detail now. The first time, when col=1, the value from column 1 from
sound s1 is added to the value from column 1 from the sound s2, averaged and assigned
to the first column from the selected sound s3. Then, for col=2, the second numbers in
the rows are averaged and assigned to the second position in the row of s3. This can
go on until col reaches 1*44100+1 because then the numbers in the s1 and the s2 sound
are finished, (they were each just one second of duration). Praat then assigns to the
Sounds s1 and s2 zero amplitude outside their domains. Thus indexes that are out of
their domain for a sound, like index 44101 is for s1 and s2, will be valid indexes but
Praat will assign a zero amplitude. In this way, the final second and third seconds of
s3 are filled with zeros, i.e. silence. When you listen to the outcome of the formula,
i.e. sound s3, you will hear frequency beats just like you did in section 4.6.2 (but they
last only one second now).
2. Now we proceed with the other summation. In line 8 the sounds are also summed but
now instead of the square brackets we use parentheses. Inside parentheses the expression will evaluate to real time. We magnify what happens.
for col to 3*44100
x = x1 + (col - 1)*dx
self [1 , col ]= (Sound_s1(x) + Sound_s2(x)) /2
endfor
In the third line of this script, the two sounds are queried for their values at a certain
time. Now the time domains of the corresponding sounds are used in the calculation.
The domains of s1 and s2 are not the same, the domains dont even overlap. Just like in
the previous case Praat accepts the non-overlapping domains and assumes the sounds to
be zero amplitude outside their domains. The resulting sound s4 is now very different
from sound s3.
The difference between the indexing of the sounds with [] versus () is very important. In
indexing with [] the sounds were treated as a row of amplitude values. Amplitude values at
the same index were blindly added, irrespective of differences in domains or differences in
sampling frequencies.15 In indexing with () the sounds are treated as functions of time and
amplitude values of the sounds at the same time were added and averaged.
Only if sampling frequencies are equal and sounds start at the same time, the two methods
result in the same output.
15 Create
sound s2 with a sampling frequency of 22050 Hz instead of 44100 and investigate the difference between
the behavior of [] and () .
82
4.7. Loops
The first line is only cosmetic because we need a sound object to refer to by name. In the
second line we use the values of the first two cells in the first row of the sound matrix, we
add them and divide by two. Note that we cannot make an assignment this way, the construct
Sound_a[i,j] may only occur on the right-hand side of an equal sign. For assignments we need
to use the Modify Set value at sample number... command. Probably you will never have
to change individual samples of a sound in this way. We will learn better ways to refer to
objects later on in advanced scripting (you may already have thought What happens if two
objects bear the same name?).
4.7.2. Repeat until loops
The following scripts simulates the number of times we have to throw a pair of dice to reach
a sum of twelve eyes.
throws = 0
repeat
eyes = randomInteger(1 , 6) + randomInteger(1 , 6)
throws = throws + 1
until eyes = 12
writeInfoLine : "It took " , throws , " trials to reach " , eyes , " with two dice."
The randomInteger function generates another random integer value in the range from 1 to
6 each time it is called and can threfore simulate the throwing of a dice. A repeat loop is used
when the number of times a loop should be executed is difficult to calculate or is not certain
beforehand. This means that the stop condition has to be calculated during execution of the
loop.
4.7.3. While loops
Given a number m find n, the nearest power of two, such that m n. For example if m = 7
then n = 8 since the nearest power of 2 is 8 (= 23 ). If m = 9 then n = 16 = 24 . This little
algorithm is used to find the size of the buffer we need to perform a fast Fourier transform on
m sample values (we use the shorthand notation n*=2 for n = n * 2).
n = 1
while n < m
n *= 2
endwhile
Just like the repeat loop this construction is used if we dont know beforehand how often the
loop has to be executed. The difference between a while and a repeat loop is that in the while
loop the exit condition is tested before the execution of the statements in the loop and that the
repeat loop always executes the statements in the loop at least once.
83
4. Praat scripting
4.8. Functions
For scripting you can use specialized units of code that process an input and then return
a value. In Praat you have a number of these so called builtin functions at your disposal.
Functions can be categorized into two groups: functions that operate on numbers and functions
that operate on a text or string value. The built-in functions that operate on numbers are called
mathematical functions, while the others are call string functions. Functions may return a
number or a string. Those that return a string value have a name that also ends with a $-sign.
In some of the functions the input domain is limited, for example if we ask for the square root
of a number the number must be positive otherwise the result is undefined. If we happen to
call the square root function with a negative argument the outcome will be assigned a special
undefined value which will print as -- undefined --.
4.8.1. Mathematical functions in Praat
2 6*
2 -1 1*
oor (x) returns the highest integer value not greater than x.
a = floor (1.5)
b = floor ( - 1.5)
c = floor(1.1)
writeInfoLine : "*floor : " , a , " " , b , " " , c , "*"
1 -2 1*
ceiling (x) returns the lowest integer value not less than x.
a = ceiling (1.5)
b = ceiling ( - 1.5)
c = ceiling(1.1)
writeInfoLine : "*ceiling : " , a , " " , b , " " , c , "*"
84
2 -1 2*
4.8. Functions
2 0.7071067811865476 --undefined--*
-2*
7.6*
imin (x, ...) returns the location of the minimum in a series of numbers.
a = imin(1 , -2 , 4 , 6 , 7.6)
writeInfoLine : "*imin : " , a , "*"
2*
imax (x, ...) returns the location of the maximum in a series of numbers.
a = imax(1 , -2 , 4 , 6 , 7.6)
writeInfoLine : "*imax : " , a , "*"
5*
1 -1*
85
4. Praat scripting
a = tan(0)
b = tan(pi / 4)
writeInfoLine : "*tan : " , a , " " , b , "*"
arcsin (x) returns the inverse of the sine; arcsin(x)=y means x=sin(y) or more directly
arcsin(sin(y))=y.
The last form makes clear that the input x must be in the the domain [-1,1]. The output
will always be in the range [/2, /2].
a = arcsin(1)
b = arcsin( - 1)
writeInfoLine : "*arcsin : " , a , " " , b , "*"
1.5707963267948966 -1.5707963267948966*
0 3.141592653589793*
0 3.141592653589793*
arctan2 (y, x) returns the angle (in radians) between the positive x-axis and the point given by
the coordinates (x, y) on it. The angle is positive for counter-clockwise angles (upper
half-plane, y > 0), and negative for clockwise angles (lower half-plane, y < 0). The
output will be in the range [, ].
a = arctan2(1 , 1)
b = arctan2(1 ,- 1)
c = arctan2( -1 ,- 1)
d = arctan2( -1 , 1)
writeInfoLine : "*arctan2 : " , a , " " , b , " " , c , " " , d , "*"
86
4.8. Functions
1 0.6366197723675814*
1 0.6366197723675814*
1 0.36787944117144233 2.718281828459045*
1 0.6931471805599453 2.302585092994046*
0.4342944819032518 0.3010299956639812
1.4426950408889634 1 3.3219280948873626*
87
4. Praat scripting
a = sinh(0)
b = sinh(1)
writeInfoLine : "*sinh : " , a , " " , b , "*"
0 1.1752011936438014*
1 1.5430806348152437*
0 0.7615941559557649*
0 0.8813735870195429 -2.998222950297976*
p
arccosh (x) returns the inverse hyperbolic cosine ln(1 + x2 1), where the input range is
limited to x 1.
a = arccosh(1)
b = arccosh(10)
writeInfoLine : "*arccosh : " , a , " " , b , "*"
0 2.993222846126381*
1
2
ln 1+x
1x , where the input range is limited
a = arctanh(0)
b = arctanh(1 / 2)
writeInfoLine : "*arctanh : " , a , " " , b , "*"
0 0.5493061443340549*
sigmoid (x) returns the value of 1/(1+exp(-x)), which will be a number between 0 and 1.
a = sigmoid(0)
b = sigmoid(1)
c = sigmoid( - 1)
writeInfoLine : "*sigmoid : " , a , " " , b , " " , c , "*"
88
4.8. Functions
The info window will show: *sigmoid:
0.8427007929497149 0.9953222650189527
R
x
et dt.
a = erfc(1)
b = erfc(2)
c = erfc(3)
writeInfoLine : "*erfc : " , a , " " , b , " " , c , "*"
0.1572992070502851 0.0046777349810472645
barkToHertz (x) returns the inverse of the previous function 650 sinh(x/7).
hertzToMel (x) transforms acoustic frequency to perceptual pitch as 550 ln(1+x/550), where
x is frequency in hertz. The dotted line in figure 4.12 displays this function on the
frequency interval from 0 to 10 kHz.
melToHertz (x) inverse of previous function. Transforms mel to acoustic frequency as 550(ex/550
1), where x is frequency in mels.
hertzToSemitones (x) from acoustic frequency to a logarithmic musical scale, relative to
100 Hz: 12 ln(x/100)/ ln 2.
Examples: hertzToSemitones(100)=0, hertzToSemitones(200)=12, hertzToSemitones(400)=24.
89
4. Praat scripting
1500
20
1000
Mel
Bark
15
10
500
5
0
0
1000
2000
3000
7000
8000
9000
0
104
Figure 4.12.: The solid line displays the function hertzToBark while the dotted line displays the
function hertzToMel.
erb (x) the perceptual equivalent rectangular bandwidth (ERB) in hertz, for a specified acoustic frequency in hertz as 6.23 106 x2 + 0.09339x + 28.52.
x+312
hertzToErb (x) from acoustic frequency to ERB-rate as 11.17 ln( x+14680
+ 43).
In the previous section a large number of mathematical functions of Praat were enumerated. In
this section we will concentrate on Praats string functions which accept strings as arguments.
We start counting characters in a string at 1, i.e. the first character has index 1.
left$ (a$, n) return a string with the first n characters of the string a$.
b$ = left$ ("This is a string with spaces !" , 4)
c$ = left$ (b$ , 10)
writeInfoLine : "*left$ : " , b$ , "*" , c$ , "*"
This*This*
right$ (a$, n) returns a string with the rightmost n characters of the string a$.
90
4.8. Functions
b$ = right$ ("This is a string with spaces !" , 7)
c$ = left$ (b$ , 10)
d$ = right$ (b$ , 10)
writeInfoLine : "*left$ : " , b$ , "*" , c$ , "*" , d$ , "*"
spaces!*spaces!*spaces!*
mid$ (a$, start, number) returns the substring starting at position start counting number
characters.
b$ = mid$ ("This is a string with spaces !" , 6 , 2)
c$ = mid$ ("This is a string with spaces !" , 1 , 4)
d$ = mid$ (c$ , 10 , 2)
writeInfoLine : "*mid$ : " , b$ , "*" , c$ , "*" , d$ , "*"
is*This**
index (a$, b$) returns the index of the first occurrence of string b$ in a$.
b = index ("This is a string with spaces !" , "is")
c = index ("This is a string with spaces !" , "piet")
writeInfoLine ("*index : " , b , " " , c , "*")
3 0*
rindex (a$, b$) returns the index of the last occurrence of string b$ in a$.
b = rindex ("This is a string with spaces !" , "is")
c = rindex ("This is a string with spaces !" , "s")
writeInfoLine : "*rindex : " , b , " " , c , "*"
6 28*
replace$ (a$, fs$, rs$, n) returns a modified version of the string a$ where at most n occurrences of the string fs$ are replaced with the string rs$. If n equals zero all occurences
are replaced.
b$ = replace$ ("This is a string with spaces !" , " " , "" , 0)
c$ = replace$ (b$ , "with" , "without" , 1)
writeInfoLine : "*replace$ : " , c$ , "*"
Thisisastringwithoutspaces!*
index_regex (a$, re$) returns the index where the regular expression re$ first matches the
string a$. Regular expressions offer the possibility to search for patterns instead of
literal occurences as in the previous functions. The Help > Regular expressions in
Praat will show you the syntax and many possible uses of regular expressions. The
first line in the folowing script searches for uppercase characters while the second line
searches for whitespace.
b = index_regex ("This is a string with spaces !" , "[A -Z ] ")
c = index_regex ("This is a string with spaces !" , "\ s")
writeInfoLine : "*index_regex : " , b , " " , c , "*"
1 5*
91
4. Praat scripting
rindex_regex (a$, re$) return the last match of the regular expression re$ within the string
a$.
replace_regex$ (a$, fs$, rs$, n) returns a modified version of the string a$ where at most n
matches of the regular expression fs$ are replaced by the pattern rs$. If n equals zero
all occurrences are replaced. The first line doubles each character from the a$ string
while the second line replaces double characters in the b$ string by single ones.
b$ = replace_regex$ ("hello" , "." , " &&" , 0)
c$ = replace_regex$ (b$ , "(.) \1" ,"\1" , 0)
writeInfoLine : "*replace_regex$ : " , b$ , "*" , c$ , "*"
hheelllloo*hello*
xed$ (number, precision) return a string with number formatted with precision digits after the decimal point.
b$ = fixed$ (5.1236 , 3)
c$ = fixed$ (0.0001234 , 2)
writeInfoLine : "*fixed$ : " , b$ , "*" , c$ , "*"
5.124*0.0001*
12.346%*0.01%*
extractNumber (sentence$, precursor$) returns the number in sentence$ after the precursor$
string.
b = extractNumber ("Lucky number 7" , "number")
c = extractNumber ("Lucky number" , "number")
writeInfoLine : "*extractNumber : " , b , " " , c , "*"
extractWord (sentence$, precursor$) returns the word following the precursor$ string from
sentence$. If no word can be found an empty string will be returned.
b$ = extractWord$ ("Lucky number 7" , "Lucky")
c$ = extractWord$ ("Lucky number" , "number")
writeInfoLine : "*extractWord$ : " , b$ , "*" , c$ , "*"
92
number**
Continuation lines start with three dots (...). Use continuation lines to split a long line
into several shorter lines.
Split off sections of code into procedures.
masters of the C programming language there is a yearly contest to accomplish exactly the opposite. Programmers try to write the most obscure code possible. For some really magnificent examples, see the website of The
International Obfuscated C Code Contest at http://www.ioccc.org.
17 There are other computer languages like Python in which white space is part of the syntax of the language.
93
4. Praat scripting
two are syntax errors while the third one is a semantic error. We anticipate a little bit on things
not yet explained. They will become more clear later on.
Praat commands in scripts have to be spelled exactly right. Text on menu options and
buttons are Praat commands and, as you can easily check, they always starts with an
uppercase character. For example, if you want to play a sound from a script and type
play instead of Play you will receive a message Command play not available for
the current selection and the script will stop running.
Sometimes during copy-paste actions or other edit actions, accidentally extra white
space may creep in, mostly in the form of blank space. For example if you write do
(Play ), with a blank space after the y instead of the correct Play, you receive a
message like Command Play not available for current selection. Another often occurring error is that you have written a blank space before the final colon of a command.
If you run the following script an error message is displayed that starts with Command
"Create Sound from formula :" not available for current selection.
Create Sound from formula : "s" , 1 , 0 , 1 , 44100 , "sin(2*pi*300*x)"
This error is easily corrected by removing the blank space before the colon.
A command may exist but is not valid for the selected object. For example, in your
script you wanted to Play a sound but accidentally selected a Spectrum object. In the
following script, Praat will show the error message Command Play not available for
current selection, Script line 3 not performed or completed: Play .
Create Sound from formula : "s" , 1 , 0 , 1 , 44100 , "sin(2*pi*300*x)"
To Spectrum : "no"
Play
One way to get this script right is to move the Play command one line up to the second
line.
Be careful when copying scripts via programs like Word or Acrobat reader because characters that may seem visually correct in the script editor may use a different underlying
representation that might confuse the script interpreter.
94
5. Pitch analysis
In Praat a Pitch object represents periodicity candidates as a function of time. The periodicity
may refer to acoustics, perception or vocal fold vibrations. The standard pitch algorithm in
Praat tries to detect and to measure this periodicity and the algorithm to do so is described in
Boersma [1993]. In this chapter we will elaborate on this algorithm and give somewhat more
background information on the steps involved.
The concept of pitch, however, is not as simple as we stated above, because, pitch is a
subjective psycho-physical property of a sound. The ANSI1 definition of pitch is as follows:
Pitch is that auditory attribute of sound according to which sounds can be ordered
on a scale from low to high.
Pitch is a sensation and the fact that pitch is formed in our brain already hints that it will not
always be simple to calculate it. The definition implies that essentially the calculation of pitch
has to boil down to one number; numbers can be ordered from low to high. Actually, the only
simple case in measuring pitch is for the pitch associated with a pure tone: a pure tone always
evokes the same pitch sensation within a normal-hearing listener. This was experimentally
verified by letting subjects adjust the frequency of a tone to make its pitch sound equal to
the pitch of a test tone. After many repetitions of the experiment and after averaging over
many listeners, the distribution of all the subjects measured pitches shows only one peak,
centered at the test frequency. For more complex sounds, distributions with more than one
peak may occur. Various theories about pitch and pitch perception exist and a nice introduction
is supplied for example by Terhardts website.
The topic of this chapter, however, is not pitch perception but pitch measurement. A large
number of pitch measurement algorithms exist and new ones are still being developed every
year. We will describe the pitch detector implemented in Praat because it is one of the best
available.
95
5. Pitch analysis
scheme discussed in section 2.4 as shown in figure 2.9 by analyzing overlapping windowed segments from the sound. The analysis of each segment results in an analysis
frame with a number of pitch candidates together with an indication of each candidates
strength. The pitch candidates are determined with a signal analysis technique called
autocorrelation and this part of the algorithm will be explained in section 5.1.1.
2. Find the best pitch candidate for each analysis frame. At each time step we have several
candidates, finding the best candidate at each time step is equivalent, as we will see later,
to finding a best path through the candidate space. Finding the best path is performed
with a technique called Viterbi and will be explained in section 5.1.4.
Before we can explore the details of the pitch algorithm we first need to be familiar with a new
signal analysis concept: the autocorrelation function. If you are not familiar with autocorrelation than check out section A.12, where we introduce functions that correlate one or more
sounds.
5.1.1. Finding pitch candidates by autocorrelation
w (t)
s(t)
1
-1
0
0 Time (ms) 24
ra( )
-1
0 Time (ms) 24
rw( )
0
0 7.14
24
Lag time (ms)
0 Time (ms) 24
r() =ra( )/rw( )
1
-1
a (t)=s (t)w(t)
1
-1
0
24
Lag time (ms)
0 7.14
24
Lag time (ms)
Figure 5.1.: The pitch determination using the autocorrelation of a windowed speech segment.
In this section we will explain how the autocorrelation is used for finding pitch candidates
in each analysis frame. As we already know, pitch analysis conforms to the general analysis
scheme as was laid out in section 5.1, and therefore the same analysis is applied to each
consecutive analysis frames of the sound. If we explain how one (analysis) frame is analysed
we know how the complete sound will be analysed.
By definition the best candidate for the pitch can be found at the maximum peak of the
autocorrelation. However, windowing and sampling may cause accuracy problems during de-
96
97
5. Pitch analysis
distances of 0.0001 s apart. This will also be the distance between the lag values in the autocorrelation of this sound. The pitch value that corresponds to a lag at position m = 31 will be
1/(32 0.0001) 312.5 Hz. For positions m = 33 and m = 34 the corresponding pitches will
be 303.0 and 294.1 Hz, respectively. These three pitches differ approximately 9 Hz from each
other. Therefore, near 300 Hz the only possible pitches are limited to the values 312.5, 303.0
and 294.1 Hz and this is not precise enough. In general we like to have a better estimate of
the pitch that is not limited by the sampling. This is possible by using interpolation. Interpolation is the estimation of values in between sample points based on some assumption about
how the amplitude varies. To get a sufficiently accurate interpolation for normal pitch ranges,
parabolic interpolation2 would suffice. However, with parabolic interpolation the autocorrelation amplitudes could still turn out to be very wrong. As the normalized autocorrelation
amplitude correspond to the pitch strength of this candidate parabolic interpolation can not
be used. Because the autocorrelation is a sampled signal, and the correct interpolation for a
sampled signal is a sinc interpolation, the pitch algorithm will use sinc interpolation to find
both the correct position as well as the correct amplitude of the local maxima of the autocorrelation.3 The precision achieved with sinc interpolation is phenominal. With an analysis width
of 40 ms, the frequency of a sine tone of 3777 Hz, sampled at 10 kHz sampling frequency, can
be determined as accurately as 3777.0001 Hz. If you take a look at the sampled representation
of this tone this will impress you even more.4
In this section we have explained how the pitch candidates in each analysis frame are calculated. In the following section we will explain what parameters are involved in the pitch
algorithm
5.1.2. Parameters of the pitch algorithm
Figure 5.2 shows the form that appears if you select the Sound: To Pitch (ac)... button.
With the parameters in the form you can fine tune the two steps involved in calculating pitch:
finding the candidates in each frame and finding the best global pitch value in each frame. The
parameters for finding the candidates are:
Time step (standard value: 0.0 s) The time between different measurements (see also section 2.4). If you supply the standard value of 0.0 Praat will choose an appropriate time
step which will be 0.75 / (pitch floor). A pitch floor of 75 Hz then results in a time step
of 0.01 s and for each second of the sound 100 pitch values will be calculated.
Pitch oor (standard value: 75 Hz) The lowest candidate frequency to consider. The pitch
floor parameter directly determines the effective length of the analysis window. For
periodicity detection we need a minimum of three periods in an analysis window. The
lower the pitch, the longer a period will be and the longer the analysis window length
needs to be. Three periods of a 75 Hz periodic signal last 3/75 = 0.04 s. To resume,
2 See
98
Figure 5.2.: The Sound: To Pitch (ac)... form with parameter defaults.
if the algorithm needs to detect pitches as low as 75 Hz then we need an analysis window that has at least 40 ms duration. If you want to go lower, for example to measure
creaky voice you could lower the floor to say 50 or 60 Hz; for a 60 Hz parameter value,
the length of the analysis window will be 3/60 = 50 ms. For female voices you could
probably increase this value to say 100 Hz.
If the time step parameter is 0.0, the pitch floor also determines the time step. At 75 Hz
an analysis window of 40 ms and a time step of 10 ms amount to four times oversampling.
Max. number of candidates (standard value: 15) determines the number of local maxima
in the autocorrelation that have to be remembered.
Very accurate determines the window function. If off, a Hanning window is used with the
same duration as the analysis window. If on, a Gaussian window is selected of twice the
effective window length duration. Although the analysis window length is doubled, the
effective width of the Gaussian window is half this width.
After the candidates have been calculated a post-processing algorithm seeks the best candidates for the global pitch asssignment. The post-processing tries to find the cheapest path to
99
5. Pitch analysis
connect the best pitch value in a frame with the best value in the next frame. The following
parameters determine the cheapest path.
Silence threshold (standard value: 0.03). Sound frames in which the largest amplitudes do
not exceed this value, relative to the global maximum peak, are considered as silent and
therefore voiceless.
Voicing threshold (standard value: 0.45) determines whether a frame is considered voiceless or not. If the strengths of all candidates in a pitch frame do not exceed this value,
the frame is marked as voiceless. If you increase the voicing threshold more frames will
be marked voiceless.
Octave cost (standard value: 0.01 per octave) determines how much high frequencies are
favored above low frequencies. This parameter is necessary to force a decision for perfectly periodic signals. A sine of frequency F will show maxima in the autocorrelation
at lag times 1/F , 2/F , 3/F , . . ., that correspond to the pitches F , F/2, F/3, . . .. Besides the correct pitch F all the undertones are candidates too because all these maxima
have the same autocorrelation amplitude. The octave cost parameter gives the highest
candidate an advantage above the others.
Octave-jump cost (standard value: 0.35) determines the degree of disfavoring pitch changes.
This parameter effects the choice going from one frame to the next. By giving pitch
changes a penalty large frequency jumps are suppressed.
Voiced / unvoiced cost (standard value: 0.14) determines the degree of disfavoring voiced/unvoiced transitions. Increasing this value decreases the number of voiced/unvoiced transitions. This parameter is necessary to suppress an accidental strong local voiceless candidates within an otherwise voiced part or, the opposite: suppress an accidental strong
voiced candidate within an otherwise voiceless part of the speech sound.
Pitch ceiling (standard value: 600 Hz). Candidates above this frequency will be ignored.
For males voices you could lower the ceiling to, say, 300 Hz.
5.1.3. How are the pitch strengths calculated?
In this section we explain in somewhat more detail how some of the parameters above are
used in the determination of the candidate strengths. The candidate that is always present is
the voiceless one whose strength is detemined as
(local peak)/(global peak)
.
R = voicingThreshold max 0, 2
silenceT hreshold/(1 + voicingT hreshold)
The strength of the voiceless candidate depends on the voicing threshold, the silence threshold and the quotient of a frames local peak and the global peak. Mind you, the peaks we
are talking about, now refer to the peaks in the oscillogram (and not to the peaks in the autocorrelation). By chosing the silence threshold as zero, the local peak will not influence the
pitch strength anymore. For this special case the strength of the voiceless candididate will be
R = voicingT hreshold and R is now only determined by the voicing threshold.
100
After the autocorrelation step we have pitch frames that store one unvoiced pitch candidate
and one or more voiced candidates and their strengths. For each pitch frame we now have to
decide what the best candidate is. Simply picking the one with the largest pitch strength in
each frame will not always lead to a correctly assigned global pitch because no considerations
about continuity are involved since in this case all pitch decisions are local. This may lead
to a very discontinuous global pitch. It is also not yet clear what to do if two candidate
strengths turn out to be equal. To solve these problems and some more, the pitch algorithm
associates a cost function to each pitch candidate. Assigning cost functions will pose the
problem in the domain of path search for which excellent algorithms to find the optimal path
exist. the optimal path is the path with minimum global costs. In the previous section we
have introduced a within-frame cost function that modifies the candidates strengths, and
only depends on the candidates within a frame. To guarantee a smooth curve we also have to
associate costs that inhibit large frequency pitch changes between two successive frames. We
call these costs between-frame costs. If F1 is a pitch candidate in a frame and F2 is a pitch
candidate in the next frame than a transition cost can be defined as
101
5. Pitch analysis
Script 5.1 Script to test your perceived pitch as a function of modulation depth.
form Modulation depth
positive Depth 0.1
positive F 150
endform
dp = 100*depth
f1 = Create Sound from formula : "sf1" , "Mono" , 0 , 0.5 , 44100 ,
... "0.5*sin(2*pi*f*x)"
@fade_in_out
s = Create Sound from formula : "s'dp'" , "Mono" , 0 , 0.5 , 44100 ,
... "0.5*(1 + depth*sin(2*pi*f*x))*sin(2*pi*2*f*x)"
@fade_in_out :
f2 = Create Sound from formula : "sf2" , "Mono" , 0 , 0.5 , 44100 ,
... "0.5*sin(2*pi*2*f*x)"
@fade_in_out
select all
Play
Remove
procedure fade_in_out
Fade in : "All" , -1 , 0.005 , "no"
Fade out : "All" , 100 , - 0.005 , "no"
endproc
if F1 = F2
0
transitionCost(F1 , F2 ) = voiced / unvoiced cost
if F1 = 0 xor F2 = 0
102
Pitch candidates
465
457
293
294
155
153
77
75
k+1
k+2
455
453
455
449
294
290
294
153
151
154
149
78
78
78
71
k+5
k+6
308
k+3
k+4
Frame number
search time. Luckily the Viterbi algorithm can, it reduces the exponential time complexity
from O(mn ) to complexity O(n m2 ). For the example above, this amounts to a reduction in
operations from the order of 4 10117 to some 22500. This number is very manageable. Let us
now show how Viterbi works.
The underlying model in the Viterbi algorithm is that the most likely path (which leads to
a particular state) up to a certain point t must depend only on the observed pitches at point t
and the most likely sequence of states which leads to that state at point t 1. In other words
we only use neighbouring frames in the calculation and there is no explicit depence on frames
that are more than one time step in the past. A trivial, but necessary, condition for the pitch
calculation is that the times associated with successive states in the path are strictly increasing,
i.e. a path always goes from left to right. The following description of the Viterbi algorithm is
rephrased from the wikipedia article on the Viterbi algorithm. The Viterbi algorithm operates
on the state machine assumption. That is, at any time the pitch being modeled is in one of
a finite number of states. Each state is characterized by a pitch candidates frequency and
strength. While multiple sequences of states (i.e. paths) can lead to a given state, at least one
of them is the most likely path to that state, called the winning path". This is a fundamental
assumption of the algorithm because the algorithm will examine all possible paths leading to
a state and only keep the one most likely. This way the algorithm does not have to keep track
of all possible paths, but only one per state. If we unfold our pitch analysis as a trellis like
we did in figure 5.3, the idea of path becomes clear: as a connection of states successive time
points.
A second key assumption is that a transition from a state to the next state is accompanied
by transition costs. The transition costs are computed from the candidates frequencies and
strengths. The third key assumption is that we can add all the state to state transition costs to
some cumulative cost. So the crux of the algorithm is to store the cumulative costs in each
state. The algorithm examines moving forward to a new state by combining the cumulative
costs of the all possible previous states with the local transition costs and chooses the transition which amounts in the smallest cumulative cost. The local transition cost, i.e. the costs
involved in going from one state to the next, depends on the pitch candidates frequencies and
103
5. Pitch analysis
strengths in the old state and the new state. After computing the combinations of local costs
and accumulated costs, only the best transition survives and all other paths are discarded. If
we also store a pointer to the optimal previous state we can, after reaching the final state, make
a trace back and so find the winning path, i.e. the path with the smalles accumulated cost.
104
6. Intensity analysis
In chapter 2 the propagation of sound waves was rudimentary treated. The most important
conclusion was that sound is an wave phenomenon that results from air pressure variations.
The sound pressure level is a measure of this air pressure variation. The air pressure is measured in pascal units (Pa), which are newtons per square metre (N/m)1 . The ambient presure
is about 100,000 Pa. The amount by which the lungs can vary this air pressure is only some
200 to 1000 Pa. Outside your body, the air pressure caused by your speech is much smaller
again, namely some 0.01 to 1 Pa at one metre from your lips. These values are comparable
to the values that you see for a typical speech recording in Praats sound editor. Although the
amplitude of a sound in the sound editor is expressed in Pa these amplitudes can only be interpreted as true air pressures after applying a sound pressure calibration. Because this scaling
effects all sound amplitudes in the same way we normally are not interested in this scaling and
forget about it2 .
A normative human ear can detect a root-mean-square air pressure as small as 0.00002 Pa
for a 1000 Hz pure tone. The sound pressure level (SPL) is generally expressed in decibel
(dB) relative to this normative threshold:
R t2 2
!
x (t)dt
1
t1
,
SPL = 10 log
t2 t1 2 105 2
where x(t) is the sound pressure in Pa as a function of time and t1 and t2 are the times between
which the energy is averaged. Essentially this formula says first to sum the squared amplitude
values then divide this sum by the squared normative value and finally to take the logarithm
of the outcome and multiply it by ten.
The Intensity object in Praat represents an intensity contour at linearly spaced time points
with values in dB SPL, i.e. dB values relative to 2105 Pa. This intensity analysis is performed
according to the general analysis scheme in figure 2.9, i.e. the sound is divided in overlapping
segments and the intensity of each separate windowed segment is determined by the formula
above.
105
6. Intensity analysis
calculation of the intensity values. If you set this parameter too high you will end up with
pitch-synchronous intensity modulations.If you set it too low, the intensity contour will look
smeared. If you want a sharp contour you should set it as high as possible. The effective
window length that Praat calculates from this parameter is 3.2/minimumPitch. This guarantees
that for a periodic signal with fundamental frequency equal to F0 the intensity contour will
hardly show any ripple if the minimum pitch is chosen equal to the fundamental frequency
value F0 .
To show the influence of the minimum pitch smoothing parameter we have displayed in
figure 6.2 the result of an intensity analysis on a short word segment for various values of the
minimum pitch parameter.
In the top panel you see the waveform of the Dutch word /vrou/ as spoken by w male
speaker. As can be seen, the initial voiced fricative /v/ is realised as unvoiced (/f/) and the
vowel part is clearly voiced. In the bottom panel, from top to bottom, the result of the intensity
analyses are displayed for minimum pitch values of 100, 200, 400 and 800 Hz. To be able to
better visually compare these contours, each following contour was vertically shifted by an
extra 5 dB. The intensity curve at the top, i.e. where the minimum pitch parameter was chosen
as 100 Hz, is very smooth and seems to nicely follow the envelope of the sound. The 200 Hz
contour that lies just below the previous one, shows almost the same features although it is
a little bit more ragged. Increasing the minimum pitch value to 400 Hz makes the contour
definitely ragged and in the voiced part of the word a ripple that varies with the individual
pitch periods becomes clearly visible. Because of this large value of the minimum pitch the
unvoiced parts in the contour also become rippled, although more irregularly because of lack
of periodicity. When the minimum pitch is increased even further to 800 Hz, the ripples grow
larger and also the number of ripples keeps increasing as the bottom curve shows. At the same
time, while the ripples start increasing as the minimum pitch increases, the intensity better
follows local amplitude changes. So there is a tradeoff between the smoothness of the curve
and the posibility to follow local amplitude changes: the higher the minimum pitch the better
we can follow the intensity contour details. From figure 6.2 it seems that a minimum pitch
value between 100 and 200 Hz is adequate for this sound under normal circumstances. In
this way the average pitch of the sounding interval, which is aroud 120 Hz, is in the 100 to
200 Hz inteval for the minimum pitch. We further note from the bottom pane that the lower the
minimum pitch the further the curve starts to the right. This is because the lower the minimum
pitch is, the longer the duration of each analysis window / segment is and the further away
the midpoint of the first window is from the start of the signal. At the end we have the same
effect, the larger window duration makes that the midpoint of the last analysed segment is now
106
Get maximum: returns the maximum value of the intensity contour in the chosen time range.
In general default parabolic interpolation technique is more than sufficient. In general
the maximum pitch parameter used in the determination of the intensity contour has a
much larger influence on this value than the interpolation technique used.
Get minimum: returns the minimum value of the intensity contour in the chosen time range.
Get time of maximum: returns the time at which the maximum of the intensity contour
occurs in the chosen time range.
Silence threshold (dB) is a relative threshold. It determines the maximum silence intensity with respect to the maximum intensity. Intervals that have intensities this number
of dBs below the maximum intensity are considered voiceless. For example, if the
maximum intensity value happens to be 78.2 dB and if the silence threshold is -25 dB
then all intensity values that are below 53.2 dB (= 78.2 25) are considered as silent.
3 The
only way to achieve this would be to artificially make parts of a sound zero.
107
6. Intensity analysis
-1
0
0.1
0.2
Time (s)
0.3
0.3689
0.1
0.2
Time (s)
0.3
0.3689
Intensity (dB)
90
40
Figure 6.2.: The relation between the intensity curve and the minimum pitch parameter. In the top
panel we show the oscillogram of the word /vrou/ as spoken by a male speaker. The
bottom panel shows the intensity curves for different values of the minimum pitch
parameter. From top to bottom the minimum pitch was 100, 200, 400 and 800 Hz,
respectively. For displaying purposes only, intensity curves were vertically shifted by
5 dB.
108
7. The Spectrum
One of the types of objects in Praat is the spectrum. The spectrum is an invaluable aid in
studying differences between speech sounds. Almost all analyses that compare sounds, are
based on spectra. The spectrum is a frequency-domain representation of a sound signal; the
spectrum gives information about frequencies and their relative strengths. The other representation of a sound, the one we are already familiar with from the sound object in Praat, is
the time-domain representation, i.e. the representation of sound amplitude versus time in an
oscillogram.
A spectrum and a sound are different. A sound you can hear, a spectrum not. The spectrum
is a (mathematical) construct to represent a sound for easier analysis. One makes calculations
with a spectrum, one visualizes aspects of a spectrum but you can not hear it or touch it. Only
after you have synthesized the sound from the spectrum, can you listen to the sound. The
reason for the popularity of the spectrum is that it is often easier to work with than the sound.
When the spectrum is calculated from a sound, a mathematical technique called Fourier analysis is used. A Fourier analysis finds all the frequencies in the sound and their amplitudes,
i.e. their strengths. There is no information loss in the spectrum: we can get the original sound
back from it by Fourier synthesis. These two transformations, analysis and synthesis, that
are each others inverse, are visualized in figure 7.1. On the left we see a very small part of
a sound as a function of time and on the right the sound as a function of frequency. The top
arrow going from the sound to the spectrum, labeled To Spectrum, visualizes the Fourier
analysis. The bottom arrow, labeled To Sound, visualizes Fourier synthesis. Although intuitively the spectrum is a simple object, i.e. a representation of the frequency content of a
signal, the mathematics to calculate the spectrum from a sound is not simple. The main causes
for mathematical complications are first of all the finite duration of the sound and secondly
the fact that sounds are sampled in the time domain.
Sound
Spectrum
To Spectrum
To Sound
Time (s)
Frequency (Hz)
Some terminology: instead of Fourier analysis one often talks about applying a Fourier
109
7. The Spectrum
transform and instead of Fourier synthesis one often says applying an inverse Fourier transform.
The spectrum is not a simple object like a mono sound but a complex one. Complex has a
double meaning in this respect. The first meaning of complex is composed of two or more
parts. There are two parts in a spectrum: one part represents the amplitudes of all the frequencies and the other part the phases of the frequencies. The other meaning of complex is the
mathematical one from complex number.1 This is about how the two aspects of a frequency,
its amplitude and its phase, are represented. To visualize a complete spectrum we would need
three dimensions: one for frequency, one for amplitude and one for phase. Three dimensional
representations are difficult, we therefore limit ourselves to the most popular two dimensional
representation: the amplitude spectrum, where vertically amplitude is displayed in decibel and
horizontally frequency in hertz. Often the amplitude spectrum is visualized in text books in
two different ways: as a line spectrum with vertical lines, or as an amplitude spectrum where
instead of showing the vertical lines, the tips of the lines are connected. In the sequel we will
show that what is visualized as a line spectrum only occurs for very special sound signals.
In Praat the amplitude spectrum is always drawn, although for special combinations of tone
frequencies and tone durations, the amplitude spectrum may have the appearance of a line
spectrum. The most important reason for the popularity of the amplitude spectrum is that the
human ear is not very sensitive to the relative phases of the components of a sound, the relative
amplitudes of the component are of far more importance as an example in section 7.1.9 will
show.
In the following sections we will first try to explain qualitatively the relation between a
sound and its spectrum. We start to vary elementary signals and notice the effects in the
spectrum. Then of course complexer signals will follow. . .
(7.1)
where a is the tones amplitude, f is its frequency and t is the time. Section A.1 gives a
mathematical introduction to sine and cosine functions and in section (2.2.1) we showed by
we can write a pure tone in the above form. Here we will not be concerned about how the
spectrum is actually calculated from the sound, this is saved for a later section. Neither will we
go into the details about how a spectrum is represented in Praat. We will start by just studying
plots of the amplitude spectrum. The amplitude spectrum gives a graphical display of the most
important part of the spectrum: the (relative) strengths of the frequency components in the
spectrum. The amplitude spectrum shows on the horizontal axis frequency and on the vertical
axis a measure of the amplitude of that frequency. We talk about two different meanings of
amplitude here: the amplitude of a pure tone and the strength of a spectral component. Note
that the sound amplitude and the amplitudes in the spectrum in general bear no relation. In
1 See
110
In this section we investigate what the spectrum of pure tones looks like when we only vary
the frequency and leave their amplitude constant. The following script (7.1) creates a pure
tone, calculates the tones spectrum and plots the tone and the spectrum next to each other in
the same row.2
Script 7.1 Create a pure tone and spectrum and draw both next to each other.
a = 1
f = 100
Create Sound from formula : "100" , "Mono" , 0 , 1 , 44100 , "a*sin(2*pi*f*x)"
Select outer viewport : 0 , 3 , 0 , 3
Draw. : 0 , 0.01 , -1 , 1 , "yes" , "Curve"
To Spectrum : "no"
Select outer viewport : 3 , 6 , 0 , 3
Draw : 0 , 500 , 0 , 100 , "yes"
# Draw the marker on the right...
The figures in the three rows of figure 7.2 were all made with amplitude a = 1 while we
chose three different values for the frequency f . Note that we use amplitude for two things
The first plot in the top row on the left shows the first ten milliseconds of a pure tone with a
frequency f of 100 Hz. The figure shows exactly one period of this tone within this interval
because period and frequency are inversely related: for a frequency of f = 100 Hz, one period
of the tone lasts T = 1/f = 0.01 seconds. The plot on the right shows the tones amplitude
spectrum. On the horizontal axis, the frequency range has been limited from 0 to 500 Hz for
a better overview. The vertical scale is in dB/Hz (see section A.4.2 on decibel). There is only
one vertical line in the amplitude spectrum. The line starts at the horizontal axis at position
100 Hz and rises to a value of 91 dB/Hz.3 This line signals that in the amplitude spectrum
there is only one frequency component present at a frequency of 100 Hz with an amplitude of
91 dB/Hz. 4
The next row in the figure shows on the left the first ten milliseconds of a 200 Hz pure
tone, also with an amplitude of one. We can now distinguish two periods because a tone of
frequency f = 200 Hz has a period of T = 1/200 = 0.005 seconds and two of these periods
fit in the plot interval of 0.01 seconds. The amplitude spectrum of this tone, again on the right,
shows only one vertical line. This line signals that in the amplitude spectrum there is only
one frequency component present at a frequency of 200 Hz with an amplitude of 91 dB/Hz.
2 In
section 7.1.7 we will explain why we calculate the spectrum from a signal with a duration of one second while
we only draw 10 milliseconds.
3 The spectral amplitude of 91 dB/Hz occurs because of the sound amplitude being 1 and the duration of the tone
being 1 second. Had we chosen another fixed sound amplitude and/or duration then the spectral amplitude would
have been a different number.
4 Although, on the frequency scale presented, the spectrum looks like a line spectrum, zooming in will reveal that it
actually is a very thin triangle. Nevertheless this we will still call it a line.
111
7. The Spectrum
dB/Hz
91
80
60
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
91
80
60
40
20
-1
0
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
91
80
60
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
Figure 7.2.: In the left column from top to bottom the first 10 ms of 1 s duration pure tones with
frequencies 100, 200 and 400 Hz. The right column shows the amplitude spectrum of
each tone.
Because the frequency scale of the amplitude spectrum is a linear scale, a frequency of 200 Hz
is twice as far from the origin at 0 Hz as a frequency of 100 Hz.
The last row shows, on the left, the first 0.01 s interval of the pure tone with frequency
400 Hz, and like the previous tones, with an amplitude of one. The period now equals T =
1/400 = 0.0025 s, hence four period fit into the plot interval of 0.01 seconds. The line in the
amplitude spectrum on the right shows there is only one frequency component at a frequency
of 400 Hz and again with an amplitude of 91 dB/Hz.
We could have continued figure 7.2 with more rows, showing periods of other pure tones of
amplitude one and their spectra. This would always have resulted, for a tone with frequency
f , in a left plot that shows 0.01 f periods and in a right amplitude spectrum with a line at
frequency f with the same amplitude as before. Our conclusion is that the amplitude spectra
of pure tones with equal amplitudes show peaks of equal heights.
112
Now that we know that different tones have different positions in the amplitude spectrum, we
want to investigate how differences in the tones amplitude translate to the amplitude spectrum. In the left column of figure 7.3 the first 10 milliseconds of tones with the same 200 Hz
frequency but different amplitudes are plotted. The amplitude varies in steps of 10. The top
figure has amplitude a = 1, the middle one has a = 0.1 and the bottom one is barely noticeable because of its small amplitude of a = 0.01. The figures in the first row are equal to the
1
dB/Hz
91
80
60
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
80
71
60
40
20
-1
0
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
80
60
51
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
Figure 7.3.: In the left column from top to bottom the first 10 ms of 1 s duration pure tones with
frequency 200 Hz and amplitudes of 1, 0.1 and 0.01. The right column shows the
amplitude spectrum of each tone.
figures in the second row of the previous figure since these tones are equal. As the figures in
the right column make clear, going from the first row to the second, an amplitude reduction by
a factor of 10, results in a spectral amplitude reduction of 20 dB. We further note that as we
reduce the amplitude this has no effect on the position of the peak in the amplitude spectrum,
only on its heigth. As was shown in section A.4.2, the difference between amplitudes a1 and
a2 in dBs can be calculated as 20 log(a1 /a2 ). If we want to compare the tone from the top
row with the one in the middle row, then with the values a1 = 1.0 and a2 = 0.1 we obtain
20 log(1.0/0.1) = 20 log(10) = 20 dB, i.e. the first tone is 20 dB louder than the second.
Had we performed the calculation the other way around and taken 20 log(a2 /a1 ), the result
would have been 20 log(0.1/1.0) = 20 dB, i.e. the second tone is 20 dB weaker than the first
tone. Both calculations result in the same 20 dB difference in spectral amplitude between the
two tones. Only the signs of the numbers differ: a negative sign indicates that the denomi-
113
7. The Spectrum
dB/Hz
91
80
60
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
85
80
60
40
20
-1
0
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
80
79
60
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
Figure 7.4.: In the left column from top to bottom the first 10 ms of 1 s duration pure tones with
frequency 200 Hz and amplitudes of 1, 0.5 and 0.25. The right column shows the
amplitude spectrum of each tone.
nator value is larger than the numerator. The value at the horizontal line in the middle rows
spectrum confirms our calculation because it reads 71 dB/Hz which is 20 dB/Hz less than the
91.
When we compare the second row with the third, then again we have an amplitude reduction
by the same factor 10. This results, again, in a 20 dB spectral amplitude reduction. The
amplitude of the tone in the third row and the first row differ by a factor of 100. The calculation
gives a difference of 20 log(100/1) = 202 = 40 dB. This is confirmed by the 51 dB/Hz value
in the amplitude spectrum of the third row.
As a confirmation we show in figure 7.4 the effect of reducing the amplitude of pure tones
of 200 Hz by factors of 2. The first row is identical to the first row of the previous figure.
The amplitudes of the tones in the left column, from top to bottom, are 1.0, 0.5 and 0.25. For
the expected differences of the spectral amplitudes in decibel, we expect 20 log(1.0/0.5) =
20 log 2 20 0.3 = 6 dB. Because a1 /a2 = a2 /a3 , the difference between the first and
the second and the difference between the second and the third spectral amplitudes should
be equal to 6 dB. The numbers at the horizontal lines in the amplitude spectrum confirm our
calculations.
This shows that the amplitude and frequency of pure tones are displayed independently of
each other in the amplitude spectrum. The frequency determines the position of the line on the
frequency axis and the amplitude the height of the line, i.e. its spectral amplitude. Amplitude
and frequency are two independent aspects of a pure tone.
114
In figures 7.2, 7.3 and 7.4 the sounds all start with a zero amplitude value. What would happen
to the amplitude spectrum if the pure tones didnt start at a time where the amplitude is zero?
To model this, the sine function of equation (7.1) is not sufficient because this one always
starts with an amplitude of zero at time t = 0. However, an extra parameter in the argument of
the sine can change this behaviour. This parameter is called the phase. Section A.1.3 contains
more information on phase. We write the new pure tone function as
s(t) = a sin(2f t + ),
(7.2)
where denotes the phase. At time t = 0 the amplitude of the tone will be s(0) = a sin(). By
choosing the right value for , we can make s(0) equal to any value in the interval from a to
+a. In the left column of figure 7.5 we show, from top to bottom, the pure tones with constant
1
dB/Hz
91
80
60
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
91
80
60
40
20
-1
0
0
0.01
100
200
300
400
500
Time (s)
dB/Hz
91
80
60
40
20
-1
0
0.01
100
200
300
400
500
Time (s)
Figure 7.5.: In the left column from top to bottom the first 10 ms of 1 s duration pure tones with
frequency 200 Hz and phases of 0, /2, and . The right column shows the amplitude
spectrum of each tone.
frequency 200 Hz and amplitude 1, for three different phases of 0, /2 and . In the right
column, the corresponding amplitude spectra are plotted. The amplitude spectra all show only
one line with the same frequency and spectral amplitude in the three spectra. We conclude
from this figure that the phase of the tone has no influence on the amplitude spectrum, only
sound amplitude and sound frequency matter.
Warning: although the frequencies and amplitudes of the three sounds in the figure are the
same, only the ones with phase 0 and phase will sound the same. In the sound with phase
115
7. The Spectrum
/2 you will hear two clicks, one near the start and the other near the end of this sound. These
clicks are caused by the abrupt change in amplitude at the start and end of this sound. Imagine
the loudspeaker cone which is at rest, it has to move to full amplitude immediately at the start
of the sound. It has to move in no time to reach its maximum, this fast movement causes the
click because the loudspeaker will overshoot. The opposite happens at the end, the cone is at
its maximum and immediately has to return to its rest position. The clicks at start and end are
caused by the discontinuities in the signal. Abrupt changes in the amplitude of the sound are
called discontinuities and cause click effects when you listen to them.
7.1.4. The spectrum of a simple mixture of tones
The pure tones we have worked with in the previous sections are elementary sounds but in no
way the sounds of daily life. We now investigate complexer sounds and show what happens
in the amplitude spectrum when we combine a number of these elementary sounds.
dB/Hz
81.4
80
60
40
20
-1
0
0.02
100
200
300
400
500
Time (s)
dB/Hz
80
81.4
60
61.4
40
41.4
20
-1
0
0
0.02
100
200
300
400
500
Time (s)
dB/Hz
81.4
69.4
80
60
40
20
-1
0
0.02
100
200
300
400
500
Time (s)
Figure 7.6.: In the left column from top to bottom the first 20 ms of 1 s duration mixtures of three
pure tones of frequencies 100, 200 and 400 Hz. The right column shows the amplitude
spectrum of each mixture.
In figure 7.6 we show on the left, from top to bottom, three different mixtures of three sines.
The mixtures were created with:
116
The formula shows that we add three tones with frequencies 100, 200 and 400 Hz. By choosing
values for the coefficients a1 , a2 and a3 , we can mix these tones in any way we like. The
factor 1/3 guarantees that the sum of these three sines never exceeds one.5 The mixture in
the top row has all three coefficients equal to one, i.e. a1 = a2 = a3 = 1. The amplitude
spectrum on the right shows that if we add tones of equal sound amplitudes, they show an
equal spectral amplitude. This spectral amplitude is 81.4 dB. If we had left out the scale factor
of 1/3, the spectral amplitudes would all have been equal to 91 dB, just like they were in figure
7.2. We now know how to do the math to account for the 1/3: 20 log(1/3) = 20 log 3
20 0.477 = 9.54 dB. The scale factor of 1/3 lowers the spectral amplitude with 9.54 dB,
from 91.0 to 81.46 which was rounded down to 81.4 dB.
The mixture in the middle row has a1 = 1, a2 = 0.1 and a3 = 0.01, just like the single
tones in figure 7.3 had. Here they show the same 20 dB spectral amplitude difference between
successive values but now in the same plot.
The last row shows the mixture with a1 = 1, a2 = 0.5 and a3 = 0.25. In the amplitude
spectrum the peaks are 6 dB apart.
All the relations between sound amplitudes and spectral amplitudes that were established
for pure tones also seem to work for mixtures or combinations of pure tones.
7.1.5. The spectrum of a tone complex
From the previous figure 7.6 it seems that by only varying the values of the three amplitude
coefficients, a great variety of sounds can be made. What do the sounds look like when we
would allow for more frequencies? Because we have many frequencies and many ways to mix
them, we start with some popular examples with prescribed amplitudes and frequencies. The
following script generates the sounds in the left column of figure 7.7.
Script 7.2 The synthesis of a sawtooth function
f0 = 200
n = 20
Create Sound from formula : "s" , "Mono" , 0 , 1 , 44100 , "0"
for k from 1 to n
Formula : "self + ( - 1) ^(k - 1) / k * sin(2*pi*(k*f0)*x)"
endfor
P
The script implements the formula s(t) = nk=1 (1)k1 /k sin(2kf0 t), the approximation
of a so-called sawtooth function for n = 20 and f0 = 200. The formula sums tones whose frequencies are multiples of a fundamental frequency f0 . We have a special name for frequencies
that are integer multiples of a fundamental frequency: they are called harmonic frequencies.
The fundamental frequency itself is called the first harmonic, the frequency f = 2f0 is called
the second harmonic, etc.6
5 In
6 In
the multiplication by 1/3 we implicitly assume that the three coefficients are not larger than one.
music terminology one refers to the term overtone. An overtone is harmonically related to the fundamental
117
7. The Spectrum
dB/Hz
80
60
40
20
-2
0
0.02
1000
2000
3000
4000
1000
2000
3000
4000
1000
2000
3000
4000
Time (s)
dB/Hz
80
60
40
20
-2
0
0
0.02
Time (s)
dB/Hz
80
60
40
20
-2
0
0.02
Time (s)
Figure 7.7.: The left column shows the sawtooth function synthesis with 1, 5 and 20 terms. The
right column shows the spectrum of each synthesis.
The amplitude of each sine is (1)k1 /k. There are two parts in this term: because of the
1/k, the amplitude of each sine is inversely proportional to its index number. The (1)k1 part
equals 1 for all even k, and equals +1 for all odd k. This results in a sum with alternating
positive and negative terms: sines with odd k are added, while sines with even k are subtracted.
The first three terms this sum are s(t) = sin(2f0 t)1/2 sin(22f0 t)+1/3 sin(23f0 t)+ .
The script goes as follows: in script line 3 a silent sound object is created. For each new
value of the loop variable k, a new tone is added to the existing sound. After 20 additions
the script terminates. In the left column of figure 7.7, we show from top to bottom, the first
frequency but the numbering is shifted by one, i.e. the first overtone equals the second harmonic, the second
overtone equals the third harmonic, etc. A special technique called overtone singing is used by many peoples
around the globe. By careful articulation they amplify a specific overtone and suppress others. If you want to
learn these techniques, see the book by Rachelle [1995].
118
P
The mathematical formula for a block is b(t) = nk=1 sin(2(2k 1)f0 t)/(2k 1). In the
synthesis, because of the 2k 1 term, only odd multiples of f0 are used. The first three terms
are b(t) = sin(2f0 t) + sin(23f0 t)/3 + sin(25f0 t)/5+ . In figure 7.8, we show the
results for three different values of the parameter n.
Isnt it amazing that if we combine sines, whose frequencies are harmonics of some frequency f0 , these block and sawtooth functions appear, with a period T that equals 1/f0 ?
119
7. The Spectrum
dB/Hz
80
60
40
20
-1
0
0.02
2000
4000
6000
8000
2000
4000
6000
8000
2000
4000
6000
8000
Time (s)
dB/Hz
80
60
40
20
-1
0
0
0.02
Time (s)
dB/Hz
80
60
40
20
-1
0
0.02
Time (s)
Figure 7.8.: The left column shows the block function synthesis with 1, 5 and 20 terms. The right
column shows the corresponding amplitude spectrum.
Does this mean that any combination of harmonics leads to a periodic function? With the
following script you can try it.
f0 = 200
n = 20
Create Sound from formula : "random" , "Mono" , 0 , 1 , 44100 , "0"
for k from 1 to n
ak = randomUniform ( - 0.9 , 0.9) / k
Formula : "self + ak * sin(2*pi*k*f0*x)"
endfor
The script synthesizes a sound object with 20 harmonics. The function randomUniform(-0.9, 0.9)
will generate a new random uniform number with a value between 0.9 and 0.9 every time
this function will be used. Each number will be completely independent from the previous
one(s). if we run the script a number of times in succession, then each time a different series
120
dB/Hz
80
60
40
20
-2
0
0.02
1000
2000
3000
4000
1000
2000
3000
4000
1000
2000
3000
4000
Time (s)
dB/Hz
80
60
40
20
-2
0
0
0.02
Time (s)
dB/Hz
80
60
40
20
-2
0
0.02
Time (s)
Figure 7.9.: The left column shows the synthesis with 5, 10 and 20 sine terms with random uniform
amplitudes. The right column shows the amplitude spectrum of each synthesis.
of random uniform numbers will result.7 The amplitude of each harmonic will be the product
of this uniform random number and the scale factor 1/k. In figure 7.9 the left column shows
the synthesis of three sounds from the script, where the number of components n from top to
bottom was chosen as 5, 10 and 20. The scale factor 1/k for each amplitude is for displaying
reasons only: it makes the periodicity in the synthesized signal easier to see. All three signals
above are periodic with period 1/f0 .
If we would run the script over and over again, each time with randomly assigned frequency
amplitudes, the resulting signals would all share the same period 1/f0 but the sound ampli7 In
a uniform distribution of numbers, all numbers in the distribution have equal probability of being chosen. When
we write about a random uniform number, we use it as a shorthand for a number randomly drawn from a uniform
distribution of numbers.
121
7. The Spectrum
tudes as a function of time vary.8 It is however very unlikely that we would ever reproduce
exactly the sound as shown in figure 7.9.
If we would change the scale factor in the script to any function of the index we liked, the
resulting signals still share the same period 1/f0 .
We conclude: The sum of harmonically related sines synthesizes a periodic sound.
7.1.6. The spectrum of pure tones that don't t
Now we are going to complicate things a little bit. In the examples we showed before, the
durations of the sound and the frequencies were not picked at random. For all the frequencies
we have used, an integral number of periods fitted in the sounds duration. We did this on
purpose to show you that a sine with frequency f corresponds to one line in the amplitude
spectrum. For example, for a sine with a frequency of 200 Hz, exactly 200 periods fit in a
sound of 1 s duration. The corresponding amplitude spectrum shows a line at 200 Hz as is
dB/Hz
dB/Hz
80
80
60
60
40
40
20
20
0
0
100
200
300
400
500
100
200
300
400
Figure 7.10.: The amplitude spectrum of a 200 Hz tone. On the left the tone was of 1 s duration
and on the right of 0.9975 s duration.
shown in the left part of figure 7.10. If instead we create the tone with a duration of 0.9975 s,
the amplitude spectrum looks like the right part of the figure. This does not look like a line at
all, more like a very peaky mountain. To explain the difference we have to delve into the way
the spectrum is calculated from the sound and how the duration of the sound comes into play.
Lets us call the duration of the sound that is being analyzed T . The spectrum is calculated from
the sound by a technique called Fourier analysis. A Fourier analysis tries to find amplitudes
(and phases) of all harmonics of the frequency 1/T that do not exceed the Nyquist frequency.
If you sum these harmonics again with the calculated amplitudes and phases you will get
the signal back. Besides these harmonics the strength at a frequency of zero hertz is also
calculated, this value equals the average value of the sound.9 Therefore, a Fourier analysis
decomposes a sound into separate frequency components.
These component frequencies are all multiples of a fundamental analysis frequency 1/T .
The component frequencies are then given by k/T , where k = 0, 1, . . . are on the horizontal
8 Of
course within certain limits: they are all synthesized with sine functions only, which dictates that they all start
with amplitude zero.
9 This is often called the DC component. The analogy is from the electronics domain where alternating current (AC)
has the static pendant direct current (DC).
122
500
-2
0.99
1
Time (s)
1.01
-2
0.99
0.9975
Time (s)
1.01
Figure 7.11.: Periodic extensions of a 200 Hz tone in a sound of duration 1 s (left) and 0.9975 s
(right).
Another way to describe the difference between these two spectra is shown in figure 7.11.
The Fourier analysis works with the underlying assumption that a sound that is analysed is
periodic with period T equal to its duration, i.e. as if its duration were infinite. The analysis
then occurs on this infinitely long signal that can be constructed from the original one by
concatenating copies of itself. It is therefore important what happens at the borders where
two copies meet. For the 1 s duration tone, the 200 Hz sine has an integral number of periods
within its 1 s duration and therefore joins smoothly at the end of each interval with the sine
at the start of the next interval. The left plot in the figure shows this. The infinitely extended
signal just looks like a sine of 200 Hz of infinite duration. For the 0.9975 s duration sound, the
200 Hz sine does not join smoothly. It ends halfway one period as the right plot in the figure
shows. This does not look like a sine anymore, because there is a discontinuity in the sound.
It is clear that no single analysis frequency can match this discontinuity. 11
Conclusion: In Fourier analysis a sound of duration T is decomposed in frequencies that
are harmonics of the fundamental analysis frequency 1/T . This decomposition is unique:
from the decomposition we can get our original sound back with a technique called Fourier
synthesis.
10 Frequency
11 This
123
7. The Spectrum
7.1.7. Spectral resolution
We repeat the finding of the previous section again: a Fourier analysis decomposes a sound,
of duration T seconds, in harmonic frequencies of the fundamental analysis frequency 1/T .
We will now investigate the consequences of this. For a 1 s duration sound the fundamental
analysis frequency is 1 Hz and this results in frequency components that are 1 Hz apart in the
spectrum. This means that for a tone complex with two tones whose frequencies f1 and f2
differ by 1 Hz, like for example f1 = 500 Hz and f2 = 501 Hz, each tone can be represented
in the spectrum as a separate value. If the tone frequencies differ by less, like for example the
frequencies f1 = 501.2 Hz and f2 = 501.9 Hz then these two frequencies have to merge into
one line in the spectrum and will not be separately detectable anymore. For a sound composed
of two tones with duration 1 s, the frequencies of the two tones have to be at least 1 Hz apart
to be separately detectable. The term associated with this is spectral resolution.12 We say
that the spectral resolution is equal to the frequency 1/T . The better the spectral resolution the
lower this frequency. A longer duration results in a better resolution, a shorter duration results
in worse resolution.
Figure 7.12 shows the effect of signal duration on the amplitude spectrum. In the left column
we show sounds with an ever increasing number of periods of a 1000 Hz tone. The right
column shows the amplitude spectrum. The sound in the first row was created with a duration
of 0.001 s and shows one period of a 1000 Hz tone. Because of its very short duration, in the
spectrum the analysis frequency components are multiples of 1/0.001 = 1000 Hz. In the next
row the duration is doubled to 0.002 s and the spectral resolution therefore halves to 500 Hz.13
12 In
a case where we know that these two frequencies are in the signal, we can use advanced techniques to measure
their amplitudes and phases, because we have extra information. In general we have only the information in the
spectrum.
13 Because the spectrum shows power spectral density, the amplitude spectrum for a pure tone shows an increase
of 3 dB for each time doubling. In the power spectral density spectrum that Praat shows, both the amplitude
of the basis frequency components and the duration of the sound are intertwined: in the amplitude spectra of
figure 7.12 doubling the sounds duration increases the value at the 1000 Hz frequency component with 3 dB
while the amplitude of the sine in the sounds did not change. This intertwining makes it impossible to directly
estimate the strength of each frequency component. However, nothing is lost: the effect of duration is the same
for all values in the amplitude spectrum. The whole amplitude spectrum is shifted up or downwards a number of
decibels, depending whether the duration increases or decreases. All relations between the amplitudes within each
spectrum therefore do not depend on duration. So you can compare the relative values of components between
two amplitude spectra as well.
124
dB/Hz
Frequency (Hz)
61
0.001
1000
2000
dB/Hz
64
0.002
1000
2000
dB/Hz
67
0.004
1000
2000
dB/Hz
70
0.008
1000
2000
dB/Hz
73
0.016
1000
2000
1000
2000
dB/Hz
We see the inverse relation between the spectral resolution and the duration of the sound.
This relation is not an artifact of Fourier analysis, it is the result of the limits nature imposes
upon us. In physics this is stated by the Heisenberg uncertainty principle, here it simply
says that the more precise you want to determine a signals frequency the longer you have to
measure: precision takes time.
125
7. The Spectrum
7.1.8. Why do we also need cosines?
Many examples in the previous sections used sines as the building block for Fourier analysis
and synthesis. These sines were all prototypical sines that start at zero then increase to their
maximum value and then decrease to their minimum, etc. We can create an infinite number of
sounds with the sine as a building block. However, there is also an infinite number of sounds
we cannot create with sines. For example all the sounds that start with an amplitude different
from zero cannot be modeled by a combination of sines only. The cosine function is a natural
candidate to model such functions. As the sine starts with zero amplitude the cosine starts with
amplitude one. By mixing a sine function and a cosine function we can have any start value
we want. In section A.1.3 we show that a mixture of a sine and a cosine function with the
same argument is equivalent to a sine with a phase. We translate equation (A.2) to frequencies
and write
a cos(2f t) + b sin(2f t) = c sin(2f t + ),
(7.3)
where the new amplitude is c =
We know from experiment that the ear is more sensitive to the amplitude of a frequency component than it is to the phase of these components. With the following simple script you
can check this for yourself. Every time you run this script, a sound is played that has equal
frequency content but the phases of the individual components differ. The only thing these
sounds have in common as you can see is the periodicity. Despite the fact that they all look
different, they all sound the same. This shows that the amplitude spectrum captures essential
perceptual elements of a sound and is a therefore a more stable representation than the time
signal.
f0 = 150
Create Sound from formula : "sines" , "Mono" , 0 , 1 , 44100 , "0"
for k to 5
phase = randomUniform ( - pi /2 , pi / 2)
Formula : "self + sin(2*pi*k*f0*x + phase)"
endfor
Scale peak : 0.99
# avoid clicks at start and end
Fade in : 1 , 0 , 0.005 , "no"
Fade out : 1 , 1 , - 0.005 , "no"
Play
126
N/21
X
(7.4)
k=0
where f0 = 1/T is the fundamental analysis frequency and N is the number of samples in the
sound. The Fourier analysis determines for each harmonic of the analyzing frequency f0 the
coefficients ak and bk . We can rewrite the above equation in terms of phases as
s(t) =
N/21
X
ck sin(2kf0 t + k )
(7.5)
k=0
In figure 7.13 we present a detailed example of a Fourier analysis of a short sound. In the top
row on the left is the sound of duration T that will be analyzed. The actual value of T is not
important now. The duration T specifies that the analyzing cosine and sine frequencies have
to be harmonics of f0 = 1/T . The following pseudo script shows the analysis structure. For
Script 7.4 The Fourier analysis pseudo script
1
2
3
4
5
6
7
8
s = selected ("Sound")
duration = Get total duration
f0 = 1 / duration
for k from 0 to ncomponents -1
< Calculate a [k] and b [k] from sound s >
select s
Formula : "self - a[k] *cos(2*pi*k*f0*x) -b[k] *sin(2*pi*k*f0*x)"
endfor
the first frequency, when k = 0, the Fourier analysis only determines the coefficient a0 which
equals the average value of the sound. In the script this corresponds to line 5, the first line
in the for loop. For each frequency component the coefficients ak and bk can be determined
by the technique of section A.1.4: multiply the sound s with cos(2kf0 t) to calculate ak
and multiply s with sin(2kf0 t) to calculate bk . The coefficient b0 is zero, because a sine
of frequency zero is always zero. The analysis cosine of frequency zero with amplitude a0 is
shown in the second column of the figure, it happens to be a straight line because a cosine of
frequency zero equals one. The dotted line shows the zero level. This function has exactly the
same duration as the sound and is next subtracted from the sound. In the script this corresponds
to the last line in the for loop, line 7. This results in the sound displayed in the third column.
In the amplitude spectrum in the fourth column the value a0 is shown with a vertical line at a
frequency of 0 Hz. For displaying purposes only, the vertical scale of the amplitude spectrum
is a linear one instead of the usual logarithmic one.
In the next step for k = 1, this new sound is now analyzed with a cosine of frequency 1/T
and the number a1 determined and then analyzed with a sine of the same frequency and the
number b1 is determined. In the figure this is shown in the second row: the left most figure is
the sound corrected for the 0 Hz frequency. The second column shows the cosine and the sine
components with amplitudes a1 and b1 , respectively. In the third column the cosine and the
sine component are subtracted from the sound. In the script we are after line 7 now. The sound
127
7. The Spectrum
does not
q contain any component of frequency 1/T any more. In the amplitude spectrum the
value a21 + b21 is shown at distance 1 from the previous value because the size of one unit on
this axis was chosen to be 1/T Hz.
For k = 2 the sound where the lower frequency components have been removed is analyzed
with a cosine and a sine of frequency 2/T and the coefficients a2 and b2 are determined. These
frequency
components are removed from the sound and in the amplitude spectrum the value
q
a22 + b22 is drawn at unit 2. The third row in the figure shows these steps.
The same process continues for k = 3 as the fourth row shows. At the end of the next step,
k = 4, we are done because after subtraction of the components of frequency 4/T there is
nothing left in the sound: all sample values are zero. All amplitude values ak and bk for k > 4
will be zero. The amplitude spectrum will look like the one in the fifth row. This completes
the Fourier analysis part.
In the last row in the figure we show the sum of all the components in the second column,
i.e. all the cosines and sines from the first five rows. This is the Fourier synthesis and we get
the sound where we started from back, the one at the top row at the left.
128
Components
Spectrum
SoundComponents
=
0
=
01
=
012
=
0123
Time (s)
Time (s)
Time (s)
01234
Frequency ( 1/T Hz)
129
7. The Spectrum
Time (s)
=0
Time (s)
Time (s)
figure 7.12 we note that the spectrum gets broader as the time of the signal decreases. In
the limit when the sound is reduced to only one sample the spectrum is at its broadest. The
spectrum of a pulse has all frequencies present at equal strengths, i.e. from zero hertz to the
Nyquist frequency. The spectrum of a pulse is a constant amplitude straight spectrum.
The following script creates a pulse and draws the spectrum of figure 7.15.
Create Sound from formula : "pulse" , "Mono" , 0 , 0.1 , 44100 , "col = 100"
To Spectrum : "no"
Draw : 0 , 0 , 0 , 40 , "no"
130
-1
0
40
20
0.1
Time (s)
5000
Frequency (Hz)
no meaning, the information about what these numbers represent must also be stored in the
Spectrum object. The extra data is
xmin the lowest frequency in the spectrum. This will equal 0 Hz for a spectrum that was
calculated from a sound.
xmax the highest frequency in the spectrum. This will be equal to half the sampling frequency
if the spectrum was calculated from a sound.
x1 the first frequency in the spectrum. This will equal 0 Hz for a spectrum that was calculated
from a sound.
131
7. The Spectrum
because the spectrum is processed as a whole, in one batch, without regard to time ordering.
There is no time ordering in the spectrum, time ordering will only reappear when the sound is
created from the spectral components by Fourier synthesis. By applying this acausal frequency
domain filtering we create more possibilities than any causal techniques can realise.14 If we
are not bound to any real-time application we prefer this method. We will illustrate filtering
with the spectrum editor.
7.5.1. The spectrum editor
1
5
Figure 7.16.: The spectrum editor with the spectrum of a random Gaussian noise sound. The pink
area shows a selected frequency interval between 5295.96 and 10198.28 Hz.
The spectrum editor appears after clicking View & Edit for a selected spectrum object. In
figure 7.16 you see an example of the spectrum of a random Gaussian noise sound.15 The
layout of the spectrum editor is very similar to the layout of the sound editor (see figure 2.7).
However, the horizontal axis here displays frequency instead of time and runs from 0 Hz to the
Nyquist frequency, which, for this spectrum is 22050 Hz. The vertical scale shows only the
amplitude of the spectrum at a logarithmic scale in dBs. In the figure the scale shows a view
range of 70 dB and runs from a maximum which happen to be at 40.6 dB to a lower value at
-39.4 dB. Theoretically, the spectrum of a random Gaussian noise is a flat spectrum and the
figure shows that the spectrum displayed is indeed (almost) flat. The figure displayed also
shows that we have made a selection in the spectrum. The pink area represents the selected
part and rectangles show the widths in hertz of the corresponding intervals. The rectangle
labeled with the marker 1 happens to be 5295.96 Hz wide and starts at a frequency of 0 Hz
14 In
general the frequency domain technique also results in less dispersion of phases, the filters can have sharper
filter edges, and the filter responses are less asymmetric. We will try to explain these terms in the next parts of
this chapter. . .
15 A spectrum like this can be created as follows:
132
133
7. The Spectrum
7.5.2. Examples of scripts that lter
Filtering in Praat is very easy to do. Script 7.5 shows the skeleton filter setup. By substitution of a formula different filters can be realized. Note that the formula is applied to the
spectrum!
The filter terminology we introduced in the previous section:
Low-pass filter: Low frequencies pass the filter, high frequencies not. For example a lowpass filter that only passes frequencies lower than 3000 Hz can be defined by using the
formula:
Formula : "if x < 3000 then self else 0 fi"
The frequency from which the suppression starts is also called the cut-off frequency.
High-pass filter: high frequencies pass, low frequencies are suppressed. The following filter
formula only passes frequencies above 300 Hz.
Formula : "if x > 300 then self else 0 fi"
Band-pass filter: frequencies within an interval pass, frequencies outside the interval are suppressed. The following filter formula only passes frequencies between 300 and 3000 Hz,
approximately the bandwidth of a plain old telephony signal.
Formula : "if x > 300 and x < 3000 then self else 0 fi"
Band-stop filter: frequencies outside the interval pass, frequencies inside the interval are
suppressed.
Formula : "if x > 300 and x < 3000 then 0 else self fi"
All-pass filter: all frequencies pass but phases are modified. The following filter makes the
sound completely not understandable (time reversal).
Formula : "if row = 2 then - self else self fi"
The formulas above were very simple and show abrupt changes in the filter near the cut-off
frequencies. Smoother transitions can be created by modifying the formulas above.
In order to filter a sound we can also skip the To Spectrum step and apply the filter formula
directly on a selected sound with the Filter (formula)... command. Praat then first calculates
the spectrum, then applies your formula on the spectrum and transforms the modified spectrum
back to a sound. The following script summarizes:
134
We can use sounds and their spectra to simulate a technique used in radio transmission or
in telephone transmission. Both transmissions can only function if the frequencies of the
information source, i.e. speech or music, are shifted to higher frequencies. In this section we
investigate what happens if we shift the frequencies of a band-limited signal. We will use a
form of so-called amplitude modulation. The technique we will explore is used, for example,
by telephone companies to transport many conversations in parallel over one telephone cable.
It is also used in radio transmission in the AM band. Although in the real world this technique
is implemented in electronic circuits we can get a feeling for it by working with sounds.
We will demonstrate this shifting by first combining two sounds into one sound. This is the
modulation step. In the demodulation step we show how to get the separate sounds back from
the combined sound. We start from the fact that a sound can be decomposed as a sum of
sines and cosines of harmonically related frequencies. Let the highest frequency in the sound
be FN Hz. Suppose we multiply the sound with a tone of frequency f1 Hz. We only have to
investigate what happens with one component of the sound, say at frequency f2 to be able to
calculate what happens to the whole sound. This is the power of the decomposition method.
We start from equation (A.28) which shows that if we multiply two sines with frequencies
f1 and f2 we can write the product as a sum of two terms.
sin(2f1 t) sin(2f2 t) = 1/2 cos(2(f1 f2 )t) 1/2 cos(2(f1 + f2 )t)
(7.6)
The right-hand side shows two components with frequencies that are the sum and the difference of the frequencies f1 and f2 . Suppose that the frequency f1 is higher than the highest
frequency in the sound. The single frequency f2 in the interval from 0 to FN is now split into
two frequency components one at f2 Hz above f1 and the other at f2 Hz below the frequency
f1 . This argument goes on for all components of the sound. The result is a spectrum that runs
from f1 fN to f1 + fN , i.e. the bandwidth of the spectrum has doubled. However, this spectrum is symmetric about the frequency f1 . This means that half of the spectrum is redundant.
If we high-pass filter the part above f1 then we have a copy of the original spectrum but all
frequencies are shifted up by f1 Hz. The frequency f1 is called the carrier frequency.16 No
information is lost in the shifted spectrum. We have found a technique to shift a spectrum
upwards to any frequency we like by choosing the appropriate carrier frequency f1 followed
by high-pass filtering.
In the demodulation step, applying the same technique, i.e. multiplication with the carrier
frequency and filtering returns the original sounds. You can easily check by working out the
product sin(2f1 t) cos(2(f1 + f2 )t) with the help of equation (A.30). In figure 7.17 the
process is visualized. The top row shows the modulation of the carrier amplitude. The information source is on the left, the carrier frequency is in the middle panel and the modulated
amplitude is on the right. The amplitude of the information source is displayed with a dotted
16 For
the technical purist, this technique is called single sideband suppressed carrier modulation(SSSCM).
135
7. The Spectrum
Time (s)
Time (s)
Time (s)
1000
3000
5000
2000
3000
line in this panel (after multiplication by a factor of 1.1 to separate it from the carrier). The
second row shows the amplitude spectra of the signals in the higher row. The symmetry of the
spectrum with respect to the (absent) carrier frequency is very clear.
The rectangular block function is a very important function because it is used as a windowing
function. If we want to select any finite sound or part of a sound then this can be modeled as
the multiplication of a sound of infinite duration with the finite block function as is depicted
in figure 7.18. If the duration of the 1-part of the block is T0 , the function that describes the
spectrum of the block varies like sinc(f T0 ).17 In section A.3 a mathematical introduction
17 The
spectrum of the block function that equals 1 for x between 0 and T0 and zero elsewhere, is given by
T0 eifT0 sin(f T0 )/(f T0 ). The factor eifT0 is a phase that does not influence the amplitude of a frequency
component, the factor T0 is a scale factor that influences all frequency components with the same amount. The
derivation of the spectrum is described in section A.16.2.1.
136
4000
Time (s)
Time (s)
Time (s)
of the sinc function is given. In the formula sinc(f T0 ) f is a continuous frequency and the
formula describes the continuous spectrum. The spectrum is zero for those values of f where
the product f T0 is an integer because the sine part of the sinc function is zero for these values.
If we want to calculate the spectrum from a sampled block function of duration T , where
sample values in the first T0 seconds equal one and zero elsewhere, we can do this by sampling the continuous spectrum at frequencies fk = k/T . The values of the spectrum at the k
frequency points fk are sinc(kT0 /T ). Let us investigate how this spectrum looks for various
values of T0 . In figure 7.19 the amplitude spectra of block functions of varying durations are
shown. The left column shows the sounds of a fixed duration of one second where the samples
that have times below T0 have value one and samples above T0 are zero. The right column
shows the corresponding amplitude spectra limited to a frequency range from 0 to 250 Hz.
The lobe-valley form of the sinc function is clearly visible in all these spectra and the distance
of the valleys decreases as the blocks T0 increases. If we look more carefully at these valleys
we see that they are not equal. Some are very deep but most are not. Why dont they all go
very deep, like for example figure A.11 shows?
The answer lies in the argument of the sinc function, kT0 /T . If the argument is an integer
value, then the sinc function will be zero and only then will the amplitude spectrum show
a deep valley, for all other values of the argument the sinc function will not reach zero and
consequently the valleys will not be as deep. With this knowledge in mind we will now show
the numbers behind these plots. For all sounds we have T = 1 and therefore all frequency
points are at multiples of 1 Hz and k values correspond to hertz values. For the plot in the first
row we have T0 = 0.01 and the argument of the sinc equals k 0.01. For k equal to 100,
200, . . . this product is an integer value. The zeros in the spectrum therefore start at 100 Hz
and are 100 Hz separated from each other.
In the second row the block is of 0.11 s duration. The amplitude spectrum now corresponds
to sinc(k 0.11). Integer values of the argument occur for k equal to multiples of 100 and the
zeros are again at multiples of 100 Hz.
The third row has T0 = 0.22. Now the zeros are at multiples of 50 Hz because k 0.22 is
an integer value for k equal to any multiple of 50.
137
7. The Spectrum
For a T0 equal to 0.33, 0.44 or 0.55 as happens in rows four, five and six, the kT0 argument
is integer for multiples of 100, 25 and 20 Hz, respectively.
From the figure it is also clear that the absolute level of the peak at f = 0 increases. This
increase corresponds directly to the increase in duration. For example, the difference between
the last and the first sound is 20 log(0.55/0.01) = 20 log 55 34.8 dB. This corresponds
nicely with the numbers indicated in the plots: the peak for the T0 = 0.55 block function in
the bottom row is at 91.8 dB, the peak for the T0 = 0.01 block function in the top row is at
57.0 dB. The difference between the two being 34.8 dB.
7.6.2. The spectrum of a short tone
We have a tone of a certain frequency, say f1 Hz, it lasts for T0 seconds and then suddenly stops
and is followed by silence. We want the spectrum of this signal. In section 7.1.6 we discussed
the situation for tones whose duration didnt fit an integer number of periods. Because in
Fourier analysis the sine and cosine analysis frequencies are continuous and last forever the
only sensible thing we could do was to analyze as if the our tone also lasts forever. The
analysis pretends that the analyzed sound is a sequence of repeated versions of itself.
The actual derivation of the continuous spectrum of a short tone is mathematically too
involved to be shown here.18 Instead we show part of the amplitude spectrum of a 1000 Hz
tone that lasts 0.1117 seconds. The following script generates the sound and the spectrum.
f1 = 1000
t0 = 0.1117
Create Sound from formula : "st" , "Mono" , 0 , 1 , 44100 , "0"
Formula : "if x < t0 then sin(2*pi*f1*x) else 0 fi"
To Spectrum : "no"
In figure 7.20 a short part of the sound is shown around the time 0.1117 s where the tone
abruptly changes to silence. In the right plot the spectrum around the 1000 Hz frequency is
shown. The appearance is the already familiar sinc-like spectrum of the previous section.
We are now ready for the full truth: the spectrum of any finite tone is not a single line in the
spectrum, it is more like a sinc function. Sometimes it may appear in the amplitude spectrum
as one line but this is only in exceptional cases where the frequency of the tone, the duration
of the tone and the duration of the analysis window all cooperate.
In general we have a true underlying continuous spectrum which is sampled at discrete
frequency points. As we saw in the previous section for the block function, these sample
points of the sinc function when applied to a sampled sound of duration T are at kT0 /T . For
only one line to appear in the spectrum two conditions have to be fulfilled:
1. The frequency of the tone must be equal to one of the frequency points in the spectrum
(at k/T )
2. The zeros of the sinc function are at frequency points of the spectrum.
18 If
you know a little bit about complex numbers and integrals, you can have a look at the full derivation in section
A.16.2.2.
138
Sk
N1
X
sm e2ikm/N
(analysis)
Sm e2ikm/N
(synthesis)
m=0
sk
N1
X
(7.7)
m=0
The coefficients Sk in general are complex numbers and define the amplitudes of the complex
sinusoids.
7.7.1. The Fast Fourier Transform (FFT)
A very fast computation of the Fourier Transform (DFT) is possible if the number of data
points,N, is a power of two, i.e. N = 2p for some natural number p. In the FFT the computing
time increases as N 2 log N, whereas the computing time of the algorithm of equation (7.7)
goes like N 2 . Whenever the FFT technique is used, the sound is extended with silence until
the number of samples equals the next power of two. For example, in the calculation of the
spectrum of a 0.1 second sound with sampling frequency 44100 Hz the number of samples
involved is 0.1 44100 = 4410. This number is not a power of two, the nearest powers of two
are 212 = 4096 and 213 = 8192. Therefore the FFT is calculated from 8192 values, the first
4410 values equal the sound, the next 3782 values are filled with zeros.
139
7. The Spectrum
reaches the next power of two. Of course, the sound will not be changed and all this happens
in the Fourier transform algorithm. The following script calculates the number of samples for
the FFT.
nfft = 1
while nfft < numberOfSamples
nfft = nfft * 2
endwhile
Because of the extension with silence, the number of samples in the sound has increased and
the analysis frequencies will be at a smaller frequency distance.
140
Frequency (Hz)
57.0
0.01
100
200
100
200
77.8
0.11
83.8
0.22
50
100
87.4
0.33
100
200
89.9
0.44
25
50
91.8
0.55
20
40
250
Figure 7.19.: In the left column are sampled sounds of block functions with variable durations.
The blocks duration T0 is indicated below the sounds. In the right column are the
amplitude spectra.
141
-1
0.1
0.12
Time (s)
7. The Spectrum
100
80
60
40
20
0
900
1100
Frequency (Hz)
Figure 7.20.: On the left a selection from a 1000 Hz short tone. On the right the amplitude spectrum around 1000 Hz.
142
8. The Spectrogram
In the spectrum we have a perfect overview of all the frequencies in a sound. However, every
information with respect to time has been lost. The spectrum is ideal for sounds that dont
change too much during their lifetime, like a vowel. For sounds that change in the course of
time, like real speech, the spectrum does not provide us with the information we want. We
like to have an overview of spectral change, i.e. how frequency content changes as function of
time. The spectrogram represents an acoustical time-frequency representation of a sound: the
power spectral density. It is expressed in units of Pa2 /Hz.
Because the notion frequency doesnt make sense at too small a time scale, spectro-temporal
representations always involve some averaging over a time interval. When we assume that the
speech signal is reasonably constant during time intervals of some 10 to 30 ms we may take
spectra from these short slices of the sound and display these slices as a spectrogram. We have
obtained a spectro-temporal representation of the speech sound. The horizontal dimension of
a spectrogram represents time. The vertical dimension represents frequency in hertz. The time
-frequency strip is divided in cells. The strength of a frequency in a certain cell is indicated by
its blackness. Black cells have a strong frequency presence while white cells have very weak
presence.
143
8. The Spectrogram
Frequency (Hz)
5000
0
1
0
Time (s)
Frequency (Hz)
5000
0
0
1
Time (s)
Figure 8.1.: Narrow-band versus broad-band spectrogram.
144
9. Annotating sounds
Annotation of sounds means adding meta data to a sound. The meta data in Praat can be any
text you like. This meta data is either coupled to a time interval or to a point in time and is
stored in a TextGrid object. To annotate you select a sound and a textgrid together and choose
the Edit option. A textgrid editor appears on the screen and you can start annotating.
We start with a simple example from scratch. Select the sound that you want to annotate
and then choose Annotation ->To TextGrid.... from the dynamic menu. Form 9.1 pops up and
you can choose your tiers. 1 Praat distinguishes two types of tiers:
1. Interval tier. An interval tier represents a series of contiguous intervals in time. An
interval is characterized with a start time and an end time where always the end time is
later/larger than the start time. You typically use an intervaltier when you want to mark
phenomena that extend in time like words or phonemes in a speech sound.
2. Text tier (or point tier). A text tier represents a series of marked points (sorted) in time.2
You will typically use a text tier for labeling points in time where something interesting
happens, like for example the moments of glottal closure.
The textgrid is the container for any number of interval tiers and text tiers. In the top field
in form 9.1 you fill out names for all tiers you intend to use. In the bottom field you fill out
which of the names in the top field are text tiers. The label of this field explicitly mentions
point tier. If you do not intend to use text tiers or just dont know yet, you can leave the field
blank. Default all tiers will be interval tiers unless you select one or more point tiers. Praat
1 The
meaning of tier that is used here is layer. A tier can be seen as an extra layer to a sound. Therefore, in
Praat a tier is a function on a domain.
2 A text tier is a marked point process (see chapter 15).
145
9. Annotating sounds
isnt too picky about how you fill out this form, the only thing that matters is that at least one
name in the top field exists. In the textgrid editor you can always add, remove or rename tiers.
Figure 9.2.: The Sound:To TextGrid... form for two interval tiers.
Because the example in the next section will show you how to annotate a sound at the word
and at the phoneme level, we will choose a textgrid that consist of two interval tiers named
phoneme and word. Of course the names you can choose are free. Figure 9.2 shows how
to fill out the form. Once the textgrid exist, you will select it together with the sound, choose
Edit and a textgrid editor appears. Figure 9.3 shows a textgrideditor.
The top of the editor window shows the identification number of the textgrid (here 15) fol-
Figure 9.3.: The textgrideditor for an existing sound with a newly created textgrid that has two
interval tiers.
lowed by the text TextGrid de-vrouw-loopt-met-noisy which happens to be the name given
to the textgrid object. The next line shows all the menu choices of the textgrideditor. More
on these menus will follow in the rest of this chapter. Below these menus is a white field
extending over the complete width of the editor. This text window will show the text from
146
is the default behavior. With the File>Preferences... form you can modify this default.
Dutch vowel Y is missing in the sentence. However, the first @ in the first word d@ can be taken as a substitute.
147
9. Annotating sounds
tion in the editor. Before we can use this the additional information we have to make some
adjustments to the textgrideditor. Choose Spectrum>Spectrogram settings... and form 9.4
pops up. The settings in the figure are the default settings for a broad band spectrogram. The
window length of 0.005 s guarantees enough time resolution to be helpful during annotation.
To be able to better visualize the first and second formant in the spectrogram we modify the
upper view range to 3000 Hz and then click OK. The spectrogram will become visible in a
newly created pane below the sound once you have selected Spectrum>Show Spectrogram.
With the spectrogram now visible in the editor we can start with marking the words in the
sound. We zoom in to the first part of the sound as is shown in figure 9.5.
148
computer independent way by three character representations that always start with a backslash. For example the
representations for V, O and E are \vs, \ct and \ep, respectively. For more information about possible representations
see the TextGridEditor>Help>About special symbols and Phonetic symbols.
149
9. Annotating sounds
150
8
9
10
13
11
12
14
The significant parts of the vowel editor have been numbered as follows:
1. The title bar. It only shows the text VowelEditor and no object ID because there is no
corresponding VowelEditor object in the list of objects.
2. The menu bar shows the available menu commands. Some of the available menu commands will be described later on.
3. The formant plane. This plane is spanned by two axes that represent the first and the
second formant frequencies. The first formant axis runs vertically and it starts at 200 Hz
and goes down to 1200 Hz. The axis has a logarithmic frequency scale. The horizontal
dotted lines are at frequency steps of 200 Hz. The second formant frequency axis which
runs horizontally from right to left also has a logarithmic scale. It starts at 500 Hz and
151
152
pointer to the (F1 , F2 ) = (656.3, 1030.1) position then clicking the left mouse and moving the
mouse pointer to the (366.0, 856.8) position and releasing the mouse button. The small bars
that cross the trajectory are located 50 ms apart and give you both an impression of the speed
of the mouse movement as well as the total duration of the trajectory. We count six cross
bars in this trajectory which accounts for a duration between 0.3 and 0.35 s. The duration field
shows a duration of 0.303481 s. You can turn them off by using a large number for the distance
value in the Show trajectory time marks every... command in the View menu.
153
of the notation with subscript we often find alternative notations like y[n]=0.5x[n-1]+0.5x[n].
155
0.5
0.5
-0.5
0
-0.5
0
0.002
40
20
0
0
2.205104
Frequency (Hz)
0.002
Time (s)
Time (s)
40
20
0
0
2.205104
Frequency (Hz)
Figure 11.1.: The effect of applying the digital filter yn = 0.5 xn + 0.5 xn1 on a noise sound
with a sampling period of 1/44100 s. Top panes: the first 2 ms of the noise sound
(left) and the filtered noise (right). At the bottom the spectra of the 1 second duration
sounds.
156
The randomGauss function to create noise was explained in the previous section. Next we
need another sound where the output of the filter can be stored. We name it y and create it
with zero values inside as follows:
Create Sound from formula : "y" , 1 , 0 , 1 , 44100 , "0"
We perform the digital filter operation by applying the following formula on the selected sound
object y:
Formula : "0.5*Sound_x [ col ]+ 0.5*Sound_x [ col -1] "
In section 4.7.1.2 we explained what goes on in a formula. The formula above essentially
says: the value at the current sample number col in the selected sound y is obtained by adding
together the contents at the corresponding sample number from the sound named x (after
multiplication by 0.5) with the contents at the previous sample number from the sound x
(after multiplication by 0.5). We use a special trick here in the formula because we use in the
calculation sample values from an object that is not the selected object, i.e. from the sound
named x (remember the sound y is selected and a formula applies to the selected sound). To
refer to an object of type Sound named x we use the special <Object type>_<object name>
syntax. In the formula the item Sound_x[col] refers to the contents of the sound object named
x at sample number col. The formula is applied to every sample of sound y.
After having performed the three actions described above we have two sounds x and y. In
figure 11.1 the results of the filtering are shown. In the lop left figure the first 2 ms of the noise
sound x is shown, while the top right figure shows the first 2 ms of the filtered noise sound y.
Although both sounds are spiky the one on the right does not fluctuate as wildly as the one
on the left and seems somewhat smoother. The effect of the filter is a smoother appearance
of the resulting filtered sound y. The smoother appearance was the result of averaging pairs
of consecutive sample values. This type of filter, where the output is a linear combination of
input values, is called a moving average filter.
157
158
In figure 11.2 we show the spectrum of the impulse response with a solid line. Although this
20
-20
-40
-60
0
5000
104
Frequency (Hz)
1.5104
2104 2.205104
Figure 11.2.: Spectrum of the impulse response (0.5, 0.5) of the digital filter yn = 0.5xn1 +0.5xn
for sampling periods of 1/44100 s, 1/22050 s and 1/11025 s.
spectrum and the spectrum in the bottom right panel of figure 11.1 at first sight do look very
different, we note that both spectra show the same amplitude fall-off as a function of frequency.
This is no coincidence. In fact the spectrum of the filtered sound could also be obtained
by multiplying the spectrum of the input sound with the spectrum of the impulse response.
The two ways that we have described to determine the output of the filter are equivalent and
it can be mathematically proven that the convolution of two functions is equivalent to the
multiplication of their spectra. Or more generally that convolution in one domain is equivalent
to multiplication in the other domain.
As a conclusion of this section we resume the very important result that we have found:
the filtering process can be described in two ways that are equivalent. In the time domain
filtering can be described as the convolution of the impulse response of the filter with the input
sound. In the frequency domain filtering can be described as the multiplication of the spectrum
of the impulse response by the spectrum of the input sound. The spectrum of the impulse
response of the filter is also called the spectrum of the filter.2 In the description of the filtering
process the sampling period never entered explicitly in the formulation of filtering. Results
were always obtained by the mathematical operation of multiplying numbers. The sampling
2 In
mathematical notation one writes that if s and S are the input sound and the spectrum of the input sound,
respectively, and, if h and H are the impulse response of the filter and its spectrum, respectively, then s h
S H. This expresses that the spectrum of the convolution of s and h can be obtained by multiplying the spectra
of s and h. Or if s S and h H then s h S H. The means some thing like form a transform
pair which translates here to form a Fourier transform pair since a spectrum is the Fourier transform of the
sound.
159
xn
1
0
0
0
0
0
0
0
yn1
0.0
1.0
-0.5
0.25
-0.125
0.0625
..
(0.5)k
xn 0.5 yn1
x1 0.5 y0
x2 0.5 y1
x3 0.5 y2
x4 0.5 y3
x5 0.5 y4
x6 0.5 y5
..
xk 0.5 yk1
yn
1.0
-0.5
0.25
-0.125
0.0625
-0.03125
..
(0.5)k1
The input of the filter is the impulse in the column labeled xn , it is a 1 followed by zeros,
i.e. x1 = 1, x2 = 0, x3 = 0, .... For n is 1 the output is equal to x1 since we assume that the
previous output yn1 equals zero. For n is 2 the input x2 equals zero and the previous output
equals y1 . The output y2 will therefore be equal to 0.5 y1 which is -0.5. For n is 3 the x3
equals 0, the previous output y2 equal 0.5 and therefore the output y3 will equal 0.5 0.5
which equals 0.25. For n equals 4 the result will be positive again, for n equals 5 negative and
so on. Although the absolute value of the output becomes smaller and smaller for increasing
n, it will never be zero. This is why this filter is called an infinite impulse response filter or IIR
filter. In fact all recursive filters are IIR filters. Because of this recursion, the calculation of the
filters impulse response becomes tedious and so we will let the computer do the calculations
from now on.
The filters that have only one recursive term are not so interesting from our point of view
and we move on to the filters that have two recursive terms.
160
(11.1)
where a, b and c are real numbers.3 However, not all combinations of a, b, and c are equally
interesting. It is easy to see that the coefficients b and c must be of different sign because
otherwise each output would be larger than the previous output and hence the outputs could
grow with every time step and eventually become infinite.4 Only when b2 + 4c < 0 interesting
phenomena occur which implies that c always is a negative number. A filter where b and c
obey this constraint is called a formant filter or resonator. One can show that a resonance at
frequency F with bandwidth B occurs if the coefficients b and c satisfy
b = 2r cos(2F T )
and
c = r 2 ,
(11.2)
where r = eBT and T is the sampling period; the resonance characteristics only depends
on b and c and do not depend on the coefficient a. The a only functions as an overall scale
factor and can thus be chosen to satisfy some constraint. If we choose a = 1 b c then the
filter response is normalized such that at zero frequency the filter response H (0) = 1. Given
equations (11.2) where coefficients b and c are calculated given formant frequency F and
bandwidth B, the reverse calculation is also possible, i.e. given coefficients b and c, calculate
F and B:
1
b
ln(c)
and F =
arccos( ).
(11.3)
B=
2T
2T
2 c
These formulas already make clear that the coefficient c has to be negative, otherwise ln(c)
would be undefined.
Without input the filter of course does not generate any output. To investigate how the a
filter transforms its input, we have to calculate its impulse response. The following script
does so. The first two lines define the formant frequency, 500 Hz, and the bandwidth of the
filter, 50 Hz. In line three the sampling period is defined which is the inverse of the sampling
frequency. Given these values the filter coefficients can now be calculated as happens in lines
four to seven. The coefficients a, b and c in the script to 3 digits precision yield 0.00507,
1.988 and -0.993, respectively, from which we can easily calculate that b2 + 4c is negative.
The penultimate line creates the output sound. The recursive filter of equation (11.1) is then
implemented in the last line of the script. The first part of the formula, i.e. the part between
parentheses, implements the impulse input function. In this part col=1 refers to the very first
sample in the sound. Clearly, only for the first sample the result of the complete expression
between parenthesis will result in the value 1, for all other sample numbers it will result in 0.
often sees the alternative notation where the recursive coefficients have a minus sign: yn = a xn p yn
q yn2 .
4 It is easy to see that for a stable filter the numbers a, b, and c cannot be chosen as all positive numbers. For
example, if we set a = b = c = 1, and we use a pulse as input, i.e. only x1 = 1 and all the other inputs are zero
then the output yn would become: y1 = 1, y2 = 1, y3 = 2, y3 = 3, y4 = 5, y5 = 8, . . ., i.e. the output would grow
without bounds. Actually this particular growth sequence 1, 1, 2, 3, 5, 8, ... where each new term is the sum of
the two previous ones is called a Fibonacci sequence.
3 One
161
The recursive part of the filter is implemented by the last two expressions and there is explicit
reference to two previous sample numbers.5 In figure 11.3 we show 0.015 s of the impulse
3
2
1
0
-1
-2
-3
0
0.005
0.01
0.015
Time (s)
Figure 11.3.: The impulse response from the digital formant filter (11.1)with formant frequency
500 Hz and bandwidths 50 Hz (black), 100 Hz (red) and 200 Hz (blue).
response of a formant filter with constant frequency of 500 Hz for three different bandwidths,
50 Hz in black, 100 Hz in red and 200 Hz in blue. Again, like the example in the previous
section, we see that despite the fact that the input was limited in duration, i.e. there was only
one sample in the input that differed from zero, the output of the filter is not limited in duration.
The output of the formant filters, as displayed in the figure looks like a sine function whose
amplitude gradually decreases. We count approximately five periods in the first 0.01 s and
conclude that this sine component has a frequency of approximately 500 Hz. In fact the output
of this filter is a damped sinusoid that can be described by the following function
s(t) = AeBt sin(2F t + ),
(11.4)
where F is the formant frequency, B is the bandwidth, A is the amplitude and is the phase
at t = 0. The function shows that the amplitude of the sine wave with frequency F decreases
5 The
162
40
Amplitude (dB)
Amplitude (dB)
60
50
40
30
20
10
0
-10
0
30
20
10
0
-10
1000
2000
3000
Frequency (Hz)
4000
5000
2000
Figure 11.4.: The effect of formant frequency and bandwidth on the filter amplitude of equation
(11.5). On the left: keeping bandwidth constant at 50 Hz and doubling the formant
frequency, starting at 250 Hz. On the right: keeping the formant frequency constant
at 1000 Hz and doubling the bandwidth, starting at 25 Hz.
It can be shown that the filter response of a digital filter with formant frequency F and
bandwidth B is:
|a|
|H (f )| = p
, (11.5)
2r 2 cos 4f T (4r cos + 4r3 cos ) cos 2f T + r4 + 4r2 cos2 + 1
where = 2F T . In terms of the coefficients a, b, and c, the filter response would read:
|a|
|H (f )| = p
2
2
1 + b + c + 2b(c 1) cos 2f T 2c cos 4f T
(11.6)
In figure 11.4 we show the filter responses of equation (11.5) for varying formant frequencies
and bandwidths. In the left display we show the effect of doubling the formant frequency
6 Of
course equal in the sense of equal modulo 2. At times t1 and t2 we choose the phases to be equal
2F t1 + = 2F t2 + + k2. This can be simplified as t1 = t2 + k/F . This boils down to choosing times that
lie at integer periods apart.
163
(11.7)
In contrast to the digital formant filter 11.1 this filter is non-recursive and therefore always
stable for all values of a0 , b0 , and c0 . Because the filter has no output recursion terms it can
not feed itself once started, and therefore the impulse response is finite and only three samples
long. The reason we did not mention this filter at the start of this section where we treated
non-recursive filters is the special relation it bears to formant filters.
For special combinations of the coefficients this filter removes the frequencies in an interval.
However if we want an antiformant with frequency F and bandwidth B, we can easily get the
values for the coefficients in the following way. We first calculate coefficients a, b, and c as if
it were a formant filter:
c
r2
b =
2r cos 2F T
a =
1bc
where r = eBT and T is the sampling time. The second, and last step, is a simple transformation of these coefficients:
a0
1/a
b/a
c/a.
164
(11.8)
50
40
30
20
10
0
-10
-20
-30
-40
-50
0
10
Amplitude (dB)
Amplitude (dB)
In figure 11.5 we show the antiformant filters with the same frequencies and bandwidths as
the formant filters of figure 11.4. It is clear that the antiformant filters look very much like the
inverse of the formant filters. However, some of the side effects of these filters are undesirable.
0
-10
-20
-30
-40
1000
2000
3000
Frequency (Hz)
4000
5000
1000
Frequency (Hz)
2000
Figure 11.5.: The effect of antiformant frequency and bandwidth on the filter amplitude of equation
(11.8). On the left: keeping bandwidth constant at 50 Hz and doubling the antiformant frequency, starting at 250 Hz. On the right: keeping the antiformant frequency
constant at 1000 Hz and doubling the bandwidth, starting at 25 Hz.
The left display of figure 11.5 shows the frequency response of an antiformant filter for varying
antiformant frequencies. We note a considerable boost of the high frequencies: the lower the
antiformant the more the high frequencies are boosted; the antiformant filter almost acts like
a high-pass filter. This is undesirable: we want an antiformant filter to remove frequencies
within a certain frequency range and not to boost frequencies outside that region.
165
Amplitude (dB)
40
30
20
10
0
-10
-20
-30
-40
0
Amplitude (dB)
10
0
-10
-20
-30
-40
1000
2000
3000
Frequency (Hz)
4000
5000
5000
40
30
20
10
0
-10
-20
-30
-40
0
Amplitude (dB)
Amplitude (dB)
0
-25
-50
-75
-100
-125
-150
-175
0
1000
2000
3000
Frequency (Hz)
4000
1000
Frequency (Hz)
1000
2000
3000
Frequency (Hz)
Figure 11.6.: The effect of combining a formant filter with an antiformant filter. Top left: constant resonator at f = 0 Hz with antiresonators at several frequencies. Bottom left:
formant at somewhat lower frequency than antiformant. Bottom right: formant at
somewhat higher frequency than antiformant. Top right: antiformant and formant
with same frequency but different bandwidths.
is a good thing. However, below this frequency a sharp low-pass filter effects results which
is very undesirable. For the lowest frequency, 250 Hz, the flat part is already 40 dB below the
response at zero frequency. For the 4000 Hz antiformant this level is almost 100 dB! This is
not very useful for synthesis.
In the two displays at the bottom of figure 11.6 we show what happens if we combine a formant and an antiformant whose frequencies differ only by a small amount and a small 25 Hz
bandwidth. In the bottom left display the antiformant frequency is 10% higher than the formant frequency and in the bottom right display it is 10% lower. The effect is the combination
of a peak and a valley, or a valley and a peak, respectively, on a rather flat spectrum. This is
usable if we want to add a peak and valley without having this any influence on the existing
spectrum.
In the display on the top right we combined a formant and antiformant of the same frequency
but with different bandwidths. The frequency is constant and 1000 Hz. The bandwidth of
the antiformant filter was fixed at 25 Hz and the bandwidth of the formant filter started at
50 Hz and was doubled in each step to 100 Hz, 200 Hz, 400 Hz and 800 Hz. A perfect valley
results without any other effect on the spectral amplitude. The best result is obtained when the
bandwidth is approximately 500 Hz. If the bandwidth is larger, the valley will not become any
deeper but instead the flatness of the spectrum disappears and especially the higher frequencies
166
2000
4000
5000
167
1
Voicing
Tilt
TF TAF NF NAF
F1 F1 F1 F1 F1 F2 F3 F4 F5 F6
B1 B1 B1 B1 B1 B2 B3 B4 B5 B6
Aspiration
A2 F2 B2
A3 F3 B3
A4 F4 B4
Frication
noise
A5 F5 B5
A6 F6 B6
Bypass
Figure 12.1.: A KlattGrid with vocal tract part as filters in cascade and frication part as filters in
parallel.
The KlattGrid synthesizer consists of four parts. Each part is controlled by a number of
parameters.
1. The phonation part. This part generates the source signal that drives the vocal tract filter.
In the figure this part is situated at the top left and shows a part responsible for voicing
and a part responsible for aspiration.
169
A KlattGrid object will be created from values supplied by the form that appears if you click
the Create KlattGrid... option in the Acoustic synthesis (Klatt) cascading menu from the
New menu. There are quite a lot of options that can be chosen in this form. The most obvious
ones are: the name you want to assign to the KlattGrid and the start and end time. The defaults
for the number of formants of a specific type are in conformance with the maximum numbers
that were possible in the original Klatt editor as described in the Klatt and Klatt [1990] article.
However, in a KlattGrid these numbers are not limited in any way.1 The choices you make
1 See
170
section 12.6 for more differences between KlattGrid and the Klatt synthesizer.
12.1. Introduction
here are not definitive nor mandatory, you just specify what you think you might need. For
example, if you accept these defaults and click OK and you do not set any tracheal formant
frequency value and bandwidth, then the tracheal formant frequency filters will not be used
during synthesis. You might also increase the number of formants at a later time if you need
to.
Once the KlattGrid is created, the dynamic menu changes, as is depicted in the left display
of Fig. 12.3. Notice that some menus have been split to avoid too many choices to pop up.
The Edit menu has been split into two parts like the query part. The first split is always for
the phonation part. Because of the many options to modify the KlattGrid this menu has been
split into four parts, one for each of the four parts of the synthesizer. The KlattGrid as defined
Figure 12.3.: On the left KlattGrids dynamic menu, on the right the cascading menu options available from the Modify phonation - button.
now is empty: none of the tiers has any points defined in it. If you would try to play the empty
KlattGrid, you would receive a warning that Praat cannot execute this command. The right
display of figure 12.3 show the available options from the cascading Modify phonation -
menu button. The options in this menu are divided into two parts. On top are the options
to add points to one of the phonation tiers like the pitch tier and the voicing amplitude tier.
The bottom part shows options to remove points from the phonation tiers. Before we explain
the KlattGrid in detail we will first make a small excursion on how to synthesize some basic
sounds.
171
Now we know how to create a KlattGrid, we first show you with a simple script what you can
do with it and how to synthesize a basic sound.
Script 12.1 The script to create the a and the au sounds.
1
2
3
4
5
6
7
8
9
10
11
12
In the first line of script 12.1 we define an empty KlattGrid with a duration of 0.3 s. The
second line starts filling the KlattGrid with information, it defines a pitch point of 120 Hz at
time 0.1 s. The next line adds a voicing amplitude point of 90 dB at the same time. With the
pitch and voicing amplitude defined, there is enough information in the KlattGrid to produce
a sound and we can now Play the KlattGrid in line 4. During 300 ms you will hear some
unidentifiable sound. It is the sound as produced by the glottal source alone. This sound
normally would be filtered by a vocal tract filter. But we have not defined the vocal tract filter
yet and in this case the vocal tract part will just behave like an all-pass filter for the phonation
sound.
In lines 5 and 6 we add a first oral formant with a frequency of 800 Hz at time 0.1 s, and a
bandwidth of 50 Hz also at time 0.1 s. The next two lines add a second oral formant at 1200 Hz
with a bandwidth of 50 Hz.
If you now play the KlattGrid, at line 9, it will sound like the vowel /a/, with a constant
pitch of 120 Hz.2
Lines 10 and 11 add some dynamics to this sound; the first and second oral formant frequency are set to the values 300 and 600 Hz of the vowel /u/; the bandwidths do not change
and stay constant as they were defined in lines 6 and 8. In the interval between times 0.1 and
0.3 s, the oral formant frequencies will be interpolated. The result will now sound approximately as an /au/ diphthong.3 The script shows that with only a few commands we already
can create interesting sounds. In the next sections we will discuss all the tiers that we can use
to modify the characteristics of a sound. We will start with the phonation part.
2 Because
of the constant extrapolation in a tier, we need to specify only one value in a tier if we want to keep the
parameter constant over the domain of the tier.
3 As you see in section 12.2, the specification in script 12.1 is not sufficient because the tiers that specify the glottal
flow function have not been defined. However, in absence of any points in these tiers a default open phase of 0.7
is used and the power1 and power2 values are default set to 3 and 4, respectively.
172
Pitch tier For voiced sounds the pitch tier models the global fundamental frequency as a
function of time. Pitch equals the number of glottal open/closing cycles within a time interval.
In the absence of flutter and double pulsing the pitch tier is the only determiner for the instants
of glottal closure. The Extract>Extract PointProcess (glottal closures) command will create
a PointProcess with the glottal closure times in it. A PointProcess is a sequence of times on a
domain (see section 15).
Voicing amplitude tier The voicing amplitude regulates the maximum amplitude of the
glottal flow in dB SPL. The reference amplitude at 0 dB is 2 105 . This means that a flow
with amplitude 1 corresponds to 20 log 1/(2 105 )
= 94 dB SPL. To produce a voiced sound
the voicing amplitude tier may not be empty.
Flutter tier The flutter models a kind of random variation of the pitch and is input as
a number between zero and one. This random variation can be introduced to avoid the mechanical monotonic sound whenever the pitch remains constant during a longer time interval. The fundamental frequency is modified by a flutter component according to the following semi-periodic function that we adapted from the Klatt and Klatt [1990] article F00 (t) =
0.01flutterF0 (sin(212.7t) +sin(27.1t) +sin(24.7t)). In Fig. 12.4 we display the relative
variation of the fundamental frequency r(t) = 0.01(sin(212.7t) + sin(27.1t) + sin(24.7t).
The graph clearly shows the semi-periodicity character of the pitch variation. It also shows
that the maximum pitch change is around three percent if the flutter variable is at a maximum.
For a constant pitch of say 100 Hz and flutter = 1, the synthesized pitch may vary between
approximately 97 and 103 Hz. Script 12.2 synthesizes a glottal source sound with a duration
of 0.5 s, a fundamental frequency of 100 Hz and maximum flutter.
173
0.03
-0.03
0
1
Time (s)
Script 12.2 Script to hear the effect of adding flutter to a monotone sound.
Create KlattGrid : "kg" , 0 , 0.5 , 6 , 1 , 1 , 6 , 0 , 0 , 0
Add pitch point : 0.25 , 100
Add voicing amplitude point : 0.25 , 90
Add flutter point : 0.25 , 1
Open phase tier The open phase tier models the open phase of the glottis with a number
between zero and one. The open phase is the fraction of one glottal period that the glottis
is open. The open phase tier is an optional tier, i.e. if no points are defined then a sensible
default for the open phase is taken (0.7). In figure 12.5 in the top panel we show a short series
-1
0
0.05
Time (s)
Figure 12.5.: On top a series of glottal flow pulses, at the bottom the derivative of the flow. The
moments of glottal closures have been indicated with a vertical dotted line. The
open phase was held at the constant value of 0.7, the pitch was fixed at 100 Hz and
the amplitude of voicing was fixed at 90 dB.
of glottal pulses with an open phase of 0.7. The moments of glottal closure are indicated with
dotted lines. A glottal period is the time between two successive glottal closure markers. In
the figure each glottal period starts with an interval where the flow amplitude is zero, this part
is called the closed phase. After the closed phase we enter the open phase where flow starts
slowly to increase and then after reaching a maximum decreases rapidly. The exact moment
174
Power1 and power2 tiers The power1 and power2 model the form of one glottal flow
function during one open phase of the glottis as
flow(t) = tpower1 tpower2
(12.1)
for 0 t 1. Here t = 0 is the time the glottis opens and at t = 1 the glottis is closed.
The flow is the result of a rising polynomial, tpower1 , counteracted by a falling polynomial,
tpower2 . To have a positive flow, i.e. air going out of the lungs, power2 has to be larger than
power1. For synthesis, these tiers are optional and if absent, default values power1 = 3 and
power2 = 4 will be used. Figure 12.6 on the left shows the flow function from equation (12.1)
0.2
0
-1
0.006
0.1
Time (s)
Figure 12.6.: The effect of the exponents in equation (12.1) on the form of the flow function.
Left: On top, nine glottal pulses synthesized with different (power1, power2) combinations. Power1 increases linearly from 1, and always power2 = power1 + 1.
Consequently, the first pulse on the left has power1 = 1 and power2 = 2, while the
last one on the right has power1 = 9 and power2 = 10. The bottom panel on the left
shows the derivatives of these flow functions. Moments of glottal closure have been
indicated with a vertical dotted line. The open phase was held at the constant value
of 0.7, the pitch was fixed at 100 Hz and the amplitude of voicing was fixed at 90 dB.
Right: the flow function for constant power1 = 3 but varying power2. Power2 values
are: 4: thick solid line; 6: thin solid line; 8: dashed line; 10: dotted line.
for different combinations of the exponents. Power1 increases linearly starting with one, and
always has power2 = power1 + 1. Consequently, the first pulse on the left has power1 = 1 and
power2 = 2, the last one on the right has power1 = 9 and power2 = 10. One easily notices the
175
power1
power2
1/(power2power1)
power1
power2
power1
power2power1
(1
power1
).
power2
For the default values power1 = 3 and power2 = 4, we get t0 = 0.75 and f (t0 ) 0.105.
Collision phase tier The collision phase parameter models the last part of the flow function
with an exponential decay function instead of a polynomial one. A value of 0.04 for example
means that the amplitude will decay by a factor of e ( 2.7183) every 4 percent of a period.4
We define the moment in time where polynomial decay changes to exponential decay, i.e. the
junction point, as the instance of glottal closure. This instance has to be determined from
the glottal flow function by two continuity constraints. The first constraint implies that at
the junction the polynomial flow function and the exponential flow function must have equal
amplitudes, i.e. there is no sudden jump at that point. The second constraint implies that at the
junction the derivatives of both functions must also be equal. In figure 12.7 we show the effect
of a collision phase of 0.04 on the flow function in the upper part and on the derivative of
the flow function in the lower part of the figure. The glottal closure points, where polynomial
decay changes to exponential decay is shown by dotted lines. It is clear from this figure that at
the glottal closure points the two constraints defined above are met. In figure 12.5 where the
flow function had no collision phase, the derivative instantly jumps from the minimum value
to zero at the glottal closure point. In figure 12.7 the derivative doesnt jump instantly back to
zero value but gradually approaches zero.
Spectral tilt tier Spectral tilt represents the extra number of dB the voicing spectrum
should be down at 3000 hertz. According to Klatt and Klatt [1990] this parameter is necessary to model corner rounding, i.e. when glottal closure is non simultaneous along the
length of the vocal folds. If no points are defined in this tier, spectral tilt defaults to 0 dB
and no spectral modifications are made. Spectral tilt makes it possible to attenuate the higher
frequencies. The attenuation is implemented by a first order recursive filter
yn = ax + byn1 .
Its frequency response is given by
p
|H (f )| = a/ 1 2b cos(2f T ) + b2 ,
(12.2)
exponential flow function is f (t) = Ae(1/collisionPhase)t/T , where A is the amplitude at the glottal closure
point and T is the period. For a collisionPhase of 0.04 this function is f (t) = Ae25t/T .
4 The
176
-1
0
0.05
Time (s)
Figure 12.7.: Effect of collision phase on the glottal flow signal (top) and its derivative (bottom).
The vertical dotted lines mark the glottal closure points where the polynomial flow
function changes into an exponential decaying one.
where T is the sampling time. By choosing a = 1 b we have H (0) = 1. The response of this
filter then only depends on b, different values for b give different frequency responses. For
a prescribed spectral tilt, we can obtain the value for b by requiring that at F = 3000 Hz the
following equivalence holds
20 log |H (F )| = spectralTilt.
In the left part of figure 12.8 we show the decibel values of the frequency response from 0 Hz
to 5000 Hz of equation (12.2). The values of spectralTilt were varied from 0 to 50 dB in steps
of 5 dB. The reference at 3000 Hz is show by a vertical dotted line.
The right display of figure 12.8 shows the result of spectral tilt measurements on the glottal
source sounds. The horizontal axis shows the prescribed spectral tilt and the vertical axis the
measured spectral tilt. The skeleton script shows how the point for spectralT ilt = 10 was
obtained. We start by defining a standard KlattGrid and add a pitch point of 100 Hz and a
voicing amplitude point. We start a loop to increase the value of spectralTilt by 5 in each
iteration. For the first iteration spectralTilt = 0. We select the KlattGrid and add the spectral
177
-10
40
-20
-30
30
+
20
+
+
-40
10
+
-50
0
3000
Frequency (Hz)
5000
0+
0
10
20
30
Specified spectral tilt (dB)
40
50
Figure 12.8.: On the left the theoretical filter curves for spectral tilt values that vary in steps of
5 dB from 0 dB to 50 dB. On the right a plot of the actual spectral tilt as measured
from synthesized sounds versus the specified spectral tilt.
Script 12.3 A skeleton to measure the actual spectral tilt values of figure 12.8.
kg = Create KlattGrid : "kg" , 0 , 1 , 6 , 1 , 1 , 6 , 0 , 0 , 0
Add pitch point : 0.5 , 100
Add voicing amplitude point : 0.5 , 90
for i to 11
spectralTilt = (i - 1) * 5
selectObject : kg
Add spectral tilt point : 0.5 , spectralTilt
To Sound
To Spectrum : "no"
dif [i] = Get band density difference : 98 , 102 , 2998 , 3002
dif = 0
if i > 1
dif = dif [ 1] - dif [ i]
endif
# ...
Remove spectral tilt points : 0 , 1
# ...
endfor
tilt value to its tier. Next the glottal source sound will be synthesized and its spectrum will
be calculated. In the next step we calculate the difference in band spectral density between
the frequency bands around 100 Hz and the frequency band around 3000 Hz. These values
were chosen because exactly one harmonic of the pitch is in each band. We subtract this value
from the reference value for spectralTilt = 0 to obtain the values that were displayed in the
figure. Then we clear the spectral tilt tier by removing all the points and before we start the
next iteration, we can do some bookkeeping, like saving the density differences numbers for
later graphical display and deleting the sound and the spectrum.
We note that the maximum spectral tilt that can actually be realised with this filter is approximately 30 dB. The filter honours spectral tilt values up to 20 dB and then flattens off. More
than 30 dB of spectral tilt can not be accomplished.
Aspiration amplitude tier The aspiration amplitude tier models the (maximum) amplitude
of noise generated at the glottis. The noise amplitude is given in dB SPL. The aspiration noise
178
0
0
0.045
Time (s)
Figure 12.9.: The glottal flow for a 100 Hz pitch with 70 dB breathiness. The voicing amplitude
was 90 dB SPL.
flow for a KlattGrid with a 100 Hz pitch point, a voicing amplitude point at 90 dB SPL and
breathiness amplitude point at 70 dB SPL. The breathiness amplitude nicely follows the flow
amplitude.
Double pulsing tier The double pulsing tier models diplophonia (by a fraction between
zero and one). Whenever this parameter is greater than zero, alternate pulses are modified. For
the time being, a pulse is modified with this single parameter tier in two ways: it is delayed in
time and its amplitude is attenuated. If the double pulsing value is maximum (= 1), the time
of closure of the first peak coincides with the opening time of the second one. Figure 12.10
shows an example of the flow signal for double pulsing. The KlattGrid is defined by script
12.4. The pitch is set to 100 Hz with an open phase of 0.5. The double pulsing tier has two
values: at time 0.05 s it is set to 0, at time 0.15 it is set to the maximum value 1. The figure
shows that the flow signal before time 0.05 s is a normal 100 Hz flow signal, showing 5 peaks
in the 0.05 s time interval. From time 0.05 s on we see a gradual decrease of the first pulse of
every pair and a displacement towards the second pulse of the pair until the first one totally
disappears.
5 The
179
0
0
0.05
0.1
Time (s)
0.15
0.2
Figure 12.10.: Double pulsing for a 100 Hz pitch which increases linearly from 0 to 1 between
times 0.05 s and 0.15 s. The vertical dotted lines are the moments of glottal closure.
Script 12.4 The script that generates the flow signal of figure 12.10.
Add
Add
Add
Add
Add
filter boosts frequencies and an antiformant filter attenuates frequencies in a certain frequency
region. If these filters are applied one after the other, this type of filtering is called cascade
or series filtering. A cascade type filter set is shown in the vocal tract part of figure 12.1. In
the figure the sound signal flow goes from left to right. The sound source signal is generated
in the upper left part of the figure where the three boxes symbolize the specifications of the
phonation. The sound then enters the box labeled TF F1 B1, this happens to be a tracheal
formant filter. If some tracheal formant frequency and bandwidth has been defined, the source
signal will be filtered in this box, otherwise it will leave this box unchanged. The signal then
enters the next box labeled TAF F1 B1 which happens to be a tracheal antiformant. It will
leave this boxed filtered if some frequency and bandwidth values were defined, or unmodified.
After it leaves this box the next one follows and this goes on and on until the last box.
In contrast to cascade filtering there is parallel filtering. In parallel filtering the input sound
is applied to all the parallel filters at the same time, all filters process the same signal and then
the output of all these parallel filters is summed so that only one output signal results. The
frication section, for example, is always modeled with parallel filters as can be seen in figure
12.1. The vocal tract filters can also work in parallel, as is shown by figure 12.11. In fact,
vocal tract filtering can be done in either of two ways: cascade or parallel. When sound is
synthesized a choice has to be made for one of the two. The default is cascade filtering. Klatt
[1980] mentions some reasons in favor of the cascade type of vocal tract filtering:
The relative amplitudes for vowels are synthesized correctly. There is no need for individual formant amplitude regulation.
It is a better model of the vocal tract transfer function during the production of non-nasal
180
Nasal
A1 F1 B1
Voicing
Tilt
A1 F1 B1
Aspiration
Pre-emphasis
A2 F2 B2
A3 F3 B3
A4 F4 B4
A5 F5 B5
A6 F6 B6
A2 F2 B2
Tracheal
A1 F1 B1
A3 F3 B3
A4 F4 B4
Frication
noise
A5 F5 B5
A6 F6 B6
Bypass
sonorants.
However for sounds that do not follow the amplitude relations of vowels, for example fricatives
and plosives, we have to use parallel filtering.
For the cascade version of the vocal tract filter, each formant filter is modeled by two tiers,
a frequency tier and a bandwidth tier. For the parallel version each formant filter has to be
modeled by three tiers: besides the frequency and bandwidth tiers we need an additional
amplitude tier. The tiers that model each formant are completely decoupled. This means that
for a formant its frequency as a function of time, its bandwidth as a function of time and
its amplitude as a function of time can be set completely independent of each other.6 In the
cascade filtering of the vocal tract part the amplitude relations between formants resemble
automatically the relations found in natural speech. For parallel filtering we have to explicitly
define each formants amplitude which may be difficult to determine.
6 For
example, in the KlattGrid you could specify only one point in the bandwidth tier, i.e. effectively keeping the
bandwidth of the formant constant in time. In the formant frequency tier you can vary the formant frequency
by defining multiple points. For frame-based synthesizers you would have to specify frequency and bandwidth
values in synchrony for each frame. Besides, the only way to simulate in a frame-based way that formants and
bandwidths change independently is by increasing the number of frames per second.
181
Amplitude (dB)
40
30
20
10
0
-10
-20
-30
-40
0
1000
2000
3000
Frequency (Hz)
4000
5000
In section 11.4 we have shown the effect of frequency and bandwidth on the amplitude
of one formant. We will now show the effect on the amplitude spectrum of more than one
formant filter if they are applied, one after the other, in cascade. In figure 12.12 we show
what happens to the spectrum of two formants if these formants approach each other. We first
show what happens for a first formant at a constant frequency of 500 Hz and a second formant
whose frequency changes in steps from 3000 Hz to 2000 Hz and then to 1000 Hz and finally to
700 Hz. These four spectra are in the lower part of the figure. First of all we see that the peak
of the second formant is always lower than the peak of the first formant, the farther the second
is from the first the larger the difference in amplitude. The amplitude of the second formant
falls off at approximately 6 dB/oct. It is only when the two formants are near to each other
that the amplitude of the first formant increases. This effect is already noticeable when the
second formant is at 1000 Hz and the formants are at a distance of 500 Hz. When the formants
are at 200 Hz distance the first formant has raised an extra 6 dB. We see the same behaviour
when the first formant starts at another frequency like for the upper part of the figure where it
is kept constant at 1000 Hz. Now the second formant changes from 3000 Hz to 2000 Hz and
then to 1200 Hz. The doubling of the first formant frequency results in a 6 dB increase in its
amplitude. The amplitude of the second formant increases by approximately 12 dB. This is
also in line with the observation in Klatt [1980] who states that if one of the formants in a
cascade model is changed all higher formant amplitudes change by a factor proportional to
frequency squared.
182
500
B1 (Hz)
F1 (Hz)
1000
500
250
-1
-1
0.03
0
Time (s)
0.03
0
Time (s)
Figure 12.13.: Example of coupling between vocal tract and the glottis with some extreme coupling values. For all signals pitch was held constant at 100 Hz, the open phase of
the glottis was set to 0.5, and only one oral formant and bandwidth are defined.
Top left: oral formant bandwidth held constant at 50 Hz and the oral formants frequency is 500 Hz during the closed phase and 1000 Hz during the open phase. Top
right: the first oral formant frequency was constant at 500 Hz but oral bandwidth
varies from 50 Hz during closed phase to 450 during open phase. At the bottom the
resulting sounds. The glottal closure times are indicated with dotted lines.
coupling parameters have been taken for a clear visual effect. In both examples we generate
a voiced sound with a 100 Hz pitch, an open phase of 0.5 to make the duration of the open
and closed phase equal, and one oral formant. In the figure on the left the oral formant bandwidth is held constant at 50 Hz and the frequency is modified during the open phase. The oral
formant frequency is set to 500 Hz. By setting a delta formant point to a value of 500 Hz we
7 Otherwise
we had to add the lungs as another source of sound besides the glottis, which would complicate things
needlessly.
183
184
185
2400
800
200
i
eI
i y
I Y
400
log F1
oO
600
A
a
800
1000
Figure 13.1.: Left pane: average formant frequencies for twelve Dutch vowels from fifty male
speakers as measured by Pols et al. [1973]. Right pane: the Dutch vowel system.
As an illustration we have plotted, in the left pane of figure 13.1, the first two formant
frequencies of the twelve Dutch monophthong vowels, averaged over fifty male speakers, as
measured by Pols et al. [1973]. The figure agrees quite nicely with the phonological Dutch
vowel system which is plotted in the right pane of the same figure.
Another advantage of formant frequencies shows itself in the coding domain: a representation of vowels with formant frequencies (and bandwidths) is extremely robust against noisy
1 If
we restrict ourselves to vowel synthesis, Praats vowel editor, described in chapter 10, might convince you that
this is indeed the case.
187
if the fundamental frequency is higher than a resonance frequency of the vocal tract there will be no
formant peak in the spectrum.
188
189
Historically seen, this is the oldest way to measure formant frequency values. To show how
formant frequencies can be measured from the oscillogram we start with a very simple example, a one-formant vowel. In figure 13.2 we have displayed the oscillogram of three periods
of an artificial vowel-like sound with only one formant in it that has a fundamental frequency
F0 of 100 Hz. We can clearly see the three periods of the fundamental frequency, each with
a duration of 0.01 s. In each fundamental period we count five oscillations; these oscillations
become smaller in amplitude towards the end of a period. This gradually dying oscillation is
called a formant and its form as a function of time can be defined as f (t) = et sin(2F t),
where the parameter is called the damping and the parameter F is called the formant frequency. Essentially this function f (t) describes a pure tone whose amplitude fades away as
time goes on. Because we count five oscillations in one period of the fundamental, each oscillation lasts 0.002 (= 0.01/5) s and its frequency will therefore be 1/0.002 which equals
500 Hz. The formant function we used to generate this signal was s(t) = e50t sin(2500t)
and has 500 Hz frequency and 50 Hz bandwidth. The formant frequency derived from the
oscillograms agrees nicely with this value.
The formants bandwidth for this vowel can also be deduced from the figure. This is done
by relating the amplitudes of the curve at two different points in time. We start by writing
down the amplitudes at times t1 and t2 from the formula above as
s(t1 )
eBt1 sin(2F t1 )
s(t2 )
eBt2 sin(2F t2 ).
1
ln
kT
s(t1 )
s(t1 + kT )
.
Therefore we can calculate the bandwidth by taking the logarithm of the quotient of two
amplitudes, at times that lie for example one period T of the formant apart, and dividing
190
0.93
0.68
-1
0
0.01
0.02
0.03
0.02
0.03
Time (s)
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0 B2
0.01
Time (s)
Figure 13.2.: Top: The oscillogram of two sound segment with one formant (top: F = 500 Hz
with B = 50 Hz) and two formants (bottom: F1 = 300 Hz withB1 = 50 Hz and
F2 = 2100 Hz with B2 = 100 Hz). For both sounds the F0 equals 100 Hz.
With two formants the estimation of formant frequencies and bandwidths becomes more
difficult. An example of an artificial two formant signal can be found in the bottom part of
figure 13.2. The first formant with a frequency of 300 Hz is easily detectable. The second
formant is somewhat smaller in amplitude than the first and superimposed on it. You can see
this second formant as a high frequency ripple of the first formants amplitude. Approximately
seven periods of this formant fit into one period of the first one and we conclude from this that
the second formants frequency is near 2100 Hz. It is not too difficult to estimate the bandwidth
191
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0 B2
0.01
0.02
0.03
Time (s)
Figure 13.3.: Two formants, F1 = 530 Hz, B1 = 50 Hz, and F2 = 940 Hz, B2 = 50 Hz.
Even for these artificial signals it is tedious to get the formant frequencies right, let alone
the bandwidths. In more difficult situations like in figure 13.3, i.e. when two formants are
close together it becomes even harder to measure frequencies and bandwidths in this way. We
have to use all kinds of knowledge/rules about how formants interfere with each other in the
sound signal. This is why once the spectrogram was invented it was such a great advance in
speech signal processing3 .
13.3.2. Formant frequencies from the spectrogram
From a spectrogram you can get a good impression about the formant structure of a sound
segment. In figure 13.4 we show part of the Dutch speech fragment /...lo:pt mEt ha:r.../
(Eng: ...walks with her...). The bottom part shows the interval tier with the segmentation of
this sound. The top part shows the spectrogram with frequencies up to 5000 Hz represented.
3 The
spectrograph, a device to analyze a sound in its frequency components and to paint the spectrogram on paper,
was invented during World War II to visualize speech in order to break enemy speech-scrambling methods.
192
0.8
1.1
1.23
5000
4000
3000
2000
1000
o:
a:
r
1.3
0.7
Time (s)
Because we are now mainly interested in the voiced parts of the sound this upper range is
sufficient. It seems the formant structure can be easily followed by eye, especially in the three
vowel segments of /o:/, /E/ and /a:/ where the darker parts are nicely separated from each other.
We also note for the two /t/s the relatively high burst of energy at frequencies above 2000 Hz
that extends to the upper range of the displayed frequency range.
For the first vowel in the display, the /o:/, we see a number of formants in the spectrogram.
The lower four can be easily identified by eye as these traces are well separated from each
other, a first formant in the 500 Hz region, a second formant just above 1000 Hz, a third formant in the 2500 Hz region and a fourth raising from 3000 to 3500 Hz. The dark area between
4000 and 5000 Hz looks like a merger of a fifth and a sixth formant, the fifth formant starting
at 4500 Hz in the /l/ region and raising to within the 4300 Hz range, the sixth staying constant
at a frequency of approximately 4500 Hz. Although we are dealing with a monophthong /o:/
vowel and the formants can be traced by eye relatively easy, the formants are not static at all:
they change during the time course of the vowel. The most stable part seems to be in the middle section of this vowel. If we had to represent the /o:/ vowel with only one set of formant
frequency values then the formant frequencies from the mid part of this vowel would probably
193
120
Sound pressure level (dB/Hz)
120
Sound pressure level (dB/Hz)
/a:/
/ E/
120
100
80
60
40
20
0
5000
0
Frequency (Hz)
100
80
60
40
20
0
5000
5000
Frequency (Hz)
Frequency (Hz)
Figure 13.5.: Three spectral slices taken from the wideband spectrogram 13.4. The slices were
taken at 0.8 s, 1.1 s and 1.23 s and belong to the vowels /o:/, /E/ and /a:/, respectively.
For the vowel /E/ there also seem to be five or six formants present. The first formant is
well below 1000 Hz but according to the figure its is a very wide trace. The second formant
starts approximately at 1200 Hz and increases to 1700 Hz. The third formant also increase
but now from 2000 to 2500 Hz, approximately. After the third formant things start to get
fuzzy: probably three other formants are visible but they are not very clearly traceable from
the spectrogram. For the /a:/ vowel the first two formants are clearly visible but formants three
and four are very blurred while formant five is more pronounced.
If we could only have a way to automatically measure these formant traces!
13.3.3. Formant frequencies from bandlter analysis
13.3.4. Formant frequencies from linear prediction
Formant frequency analysis by linear prediction coding (LPC) analysis differs from the previous analysis methods in a fundamental way. Linear prediction is a parametric model which
means that the analysis is based on a model for speech production and the analysis therefore
fits model parameters. Parametric models are a mixed blessing: they offer more accurate spectral estimates than nonparametric ones in the case where the data satisfy the model. However,
when the data does not satisfy the model, the results may be very wrong. LPC analysis is
based is on the source-filter theory of speech production Fant [1960]. Source-filter theory hypothesizes that an acoustics speech signal is the result of a source signal (the glottal source or
noise generated at a constriction in the vocal tract) filtered with the resonances in the cavities
of the vocal tract downstream from the glottis or the constriction. LPC analysis separates the
source and the filter from the acoustic signal. In principle this separation of a sound into a
source and a filter can be done in infinitely many ways. This can be most easily seen in the
frequency domain where the (simplified) source-filter model translates to the multiplicative
model:
O(f ) = H (f )S(f ).
(13.1)
This formula relates the spectrum O(f ) of the sound to the spectrum S(f ) of the source and the
spectrum H (f ) of the filter. According to the source-filter model applied to speech production,
the spectrum of the synthesized sound is obtained by multiplying the spectrum of the source
194
Time step determines the time between the centres of two consecutive analysis frames. We
discussed this parameter already in section 2.4, where the general analysis scheme was
explained. If this value is 0.01 and the sound has a duration of 2 s then approximately
200 analysis frames will result. If you leave the value at 0.0 Praat will use a value that
is one fourth of the analysis window length.
Maximum number of formants determines the number of formants that the algorithm tries
to calculate from the analysis frame. For most analysis of human speech you will want
to calculate five formants.
Maximum formant determines the ceiling of the frequency interval in which formants will
be calculated. On the average we expect one formant in every 1000 Hz interval for
4 An
analog with real numbers is the following. Say you want to decompose a real number, say 1, as the product of
two other real numbers a and b such that 1 = a b. Then any number a satisfies the equation if b is chosen as
1/a.
195
196
1.1
1.23
5000
4000
3000
2000
1000
o:
a:
r
1.3
0.7
Time (s)
Figure 13.7.: Formant frequency analysis with Praats default five formants.
Let us try a new analysis where we allow more formants than the default five to fit the signal.
We will try to fit one formant more and use the command Sound: To Formant (burg)... 0.0
6 5000 0.025 50. In figure 13.8 we have displayed the result of the new analysis. The
complaints we had have now vanished and, at least for the vowel parts, the formants nicely
trace the darker areas in the spectrogram. Only in vowel segments where the dark areas in
the spectrogram are not nicely contiguous the formants seem to follow that trend and show
irregularities. For example the third formant of the /a: / show irregularities near the transition
197
1.1
1.23
5000
4000
3000
2000
1000
o:
a:
1.3
0.7
Time (s)
Figure 13.8.: Formant frequency analysis with six formants on the sound from figure 13.7.
However, increasing the number of formants in the analysis even more to seven results in
a degradation of the formant contours. We do not show the figure here, but in-between the
smooth contours of the analysis with six formants, spurious formant points turn up, disrupting
the continuity that might exist.
Instead of looking for six formants in the frequency range from 0 to 5000 Hz we could also
search for five formants but now in a more restricted frequency range. This can be arranged
by setting the Maximum formant parameter to a lower value.
This is the case in figure 13.9, where the ceiling was set to 4500 Hz. This seems to result
in formant tracks that at least for the lower frequency range seem to fit at least as good as was
the case in figure 13.8, where 6 formants were measured. Obviously this measurement is also
better than the measurement where we had the ceiling set at 5000 Hz as was the case in figure
13.7. In figure 13.10 we give a summary of the three discussed formant analysis by LPC. From
left to right the three panes present spectral slices of the vowels /o:/, /E/ and /a:/, respectively.
198
0.8
1.1
1.23
4000
3000
2000
1000
o:
a:
r
1.3
0.7
Time (s)
Figure 13.9.: Formant frequency analysis on the same sound as figure 13.7 with five formants and
Maximum formant parameter set to 4500 Hz.
These spectral slices were determined from the formant frequency analysis by transforming
the Formant object to an LPC object and then drawing a spectral splice at the three different
time values 0.8, 1.1 and 1.23 s that are reasonably representative of these vowels. 5 .
example, in the left-most pane the spectrum drawn with a solid line is produced as:
199
60
40
20
80
Sound pressure level (dB/Hz)
80
Sound pressure level (dB/Hz)
/a:/
/ E/
80
60
40
20
0
5000
60
40
20
0
5000
Frequency (Hz)
5000
Frequency (Hz)
Frequency (Hz)
Figure 13.10.: LPC spectra of the vowels /o:/, /E/ and /a:/, respectively.
Ceiling
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
F1
493
493
496
498
502
511
528
563
604
565
636
535
47
B1
48
50
60
66
79
109
150
190
182
602
138
152
150
F2
930
932
939
943
949
954
948
902
730
2006
2278
1137
481
B2
122
127
152
171
209
306
481
823
1537
233
259
402
409
F3
2290
2291
2297
2300
2305
2314
2324
2330
2325
3213
3390
2489
385
B3
53
50
49
49
53
68
96
143
205
132
130
93
50
F4
3355
3350
3348
3349
3353
3363
3374
3384
3392
4166
4335
3524
344
B4
80
82
55
51
43
42
47
60
82
58
57
60
14
F5
3650
3784
4244
4294
4306
4317
4326
4132*
266*
B5
1377
1057
207
102
62
49
47
415*
517*
Table 13.1.: The first five formant frequencies and bandwidths for vowel /o:/ measured with different values of the ceiling parameter.
only the indication from acoustic models that approximately every 1000 Hz there might be
a formant. Therefore if we want to say anything about the accuracy of formant frequency
measurements we have to use sounds where we know beforehand the number of formants.
This automatically forces us to use artificial vowel sounds. Let us start by defining three
artificial vowels /u/, /i/ and /a/ whose formant frequencies are given in table 13.2.
These three vowels are at the extremes of the vowel triangle an they will be used in the
automatic formant frequency measurements. For these vowels the third, forth and fifth formant
were were not varied at all and they were the same for these vowels as we know that vowel
identity is mainly determined by the first two formants. The fundamental frequency was held
constant at 125 Hz. The artificial vowels were produced according to the source-filter model,
i.e. a sound source was filtered by a number of consecutive formant filters. The details of these
sounds can be found in the following section.
200
F1
300
800
300
F2
600
1200
2300
F3
2500
2500
2500
F4
3500
3500
3500
F5
4500
4500
4500
B1
30
80
30
B2
60
120
230
B3
250
250
250
B4
350
350
350
B5
100
100
100
Table 13.2.: The formant frequency values and bandwidths in Hz of three artificial vowels /u/, /i/
and /a/.
u
120
100
80
60
40
20
160
Sound pressure level (dB/Hz)
140
140
120
100
80
60
40
20
0
300 600
2500
3000
800 1200
Frequency (Hz)
2500
80
60
40
20
0
350
120
100
80
60
40
20
0
3000
3000
2300
2500
3000
i
160
140
120
100
80
60
40
20
0
2500
2300
2500
Frequency (Hz)
140
Frequency (Hz)
100
3000
160
300 600
120
160
140
0
0
Frequency (Hz)
160
Sound pressure level (dB/Hz)
160
140
120
100
80
60
40
20
0
800 1200
Frequency (Hz)
2500
3000
350
Frequency (Hz)
Figure 13.11.: Spectral slices determined from LPC analysis of three different artificial vowel
sounds /u/, /a/ and /i/.
Most of the time formant frequency values have to be measured for vowels. In order to make
the measurements reproducible it is desirable to be able to automate the measuring process as
much as possible. This means that already at the sound file name level enough structure should
be present to be able to deduce as much information from the file name as is needed. The next
step is annotating the segments we are interested in. The last step is probably to write a script
to perform automatic measurements.
201
202
14.2. TableOfReal
A TableOfReal contains cells that are ordered like a matrix, i.e. each cell can be identified with
a row index and a column index. As the name indicates the cells can only contain real numbers.
The rows and the columns of the TableOfReal may have a label text for identification. If we
only need one-dimensional information as row information, like for example vowel labels, a
TableOfReal object suffices for holding the data. A TableOfReal can be created from the New
> Tables menu with the Create TableOfReal... command. The form that appears allows to fill
out a name for the new table as well as the number of rows and columns. After clicking the OK
.
Figure 14.1.: The form of the Create TableOfReal... and the Set value... commands.
button a new TableOfReal object appears in the list of objects. The cell values will be filled
with zeros and both row and column labels will be empty. Column labels can be assigned with
the Modify > Set column label (index)... command, while we have the Modify > Set row label
(index)... command. Assigning values to the cells can be done in two ways. For assigning
the cell values one by one we can use the Set value... command. As the right-hand panel in
figure 14.1 shows only one cell value at a particular row and column index can be changed at
a time. Filling a table in this way by hand is a lot of work. The other way to directly fill the
table is with the Formula... command. The following script fills a selected table with random
Gaussian numbers with per column different standard deviations.
203
The data in the first column will have 0.1 standard deviation, the data in the second column 0.2,
and so forth. We note that if we generate these data with the given specification and afterwards
measure the values they will almost never be exactly equal. For example, the following script
will generate 1000 random Gaussian numbers with zero mean and standard deviation one.
Create TableOfReal : "rg" , 1000 , 1
Formula : "randomGauss(0 , 1)"
mean = Get column mean (index) : 1
std = Get column stdev (index) : 1
writeInfoLine : "mean = " , mean , " , stdev = " , stdev
The TableOfReal has a number of useful drawing options that may turn out useful. Let us
illustrate these options with the well-known data set, the [Pols et al., 1973] data set of the first
three formant frequencies of the twelve Dutch monophthongs as spoken by fifty Dutch male
speakers. This data set is included in Praat and the following command will make it available
as a TableOfReal object in the list of objects.
Create TableOfReal (Pols 1973) : "yes"
The resulting object will have 600 rows and 6 columns. The columns will be labeled F1,
F2, F3, L1, L2 and L3 respectively. The rows will be labeled with the twelve
different vowel labels u, a, o, \as, \o/, i, y, e, \yc, \ep, \ct, and \ic. The
vowel labels which have a backslash in them are special and use an encoding that guarantees
they are always displayed in the same way, irrespective of the computer platform you happen
to work on. The twelve vowel labels will display as u, a, o, A, , i, y, e, Y,
E, O, and I.
204
14.3. Table
14.3. Table
A Table has just, like the TableOfReal, cells that are ordered like a matrix. However, in the
TableOfReal the cells can only contain numbers while in the Table the cells also may contain
text. The columns of the table may have a label. Because cells may contain text, a Table
doesnt have row labels.
14.4. Permutation
A Permutation object represents one of the possible permutations of n things.
14.5. Strings
A Strings object represents an ordered list of strings. For example, one of the uses of a Strings
object in Praat is to hold a list of files.
205
Part II.
Advanced features
207
209
p
X
ak snk ,
(16.1)
k=1
where sn indicates that it is the predicted value for sn , and the p parameters ak are the linear
prediction coefficients. Convention says that we start with a minus sign. Equation (16.1)
shows that we need at least p + 1 sample values, s1 , s2 , ..., sp+1 , to calculate the prediction
coefficients. We will show how this calculation works and assume that sample values that
occur before s1 are all zero. We can use the following chain of equations to solve for the ak :
for the first sample we have no previous ones so we start with the second sample for which
211
(16.2)
As the measure E, for the total error made in the prediction, we accumulate the squares of the
sample-wise errors to obtain
E=
N
X
e2n .
(16.3)
n=1
The sum of squares is used because this makes the further handling mathematically more
tractable1 . The error E is a function of the parameters ak , this could be indicated by writing
it as E(a1 , a2 , . . . , ap ), but nobody does so. We can find the extrema of the E function by
taking partial derivatives with respect to the ai and assigning the result to zero. These partial
derivatives are
N
en
E X
=
2en
= 0,
for
1 i p.
(16.4)
ai n=1
ai
To solve these p equations we rewrite the en using eq. (16.1) as
en
=
=
sn sn
p
X
sn +
ak snk .
(16.5)
k=1
=
ai
ai
1 We
sn +
p
X
!
ak snk
= sni .
k=1
could also accumulate the absolute values of the differences between the real value and the predicted value as
P
E= N
i=1 |en |. However, the absolute signs make it more difficult to use standard minimization techniques.
212
N
X
sn +
n=1
p
X
!
ak snk
sni
k=1
N
X
sn sni +
n=1
p
X
!
ak snk sni
= 0.
k=1
ak snk sni =
N
X
sn sni ,
for
1ip
(16.6)
n=1
k=1
There are many ways to solve these equations and they depend on the what we assume what
the sample values are outside the range s1 . . . sN . In the equation above there are products
of sample values like snk sni where for certain combinations of n and k and/or n and i the
resulting index is smaller than 1. In the sequel our assumption will be that these sample values
with an index smaller than one are all zero. This corresponds to what in the literature has
become the autocorrelation method. Given that only s1 to sN are different from zero we can
define the (autocorrelation) coefficients Ri as
X
Ri =
sn sn+|i| .
(16.7)
We then rewrite eq. (16.6) as
p
X
ak R|ik| = Ri
k=1
R0
R1
R1
R
Rp1 Rp2
Rp1
a1
a2
Rp2
R0
ap
R1
= R2 .
Rp
213
p
X
ak snk + un .
(16.8)
k=1
Here un is the source and the ak represent the characteristics of the vocal tract filter. This
equation says that the output sn is predicted from p previous outputs and a source un . Because
we do not know the source signal un we proceed by first neglecting it and stating that our best
estimate of sn would be a linear combination of p previous values only, i.e.
sn =
p
X
(16.9)
ak snk .
k=1
In doing so we make an error en = sn sn for each sample. Using eq. (16.9) we write
en = s n +
p
X
ak snk ,
(16.10)
ak snk + en .
(16.11)
k=1
p
X
k=1
Now the error en has appeared at the position of the source un in eq. (16.8). This is nice,
we can use the linear prediction method outlined in the previous section to estimate the filter
coefficients ak and then use eq. (16.10) to give us the estimate of the source. It seems that we
get two for the price of one: we start with the sound and get the filter as well as the source.
Of course this can not be the whole truth. What happens is that part of the sound is modeled
by the filter and that what has not been modeled ends up in the source. So in the end we still
dont know exactly what the source was, it just adds up to all what is in the sound that is not
modeled by the filter. The better our model fits the sound, the better our estimates of the vocal
tract filter and the source signal.
for
1 n N.
Typically this equation is evaluated at discrete times nT which are multiples of the sampling
period, i.e. the index n refers to units of the sampling period T . The Z-transform offers a
214
yk zk ,
z .
k=0
Applying this transformation to the sequence yn results in the Z-transform Y (z). The Ztransform is a function in the complex z-plane. The Z-transform is important because it is a
generalization of the Discrete Fourier transformation (DFT). If we evaluate the Z-transform
at discrete points on the unit circle, i.e. forPvalues of z = e2im/N , where 0 m N 1, the
2ikm/N
Z-transform for a particular value of m is N1
, which equals the term in the DFT
k=0 yk e
formula in section (7.7). This means that we can get the frequency response of any system
described by a Z-transform by evaluating the Z-transform for values of z on the upper half of
the unit circle, i.e. for values of z = e2if T , where frequency f runs from zero to the Nyquist
frequency 1/2T .
An more mathematically refined introduction to the Z-transform can be found in for example Papoulis [1988], in here we will only use some of the properties of the Z-transform that are
convenient to us. If we define the operation ZT (yn ) as applying the Z-transform to all the samples of a signal yn we can write ZT (yn ) = Y (z), i.e. the result of applying the Z-transform
to the signal yn is Y (z). The inverse operation can then be written as ZT 1 (Y (z)) = yn .
Although we will not use the inverse Z-transform, we will use the notation yn Y (z) to
denote a Z-transform pair. The most important properties of the Z-transform are:
1. ayn aY (z). If Y (z) is the Z-transform of yn then to obtain the Z-transform of ayn ,
i.e. the signal multiplied by a common scale factor a, multiply the Z-transform of yn
with this factor and vice versa. With the notation developed above we can also write
this as ZT (ayn ) = a ZT (yn ) and ZT 1 (aY (z)) = a ZT 1 (Y (z)).
2. axn + byn aX(z) + bY (z). The Z-transform is a linear operation.
3. yn1 z1 Y (z). The notation yn1 means a time shift of the signal one sample back,
i.e. the signal is shifted one step to the left. We have to repeat this shifting to obtain the
more general result ynk zk Y (z): the Z-transform of a signal shifted backward in
time over k sample times is obtained by multiplying its Z-transform by a factor zk .
4. xn ? yn X(z)Y (z). Convolution in the sampled domain is like multiplication in the
z-domain.
These three properties are all we need to describe the frequency characteristics of a digital
filter. For example, applying the Z-transform on the filter equation yn = axn + byn1 + cyn2
results in
ZT (yn ) = ZT (axn ) + ZT (byn1 ) + ZT (cyn2 ).
This translates to
Y (z) = aX(z) + bz1 Y (z) + cz2 Y (z).
2 Analog
systems, for example electrical circuits, are typically described by differential equations and there the
Laplace transform is the preferred tool.
215
Y (z)
.
X(z)
This definition states that if you divide the output (representation) of a filter by its input (representation) you will get the characteristics of the filter. For the formant filter this turns out to
be
a
Y (z)
=
.
(16.12)
H (z) =
1
X(z)
1 bz cz2
This equation can be put in another form by extracting the z2 from the denominator which
results in
az2
H (z) = 2
.
z bz c
The denominator of this quotient is a second degree polynomial in z and has real coefficients,
i.e. 1, b and c. The zeros are located at values
b b2 + 4c
.
(16.13)
z1,2 =
2
This shows that the zeros are either real, or complex if b2 + 4c < 0. If they are complex, then
always z1 = z2 , i.e. they are complex conjugate. For now we only consider the complex zeros
because it will turn out that they are more interesting than the real zeroes. Now H (z) can be
expressed as
az2
H (z) =
(16.14)
(z z1 )(z z1 )
16.4.1. Stability of the response in terms of poles
We first express the poles in polar coordinates: z1,2 = rei . A Z-transform that has the
form of (16.12), i.e. with two complex conjugate poles comes from a sampled signal that can
be written as yn enln r/ sin(n) . This is an oscillating sine function multiplied by an
exponential function. If ln r < 0 the exponent becomes negative and we get an exponentially
damped sine. For large enough sample indices n the sample values become zero. This only
happens if r is less than one which is the case if the pole is located within the unit circle. For
values of r greater than one, the ln r term in the exponent becomes positive and sample values
keep growing towards infinity. The latter system is unstable. If r equals one there will be
no damping at all and an oscillating signal of constant amplitude will result. It is possible to
generalize this to a system with more poles than the two considered here: stable systems have
their poles located within the unit circle.
216
The frequency response of H (z) can be obtained by evaluation of H (z) for values of z on the
upper half of the unit circle. This means evaluation for values of z = e2if T , where f runs
from 0 to the Nyquist frequency 1/2T . As an example let us try to evaluate the frequency
response of the filter in equation (16.14). We are only interested in the amplitude response
and leave phase out of the picture. We write
|H (z)| =
|az2 |
|a|
=
.
|(z z1 )(z z1 )| |z z1 | |z z1 |
(16.15)
This lends itself to a geometrical interpretation as is shown in figure (16.1). On the left the
f2
f1
dB
z1x
f1
f2
Frequency
Figure 16.1.: The positions of the poles of a formant filter in the z-plane (left) determine the frequency response (right).
z-plane is drawn. The unit circle is shown and the two conjugate poles are indicated with a
cross. On the right the frequency response is shown as the point z = e2if T moves on the
unit circle. The point starts at z = 0 which corresponds to a frequency of 0 Hz, then travels
on the top half of the unit circle, through z = i which corresponds to a frequency of one
half of the Nyquist and then ends at z = 1 which corresponds to the Nyquist frequency.
The frequency response at any of these positions, for example, at the position labeled as f1 ,
is the inverse of the product of the distances of the point to both poles as equation (16.15)
says. For a point f1 these two distances are indicated with red dotted lines. In the right figure
the (logarithm of the) inverse product of these distances is shown with a red dotted line. For
another point f2 these distances are shown in blue. The resulting response is smaller than for
the point f1 because both distances are larger and their product is therefore also larger and the
response being the inverse of this product will be smaller. If the point z is near the pole z1 one
of distances becomes very small and the product of the two distances will be small too. The
inverse will be large then, hence a peak in the response. We can deuce from this that the closer
a pole lies to the unit circle, the higher the peak will be and smaller its bandwidth.
What we showed here for a two pole system can be generalized to more poles. If z moves
over the unit circle then every time it moves towards a pole the response goes up. In general
the frequency response, i.e. the spectrum will then show as many peaks as there are poles in
217
p
X
ak snk + un ,
k=1
1
S(z)
=
.
Pp
U (z)
1 + k=1 ak zk
This is the standard formulation if we write it as S(z) = H (z)U (z), i.e. it says that in order
to get the output S(z) you apply the filter H (z) on the input U (z). In the z-domain the Ztransform of the output can simply be obtained by a multiplication of the Z-transforms of
the filter and the input. However, another description exist which is called the inverse filter
formulation. Taking the Z-transform of equation (16.11) results in
E(z) = A(z)S(z),
(16.16)
p
X
ak zk .
(16.17)
k=1
Because we dont know what the source signal U (z) is we assumed U (z) = E(z). The
analysis filter being A(z) = 1/H (z) as can be easily verified.
The denominator of H (z) is a polynomial of degree p in z whose poles are either real or
complex conjugate pairs. If the system H (z) is stable and all poles are complex then p/2
poles are located in the upper half plane. The spectrum will show maximally p/2 peaks as
was explained in section (16.4.2).
218
As we explained in the previous section, the spectrum of a vowel sound has a slope of approximately -6 dB/oct. In general this makes the amplitude of peaks at higher frequencies
weaker than those at lower frequencies. This has a negative effect on the LPC analysis because
these peaks would not get enough impact in the analysis procedure. The analysis automatically matches stronger peaks better than less stronger peaks. Perceptually the peaks at higher
frequencies are important because the power at higher frequencies is more concentrated on
the basilar membrane because of membranes logarithmic frequency to place relationship. It
would be nice if we could compensate for the negative slope of the spectrum to give the spectral peaks at higher frequencies the same impact as the ones at lower frequencies. By high-pass
filtering the sound before performing the actual analysis we can compensate. The simplest
high-pass filter that accomplishes can accomplish this task is of the form yn = xn axn1 .
The effect of this filter is a +6 dB/oct correction applied to the spectrum. The coefficient a is
always near the value one and is calculated as
a = e2F T ,
where T is the sampling period and F is chosen as the frequency from which we apply the
+6 dB/oct correction. The default value for F is 50 Hz. For a male voice with 5000 Hz as the
219
80
Sound pressure level (dB/Hz)
80
60
40
20
0
0
5000
Frequency (Hz)
60
40
20
0
0
5000
Frequency (Hz)
Figure 16.2.: The effect of pre-emphasis on the spectrum of the vowel /o:/ from a male speaker.
The spectrum on the right was pre-emphasized.
applying default pre-emphasis is shown on the spectrum of a vowel /o:/ from a male speaker.
The left spectrum is without pre-emphasis, the one on the right was pre-emphasized above
50 Hz. It is clear that the high frequency peaks have been emphasized. This will result in a
better spectral modeling from a following LPC analysis.
16.6.2. The parameters of the LPC analysis
The minimum parameters that all LPC analyses need are shown in the form that appears after
choosing To LPC (burg)..., as shown by figure 16.3. The form shows the settings for a
Figure 16.3.: The Sound: To LPC (burg)... form with parameter settings.
prediction order of 10 which means that we want to match 5 formant peaks in the spectrum.
220
The result of applying an LPC analysis on a sound will be an object of type LPC. This object
incorporates the (local) filter characteristics of the vocal tract. The filter characteristics as a
function of time are represented in the LPC object as an array of LPC frames. Each LPC frame
contains the maximally p filter coefficients ai that were the result of the analysis. Equation
(16.16) shows that if we invert the filter and apply it to the sound we get the source signal.
221
223
1.667
0.4081
0.532
2.53256236
Figure 17.1.: The dynamic time warp of two versions of the Dutch sentence Er was eens een oud
kasteel (Eng. Once upon a time there was an old castle).
warping path shows how time at the horizontal axis corresponds best with time on the vertical
axis. For example, if we start at time tx = 0.532 s in the horizontal sound and follow the
dotted line to where it crosses the warping path and then follow the dotted horizontal line to
the left, we arrive at time ty = 0.4081 s in the vertical sound. This means that time 0.532 s at
the horizontal time axis is warped to time 0.4081 s on the vertical time axis. And, of course,
this goes the other way around too: the time 0.4081 s on the vertical axis is warped to the
time 0.532 s on the horizontal axis. As the figure shows the warping path is not a straight
line running from bottom left to top right. If it were so, both consonants and vowels in both
sounds would be streched by an equal amount and this almost never happens if both sounds
are naturally produced utterances. If people speak the same sentence twice, first slowly and
then fast, not all parts in the fast sound will be shortened in the same proportion.1 The DTW
algorithm tries to calculate a warping path that is optimal in some sense.
rate is normally expressed as words per minute. However, several studies have indicated that if speakers
change their speech rate, this has effects on various levels of the temporal structure of speech.
224
(17.1)
225
(17.2)
The first formula expresses that a frequency in mels is a logarithmic function of the
frequency in hertz. The numbers in the formula were chosen such that a frequency
of 1000 Hz corresponds to a value of 1000 mels. Several other formulas exist in the
literature that relate mels to hertz.2
2 See
226
for example the wikipedia entry on Mel scale for more definitions.
(a)
-1
0
-1
0.325
0.61
Frequency (mel)
(c)
(d)
Amplitude
60
40
20
0
-20
0
0
0
8000
Frequency (Hz)
2800
0.375
Time (s)
2000
Frequency (Hz)
(e)
90
(f)
Amplitude (dB)
Time (s)
80
(b)
2000
1000
0
0
8000
Frequency (Hz)
0
0
1000
2000
mels ->
Figure 17.2.: From sound to mel filter coefficients. (a) Selected part of the sound in red colour,
(b) after windowing, (c) the spectrum, (d) in red part of the spectrum with overlaid
triangular filters in black, (e) the hertz to mel function, (f) the filter values.
227
229
A. Mathematical Introduction
A.1. The sin and cos function
These functions originally were invented for characterizing triangles in trigonometry, the
branch of mathematics that studies the relations of the sides and angles of triangles, the methods of deducing from certain given parts other required parts, and the general relations between
the trigonometrical functions of arcs or angles. [1913 Webster]
B
(hypotenuse)
c
a
(opposite)
(adjacent)
The sine of the angle BAC (sin ), for example, is defined as BC/AB, the ratio between the
opposite side and the hypotenuse of a right triangle. The cosine (cos) of this angle is defined
as the ratio between the adjacent side and the hypotenuse (AC/AB). It is clear that since the
hypotenuse of a triangle is always the longest side, the maximum value these two functions
can attain is one. Note that if we make the length of AB equal to one, the sine and cosine
reduce to the length of the sides BC and AC, respectively.
Table A.1.: The triangle relations
sin
cos
tan
Denomination
Value
opposite
hypotenuse
adjacent
hypotenuse
opposite
adjacent
a
c
b
c
a
b
In general we are not interested in the sine of one particular angle, we want to know how
the value of the sine varies if the angle varies. We like to consider the sine as a function
that varies with its one variable: its angle. As a consequence the definitions above needs to
231
A. Mathematical Introduction
be extended: the triangle is placed in a circle with the point A at the centre and B on the
circumference, i.e. we position the triangle in the unit circle, a circle with a radius of length
one. The generalization is than simple: we now allow the point B to travel along the circle and
define the sine as the length of BC and the cosine as the length of AC.
y-axis
B
1
tan
sin
A cos C
x-axis
Another way of viewing this is the cosine as the projection of the point B on the horizontal
x-axis and the sine as its projection on the vertical y-axis. With respect to the centre A, we
define the lengths to the right as positive, the lengths to the left as negative, the lengths above
A as positive and the lengths below A as negative. We now allow the point B to move on the
circle, and draw the cosine and the sine as a function of the distance that the point B travels.
We start when B lies on the point D and make a left turn, i.e. we go upwards.
-1
-2
-3/2
-/2
/2
3/2
If we start at D, where AC=1 and BC=0, and make a quarter left turn, we have AC=0 and
BC=1 and B has traveled a distance of one fourth of the circumference of the circle 2/4.1 At
1 To
travel around a square with sides of length l, we have to travel a distance of 4l. This means that the relation
between the circumference and the diameter is exactly 4 for a square. The relation between the circumference and
the diameter of a circle can not be calculated that easily. It is not an integer number, it is not a rational number
(i.e. it can not be expressed as p/q, the quotient between two integer numbers p and q). The irrational number
232
The symmetry of a function is defined with respect to its behavior for positive and negative
values of its argument.
For a symmetric function: f (x) = f (x). The function values at equal positive and negative distances from the origin are the same. Figure A.3 shows that the cosine function
is symmetric.
For an antisymmetric function: f (x) = f (x). The function values at equal positive
and negative distances from the origin are each others opposite. The sine function is
antisymmetric.
For two functions f1 (x) and f2 (x) table A.2 shows the symmetry of the product function
f1 (x)f2 (x) and the quotient function f1 (x)/f2 (x). The first line at the third column reads
that if f1 (x) is a symmetric function (S) and f2 (x) is an antisymmetric function then the
product function f1 (x)f2 (x) is an antisymmetric function (A). As we can see from the table
the relations with respect to symmetry for the quotient function equal the relations for the
product functions. This is because the function f2 (x) and 1/f2 (x) have the same symmetry.
For example, the tangent function (see section A.2) is antisymmetric because it is the quotient
of the antisymmetric sine function and the symmetric cosine function.
is involved: the circumference of a circle equals times its diameter, or 2 times its radius. The number is
approximately 3.1415...
233
A. Mathematical Introduction
f1 (x)
S
S
A
A
f2 (x)
A
S
S
A
f1 (x)f2 (x)
A
S
A
S
f1 (x)/f2 (x)
A
S
A
S
Every function f (x) can be split in a symmetric part fs (x) and an antisymmetric part fa (x)
such that f (x) = fs (x) + fa (x) by:
fs (x)
(f (x) + f (x))/2
fa (x)
(f (x) f (x))/2
You can try it for example with the function f (x) = x + x2 which has the symmetric part x2
and the antisymmetric part x.
A.1.2. The sine and cosine and frequency notation
In the previous section we have shown that sine and cosine are periodic functions with period
2. To make more explicit how many periods we want to travel, we could write sin(2x)
instead of sin . This notation makes more explicit that when x varies between 0 and 1, the
argument of the sine varies between 0 and 2. Especially in physics where the notion of
frequency is coupled to sines and cosines, we want a notation where it is immediately obvious
how many periods occur within a certain time interval.
Consider the following notation for a signal s(t): s(t) = sin(2f t) for 0 t 1. This
function represents one second of a tone with frequency f . If we want to know how this
function behaves in this interval we have to know what the value of f is.
1
-1
1
0
Time (s)
Figure A.4.: The function sin(2t), drawn with a solid line, and the function sin(22t), drawn
with a dotted line.
In the figure we have plotted with a solid line the function s(t) for f = 1 and with a dotted
line for f = 2. We see that the solid line traces exactly one period of the sine and the dotted
234
We will first show that if we add a cosine and a sine with the same argument the resulting
function can be written in an alternative way by using only a sine function. We start with
s(x) = a cos(x) + b sin(x), where initially we assume that a and b are some positive numbers.
The function s(x) is a mixture of a sine and a cosine, the outcome of the mixture is determined
by the coefficients a and b. Two extreme cases are for b = 0, then s(x) reduces to a cosine
function s(x) = a cos(x), and for a = 0, then s(x) reduces to the sine function s(x) =
b sin(x). In figure A.5 we show s(x) for three different mixtures (a, b). In the top row a = 1
and b = 1/10 and we show: on the left three periods of the function cos(x), in the middle
three periods of the function 1/10 sin(x) and on the right the result of summing these two
functions s(x) = cos(x) + 1/10 sin(x). In the middle row both a and b are equal to one and
the function on the right is the sum of three periods of cos(x) and three periods of sin(x). For
the bottom row a = 1/10 and b = 1.
A careful look at the functions in the right column shows that they all show exactly three
periods. Each period is marked with a vertical dotted line and shows a complete sine function.
Compared to the normal sin(x) function they are displaced and start with a value as if x was
not zero at the start. Well, this is exactly how we can describe this function: as a displaced sine.
We write c sin(x + ), where the displacement is called the phase, and c is the amplitude.
As the figure makes clear, both c and have a relation with the coefficients a and b of the
mixture.
235
A. Mathematical Introduction
We will now derive what this relation is. We start with the mixture
a cos(x) + b sin(x).
(A.1)
The trick we will use is to rewrite the a and the b. Let us assume first that a and b correspond
to the sides of a right triangle as in figure A.1. From this triangle it follows that sin = a/c,
cos = b/c and tan = b/a. We rewrite the first two as a = c sin and b = c cos and the
1
third
as = tan (b/a) = arctan(b/a). Since c is the hypotenuse of the triangle, its length is
a2 + b2 . We substitute these values for a and b in equation (A.1) to obtain
a cos(x) + b sin(x)
c sin(x + ),
(A.2)
where we used one of the familiar trigonometric relations between sines and cosines of
different arguments. This equation says that any linear combination of a sine and a cosine
results in a displaced sine function. Given a displaced sine function we can decompose
it into
a mixture of a sine and a cosine. If we start from the pair a and b we calculate c = a2 + b2
and = arctan(b/a). If we start from c and we calculate a = csin and b = c cos .
These insights will be used in the next session where we will explore how to determine from
a mixture the mixture coefficients a and b, or the alternative representation with c and .
A.1.4. Average value of products of sines and cosines
In this section we show a way to calculate the average value of products of sine and cosine
functions. We need a way to calculate these averages because they turn up in many problems
like for example Fourier analysis. We want to avoid the mathematics of integrals, instead we
use the related concept of the average value of a function on an interval. The average values of
the functions we use are easy to calculate: just sum the function values at regularly spaced
x-values. To be able to calculate the averages of these compounds, we start with simple sines
and cosines. In figure A.3 the sine and cosines functions are shown. Both functions have a
period of 2. It is easy to see that, loosely speaking, the average value of the sine curve in a
stretch of one period equals zero. The sine curve in the interval from 0 to is exactly equal
but of opposite sign to the curve in the interval from to 2. It doesnt even matter where
the period starts: for any stretch of length of one period, the average of the sine curve will be
zero. Since the cosine curve equals a sine curve, but displaced over a distance of /2, these
results also apply to the cosine. As a consequence, if an integer number of sines or cosines fit
in an interval then the average values of these functions equal zero. This result is important
because the rest of this section will depend on it.
We will start with the investigation of the simple products as displayed in figure A.6. In
this figure all functions are displayed in the interval from 0 to 2. The first column shows one
period of a sine or cosine function that is multiplied with the sine or cosine function from the
second column. The third column shows the product function. For example in the middle row
the first column shows one period of sin(x), the second column shows one period of cos(x)
and the third shows the product sin(x) cos(x). The third column in the first row shows the
236
sin(mx) sin(nx)
The equation above shows that if m is not equal to n, the average of pss (x) is zero because it
is the sum of two cosine functions each of which has average value zero. If m equals n, the
situation is different and pss (x) = 1/2 1/2 cos(2mx). The average of this function will be
1/2 and the result does not depend on the value of m. We conclude from this: on the interval
[0, 2], the average of the product of two harmonically related sines is only different from
zero when the sines are equal.
For the product of a harmonically related sine and cosine we use equation (A.30) to get
psc (x)
sin(mx) cos(nx)
The average of this equation is always zero. If m equals n the sin((m n)x) term equals zero
and the average of the other term equals zero too.
For the other possibility pcc (x) = cos(mx) cos(nx) we use equation (A.31), and the results
are like the results obtained for two harmonically related sines: on the interval [0, 2], the
average of the product of two harmonically related cosines is only different from zero when
the cosines are equal.
The summary conclusion from these products is: in the interval [0, 2], the average of
products of harmonically related sines and/or cosines are only different from zero when both
terms in the product are the same function.
The results above can be used to show that when we have a mixture of a sine and a cosine function, then we can get the strength of each component by only determining specially
formed averages. Suppose we have a function s(x) = a cos(x) + b sin(x) in the interval from
0 to 2, and the numbers a and b are unknown. How can we determine them? Let us multiply
237
A. Mathematical Introduction
s(x) by the function cos(x). This results in a cos(x) cos(x) + b sin(x) cos(x). If we calculate
the average of this function the second term will not contribute because of the results obtained
above. The first term will contribute the value a/2. Therefore the value for a will be twice
the calculated average of s(x) cos(x). If we multiply s(x) with a sine function, we obtain
a cos(x) sin(x) + b sin(x) sin(x) and now only the second term contributes to the average. The
value of b is twice the calculated average of s(x) sin(x).
These results are so important that we will rephrase them again: if we have a function s(x) =
a cos(x) + b sin(x) that is a mixture of a cosine and a sine component whose strengths a and
b are not known, we can calculate the strengths a and b by any of the following procedures:
1. Multiply s(x) by the function cos(x) and determine the average value of the product
function on the interval from 0 to 2. The strength a will be two times this value. To
determine b, we multiply s(x) by the function sin(x) and determine the average value
of this product function on the integral 0 to 2. The value of the strength b is two times
this average.
2. The alternative procedure starts at the alternative formulation of the mixture with the
function s(x) = c sin(x + ). We multiply with cos(x) and apply equation (A.30) we
obtain
s(x) cos(x)
c sin(x + ) cos(x)
Now if we calculate the average of this function only the first term contributes and the
result will be c/2 sin(). Multiplication of s(x) with sin(x) and calculating the average results in a value that equals c/2 cos(). This can be shown by applying equation
(A.28). The two averages can be used to solve for c and .
The results above are used in Fourier analysis where the individual frequency components are
all harmonic frequencies. By multiplying with cosines and sines of the right frequency their
strengths can be determined.
A.1.5. Fade-in and fade-out: the raised cosine window
When the amplitude at the start or the end of a sound changes abruptly, we often hear a click
sound. Most of the time these click sounds can be avoided by selecting parts of the sound
that start and end at zero crossings of the sound signal. Sometimes however, these clicks
can not be avoided. For example if we want to create fixed duration stimuli for a listening
experiment, it is almost impossible to guarantee that all these sound start and end nicely on
zero crossings. We can arrange that the signal amplitude, instead of abruptly rising, slowly
rises at the start of the sound. This is called fade in. By multiplying the first part of the sound
with a function whose value gradually increases from zero to one over a predefined interval
we can accomplish the fade in. A five millisecond rise time is sufficient to avoid clicks. The
simplest function that accomplishes a linear rise time for a sound that starts at zero seconds
is the function x/0.005. The following script modifies the first five milliseconds of a selected
sound to fade in smoothly:
238
To fade out the last five milliseconds of a sound the function needs to fall from one to zero
within the interval. The following script modifies the last five milliseconds of a selected sound
to fade out smoothly:
Formula : "if x >= xmax - 0.005 then self*(xmax - x) / 0.005 else self fi"
The disadvantage of the linear function defined above for the fading is that at the start and
the end of the fading it is not continuous because the slope changes instantly at these points.
We like the slope to change gradually at the start and end of the fade. Better functions to
accomplish the fading can for example be based on a raised cosine. In the left plot of figure
A.7 we show the first five milliseconds of the function wo (x) = (1 + cos(2100x))/2 with a
solid line and of wi (x) = (1 cos(2100x))/2 with a dotted line. The dotted line shows a
curve smoothly rising from zero to one, while the solid line shows a smooth transition from
one to zero. Moreover, the slopes of these curves start and end horizontally. This behavior
makes these two function very suitable as a fade-in and a fade-out function, respectively. The
following script fades in a selected sound:
Formula : "f x <= 0.005 then self*(1 - cos(2*pi*100*x)) / 2) else self fi"
To have the fade-in or the fade-out start at any defined point x0 we have to translate these
functions that show the desired behavior in the interval from 0 to 0.005, i.e. for x0 = 0. For
example, if we like to start the fade-in or the fade-out at say x0 = 0.13775, the following script
to fade in
x1 = x0 + 0.005
Formula : "if x >= x0 and x <= x1 then self*(1 - cos(2*pi*100*x)) /2 else self fi"
would produce the wrong result since the function wi (x) evaluated for values of x in the
interval from x0 to x0 + 0.005 shows the values in the middle plot of figure A.7 indicated with
a dotted line. The solid line shows the wo (x) function evaluated in the same interval. Both
curves show that the fade functions used in this way do not produce the right results. There
is an easy way out: arrange for the arguments of these functions that when they start at value
x = x0 they behave as if they start at x = 0. If we replace the x in both function definitions
by (x x0 ) then we have accomplished our goal: the evaluation for x running from x0 to
x0 + 0.005 returns the same values as the evaluations for x between 0 and 0.005.
With the adaption for x0 , our fade-in and fade-out functions become w(x) = (1cos(2100(x
x0 )))/2 and wo (x) = (1 + cos(2100(x x0 )))/2. The correct fade-in that reproduces the
right plot in figure A.7 at time x0 now reads
x1 = x0 + 0.005
Formula :
... "if x >= x0 and x <= x1 then self*(1 - cos(2*pi*100*(x - x0))) /2 else self fi"
As one might guess, there is a relation between the 0.005 s duration of the fade and the
number 100 that appears in the functions wi and wo . The duration of the fade corresponds
exactly to one half period of the cosine in these functions. We then calculate a full period
duration as 0.01 s. This corresponds to 100 periods in a second, i.e. a frequency of 100 Hz for
the cosine. In the following script a noise signal of 50 ms duration is faded in a 10 ms duration
interval and faded out with the same duration interval. The script shows the fade-in and the
fade-out rather explicit:
239
A. Mathematical Introduction
xmax = 0.05
d = 0.01
f = 1 / (2*d)
Create Sound from formula : "noise" , "Mono" , 0 , xmax , 44100 , "0"
Formula : "randomUniform( -1 , 1)"
x0 = 0
x1 = x0 +d
Formula :
... "if x >= x0 and x <= x1 then self*(1 - cos(2*pi*f*(x - x0))) /2 else self fi"
x0 = xmax - d
x1 = x0 +d
Formula :
... "if x >= x0 and x <= x1 then self*(1 + cos(2*pi*f*(x - x0))) /2 else self fi"
In figure A.8 we see the result of applying a 10 ms fade-in at the start of a noise and a 10 ms
fade-out at the end of the noise. The combined effect of the fade-in and the fade-out on this
30 ms long noise can also be visualized as the multiplication of the noise with a window signal.
This window signal has value one everywhere except at the start and the end where the faders
work. The middle pane shows the window signal, the right panel the result of multiplying the
left noise with the middle window signal.
15
10
5
0
-5
-10
-15
0 /2 3/22
In figure A.9 we show the tan(x) function for values of x between 0 and 2 and for amplitudes limited between 15and +15. At x = /2 and x = 3/2 there are two discontinuities:
240
-6
-5
-4
-3
-2
-1
0
x
sin(x)
.
x
The sinc function is important in digital signal analysis because it is used among others for
interpolation of band-limited signals. For example, if we want to represent a sound with a
different sampling frequency, the new sample values have to be interpolated from the old
values and the sinc function is used to do this.
The sinc function, where sinc stands for sinus cardinalis, is defined as
sinc(x) =
sin(x)
.
x
(A.3)
In figure A.10 we show this function. The function is symmetric because it is the product
of two antisymmetric functions, the sine function sin(x), and the slowly decaying function
1/(x). Because sin(x) equals zero if x has integer value, the sinc function equals zero at
these points too. A shorthand notation for the former notion is
sin(k) = 0,
f or
k = 0, 1, 2, . . .
At x = 0 both the numerator and the denominator in equation A.3 are zero and this signals a
potential problem because the result of such a division is not defined. However, it turns out that
the sinc function behaves nicely at x = 0 and is equal to one. The reason for the nice behavior
of equation A.3 for x = 0 stems from the fact that for very small values of x, the sin(x) is
241
A. Mathematical Introduction
approximately equal to its argument x, i.e. sin(x) x for small x. The smaller x is, the better
this approximation will be. The consequence is that near x = 0 the functions f (x) = x and
f (x) = sin(x) become nearly indistinguishable. Therefore, for very small x we may substitute
for the sin(x) function its argument x and we may write sinc(x) x/x = 1.
The square of this function turns up in power spectra where amplitudes are expressed in
dBs. In figure A.11 we show the function 10 log(sin(x)/(x))2 . It shows the typical main
lobe and side lobes. The main lobe of this function is at 0 dB for x = 0, since the argument
of the log equals one there. The amplitude of the first side lobe is indicated in the figure
with a horizontal dotted line at 13.3 dB below the amplitude of the main lobe. Because
the amplitude of the sync function decreases as 1/x, the amplitudes of the side lobes in the
squared function decrease at a rate of 6 dB/oct.
To make calculations with logarithms, we first recapitulate some elementary rules of arithmetic. We will not be mathematically rigorous in the formulations but try to state these rules
in popular terms. The first rule we need concerns multiplication: in multiplication with numbers, the order of the terms is not important, i.e. a b = b a. For example 2 3 = 3 2 = 6.
Another elementary rules of arithmetic states that if we calculate a quotient like a/b, with
numerator a and denominator b, this expression can also be calculated as the product of
the numerator with the inverse of the denominator, i.e. a/b = a (1/b). Some examples:
4/2 = 4 (1/2) = 2 and (7/4)/(7/8) = (7/4) (8/7) = 2. With these rules we can master
logarithms, giving the following rules for logarithms.
log 1 = 0, the logarithm of the number one equals zero.
log 10 = 1, the logarithm of 10 equals 1.
log 2 0.3, the logarithm of the number two is approximately equal to the number 0.3.
This is probably the most often used logarithm in this book.
log(a b) = log a + log b, the multiplication rule: the logarithm of a product of two
terms equals the sum of the logarithms of each term. Some examples: log 4 = log(2
242
The decibel is a logarithmic unit of measurement that expresses the magnitude of a physical
quantity (for example power or intensity) relative to a specified or implied reference level.
Since it expresses a ratio of two quantities, it is a dimensionless unit:
Power(dB) = 10 log(P/Pref )
(A.5)
Because power in a signal is related to the square of the amplitude, dBs can also be stated in
terms of amplitude:
243
A. Mathematical Introduction
Amplitude(dB) = 20 log(A/Aref )
(A.6)
Human perception experiments have shown (refs??) the smallest just noticeable difference
(jnd) in sound intensity is approximately 1 dB. This jnd is the smallest difference in intensity
that is detectable by a human being. The jnd is a statistical, rather than an exact quantity: from
trial to trial, the difference that a given person notices will vary somewhat, and it is therefore
necessary to conduct many trials in order to determine the threshold. The jnd usually reported
is the difference that a person notices on 50% of trials. We can now simply calculate how much
two tones have to differ in amplitude to have a 1 dB difference in intensity. If we have two
sounds with intensities I1 and I2 their intensity difference, call it y, in dBs is 10 log(I1 /I2 ).
We want to use amplitudes instead of intensities so we use the scale factor 20 for amplitudes to
get y = 20 log(A1 /A2 ). In the last equation we divide both sides by 20 to get log(A1 /A2 ) =
y/20, from which we get by using equation (A.4) A1 /A2 = 10y/20 . For two pure tones, where
the first is 1 dB less intense than the first we calculate A1 /A2 = 101/20 0.89 and for two
tones where the first is 1 dB louder we calculate A1 /A2 = 10+1/20 1.12.
We can create a series of 1000 Hz pure tones that differ by a specified amount of dBs with
the following command:
Create Sound from formula : "s" , 1 , 0 , 0.05 , 44100 , "scale*0.5*sin(2*pi*1000*x)"
The following table (A.3) gives some values for scale. The table shows that the following
Table A.3.: The scale factor for intensity differences in dBs
dBs
0.0
+0.5
+1.0
0.5
1.0
3.0
6.0
formula
100/20
100.5/20
101/20
100.5/20
101/20
103/20
106/20
scale
1.00
1.06
1.12
0.94
0.89
0.71
0.50
two commands result in two tones with a frequency of 1000 Hz that differ by 0.5 dB in intensity.
C r e a t e Sound from f o r m u l a : " s " , "Mono" , 0 , 0 . 0 5 , 44100 ,
. . . "0.5*sin(2*pi*1000*x)"
C r e a t e Sound from f o r m u l a : " s " , "Mono" , 0 , 0 . 0 5 , 44100 ,
. . . "0.94*0.5*sin(2*pi*1000*x)"
We now have worked with the simplified definition for the logarithm of a given number as the
power to which you need to raise 10 in order to get the number. We repeat once again equation
244
(A.7)
In the notation the base b is made explicit. If the base b = 10 and we write log x this base is
implicit. The multiplication rule, the quotient rule and the power rule apply to all logarithms,
and do not depend on the base b.
In working with the computer we often encounter the base 2 logarithm: the power needed
to raise 2 in order to get the number:
y = 2 log x x = 2y
(A.8)
Some numeric examples for base 2: 2 log 1 = 0, 2 log 2 = 1, 2 log 4 = 2 log 22 = 2, 2 log 4 =
log(2 2) = 2 log 2 + 2 log 2 = 1 + 1 = 2, 2 log 8 = 2 log 23 = 3, . . ., 2 log 2n = n, 2 log 21 = 1,
2
log 81 = 3.
Another base that deserves special attention is the one with base e.2 The logarithm with this
base has a special notation and a special name. We write the natural logarithm of x is y as
ln x = y instead of e log x = y:
y = ln x x = ey .
(A.9)
2
the number , the number e is an irrational number. Jacob Bernoulli discovered the number while working
on the following financial problem of compound interest. Suppose you start with a sum of euro 1 and you receive
an interest of p percent per period. To calculate how much money you own after one period, you can do the
following. The interest paid equals the start sum multiplied by p/100. You add this value to the money you
started with and now you know how much you own. If you had multiplied your start sum with (1 + p/100) you
would have obtained the same result: it says that the money you have after any interest period can be obtained
from the money you had at the start of that period multiplied by (1 + p/100). If you start with 1 euro then after
the first interest period you will own (1 + p/100). After the second interest period you will own (1 + p/100)
times what you owned at the start of the second interest period, i.e. (1 + p/100) (1 + p/100) = (1 + p/100)2 .
After interest period three you will own (1 + p/100) times what you owned at the start of the third interest period,
i.e. (1 + p/100) (1 + p/100)2 = (1 + p/100)3 . After n periods you own (1 + p/100)n . Now we start calculating
with real numbers. Start with 1 euro and an interest rate of 100%. If the interest is paid yearly then after one year
you own 1 (1+100/100) = 2 euros. If you had received the interest half-yearly, i.e. in two 50% terms, then after
a year you would have owned 1 (1 + 50/100)2 = (1 + 1/2)2 = 2.25. After a year, receiving interest quarterly
at 25%, you would own 1 (1 + 25/100)4 = (1 + 1/4)4 2.44. We continue: monthly interest at 8.25% after
one year would yield 1 (1 + 8.25/100)12 = (1 + 1/12)12 2.61. Interest periods of one week would result
in 1 (1 + 1/52)52 2.69. Daily interest results in 1 (1 + 1/365)365 2.714. Bernoulli asked what number
would result if the interest period went infinitely small, i.e. what is the value of (1 + 1/n)n for n going to infinity.
Well this number is e.
245
A. Mathematical Introduction
x
0
1
2
3
4
5
10
20
100
ex
1
2.71828
7.38906
20.0855
54.5982
148.413
ex
1
0.36788
0.36788
0.04979
0.01832
0.00674
22026.5
0.00005
4.8 108
2.1 109
2.7 1043
3.7 1044
numbers x the size of exp(x) grows very fast indeed. The exponential rise behavior of the
function for x greater than zero is mirrored by the exponential decline in the function s(t) =
exp(x), for values of x greater than zero, i.e. for arguments smaller than zero. If x = 0 both
the exponential rising and declining functions are equal to one.
(A.10)
Here the parameter is called the damping constant and the parameter F is called the frequency. Another possible way to write this function is by using the damping time as
s(t) = et/ sin(2F t).
(A.11)
The damping time is the time the amplitude of the sine wave falls by a factor of e (2.71828).
Another description is in terms of formant frequency and formant bandwidth:
s(t) = eBt sin(2F t).
(A.12)
Now F and B denote the formant frequency and the formant bandwidth respectively. Each
formant is modeled as a tone, the sin(2F t) part, whose amplitude falls off exponentially, the
eBt part. The frequency of the sine is called the formant frequency and the damping, or
fall-off, is related to bandwidth. For example, the time function of a formant with frequency
500 Hz and bandwidth 50 Hz is s(t) = exp(50t) sin(2500t). What does this function look
like? The following script creates figure A.14 where we see the first 100 ms of the formant.
246
the first line in the script creates the formant as a sound from a formula. Its duration was
chosen as 0.01 s, i.e. 100 ms. In the fourth line we draw the first 0.01 s of the formant. The
sixth line draws the exponentially decaying part of the formant with a dotted line. Finally we
draw a marker with the value of the exponential function exp(50t) at time 0.01 at the right
of the figure.
3 The
247
A. Mathematical Introduction
the spectrum falls off as 10 dB/decade. In the left figure we can check that the five octaves
result in a 15 dB drop.
2
Because the power P of a tone and
the amplitude A of a tone are related as P A ,
amplitude is related to power as A P . p
This means that if the power spectrum falls of as
1/f , the amplitude spectrum falls off as 1/ f . The amplitude spectrum in dBs then equals
p
20 log f = 10 log x.
248
If you run the script a form pops up in which you fill out a field labeled Interval (s). After
pushing the OK-button, a test signal is time-reversed and figure A.18 shows up in the Picture
Window. This figure shows with a solid line the amplitude of the test signal and with a dotted
line its time-reversed version for an interval duration of 0.17 seconds. As we can see, from the
dotted line, time reversion was successful.6
Let us step slowly through this example. The first three lines define the form that pops up.
In line 5, a sound named test with a duration of 0.777 seconds is created with a linearly
rising amplitude (the x in the formula part guarantees this). This is a good sound to test the
5A
script is text that consists of Praat commands. If you run the script, the commands are executed. In appendix 4
you will find more information about scripting Praat.
6 We have deliberately chosen an interval duration that is not an integer divisor of the test signals length to test the
correctness of the time reversal code for the last part of the sound.
249
A. Mathematical Introduction
time-reversal algorithm in lines 10, 11 and 12. To be able to refer to the name of this sound
in a more general way, we assign the name of the created sound to a variable s$ in line 6.
Because the code is simpler and because we like to keep the original signal intact, we create
a copy of the original sound in line 8. To give the new sound a meaningful name we copy the
name from the original sound and append the interval duration, rounded to an integer number
of milliseconds to it. Line 7 assigns the number of milliseconds to a variable ims while the
ims:0 part in the next line substitutes this number rounded down to an integer. Therefore, if
a 0.17 s interval duration is chosen, the newly created sound will be named test_170. All
the magic of this script is in the next three lines. In fact, it is only one line that, for reasons of
readability, has been laid out into three lines (in a Praat script a line that starts with three dots
signals a continuation of the previous line). This powerful one-liner starts with Formula...
which always implies that the code in the rest of the line is interpreted for each sample in the
sound afresh. See section 4.7.1.2 for more info on this command. Line 11 is the meat of the
formula, line 10 tests if the duration of the sound is an integer multiple of the chosen interval
length and if not then line 12 will treat the last incomplete interval of the sound. The value
of x is evaluated to a number between the start time of the sound, xmin, and the end time,
xmax. In the formula the div operator and the mod operator are used in specific ways:
(xxmin) div interval and (xxmin) mod interval . The former equation essentially counts in which
interval the x value lies. If we multiply this number with the duration of an interval we arrive
at the starting position of the interval. First adding the number 1, before the multiplication,
returns the end position of the interval. The latter equation calculates the relative position of
x in the interval. In line 11 a sound sample with a certain offset from the end position of an
interval in the originating sound is copied to a same offset but now with respect to the start
position of the interval in the new sound.
To use this script on your own Sounds, you modify the script as follows: remove line 5,
where the test sound is created, and remove the last part starting at line 14. The resulting
script will time reverse the intervals in any sound selected in Praats Object window. If the
interval length is chosen larger than the duration of the sound, the sound will be completely
time reversed! Applying the script again, now on the time-reversed sound, will result in another sound that is the time-reversed version of the time-reversed version, i.e. a sound that is
identical to the sound you started with.
250
a cos(x)
-1
0
-2
0
-1
0
csin(x+T )
-2
0
-1
0
-1
0
-1
0
-1
0
b sin(x)
-2
0
Figure A.5.: The mixture a cos(x) + b sin(x). From top row to bottom row the values for the
coefficients (a, b) are (1, 1/10), (1, 1) and (1/10, 1), respectively. For each row the
function at the right is the result of adding the two functions on the left.
251
A. Mathematical Introduction
sin(x )
sin(x )
sin(x)sin(x)
0.5
sin(x )
cos(x)
sin(x)cos(x)
cos(x)
cos(x)
cos(x)cos(x)
0.5
0.005 0.13775
0.14275 0.13775
252
0.14275
=
0
0.01
0.020
-1
-1
0.01
0.020
dB
-13.3
-6
-5
-4
-3
-2
-1
0
x
253
A. Mathematical Introduction
-1
0 10 20 30 40 50 60 70 80 90 100
Figure A.12.: The function log x.
25
0
-3
-2
-1
0
Time (s)
Figure A.13.: The exponential functions exp(x), drawn with a solid line, and exp(x), drawn with
a dotted line.
254
0.2079
0
-1
0.01
0
Time (s)
Figure A.14.: A 500 Hz formant with a bandwidth of 50 Hz drawn with a solid line, showing the
exponentially declining part with a dotted line.
1.5
0.5
0
1 2
10
x
20
dB
-5
-10
-15
400
800
1600
Frequency (Hz)
3200
255
A. Mathematical Introduction
Figure A.17.: Time-reverse a sound: copying of a sound value, located a distance r from the end
of the interval in the original sound (O), to a position at distance r from the start of
the interval in the new sound (N).
0
0
0.17
0.34
Time (s)
0.51
0.68
0.777
256
Zb
f (t)dt.
(A.13)
0
a
0
a
Figure A.19.: Integrating a function from t = a to t = b is the determination of the areas above
and below the horizontal axis. The top panel shows the area between the function
curve and the horizontal axis with shades of grey. The bottom panel shows an approximation of these areas with a sum of rectangular areas.
We can determine this area with a mathematical technique called integration. In figure A.19
we show an example. If the function of equation (A.13) is given by the blue curve then the
result of this equation is equivalent to the determination of the area enclosed by the horizontal
axis and the function on the interval between the points a and b. The areas above and below
the horizontal axis are shown with different shades of grey. For the calculation of the integral
the area below the horizontal axis is subtracted from the area above this axis. We will not go
into the mathematical techniques that are available to solve equations like (A.13) for different
functions f (t), but instead show a simple numerical approximation of this integral for sampled signals or sounds. The bottom panel shows the sampled representation of the function in
the top panel, however instead of a representation of the sample values as speckles or poles
257
A. Mathematical Introduction
it shows each sample value as a rectangle. The height of the rectangle equals the amplitude
and the width equals the sampling time t. The area of a rectangle equals its heigth times its
width and is therefore f (ti )t, where ti is the time at the middle of the rectangle and f (ti )
the amplitude at ti . As we can see now the sum of the areas of these rectangles approximates
the grey area in the top panel quite well. If there are N samples between a and b then the sum
P
PN
of the areas of the rectangles will be N
i=1 f (ti )t = t
i=1 f (ti ) = tNf . Here f is the
average value of f (t) over the interval and tN is the duration of the interval. This is a nice
result because it shows that
for a sampled sound the integral equals the sounds average value times its duration.
In Praat we have the Sound > Query > Get mean... command to access a sounds mean
value and it is therefore easy to derive the value of the area under a sound curve.
(b)
(c)
y2
y2
y2
y3
y3
y3
y1
y1
y1
x1 x x2
x3
x1 x x2
x3
x1 x x2
x3
Figure A.20.: Interpolation examples. (a) constant, (b) linear, (c) quadratic or parabolic.
Constant interpolation The amplitude is assumed to be constant. In pane (a) of figure A.20
we show an example. The y-value at a point x that lies between x1 and x2 can be found
by extending the y-value of x1 . Therefore y = y1 .
Linear interpolation Given two points (x1 , y1 ) and (x2 , y2 ) we assume a straight line through
the two points. The equation of a straight line is of the form y = ax + b, where a and
258
y1 y2
x1 x2
y1 ax1
(A.14)
Quadratic or parabolic interpolation Given three points (x1 , y1 ), (x2 , y2 ) and (x3 , y3 ),
x1 x2
bx1
ax21
(A.15)
In pane (c) of figure A.20 we show an example. The figure already makes clear that an
interpolated y-value may be larger (or smaller) than any of the yi of the three points.
Sinc interpolation For sampled signals we can use sinc interpolation as it perfectly reconstructs the original signal from the sample values. This interpolation is used very often
in Praat because of the famous sampling theorem which states that for a correctly sampled signal7 s(t) the original can be reconstructed from the sample values sk as
s(t) =
+
X
sk sinc(t tk ),
(A.16)
k=
where the sk represent the sample value at time tk = t0 + kT and T is the sampling
time whose inverse 1/T is called the sampling frequency. The formula states that given
sample values sk at regularly spaced intervals, we can calculate the value at any value
of the time t, by first weighing sample values according to a sinc function and then
adding them. In contrats with the interpolations defined above, the sync interpolation
is laborous because the determination of the amplitude at any only one instance in time
involves all sample values.
7 A correctly sampled signal can be derived from a band-limited analog signal by sampling with a sampling frequency
of at least twice the bandwidth. For speech sounds that have frequencies that start at 0 Hz, the bandwidth equals
the highest frequency present. The sampling frequency than needs to be at least twice the highest frequency in the
sound.
259
A. Mathematical Introduction
Simulation. When a computer is being used to simulate human learning, random numbers
are required to make things realistic.
Sampling. When the number of possible cases is impractically large, a smaller random sample of cases can be investigated.
Numerical analysis. Sometimes very complicated numerical problems can be solved by using random numbers.
Computer programming. Testing a computer program with all kinds of possible and impossible inputs drawn randomly.
Decision making. In a football game flipping a coin by the referee decides which team plays
on either side.
Recreation. Rolling dice, shuffling cards and playing roulette are favorite pastimes. The
roulette wheel inspired the term Monte Carlo method to describe algorithms that use
random numbers.
The definition of what exactly a random number is, will not be treated here. We will speak
of a sequence of independent random numbers with a specified distribution: each number
was obtained by chance alone, each number in the sequence has nothing to do with the other
numbers in the sequence, and each number has a specified chance of falling in any given range
of values.
In a uniform distribution of numbers on a finite set each number is equally probable of being
drawn. In a fair die each of the eyes one to six is equally probable. Despite the fact that on
the long run the number of ones, twos etc. will approximately be equal, the probability of
finding an exactly equal number of occurrences for each of the six possibilities is small. The
probability that in a sequence of six throws you will exactly find one one, one two, one three,
etc is only1.5 %.8
On a computer we can generate random sequences of numbers by an algorithm called a
random number generator. The generator that is used in Praat is based on Knuth [1998,
p.187, 602] and described as a lagged Fibonacci on real numbers in [0, 1). Based on this one
random number generator all other sequences of random numbers with a specified distribution
can be generated.
8 This
260
One of the most frequent uses of random numbers in Praat is to create noise sounds. For
example, if we create a sound with amplitude values drawn from a random uniform distribution
we have created a white noise variant. The following line shows how to create a white noise
whose amplitude is randomly distributed in the interval (0.5, +0.5).
Create Sound from formula : "wn" , 2 , 0 , 1 , 44100 , "randomUniform ( - 0.5 , 0.5)"
The spectrum of this white noise sound is flat as figure A.21 shows.9
-1
0
0.005
Time (s)
60
40
20
2.205104
0
Frequency (Hz)
Figure A.21.: Part of a random uniform noise sound and its spectrum.
On the left we show the first 5 ms of the noise sound. The amplitude varies widely within
this interval. The spectrum on the right shows that the amplitude spectrum is flat. If you
would use the script above again, the resulting noise sound will be different from the one in
the figure. You can execute the script above as often as you like and never will there be two
noises exactly equal! However, they will all have approximately the same flat spectrum and
you cannot hear any difference between all these noise sounds.
In the next example we will again generate random numbers but now they will be distributed
according to a normal distribution with mean zero and standard deviation one. The script is
somewhat more explicit than strictly necessary to make the relation clear between the total
number of random numbers, the number of bins in the drawing interval of the distribution and
the scaling of the normal curve.
npoints = 10000
nbins = 100
xleft = -5
xright = 5
mu = 0
s = 1
Create simple Matrix : "rg" , 1 , npoints , "randomGauss(mu , s)"
Erase all
9 The randomUniform function that is used in the formula part will be re-evaluated for every sample in the sound, and
each time it will return another random value from the (-0.5, 0.5) interval. More information about the Create
Sound from formula... command is given in section 4.7.1.2.
261
A. Mathematical Introduction
Select outer viewport : 0 , 5 , 0 , 5
Draw distribution : 0 , 0 , 0 , 0 , xleft , xright , nbins , 0 , 500 , "yes"
Line width : 2
Draw function : xleft , xright , 1000 , "npoints / (nbins / (xright - xleft)) *
... 1 / (s*sqrt(2*pi))*exp( -x^ 2/ (2*s ^ 2))"
Figure A.22 is the plot that results from this script above. The distribution of the generated
numbers follows the normal curve rather nicely.
number/bin
500
0
-5
Figure A.22.: The distribution of random numbers generated according to a Gaussian distribution
with mean zero and standard deviation one.
Applying a time lag means shifting the function in time. In figure A.23 we show for three
different values of time lag how g(t + ) and g(t ) behave if we know the function g(t).
For this example we have chosen a function g(t) that increases linearly from zero at time t = 0
to 0.5 at time t = T0 . Outside this interval the function equals zero. In the figure the following
lag time were used: = 0, = 0.3T0 and = 0.6T0 as indicated in the first column. In
the middle and right column we consider the effect of the sign of a lag on g(t), the middle
column shows g(t + ) and the right column shows g(t ). In the top panel where there is no
lag, i.e. = 0, g(t + ) and g(t ) show no difference: both equal g(t). The second and third
row show that g(t + ) has the same form as g(t) but displaced to the left over a distance . In
an analog way g(t ) is also a displaced version of g(t) but now to the right. In the figure
we have only considered values of that were greater than or equal to zero. It is not difficult
to see that if the lag time becomes negative g(t + ) behaves as g(t ) for positive lags. It
will therefore shift to the right. We resume:
For positive values of the lag , we can get the function g(t + )/g(t ) by displacing
g(t) to the left/right, for negative values of we can get the function g(t + )/g(t + ) by
displacing g(t) to the right/left.
262
Lag
g (t+ )
g (t- )
0 T0
T0
T0
0.3 T0
0
Time (s)
0
Time (s)
0.6 T0
Informally spoken, the cross-correlation function is a measure of the similarity of two wave
forms as a function of the time lag of one of them. Given two functions f (t) and g(t), the
cross-correlation Rf g () is defined as
Z+
f (t)g(t + )d,
Rf g () =
(A.17)
where is a parameter called the lag time. This looks like a complicated formula: it says
that to determine the cross-correlation of f (t) and g(t) for one particular value of the lag time
you have to
1. displace the function g(t) over the distance to obtain g(t + ),
2. multiply g(t + ) with the function f (t) to obtain the product function h(t) = f (t)g(t +
),
3. determine the integral of the product function h(t).
To obtain the cross-correlation function you have to repeat the steps above for many different
values of the lag time . We are interested in the outcome of equation (A.17) when the two
263
A. Mathematical Introduction
functions f (t) and g(t) are sounds and in the sequel we will use the words function and sound
interchangeably. Now let us show how to perform the calculation above in somewhat more
detail.
First of all the integral sign means that we have to determine an area. In section A.9 we
show that an integral of a sampled function can be easily calculated by multiplying the functions average value by its duration. Therefor, once we know the sample values of a function
we can easily determine its area. Further we have to calculate the product of a function f (t)
with a lagged version of another function g(t) as f (t)g(t + ) for different values of the lag .
Applying a time lag is explained in section A.12.1. In figure A.24 we show step by step how
to calculate the cross-correlation of two simple functions f (t) and g(t).
264
Lag
-1.60 T2
-T2
-1.20 T2
-0.80 T2
-0.40 T2
0.80 T2
T1 T1+T2
T1 T1+T2
T1 T1+T2
-T1
T2
(2)
-T2
T1 T1+T2
-T1
T2
(3)
-T2
T1 T1+T2
-T1
T2
(4)
T1 T1+T2
-T2
T1 T1+T2
-T1
T2
(5)
T1 T1+T2
-T2
T1 T1+T2
-T1
T2
(6)
T1 T1+T2
-T2
-T2
-T2
(1)
-T2
0.40 T2
T1 T1+T2
-T2
-T2
Rfg()
-T2
f(t)g(t+)
-T2
T1 T1+T2
-T1
T2
(7)
T1 T1+T2
-T2
T1 T1+T2
-T1
T2
(8)
-T1
T2
Figure A.24.: Calculation of the cross-correlation of two functions f (t) and g(t) for different values of lag . Column 1: lag time . Column 2: f (t) in black and g(t + ) in red.
Column 3: product function h(t) = f (t)g(t + ) with red dotted line showing the
average value. Column 4: the cross-correlation value at the lag position. The bottom
panel shows the accumulation of the seven cross-correlation values in red together
with the complete cross-correlation function in black.
The first function f (t) shows an amplitude that slowly rises from 0 to 1 between times 0 and
T1 and which is zero otherwise. The second function g(t) has a constant value of 1 between
times 0 and T2 and is zero outside this domain. We have chosen these very simple functions
because the steps in the algorithm can now easily be checked by eye. Figure A.24 shows in
each row, from left to right, the steps involved in the calculation of one value of Rf g () i.e. for
one particular value of the lag time .
265
A. Mathematical Introduction
1. The first column, labeled Lag shows values for expressed as a fraction of the
duration of g(f ). We start with an some arbitrary negative lag of 1.6T2 at the top
panel, increase the lag with a fixed value of 0.4T2 going to the next row, and end with
a 0.8T2 positive lag at the bottom panel. In this way the seven lags almost cover the
complete lag domain.
2. The second column, labeled f(t) & g(t+) shows the two functions. The f (t) is drawn
with a black solid line and g(t + ) is drawn with a red line; the part of g(t + ) that does
not overlap with the domain of f (t) is drawn with a dotted red line to emphasize that this
part does not contribute to the cross-correlation. This column shows that the function g
moves in fixed steps from the right position of the time axis in the top panel to the left
position of the time axis in the bottom panel. It is easy to see that the maximum interval
where f (t) and g(t + ) show overlap runs from T2 to T1 + T2 . The canonical forms
of both functions f (t) and g(t) are shown in the fifth row where the lag time happens to
be zero.
3. The third column is labeled f(t)g(t+) and shows the product function h(t) of the two
functions f (t) and g(t+) for the corresponding value of . The vertical scale here is the
same as in the previous column. As equation (A.17) shows, the cross-correlation at an
instance of is given by the area of the product function. This area is indicated with
grey colour. As we already know, this areas relates to the average value of this product
function. The red dotted line in this column marks this average value over the domain
from T2 to T1 + T2 . The time axis here extends, like in the previous column, from
T2 to T1 + T2 . Outside this interval these functions do not overlap and consequently
the product function will always equal zero and therefore will give no contribution to
the cross-correlation. The larger the area of the product function h(t), the larger the
cross-correlation. The area is largest when the two functions maximally overlap.
4. Finally, the column labeled Rf g () shows the value of the cross-correlation at the corresponding lag time. This value was obtained by multiplying the average value of h(t),
as shown in the previous column, with the duration of the domain of h(t), i.e. T1 + 2T2 .
The area enclosed by h(t) and the horizontal axis therefore equals the area of the rectangular area between the red dotted line and the horizontal axis. Because positive lags in
g(t + ) result in a shift of g(t) to the left, i.e. to negative times, and negative lags result
in shifts of g(t) to positive times as we also demonstrate in section A.12.1, it is now
easy to see that the domain of the cross-correlation function is from lag T1 to lag T2 .
For displaying purposes only, the vertical scale in this column differs from the previous
ones.
In the bottom panel the cross-correlation data of the last column are accumulated. For comparison the complete cross-correlation function has also been drawn in the panel which shows
that our calculations of the cross-correlations at the lag times were correct. 10
10 The
266
f (t) F () and g(t) G() are Fourier transform pairs then Rfg () F ()G() is a Fourier transform
pair too.
11 If
267
A. Mathematical Introduction
f(t)
g(t)
T2 -T1
T1 0 t2
T2 -T1
T1 0
(1)
T1 0
Rfg()
t3
T2 -T1
T2
(2)
0 t2
T2
(3)
t3
T2
0
(4)
0
0
T1 0
t4
T2 -T1
t4
T2
0
0
0
T1 0
Time
t4
T2 -T1
Time
(5)
0
t5
Lag time
T2
Figure A.25.: The cross-correlation of the sound in the column labeled f (t) with the sound in the
column labeled g(t) results in the sound in the column labeled Rfg (t). The sounds
f have exact duration T1 while the sounds g(t) actually are of longer duration than
the T 2 seconds shown.
A cross-correlation is performed by first selecting the two sounds that you want to crosscorrelate and then choosing the Cross-correlate... command. Next the form of figure A.26
appears. The amplitude scaling options of the resulting sound represent:
268
PN
f (t)g(t + )d
.
R +
f 2 (t)dt g 2 (t)dt
= qR
+
Effectively we divide the cross-correlatation by the square root of the product of the
energies of the two sounds which will result in a dimensionless object.
peak 0.99. This is a pragmatic scaling to be able to play the resulting sound without audible
distortion.
The result of a cross-correlation is a new sound object in the list of objects. The dimensions of
the result of the correlation with each of the first three scalings differ and none really conforms
to dimension of a sound. A sounds amplitude is expressed in Pa and equation (A.17) makes
clear that the dimension of the cross-correlation is in Pa2 s since it is obtained by multiplying
two sounds together followed by a multiplication with a time unit. Therefore, the dimension of
Pa2 for the sum scaling and the dimensionless object from the normalize scaling are also
not in compliance with the dimension of a real sound. Nevertheless, once we have a sound
object, its history may turn out to be completely irrelevant.
A.12.3. The autocorrelation
Informally spoken, the autocorrelation function of a sound shows the similarity of the sound
with displaced versions of itself. The autocorrelation function is often written as r(), where
is the displacement time or lag time. The autocorrelation of a sound equals the crosscorrelation of a sound with itself. Given a function f (t), it is defined as
Z+
r() =
f (t)f (t + )d.
(A.18)
It can be used as a tool to find repeating patterns in a sound. The pitch algorithm in Praat uses
the autocorrelation function. For a periodic sound with period T , the autocorrelation function
will show maxima at lag times that are multiples of the period T , as we will demonstrate
shortly.
Some properties of the autocorrelation function:
The autocorrelation function is a symmetric function, i.e. r() = r().
269
A. Mathematical Introduction
The maximum peak is always at the origin, i.e. for = 0. We can state this mathematically as |r()| r(0). Sometimes the autocorrelation function is divided by the
maximum r(0) to obtain the normalized autocorrelation function r0() = r()/r(0).
The maximum of the normalized autocorrelation function always equals one.
The autocorrelation function of a periodic function has peaks at intervals of the period.
The autocorrelation of white noise sound will show a peak at = 0 and will be almost
zero for all other values of . This can be used to reveal hidden periodicity of a sound
buried in noise.
The autocorrelation of the sum of two uncorrelated functions is the sum of the autocorrelations of these functions.
s(t )
r ( )
0.0005
-1
-0.0005
0.04
0
0
Time (s)
1
0.0005
-1
0.04
Lag time (s)
-0.0005
0.04
0
0
Time (s)
1
0.0005
-1
0.04
Lag time (s)
-0.0005
0.04
0
0
Time (s)
0.04
Lag time (s)
Figure A.27.: The left column shows three different sounds of 40 ms duration. In the top panel
the sound is a tone of 300 Hz with amplitude 0.1, in the middle panel it is random
uniform noise with amplitudes between 0.4 and 0.4 and in the bottom panel it is
the sum of these two sounds. The right column shows the autocorrelation of the
corresponding sound in the left column for positive lag times only.
We now show how the noise cancellation by the autocorrelation function can help find
270
-1
-1
0.055
0.055
Time (s)
Time (s)
-1
-1
0.055
0
Time (s)
0.055
0
Time (s)
271
A. Mathematical Introduction
An autocorrelation is performed by first selecting a sound and then choosing the Autocorrelate... command. Next the form in figure A.29. The amplitude scaling options of the resulting
sound represent:
P
integral. Equation (A.18) is numerically evaluated as Ni=1 f (ti )f (ti + )t, where t is the
sampling period and ti are the sampling times. The resulting object has dimensions of
Pa2 s.
P
sum. Equation (A.18) is now evaluated as Ni=1 f (ti )f (ti + ) which differs from the integral
evaluation only by a factor of t. The resulting object now has dimensions of Pa2 .
f (t)f (t + )d
.
R +
f 2 (t)dt
We divide the autocorrelatation by the energy of the sound which will result in a dimensionless object. The maximum amplitude will occur at = 0 and be equal to one.
peak 0.99. This is a pragmatic scaling to be able to play the resulting sound without audible
distortion.
The result of the autocorrelation is a new sound object in the list of objects. Because the
autocorrelation of a sound is a symmetric function, only the values for lag times greater equal
than zero are given.
The arguments used at the end of section (A.12.2.2) about apply here too, the dimensions
of the result of the autocorrelation with these scalings differ and none really conforms to the
dimensions of a sound. Nevertheless, once we have a sound object, its history may turn out to
be completely irrelevant.
272
The Romans used the symbols I (value 1), V (value 5), X (value 10), L (value 50), C (value
100), D (value 500) and M (value 1000) to represent numbers. The rules to attain the value
represented by a sequence of these symbols, are fairly complicated. If the symbols, from left
to right, are such that the values of the symbols are not increasing, these values can be added
directly. For example, in the number CXII the ordering is non-increasing: the C on the left
represents the highest value, 100, the C is followed by the X with the lower value 10. The X
is followed by the I with the lower value 1 and this I is followed by another I. This makes the
sequence non-increasing. Its value can be calculated as C+X+I+I = 100+10+1+1=112.
Because the Romans did not allow to write more than three symbols with the same value
next to each other, they needed a way to write for example 4 (6= IIII) or 9 (6= VIIII) or 14(6= XIIII) or 40(6= XXXX). They allowed a lower symbol to be on the left of the next higher symbol
and the value of the lower symbol had to be subtracted from the higher symbol. In the representation MCMLXXIX there are two lower symbols before a higher symbol: the C (=100) is
before the second M (=1000) and the I (=1) is before the third X. The value of this number is
therefore 1000+(100+1000)+50+10+10+(1+10) = 1000+900+50+10+10+9 = 1979.
The first 15 numbers in the Roman system are I=1, II=2, III=3, IV=4, V=5, VI=6, VII=7,
VIII=8, IX=9, X = 10, XI = 11, XII=12, XIII=13, XIV=14, XV=15. In this system performing arithmetic, like adding numbers or multiplying them, is a horror.
273
A. Mathematical Introduction
A.14.2. The decimal system
In the decimal system that we are used to, the rules to obtain the value of written numbers are
much simpler than in the Roman system discussed above. The Romans had a system where the
value of a symbol was fixed. We use a system where the value of a symbol also depends on its
position in the sequence. For example, the numbers 12 (twelve) and 1234 (one thousand three
hundred and thirty four) both have the digit 1 in the first position from the left. However, the
value of this digit in both numbers is very different. In 12 the 1 represents the value ten while
it represents the value thousand in 1234. The position of the digit in the sequence determines
its value: the more to the left in the sequence the higher its value, the more to the right the
lower its value. This is a very elegant and efficient representation of numbers. It allows us to
represent an infinite number of values with only 10 symbols, the digits from 0 to 9.
Now we will try to make the way in which the value of a sequence of symbols is constructed,
more explicit. The rule to obtain the value of a sequence of digits in a positional system like
our decimal system, is as follows: multiply the value of each symbol in the sequence with
the value corresponding with its position in the sequence and sum the results. For example,
the value of a sequence of four digits, the number 1234, can be calculated as 1 1000 + 2
100 + 3 10 + 4 1. From this representation we can see that the first position on the right
is worth one, the second position from the right is worth ten, the third hundred, and the fourth
position from the right thousand. We rewrite the values associated with position as powers of
10, 1234 = 1 103 + 2 102 + 3 101 + 4 100 . The system becomes clear: the values of
neighboring positions differ by a factor of 10. Mathematicians like to generalize this and write
that the value of position p is 10p1 , if we count the positions going from right to left with the
rightmost position at p = 1. Check: in the number 456712 the value of the fifth position from
the right is 10000 = 1051 = 104 , and the symbol 5 at this position adds 50000 to the value
of this number. In these numbers a value of 10 plays a crucial role, it is called the base of the
decimal system.
A.14.3. The general number system
Mathematicians like to formalize the rules to obtain the value of a sequence of n symbols.
They write such a sequence as sn sn1 s3 s2 s1 , where each of the sp is one of the symbols
from the number system. We start with the decimal system where we know that the value of
this sequence of digits can be calculated as sn 10n1 + sn1 10n2 + + s4 103 +
s3 102 + s2 101 + s1 100 , where each of the symbols appearing before a power of ten
represents one of the digits 0, 1, , 9. Lets define a symbol b that has the value 10. We can
then write the value of the sequence as
sn sn1 s3 s2 s1 = sn bn1 + sn1 bn2 + + s3 b2 + s2 b1 + s1 b0 ,
(A.19)
in which the value of a digit sp at a position p from the right is sp bp1 . This is the desired
result. Check: in the number 143298 the digit at p = 5, i.e. 4, represents the value 4 1051 =
40000.
The last step. Up till now we have used the decimal system: base b = 10 with 10 symbols,
the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. We now leave the familiar decimal system and allow
the base b to be any number. A number system with base b needs b different symbols to write
274
In the computer world three number systems are dominant, the hexadecimal system, the octal
system and the binary system. Of these three the octal system is used less frequently than
the other two. The hexadecimal system uses base b = 16 and its symbols are the digits 0
to 9 which make up the first 10 symbols, the addition six are the A (= 1010 ), B (=1110 ),
C (= 1210 ), D (1310 ), E (= 1410 ) and F (= 1510 ). Examples of hexadecimal numbers are
1016 = 1 16 + 0 160 = 16 and FF16 = 15 161 + 15 160 = 25510 .
In the binary system the base b = 2, the two symbols used are 0 and 1. Some examples of
binary numbers are 102 = 1 21 + 0 = 210 , 11101 = 1 24 + 1 23 + 1 22 + 0 21 + 1 =
16 + 8 + 4 + 0 + 1 = 2910 , 1118 = 22 + 2 + 1 = 710 .
Now the basics of number systems have been explained, we know how to calculate the
decimal value of a number in a certain number system with equation A.19. We have not
shown yet how to do simple arithmetic like add and subtract in a particular number system, or
how to transform numbers from one number system to any other.
Table A.5.: Conversion from decimal to binary by long division.
21310
106
53
26
13
6
3
1
0
Remainder
1
0
1
0
1
0
1
1
An easy algorithm to convert a number from decimal to binary notation is by long division
12 For
numbers between 60 and 70 they mix base 10 and base 20. They say 71 is soixante et onze, 60 + 11.
275
A. Mathematical Introduction
as exemplified by the first three columns in table A.5. We start with 21310 and divide by 2.
The result, 106, is written on the next line in the same column, the remainder, 1, in the next
column. Check: 213 = 2106+1, 213 is an uneven number and uneven numbers always have
a remainder of 1 after division by 2. We continue the process and now divide the 106 by 2 and
write the result and remainder on the next line. We continue 53 = 2 26 + 1, 26 = 2 13 + 0,
13 = 2 6 + 1, 6 = 2 3 + 0, 3 = 1 2 + 1, 1 = 0 2 + 1. The binary representation
can now be formed by arranging the numbers in the Remainder column in the following way.
Get the top number, 1, and write it down, get the number in the next row, 0, and write it to
the left of the previous number, continue this procedure until there are no numbers left. You
now have the sequence 11010101 which forms the binary representation of 21310 . Check:
11010101 = 1 27 + 1 26 + 1 24 + 1 22 + 1 = 128 + 64 + 16 + 4 + 1 = 21310 . This
procedure by long division works in any conversion from decimal to another number system
as the following table shows.
16
21310
13
0
Remainder
5
D
10
21310
21
2
0
Remainder
3
1
2
A.15. Matrices
A matrix is a rectangular table of elements. Most of the time these elements are numbers.
The horizontal and vertical lines in a matrix are called rows and columns. For example, the
following matrix A has 3 rows and 5 columns.
1
A= 2
3
4
5
6
7
8
9
10
11
12
13
14
15
Any element in a matrix can be indexed by a pair of numbers as (row number, column number). For example, the element with value 8 in the matrix A has row and column indices (2,
3). The convention is that a matrix is indicated with an alphabetic character in uppercase, like
A in our example. There are various ways to indicate a matrix element. The only thing they
have in common is that the matrix element is written in lowercase. For example, the element
of matrix A with value 8 can be referred to as a2,3 or a(2, 3) or a[2][3] or a[2, 3].
A matrix with only one row is called a row vector, a matrix with only one column is called
a column vector. A matrix with equal number of rows and columns is called a square matrix.
Matrices are used in Praat all over the place. For example, to represent sounds we use
matrices. The samples of a mono sound are a matrix with only one row; the samples of a
276
a+bi
c+di
a+bi
c+di
cdi
cdi
ac+bd+(bcad)i
.
c2 +d2
All the trigonometric relations between sums and products of sines and cosines can easily be
derived once we know about complex numbers. We start from the famous relation of Euler:
ei = cos + i sin .
(A.20)
ei = cos i sin .
(A.21)
In the latter formula we have used the fact that a cosine is a symmetric function, i.e. cos() =
cos and the sine is antisymmetric, i.e. sin() = sin . Subtraction and addition of these
two equations results in expressions for the sine and cosine:
sin
cos
ei ei
2i
ei + ei
.
2
(A.22)
(A.23)
13 To
complicate things: the Matrix type in Praat also consists of a table of numbers. However, these numbers
represent sampled values on a domain.
277
A. Mathematical Introduction
These equations are all we need to derive rules for the arguments of the form + . We
substitute = + in the formula (A.20) above and obtain
ei(+) = cos( + ) + i sin( + ).
(A.24)
ei ei
(A.25)
For the right-hand sides of equations (A.24) and (A.25) to be equal, the real parts have to be
equal and the imaginary parts have to be equal. We obtain
cos( + )
sin( + )
(A.26)
sin( )
(A.27)
(cos( ) cos( + )) /2
(A.28)
cos sin
( sin( ) + sin( + )) /2
(A.29)
sin cos
(sin( ) + sin( + )) /2
(A.30)
cos cos
(A.31)
(A.32)
sin cos
1/2 sin(2)
(A.33)
(A.34)
cos
(A.35)
(ei + ei + ei + ei )/2
=
=
=
278
ei(+)/2 (ei()/2 + ei()/2 ) + ei(+)/2 (ei()/2 + ei()/2 ) /2
(ei(+)/2 + ei(+)/2 )(ei()/2 + ei()/2 )/2
2 cos (( + )/2) cos (( )/2)
(A.36)
=
=
=
=
=
0 x T0
elsewhere
(A.38)
b(x)e2if t dt
ZT0
1e2if t dt
=
=
=
=
=
e2if t
2if
T0
0
e2if T0 1
2f
if T0
e
eif T0
if T0
T0 e
2f T0
sin(f
T0 )
T0 eif T0
f T0
T0 eif T0 sinc(f T0 ).
In the derivation of the transform we used identity (A.22) for the sine and (A.3) for the definition of the sinc function.
0 x T0
elsewhere
(A.39)
279
A. Mathematical Introduction
Its Fourier transform is
T (f )
t(x)ei2f t dt
ZT0
sin(2f1 t)ei2f t dt
ZT0
1/(2i)
ZT0
1/(2i)
ei2(ff1 )t ei2(f+f1 )t dt
T0
ei2(f+f1 )t
ei2(ff1 )t
= 1/(2i)
i2(f f1 ) i2(f + f1 ) 0
i2(f f1 )T0
e
1 ei2(f+f1 )T0 1
= 1/(2i)
i2(f f1 )
i2(f + f1 )
!
i(ff1 )T0
i(ff1 )T0
e
e
ei(ff1 )T0
ei(f+f1 )T0 ei(f+f1 )T0 ei(f+f1 )T0
= 1/(2i)
i2(f f1 )
i2(f + f1 )
i(ff1 )T0
i(f+f1 )T0
= T0 /(2i) e
sinc ((f f1 )T0 ) e
sinc ((f + f1 )T0 ) .
280
281
A. Mathematical Introduction
Figure A.30.: The spectral parts of a time-limited tone. For details see text.
282
B. Advanced scripting
B.1. Procedures
In writing larger scripts you will notice that certain blocks of lines are repeated many times at
different locations in the script. For example, if you make a series of tones but the frequencies
are not related in such a way that a simple for loop could do the job. One way to do this in a
for loop is by defining arrays of variables like we did in the last part of section 4.7.1. In this
section we describe another way, procedures, and we introduce local variables.
A procedure is a reusable part of a script. Unlike loops, which also contain reusable code,
the place where a procedure is defined and the place from which a procedure is called differ.
A simple example will explain. If you run the following script you will first hear a 500 Hz
tone being played, followed by a 600 Hz tone and finally a 750 Hz tone. The first line in the
Script B.1 Use of procedures in scripting.
1
2
3
4
5
6
7
8
9
@playTone : 500
@playTone : 600
@playTone ; 750
procedure playTone : frequency
Create Sound as pure tone : "tone" , 1 , 0 , 0.2 , 44100 , frequency , 0.5 , 0.01 , 0.01
Play
Remove
endproc
script calls the procedure named play_tone with an argument of 500.1 This results in a tone
of 500 Hz being played.
In detail: Line 1 directs that the code be continued at line 5 where the procedure play_tone
starts. The variable frequency will be assigned the value 500 and the first line of the procedure, line 6, will be executed. This results in a tone of 500 Hz being created. Lines 7 and 8 will
first play and after playing remove the sound. When the endproc line is reached, the execution
of the script will return to the end of line 1. The execution of the following line 2 will result
in the same sequence of code: the variable frequency will be assigned the value 600 and
execution of the procedure continues until the endproc line is reached. Then execution will
continue at line 3, and the whole cycle starts again. The effect of this script is identical to the
following script.
before praat version 5.3.46 calling a procedure was really preceeded by call, i.e. the first line of the script
had to be like: call play_tone 500. From praat version 5.3.46 this way of calling is deprecated as the new
calling syntax guarantees superior argument handling.
1 Indeed
283
B. Advanced scripting
Create Sound as pure tone : "tone" , 1 , 0 , 0.2 , 44100 , 500 , 0.5 , 0.01 , 0.01
Play
Remove
Create Sound as pure tone : "tone" , 1 , 0 , 0.2 , 44100 , 600 , 0.5 , 0.01 , 0.01
Play
Remove
Create Sound as pure tone : "tone" , 1 , 0 , 0.2 , 44100 , 700 , 0.5 , 0.01 , 0.01
Play
Remove
Although this is only a small script, it may be clear that defining a procedure can save a lot
of typing: less typing means less possibility for errors to creep in. A procedure also makes a
nice way to isolate certain portions of the code. You can than test that part more easily and
thoroughly. In script B.1 you can test the play_tone procedure,2 for a couple of frequencies
and then it will work for all the other argument frequencies.
In order to use procedures safely we have to make them somewhat more self contained then
they are now. In the next section we show how we can do that by the introduction of local
variables.
B.1.1. Local variables
The procedure play_octave plays a tone twice as high as its argument. However, in doing
so it also changes the variable frequency. The consequence is that in line 4 of the script the
2 For
example, one way of testing script B.1 could be that you dont remove the created sound immediately after
playing it. Instead, remove line 8, or even better: put a # sign in front of line 8 to make it a comment line.
Next you call the procedure with two or three different frequency arguments. Each time the procedure is called,
a new sound is added to the list of objects. Open the sounds in the sound editor and measure the duration of one
period. If you zoom in you will notice that the sound editor shows both the duration of a selected segment as well
as its inverse, i.e. frequency. For example the rectangle above the sound that indicates your selection might show
0.002734 (365.784 / s) which indicates that the selection duration is 0.002734 s and if this were one period of
a sound it would correspond to 365.784 periods per second, i.e. a frequency of 365.784 Hz. You can improve the
accuracy of your measurement by not taking one period but say ten periods. The advantage of taking ten periods
is that you dont have to zoom in as often and that your measurement is more accurate. To obtain the frequency
of one period you have to multiply the frequency that is shown in the selection window by ten. Check if this is in
agreement with the given frequencies.
284
A local variable is, as the naming already suggests, only locally known within the procedure.
If you dont use the dot in front of the name, the variables scope is global and its value may
be changed outside the script, or, your script modifies an outside variable. This may create
very undesired side effects as we have seen.3 The modified script will now correctly print
Frequency = 440 because the global variable frequency will not interfere with the local
variable .frequency.
3 As
never stop because the loop variable will never reach the value 10. . .
new functions exist since Praat 5.3.48.
4 These
285
B. Advanced scripting
Script B.4 Examples of object selection in a script.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
286
287
B. Advanced scripting
The directory tree traversal script example can be used for the TIMIT acoustic phonetic continuous speech corpus [Lamel et al., 1986]. The TIMIT speech corpus is a speech corpus in
which all speech sounds have been labeled with text. All spoken sentences have been labeled,
all the words in a sentence have been labeled and all phonemes in all the words have been
labeled. This makes it a very valuable research tool because we have access to all the label
information. The TIMIT corpus resulted from the joint efforts of several American speech research sites. The text corpus design was done by the Massachusetts Institute of Technology
(MIT), Stanford Research Institute and Texas Instruments (TI). The speech was recorded at
TI, transcribed at MIT, and has been maintained, verified and prepared for CDROM production by the American National Institute of Standards and Technology (NIST), and was made
available via the Linguistic Data Consortium.
The TIMIT corpus contains a total of 6300 sentences. Each of 630 speakers from eight major
dialect regions of the United States of America spoke ten sentences. Approximately 70% of
the speakers were male and 30% were female. The speakers dialect region was defined as that
geographical area were s/he had lived during childhood years. Dialect number 8 (Army Brat)
was assigned to people who had moved around a lot during their childhood and to whom no
particular dialect could be attributed.
The ten sentences produced by each speaker consisted of two so-called SA-type, five SXtype and three SI-type sentences.
288
where <timitroot> is the root of the timit directory tree and the other symbols in the path
may obtain the following values:
< USAGE > := (train | test)
< DIALECT > := dr(1 |2|3| 4|5|6 |7| 8)
<SEX > := (m | f) male or female
< SPEAKER > := xyzd (three character followed by digit)
< SENTENCE > := (sa | si | sx)n (sentence number n)
< FILETYPE > := (wav | txt | wrd | phn)
289
B. Advanced scripting
oy
n dh eh m
h#
h# h aU dcld I dcl w
OI
n D
h#
2 ndcldZ
1.625
Time (s)
Figure B.1.: The audio and phoneme labels for sentence si1640 from speaker mjw0 of dialect
region dr1 in the test part of TIMIT.
case you dont know the number of entries beforehand you have to create a table that is large enough. When
youre done filling the table you can scale it down to the right size. For example if you had created a table with 10000 entries, you can use the command Extract rows where column (text)..., usage,
"matches (regex)", .+. This command creates a new table by only copying those rows where the element
in the column labeled usage is non-empty.
290
want to use the table also to be able to read all the label files. In this case we only need to append the specific
extension to basepath. This is a lot simpler than first removing the .wav extension and then appending the new
extension.
291
B. Advanced scripting
Script B.6 Using a system command to traverse a directory tree.
13
18
23
28
33
38
We start this script by defining the file separator as the slash / character. For Windows the
file separator is the backslash \. Next we use the find system command to get a file with all
the file names. We read this file as a Strings object: each line in the file becomes a separate
item in the Strings, i.e. a file name. We strip of the first part of each string, the root part. In
the inner loop we successively strip the parts before a file separator character. Finally, we strip
the last four characters .wav from the remaining part.
The script explicitly uses that audio files only exist at the lowest level of the tree.
292
Once we have a list with filenames we can proceed with the second step: the analysis of these
files. Maybe we do not want to analyse all the files at once but make a selection. For example, we might be interested in the data from the male speakers only and use do (Extract
rows where column (text)..., speaker, starts with, m) to get a new smaller
table. Or we might be interested in the sheboleth sentences only do (Extract rows where
column (text)..., sentence, starts with, sa). The most elementary analysis
script will process the items in the table successively. For example, suppose your TIMIT
database lives in /data/TIMIT/, then you may use the directory-traversal procedure as is
shown in one the skeleton scripts B.7 and B.6.
The first line calls the directory-traversal procedure in script B.6. This results in the well
known table with 6300 file name entries. Its number of rows are queried and then we loop
over all rows from the table. We extract a string with the base part of the filename from the
table cell at the row with index ifile and the column with label basepath and appends the
.wav extension to this string. The complete filename is known and the sound file can be read
from disk. We then can do some processing on this sound and, finally, when we are done with
the sound, we remove it.
Now, suppose we want to measure formant frequencies for all the vowels in the TIMIT database.
In section B.5.1.1 we created a script to traverse the directory tree. We will use it here and
provide a skeleton script to perform a rudimentary formant frequency analysis on all the vowels in the database. We will measure the first three formant frequencies at three positions in
the vowel, at 20%, 50% and 80% of its duration. In this way we will have information for
monophthongs and diphthongs as well. The following script will perform the analysis.
293
B. Advanced scripting
Script B.8 Formant analysis in TIMIT.
1
10
15
20
25
30
35
40
45
50
55
60
65
is in general a good idea: analyze everything and select afterwards. Reading the file and formant frequency
analysis itself will, in general, use the major part of the analysis time. Storing somewhat more than we need only
costs some memory. And we have gigabytes of them. . .
295
B. Advanced scripting
one or more characters. Therefore, we only extract those rows from the table when there is
at least one character in the usage column.
296
14
19
24
29
34
39
44
49
.usage$ [ 1] = "train"
.usage$ [ 2] = "test"
.irow = 1
for .idir1 to 2
# test | train level
.dir1$ = .usage$ [ .idir1 ]
.path1$ = .timitroot$ + "/ " + .dir1$
.dirList1 = Create Strings as directory list :
... "dirList1" , .path1$
.dirList1_n = Get number of strings
for .idir2 to .dirList1_n
# dr1 .. dr8
selectObject : .dirList1
.dir2$ = Get string : .idir2
.path2$ = .path1$ + "/" + .dir2$
.dirList2 = Create Strings as directory list :
... "dirList2" , .path2$
.dirList2_n = Get number of strings
for .idir3 to .dirList2_n
# speakers
selectObject : .dirList2
.dir3$ = Get string : .idir3
.path3$ = .path2$ + "/" + .dir3$
.fileList = Create Strings as file list :
... "fileList" , .path3$ + "/ *.wav"
.fileList_n = Get number of strings
for .ifile to .fileList_n
selectObject : .fileList
.file$ = Get string : .ifile
.base$ = .file$ - ".wav$"
.pathc$ = .path3$ + "/" + .base$
selectObject (.table)
Set string value : .irow , "usage" , .dir1$
Set string value : .irow , "dialect" , .dir2$
Set string value : .irow , "speaker" , .dir3$
Set string value : .irow , "sentence" , .base$
Set string value : .irow , "basepath" , .pathc$
.irow += 1
endfor
removeObject : .fileList
endfor
removeObject : .dirList2
endfor
removeObject : .dirList1
endfor
selectObject : .table
endproc
297
B. Advanced scripting
Script B.7 Example skeleton script that uses the procedure in script B.5 to process all audio
files in TIMIT.
1
10
298
C. Scripting syntax
C.1. Variables
Variable names start with a lowercase letter and are case sensitive, i.e. aBc and abc are not
the same variable. String variables end with a $, numeric variables dont.
Examples: length = 17.0, text$ = "some words"
C.1.1. Predened variables
299
C. Scripting syntax
Examples:
if age < 3
writeInfoLine :
elsif age < 12
writeInfoLine :
elsif age < 20
writeInfoLine :
else
writeInfoLine :
endif
C.3. Loops
C.3.1. Repeat until loop
repeat
< statements >
300
C.3. Loops
until expression
Repeats executing the statements between repeat and the matching until line as long as the
evaluation of expression does not return zero or false.
C.3.2. While loop
while expression
< statements >
endwhile
Repeats executing the statements between the while and the matching endwhile as long as
the evaluation of expression does not return zero or false.
C.3.3. For loop
for variable [ from expression_1 ] to expression_2
< statements >
endfor
If expression_1 evaluates to 1, the part between the [ and the ] can be left out as in:
for variable to expression_2
< statements >
endfor
The semantics of the first for loop are equivalent to the following while loop:
variable = expression_1
while variable <= expression_2
< statements >
variable = variable + 1
endwhile
C.3.4. Procedures
Define a procedure:
procedure nameOfTheProcedure : .argument1 , ... , .argument_n
<do something with the arguments >
endproc
# call the procedure :
@nameOfTheProcedure : arg1 , ... , arg_n
301
C. Scripting syntax
Play
# old syntax (deprecated , very wordy) :
do ("Create Sound from formula..." , "name" , 1 , 0 , 1 , 44100 , "sin(2*pi*500*x)")
do ("Play")
# even older (deprecated) : no comma's between arguments
Create Sound from formula... name 1 0 1 44100 sin(2*pi*500*x)
Play
302
D. Terminology
ADC An Analog to Digital Converter converts an analog electrical signal into a series of
numbers.
ADPCM A variant of DPCM that varies the size of the quantization step.
Aliasing the ambiguity of a sampled signal. See section 3.6.4 on analog to digital conversion.
Bandwidth The bandwidth of a sound is the difference between the highest frequency in the
sound and the lowest frequency in the sound. The bandwidth of a filter is the difference
between the two frequencies where the
DPCM A variant of PCM, encodes PCM values as differences between the current and the
predicted value.
Endianness. Refers to the way things are ordered in computer memory. An entity that consists of 4 bytes, say the number 0x0A0B0C0D in hexadecimal notation is stored in the
memory of big-endian hardware in four consecutive bytes with contents 0x0A, 0x0B,
0xCA, and 0x0D, respectively. In little-endian hardware this 4-byte entity will be
stored in four consecutive bytes as 0x0D, 0xCA, 0x0B and 0x0A. A vague analogy
would be in the representation of dates: yyyy-mm-dd would be big-endian, while ddmm-yyyy would be little-endian.
PCM Pulse Code modulation, a digital representation of a sound. The amplitude is sampled
at a constant rate and always quantized with the same number of bits.
Sensitivity of an electronic device is the minimum magnitude of the input signal required
to produce a specified output signal. For the microphone input of a soundcard, for
example, it is that voltage that provides the maximum allowed voltage that the ADC
accepts if the input volume control is set to its maximum. Generally the sensitivity
levels are mentioned in the specifications of all audio voltage accepting equipment.
303
D. Terminology
S/PDIF Sony/Philips Digital Interconnect Format, a version of the AES/EBU format for
digital audio connections for consumer sound cards.
Transducer a device that converts one type of energy to another. A microphone converts
acoustic energy to electric energy while the reverse process is accomplished by a speaker.
A light bulb is another transducer, it converts electrical energy into light.
304
Bibliography
Patti Adank, Roeland Van Hout, and Roel Smits. An acoustic description of the vowels of
Northern and Southern Standard Dutch. J. Acoust. Soc. Am., 116:17291738, 2004.
Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonicsto-noise ratio of a sampled sound. Proc. Institute of Phonetic Sciences University of Amsterdam, 17:97110, 1993.
D.G. Childers. Modern spectrum analysis. IEEE Press, 1978.
G. Fant. The Acoustic Theory of Speech Production. Mouton, The Hague, 1960.
Jonathan Harrington and Steve Cassidy. Techniques in Speech Acoustics. Kluwer Academic
Publishers, 1999.
Toshio Irino and Roy D. Patterson. A time-domain, level-dependent auditory filter: the gammachirp. J. Acoust. Soc. Am., 101(1):412419, 1997.
Keith Johnson. Acoustic and Auditory Phonetics. Blackwell, 1997. ISBN 0-631-20095-9.
Dennis H. Klatt. Software for a cascade/parallel formant synthesizer. J. Acoust. Soc. Am., 67:
971995, 1980.
Dennis H. Klatt and Laura C. Klatt. Analysis, synthesis, and perception of voice quality
variations among female and male talkers. J. Acoust. Soc. Am., 87:820857, 1990.
Donald E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming.
Addison-Wesley, third edition, 1998.
W. Koenig, H.K. Dunn, and L.Y. Lacey. The sound spectrograph. J. Acoust. Soc. Am., 18:
1949, 1946.
L. F. Lamel, R. H. Kassel, and S. Seneff. Speech database development: Design and analysis
of the acoustic-phonetic corpus. In Proc. DARPA Speech Recognition Workshop, pages
100109, 1986.
Chin-Hui Lee. On robust linear prediction of speech. IEEE Trans. on Acoustics, Speech, and
Signal Processing, 36:642649, 1988.
J. Makhoul. Linear prediction: A tutorial review. Proc. IEEE, 63:561580, 1975.
J. D. Markel and A. H. Gray, Jr. Linear prediction of speech. Springer Verlag, Berlin, 1976.
Athanasios Papoulis. Signal analysis. McGraw-Hill, 1988.
305
Bibliography
G. E. Peterson and H. L. Barney. Control methods used in a study of the vowels. J. Acoust.
Soc. Am., 24:175184, 1952.
Louis C. W. Pols, H. R. C. Tromp, and Reinier Plomp. Frequency analysis of Dutch vowels
from 50 male speakers. J. Acoust. Soc. Am., 53:10931101, 1973.
Rollin Rachelle. Overtone Singing Study Guide. Cryptic Voices Productions, Amsterdam,
1995.
K. Saberi and D. R. Perrot. Cognitive restoration of reversed speech. Nature, 398:760, 1999.
Hiroaki Sakoe and Seibi Chiba. Dynamic programming optimization for spoken word recognition. IEEE Trans. on Acoustics, Speech, and Signal Processing, 26:4349, 1978.
Kenneth N. Stevens. Acoustic phonetics. MIT Press, 2nd edition, 2000.
Stanley Smith Stevens, John Volkman, and Edwin B. Newman. A scale for the measurement
of the psychological magnitude pitch. J. Acoust. Soc. Am., 8:185190, 1937.
306