100% found this document useful (3 votes)
3K views

(Thomas F. Quatieri) Discrete Time Speech Signal P (BookFi - Org) 2 PDF

Uploaded by

Mayank Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (3 votes)
3K views

(Thomas F. Quatieri) Discrete Time Speech Signal P (BookFi - Org) 2 PDF

Uploaded by

Mayank Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 800
Discrete-Time Speech Signal Processing : Balas) Cit aula 2 rete-Time Speech ignal Processing 4) 0 PRENTICE HALL SIGNAL PROCESSING SERIES Alan V. Oppenheim, Series Editor BRACEWELL ‘Two Dimensional Imaging BRIGHAM The Fast Fourier Transform and Ifs Applications (AOD) BUCK, DANIEL & SINGER Computer Explorations in Signals and Systems Using MATLAB CASTLEMAN _ Digital Image Processing COHEN Time-Frequency Analysis CROCHIERE & RABINER Multirate Digital Signal Processing (AOD) JOHNSON & DUDGEON = Array Signal Processing (AOD) Kay Fundamentals of Statistical Signal Processing, Vols. 1 & I Kay Modern Spectral Estimation (AOD) Lim ‘Two-Dimensional Signal and Image Processing MCCLELLAN, BURRUS, OPPENHEIM, PARKS, SCHAFER & SCHUESSLER Computer- Based Exercises for Signal Processing Using MATLAB Ver. 5 MENDEL Lessons in Estimation Theory for Signal Processing, Communications, and Control, 2/e NIKIAS & PETROPULU Higher Order Spectra Analysis OPPENHEIM & SCHAFER Digital Signal Processing OPPENHEIM & SCHAFER Discrete-Time Signal Processing OPPENHEIM & WILLSKY, WITH NAWAB__ Signals and Systems, 2/e ORFANIDIS Introduction to Signal Processing PHILLIPS & NAGLE Digital Control Systems Analysis and Design, 3/e QuaTiERI Discrete-Time Speech Signal Processing: Principles and Practice RABINER & JUANG Fundamentals of Speech Kecognition RABINER & SCHAFER Digital Processing of Speech Signals STEARNS & DAVID Signal Processing Algorithms in MATLAB TEKALP Digital Video Processing VAIDYANATHAN Multirate Systems and Filter Banks VETTERLI & KOVACEVIC Wavelets and Subband Coding WANG, OSTERMANN & ZHANG _ Video Processing and Communications WIDROW & STEARNS Adaptive Signal Processing Discrete-Time Speech Signal Processing Principles and Practice Thomas F. Quatieri Massachusetts Institute of Technology Lincoln Laboratory PH Prentice Hall PTR PTR Upper Saddle River, NI 07458 a = www.phptrcom Library of Congress Cataloging-in-Publication Data Quatieri, T. F. (Thomas F.) Discrete-time speech processing: principles and practice / Thomas F. Quatieri. p. em. -- (Prentice-Hall signal processing series) Includes bibliographical references and index. ISBN 0-13-242942-% 1. Speech processing systems. 2. Discrete-lime systems [. Title. Il. Series. TK7882.865 Q38 2001 G06.5--de2 1 2001021821 Editorial/production supervision: Faye Gemmellaro Production assistant: Jodi Sherr Acquisitions editor: Bernard Goodwin Editorial assistant: Michelle Vincenti Marketing manager: Dan DePasquale Manufacturing manager: Alexis Heydt Cover design director: Jerry Vetta Cover designers: Talar Agasyan, Nina Scuderi Composition: PrefEX. Inc. PH ©2002 Prentice Hall PTR aye Prentice-Hall, Inc. fee = Upper Saddle River, NJ 07458 Prentice Hall books are widely used by corporations and government agencies for training, marketing, and Tesale. The publisher offers discounts on this book when ordered in bulk quantities. For more information, contact: Corporate Sales Department Phone: 800-382-3419 Fax: 201-236-7141 Email: [email protected] Or write: Prentice Hall PTR Corporate Sales Department One Lake Street Upper Saddle River, NJ 07458 MATLAB is a registered trademark of The MathWorks, Inc. All product names mentioned herein are the trademarks of their respective owners. All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Printed in the United States of America 19 8 765 432721 ISBN 0-13-242942-X Pearson Education LTD. Pearson Education Australia PTY, Limited Pearson Education Singapore, Pte. Ltd. Pearson Education North Asia Ltd. Pearson Education Canada, Ltd. Pearson Education de Mexico. §.A. de C.V. Pearson Education—Japan Pearson Education Malaysia, Pte. Ltd. This book is dedicated to my wife Linda and to our parents and family. Contents Foreword xv Preface xvii 1 1.1 1.2 13 1.4 15 1.6 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 312 1__Introduction | Discrete-Time Speech Signal Processing 1 The Speech Communication Pathway 2 Analysis/Synthesis Based on Speech Production and Perception 3 Applications 5 Outline of Book 7 Summary 9 Bibliography 9 A Discrete-Time Signal Processing Framework 11 Introduction 11 Discrete-Time Signals 11 Discrete-Time Systems 14 Discrete-Time Fourier Transform 15 Uncertainty Principle 20 z-Transform 23 LTI Systems in the Frequency Domain 28 Properties of LTT Systems 33 2.8.1 Difference Equation Realization 33 2.8.2 Magnitude-Phase Relationships 34 2.8.3 FIR Filters 37 2.8.4 T[R Filters 37 Time-Varying Systems 38 Discrete Fourier Transform 41 Conversion of Continuous Signals and Systems to Discrete Time 43 2.11.1 Sampling Theorem 43 2.11.2 Sampling a System Response 45 2.11.3 Numerical Simulation of Differential Equations 47 Summary 47 viii 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.6 Contents Exercises 48 Bibliography 54 Production and Classification of Speech Sounds 55 Introduction 35 Anatomy and Physiology of Speech Production 57 3.2.1 Lungs 57 3.2.2 Larynx 58 3.2.3. Vocal Tract 66 3.2.4 Categorization of Sound by Source 71 Spectrographic Analysis of Speech 72 Categorization of Speech Sounds = 77 3.4.1 Elements ofa Language 79 342 Vowels §1 3.4.3 Nasals 84 3.4.4 Fricatives 85 3.4.5 Plosives 88 3.4.6 Transitional Speech Sounds 92 Prosody: The Melody of Speech 95 Speech Perception 99 3.6.1 Acoustic Cues 99 3.6.2. Models of Speech Perception 100 Summary [01 Exercises 102 Bibliography 108 Acoustics of Speech Production 111 Introduction UL Physics of Sound = 112 4.2.1 Basics 112 4.2.2 The Wave Equation 115 Uniform Tube Model 119 4.3.1 Lossless Case 119 4.3.2 Effectof Energy Loss 127 4.3.3 Boundary Effects 130 4.34 AComplete Model 134 A Discrete-Time Model Based on Tube Concatenation 136 4.4.1 Sound Propagation in the Concatenated Tube Model 137 4.4.2 A Discrete-Time Realization 143 4.4.3 Complete Discrete-Time Model 148 Vocal Fold/Vocal Tract Interaction 153 4.5.1 A Model for Source/Tract Interaction 154 4.5.2. Formant Frequency and Bandwidth Modulation 158 Summary 162 Exercises 162 Bibliography 173 Contents wn in 5.6 5.7 5.9 6.1 6.2 6.3 6.4 ix Analysis and Synthesis of Pole-Zero Speech Models 175 Introduction 175 Time-Dependent Processing 176 All-Pole Modeling of Deterministic Signals 177 5.3.1 Formulation 177 5.3.2 Error Minimization 181 5.3.3 Autocorrelation Method 185 5.3.4 The Levinson Recursion and Its Associated Properties 194 5.3.5 Lattice Filter Formulation of the Inverse Filter 200 5.3.6 Frequency-Domain Interpretation 205 Linear Prediction Analysis of Stochastic Speech Sounds = 207 5.4.1 Formulation 207 5.4.2 Error Minimization 209 5.4.3 Autocorrelation Method 210 Criterion of “Goodness” 210 5.5.1 Time Domain 210 §.5.2 Frequency Domain 212 Synthesis Based on All-Pole Modeling 216 Pole-Zero Estimation 220 5.7.1 Linearization 271 5.7.2 Applicationto Speech 222 5.7.3 High-Pitched Speakers: Using Two Analysis Windows 227 Decomposition of the Glottal Flow Derivative 228 5.8.1 Model 228 5.8.2 Estimation 230 Summary 232 Appendix 5.A: Properties of Stochastic Processes 233 Random Processes 233 Ensemble Averages 235 Stationary Random Process 236 Time Averages 236 Power Density Spectrum 237 Appendix 5.B: Derivation of the Lattice Filter in Linear Prediction Analysis 238 Exercises 240 Bibliography 251 Homomorphic Signal Processing 253 Introduction 253 Concept 254 Homomorphic Systems for Convolution 257 Complex Cepstrum of Speech-Like Sequences 261 64.1 Sequences with Rational z-Transforms 261 64.2 — Impulse Trains Convelved with Rational 2-Transform Sequences 265 64.3. Homomorphic Filtering 266 644 Discrete Complex Cepstrum 269 6.7 6.8 6.9 Spectral Root Homomorphic Filtering 272 Short-Time Homomorphic Analysis of Periodic Sequences 276 6.6.1 Quefrency-Domain Perspective 277 6.6.2. Frequency-Domain Perspective 279 Short-Time Speech Analysis 281 6.7.1 Complex Cepstrum of Voiced Speech 281 6.7.2 Complex Cepstrum of Unvoiced Speech 286 Analysis/Synthesis Structures 287 6.8.1 Zero- and Minimum-Phase Synthesis 288 6.8.2 Mixed-Phase Synthesis 290 6.8.3 Spectral Root Deconvolution 292 Contrasting Linear Prediction and Homomorphic Filtering 293 6.9.1 Properties 293 6.9.2. Homomorphic Prediction 293 6.10 Summary 296 7.1 7.2 7.5 7.6 Exercises 297 Bibliography 306 Short-Time Fourier Transform Analysis and Synthesis 309 Introduction 309 Short-Time Analysis 310 7.2.1 Fourier Transform View 310 72.2 Filtermg View 313 7.2.3 Time-Frequency Resolution Tradeoffs 318 Short-Time Synthesis 320 7.3.1 Formulation 320 7.3.2 Filter Bank Summation (FBS) Method 321 7.3.3 Overlap-Add (OLA) Method 325 7.3.4 ‘Time-Frequency Sampling 328 Short-Time Fourier Transform Magnitude 330 74.1 Signal Representation 331 7.4.2. Reconstruction from Time-Frequency Samples 334 Signal Estimation from the Modified STFT or STFTM = 335 7.5.1 Heuristic Application of STFT Synthesis Methods 337 7.5.2 Least-Squared-Error Signal Estimation from the Modified STFT 340 7.5.3 LSE Signal Estimation from Modified STFTM 342 Time-Scale Modification and Enhancement of Speech 343 7.6.1 Time-Scale Modification 343 7.6.2 Noise Reduction 349 Summary 350 Appendix 7.4: FBS Method with Multiplicative Modification 351 Exercises 352 Bibliography 36] Contents XI 8 Filter-Bank Analysis/Synthesis 363 8.1 Introduction 363 8.2 Revisiting the FBS Method 364 8.3 Phase Vocoder 367 8.3.1 Analysis/Synthesis of Quasi-Periodic Signals 367 8.3.2 Applications 375 8.3.3 Motivation for a Sinewave Analysis/Synthesis 380 8.4 Phase Coherence in the Phase Vocoder 381 8.4.1 Preservation of Temporal Envelope 381 8.4.2 Phase Coherence of Quasi-Periodic Signals 385 8.5 Constant-Q Analysis/Synthesis 386 8.5.1 Motivation 387 8.5.2 Wavelet Transform 388 8.5.3 Discrete Wavelet Transform 392 8.5.4 Applications 397 86 Auditory Modeling 401 8.6.1 AM-FM Model of Auditory Processing 403 8.6.2 Auditory Spectral Model 406 8.6.3 Phasic/Tonic View of Auditory Neural Processing 408 8.7. Summary 412 Exercises 412 Bibhography 422 9 Sinusoidal Analysis/Synthesis 427 9.1 Introduction 427 9,2. Sinusoidal Speech Model 429 9.3 Estimation of Sinewave Parameters 432 9.3.1 Voiced Speech 435 9.3.2 Unvoiced Speech 439 9.3.3 Analysis System 440 9.3.4 Frame-to-Frame Peak Matching 442 9.4 Synthesis 445 9.4.1 Cubic Phase Interpolation 446 94.2 Overlap-Add Interpolation 450 94.3 Examples 452 944 Applications 456 94.5 ‘Time-Frequency Resolution 457 9.5 Source/Filter Phase Model 460 9.5.1 Signal Model 460 9.5.2 Applications 461 9.6 Additive Deterministic-Stochastic Model 474 9.6.1 Signal Model 474 9.6.2 Analysis/Synthesis 475 9.6.3 Application to Signal Modification 477 9.7 Summary 478 Appendix 9.A: Derivation of the Sinewave Model 479 Contents Appendix 9.B: Derivation of Optimal Cubic Phase Parameters 482 Exercises 464 Bibliography 499 10 Frequency-Domain Pitch Estimation 503 10.1 Introduction 503 10.2 A Correlation-Based Pitch Estimator 504 10.3 Pitch Estimation Based on a “Comb Filfer’’ 505 10.4 Pitch Estimation Based on a Harmonic Sinewave Model 509 10.4.1 Parameter Estimation for the Harmonic Sinewave Model 510 10.4.2 Parameter Estimation for the Harmonic Sinewave Model with @ priort Amplitude 5311 10.4.3 Voicing Detection 516 10.4.4 Time-Frequency Resolution Perspective 519 10.4.5 Evaluation by Harmonic Sinewave Reconstruction 522 10.5 Glottal Pulse Onset Estimation 523 10.5.1 A Phase Model Based on Onset Time 523 10.5.2. Onset Estimation 525 10.5.3 Sinewave Amplitude Envelope Estimation 527 10.5.4 Minimum -Phase Sinewave Reconstruction 530 10.6 Multi-Band Pitch and Voicing Estimation 531 10.6.1 Harmonic Sinewave Model 531 10.6.2. Multi-Band Voicing 533 10.7 Summary 534 Exercises 535 Bibliography 540 11. Nonlinear Measurement and Modeling Techniques 541 11.1 Introduction 541 11.2 The STFT and Wavelet Transform Revisited 542 11.2.1 Basis Representations 543 11.2.2 Minimum Uncertainty 543 11.2.3. Tracking Instantaneous Frequency 546 11.3 Bilinear Time-Frequency Distributions 549 11.3.1 Properties of a Proper Time-Frequency Distribution 549 11.3.2 Spectrogram as a Time-Frequency Distribution 552 11.3.3 Wigner Distribution 553 11.3.4 Variations on the Wigner Distribution 558 11.3.5 Application to Speech Analysis 558 11.4 Aeroacoustic Flow in the Vocal Tract 562 11.4.1 Preliminaries 563 11.4.2 Early Measurements and Hypotheses of Aeroacoustic Flow in the Vocal Tract 564 11.4.3 Aeroacoustic Mechanical Model 567 1144 Aeroacoustic Computational Model 570 11.8 Instantaneous Teager Energy Operator 571 1135.1 Motivation 371 + il Esercy Measurement 572 Contents 11.6 12 12.1 12.2 12.3 12.4 12.5 12.6 12.8 13 13.1 13.2 13.3 xiii 11.5.3 Energy Separation 377 Summary 582 Exercises 583 Bibliography 592 Speech Coding 595 Introduction 595 Statistical Models 598 Scalar Quantization 598 12.3.1 Fundamentals 599 12.3.2 Quantization Noise 602 12.3.3. Derivation of the Max Quantizer 606 12.3.4 Companding 609 12.3.5 Adaptive Quantization 610 12.3.6 Differential and Residual Quantization 613 Vector Quantization (VQ) 616 12.4.1 Approach 616 12.4.2 VQ Distortion Measure 618 12.4.3. Use of VQ in Speech Transmission 620 Frequency-Domain Coding 621 12.5.1 Subband Coding 621 12.5.2 Sinusoidal Coding 625 Model-Based Coding 635 12.6.1 Basic Linear Prediction Coder (LPC) 635 12.6.2 AVQLPC Coder 637 12.6.3. Mixed Excitation LPC (MELP) 638 LPC Residual Coding 640 12.7.1 Multi-Pulse Linear Prediction 641 12.7.2 Multi-Pulse Modeling with Long-Term Prediction 645 12.7.3. Code-Excited Linear Prediction (CELP) 649 Summary 652 Exercises 653 Bibliography 660 Speech Enhancement 665 Introduction 665 Preliminaries 666 13.2.1 Problem Formulation 666 13.2.2 Spectral Subtraction 668 13.2.3 Cepstral Mean Subtraction 671 Wiener Filtering 672 13.3.1 Basic Approaches to Estimating the Object Spectrum 673 13.3.2 Adaptive Smoothing Based on Spectral Change 675 13.3.3. Application to Speech 678 13.3.4 Optimal Spectral Magnitude Estimation 680 13.3.5 Binaural Representations 682 —— 13.4 13.5 13.6 13.7 14 14,1 14.2 14.3 144 14.5 14.6 14.7 Model-Based Processing 682 Enhancement Based on Auditory Masking 684 13.5.1 Frequency-Domain Masking Principles 685 13.5.2 Calculation of the Masking Threshold 687 13.5.3. Exploiting Frequency Masking in Noise Reduction 687 Temporal Processing in a Time-Frequency Space 690 13.6.1 Formulation 690 13.6.2. Temporal Filtering 691 13.6.3 Nonlinear Transformations of Time-Trajectories 694 Summary 698 Appendix 13.A: Stochastic-Theoretic Parameter Estimation 699 Exercises 700 Bibliography 705 Speaker Recognition 709 Introduction 709 Spectral Features for Speaker Recognition 711 14.2.1 Formulation 711 14.2.2 Mel-Cepstrum 712 14.2.3. Sub-Cepstrum 715 Speaker Recognition Algorithms 717 14.3.1 Minimum-Distance Classifier 717 14.3.2 Vector Quantization 718 14.3.3 Gaussian Mixture Model (GMM) 719 Non-Spectral Features in Speaker Recognition 725 14.4.1 Glottal Flow Derivative 725 14.4.2 Source Onset Timing 729 14.4.3. Relative Influence of Source, Spectrum, and Prosody 729 Signal Enhancement for the Mismatched Condition 733 14.5.1 Linear Channel Distortion 734 14.5.2 Nonlinear Channel Distortion 737 14.5.3 Other Approaches 746 Speaker Recognition from Coded Speech 748 14.6.1 Synthesized Coded Speech 748 14.6.2 Experiments with Coder Parameters 749 Summary 751 Appendix 14.A: Expectation-Maximization (EM) Fstimation 752 Exercises 754 Bibliography 762 Glossary 767 Speech Signal Processing 767 Units 768 Databases 768 Index 769 About the Author 781 Foreword Speech and hearing, man’s most used means of communication, have been the objects of intense study for more than 150 years—from the time of von Kempelen’s speaking machine to the present day. With the advent of the telephone and the explosive growth of its dissemination and use, the engineering and design of evermore bandwidth-efficient and higher-quality transmission systems has been the objective and providence of both engineers and scientists for more than seventy years. This work and investigations have been largely driven by these real-world applications which now have broadened to include not only speech synthesizers but also automatic speech recognition systems, speaker verification systems, speech enhancement systems, efficient speech coding systems, and speech and voice modification systems. The objectives of the engineers have been to design and build real workable and economically affordable systems that can be used over the broad range of existing and newly installed communication channels. Following the development of the integrated circuit in the 1960s, the communication chan- nels and the end speech signal processing systems changed from analog to purely digital systems. The early laboratories involved in this major shift in implementation technology included Bell Telephone Laboratories, MIT Lincoln Laboratory, [BM Thomas Watson Research Laboratories, the BB&N Speech Group, and the Texas Instruments Company, along with numerous excellent university research groups. The introduction by Texas Instruments in the 1970s of its Speak-and- Spell product, which employed extensive digital integrated circuit technology, caused the entire technical, business, and marketing communities to awaken to the endless system and product possibilities becoming viable through application of the rapidly developing integrated circuit technologies. As more powerful integrated circuits became available, the engineers would take their existing working systems and try to improve them. This meant going back and studying their existing models of speech production and analysis in order to gain a more complete understanding of the physical processes involved. It also meant devising and bringing to bear more powerful mathematical tools and algorithms to handle the added complexity of the more detailed analysis. Certain methodologies became widely used partly because of their initial success, their viability, and their ease of analysis and implementation. It then became increasingly difficult to change an individual part of the system without affecting the other parts of the system. This logical design procedure was complicated and compromised by the ever-present reducing cost and increasing power of the digital integrated circuits used. In the midst of all this activity lay Lincoln Laboratory with its many and broad projects in the speech area. The author of this timely book has been very actively involved in both the engineering and the scientific aspects of many of those projects and has been a major contributor to their success. In addition, he has developed over the course of many years the graduate course in speech analysis and processing at MIT, the outgrowth of which is this text on the subject. In this book you will gain a thorough understanding of the basic scientific principles of speech production and hearing and the basic mathematical tools needed for speech signal rep- resentation, analysis, and manipulation. Then, through a plethora of applications, the author illustrates the design considerations, the system performance, and the careful analysis and cri- tique of the results. You will view these many systems through the eyes of one who has been there, and one with vision and keen insight into figuring out why the systems behave the way they do and where the limitations still exist. Read carefully, think continually, question always, try out the ideas, listen to the results, and check out the extensive references. Enjoy the magic and fascination of this broad area of the application of digital technology to voice communication through the experiences of an active researcher in the field. You wil! be richly rewarded. James F. Kaiser Visiting Professor, Department of Electrical and Computer Engineering Duke University Durham, NC Preface This text is in part an outgrowth of my MIT graduate course Digital Speech Signal Processing, which I have taught since the Fall of 1990, and in part a result of my research at MIT Lincoln Laboratory. As such, principles are never too distant from practice; theory is often followed by applications, both past and present. This text is also an outgrowth of my childhood wonder in the blending of signal and symbol processing. sound. and technology. J first felt this fascination in communicating with two cans coupled by twine, in playing with a toy Morse code, and in adventuring through old ham radio equipment in my family’s basement. My goals in this book are to provide an intensive tutorial on the principles of discrete-time speech signal processing, to describe the state-of-the-art in speech signal processing research and its applications, and to pass on to the reader my continued wonder for this rapidly evolving field. The text consists of fourteen chapters that are outlined in detail in Chapter |. The “theory” component of the book falls within Chapters 2-11, while Chapters 12—14 consist primarily of the application areas of speech coding and enhancement, and speaker recognition. Other applications are introduced throughout Chapters 2-11, such as speech modification, noise reduction, signa} restoration, and dynamic range compression. A broader range of topics that include speech and language recognition is not covered; to do so would result in a survey book that does not fill the current need in this field. The style of the text is to show not only when speech modeling and processing methods succeed. but also to describe limitations of the methods. This style makes the reader question established ideas and reveals where advancement is needed. An important tenet in this book is that anomaly in observation is crucial for advancement; as reflected by the late philospher Thomas Kuhn: “Discovery commences with the awareness of anomaly, 1.c., with the recognition that nature has somehow violated the paradigm-induced expectations that govern normal science”! The text body is strongly supplemented with examples and exercises. Each exercise set contains a number of MATLAB problems that provide hands-on experience with speech signals and processing methods. Scripts, workspaces, and signals, required for the MATLAB exercises, are located on the Prentice Hall companion website (http://www.phpt.com/quatieri/) Also on this website are audio demonstrations that illustrate a variety of principles and applications ') Kuhn. The Structure of Sciemific Revolution, University of Chicago Press. 1970, Preface from each chapter, including time-scale modification of the phrase “as time goes by” shown on the front cover of this book. The book is structured so that application areas that are not covered as separate topics are either presented as examples or exercises, ¢.g., speaker separation by sinusoidal modeling and restoration of old acoustic recordings by homomorphic processing. In my MIT speech processing course, I found this approach to be very effective, especially since such examples and exercises are fascinating demonstrations of the theory and can provide a glimpse of state-of-the-art applications. The book is also structured so that topics can be covered on different levels of depth and breadth. For example, a one-semester course on discrete-time speech signal processing could be taught with an emphasis on fundamentals using Chapters 2-9. To focus on the speech coding application, one can include Chapter 12, but also other applications as examples and exercises. In a two-semester course, greater depth could be given to fundamentals in the first semester, using Chapters 2-9. In the second semester, a focus could then be given to advanced theories and applications of Chapters 10-14. with supplementary material on speech recognition. I wish to express my thanks to the many colleagues, friends, and students who provided review of different chapters of this manuscript, as well as discussions on various chapter topics and style. These include Walt Andrews, Carlos Avendano, Joe Campbell, Mark Clements, Jody and Michael Crocetta, Ron Danisewicz, Bob Dunn. Carol Epsy-Wilson, Allen Gersho, Terry Gleason, Ben Gold, Mike Goodwin, Siddhartan Govindasamy, Charles Jankowski, Mark Kahrs, Jim Kemerling, Gernot Kubin, Petros Maragos, Rich McGowen, Michael Padilla, Jim Pitton, Mike Plumpe, Larry Rabiner, Doug Reynolds, Dan Sinder, Ellict Singer, Doug Sturim, Charlie Therrien, and Lisa Yanguas. In addition, I thank my MIT course students for the many con- siructive comments on my speech processing notes, and my teaching assistants: Babak Azifar, Ibrahim Hajjahmad, Tim Hazen, Hanfeng Yuan, and Xiaochun Yang for help in developing class exercise solutions and for feedback on my course notes. Also, in memory of Gary Kopec and Tom Hanna, who were both colleagues and friends, I acknowledge their inspiration and influence that live on in the pages of this book. A particular thanks goes to Jim Kaiser, who reviewed nearly the entire book in his charac- teristic meticulous and uncompromising detail and has provided continued motivation through- out the writing of this text, as well as throughout my career, by his model of excellence and creativity, Talso acknowledge Bob McAulay for the many fruitful and highly motivational years we have worked together; our collaborative effort provides the basis for Chapters 9, 10, and parts of Chapter [2 on sinusoidal analysis/synthesis and its applications. Likewise, | thank Hamid Nawab for our productive work together in the early 1980s that helped shape Chapter 7, and Rob Baxter for our stimulating discussions that helped to develop the time-frequency distribution tutorials for Chapter L1. In addition, I thank the following MIT Lincoln Laboratory manage- ment for flexibility given me to both lecture at MIT and perform research at Lincoln Laboratory, and for providing a stimulating and open research environment: Cliff Weinstein, Marc Zissman, Jerry O’ Leary, Al McLaughlin, and Peter Blankenship. I have also been very fortunate to have the support of Al Oppenheim, who opened the door for me to teach in the MIT Electrical Engi- neering and Computer Science Department, planted the seed for writing this book, and provided the initial and continued inspiration for my career in digital signal processing. Thanks also goes to Faye Gemmellaro, production editor; Bernard Goodwin, publisher; and others at Prentice Hall for their great care and dedication that helped determine the quality of the finished book product. Preface xix Finally, Lexpress my deepest gratitude to my wife Linda, who provided the love, support, and encouragement that was essential in a project of this magnitude and who has made it all meaningful. Linda’s voice example on the front cover of this book symbolizes my gratitude now and “as time goes by.” Thomas F. Quatieri MIT Lincoln Laboratory 2 This work was sponsored by the Department of Defense under Air Force contract F19628—00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and not necessarily endorsed by the United States Air Force. C HA PTE R 1 Introduction 1.1 Discrete-Time Speech Signal Processing Speech has evolved as a primary form of communication between humans. Nevertheless, there often occur conditions under which we measure and then transform the speech signal to another form in order to enhance our ability to communicate. One early case of this is the transduction by a telephone handset of the continuously-varying speech pressure signal at the lips output to a continuously-varying (analog) electric voltage signal. The resulting signal can be transmitted and processed electrically with analog circuitry and then transduced back by the receiving handset to a speech pressure signal. With the advent of the wonders of digital technology, the analog-to- digital (A/D) converter has entered as further “transduction” that samples the electrical speech samples, e.g., 8000 samples per second for telephone speech, so that the speech signal can be digitally transmitted and processed. Digital processors with their fast speed, low cost and power. and tremendous versatility have replaced a large part of analog-based technology. The topic of this text, discrete-time speech signal processing, can be loosely defined as the manipulation of sampled speech signals by a digital processor to obtain a new signal with some desired properties. Consider. for example, changing a speaker’s rate of articulation with the use ofa digital computer. In the modification of articulation rate, sometimes referred to as time-scale modification of speech, the objective is a new speech waveform that corresponds to a person talking faster or slower than the original rate, but that maintains the character of the speaker's voice, te.. there should be litthe change in the pitch (or rate of vocal cord vibration) and spectrum of the original utterance. This operation may be useful, for example, in fast scanning of a long recording in a message playback system or slowing down difficult-to-understand speech. In this 2 Introduction Chap. 1 Speech at Normal Articulation Rate Y y y . Time-Scale T \ ume Modification Analog | \ | a Digital |_| Cassette Computer Speech at Reduced Analog | Articulation Rate Cassette | Time Figure 1.1 Time-scale modification as an example of discrete-time speech signal processing. application, we might begin with an analog recording of a speech utterance (Figure 1.1). This continuous-time waveform is passed through an A/D waveform converter to obtain a sequence of numbers, referred to as a discrete-time signal, which is entered into the digital computer. Discrete-time signal processing is then applied to obtain the required speech modification that is performed based on a model of speech production and a model of how articulation rate change occurs. These speech-generation models may themselves be designed as analog models that are transformed into discrete time. The modified discrete-time signal is converted back to analog form with a digital-to-analog (D/A) converter, and then finally perhaps stored as an analog waveform or played directly through an amplifier and speakers. Although the signal processing required for a high-quality modification could conceivably be performed by analog circuitry built into a redesigned tape recorder,! current digital processors allow far greater design flexibility. Time-scale modification is one of many applications of discrete-time speech signal processing that we explore throughout the text. 1.2 The Speech Communication Pathway In the processing of speech signals, it is important to understand the pathway of communication from speaker to listener [2]. Atthe linguistic level of communication, an idea is first formed in the mind of the speaker. The idea is then transformed to words, phrases, and sentences according to the grammatical rules of the language. At the physiological level of communication. the 1 Observe that time-scale modification cannot be performed simply by changing the speed of a tape recorder because this changes the pitch and spectrum of the speech. 1.3. Analysis/Synthesis Based on Speech Production and Perception 3 brain creates electric signals that move along the motor nerves; these electric signals activate muscles in the vocal tract and vocal cords. This vocal tract and vocal cord movement results in pressure changes within the vocal tract, and, in particular, at the lips, initiating a sound wave that propagates in space. The sound wave propagates through space as a chain reaction among the air particles, resulting in a pressure change at the ear canal and thus vibrating the ear drum. The pressure change at the lip, the sound propagation, and the resulting pressure change at the ear drum of the listener are considered the acoustic level in the speech communication pathway. The vibration at the ear drum induces electric signals that move along the sensory nerves to the brain; we are now back to the physiological level. Finally, at the linguistic level of the listener, the brain performs speech recognition and understanding. The linguistic and physiological activity of the speaker and listener can be thought of as the “transmitter” and “receiver,” respectively, in the speech communication pathway. The transmitter and receiver of the system, however, have other functions besides basic communications. In the transmitter there is feedback through the ear which allows monitoring and correction of one’s own speech (the importance of this feedback has been seen in studies of the speech of the deaf). Examples of the use of this feedback are in controlling articulation rate and in the adaptation of speech production to mimic voices. The receiver also has additional functions. It performs voice recognition and it is robust in noise and other interferences; in a room of multiple speakers, for example, the listener can focus on a single low-volume speaker in spite of louder interfering speakers. Although we have made great strides in reproducing parts of this communication system by synthetic means, we are far from emulating the human communication system. 1.3 Analysis/Synthesis Based on Speech Production and Perception In this text, we do not cover the entire speech communication pathway. We break into the pathway and make an analog-to-digital measurement of the acoustic waveform. From these measurements and our understanding of speech production, we build engineering models of how the vocal tract and vocal cords produce sound waves, beginning with analog representations which are then transformed to discrete-time. We also consider the receiver, i.e., the signal processing of the ear and higher auditory levels, although to a lesser extent than the transmitter, because it is imperative to account for the effect of speech signal processing on perception. To preview the building of a speech model, consider Figure 1.2 which shows a model of vowel production. In vowel production, air is forced from the lungs by contraction of the muscles around the lung cavity. Air then flows past the vocal cords, which are two masses of flesh, causing periodic vibration of the cords whose rate gives the pitch of the sound; the resulting periodic puffs of air act as an excitation input, or source, to the vocal tract. The vocal tract is the cavity between the vocal cords and the lips, and acts as a resonator that spectrally shapes the periodic input, much like the cavity of a musical wind instrument. From this basic understanding of the speech production mechanism, we can build a simple engineering model, referred to as the source/filter model. Specifically, if we assume that the vocal tract is a linear time-invariant system, or filter, with a periodic impulse-like input, then the pressure output at the lips is the 4 Introduction Chap. 1 Nasal Cavity (Filter) _Velum Lips (Nasal Cavity Switch) Resonant Tongue Vocal Tract Jaw _Oral Cavity Cavity | (Variable Parameters (Filter) of Filters) \, / *, Vocal Cords __-—— (Modulator) wf “s._ Breath Stream i (Energy Source) Lungs u(t) a s(t) _—A___ A A Resonant D pel pal pl par |p Time Cavity p-.| Time |U(Q), [V(22)| |S(Q)| . Frequency Frequency Frequency | 1/P Figure 1.2 Speech production mechanism and model of a steady-state vowel. The acoustic waveform is modeled as the output of a linear time-invariant system with a periodic impulse-like input. In the frequency domain, the vocal tract system function spectrally shapes the harmonic input, convolution of the impulse-like train with the vocal tract impulse response, and therefore is itself periodic. This is a simple model of a steady-state vowel. A particular vowel, as, for example, “a” in the word “father,” is one of many basic sounds of a language that are called phonemes and for which we build different production models. A typical speech utterance consists of a string of yowel and consonant phonemes whose temporal and spectral characteristics change with time, corresponding to a changing excitation source and vocal tract system. In addition, the time-varying source and system can also nonlinearly interact in a complex way. Therefore, although our simple model for a steady vowel seems plausible, the sounds of speech are not always well represented by linear time-invariant systems. ‘4 Applications 5 Measurement of Acoustic Waveform Waveform and Spectral Representations Modification Speaker Recognition Figure 1.3 Discrete-time speech signal processing overview. Appli- cations within the text include speech modification. coding, enhance- ment, and speaker recognition. Speech and Language Recognition, Other Based on discrete-time models of speech production. we embark on the design of speech analysis/synthesis systems (Figure 1.3). In analysis, we take apart the speech waveform to extract underlying parameters of the time-varying model. The analysis is performed with temporal and spectral resolution that is adequate for the measurement of the speech model parameters. In synthesis, based on these parameter estimates and models, we then put the waveform back together. An objective in this development is to achieve an identity system for which the output eguals the input when no speech manipulation is performed. We also investigate waveform and spectral representations that do not involve models, but rather various useful mathematical representations in time or in frequency from which other analysis/synthesis methods can be derived. These analysis/synthesis methods are the backbone for applications that transform the speech waveform into some desirable form. 1.4 Applications This text deals with applications of discrete-time speech analysis/synthesis primarily in the following areas: (1) speech modification, (2) speech coding, (3) speech enhancement. and (4) speaker recognition (Figure 1.3). Other important application areas for discrete-time signal Introduction Chap. 1 processing, including speech recognition, language recognition, and speech synthesis from text, are not given; to do so would require a deeper study of statistical discrete-time signal processing and linguistics than can be satisfactorily covered within the boundaries of this text. Tutorials in these areas can be found in [1],[3],[4],{5),[6].[71. Modification — The goal in speech modification is to alter the speech signal to have some desired property. Modifications of interest include time-scale, pitch, and spectral changes. Applications of time-scale modification are fitting radio and TV commercials into an allocated time slot and the synchronization of audio and video presentations. In addition, speeding up speech has use in message playback, voice mail, and reading machines and books for the blind, while slowing down speech has application to learning a foreign language. Voice transformations using pitch and spectral modification have application in voice disguise, entertainment, and concatenative speech synthesis. The spectral change of frequency compression and expansion may be useful in transforming speech as an aid to the partially deaf. Many of the techniques we develop also have applicability to music and special effects. In music modification, a goal is to create new and exotic sounds and enhance electronic musical instraments. Cross synthesis, used for special effects, combines different source and system components of sounds, such as blending the human excitation with the resonances of a musical instrument. We will see that separation of the source and system components of a sound is also important in a variety of other speech application areas. Coding — In the application of speech coding, the goal is to reduce the information rate, mea- sured in bits per second, while maintaining the quality of the original speech waveform.” We study three broad classes of speech coders. Waveform coders, which represent the speech wave- form directly and do not rely on a speech production model, operate in the high range of 16-64 kbps (bps, denoting bits per second). Vocoders are largely speech model-based and rely on a small set of model parameters; they operate at the low bit rate range of 1.2-4.8 kbps, and tend to be of lower quality than waveform coders. Hybrid coders are partly waveform-based and partly speech model-based and operate in the 4.8-16 kbps range with a quality between waveform coders and vocoders. Applications of speech coders include digital telephony over constrained- bandwidth channels, such as cellular, satellite, and Internet communications. Other applications are video phones where bits are traded off between speech and image data, secure speech links for government and military communications, and voice storage as with computer voice mail where storage capacity is limited. This last application can also benefit from time-scale compression where both information reduction and voice speed-up are desirable. Enhancement — In the third application—speech enhancement—te goal is to improve the quality of degraded speech. One approach is to preprocess the speech waveform before it is degraded. Another is postprocessing enhancement after signal degradation. Applications of preprocessing include increasing the broadcast range of transmitters constrained by a peak power transmission limit, as, for example. in AM radio and TV transmission. Applications of postprocessing include reduction of additive noise in digital telephony and vehicle and aircraft 2 The term quality refers to speech attributes such as naturalness, intelligibility, and speaker recognizability. 1.5 Outline of Book 7 communications. reduction of interfering backgrounds and speakers for the hearing-impaired, removal of unwanted convolutional channel distortion and reverberation, and restoration of old phonograph recordings degraded. for example, by acoustic horns and impulse-like scratches from age and wear. Speaker Recognition — This area of speech signal processing exploits the variability of speech model parameters across speakers. Applications include verifying a person’s identity for entrance to a secure facility or personal account. and voice identification in forensic investi- gation. An understanding of the speech model features that cue a person’s identity is also impostant in speech modification where we can transform model parameters for the study of specific voice characteristics: thus, speech modification and speaker recognition can be devel- oped synergistically. 1.5 Outline of Book The goal of this book is to provide an understanding of discrete-time speech signal processing techniques that are motivated by speech model building, as well as by the above applications. We will see how signal processing algorithms are driven by both time- and frequency-domain representations of speech production, as well as by aspects of speech perception. In addition, we investigate the capability of these algorithms to analyze the speech signal with appropriate time-frequency resolution, as well as the capability to synthesize a desired waveform. Chapter 2 reviews the foundation of discrete-time signal processing which serves as the framework for the remainder of the text. We investigate some essential discrete-time tools and touch upon limitations of these techniques, as manifested through the uncertainty principle and the theory of time-varying linear systems that arise in a speech signal processing context. Chapter 3 describes qualitatively the main functions of the speech production mechanism and the associated anatomy. Acoustic and articulatory descriptors of speech sounds are given, some simple linear and time-invariant models are proposed, and, based on these features and models, the study of phonetics is introduced. Implications of sound production mechanisms for signal processing algorithms are discussed. In Chapter 4, we develop a more quantitative description of the acoustics of speech production, showing how the heuristics of Chapter 3 are approximately supported with linear and time-invariant mathematical models, as well as predicting other effects not seen by a qualitative perspective. such as a nonlinear acoustic coupling between the source and system functions. Based on the acoustic models of Chapters 3 and 4, in Chapter 5 we investigate pole- zero transfer function representations of the three broad speech sound classes of periodic (e.¢., vowels), noise-like (e.g.. fricative consonants). and impulsive (e.g., plosive consonants), loosely categorized as “deterministic,” i.e. with a periodic or impulsive source, and “stochastic,” Le., with a noise source. There also exist many speech sounds having a combination of these sound elements. In this chapter, methodologies are developed for estimating all-pole system parameters for each sound class. an approach referred to as linear prediction analysis. Extension of these methods is made to pole-zero system models. For both all-pole and pole-zero analysis, corresponding synthesis methods are developed. Linear prediction analysis first extracts the system component and then. by inverse filtering, extracts the source component. We can think Introduction Chap. 1 of the source extraction as a method of deconvolution. Focus is given to estimating the source function during periodic sounds. particularly a “pitch synchronous” technique, based on the closed phase of the glottis, i.e., the slit between the vocal cords. This method of glottal flow waveform estimation reveals a nonlinear coupling between the source and the system. Chapter 6 describes an alternate means of deconvolution of the source and system components, referred to as homomorphic filtering. In this approach, convolutionally combined signals are mapped to additively combined signals on which linear filtering is applied for signal separation. Unlike linear prediction, which is a “parametric” (all-pole) approach to deconvolution, homomorphic filtering is “nonparametric” in that a specific model need not be imposed on the system transfer function in analysis. Corresponding synthesis methods are also developed, and special attention is given to the importance of phase in speech synthesis. In Chapter 7, we introduce the short-time Fourier transform (STFT) and its magnitude for analyzing the spectral evolution of time-varying speech waveforms. Synthesis techniques are developed from both the STFT and the STFT magnitude. Time-frequency resolution properties of the STFT are studied and application to time-scale modification is made. In this chapter, the STFT is viewed in terms of a filter-bank analysis of speech which leads to an extension to constant-Q analysis and the wavelet transform described in Chapter 8. The wavelet transform represents one approach to addressing time-frequency resolution limitations of the STFT as revealed through the uncertainty principle. The filter-bank perspective of the STFT also leads to an analysis/synthesis method in Chapter 8 referred to as the phase vocoder, as well as other filter-bank structures. Also in Chapter 8, certain principles of auditory processing are introduced, beginning with a filter-bank representation of the auditory front-end. These principles, as well as others described as needed in later chapters, are used throughout the text to help motivate various signal processing techniques. as. for example. signal phase preservation. The analysis stage of the phase vocoder views the output of a bank of bandpass filters as sinewave signal components. Rather than relying on a filter bank to extract the underlying sinewave components and their parameters, an alternate approach is to explicitly model and estimate time-varying parameters of sinewave components by way of spectral peaks in the short-time Fourier transform. The resulting sinewave analysis/synthesis scheme, described in Chapter 9, resolves many of the problems encountered by the phase vocoder, e.g.. a characteristic phase distortion problem, and provides a useful framework for a large range of speech applications, including speech modification, coding, and speech enhancement by speaker separation. Pitch and a voicing decision, i-e., whether the vocal tract source is periodic or noisy, play a major role in the application of speech analysis/synthesis to speech modification, coding, and enhancement. Time-domain methods of pitch and voicing estimation follow from specific analysis techniques developed throughout the text, e.g., lmear prediction or homomorphic anal- ysis. The purpose of Chapter 10 is to describe pitch and voicing estimation. on the other hand, from a frequency-domain perspective, based primarily on the sinewave modeling approach of Chapter 9. Chapter 11 then deviates from the main trend of the text and investigates advanced topics in nonlinear estimation and modeling techniques. Here we first go beyond the STFT and wavelet transforms of the previous chapters to time-frequency analysis methods includ- ing the Wigner distribution and its variations referred to as bilinear time-frequency distribu- tions. These distributions, aimed at undermining the uncertainty principle, attempt to esti- Bibliography 9 mate important fine-structure speech events not revealed by the STFT and wavelet transform, such as events that occur within a glottal cycle. In the latter half of this chapter. we intro- duce a second approach to analysis of fine structure whose original development was mo- tivated by nonlinear aeroacoustic models for spatially distributed sound sources and modu- lations induced by nonacoustic fluid motion. For example, a “vortex ring,” generated by a fast-moving air jet from the glottis and traveling along the vocal tract, can be converted to a secondary acoustic sound source when interacting with vocal tract boundaries such as the epiglottis (false vocal folds), teeth, or inclusions in the vocal tract. During periodic sounds, in this model secondary sources occur within a glottal cycle and can exist simultaneously with the primary glottal source. Such aeroacoustic models follow from complex nonlinear behavior of fluid flow, quite different from the small compression and rarefraction pertur- bations associated with acoustic sound waves in the vocal tract that are given in Chapter 4, This aeroacoustic modeling approach provides the impetus for the high-resolution Tea- ger energy operator developed in the final section of Chapter 11. This operator is char- acterized by a time resolution that can track rapid signal energy changes within a glottal cycle, Based on the foundational Chapters 2-11, Chapters 12, 13, and 14 then address the three application areas of speech coding, speech enhancement, and speaker recognition, respectively. We do not devote a separate chapter to the speech modification application, but rather use this application to illustrate principles throughout the text. Certain other applications not covered in Chapters 12, 13, and 14 are addressed sporadically for this same purpose, including restoration of old acoustic recordings, and dynamic range compression and signal separation for signal enhancement. 1.6 Summary In this chapter. we first defined discrete-time speech signal processing as the manipulation of sampled speech signals by a digital processor to obtain a new signal with some desired properties. The application of time-scale modification, where a speaker’s articulation rate is altered, was used to illustrate this definition and to indicate the design flexibility of discrete-time processing. We saw that the goal of this book is to provide an understanding of discrete-time speech signal processing techniques driven by both time- and frequency-domain models of speech production, as well as by aspects of speech perception. The speech signal processing algorithms are also motivated by applications that include speech modification, coding, and enhancement and speaker recognition. Finally, we gave a brief outline of the text, BIBLIOGRAPHY [1] J.R. Deller. J.G. Proakis, and JH.L. Hansen, Discrete-Time Processing of Speech, Macmillan Publishing Co., New York, NY. 1993. [2] P.B. Denes and E.N. Pinson, The Speech Chain: The Physics and Biology of Spoken Language, Anchor Press-Doubleday, Garden City. NY, 1973. [3] F. Jelinek. Statistical Methods for Speech Recognition, The MIT Press, Cambridge, MA, 1998. 10 Introduction Chap. 1 [4] WB. Kleijn and K.K. Paliwal, eds., Speech Coding and Synthesis, Elsevier. 1995. [5] D. O'Shaughnessy, Speech Communication: Human and Machine. Addison-Wesley, Reading, MA, 1987. [6] L.R. Rabiner and B-H. Juang, Fi widamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ. 1993. [7] M.A. Zissman, “Comparison of Four Approaches to Automatic Language Identification of Tele- phone Speech,” JEEE Trans. on Speech and Audio Processing, vol. 4, no. 1, pp. 31-44, Jan, 1996, C HAPTER 2 A Discrete-Time Signal Processing Framework 2.1 Introduction In this chapter we review the foundation of discrete-time signal processing which serves as a framework for the discrete-time speech processing approaches in the remainder of the text. We investigate some essential discrete-time tools and touch upon the limitations of these techniques, as manifested through the time-frequency uncertainty principle and the theory of time-varying linear systems, that will arise in a speech processing context. We do not cover all relevant background material in detail, given that we assume the familiarity of the reader with the basics of discrete-time signal processing and given that certain topics are appropriately cited, reviewed, or extended throughout the text as they are needed. 2.2 Discrete-Time Signals The first component of a speech processing system is the measurement of the speech signal, which is a continuously varying acoustic pressure wave. We can transduce the pressure wave from an acoustic to an electrical signal with a microphone, amplify the microphone voltage, and view the voltage on an oscilloscope. The resulting analog waveform is denoted by x, (f)}, which is referred to as a continuous-time signal or, alternately, as an analog waveform. To process the changing voltage x,,(¢) with a digital computer, we sample x(t) at uniformly spaced time instants, an operation that is represented in Figure 2.1 by a sampler with T being the sampling interval. The sampler is sometimes called a continuous-to-discrete (C/D) converter [7] and its output is a series of numbers x,,(71T) whose representation is simplified as x(n] = x,(nT). 44 A Discrete-Time Signal Processing Framework Chap. 2 4 Microphone Oscilloscope ‘ . CF @ > \ Ss Pressure Voltage X(t) (a) Continuous-to-Discrete-Time _-— Converter T X(t) x xa(NT) \ Sequence of Numbers (b) Figure 2.1 Measurement (a) and sampling ( b) of an analog speech waveform. The series x [7] is referred to as a discrete-time signal or sequence. Unless otherwise specified, when working with samples from an analog signal, we henceforth use the terms discrete-time signal, sequence, and (for simplicity) sz ‘enal interchangeably. We have assumed that the analog signal ¥_(¢) is sampled “fast enough” to be recoverable from the sequence x[n], a condition that is called the Nyguist criterion to which we return at the end of this chapter. The C/D converter that generates the discrete-time signal is characterized by infinite amplitude precision. Therefore, although the signal x[71] is discrete in time, il is continuous in amplitude. In practice, however, a physical device does not achieve this infinite precision. As an approximation to a C/D converter, an analog-to-digital (A/D) converter quantizes each amplitude to a finite set of values closest to the actual analog signal amplitude [7]. The resulting digital signal is thus discrete in time and in amplitude. Associated with discrete-time signals are discrete-time systems whose input and output are sequences. Likewise, digital systems are characterized by digital inputs and outputs. The principal focus of this text is discrete-time signals and systems, except where deliberate amplitude quantization is imparted, as in speech coding for bit rate and bandwidth reduction, which is introduced in Chapter 12. 2.2 Discrete-Time Signals 13 Some special sequences serve as the building blocks for a general class of discrete-time signals [7]. The unit sample or “impulse” is denoted by [nz] = 1, n=0 = 0, n #0. The unit step is given by ufn] = 1, n=0 = 0, n<0 and can be obtained by summing the unit sample: u[z7] = Pe oo OK]. Likewise, the unit sample can be obtained by differencing the unit step with itself shifted one sample to the right, i.e., forming the first backward difference: 5[”] = u[n]—u(n — 1]. The exponential sequence is given by x[n] = Aa” where if A and @ are real, then x[/] is real. Moreover, if 0 < @ < 1 and A > O, then the sequence x[#] is positive and decreasing with increasing 7. If —1 < @ < 0. then the sequence values alternate in sign. The sinusoidal sequence is given by x[7] = Acos(wn + ¢) with frequency w, amplitude A, and phase offset @. Observe that the discrete-time sinusoidal signal is periodic’ in the time variable n with period N only if N = integer = 27k/o. The complex exponential sequence with complex gain A = |Ale/® is written as x(n] = Aefo" jAlei@elo” |A[cos(w@n + @) + /fAlsin(@n + ¢). An interesting property, which is a consequence of being discrete, is that the complex exponential sequence is periodic in the frequency variable w with period 277, i.e, Ae/@+27™" = Agion, This periodicity in w also holds for the sinusoidal sequence. Therefore, in discrete time we need to consider frequencies only in the range 0 < w < 27. The complex exponential and the above four real sequences serve as building blocks to discrete-time speech signals throughout the text. } This is in contrast to its continuous-time counterpart x, (tf) = A cos(Qt +) that is always periodic with period = 27/@. Here the uppercase frequency variable @ is used for continuous time rather than the lower case co for discrete time. This notation will be used throughout the text. 14 A Discrete-Time Signal Processing Framework Chap. 2 2.3 Discrete-Time Systems Discrete-time signals are often associated with discrete-time systems. A discrete-time system can be thought of as a transformation 7 [x] of an input sequence to an output sequence, Le., yz] = Toll). When we restrict this transformation to have the properties of linearity and time invariance, we form a class of linear time-invariant (LTI) systems. Let x)[n] and 22[71] be the inputs to a discrete-time system. Then for arbitrary constants @ and b, the system is linear if and only if T(ax,[n] + bxa[n}) = aT (mln) + bT Gain) which is sometimes referred to as the “principle of superposition” [7]. A system is time-invariant if a time shift in the input by 72) samples gives a shift in the output by 7, samples, L.e., if ya] = Tin). then y¥[z — a.) = TOfn — no))- Animportant property of an LTI system is that itis completely characterized by its impulse response, which is defined as the system’s response to a unit sample (or impulse). Given an LT] system, the output y[77] for an input x[77] is given by a sum of weighted and delayed impulse responses, L€., 06 ylna] = ¥; x[Ala[n — k] k=00 = x[n] * Afn] (2.1) which ig referred to as the convolution of x[71] with A[7], where “*” denotes the convolution operator. We can visualize convolution either by the weighted sum in Equation (2.1) or by flipping [72] in time and shifting the flipped hz] past x[7t]. The length of the resulting sequence vot] = x(n] * Af] is M+ L — 1. Since convolution is commutative, ic, x[7t] * Afr] = hf] * x[n] (Exercise 2.1), we can also flip x[#] and run it past the response h(n]. The convolution operation with [7] is sometimes referred to as filtering the input x[7] by the system /i[71], an operation useful in our modeling of speech production and in almost all speech processing systems. Two useful properties of LTT systems are stability and causality, which give a more restricted class of systems (Example 2.1). In a stable system, every bounded input produces a bounded output, ie., if |x[n]| < oe, then |y[re}| < 00 forall n. A necessary and sufficient condition for stability is that [7m] be absolutely summable, i.e., oO %, |A{n]| < oo. n=—0O A causal system is one where for any time instant 72,, yfte] does not depend on x[n] for# > Mo- i.e., the output does not depend on the future of the input. A necessary and sufficient condition 2.4 Discrete-Time Fourier Transform 15 for causality is that [nm] = 0, for # < 0. One can argue this necessary and sufficient condition by exploring the signal-flip interpretation of convolution (Exercise 2.2). A consequence of causality is that if xy[72] = x [17] form . lin]| = A al" t=—3 n=O A 1 — [a| where we have used the geometric series relation )-~ , b" = 4 for |b| < 1. If, on the other hand, |a| => 1, then the geometric series does not converge, the response is not absolutely summable, and the system is unstable. Note that, according to this condition. a system whose impulse response is the unit step function, i.¢., A[2] = u[7]. is unstable. A The terminology formulated in this section for systems is also used for sequences, although its physical meaning for sequences is lacking. A stable sequence is defined as an absolutely summable sequence, and a causal sequence is zero for tt < 0. A causal sequence will also be referred to as a right-sided sequence. 2.4 Discrete-Time Fourier Transform The previous section focused on time-domain representations of signals and systems. Frequency- domain representations, the topic of this section, are useful for the analysis of signals and the design of systems for processing signals. We begin with a review of the Fourier transform. A large class of sequences can be represented as a linear combination of complex expo- nentials whose frequencies lie in the range* [—z. 77]. Specifically, we write the following pair of equations: lL a : x{n] = — X(wei "dw 2m Jz oo X(w) = > x[njeFo, (2.2) n=—O This pair of equations is known as the discrete-time Fourier transform pair representation of a sequence. For convenience, the phrase Fourier transform will often be used in place of discrete- 2 Recall that e/@227" = len, 16 A Discrete-Time Signal Processing Framework Chap. 2 time Fourier transform. Equation (2.2) represents x(n] as a superposition of infinitesimally small complex exponentials dwX (w)e/" , where X (2) determines the relative weight of each exponential. X (w) is the Fourier transform of the sequence x[7], and is also referred to as the “analysis equation” because it analyzes x[n] to determine its relative weights. The first equation in the pair is the inverse Fourier transform, also referred to as the “synthesis equation” because it puts the signal back together again from its (complex exponential) components. In this text, we often use the terminology “analysis” and “synthesis” of a signal. We have not yet explicitly shown for what class of sequences such a Fourier transform pair exists. Existence means that (1) X(@) does not diverge, i.e., the Fourier transform sum converges, and (2) x[n] can be obtained from X (w). It can be shown that a sufficient condition for the existence of the pair is that x[#] be absolutely summable, i-e., that x[y7] is stable [7]. Therefore, all stable sequences and stable system impulse responses have Fourier transforms. Some useful properties of the Fourier transform are as follows (Exercise 2.3): P1: Since the Fourier transform is complex, it can be written in polar form as X(@) = X-(@) + jXi(w) [X(w)lei x where the subscripts r and i denote real and imaginary parts, respectively. P2: The Fourier transform is periodic with period 27: X(@ + 27) = X(a@) which is consistent with the staternent that the frequency range [—7r, zr] is sufficient for repre- senting a discrete-time signal. P3: For a real-valued sequence x [7m], the Fourier transform is conjugate-symmetric: X(w) = X*(-w) where * denotes complex conjugate. Conjugate symmetry implies that the magnitude and real part of X(w) are even, i.e., |X(w)] = |X (—a)| and X,(w) = X,(—w), while its phase and imaginary parts are odd, i.e., ZX (w) = —LK (—w) and X; (w) = —X;(—w). It follows that if a sequence is not conjugate-symmetric, then it must be a complex-valued sequence. P4: The energy of a signal can be expressed by Parseval’s Theorem as oo ] 7c yD |xfn]|? = a / [X(w)|?dev (2.3) n=—o which states that the total energy of a signal can be given in either the time or frequency domain. The functions |x[n]}|? and [|X (w)|? are thought of as energy densities, i.e., the energy per unit time and the energy per unit frequency, because they describe the distribution of energy in time and frequency, respectively. Energy density is also referred to as power at a particular time or frequency. 2.4 Discrete-Time Fourier Transform 17 EXAMPLE 2.2 Consider the shifted unit sample x[v] = d{n — n,]. The Fourier transform of x[n] is given by o X(o) = D> bln — nye" a=—co = ge tano since x[H] is nonzero for only » = ny. This complex function has unity magnitude and a linear phase of slope —n,.. In time, the energy in this sequence is unity and concentrated at 2 = n,, but in frequency the energy is uniformly distributed over the interval [—sr, zr] and, as seen from Parseval’s Theorem, averages to unity. A More generally, it can be shown that the Fourier transform of a displaced sequence x[n — 1,] is given by X(w)e-/@"’. Likewise, it can be shown, consistent with the similar forms of the Fourier transform and its inverse, that the Fourier transform of e/“e"x[n] is given by X (w@—a,). This later property is exploited in the following example: EXAMPLE 2.3 Consider the decaying exponential sequence multiplied by the unit step: x{n] = a"ufnl with @ generally complex. Then the Fourier transform of x[n] is given by ~ X{@) = Seem n={) y fae n=O 1 . = ——.,,_[ae’"| = la] < 1 1 — aes@ so that the convergence condition on @ becomes |a| < 1. If we multiply the sequence by the complex exponential e/“°", then we have the following Fourier transform pair: ef autn] jal < 1. 1 — ae—ie-ee)’ An example of this later transform pair is shown in Figure 2.2a,b where it is seen that in frequency the energy is Concentrated around @ = @, = = The two different values of a show a broadening of the Fourier transform magnitude with decreasing @ corresponding to a faster decay of the exponential. From the linearity of the Fourier transform, and using the above relation, we can write the Fourier transform pair for a real decaying sinewave as 1 Sy See «(ed ei 1 — ae—f(e-@e) 1 — aenfetee)? la} 2a" cos(w,n)uln] <= 18 A Discrete-Time Signal Processing Framework Chap. ? Amplitude Amplitude Phase (Radians) Phase (Radians) ° -2 0 2 Radian Frequency Radian Frequency (b) (qd) Figure 2.2 Frequency response of decaying complex andreal exponentials of Example 2.3: (a) magnitude and (b) phase for complex exponential; (c) magnitude and (d) phase for decaying sinewave (solid for slow decay [a = 0.9] and dashed for fast decay [a = 0.7]). Frequency @, = 7/2. where we have used the identity cos(a) = i (ef +e/*). Figure 2.2c,d illustrates the implications of conjugate symmetry on the Fourier transform magnitude and phase of this real sequence, i.e., the magnitude function is even, while the phase function is odd. In this case, decreasing the value of a broadens the positive and negative frequency components of the signal around the frequencies w, and —W,, Tespectively. A Example 2.3 illustrates a fundamental property of the Fourier transform pair representation: A signal cannot be arbitrarily narrow in time and in frequency. We return to this property in the following section. The next example derives the Fourier transform of the complex exponential, requiring in frequency the unit impulse which is also called the Dirac delta function, 2.4 Discrete-Time Fourier Transform 19 EXAMPLE 2.4 In this case we begin in the frequency domain and perform the inverse Fourier transform. Consider a train of scaled unit impulses in frequency: oo X(w) = ss AlnS(a — @, + r2m) r=—-o where 27 periodicity is enforced by adding delta function replicas at multiples of 27r (Figure 2.3a). The inverse Fourier transform is given by? I 2 ‘ x[x] = — Alré(w — w,)e?"da 21 Jn = Aer which is our familiar complex exponential. Observe that this Fourier transform pair represents the time-frequency dual of the shifted unit sample 6[7 — 1,,] and its transform e/"°*. More generally, a shifted Fourier transform X (@ — @,) corresponds to the sequence x[sJe/“°" , a property alluded to earlier. A IX(@)| A —1 Wp | Wo rw (b) Figure 2.3 Dirac delta Fourier transforms of (a} complex exponential sequence, (b) sinusoidal sequence, (c) sum of complex exponentials, and (d) sum of sinusoids. For simplicity, z and 27 factors are not shown in the amplitudes. 3 Although this sequence is not absolutely summable, use of the Fourier transform pair can rigorously be justified using the theory of generalized functions [7]. 20 A Discrete-Time Signal Processing Framework Chap. 2 Using the linearity of the Fourier transform, we can generalize the previous result to a sinusoidal sequence as well as to multiple complex exponentials and sines, Figure 2.3b-d illustrates the Fourier transforms of the following three classes of sequences: Sinusoidal sequence Acos(won + 6) @ wAe!*5(m — w,) + mw Ae !?8(w + a) Multiple complex exponentials N N So ape lone > Yo 2rael*5(w — wy) k=0 k=0 Multiple sinusoidals N N >a cos(apn + de) < Yo rape! 3(w — wp) + rape!" 8 (a + wp) k=0 k=0 For simplicity, cach transform is represented over only one period; for generality, phase offsets are included. 2.5 Uncertainty Principle We saw in Example 2.3 a fundamental property of the Fourier transform pair: A signal cannot be arbitrarily narrow in time and in frequency. We saw in Figure 2.2 that the Fourier transform increased in spread as the time sequence decreased in width. This property can be stated more precisely in the uncertainty principle. To do so requires a formal definition of the width of the signal and its Fourier transform. We refer to these signal characteristics as duration, denoted by D(x) and bandwidth,? denoted by B(x), and define them respectively as oO Dx) = Da — AP eee (@ — )*|X (w)/?dw (2.4) —T Bix) where 7 is the average time of the signal, ie. i = J“. n|x{n]|?, and & is its av- erage frequency, ie, @& = — «|X (w)|?dw [2]. In these definitions, in order that the time and frequency averages be meaningful, we assume that the signal energy is unity, ie., nooo IInj|? = {", \X(@)|?de@ = 1 or that the signal has been normalized by its energy. These duration and bandwidth values give us a sense of the concentration of a signal, or of its Fourier transform, about its average location. The definitions of signal or transform width are motivated by the definition of the variance, or “spread,” of a random variable. In fact, 4 & more traditional definition of bandwidth, not necessarily giving the same value as that in Equation (2.4), is the distance between the 3 dB attenuation points around the average frequency. 2.5 Uncertainty Principle 21 [x[z2]|? and |X (c)|? viewed as energy densities, we will see later in the text, are analogous to the probability density function used in defining the variance of a random variable [2]. It follows that normalizing the magnitude functions in Equation (2.4) by the total signal energy ensures probability density-like functions that integrate to unity, The uncertainty principle states that the product of signal duration and bandwidth cannot be less than a fixed limit, i.e., 1 D(x) Bx) = Zz: (2.5) Proof of the uncertainty principle is given by first applying Parseval’s Theorem to obtain m ., .aX(w D(x) B(x) > | oX*(o) XO dey (2.6) 2 dw from which Equation (2.5) follows. The reader is led through the derivation in Exercise 2.5. The principle implies that a wide signal gives a narrow Fourier transform, and a narrow signal gives a wide Fourier transform.> The uncertainty principle will play a major role in spec- trographic, and, more generally, time-frequency analysis of speech, especially when the speech waveform consists of dynamically changing events or closely-spaced time or frequency components, It is important to look more carefully at our definition of bandwidth in Equation (2.4). Observe that for a real sequence, from the conjugate-symmetry property, the Fourier transform magnitude is even. Thus the average frequency is zero and the bandwidth is determined by the distance between the spectral energy concentrations in positive and negative frequency. The resulting bandwidth, therefore, is not necessarily indicative of the distribution of energy of physically meaningful quantities such as system resonances (Exercise 2.4). The bandwidth of the signals with discrete-time Fourier transform magnitudes in Figure 2.2c is such a case. As a Consequence, complex sequences, as those corresponding to the transform magnitudes in Figure 2.2a, or only the positive frequencies of a real sequence, are used in computing bandwidth. In the latter case, we form a sequence s[m] with Fourier transform S(w) = X(mw). O The nomenclature “uncertainty” is somewhat misleading because there is no uncertainty in the measure- ment Of the signal or its Fourier transform. This terminology evolved from Heisenberg’s uncertainty principle in the probabilistic context of quantum mechanics where it was discovered that the position and the momentum of a particle cannot be measured at a particular time simultaneously with absolute certainty. There is a mathematical similarity. and not a similarity in the physical interpretation, in Heisenberg’s uncertainty principle with Equation (2.5) because the position and momentum functions are related through the Fourier transform, The squared magnitude of the Heisenberg functions represent the probability of measuring a particle with a certain position and momentum, respectively, unlike the deterministic magnitude of a signal and its Fourier transform [2]. 22 A Discrete-Time Signal Processing Framework imaginary components six] = s-[n] + jsj{n]. Chap. 2 (2.7) Equation (2.7), the inverse Fourier transform of S(q), is called the analytic signal representation of x[#] and is used occasionally later in the text. It can be shown that the real §,-(n] = ee and that the imaginary part §; [7] can be obtained from 5,(#] through the domain operation (Exercise 2.6): Si() = H(w)S,(o) where H (w) is called the Hilbert transformer and is given by Hw) =-j, O 0. 2.6 z-Transform 25 (a) Im z—plane (ax b Re (c) (d) Figure 2.4 Region of convergence (ROC) of the z-transform for Examples 2.52.7: (a) x[n] = 8[n] — ad8[n — 1); (b) x[n] = a" un]; (c) xfn] = —b"ul—n — 1); () afn] = a"u[n) — b"u{—n — 1]. Poles are indicated by small crosses and zeros by small circles, and the ROC is indicated by shaded regions. Consider now the difference of unit samples x{n] = &[n] — ad{n — 1] with @ generally complex. From linearity of the z-transform operator and the previous result, the Z-transforn is given by X(z) =1-— az where the ROC again includes the entire z-plane, not including z = 0. For this case, we say that X (z) has a pole at z = 0, i.e., the transform goes to infinity, and a zero at z = a, 1.e,, the transform takes on value zero. A 26 A Discrete-Time Signal Processing Framework Chap, 2 EXAMPLE 2.6 Consider the decaying exponential [hn] = a"n[n] where @ is generally complex. From the convergence property of a geometric series (Example 2.1), the z-transform is given by X(z) = - (zl > ja rogers kee lal where the ROC includes the <-plane outside of the radius a, iLe., Iz| > la]. For this case, X (z) has a pole at 5 = a and a zero at = = 0, For la] < 1 the ROC includes the unit circle, A similar case is the time-reversed version of the decaying exponential x[#] = —b"uf—n — 3]. Again. from the convergence Property of a geometric series, the z-transform is given by X(z) = ; lz] < |b| J — bel? but now the ROC includes the z-plane inside of the radius b. i.e., lz| < |b]. For this case, X (z) has a pole at z = b and a zero at z = (0). For || > 1, the ROC includes the unit circle. A From the above examples. we see that both the z-transform and an ROC are required to specify a sequence. When a sequence consists of a sum of components, such as the sequences of the previous examples, then, in general, the ROC is the intersection of the ROC for individual terms, as we show in the following example: EXAMPLE 2.7 We now combine the elemental sequences of the previous example. Let x)[] and x3[1] denote the decaying exponential and its time-reversal, respectively. Then the sum of the two sequences is given by x(n] = x\{n] + x[n] = a"ufn] — b"ul[—n — I]. From the linearity of the z-transform, X (z) is given by X(z) = Xi(z) + X2(z) 1 1 ~ toaet b Tope Fl < kel < 1 where, because both Sequences must converge to have x(n] converge, the ROC is an annulus in the z-plane defined by Ja] < |z| < [b|. Rewriting X (z) as 201 - [42] <"') 2) = C= az-Ya — bz-1y’ we see that X(z) is characterized by two poles at z = a, b, one zero at l= a and a zero at <=0. A tS 2.6 z-Transform 27 The z-transforms of the previous examples can be generalized to complex a and b; the poles and zeros are not constrained to the real axis, but can lie anywhere in the z-plane. For a general rational X (z), factoring P(z) and Q(z) gives the z-transform in the form X(@) = Arr a 0), or two-sided (neither right-sided nor left-sided). The ROC may or may not include the unit circle. For a right-sided sequence, the ROC extends outside the ontermost pole, while for a left-sided sequence, the ROC extends inside the innermost pole. For a two- sided sequence, the ROC is an annulus in the z-plane bounded by poles and, given conver- gence, not containing any poles. These different configurations are illustrated in Figure 2.5. (c) Figure 2.5 Poles and zeros, and region of convergence (ROC) of rational z-transforms: (a) left-sided, (b) right-sided, and (c) two-sided. The ROC is indicated by shaded regions. 28 A Discrete-Time Signal Processing Framework Chap. 2 A specific case, important in the modeling of speech production, is when x[7] is stable and causal; in this case, because the unit circle must be included m the ROC, all poles are inside the unit circle and the ROC is outside the outermost pole. This configuration is shown later in Figure 2.8 and will be further described in Section 2.8. It was stated earlier that a rational z-transform can be represented by a partial fraction expansion. The details of this partial fraction expansion are given in [7]; here it suffices to describe a particular signal class. Consider the case where the number of poles is greater than the number of zeros. Then, for the case of no poles outside the unit circle, we can write Equation (2.17) as Ni X(z) = —— (2.18) This additive decomposition. in contrast to the earlier multiplicative representation of component poles, will be useful in modeling of speech signals, as in the representation of resonances of a vocal tract system impulse response. 2.7 LTI Systems in the Frequency Domain We have completed our brief review of frequency-domain representations of sequences; we now look at similar representations for systems. Consider the complex exponential x[7| = e/°" as the input to an LTI system. From the convolutional property y[a] = x[n] Afr] oO _ - h{kjelee— k=— 90 oo aoe >” ae. (2.19) k=—00 The second term in Equation (2.19) is the Fourier transform of the system impulse response which we denote by H(w) os H(w) = es h[kje Jo" (2.20) k=--06 so that vin] = A(w)el@". Therefore, a complex exponential input to an LTI system results in the same complex exponential at the output, but modified by H(qw). It follows that the complex exponential is an eigenfunction of an LTI system, and H'(q) is the associated eigenvalue [7].° H (w) is often referred to as the § This eigenfunction/eigenvalue terminology is also often used in mathematics. In linear algebra, for example, for a matrix A and a vector x, Ax = Ax where x is the eigenfunction and 2. the eigenvalue. 2.7 LTl Systems in the Frequency Domain 29 system frequency response because it describes the change in e/”” with frequency; its z-domain generalization, H(z), is referred to as the system function or sometimes the transfer fiction. The following example exploits the eigenfunction property of the complex exponential: EXAMPLE 2.8 A sinusoidal sequence can be expressed as xin] = Acos(w,n + @) me A Ait pieon ae A 16 .—ioon, Then, by superposition, the output to an LTI system H (ew) to the input x[r] is given by A 333 A 3 y[n] = H (wo) 0% eee" + H(t) ee Hoon A —_ . qlA (woe! elo ts H* (we 1% oF er")7 where we have used the conjugate symmetry property of the Fourier transform, H*(—w,) = H(w,). Using the relation a + a* = 2Re[a] = 2|a| cos(@) where a = |ale/”, the ourput can be expressed as yin] = Al H(e,)| cos[oon + @ + 2 H(a,)] where we have invoked the polar form H(w) = |H(w)|e/44, Because the system is linear, this result can be generalized to a sum of sinewaves, i.¢., for an input of the form i x[n] = > At cos(an + dy) k=O the output is given by # yl] = S2 ArH (ex)| costeonn + by + LH (or). k=0 A similar expression ts obtained for an input consisting of a sum of complex exponentials. A Two important consequences of the eigenfunction/eigenvalue property of complex exponentials for LTI systems to be proven in Exercise 2.12 are stated below. Convolution Theorem — This theorem states that convolution of sequences corresponds to multiplication of their corresponding Fourier transforms. Specifically, if xin] <= X(w) h{n] — He) 30 A Discrete-Time Signal Processing Framework Chap. 2 and if vi] = xa] * Ata] then Y¥(a) = X(w)H(@). Windowing (Modulation) Theorem — The following theorem is the dual of the Convolution Theorem. Let xfr] = X(w) win] o W(ea) and if vin} = x[aju[n] then cis yo) =—f[ x@ww- ede on J, 1 —X(@) @ W(w) 2 where @ denotes circular convolution, corresponding to one function being circularly shifted relative to the other with period 277. We can also think of each function being defined only in the interval [—77, 77] and being shifted modulo 27 in the convolution. The “windowing” terminology comes about because when the duration of w[n] is short relative to that of x[/], we think of w[m] as viewing (extracting) a short piece of the sequence x [nr]. EXAMPLE 2.9 Consider a sequence consisting of a periodic train of unit samples x[a] = ¥, é[n — KP] fa—00 with Fourier transform (Exercise 2.13) ae 20 27 A(w) = —8&(w — —kK). (w) gis pote — +4) Suppose that +[n] is the input to an LTI system with impulse response given by A{n] = a"u{n] with Fourier transform 1 “ A(@) => 1 — aene’ la| |8 nal eat an 1 an =— —_.- 6 (« = 7K) . X 1—aeiF* Pp We think of the magnitude of the system function as the spectral envelope to the train of Dirac delta pulses (Figure 2.6). This representation will be particularly important in modeling voiced speech sounds such as vowels. A The importance of Example 2.10, the dual of Example 2.9, will become apparent later in the text when we perform short-time analysis of speech. EXAMPLE 2.10 Consider again the sequence x[n] of Example 2.9, and suppose that the sequence is multiplied by a Hamming window of the form [7] rn Nw — 1 win] = 0.54 — 0.46008) IE O 0 (2.23) n=0) n=O where /i[it] is a causal sequence with the Fourier transform magnitude equal to that of the reference minimum-phase sequence Prmplfl. As zeros are flipped outside the unit circle, the energy of the sequence is delayed in time, the maximum-phase counterpart having maximum energy delay (or phase lag) [7]. Similar energy localization properties are found with respect to poles. However, because causality strictly cannot be made to hold when a z-transform contains maximnum-phase poles, itis more useful to investigate how the energy of the sequence shifts with respect to the time origin. As illustrated in Example 2.11, flipping poles from inside to outside the unit circle to their conjugate reciprocal location moves energy to the left of the time origin. transforming the fast attack of the minimum-phase sequence to a more gradual onset. We will see throughout the text that numerous speech analysis schemes result in a minimum-phase vocal tract impulse response estimate. Because the vocal tract is not necessarily minimum phase, synthesized speech may be characterized in these cases by an unnaturally abrupt vocal tract impulse response. 8 Because we assume causality and stability, the poles lie inside the unit circle. Different phase functions. for a specified magnitude, therefore are not contributed by the poles. * Itis of interest to note that a sufficient but not necessary condition for 4 causal sequence to be minimum phase is that |A[O]| > 37°, |ALe]| [9]. 36 A Discrete-Time Signal Processing Framework Chap. 2 EXAMPLE 2.11 An example comparing a mixed-phase impulse response An], having poles inside and outside the unit circle, with its minimum-phase reference Aymp ln] is given in Figure 2.9. The minimum-phase sequence has pole pairs at 0.95e*/°! and 0.95e/°3 The mixed-phase se- quence has pole pairs at 0.95e%/"! ang gose*/°?. The minimum-phase sequence (a) is concen- trated to the right of the origin and in this case is less “dispersed” than its non-minimum-phase counterpart (c). Panels (b) and (d) show that the frequency response magnitudes of the two se- quences are identical. As we will see later in the text, there are perceptual differences in speech synthesis between the fast and gradual “attack” of the minimum-phase and mixed-phase sequences, respectively. A Amplitude (dB) © 1000 2000 3000 4000 5000 (b) Amplitude (dB) 0 10 20 30 40 0 1000 2000 3000 4000 5000 Time (ms) Frequency (Hz) (c) (d) Figure 2.9 Comparison in Example 2.11 of (a) a minimum-phase sequence H,,,,[1] with (c) a mixed- phase sequence h[n] obtained by flipping one pole pair of Hy mpl#] outside the unit circle to its conjugate reciprocal location. Panels (b) and (d) show the frequency tésponse magnitudes of the minimum- and mixed-phase sequences, respectively, 2.8 Properties of LTI Systems 37 2.8.3 FIR Filters There are two classes of digital filters: finite impulse response (FIR) and infinite impulse response (HR) filters [7],[10]. The impulse response of an FIR filter has finite duration and corresponds to having no denominator in the rational function A(z), i.e, there is no feedback in the difference Equation (2.21). This results in the reduced form M yin) = )° bxtn - ri. (2.24) r=0) Implementing such a filter thus requires simply a train of delay, multiply, and add operations. By applying the unit sample input and interpreting the output as the sum of weighted delayed unit samples, we obtain the impulse response given by Afn] a Bn; 0 ={AaAcs M = 0, otherwise Because fi[n] is bounded over the duration 0 < n < M, it is causal and stable. The corre- sponding rational transfer function in Equation (2.22) reduces to the form M; M, Xz) = Az] Ja i az')] Ja — bz) k=1 k=1 with M; + M, = M and with zeros inside and outside the unit circle: the ROC is the entire z-plane except at the only possible poles z = 0 or z = 00. FIR filters can be designed to have perfect linear phase. For example, if we impose on the impulse tesponse symmetry of the form h[n) = h[M — n), then under the simplifying assumption that M is even (Exercise 2.14), H(w) = A(w)e~/@/?) where A(@) is purely real, implying that phase distortion will not occur due to filtering [10], an important property in speech processing. !° 2.8.4 IIR Filters IIR filters include the denominator term in H(z) and thus have feedback in the difference equation representation of Equation (2.21). Because symmetry is required for linear phase, most!? TR filters will not have linear phase since they are right-sided and infinite in duration. Generally, IIR filters have both poles and zeros. As we noted earlier for the special case where the number of zeros is less than the number of poles, the system function H(z) can be expressed in a partial fraction expansion as in Equation (2.18). Under this condition, for causal systems, the impulse response can be written in the form Ni Afr] = So Accfutn] k=! 10 This does not mean that M (cw) is positive. However, if the filter A[m] has most of its spectral energy where Af(w) > 0, then little speech phase distortion will occur. 1) A class of linear phase IIR filters has been shown to exist [1]. The transfer function for this filter class, however, is not rational and thus does not have an associated difference equation. 38 A Discrete-Time Signal Processing Framework Chap. 2 where cy is generally complex so that the impulse response is a sum of decaying complex exponentials. Equivalently, because Ali) is real, it can be written by combining complex conjugate pairs as a set of decaying sinusoids of the form N;/2 A{nj = »- Bylex|" cos(exn + bp )uln] k=[ where we have assumed no real poles and thus N;; is even. Given a desired spectral magnitude and phase response, there exist numerous IIR filter design methods [7], [10]. In the implementation of HR filters, there exists more flexibility than with FIR filters. A “direct-form” method is seen in the recursive difference equation itself [7],[10]. The partial fraction expansion of Equation (2.18) gives another implementation that, as we will see later in the text, is particularly useful ina parallel resonance realization of a vocal tract transfer function. Suppose, for example, that the number of poles in A(z) is even and that all poles occur in complex conjugate pairs. Then we can alter the partial fraction expansion in Equation (2.18) to take the form Nj/2 a Atl — pez) X(z) = ee @) 2 1 — coz ja - cfz}) N;/? An — pyz!) =) ee —A ao — Upzo! + ue = iy k (2.25) which represents & second-order IIR filters in parallel. Other digital filter implementation structures are introduced as needed in speech analysis/synthesis schemes throughout the text. 2.9 Time-Varying Systems Up to now we have studied linear systems that are time-invariant, ice., if *|7] results in y[nj, then a shifted input x[”n — No] results in a shifted output y[7 — n,]. In the speech pro- duction mechanism, however, we often encounter time-varying linear systems. Although su- perposition holds in such systems, time-invariance does not. A simple illustrative exarnple is a “system” which multiplies the input by a sequence h[n]. The system is linear because (ox) [Nn] + Bxo[n))h[n] = ox ilajh{a] + Bx2[2 h(n], but it is not time-invariant because, in general, [1 — nojh[n] A x[n — Nolhi[n — ny]. A time-varying linear system is characterized by an impulse response that changes for each time mm. This system can be represented by a two-dimensional function gla, m), which is the impulse response at time n to a unit sample applied at time m2. The response, for example, to 4 unit sample at time m = 0 is &[n, 0], while the response to d[” — No| is g[n, Ho]. The two-dimensional function &[#, m] is sometimes referred to as Green ’s function [8]. Because the system is linear and because an input x[71] consists of a sum of weighted and delayed unit 29 Time-Varying Systems 39 samples, i.e., x[n] = a x[mé[n — m], then the output to the input x[n] is given by wo yin] = y; gin, mx[m] (2.26) m=—co which is a superposition sum, but nora convolution. We can see how Equation (2.26) differs from a convolution by invoking an alternate two-dimensional representation which is the response of the system at time n to a unit sample applied m samples earlier at time [7 — m]. This new function, called the time-varying unit sample response and denoted by h{n, mi], is related to Green’s function by A[n, im] = g[n,n — m] or, equivalently, h[n, 2 — m) = gin, im]. We can then write the time-varying system output as (Exercise 2.16) yal = y- hn, na — m)x[m] m=—to = a h[n, mjx[n — m] (2.27) M>=—oo where we have invoked a change in variables and where the weight on each impulse response corresponds to the input sequence #7 samples in the past. When the system is time-invariant, it follows that Equation (2.27) is expressed as (Exercise 2.16) oo y[n] = se h[mlx[n — m] (2.28) m>=—-0o which is the convolution of the input with the impulse response of the resulting linear time- invariant system. It is of interest to determine whether we can devise Fourier and z-transform pairs, i.e., a frequency response and transfer function, for linear time-varying systems, as we had done with linear time-invariant systems. Let us return to the familiar complex exponential as an input to the linear time-varying system with impulse response fi[n, m]. Then, from Equation (2.27), the output [7] is expressed as co y[n] = » h{n, mjei2"@—™ m=—oO co = een >: h[n, mje/" nS Oo = e’"H(n, a) (2.29) where Oo A(n,o) = > h[n, mje ie" m=—oco A Discrete-Time Signal Processing Framework Chap. 2 which is the Fourier transform of h[n, m] attime 7 evaluated with respect to the variable m and referred to as the time-varying frequency response. Equivalently, we can write the time-varying frequency response in terms of Green’s function as (Exercise 2.17) oo H(n,@) = ef" ‘ g(n, mje J°" (2.30) M>—00 which, except for the linear phase factor e/°” , is the Fourier transform of Green’s function at time nr. Because the system of interest is linear, its output to an arbitrary input x[7] is given by the following superposition [8] (Exercise 2.15): co y[n] = » h[n. m)x[n — m] n= OO = = [ ; H(n, @)X (we! de (2.31) so that the output y[7] of A[n, m] at time n is the inverse Fourier transform of the product of X(w) and H(n, w), X(w)H(n, w) which can be thought of as a generalization of the Convolution Theorem for linear time-invariant systems. This generalization, however, can be taken only so far. For example the elements of a cascade of two time-varying linear systems, Le., Hi(n, @) followed by H3(n, @), do not generally combine in the frequency domain by multiplication and the elements cannot generally be interchanged, as illustrated in the following example. Consequently, care must be taken interchanging the order of time-varying systems in the context of speech modeling and processing. EXAMPLE 2.12 Consider the linear time-varying multiplier operation y[n] = x[nJe?e" cascaded with a linear time-invariant ideal low-pass filter h[7r], as illustrated in Figure 2.10. Then, in general, (x[nje/“e") * Ala] % (x[n] * Almelo". For example, let x[17] = ef 7 ®, = ai and h[n] have lowpass cutoff frequency at 3. When the lowpass filter follows the multiplier, the output is zero, when the order is interchanged, the output is nonzero. A We will see in following chapters that under certain “slowly varying” conditions, linear time- varying systems can be approximated by linear time-invariant systems. The accuracy of this approximation will depend on the time-duration over which we view the system and its input, as well as the rate at which the system changes. More formal conditions have been derived by Matz and Hlawatsch [5] under which a “transfer function calculus” is allowed for time-varying systems. 2.10 Discrete Fourier Transform 4) el" Lowpas x(n] mie &) yin] x(n] Figure 2.10 Cascade configurations of linear time- invariant and time-varying systems. 2.10 Discrete Fourier Transform The Fourier transform of a discrete-time sequence is a continuous function of frequency [Equa- tion (2.2)]. Because, in practice, when using digital computers we cannot work with continuous frequency, we need to sample the Fourier transform, and, in particular, we want to sample finely enough to be able to recover the sequence. For sequences of finite length N, sampling yields a new transform referred to as the discrete Fourier transform or DFT, The DFT pair representation of x7] is given by N-1 I Or [a] = — Y Jay as x{a] = N 2 X(k)e '" O Qy = 22 Fy. Then x,(r) canbe uniquely determined from its uniformly spaced samples x[n] = x,,(aT) if the sampling trequency Fy is greater than twice the largest frequency of the signal, ie., F, > 2Fy. The largest frequency in the signal Fy is called the Nvguist Frequency, and 2 Fy , which must be attained in sampling for reconstruction, is called the Nyquist rate. For example, in apn we might assume a 5000 Hz bandwidth. Therefore, for signal recovery we must sample at ¢ = 10000 samples/s corresponding to a T = 100 jus sampling interval. The basis for the Sampling Theorem is that sampling x, (f) at arate of + results in spectral duplicates spaced by +, so that sampling at the Nyquist rate avoids aT thus preserving the spectral integrity of the signal. The sampling can be performed with a periodic impulse train with spacing T and unity weights, ic., p(t) = i x, 6(¢ — kT). The impulse train resulting from multiplication with the signal x, (f), denoted by x,(t). has weights equal to the signal values evaluated at the sampling rate, i.e. xp(t) Xa(t) p(t) ~ Xa(kT)8(t — kT). (2.33) ka—26 The impulse weights are values of the discrete-time signal. i.c., x[] = x,(7T), and therefore, as illustrated in Figure 2.11, the cascade of sampling with the impulse train p(t} followed by con- version of the resulting impulse weights to a sequence is thought of as an ideal A/D converter (or C/D converter). In the frequency domain, the impulse train p(t) maps to another impulse train with spacing 27 Fy, i.e., the Fourier transform of p(t) is P(@) = 2. Vite 6(22 — kQ,) where Q, = 27 F,. Using the continuous-time version of the Windowing Theorem, it fol- A Discrete-Time Signal Processing Framework Chap. 2 sin(x)/x Interpolation 1 Sampling i 1 x2(0) Diserete- ‘Time System ptt) = 38(-kT) =—oo Se a re a ed bic CID =Oy | On 8 Figure 2.11 Path from sampling to reconstruction (22 = Q,). lows that P(&) convolves with the Fourier transform of the signa] X_(¢), thus resulting in a continuous-time Fourier transform with spectral duplicates ti Xp(Q) = = DE Xu(Q — k2,) (2.34) k=—co where 22, = 27 F,. Therefore. the original continuous-time signal x,(t) can be recovered by applying a lowpass analog filter, unity in the passband [- & Ss], and zero outside this band. This perspective also leads to a reconstruction formula which interpolates the signal samples with a sin function. Using the continuous-time version of the Convolution Theorem, application of an ideal lowpass filter of width §25 corresponds to the convolution of the filter impulse response with the signal-weighted impulse train x, (¢). Thus, we have a reconstruction formula given by _ = sin(@(t — nT)/T) X(t) = 2, ar because the function ee is the inverse Fourier transform of the ideal lowpass filter. As illustrated in Figure 2.11, the cascade of the conversion of the sequence y[mn] = x[n] toa continuous-time impulse train (with weights x,,(#2T )) followed by lowpass filtering is thoughtof as a discrete-to-continuous (D/C) converter. In practice, however, a digital-to-analog converter is used. Unlike the D/C converter, because of quantization error and other forms of distortion, D/A converters do not achieve perfect reconstruction. 2.11 Conversion of Continuous Signals and Systems to Discrete Time 45 The relation between the Fourier transform of x,(t), X,,(&2), and the discrete-time Fourier transform of x[7] = Xu(nT), X (@), can now be deduced. When the Sampling Theorem holds, over the frequency interval [—7r, 7] X(w) is a frequency-scaled (or frequency-normalized) version of X,,(@). Specifically, over the interval [—7r, 7] we have Xo) = Xi (>). lol <7. This relation can be obtained by first observing that X ,,(@) can be written as oS X= } are (2.35) A=—CO and then by applying the continuous-time Fourier transform to Equation (2.33), and comparing this result with the expression for the discrete-time Fourier transform in Equation (2.2) [7]. We see that if the sampling is performed exactly at the Nyquist rate, then the normalized frequency 7 corresponds to the highest frequency in the signal. For example, when Fy = 5000 Hz, then # corresponds to 5000 Hz. The entire path, including sampling, application of a discrete-time system, and recon- struction, as well as the frequency relation between signals, is illustrated in Figure 2.11. In this illustration, the sampling frequency equals the Nyquist rate, i.c., 20, = ©,. The discrete-time system shown in the figure may be, for example, a digital filter which has been designed int discrete time or derived from sampling an analog filter with some desired properties. A topic related to the Sampling Theorem is the decrease and increase of the sampling rate, referred to as decimation and interpolation or, alternatively, as downsampling and upsampling, respectively. Changing the sampling rate can be important in speech processing where one parameter may be deemed to be slowly-varying relative to another; for example, the state of the vocal tract may vary more slowly, and thus have a smaller bandwidth, than the state of the vocal cords. Therefore, different sampling rates may be applied in their estimation, requiring a change in sampling rate in waveform reconstruction, For example, although the speech waveform is sampled at, say, 10000 samples/s, the vocal cord parameters may be sampled at 100 times/s, while the vocal tract parameters are sampled at 50 times/s. The reader should briefly review one of the numerous tutorials on decimation and interpolation [3],[7].[1 0]. 2.11.2 Sampling a System Response In the previous section, we sampled a continuous-time waveform to obtain discrete-time samples for processing by a digital computer or other discrete-time-based system. We will also have occasion to transform analog systems to discrete-time systems, as. for example, in sampling a continuous-time representation of the vocal tract impulse response, or in the replication of the spectral shape of an analog filter. One approach to this transformation is to simply sample the continuous-time impulse response of the analog system; i.e., we perform the continuous-to- discrete-time mapping h(n] = hy(nT) 46 A Discrete-Time Signal Processing Framework Chap. 2 where h(t) is the analog system impulse response and T is the sampling interval. This method of discrete-time filter design is referred to as the impulse invariance method [7]. Similar to sampling of continuous-time waveforms, the discrete-time Fourier transform of the sequence h[n], H(@), is related to the continuous-time Fourier transform of ha(t). H,(Q), by the relation MOS = Be (;). lol sx where we assume hg(f) is bandlimited and the sampling rate is such to Satisfy the Nyquist criterion [7]. The frequency response of the analog signal is therefore preserved. It is also of interest to determine how poles and zeros are transformed in going from the continuous- to the discrete-time filter domains as, for example, in transforming a continuous-time vocal tract impulse response. To obtain a flavor for this style of conversion, consider the continuous-time rendition of the IIR filter in Equation (2.25), i.e., N halt) = D> Age u(t) k=l whose Laplace transform is given in partial fraction expansion form (the continuous counterpart to Equation (2.18)) [7] N Ay Ay(s) = Bos = k=1 Then the impulse invariance method results in the discrete-time impulse response N Kin] = hg{nT] = Do Ace" un] k=1 whose z-transform is given by N Ax n@ = = with poles at z = e®*?) inside the unity circle in the z-plane (Je“*™| = eReIZ) = 1 when Re[sg] < 0) is mapped from poles in the s-plane from » = s; located to the left of the Fa axis. Poles being to the left of the /Q axis is a stability condition for causal continuous systems. Although the poles are mapped inside the unit circle, the mapping of the zeros depends on both the resulting poles and the coefficients Ag in the partial fraction expansion. It is conceivable, therefore, that a minimum-phase response may be mapped to a mixed-phase response with zeros outside the unit circle, a consideration that can be particularly important in modeling the vocal tract impulse response. 2.12 Summary 47 Other continuous-to-discrete-time conversion methods, e. g., the bilinear transformation [7], will be described as needed throughout the text. 2.11.3 Numerical Simulation of Differential Equations In a loose sense, the impulse invariance method can be thought of as a numerical simulation of a continuous-time system by a discrete-time system. Suppose that the continuous-time system is represented by a differential equation. Then a discrete-time simulation of this analog system could alternatively be obtained by approximating the derivatives by finite differences, eg. x(t) at t = nT is approximated by Mod DP) - In mapping the frequency response of the continuous-time system to the unit circle, however, such an approach has been shown to be undesirable due to the need for an exceedingly fast sampling rate, as well as due to the restriction on the nature of the frequency response [6]. Nevertheless, in this text we will have occasion to revisit this approach in a number of contexts, such as in realizing analog medels of speech production or analog signal processing tools in discrete time. This will become especially important when considering differential equations that are not necessarily time-invariant, are possibly coupled, and/or which may contain a nonlinear element. In thesc scenarios, approximating derivatives by differences is one solution option, where digital signal processing techniques are applied synergistically with more conventional numerical analysis methods. Other solution options exist, such as the use of a wave digital filter methodology to salve coupled, time-varying, nonlinear differential equations [4]. 2.12 Summary In this chapter we have reviewed the foundation of discrete-time signal processing which will serve as a framework for the remainder of the text. We reviewed discrete-time signals and systems and their Fourier and z-transform representations. A fundamental property of the Fourier transform is the uncertainty principle that imposes a constraint between the duration and bandwidth of a sequence and limits our ability to simultaneously resolve both in time and frequency dynamically changing events or events closely spaced in time and frequency, We will investigate this limitation in a speech context more fully in Chapters 7 and 11. In this chapter, we introduced the concepts of minimum- and mixed-phase sequences and looked at important relationships between the magnitude and phase of their Fourier transforms. A property of these sequences, which we will see influences the perception of synthesized speech, is that a minimum-phase sequence is often characterized by a sharper “attack” than that of a mixed-phase counterpart with the same Fourier transform magnitude. Also in this chapter, we briefly reviewed some considerations in obtaining a discrete-time sequence by sampling a continuous-time signal, and also reviewed constraints for representing a sequence from samples of its discrete-time Fourier transform, i.e., from its DFT. Finally, we introduced the notion of time-varying linear systems whose output is not represented by a convolution sum, but rather by a more general superposition of delayed and weighted input values. An important property of time-varying systems is that they do not commute, implying that care must be taken when interchanging their order. The importance of time-varying systems in a speech processing context will become evident as we proceed deeper into the text.

You might also like