Digital Speech Processing
Digital Speech Processing
1 2
5 6
1
Speech Coding Demo of Speech Coding
• Speech Coding is the process of transforming a
• Narrowband Speech Coding: • Wideband Speech Coding:
speech signal into a representation for efficient
transmission and storage of speech 64 kbps PCM
Male talker / Female Talker
– narrowband and broadband wired telephony 32 kbps ADPCM
– cellular communications 16 kbps LDCELP 3.2 kHz – uncoded
7 kHz – uncoded
– Voice over IP (VoIP) to utilize the Internet as a real-time 8 kbps CELP 7 kHz – 64 kbps
communications medium 7 kHz – 32 kbps
4.8 kbps FS1016
– secure voice for privacy and encryption for national 7 kHz – 16 kbps
security applications 2.4 kbps LPC10E
– extremely narrowband communications channels, e.g.,
battlefield applications using HF radio
– storage of speech for telephone answering machines,
IVR systems, prerecorded messages
Narrowband Speech Wideband Speech
7 8
Speech Synthesis
Speech Synthesis • Synthesis of Speech is the process of
generating a speech signal using
computational means for effective human-
machine interactions
– machine reading of text or email messages
text Linguistic DSP D-to-A speech – telematics feedback in automobiles
Rules Computer Converter
– talking agents for automatic transactions
– automatic agent in customer care call center
– handheld devices such as foreign language
phrasebooks, dictionaries, crossword puzzle
helpers
– announcement machines that provide
information such as stock quotes, airlines
11 12
schedules, weather reports, etc.
2
Speech Synthesis Examples
Pattern Matching Problems
• Soliloquy from Hamlet:
• Gettysburg Address:
• speech recognition
Reference
• speaker recognition Patterns
• speaker verification
• Third Grade Story:
• word spotting
1964-lrr 2002-tts
13
• automatic indexing of speech recordings 14
17 18
3
Other Speech Applications DSP/Speech Enabled Devices
• Speaker Verification for secure access to premises,
information, virtual spaces
• Speaker Recognition for legal and forensic purposes—
national security; also for personalized services
• Speech Enhancement for use in noisy environments, to
eliminate echo, to align voices with video segments, to
change voice qualities, to speed-up or slow-down Internet Audio Digital Cameras PDAs & Streaming
prerecorded speech (e.g., talking books, rapid review of Audio/Video
material, careful scrutinizing of spoken material, etc) =>
potentially to improve intelligibility and naturalness of
speech
• Language Translation to convert spoken words in one
language to another to facilitate natural language Hearing Aids
dialogues between people speaking different languages,
i.e., tourists, business people Cell Phones
19 20
4
Speech Production/Generation Model Speech Production/Generation Model
• Message Formulation Æ desire to communicate an idea, a wish, a • Neuro-Muscular Controls Æ need to direct the neuro-muscular
request, … => express the message as a sequence of words system to move the articulators (tongue, lips, teeth, jaws, velum) so
as to produce the desired spoken message in the desired manner
Message I need some string
Please get me some string (Discrete Symbols) Neuro-
Desire to Formulation Text String Where can I buy some
Articulatory
Communicate string
Phoneme String
Muscular (Continuous control)
motions
• Language Code Æ need to convert chosen text string to a with Prosody Controls
sequence of sounds in the language that can be understood by • Vocal Tract System Æ need to shape the human vocal tract system
others; need to give some form of emphasis, prosody (tune, melody) and provide the appropriate sound sources to create an acoustic
to the spoken sounds so as to impart non-speech information such waveform (speech) that is understandable in the environment in
as sense of urgency, importance, psychological state of talker, which it is spoken
environmental factors (noise, echo)
Vocal Tract Acoustic
Language (Continuous control)
Text String
Phoneme string
Articulatory System Waveform
Code (Discrete Symbols) Motions (Speech)
with prosody
Generator
5
Speech Sciences The Speech Circle
Voice reply to customer
• Linguistics: science of language, including phonetics, Customer voice request
phonology, morphology, and syntax “What number did you
want to call?”
• Phonemes: smallest set of units considered to be the
basic set of distinctive sounds of a languages (20-60 Text-to-Speech TTS ASR Automatic Speech
units for most languages) Synthesis Recognition
1000-2000 times change in information rate from discrete message Extraction and
Human listeners,
symbols to waveform encoding => can we achieve this three orders of Utilization of
machines
magnitude reduction in information rate on real speech waveforms? 33 Information 34
6
Information Rate of Speech Speech Processing Applications
Data Rate (Bits Per Second)
Intelligent Robot?
The Speech Stack http://www.youtube.com/watch?v=uvcQCJpZJH8
40