01 Introduction To Digital Speech Processing
01 Introduction To Digital Speech Processing
數位語音處理概論
李琳山
Speech Signal Processing
x(t) x[n]
Processing
LPF Algorithms
output
X[n]
X(t)
t n
Double Levels of Information
字(Character)
詞(Word)
人人用電腦
句(Sentence)
Speech Signal Processing – Processing of Double-Level
Information
• Speech Signal • •
Sampling Processing
今 天 的
Algorithm
天 氣 非
Chips or Computers
• Linguistic
常 好 Structure
• Linguistic Knowledge
今天的 天氣 非常 好
Lexicon Grammar
今天 的
Well-Known Application Examples of Speech and
Language Technologies – Speaking Personal Assistant
• Examples • Special Questions:
– Weather in New York next week ? – 唐詩宋詞, 出師表…
– Who is the president of US ? What did he say today ? – 說個笑話…
– How can I go to National Taiwan University ?
– Short messaging, personal scheduling, etc.
Speech Language
Synthesis Generation
Output Speech
Signals
• Examples:
– Siri (Apple), Google Now (Google), Cortana (Microsoft)
Voice-based Network Access
Internet
User-Content
Interaction
User Interface
—when keyboards/mice inadequate
Content Analysis
— help in browsing/retrieval of multimedia content
User-Content Interaction
—all text-based interaction can be accomplished by spoken language
User Interface —Wireless Communications Technologies
have Created a Whole Variety of User Terminals
Text Content
Internet
Multimedia
at Any Time, from Anywhere Content
Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-
free Interfaces, Home Appliances, Wearable Devices…
Small in Size, Light in Weight, Ubiquitous, Invisible…
Post-PC Era
Keyboard/Mouse Most Convenient for PC’s not Convenient any longer
— human fingers never shrink, and application environment is changed
Service Requirements Growing Exponentially
Voice is the Only Interface Convenient for ALL User Terminals at Any Time,
from Anywhere, and to the point in one utterance
Speech Processing is the only less mature part in the Technology Chain
Content Analysis—Multimedia Technologies have Created
a World of Multimedia Content
Internet
Real–time
Information Private Services
– weather, traffic – personal notebook
Intelligent Working
– flight schedule Environment – business databases
Knowledge Special Services – home appliances
– stock price – e–mail processors
Archieves – Google – network
– sports scores – intelligent agents
– digital libraries – Facebook entertainments
– virtual museums – teleconferencing
–YouTube – distant learning
– Amazon – electric commerce
• Most Attractive Form of the Network Content is Multimedia, which usually Includes
Speech Information (but Probably not Text)
• Multimedia Content Difficult to be Summarized and Shown on the Screen, thus
Difficult to Browse
• The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of
the Multimedia Content, thus Becomes the Key for Browsing and Retrieval
• Multimedia Content Analysis based on Speech Information
User-Content Interaction — Wireless and Multimedia
Technologies are Creating An Era of Network Access by Spoken
Language Processing
text
Text-to-Speech information
Text
voice Synthesis Content
information Multimedia
Content
Spoken and Voice-based Internet
multi-modal Information
Dialogue Retrieval
Multimedia
Text Information Content
Retrieval Analysis
Voice Instructions
Text Instructions
請問鼎泰豐的地址?
Text Information
d1
Voice Information
d1 d2
d2 d3
d3 鼎泰豐台北101分店在…
Sentence Generation
Users Internet
and Speech Synthesis
Output
Speech Response to
the user
D Hardware
1.0 Introduction — A Brief Summary of Core
Technologies and Example Application Seenarios
X W
x(t)
Feature Pattern Decision
Extraction Matching Making
unknown output
speech feature word
signal vector
sequence
y(t) Y
Feature Reference
Extraction Patterns
training
speech
Basic Approach for Large Vocabulary Speech Recognition
Acoustic Language
Speech Model Acoustic Language Text
Lexicon Model
Corpora Training Models Model Corpora
Construction
Output
Input Speech
Text Text Analysis Signal Signal
and Letter-to- Prosody Processing
sound Generation and
Conversion Concatenation
Speech Understanding
• Understanding Speaker’s Intention rather than Transcribing into
Word Strings
• Limited Domains/Finite Tasks
input understanding
utterance Syllable syllable lattice Key Phrase phrase graph results
Semantic
Recognition Matching concept graph Decoding
• An Example
utterance: 請幫我查一下 台灣銀行 的 電話號碼 是幾號?
key phrases: (查一下) - ( 台灣銀行) - (電話號碼)
concept: (inquiry) - (target) - (phone number)
Speaker Verification
• Verifying the speaker as claimed
• Applications requiring verification
• Text dependent/independent
• Integrated with other verification schemes
input
speech yes/no
Feature
Verification
Extraction
Speaker
Models
Voice-based Information Retrieval
• Speech Instructions
• Speech Documents (or Multi-media Documents including Speech
Information)
speech instruction
text instruction
請問鼎泰豐的地址?
text documents
speech documents d1
d1 d2
d2 d3
d3 鼎泰豐台北101分店在…
User’s
Intention Dialogue
Input
Speech Speech Recognition Server
and Understanding
Spoken Document Understanding and Organization
• Unlike the Written Documents which are easily shown on the
screen for user to browse and select, Spoken Documents are just
Audio Signals
— the user can’t listen each one from the beginning to the end during browsing
— better approaches for understanding/organization of spoken documents becomes
necessary
• Spoken Document Segmentation
— automatically segmenting a spoken document into short paragraphs, each with
a central topic
• Spoken Document Summarization
— automatically generating a summary (in text or speech form) for each short
paragraph
• Title Generation for Spoken Documents
— automatically generating a title (in text or speech form) for each short paragraph
• Key Term Extraction and Key Term Graph Construction for
Spoken Documents
— automatically extracting a set of key terms for each spoken document, and
constructing key term graphs for a collection of spoken documents
• Semantic Structuring of Spoken Documents
— construction of semantic structure of spoken documents into graphical hierarchies
Multi-lingual Functionalities
• Code-Switching Problem
– English words/phrases inserted in spoken Chinese sentences as an example
人人都用Computers,家家都上Internet
OK不OK?OK啦!
– the whole sentence switched from Chinese to English as an example
準備好了嗎?Let’s go!
• Cross-language Information Processing
– globalized network with multi-lingual content/users
– cross-language network information processing with a certain input language
• Dialects/Accents
– hundreds of Chinese dialects as an example
– code-switching problem─ Chinese dialects mixed with Mandarin (or plus
English) as an example
– Mandarin with a variety of strong accents as an example
• Global/Local Languages
• Language Dependent/Independent Technologies
• Code-Switching Speech Processing, Speech-to-speech Translation,
Computer-assisted Language Learning
Computer-Assisted Language Learning
Globalized World
– every one needs to learn one or more languages in addition
to the native language
Language Learning
– one-to-one tutoring most effective but with high cost
Computers not as good as Human Tutors
– software reproduced easily
– used repeatedly any time, anywhere
– never get tired or bored
Learning of
– pronunciation, vocabulary, grammar, sentences, dialogues,
etc.
– sometimes in form of games