chapter-2
chapter-2
Introduction
Basic Sound Concepts
Representation and formats
Basic Music (MIDI) Concepts
Devices
Messages
Standards and softwares
Speech:
Generation
Analysis and transmission
1
Basic Sound Concepts
Acoustics
study of sound - generation, transmission and reception of
sound waves.
Sound is produced by vibration of matter.
During vibration, pressure variations are created in the
surrounding air molecules.
Pattern of oscillation creates a waveform
the wave is made up of pressure differences.
period
Air
pressure
amplitude
time
4
Basic Sound Concepts
Wavelength is the distance travelled in one cycle
Frequency represents the number of periods in a
second (measured in hertz, cycles/second).
Frequency is the reciprocal value of the period.
Human hearing frequency range: 20Hz - 20Khz, voice is
5
Basic Sound Concepts
Amplitude of a sound is the measure of the
displacement of the air pressure wave from its
mean or quiescent state.
Subjectively heard as loudness. Measured in
decibels.
0 db - essentially no sound heard
35 db - quiet home
70 db - noisy street
120db - discomfort
6
Computer Representation of Audio
A transducer converts pressure to voltage levels.
Convert analog signal into a digital stream by
discrete sampling.
Discretization both in time and amplitude (quantization).
In a computer, we sample these values at intervals
to get a vector of values.
A computer measures the amplitude of the
waveform at regular time intervals to produce a
series of numbers (samples).
7
Computer Representation of Audio
Sampling Rate:
rate at which a continuous wave is sampled (measured in
Hertz)
CD standard - 44100 Hz, Telephone quality - 8000 Hz.
information?
Answer
To decide a sampling rate - must be aware of difference
samples
9
Quantization and Sampling
Sample
Height
0.75
0.5
0.25
samples
10
Audio Formats
Audio formats are characterized by four parameters
Sample rate: Sampling frequency
Encoding: audio data representation
-law encoding corresponds to CCITT G.711 - standard for
instruments
A MIDI port is built into an instrument so that MIDI Cable can
14
MIDI Contd…
Data Format
Encodes the information through the hardware
Data format doesn’t include individual music
15
MIDI Devices
Any musical instrument that satisfies both
components of MIDI specification
MIDI hardware includes
Sound generator: to produce audio signal.
Microprocessor: For processing of produced sound
Keyboard :to have direct control over synthesizer.
Control panel: for controlling functions that are not directly
concerned with notes and duration e.g. menu, volume
Auxiliary controllers: to give more control over the notes
played on the keyboard. Very common are pitch bend and
modulation
Memory : for storing.
Sequencer: can store data, which is a computer application.
Synthesizer: looks like a simple piano keyboard with a panel
full of buttons
16
MIDI Modes
There are two categories of MIDI modes, OMNI and
POLY, which can be combined four different ways:
Mode 1 -- Omni On / Poly
Mode 2 -- Omni On / Mono
Mode 3 -- Omni Off / Poly
Mode 4 -- Omni Off / Mono
The Omni (meaning "all") modes determine whether a
synthesizer will respond to incoming data on an
individual MIDI channel or to data on any channel.
In Omni On mode, a receiving instrument will play all
incoming MIDI information, regardless of the MIDI
channel.
In Omni Off mode, an instrument responds only to
information on the single channel to which it is set,
which is called an instrument's basic channel.
17
MIDI Messages
MIDI messages transmit information between
MIDI devices and determine what kind of musical
events can be passed between different devices
MIDI messages consist of Status byte and Data
byte
Status byte describe the kind of message
Data byte describe the message itself
18
MIDI Messages Contd…
There are two types of MIDI messages
Channel Message
Goes only to specified devices
Channel Voice Message
Sends the actual performance data between MIDI devices
messages
Sets the channel reception mode, stops playing the fake notes
other messages
System Common Message
Commands that prepare sequencers and synthesizers to
play music
These messages are system generic
20
MIDI Standards
MIDI reproduces traditional note length using
MIDI clock. Using a MIDI clock, a receiver can
synchronize with the clock cycle of the sender
As an alternate, the SMPTE timing standard
(Society of Motion Picture and Television
Engineering) can be used.which is used for,set
of cooperating standards to label individual
frames of video or film with a time code
defined
It was originally developed by NASA, which is
very precise
21
MIDI Software
4 major categories
Music recording and performance applications
Musical notations and printing application
Synthesizer patch editors and librarians
Music education application
22
Speech
Any sound that can be “generated”, “perceived”
and “Understood” naturally by humans and
artificially by machines .
Bears following properties
During certain interval of time speech signals show
periodic nature
The spectrum of speech signals show characteristic
maxima, which are 3-5 frequency bands
23
Speech Processing
Involves following processes
Speech Generation
Helmholtz built a mechanical vocal tract, coupling together several
mechanical resonators t generate sound
Dudley produced first speech synthesizer through imitation of
mechanical vibration using electrical oscillation
Speech generation has following basic requirements
Generation of real-time speech signal
Generation of natural and understandable speech signal
24
Speech Processing
Vowels
Created by free passage of air through the larynx and oral cavity
a,e,i,o,u.
Consonants
Created by partial or complete obstruction of air through the larynx
and oral
b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z
Methodology
Time-dependent Concatenation
Frequency dependent Concatenation
25
Fig:-Speech recognition and synthesis front ends
26
Time-dependent Concatenation
Individual speech units are composed like building blocks,
where the composition can occur at different levels
In the simplest case, the individual phones are understood as
speech units
It is possible with just a few phones to create an unlimited
vocabulary.
However, transitions between individual phones prove to be
extremely problematic.
Therefore the phones in their environment are considered in
the second level.
To make the transition problem easier, syllables are created.
The speech is generated through the set of syllables.
The best pronunciation is achieved through storage of the
whole word.
27
Speech Generation (Time dependent)
^
k r m
Crumb (Phone sound concatenation)
28
Frequency-dependent Concatenation
Speech generation can also be based on a frequency
dependent sound concatenation eg formant synthesis
Formants are frequency maxima in the spectrum of the
speech signal.
Formant synthesis simulates the vocal tract though a filter.
The characteristic values are the filter’s middle frequencies
an their bandwidths.
A pulse signal with a frequency is chosen as a simulation of
voiced sound.
On the other hand unvoiced sounds are created though a
noise generator
New sound specific methods provide a sound concatenation
with combined time and frequency dependencies.
29
Speech Analysis
Speech Analysis
30
Speech Analysis
Human speech has certain characteristics
determined by a speaker.
Speech analysis can then serve to analyze who is
speaking.
The computer identifies and verifies the speaker
using an acoustic fingerprint.
An Acoustic fingerprint is a digitally stored
speech probe of a person.
31
Speech Analysis
Another main task of speech analysis is to analyze
what has been said.
Based on speech sequence the corresponding text is
generated.
This can lead to a speech controlled typewriter, a
translation system or part of a workplace for the
handicapped.
Another area of speech analysis tries to research
speech pattern with respect to how a certain
statement was said.
Eg. A spoken sentence sounds differently if a person
is angry or calm.
The application of this research could be a lie
detector
32
Speech Analysis
Speech analysis is of strong interest for multimedia
system.
Together with speech synthesis, different media
transformations can be implemented.
The primary goal of speech analysis is to correctly
determine individual words with probability <=1.
A word is recognized only with a certain probability.
Here environmental noise, room acoustics and
speaker’s physical and psychological conditions play
an important role.
33
Speech Recognition
Sound pattern Syntax
Semantics
Word Models
Recognized Semantic
Acoustic and Syntactical speech Analysis
Speech phonetic Analysis
Analysis Understand
Speech
34
Speech Recognition
In the first step, the principle is applied to a sound pattern and/or
word model.
An acoustical and phonetical analysis is performed.
In the second step, certain speech units go through syntactical
analysis: thereby, the errors of the previous step can be
recognized.
Very often during the first step, no unambiguous decisions can be
made.
In this case, syntactical analysis provides additional decision help
and result is a recognized speech.
The third step deals with the semantics of the previously
recognized language.
Here the decision errors of the semantics of the previously and
corrected with other analysis methods.
Even today, this step is non-trivial to implement with current
methods known as Artificial Intelligence and neural nets research.
The result of this step is an understood speech.
35
Speech Transmission
Analog speech signal
A / D Converter
Speech Analysis
Coded Speech
Reconstruction
Analog speech
D / A Converter signal
38
Speech Transmission
Source Coding
Parameterized system work with source coding algorithm
The specific speech characteristics are used for data rate
reduction.
Recognition/Synthesis Methods
There have been attempts to reduce the transmission rate
using pure recognition/synthesis methods.
Speech analysis (recognition) follows on the sender side of a
speech transmission system and speech synthesis
(generation) follows on the receiver side.
Achieved Quality
How to achieve the minimal data rate for a given quality in
transmission
One can assume that for telephone quality, a date rate of 8
Kbits/s is sufficient.
39