0% found this document useful (0 votes)
8 views

Chapter 2 SOUND AUDIO Systems

Chapter 2 discusses the concepts of sound and audio systems, detailing the nature of sound, its characteristics, and the differences between sound and audio. It covers the processes of digitization, including sampling and quantization, and explores various audio file formats and hardware used in sound recording and playback. Additionally, it introduces MIDI technology for music synthesis and communication between electronic instruments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Chapter 2 SOUND AUDIO Systems

Chapter 2 discusses the concepts of sound and audio systems, detailing the nature of sound, its characteristics, and the differences between sound and audio. It covers the processes of digitization, including sampling and quantization, and explores various audio file formats and hardware used in sound recording and playback. Additionally, it introduces MIDI technology for music synthesis and communication between electronic instruments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Chapter 2

Sound/ Audio system

▪ Concepts of sound system


▪ Music and Speech
▪ Speech Generation
▪ Speech Analysis
▪ S p e e c h Tra n s m i s s i o n
Nature of Sound
• Sound is a physical phenomenon produced by the vibration of matter
such as violin string or a block of wood.
• Perception of sound by human beings is a very complex process. It
involves three systems:
▪ The source which emits sound
▪ The medium through which the sound propagates
▪ The detector which receives and interprets the sound
Nature of Sound
• As the matter vibrates, pressure variations are created in the air
surrounding it. This alteration of high and low pressure is propagated
through the air in a wave like motion. When a wave reaches the
human ear, a sound is heard.
Sound vs Audio
• The key difference between sound and audio is their form of energy.
• Sound is mechanical wave energy (longitudinal sound waves) that
propagate through a medium causing variation in pressure within the
medium.
• Audio is made of electrical energy (analog or digital signals) that represent
sound electrically.
• Example: When a bell rings, it sets the air around it vibrating. These
vibrations travel as sound waves, reaching our ears and allowing us to hear
the ringing sound.
• Example: When you record a song using a microphone, the sound waves
created by the vocalist's voice are converted into electrical signals. These
signals are then processed and stored digitally, allowing you to play back
the song exactly as it was recorded.
Sound vs Audio
Sound Audio
1. Can travel through various media, including air, 1. Typically refers to sound that has been
water, solids, etc. electronically recorded, processed, or transmitted.
2. Natural and environmental occurrences. 2. Can be natural or artificially generated.
3. Recorded using microphones and stored as digital
3. Not recorded by default; exists in the environment.
files.
4. Can be transmitted over the internet or radio
4. Travels through the medium in real-time.
waves.
5. Limited manipulation; changes with the source 5. Can be edited, mixed, enhanced, or modified
and environment. digitally.
6. Music tracks, podcast episodes, recorded
6. Birds chirping, a door creaking, thunder rumbling.
speeches.
Basic Sound Concepts
• Sound waves can be characterized by the following attributes:
• Period
• Frequency
• Pitch
• Amplitude and loudness
• Dynamic and Bandwidth
Basic Sound Concepts
• Period:
→ It is the interval at which a periodic signal repeats regularly.
• Pitch:
→It is a perception of sound by human beings. It measures how high is
the sound as it is perceived by a listener.
• Time period:
→The time taken by the wave for one complete oscillation is called
time period.
→SI unit is second.
Basic Sound Concepts
• Amplitude:
→It is a maximum displacement of a vibrating object from its central
location.
Explanation:
→We know that sound is produced due to vibration of particles (to and
fro motion).
→In to and fro motion, an object moves from its central position.
→More it moves from the central position, more is the amplitude.
→Less it moves from the central position, less is the amplitude.
Basic Sound Concepts
• Loudness:
→More amplitude means more loudness of sound.
→Loudness is directly proportional to the square of amplitude.
→If amplitude becomes double, loudness becomes four times.
→Roaring of lion can be heard over long distances because it has
higher amplitude.
Basic Sound Concepts
• Typical sound generated by various sources:
Basic Sound Concepts
Basic Sound Concepts
• Frequency:
→Number of oscillations made in one second is frequency.
→Measured in hertz (Hz).
→ More time it moves in one second, more is the frequency.
→Suppose if vibration of sound is 50 vibrations per second, it means its
frequency is 50 Hz.
Basic Sound Concepts
• Frequency determines shrillness or pitch of sound.
• More the frequency, more the pitch of the sound.

Example 1:
• Voice of baby is of high pitch.
• Voice of grown man is of low pitch.

Example 2:
• Chirping of bird is of high pitch.
• Roaring of lion is of low pitch.
Basic Sound Concepts
• Dynamic and Bandwidth:
→Dynamic range means the change in sound levels.
→For example: a large orchestra can reach 130dB at its climax and
drop to low as 30dB at its softest, giving a range of 100dB.
→Bandwidth is the range of frequencies a device can produce or a
human can hear.
Fun facts about Sound
1. Sound cannot travel through space since there are no molecules to
travel through. Here on earth, we have air molecules that vibrate in and
around our ears.
2. Do you know what is louder than a car horn? The cry of a human
baby, which is about 115 decibels.
3. The loudest natural sound on earth is caused by an erupting volcano.
4. Dogs are capable of hearing sounds at a much higher frequency than
humans can. They can hear sounds or noises humans cannot.
5. Flies can not hear any kind of sound. Not even their own buzzing.
Fun facts about Sound
6. Since particles are closer together in water than in air, sound can
travel four times faster in water.
7. Sound travels at a speed of around 767 miles per hour.
8. The majority of cows that listen to music end up producing more
milk than those who do not.
9. Horror films like to use infrasound, which is below the range of
human hearing. It creates shivering, anxiety, and even heart
palpitations in humans when it is being played.
Computer Representation of Sound
• Sound waves are continuous while computers are good at handling
discrete numbers. In order to store a sound wave in a computer,
samples of the wave are taken. Each sample is represented by a
number, the ‘code’. This process is known as digitization.
• Digitization is a process of converting the analog signals to a digital
signal. There are three steps of digitization of sound. These are:
a. Sampling
b. Quantization
c. Sound Hardware
Sampling

• A sampling rate is the number of times the analog sound is taken per
second.
• A higher sampling rate implies that more samples are taken during
the given time interval and ultimately, the quality of reconstruction is
better.
• CDs, for example, contain audio that was converted at a rate of 44.1
kHz/second, which means the original analogue recording was
sampled 44,100 times for every second of music.
Sampling
• Sound is made up of waves of different frequencies. The human ear
can hear frequencies up to about 20,000 hertz (Hz).
• The Nyquist sampling theorem says that in order to accurately record
sound, we need to sample it at least twice the highest frequency we
want to record. So, to record sound up to 20,000 Hz, we need to
sample it at least 40,000 times per second.
• This is why one of the most popular sampling rates for high quality
sound is 44,100 samples per second.
Quantization
• Quantization is a process of representing the amplitude of each
sample as integers or numbers.
• The number of bits used to represent each sample is called the
sample size or bit depth. A higher sample size means that more detail
can be captured in the recording.
• Commonly used sample sizes are 8 bits and 16 bits.
Quantization
• An 8-bit sample size provides 256 different levels of amplitude, while
a 16-bit sample size provides 65,536 different levels of amplitude.
• The value of each sample is rounded off to the nearest integer.
• If the amplitude of the signal is greater than the maximum value that
can be represented by the sample size, then clipping occurs.
• Clipping is when the top or bottom of the waveform is cut off, which
can cause distortion.
Sound hardware
• Before sound can be processed, a computer needs input/ output
devices.
• Microphone jacks and built in speakers are devices connected to an
ADC and DAC respectively for input and output of audio.
Quality versus File Size
• The size of a digital recording depends on the sampling rate,
resolution and number of channels.
• S = R * (b/8) * C * D
• S → file size bytes
• R → sampling rate (samples / second)
• b → resolution bits
• C → channels 1 - mono, 2 – stereo
• D → recording duration seconds
• Higher sampling rate, higher resolution gives higher quality but bigger
file size.
• For example, if we record 10 seconds of stereo music with sampling
rate 44.1kHz, 16 bits, the size will be:
• S =44100 * (16/8) * 2 * 10
• = 1,764,000 bytes =1722.7 Kbytes= 1.68 Mbytes
• High quality sound files are very big, however, the file size can be
reduced by compression.
File Size for common sampling rates and
resolution
Audio file format
• The most commonly used digital sound format in Windows systems is .wav files.
• Sound is stored in .wav as digital samples known as Pulse Code Modulation
(PCM).
• Each .wav file has a header containing information of the file.
❑type of format, e.g., PCM or other modulations
❑size of the data
❑number of channels
❑samples per second
❑bytes per sample
• There is usually no compression in .wav files.
• Other format may use different compression technique to reduce file size.
• .vox use Adaptive Delta Pulse Code Modulation (ADPCM).
• .mp3 MPEG-1 layer 3 audio.
Types of Digital Audio file formats
1. WAV (Waveform Audio File Format): A standard audio format used
primarily on Windows systems. WAV files can contain
uncompressed audio data and are known for their high audio
quality. They are often used for professional audio recording and
editing.
2. MP3 (MPEG Audio Layer III): One of the most popular and widely
used audio formats. MP3 files use lossy compression to reduce file
size while maintaining reasonable audio quality. They are commonly
used for music distribution and playback.
Types of Digital Audio file formats
3. AIFF (Audio Interchange File Format): Similar to WAV, AIFF is a high-
quality audio format commonly used on Apple systems. It supports
uncompressed audio data and is often used for professional audio
applications.
4. AAC (Advanced Audio Coding): Another widely used audio format
that offers better sound quality at lower bit rates compared to MP3.
AAC is commonly used for music streaming and is the default format
for Apple's iTunes and iOS devices.
Types of Digital Audio file formats
5. FLAC (Free Lossless Audio Codec): A lossless compression format
that retains the original audio quality while reducing file size. FLAC files
are often used by audiophiles and for archiving high-quality audio.
6. Opus: A relatively newer audio format designed for efficient
compression and high audio quality at low bit rates. Opus is suitable for
both voice and music and is often used in real-time communication and
streaming applications.
Audio Hardware
• Recording and Digitizing sound:
❑An analog-to-digital converter (ADC) converts the analog sound signal into
digital samples.
❑A digital signal processor (DSP) processes the sample, e.g. filtering,
modulation, compression, and so on.
• Play back sound:
❑A digital signal processor processes the sample, e.g. decompression,
demodulation, and so on.
❑An digital-to-analog converter (DAC) converts the digital samples into sound
signal.
• All these hardware devices are integrated into a few chips on a sound
card.
Audio Hardware
• Different sound card have different capability of processing digital
sounds.
• When buying a sound card, you should look at:
❑maximum sampling rate
❑stereo or mono
❑duplex or simplex.
Audio Software
• Windows device driver: controls the hardware device.
• Device manager: the user interface to the hardware for configuring
the devices.
❑You can choose which audio device you want to use.
❑You can set the audio volume.
Audio Software
• Mixer: its functions are:
❑To combine sound from different sources.
❑To adjust the playback volume of sound sources.
❑To adjust the recording volume of sound sources.
• Recording: Windows has a simple Sound Recorder.
• Editing: The Windows Sound Recorder has a limiting editing function,
such as changing volume and speed, deleting part of the sound.
• There are many freeware and shareware programs for sound
recording, editing and processing.
Computer Music
• Sounds, whether they come from nature or are created by people,
can be complicated because they contain many different pitches.
• It's relatively easy to record complicated sounds using digital
technology. But making these kinds of sounds from scratch, known as
synthesis, is harder.
• There's a more effective method to create great music, called MIDI
(Musical Instrument Digital Interface).
Computer MIDI
• Musical Instrument Digital Interface.
• It is a communication standard developed in the early 1980s for
electronic instruments and computers.
• It specifies the hardware connection between equipment as well as
the format in which the data are transferred between the equipment.
• Common MIDI devices include electronic music synthesizers,
modules.
General MIDI
• General MIDI is a standard specified by MIDI Manufacturers
Association.
• For a device to work with General MIDI, it needs to follow some rules:
❑It should have at least 24 different sounds it can make.
❑It can play sounds on 16 different channels.
❑It must be able to play 16 sounds at the same time with different kinds of
instruments.
❑It should have at least 128 ready-made sounds that can be used.
❑It has to work with certain controls.
MIDI Hardware

• An electronic musical instrument or a computer which has MIDI


interface should has one or more MIDI ports.
• The MIDI ports on musical instruments are usually labeled with:
❑IN — for receiving MIDI data;
❑OUT — for outputting MIDI data that are generated by the instrument;
❑THRU — for passing MIDI data coming from IN to the next instrument.
• MIDI devices can be daisy-chained together.
MIDI Daisy Chain Network
Multi port MIDI Interface ( 8 IN/ OUT pairs)
MIDI Software
• MIDI player for playing MIDI music. This includes:
▪ Windows media player can play MIDI files.
▪ Player come with sound card – Creative MIDI player
▪ Freeware and shareware players and plugins – Midigate etc.
• MIDI sequencer for recording, editing and playing MIDI
▪ Cakewalk Express, Home studio, Professional
▪ Cubasis
▪ Encore
• Configuration – Like audio devices, MIDI devices requires a driver.
Select and configure MIDI devices from the control panel
MIDI Messages
• MIDI messages are used by MIDI devices to communicate with each other
and to determine what kinds of musical events can be passed from device
to device.
Structure of MIDI messages:
• MIDI message includes a status byte and up to two data bytes.
• Status byte:
▪ The most significant bit of status byte is set to 1.
▪ The 4 low-order bits identify which channel it belongs to (four bits produce 16
possible channels).
▪ The 3 remaining bits identify the message.
• Data Byte: The most significant bit of data byte is set to 0.
Classification of MIDI Messages
• Channel Message: Since, channel message are specified, these
messages are like notes sent to specific devices. There are 2 types of
channel messages:
• Channel voice messages: These carry real performance info between
MIDI devices, like key and controller actions. Examples are note on,
note off, and control changes.
• Channel mode messages: These decide how a device reacts to the
voice messages. Examples include local control and turning off all
notes.
Classification of MIDI Messages
• System Message: System messages go to all devices in a MIDI system
because no channel numbers are specified. There are three types of system
messages:
• System real time messages: These messages are short and simple (one
byte). Quick messages to make MIDI devices play together. They're sent in
between other messages to keep things in sync. E.g. System reset, Timing
clock i.e. MIDI clock etc.
• System common messages: Commands for sequencers and synthesizers to
get ready for playing music. Examples include picking a song and asking for
a tuning.
• System exclusive messages: Special messages made by manufacturers to
talk between their own MIDI gadgets.
Speech Generation
• Speech can be perceived, understood and generated by humans and by
machines.
• Generated speech must be understandable and must sound natural. It's a
basic belief that speech should make sense, and when it sounds like regular
talking, people like it better.
• Speech signals have two properties which can be used in speech
processing:
▪ Voiced speech signals repeat patterns in short time frames, about 30 milliseconds.
This helps us study them.
▪ When we analyze the sound of voices, we notice specific strong peaks in the sound
pattern. These peaks are known as formants. They happen because the space inside
our mouth and throat enhances certain frequencies of sound.
Basic Notations
• The lowest periodic spectral component of the speech signal is called the
fundamental frequency. It is present in a voiced sound.
• A phone is the smallest speech unit, such as the m of mat and the b of bat in
English, that distinguish one utterance or word from another in a given language.
• Allophones mark the variants of a phone. For example, the aspirated p of pit and
the unaspirated p of spit are allophones of the English phoneme p.
• The morph marks the smallest speech unit which carries a meaning itself.
Therefore, consider is a morph, but reconsideration is not.
• A voiced sound is generated through the vocal cords. m, v and l are examples of
voiced sounds. The pronunciation of a voiced sound depends strongly on each
speaker.
• During the generation of an unvoiced sound, the vocal cords are opened. F and S
are unvoiced sounds. Unvoiced sounds are relatively independent from the
speaker.
Reproduced speech output
• The easiest method of speech generation/output is to use pre-
recorded speech and play it back in a timely fashion. The speech can
be stored as PCM (Pulse Code Modulation) samples.
• Further data compression methods, without using language typical
properties, can be applied to recorded speech.
• There are two way of speech generation/output performed by time-
dependent sound concatenation and a frequency-dependent sound
concatenation
Time-dependent Sound Concatenation
• Speech is made up of individual units called phones. These units are
like building blocks that can be combined to create different sounds.
• When phones are combined, they can sometimes change shape to fit
together better. This is called coarticulation.
• Syllables are made up of phones and are the basic units of sound in a
language. Words and sentences are made up of syllables.
• Prosody is the rhythm, stress, and melody of speech. It can be used to
convey meaning and emotion.
• Prosody is often context dependent, meaning that it can change
depending on the surrounding words and phrases.
Frequency‐dependent Sound Concatenation
• Speech can be created by putting sounds together, either based on how
often they appear in sound waves or by combining specific sound patterns.
• Formants are the highest points of frequency in the sound waves of
speech. Making speech using formants involves simulating the shape of the
vocal tract using a filter.
• Different speech parts (like individual sounds) are made by adjusting the
values of formants. This faces challenges when it comes to smoothly
transitioning between sounds, known as co-articulation, and getting the
right rhythm and pitch.
• Human-like speech can be produced using a type of filter called a multi-
pole lattice filter. This filter works well for mimicking the first four or five
formants found in human speech.
• Using speech synthesis, an existent text can be transformed into an
acoustic signal. The typical components of a speech synthesis system
with time-dependent concatenation:

Step 1: Generation of a Sound Script Transcription from text to a sound script using
a library containing (language specific) letter ‐ to ‐phone rules. A dictionary of
exceptions is used for word with a non ‐standard pronunciation.
Step 2: Generation of Speech The sound script is used to drive the time ‐ or
frequency ‐dependent sound concatenation process.
Problem of speech synthesis
• Ambiguous pronunciation. In many languages, the pronunciation of
certain words depends on the context.
• Example: ‘lead’
• This is not so much of a problem for the German language
• It is a problem for the English language
• Anecdote by G. B. Shaw:
▪ if we pronounce “gh” as “f” (example: “laugh“)
▪ if we pronounce “o” as “i” (example: “women”)
▪ if we pronounce “ti” as “sh” (example: “nation”), then why don’t we write
“ghoti” instead of fish?
Speech Analysis
• Purpose of Speech Analysis:
▪ Who is speaking: speaker identification for security purposes
▪ What is being said: automatic transcription of speech into text
▪ How was a statement said: understanding psychological factors of a speech
pattern (was the speaker angry or calm, is he lying, etc).
• The primary goal of speech analysis in multimedia systems is to
correctly determine individual words (speech recognition).
Speech Recognition System
• Speech analysis is of strong interest for multimedia systems.
• By understanding how speech works and creating speech, we can
make different types of media changes.
• The main aim of studying speech is to figure out words correctly, but
it's not always 100% certain.
• Sometimes, things like background noise, the place where you are,
and how the speaker is feeling can affect this.
Speech Recognition System

• The speech recognition system follows these steps multiple times:


• First, it looks at the sound pattern and the way words are said. It analyzes how the
sounds are made.
• Then, it examines certain parts of the speech to see how they fit together in
sentences. This helps to find mistakes from the first step. Sometimes, it's not easy to
make clear decisions during this step.
• Next, it looks at the meaning of the recognized language. It corrects any mistakes
from the previous steps using different methods. This part is still quite challenging to
do using current methods from artificial intelligence and neural network research.
• So, the system listens, figures out how the sounds are made, checks if the
words fit together right, and finally understands what's being said.
Speech Recognition System
• There are still many challenges in speech recognition research:
▪ One issue is caused by the noise in a room. Sounds can bounce off
walls and objects, mixing with the main sound.
▪ Figuring out where one word ends and another begins can be
tricky. Sometimes, words seem to blend together.
▪ Comparing a spoken sound to its pattern requires making time
adjustments. Some words are spoken quickly, some slowly. But
changing the time scale is hard because it doesn't match the
overall time. There are short and long sounds (like 's' and 'sh') that
need different times to be recognized.
Speech Recognition System
• There are two types of speech recognition systems:
• Speaker-Independent: These can understand only a limited number
of words well. Imagine a phone that responds to "Hey, Siri." It doesn't
care who says it; it just knows this specific phrase.
• Speaker-Dependent: These are better at understanding more words
because they've been taught in advance. Think of a computer that
understands your voice commands after you've used it for a while.
Speech Recognition System
• "Training in advance" means preparing the system to recognize your voice. It's like
teaching a dog tricks before it performs them perfectly.
• Speaker-dependent systems can understand about 25,000 words. Imagine a
system that helps doctors with medical terms. It knows many specific words
related to medicine.
• Speaker-independent systems can only handle about 500 words, but they're not
as good at it. Think of a basic voice command system that can just understand a
few simple words like "yes," "no," and "stop."
• These numbers are just rough estimates; they give you a general idea, not exact
values.
• In real situations, you need to consider details like where the measurements
were taken. For example, if the measurements were done in a quiet room, the
results might be more accurate. Also, sometimes the speaker needs to adjust to
the system to make things like timing easier to understand.
Speech Transmission

You might also like