Unit 5 Speech Processing
Unit 5 Speech Processing
Diphthongs and Glides: Apart from pure vowels, there are also diphthongs
(vowel sounds formed by the combination of two vowel sounds within the same
syllable, like "oi" in "coin") and glides (semi-vowel sounds that act as transitional
sounds between vowels, like "y" in "yes" and "w" in "we").
1. Sound Waves: When we talk, our vocal cords vibrate, creating sound
waves. These waves travel through the air and reach our ears, allowing us
and others to hear the sounds.
2. Frequency and Amplitude: Speech sounds have different frequencies
(how often the waves repeat) and amplitudes (how loud the waves are).
For example, high-frequency waves create high-pitched sounds, while low-
frequency waves create low-pitched sounds.
3. Formants: In speech, certain frequencies are emphasized, creating
distinct sounds called formants. These formants help differentiate
between vowels and contribute to the unique characteristics of speech
sounds.
4. Spectrogram: A spectrogram is a visual representation of sound waves
over time. In NLP, spectrograms are used to analyze and visualize speech,
showing the intensity of different frequencies at different points in time.
5. Coarticulation: When we speak, the sounds we make can influence each
other. This phenomenon is called coarticulation. For example, the
pronunciation of a vowel may change slightly depending on the
surrounding consonants.
6. Speech Recognition: Understanding the acoustics of speech production
is essential for speech recognition systems to accurately interpret and
transcribe spoken words into text.
7. Speaker Variability: Everyone's voice is unique, and factors like accent,
pitch, and speed of speech contribute to speaker variability. Acoustic
analysis helps account for these differences in NLP systems.
8. Noise Reduction: Acoustic modeling techniques are used to reduce
background noise and improve the accuracy of speech processing
systems, such as speech recognition and synthesis.
Pattern comparison techniques involve comparing patterns extracted from speech signals to
determine similarities, differences, or matches.
Here are some commonly used pattern comparison techniques in speech processing:
Euclidean Distance:
Euclidean distance is a simple and widely used technique for measuring the similarity
between two feature vectors or patterns. The Euclidean distance calculates the geometric
distance between two feature vectors in the feature space.
Cosine Similarity:
Cosine similarity measures the angle between two vectors and is commonly used to
compare the similarity between feature vectors. In speech processing, cosine similarity is
often employed to compare speech vectors in tasks like speaker verification or speaker
recognition.
Neural Networks:
Neural networks, such as Convolutional Neural Networks (CNNs) or Siamese networks, can
be trained to learn similarity metrics directly from speech data. These networks can map
speech patterns into high-dimensional embeddings and measure similarity based on the
distances or similarities in the embedding space.
Phonetic-based approaches:
Phonetic-based approaches recognize speech based on their phonetic similarity. They are
used in machine learning techniques for speech recognition
Transformer Models:
Transformer models, originally introduced for natural language processing, have also been adapted
for speech processing. Transformer-based architectures, such as Conformer, allow for capturing
global dependencies in speech signals and have shown promising results in speech recognition and
speech synthesis tasks.
Probabilistic Models:
Various probabilistic models, such as Hidden Semi-Markov Models (HSMMs), Factorial Hidden
Markov Models (FHMMs), or Probabilistic Context-Free Grammars (PCFGs), are used in speech
processing. These models provide a probabilistic framework for capturing complex patterns in
speech data and have applications in speech recognition, parsing, and synthesis.