0% found this document useful (0 votes)
108 views31 pages

CCS369 - Text and Speech Analysis

text and speech analysis practice notes

Uploaded by

aaishamaryamj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views31 pages

CCS369 - Text and Speech Analysis

text and speech analysis practice notes

Uploaded by

aaishamaryamj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CCS369- TEXT AND SPEECH ANALYSIS

LAB RECORD

NAME: ….......................................................

REGISTER NUMBER: …...................................................…

DEGREE&BRANCH: ..................................................….

YEAR/SEMESTER: ….................................................……
BONAFIDE CERTIFICATE

Certified that this is a bonafide record work done


by
Selvan/Selvi......................................................................................
.. with
Register No ………………………… studying in III-year VI semester in
Computer Science and Engineering branch of this Institution during the
academic year 2023-2024[ EVEN SEM]

Staff in-charge Head of the Department

Submitted for the Anna University practical examination held at SCAD


College of Engineering and Technology, Cherranmahadevi on
………………………

Internal Examiner External Examiner


1. Create Regular expressions in Python for detecting word patterns and tokenizing text

AIM:
To create Regular expressions in Python for detecting word patterns and tokenizing text. Regular
expressions (regex) are powerful tools for detecting patterns in text.
ALGORITHM:
1. Import the module.
2. Define your text.
3. Create regular expressions for detecting word patterns.
4. Use the re.findall(pattern, text)
5. Print or process the(matches)
6. Create regular expressions for tokening text.
7. Print or process the tokens.
8. Repeat as necessary.

word patterns
PROGRAM:
import re

text = "The quick brown fox jumps over the lazy dog"
pattern = r'\b[a-zA-Z]*[qQ][a-zA-Z]*\b' # Words containing the letter 'q' or 'Q'
matches = re.findall(pattern, text)
print(matches) # Output: ['quick']

OUTPUT:
['quick']

=== Code Execution Successful ===

tokenizing text
PROGRAM:
import re
text = "The quick brown fox"
tokens = re.split(r'\s+', text)
print(tokens) # Output: ['The', 'quick', 'brown', 'fox']

OUTPUT:
['The', 'quick', 'brown', 'fox']

=== Code Execution Successful ===

RESULT:
Thus regular expressions in Python to detect word patterns and tokenize text is written and verified.
Depending on your specific requirements, you can customize the regex patterns accordingly.
2. Getting started with Python and NLTK - Searching Text, Counting Vocabulary,
Frequency Distribution, Collocations, Bigrams
AIM:
To get started with Python and NLTK - Searching Text, Counting Vocabulary, Frequency Distribution,
Collocations, Bigrams.
ALGORITHM:
Install NLTK.
Import NLTK and Download Resources.
Load text data.
Searching text.
Counting vocabulary.
Frequency Distribution.
Collocations.
Bigrams.
PROGRAM:
(A) SEARCHING TEXT
import nltk
from nltk.book import *
# Search for occurrences of a word
def search_word(text, word):
print("Concordance for word:", word)
text.concordance(word)
# Search for similar words
def search_similar(text, word):
print("Similar words for:", word)
text.similar(word)

# Load text data (you can choose from built-in texts)


text = text1 # Moby Dick

# Perform searches
search_word(text, "whale")
search_similar(text, "whale")
OUTPUT:
Concordance for word: whale
Displaying 25 of 1226 matches:
the Sperm Whale. CHAPTER 32. Cetology. CHAPTER 33. The Spe
rise of the Leviathan. CHAPTER 105. Does the Whale '

(B) COUNTING VOCABULARY

import notch
from nltk.book import *
# Count unique words
def count_vocabulary(text):
vocabulary = set(text)
num_unique_words = len(vocabulary)
return num_unique_words
# Load text data (you can choose from built-in texts)
text = text1 # Moby Dick
# Count vocabulary
num_unique_words = count_vocabulary(text)
print("Number of unique words:", num_unique_words)

OUTPUT:
Number of unique words: 19317

C) FREQUENCY DISTRIBUTION
import nltk
from nltk.book import *

# Calculate frequency distribution of words


def calculate_frequency_distribution(text):
fdist = FreqDist(text)
return fdist

# Load text data (you can choose from built-in texts)


text = text1 # Moby Dick

# Calculate frequency distribution


fdist = calculate_frequency_distribution(text)

# Print the most common words


print("Most common words:", fdist.most_common(10))

OUTPUT:
Most common words: [('the', 13721), (',', 7301), ('.', 6483), ('of', 4748), ('and', 4515), ('a',
3090), ('to', 2824), (';', 2552), ('in', 2185), ('that', 1659)]

D) COLLOCATIONS:
import nltk
from nltk.book import *

# Find collocations
def find_collocations(text):
print("Collocations:")
text.collocations()

# Load text data (you can choose from built-in texts)


text = text1 # Moby Dick

# Find collocations
find_collocations(text)

OUTPUT:
Collocations:
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

E) BIGRAMS
import nltk
from nltk.book import *

# Generate bigrams
def generate_bigrams(text):
print("Example bigrams:")
bigrams = list(nltk.bigrams(text))
for bigram in bigrams[:10]: # Displaying the first 10 bigrams as an example
print(bigram)

# Load text data (you can choose from built-in texts)


text = text1 # Moby Dick

# Generate bigrams
generate_bigrams(text)

OUTPUT:
Example bigrams:
('CHAPTER', '1')
('1', 'Loomings')
('Loomings', '.')
('.', 'Call')
('Call', 'me')
('me', 'Ishmael')
('Ishmael', '.')
('.', 'Some')
('Some', 'years')
('years', 'ago')

RESULT:
These are some basic steps to get started with NLTK for text analysis tasks. You can explore further
functionalities and modules within NLTK for more advanced text processing and analysis.
3. Accessing Text Corpora using NLTK in Python
AIM:
To access text corpora using NLTK in python.
ALGORITHM:
4. Install NLTK
5. Import NLTK
6. Download Corpora or models.
7. Access text Corpora.
8. Explore Further.
PROGRAM:
import nltk
from nltk.corpus import gutenberg, brown, wordnet

# Accessing the Gutenberg Corpus


print("=== Gutenberg Corpus ===")
print("Available files:", gutenberg.fileids())
emma_text = gutenberg.raw('austen-emma.txt')[:500] # Extracting first 500 characters
print("Sample text from 'Emma' by Jane Austen:")
print(emma_text)

# Accessing the Brown Corpus


print("\n=== Brown Corpus ===")
print("Available categories (genres):", brown.categories())
news_text = brown.raw(categories='news')[:500] # Extracting first 500 characters
print("Sample text from the 'news' category:")
print(news_text)

# Accessing the WordNet Corpus


print("\n=== WordNet Corpus ===")
car_synsets = wordnet.synsets('car') # Synsets for the word 'car'
print("Synsets for the word 'car':", car_synsets)
print("Definitions of the synsets:")
for synset in car_synsets:
print(synset.definition())

OUTPUT:
=== Gutenberg Corpus ===
Available files: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
Sample text from 'Emma' by Jane Austen:
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I

Emma Woodhouse, handsome, clever, and rich, with a comfortable home


and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

=== Brown Corpus ===


Available categories (genres): ['adventure', 'belles_lettres', 'editorial', ...]
Sample text from the 'news' category:
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election
produced "no evidence" that any irregularities took place.

=== WordNet Corpus ===


Synsets for the word 'car': [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'),
Synset('cable_car.n.01')]
Definitions of the synsets:
a motor vehicle with four wheels; usually propelled by an internal combustion engine
a wheeled vehicle adapted to the rails of railroad
the compartment that is suspended from an airship and that carries personnel and the cargo and the
power plant
where passengers ride up and down
a conveyance for passengers or freight on a cable railway

RESULT:
This code snippet demonstrates how to access text corpora from the Gutenberg, Brown, and WordNet
corpora using NLTK in Python, along with sample output from each corpus.

4. Write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
AIM:
To Write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
ALGORITHM:
1. Accept a piece of text as input.
2. Tokenize the input text into words.
3. Remove stopwords from the list of tokens. Stopwords are commonly occurring words in a
language (e.g., "the", "is", "and") that do not carry significant meaning.
4. Count the frequency of each word in the filtered list of tokens.
5. Sort the words based on their frequencies in descending order.
6. Select the top 50 words with the highest frequencies.
7. Return the list of the 50 most frequently occurring words.
PROGRAM:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

def most_frequent_words(text):
# Tokenize the text
tokens = word_tokenize(text)

# Filter out stop words


stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Calculate frequency distribution of filtered tokens


fdist = FreqDist(filtered_tokens)

# Get the 50 most frequent words


most_frequent = fdist.most_common(50)
return most_frequent

# Example usage:
text = "This is a sample text. It contains some words that will be counted. The words in this text will
be analyzed to find the most frequent ones."
result = most_frequent_words(text)
print("50 most frequently occurring words (excluding stop words):")
print(result)

OUTPUT:
50 most frequently occurring words (excluding stop words):
[('words', 2), ('text', 2), ('sample', 1), ('contains', 1), ('counted', 1), ('analyzed', 1), ('find', 1), ('frequent',
1), ('ones', 1)]

RESULT:
This function first tokenizes the input text, then filters out stop words using NLTK's English stop
words list. After that, it calculates the frequency distribution of the filtered tokens and returns the 50
most frequent words along with their frequencies. You can replace the example text with any text you
want to analyze.
5. Implement the Word2Vec model.
AIM:
To implement the Word2Vec model.
ALGORITHM:
1. Accept a corpus of text data as input.
2. Tokenize the text into individual words or phrases.
3. Remove stopwords, punctuation, and other noise if necessary.
4. Initialize a Word2Vec model with parameters like vector size, window size, minimum word
count.
5. These vectors capture semantic meanings of words based on their context in the training data.
6. Evaluate the performance of the Word2Vec model using tasks like word similarity, analogy
7. The trained Word2Vec model with word embeddings.

PROGRAM:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Sample text data


text = "Word2Vec is a technique for natural language processing. It is used for converting words into
vectors. Word vectors can be used for various NLP tasks such as sentiment analysis, named entity
recognition, and machine translation."

# Tokenize the text


tokens = word_tokenize(text.lower())

# Remove stop words


stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Create Word2Vec model


model = Word2Vec([filtered_tokens], vector_size=100, window=5, min_count=1, workers=4)
# Test the model
word = 'word2vec'
most_similar_words = model.wv.most_similar(word)

# Output
print("Words most similar to '{}':".format(word))
for similar_word, similarity in most_similar_words:
print(similar_word, similarity)

OUTPUT:
Words most similar to 'word2vec':
nlp 0.24648666310310364
machine 0.24033652210235596
recognition 0.2289284769296646
used 0.22583018231391907
vectors 0.22268365383148193
natural 0.20813092589378357
language 0.19862347853183746
technique 0.17929100954532623
sentiment 0.1550669378042221
converting 0.1451803447008133

RESULT:
This code snippet demonstrates how to implement the Word2Vec model using Gensim in Python.

6. Use a transformer for implementing classification


AIM:
To Use a transformer for implementing classification by python code.

ALGORITHM:
1. Data preparation.
2. Model architecture.
3. Training loop.
4. Evaluation.
5. Testing.
6. Inference.
7. Deployment.
PROGRAM:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

# Define the Transformer model


class TransformerClassifier(nn.Module):
def __init__(self, input_dim, output_dim, max_seq_length, num_heads, num_layers):
super(TransformerClassifier, self).__init__()
self.embedding = nn.Embedding(input_dim, 256)
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=256, nhead=num_heads),
num_layers=num_layers)
self.fc = nn.Linear(256 * max_seq_length, output_dim)

def forward(self, x):


x = self.embedding(x)
x = x.permute(1, 0, 2) # Shape: (seq_len, batch_size, embed_size)
x = self.transformer(x)
x = x.permute(1, 0, 2) # Reshape to (batch_size, seq_len, embed_size)
x = x.reshape(x.size(0), -1) # Flatten
x = self.fc(x)
return F.log_softmax(x, dim=1)

# Dummy dataset
class DummyDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels

def __len__(self):
return len(self.data)

def __getitem__(self, idx):


return self.data[idx], self.labels[idx]

# Example usage
input_dim = 1000 # Size of vocabulary
output_dim = 10 # Number of classes
max_seq_length = 20 # Maximum sequence length
num_heads = 8 # Number of attention heads
num_layers = 6 # Number of transformer layers

# Instantiate the model


model = TransformerClassifier(input_dim, output_dim, max_seq_length, num_heads, num_layers)

# Dummy data
data = torch.randint(0, input_dim, (1000, max_seq_length)) # 1000 samples of sequences of length
20
labels = torch.randint(0, output_dim, (1000,))
# Create dataset and dataloader
dataset = DummyDataset(data, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Define loss function and optimizer


criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):
for batch_idx, (inputs, targets) in enumerate(dataloader):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()

if batch_idx % 10 == 0:
print('Epoch {} Batch {} Loss: {:.4f}'.format(epoch, batch_idx, loss.item()))

# Example inference
test_input = torch.randint(0, input_dim, (1, max_seq_length)) # Test input sequence
with torch.no_grad():
output_probs = torch.exp(model(test_input))
predicted_class = torch.argmax(output_probs)
print("Predicted class:", predicted_class.item())

OUTPUT:
Epoch 0 Batch 0 Loss: 2.3421
Epoch 0 Batch 10 Loss: 2.2925
Epoch 0 Batch 20 Loss: 2.3014
...
Epoch 9 Batch 0 Loss: 0.0543
Epoch 9 Batch 10 Loss: 0.0402
Epoch 9 Batch 20 Loss: 0.0221
Predicted class: 3
RESULT:
This is a basic implementation of a Transformer-based classifier using PyTorch. You can adjust
hyperparameters, model architecture, and dataset accordingly to fit your specific task.
7. Design a chatbot with a simple dialog system

AIM:
To design a chatbot with a simple dialog system
ALGORITHM:
1. Define objectives.
2. Select platform.
3. Data collection.
4. Preprocessing.
5. Training data creation.
6. Model design.
7. Model training
8. Evaluation.
9. Integration
10. Testing
11. Deployment
12. maintenance
PROGRAM:
import random

# Define responses for different intents


responses = {
"greeting": ["Hello!", "Hi there!", "Hey! How can I help you?"],
"farewell": ["Goodbye!", "See you later!", "Have a great day!"],
"thanks": ["You're welcome!", "No problem!", "Anytime!"],
"default": ["Sorry, I didn't understand that.", "Could you please repeat that?", "I'm not sure how to
respond to that."]
}

# Define rules for mapping user inputs to intents


rules = {
"greeting": ["hello", "hi", "hey", "howdy"],
"farewell": ["bye", "goodbye", "see you later", "take care"],
"thanks": ["thank you", "thanks", "thanks a lot"]
}

# Function to classify user input into intents


def classify_intent(user_input):
user_input = user_input.lower()
for intent, patterns in rules.items():
for pattern in patterns:
if pattern in user_input:
return intent
return "default"

# Function to generate response based on intent


def generate_response(intent):
return random.choice(responses[intent])

# Main function to run the chatbot


def chatbot():
print("Chatbot: Hi! How can I help you today?")
while True:
user_input = input("You: ")
if user_input.lower() == 'exit':
print("Chatbot: Goodbye!")
break
intent = classify_intent(user_input)
response = generate_response(intent)
print("Chatbot:", response)

# Run the chatbot


if __name__ == "__main__":
chatbot()

OUTPUT:
Chatbot: Hi! How can I help you today?
You: Hi
Chatbot: Hello!
You: Can you help me with a problem?
Chatbot: Sorry, I didn't understand that.
You: Thank you
Chatbot: You're welcome!
You: Bye
Chatbot: Goodbye!
RESULT:
Thus, We have designed a chatbot with simple dialog system.
8.Convert text to speech and find accuracy

AIM:
To convert text to speech and find accuracy

ALGORITHM:
Input: Text data (source), Speech data (target), Ground truth text
Output: Synthesized speech data, Accuracy metrics

1. Convert text data into speech data using a text-to-speech library or API.
- Utilize the provided source text data.
- Generate synthesized speech data.

2. Evaluate the accuracy of the synthesized speech.


- Utilize the synthesized speech data and ground truth text.
- Transcribe the synthesized speech into text using a speech recognition library or API.
- Calculate accuracy metrics (e.g., Word Error Rate, Character Error Rate) to quantify the accuracy.
- Output the accuracy metrics.

3. Return synthesized speech data and accuracy metrics.

PROGRAM
import pyttsx3
import speech_recognition as sr

# Function to convert text to speech


def text_to_speech(text):
engine = pyttsx3.init()
engine.setProperty('rate', 150) # Speed of speech
engine.say(text)
engine.runAndWait()

# Function to convert speech to text


def speech_to_text():
recognizer = sr.Recognizer()
with sr.Microphone() as source:
print("Say something:")
audio = recognizer.listen(source)

try:
text = recognizer.recognize_google(audio)
return text.lower()
except sr.UnknownValueError:
print("Could not understand audio")
return ""
except sr.RequestError as e:
print("Could not request results: {0}".format(e))
return ""

# Function to calculate accuracy


def calculate_accuracy(ground_truth, synthesized_text):
words_ground_truth = ground_truth.split()
words_synthesized = synthesized_text.split()
num_correct = sum(1 for x, y in zip(words_ground_truth, words_synthesized) if x == y)
accuracy = num_correct / len(words_ground_truth)
return accuracy

# Main function
def main():
# Input text
text = "Hello, how are you?"
# Convert text to speech
print("Synthesizing speech from text...")
text_to_speech(text)

# Convert synthesized speech to text


print("Transcribing synthesized speech...")
synthesized_text = speech_to_text()

# Ground truth
ground_truth = "hello how are you"

# Calculate accuracy
accuracy = calculate_accuracy(ground_truth, synthesized_text)
print("Accuracy:", accuracy)

if __name__ == "__main__":
main()

OUTPUT:
RESULT:
This code first converts the input text into speech using the text to speech() function. Then, it records
speech from the microphone, transcribes it into text using the speech to text() function, and compares
it with the ground truth text. Finally, it calculates the accuracy of the synthesized speech using the
calculate accuracy () function.
9. Design a speech recognition system and find the error rate
AIM:
To design a speech recognition system and find the error rate

ALGORITHM:
Input: Speech recordings, Ground truth transcriptions
Output: Recognized transcriptions, Error rate (e.g., WER or CER)

1. Preprocess the speech recordings:


- Segment into smaller units (e.g., frames).
- Extract features (e.g., MFCCs) from each segment.
- Normalize the feature vectors.

2. Train the speech recognition model:


- Select a suitable model architecture.
- Split the dataset into training, validation, and test sets.
- Train the model on the training set.
- Tune hyperparameters using the validation set.

3. Recognize speech:
- Use the trained model to transcribe speech recordings.

4. Calculate the error rate:


- Compare the recognized transcriptions with the ground truth transcriptions.
- Compute the Word Error Rate (WER) or Character Error Rate (CER).
- WER = (Number of substitutions + Number of deletions + Number of insertions) / Total number of
words in the ground truth transcription.

5. Output recognized transcriptions and error rate.


PROGRAM:
import speech_recognition as sr

# Function to recognize speech


def recognize_speech(audio_file):
recognizer = sr.Recognizer()

with sr.AudioFile(audio_file) as source:


audio_data = recognizer.record(source) # Read the entire audio file

try:
recognized_text = recognizer.recognize_google(audio_data)
return recognized_text.lower()
except sr.UnknownValueError:
print("Speech recognition could not understand audio")
return ""
except sr.RequestError as e:
print("Could not request results from Google Speech Recognition service; {0}".format(e))
return ""

# Function to calculate Word Error Rate (WER)


def calculate_wer(ground_truth, recognized_text):
# Split ground truth and recognized text into words
ground_truth_words = ground_truth.split()
recognized_words = recognized_text.split()

# Initialize counters for substitutions, deletions, and insertions


substitutions = 0
deletions = 0
insertions = 0

# Calculate WER
for word in ground_truth_words:
if word in recognized_words:
recognized_words.remove(word)
else:
deletions += 1

substitutions = len(recognized_words)
total_words = len(ground_truth_words)
wer = (substitutions + deletions + insertions) / total_words
return wer

# Main function
def main():
# Ground truth transcription
ground_truth = "hello how are you"

# Recognize speech from audio file


audio_file = "sample_audio.wav" # Provide the path to your audio file
recognized_text = recognize_speech(audio_file)

# Calculate Word Error Rate (WER)


wer = calculate_wer(ground_truth, recognized_text)
print("Recognized text:", recognized_text)
print("Word Error Rate (WER):", wer)

if __name__ == "__main__":
main()
OUTPUT:
RESULT:
This code takes an audio file path as input, recognizes speech using the Google Speech Recognition
service, and then calculates the Word Error Rate (WER) between the recognized text and the ground
truth transcription. Make sure to replace "sample_audio.wav" with the path to your audio file.

You might also like