CCS369 - Text and Speech Analysis
CCS369 - Text and Speech Analysis
LAB RECORD
NAME: ….......................................................
DEGREE&BRANCH: ..................................................….
YEAR/SEMESTER: ….................................................……
BONAFIDE CERTIFICATE
AIM:
To create Regular expressions in Python for detecting word patterns and tokenizing text. Regular
expressions (regex) are powerful tools for detecting patterns in text.
ALGORITHM:
1. Import the module.
2. Define your text.
3. Create regular expressions for detecting word patterns.
4. Use the re.findall(pattern, text)
5. Print or process the(matches)
6. Create regular expressions for tokening text.
7. Print or process the tokens.
8. Repeat as necessary.
word patterns
PROGRAM:
import re
text = "The quick brown fox jumps over the lazy dog"
pattern = r'\b[a-zA-Z]*[qQ][a-zA-Z]*\b' # Words containing the letter 'q' or 'Q'
matches = re.findall(pattern, text)
print(matches) # Output: ['quick']
OUTPUT:
['quick']
tokenizing text
PROGRAM:
import re
text = "The quick brown fox"
tokens = re.split(r'\s+', text)
print(tokens) # Output: ['The', 'quick', 'brown', 'fox']
OUTPUT:
['The', 'quick', 'brown', 'fox']
RESULT:
Thus regular expressions in Python to detect word patterns and tokenize text is written and verified.
Depending on your specific requirements, you can customize the regex patterns accordingly.
2. Getting started with Python and NLTK - Searching Text, Counting Vocabulary,
Frequency Distribution, Collocations, Bigrams
AIM:
To get started with Python and NLTK - Searching Text, Counting Vocabulary, Frequency Distribution,
Collocations, Bigrams.
ALGORITHM:
Install NLTK.
Import NLTK and Download Resources.
Load text data.
Searching text.
Counting vocabulary.
Frequency Distribution.
Collocations.
Bigrams.
PROGRAM:
(A) SEARCHING TEXT
import nltk
from nltk.book import *
# Search for occurrences of a word
def search_word(text, word):
print("Concordance for word:", word)
text.concordance(word)
# Search for similar words
def search_similar(text, word):
print("Similar words for:", word)
text.similar(word)
# Perform searches
search_word(text, "whale")
search_similar(text, "whale")
OUTPUT:
Concordance for word: whale
Displaying 25 of 1226 matches:
the Sperm Whale. CHAPTER 32. Cetology. CHAPTER 33. The Spe
rise of the Leviathan. CHAPTER 105. Does the Whale '
import notch
from nltk.book import *
# Count unique words
def count_vocabulary(text):
vocabulary = set(text)
num_unique_words = len(vocabulary)
return num_unique_words
# Load text data (you can choose from built-in texts)
text = text1 # Moby Dick
# Count vocabulary
num_unique_words = count_vocabulary(text)
print("Number of unique words:", num_unique_words)
OUTPUT:
Number of unique words: 19317
C) FREQUENCY DISTRIBUTION
import nltk
from nltk.book import *
OUTPUT:
Most common words: [('the', 13721), (',', 7301), ('.', 6483), ('of', 4748), ('and', 4515), ('a',
3090), ('to', 2824), (';', 2552), ('in', 2185), ('that', 1659)]
D) COLLOCATIONS:
import nltk
from nltk.book import *
# Find collocations
def find_collocations(text):
print("Collocations:")
text.collocations()
# Find collocations
find_collocations(text)
OUTPUT:
Collocations:
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
E) BIGRAMS
import nltk
from nltk.book import *
# Generate bigrams
def generate_bigrams(text):
print("Example bigrams:")
bigrams = list(nltk.bigrams(text))
for bigram in bigrams[:10]: # Displaying the first 10 bigrams as an example
print(bigram)
# Generate bigrams
generate_bigrams(text)
OUTPUT:
Example bigrams:
('CHAPTER', '1')
('1', 'Loomings')
('Loomings', '.')
('.', 'Call')
('Call', 'me')
('me', 'Ishmael')
('Ishmael', '.')
('.', 'Some')
('Some', 'years')
('years', 'ago')
RESULT:
These are some basic steps to get started with NLTK for text analysis tasks. You can explore further
functionalities and modules within NLTK for more advanced text processing and analysis.
3. Accessing Text Corpora using NLTK in Python
AIM:
To access text corpora using NLTK in python.
ALGORITHM:
4. Install NLTK
5. Import NLTK
6. Download Corpora or models.
7. Access text Corpora.
8. Explore Further.
PROGRAM:
import nltk
from nltk.corpus import gutenberg, brown, wordnet
OUTPUT:
=== Gutenberg Corpus ===
Available files: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
Sample text from 'Emma' by Jane Austen:
[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I
RESULT:
This code snippet demonstrates how to access text corpora from the Gutenberg, Brown, and WordNet
corpora using NLTK in Python, along with sample output from each corpus.
4. Write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
AIM:
To Write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
ALGORITHM:
1. Accept a piece of text as input.
2. Tokenize the input text into words.
3. Remove stopwords from the list of tokens. Stopwords are commonly occurring words in a
language (e.g., "the", "is", "and") that do not carry significant meaning.
4. Count the frequency of each word in the filtered list of tokens.
5. Sort the words based on their frequencies in descending order.
6. Select the top 50 words with the highest frequencies.
7. Return the list of the 50 most frequently occurring words.
PROGRAM:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
def most_frequent_words(text):
# Tokenize the text
tokens = word_tokenize(text)
# Example usage:
text = "This is a sample text. It contains some words that will be counted. The words in this text will
be analyzed to find the most frequent ones."
result = most_frequent_words(text)
print("50 most frequently occurring words (excluding stop words):")
print(result)
OUTPUT:
50 most frequently occurring words (excluding stop words):
[('words', 2), ('text', 2), ('sample', 1), ('contains', 1), ('counted', 1), ('analyzed', 1), ('find', 1), ('frequent',
1), ('ones', 1)]
RESULT:
This function first tokenizes the input text, then filters out stop words using NLTK's English stop
words list. After that, it calculates the frequency distribution of the filtered tokens and returns the 50
most frequent words along with their frequencies. You can replace the example text with any text you
want to analyze.
5. Implement the Word2Vec model.
AIM:
To implement the Word2Vec model.
ALGORITHM:
1. Accept a corpus of text data as input.
2. Tokenize the text into individual words or phrases.
3. Remove stopwords, punctuation, and other noise if necessary.
4. Initialize a Word2Vec model with parameters like vector size, window size, minimum word
count.
5. These vectors capture semantic meanings of words based on their context in the training data.
6. Evaluate the performance of the Word2Vec model using tasks like word similarity, analogy
7. The trained Word2Vec model with word embeddings.
PROGRAM:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Output
print("Words most similar to '{}':".format(word))
for similar_word, similarity in most_similar_words:
print(similar_word, similarity)
OUTPUT:
Words most similar to 'word2vec':
nlp 0.24648666310310364
machine 0.24033652210235596
recognition 0.2289284769296646
used 0.22583018231391907
vectors 0.22268365383148193
natural 0.20813092589378357
language 0.19862347853183746
technique 0.17929100954532623
sentiment 0.1550669378042221
converting 0.1451803447008133
RESULT:
This code snippet demonstrates how to implement the Word2Vec model using Gensim in Python.
ALGORITHM:
1. Data preparation.
2. Model architecture.
3. Training loop.
4. Evaluation.
5. Testing.
6. Inference.
7. Deployment.
PROGRAM:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
# Dummy dataset
class DummyDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
# Example usage
input_dim = 1000 # Size of vocabulary
output_dim = 10 # Number of classes
max_seq_length = 20 # Maximum sequence length
num_heads = 8 # Number of attention heads
num_layers = 6 # Number of transformer layers
# Dummy data
data = torch.randint(0, input_dim, (1000, max_seq_length)) # 1000 samples of sequences of length
20
labels = torch.randint(0, output_dim, (1000,))
# Create dataset and dataloader
dataset = DummyDataset(data, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Training loop
for epoch in range(10):
for batch_idx, (inputs, targets) in enumerate(dataloader):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0:
print('Epoch {} Batch {} Loss: {:.4f}'.format(epoch, batch_idx, loss.item()))
# Example inference
test_input = torch.randint(0, input_dim, (1, max_seq_length)) # Test input sequence
with torch.no_grad():
output_probs = torch.exp(model(test_input))
predicted_class = torch.argmax(output_probs)
print("Predicted class:", predicted_class.item())
OUTPUT:
Epoch 0 Batch 0 Loss: 2.3421
Epoch 0 Batch 10 Loss: 2.2925
Epoch 0 Batch 20 Loss: 2.3014
...
Epoch 9 Batch 0 Loss: 0.0543
Epoch 9 Batch 10 Loss: 0.0402
Epoch 9 Batch 20 Loss: 0.0221
Predicted class: 3
RESULT:
This is a basic implementation of a Transformer-based classifier using PyTorch. You can adjust
hyperparameters, model architecture, and dataset accordingly to fit your specific task.
7. Design a chatbot with a simple dialog system
AIM:
To design a chatbot with a simple dialog system
ALGORITHM:
1. Define objectives.
2. Select platform.
3. Data collection.
4. Preprocessing.
5. Training data creation.
6. Model design.
7. Model training
8. Evaluation.
9. Integration
10. Testing
11. Deployment
12. maintenance
PROGRAM:
import random
OUTPUT:
Chatbot: Hi! How can I help you today?
You: Hi
Chatbot: Hello!
You: Can you help me with a problem?
Chatbot: Sorry, I didn't understand that.
You: Thank you
Chatbot: You're welcome!
You: Bye
Chatbot: Goodbye!
RESULT:
Thus, We have designed a chatbot with simple dialog system.
8.Convert text to speech and find accuracy
AIM:
To convert text to speech and find accuracy
ALGORITHM:
Input: Text data (source), Speech data (target), Ground truth text
Output: Synthesized speech data, Accuracy metrics
1. Convert text data into speech data using a text-to-speech library or API.
- Utilize the provided source text data.
- Generate synthesized speech data.
PROGRAM
import pyttsx3
import speech_recognition as sr
try:
text = recognizer.recognize_google(audio)
return text.lower()
except sr.UnknownValueError:
print("Could not understand audio")
return ""
except sr.RequestError as e:
print("Could not request results: {0}".format(e))
return ""
# Main function
def main():
# Input text
text = "Hello, how are you?"
# Convert text to speech
print("Synthesizing speech from text...")
text_to_speech(text)
# Ground truth
ground_truth = "hello how are you"
# Calculate accuracy
accuracy = calculate_accuracy(ground_truth, synthesized_text)
print("Accuracy:", accuracy)
if __name__ == "__main__":
main()
OUTPUT:
RESULT:
This code first converts the input text into speech using the text to speech() function. Then, it records
speech from the microphone, transcribes it into text using the speech to text() function, and compares
it with the ground truth text. Finally, it calculates the accuracy of the synthesized speech using the
calculate accuracy () function.
9. Design a speech recognition system and find the error rate
AIM:
To design a speech recognition system and find the error rate
ALGORITHM:
Input: Speech recordings, Ground truth transcriptions
Output: Recognized transcriptions, Error rate (e.g., WER or CER)
3. Recognize speech:
- Use the trained model to transcribe speech recordings.
try:
recognized_text = recognizer.recognize_google(audio_data)
return recognized_text.lower()
except sr.UnknownValueError:
print("Speech recognition could not understand audio")
return ""
except sr.RequestError as e:
print("Could not request results from Google Speech Recognition service; {0}".format(e))
return ""
# Calculate WER
for word in ground_truth_words:
if word in recognized_words:
recognized_words.remove(word)
else:
deletions += 1
substitutions = len(recognized_words)
total_words = len(ground_truth_words)
wer = (substitutions + deletions + insertions) / total_words
return wer
# Main function
def main():
# Ground truth transcription
ground_truth = "hello how are you"
if __name__ == "__main__":
main()
OUTPUT:
RESULT:
This code takes an audio file path as input, recognizes speech using the Google Speech Recognition
service, and then calculates the Word Error Rate (WER) between the recognized text and the ground
truth transcription. Make sure to replace "sample_audio.wav" with the path to your audio file.