0% found this document useful (0 votes)

21 views

Assingment-3 NLP

The document discusses preprocessing text data for sentiment classification using neural networks. It covers tokenization, vocabulary creation, encoding, padding sequences, and defining feed-forward and LSTM models. The models are trained on the IMDB dataset and evaluated to compare performance.

Uploaded by

mukesh shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Assingment-3 NLP

Uploaded by

mukesh shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment -3

Name : Mukesh shah

Roll No: 2303res31

Data Preprocessing

 Tokenization: Use spaCy to tokenize the text data. Ensure that only words appearing five or
more times are included in your vocabulary. Words that appear less frequently should be
assigned an "UNK" token.

 Vocabulary Management: Construct a vocabulary from the tokenized data, mapping each
unique word to a unique index. Include a special "PAD" token for padding shorter sentences.

 One-hot Encoding: Convert tokens in your sentences to one-hot encoded vectors based on the
vocabulary index.

2. Model Architecture

Feed-Forward Neural Network (FFNN)

 Input Layer: Takes in one-hot encoded vectors of fixed size (determined by the maximum length
of sentences in your data).

 Hidden Layers: Two hidden layers with 256 and 128 neurons respectively.

 Activation Functions: You can use ReLU or Tanh between layers to introduce non-linearity.

 Output Layer: Size of this layer depends on the number of classes (2 for binary, more for multi-
class).

Recurrent Neural Network (RNN)

 Model Type: Use LSTM due to its effectiveness in handling long sequences.

 Structure: Similar to FFNN in terms of input, but the sequential data is processed through LSTM
units. The final state of the LSTM is used to predict the sentiment.

 Output Layer: Configured similarly to the FFNN's output layer, adjusted for the number of
classes.

3. Loss Functions

 Binary Classification: Use binary cross-entropy.

 Multi-class Classification: Use categorical cross-entropy.

4. Training and Evaluation

 Datasets: You’ll work with the IMDB dataset for binary classification and the AG_News dataset
for multi-class classification.

 Metrics: Track accuracy, precision, recall, and F1-score. Utilize validation sets to tune your
models and save the best-performing models.

 Epochs: Train each model for at least 10 epochs.

5. Submission

 Package your code, logs, and a detailed report into a .zip file named appropriately with your roll
number and assignment number.

 Submit your work through the provided Dropbox link.

6. Additional Notes

 Ensure your code is well-commented and properly formatted.

 Avoid plagiarism and ensure originality in your approach and coding.

 import os
 import tarfile
 import urllib.request
 import torch
 import torch.nn as nn
 import torch.optim as optim
 import spacy
 from collections import Counter
 from torch.utils.data import Dataset, DataLoader
 from torch.nn.utils.rnn import pad_sequence

 # Load English tokenizer from spaCy
 nlp = spacy.load("en_core_web_sm")

 # Download and extract IMDB dataset if not already available
 def download_extract(url, download_path, extract_path):
 if not os.path.exists(extract_path):
 print("Downloading and extracting dataset...")
 urllib.request.urlretrieve(url, download_path)
 with tarfile.open(download_path, "r:gz") as tar:
 tar.extractall()
 print("Dataset ready.")

 url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
 download_path = "./aclImdb_v1.tar.gz"
 extract_path = "./aclImdb"
 download_extract(url, download_path, extract_path)

 # Load data from files
 def load_data(path):
 data = []
 for label in ["pos", "neg"]:
 directory = os.path.join(path, label)
 for filename in os.listdir(directory):
 if filename.endswith(".txt"):
 with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
 content = file.read().strip()
 data.append((content, 1 if label == "pos" else 0))
 return data

 train_data = load_data(os.path.join(extract_path, "train"))
 test_data = load_data(os.path.join(extract_path, "test"))

 # Tokenization and Vocabulary Management
 def tokenize(text):
 return [token.text for token in nlp.tokenizer(text)]

 def build_vocab(data, min_freq=5):
 counter = Counter()
 for text, _ in data:
 tokens = tokenize(text)
 counter.update(tokens)

 vocab = {word: i+1 for i, (word, freq) in enumerate(counter.items()) if freq >= min_freq}
 vocab['UNK'] = 0 # Unknown words
 vocab['PAD'] = len(vocab) # Padding token
 return vocab

 vocab = build_vocab(train_data)

 def encode_sentence(sentence, vocab):
 return [vocab.get(token, vocab['UNK']) for token in tokenize(sentence)]

 def pad_encoded(encoded_sentence, max_length):
 return encoded_sentence[:max_length] + [vocab['PAD']] * (max_length - len(encoded_sentence))

 class SentimentDataset(Dataset):
 def __init__(self, data, vocab, max_length=512):
 self.data = data
 self.vocab = vocab
 self.max_length = max_length

 def __len__(self):
 return len(self.data)

 def __getitem__(self, idx):
 text, label = self.data[idx]
 encoded_text = encode_sentence(text, self.vocab)
 padded_text = pad_encoded(encoded_text, self.max_length)
 return padded_text, label

 # Collate function for DataLoader
 def collate_fn(batch):
 texts, labels = zip(*batch)
 texts = torch.tensor([text for text in texts])
 labels = torch.tensor(labels)
 return texts, labels

 # DataLoaders
 train_dataset = SentimentDataset(train_data, vocab, 512)
 test_dataset = SentimentDataset(test_data, vocab, 512)
 train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
 test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

 # Model Definitions
 class FFNN(nn.Module):
 def __init__(self, vocab_size, hidden_dim1=256, hidden_dim2=128, num_classes=2):
 super(FFNN, self).__init__()
 self.fc1 = nn.Linear(vocab_size, hidden_dim1)
 self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
 self.fc3 = nn.Linear(hidden_dim2, num_classes)
 self.relu = nn.ReLU()

 def forward(self, x):
 x = self.relu(self.fc1(x))
 x = self.relu(self.fc2(x))
 return self.fc3(x)

 class LSTMModel(nn.Module):
 def __init__(self, vocab_size, embedding_dim=100, hidden_dim=256, num_classes=2):
 super(LSTMModel, self).__init__()
 self.embedding = nn.Embedding(vocab_size, embedding_dim)
 self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
 self.fc = nn.Linear(hidden_dim, num_classes)

 def forward(self, x):
 x = self.embedding(x)
 _, (hn, _) = self.lstm(x)
 return self.fc(hn.squeeze(0))

 # Instantiate models
 model_ffnn = FFNN(len(vocab))
 model_lstm = LSTMModel(len(vocab))

 # Loss and optimizer
 criterion = nn.CrossEntropyLoss()
 optimizer_ffnn = optim.Adam(model_ffnn.parameters(), lr=0.001)
 optimizer_lstm = optim.Adam(model_lstm.parameters(), lr=0.001)

 # Training and evaluation functions
 def train_model(model, loader, optimizer):
 model.train()
 total_loss = 0
 for inputs, labels in loader:
 optimizer.zero_grad()
 outputs = model(inputs)
 loss = criterion(outputs, labels)
 loss.backward()
 optimizer.step()
 total_loss += loss.item()
 return total_loss / len(loader)

 def evaluate_model(model, loader):
 model.eval()
 correct = 0
 total = 0
 with torch.no_grad():
 for inputs, labels in loader:
 outputs = model(inputs)
 _, predicted = torch.max(outputs.data, 1)
 total += labels.size(0)
 correct += (predicted == labels).sum().item()
 return correct / total

 # Training and evaluation loop
 num_epochs = 10
 for epoch in range(num_epochs):
 train_loss_ffnn = train_model(model_ffnn, train_loader, optimizer_ffnn)
 train_loss_lstm = train_model(model_lstm, train_loader, optimizer_lstm)
 print(f'Epoch {epoch+1}/{num_epochs}, Train Loss FFNN: {train_loss_ffnn}, Train Loss LSTM: {train_loss_lstm}')

 accuracy_ffnn = evaluate_model(model_ffnn, test_loader)
 accuracy_lstm = evaluate_model(model_lstm, test_loader)
 print(f'Test Accuracy FFNN: {accuracy_ffnn*100:.2f}%, Test Accuracy LSTM: {accuracy_lstm*100:.2f}%')


Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Production Optimization Using Prosper Software
No ratings yet
Production Optimization Using Prosper Software
157 pages
Physics
100% (14)
Physics
622 pages
Estimating Guide: Types of Estimates
No ratings yet
Estimating Guide: Types of Estimates
10 pages
CS663-2024-Executive NLP - Assignment Sentiment Analysis
No ratings yet
CS663-2024-Executive NLP - Assignment Sentiment Analysis
4 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
3 pages
RNN Text Generation
No ratings yet
RNN Text Generation
3 pages
Adobe Scan 08 Jan 2025
No ratings yet
Adobe Scan 08 Jan 2025
7 pages
unit4 (1)
No ratings yet
unit4 (1)
23 pages
Cse425 Assignement - 20101257
No ratings yet
Cse425 Assignement - 20101257
12 pages
DL_3
No ratings yet
DL_3
6 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
Sample
No ratings yet
Sample
6 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Deep DL Manual Deep
No ratings yet
Deep DL Manual Deep
8 pages
LLM_FINE_TUNE
No ratings yet
LLM_FINE_TUNE
11 pages
Neural Networks
No ratings yet
Neural Networks
8 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
DL - Assignment 1
No ratings yet
DL - Assignment 1
12 pages
Keras For Beginners: Implementing A Recurrent Neural Network
No ratings yet
Keras For Beginners: Implementing A Recurrent Neural Network
13 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
Cv prince
No ratings yet
Cv prince
120 pages
Assignment3
No ratings yet
Assignment3
6 pages
QLSTMvs LSTM
No ratings yet
QLSTMvs LSTM
7 pages
Deep DL Manual Nainish
No ratings yet
Deep DL Manual Nainish
8 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
sentiment analysis using LSTM (1)
No ratings yet
sentiment analysis using LSTM (1)
5 pages
Text Classification With Transformer - 1716327784332
No ratings yet
Text Classification With Transformer - 1716327784332
3 pages
Dl lab answers batch 2
No ratings yet
Dl lab answers batch 2
27 pages
intent_recognizer
No ratings yet
intent_recognizer
5 pages
2023 Aug How To Produce Data For A Neural networkORG
No ratings yet
2023 Aug How To Produce Data For A Neural networkORG
6 pages
Assignment 3 DS5620
No ratings yet
Assignment 3 DS5620
11 pages
0.0.1 Implementation of Recurrent Neural Network: #Importing The Required Libraries
No ratings yet
0.0.1 Implementation of Recurrent Neural Network: #Importing The Required Libraries
9 pages
Recurrent Neural Network Using LSTM Model
No ratings yet
Recurrent Neural Network Using LSTM Model
15 pages
Exp 6,7,8
No ratings yet
Exp 6,7,8
17 pages
Sentence Embedding Code
No ratings yet
Sentence Embedding Code
9 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
DL Programs
No ratings yet
DL Programs
13 pages
CNN Text Classification
No ratings yet
CNN Text Classification
12 pages
Hand on Day 2 Salinan_dari_2_Using_transformers
No ratings yet
Hand on Day 2 Salinan_dari_2_Using_transformers
10 pages
Dl Lab Manual
No ratings yet
Dl Lab Manual
18 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
cl12_huggingface
No ratings yet
cl12_huggingface
34 pages
IRT Lab Programs
No ratings yet
IRT Lab Programs
9 pages
LSTM Flow
No ratings yet
LSTM Flow
3 pages
assignment-9
No ratings yet
assignment-9
4 pages
Text Classification_movie Review_news Wires
No ratings yet
Text Classification_movie Review_news Wires
5 pages
vnd.openxmlformats-officedocument.wordprocessingml.document&rendition=1-10
No ratings yet
vnd.openxmlformats-officedocument.wordprocessingml.document&rendition=1-10
13 pages
DL_22_IMDB_SentimentAnalysis_RNN.ipynb - Colab
No ratings yet
DL_22_IMDB_SentimentAnalysis_RNN.ipynb - Colab
6 pages
EncoderDecoderSeq2Seq DeepLSTM
No ratings yet
EncoderDecoderSeq2Seq DeepLSTM
7 pages
dl lab_merged (2)
No ratings yet
dl lab_merged (2)
60 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
No ratings yet
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
17 pages
Working of Sentimental Analysis
No ratings yet
Working of Sentimental Analysis
3 pages
DL
No ratings yet
DL
17 pages
CCTV Anomaly Detection Guide
No ratings yet
CCTV Anomaly Detection Guide
39 pages
Fine-tuned vs RAG Short Notes ?
No ratings yet
Fine-tuned vs RAG Short Notes ?
25 pages
DOC-20250104-WA0000.
No ratings yet
DOC-20250104-WA0000.
40 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Chapter 1, Lecture 1: Conservation Equations
No ratings yet
Chapter 1, Lecture 1: Conservation Equations
5 pages
Assignment 2 Physics Class10 Science (2024-25)
No ratings yet
Assignment 2 Physics Class10 Science (2024-25)
2 pages
Attitudes Toward Dental Appearance in 50-And 60-Year-Old Subjects Living in Sweden
No ratings yet
Attitudes Toward Dental Appearance in 50-And 60-Year-Old Subjects Living in Sweden
10 pages
BANA 560 Lecture 6 Association Rules Collaborative Filtering
No ratings yet
BANA 560 Lecture 6 Association Rules Collaborative Filtering
34 pages
Physical Constraints On Hypercomputation: Paul Cockshott, Lewis Mackenzie, Greg Michaelson
No ratings yet
Physical Constraints On Hypercomputation: Paul Cockshott, Lewis Mackenzie, Greg Michaelson
16 pages
Quadratic Functions and Equations: 45-Minute
No ratings yet
Quadratic Functions and Equations: 45-Minute
106 pages
General Plane Motion
No ratings yet
General Plane Motion
14 pages
De Stijl Modernism in The Netherlands Van Dusburg
No ratings yet
De Stijl Modernism in The Netherlands Van Dusburg
37 pages
ISO 5459 IS 10721: - Specified Datums - Datum Systems - Datum Targets
No ratings yet
ISO 5459 IS 10721: - Specified Datums - Datum Systems - Datum Targets
27 pages
Flat Plate Boundary Layer Cornell
No ratings yet
Flat Plate Boundary Layer Cornell
39 pages
Project
No ratings yet
Project
38 pages
Exercises Master vv2016 1
No ratings yet
Exercises Master vv2016 1
86 pages
4.23 Inverse Trigonometric Functions
No ratings yet
4.23 Inverse Trigonometric Functions
7 pages
Circles 1-27--Math--Jee main
No ratings yet
Circles 1-27--Math--Jee main
25 pages
Digital Lab Manual
No ratings yet
Digital Lab Manual
87 pages
SMDM - Assignment 1
No ratings yet
SMDM - Assignment 1
4 pages
Duration - and - Convexity
No ratings yet
Duration - and - Convexity
22 pages
2a Guess Paper 6
No ratings yet
2a Guess Paper 6
2 pages
WORKSHEET ON PYTHON MODULES
No ratings yet
WORKSHEET ON PYTHON MODULES
2 pages
Lecture Notes: IV B. Tech I Semester (JNTUH-R13)
No ratings yet
Lecture Notes: IV B. Tech I Semester (JNTUH-R13)
18 pages
Cybernetics and Second-Order Cubernetics Francis Heylighen
No ratings yet
Cybernetics and Second-Order Cubernetics Francis Heylighen
15 pages
Week 2 (Final) - Theory - Conversion and Multiple Reactors1
No ratings yet
Week 2 (Final) - Theory - Conversion and Multiple Reactors1
29 pages
The Essentials of Statistics A Tool for Social Research 3rd Edition Joseph F. Healey - Instantly access the complete ebook with just one click
No ratings yet
The Essentials of Statistics A Tool for Social Research 3rd Edition Joseph F. Healey - Instantly access the complete ebook with just one click
43 pages
Probability Distribution
No ratings yet
Probability Distribution
16 pages
Evaluation of Pull-Out Capacity of Helical Anchors in Clay Using Finite Element Analysis
No ratings yet
Evaluation of Pull-Out Capacity of Helical Anchors in Clay Using Finite Element Analysis
10 pages
Flow and Heat Transfer Dissertation Biegger Online Opt 450
No ratings yet
Flow and Heat Transfer Dissertation Biegger Online Opt 450
221 pages
QUANTUM NUMBERS - General Chemistry 9th - Ebbing, Gammon
No ratings yet
QUANTUM NUMBERS - General Chemistry 9th - Ebbing, Gammon
4 pages

Assingment-3 NLP

Uploaded by

Assingment-3 NLP

Uploaded by

Assignment -3

Name : Mukesh shah

Roll No: 2303res31

Feed-Forward Neural Network (FFNN)

Recurrent Neural Network (RNN)

 Binary Classification: Use binary cross-entropy.

 Multi-class Classification: Use categorical cross-entropy.

4. Training and Evaluation

 Epochs: Train each model for at least 10 epochs.

 Submit your work through the provided Dropbox link.

 Ensure your code is well-commented and properly formatted.

 Avoid plagiarism and ensure originality in your approach and coding.

You might also like