0% found this document useful (0 votes)
7 views

NLP Exercise 10

Uploaded by

judinjomon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

NLP Exercise 10

Uploaded by

judinjomon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Exercise 10: Develop a python program to fine-tune a BERT model for text

classification task.

pip install fsspec==2024.10.0


pip install --upgrade gcsfs fsspec
pip install datasets evaluate

import torch
import evaluate
from transformers import BertTokenizer, BertForSequenceClassification, Trainer,
TrainingArguments
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np

# Load the IMDB dataset


dataset = load_dataset("imdb")

# Limit the dataset to only a few rows for testing


subset_size = 100 # Adjust the number of rows as needed
small_train_data = dataset['train'].select(range(subset_size))
small_val_data = dataset['test'].select(range(subset_size))

# Load pretrained BERT model and tokenizer


model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize the dataset


def tokenize_function(examples):
return tokenizer(examples['text'], padding="max_length", truncation=True)

train_encodings = small_train_data.map(tokenize_function, batched=True)


val_encodings = small_val_data.map(tokenize_function, batched=True)

# Convert labels to torch tensors


train_encodings.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_encodings.set_format("torch", columns=["input_ids", "attention_mask", "label"])
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

# Set up Trainer with training arguments


training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=1,
weight_decay=0.01,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_encodings,
eval_dataset=val_encodings,
compute_metrics=compute_metrics,
)

# Fine-tune the model


trainer.train()

# Validate the model on test data


results = trainer.evaluate()

# Show accuracy
print(f"Validation accuracy: {results['eval_accuracy']:.4f}")

# Function to make a prediction on new input text


def classify_text(text):
tokens = tokenizer(
text,
max_length=128, # Set maximum length of input sequence
padding='max_length', # Pad to max length
truncation=True, # Truncate if text is too long
return_tensors="pt" # Return as PyTorch tensors
)

device = model.device # Get the device where the model is located


tokens = tokens.to(device) # Move input tokens to the same device as model
outputs = model(**tokens)

prediction = torch.argmax(outputs.logits, dim=1).item()


label = "Positive" if prediction == 1 else "Negative"
return label

# Example usage
new_text = "The food was awful and the service was great!"
print(f"Text: '{new_text}'")
print("Classification:", classify_text(new_text))
Exercise 10: Develop a python program to fine-tune a BERT model for text
classification task.

Step 1: Install Required Libraries

 Install necessary Python libraries for managing datasets, evaluating metrics, and
using the BERT model:

pip install fsspec==2024.10.0


pip install --upgrade gcsfs fsspec
pip install datasets evaluate

Step 2: Import Required Modules

import torch
import evaluate
from transformers import BertTokenizer, BertForSequenceClassification,
Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np

- `torch`: A machine learning library for building and training models.


- `evaluate`: Provides functions to evaluate model performance.
- `transformers`: Contains pretrained BERT models and tokenizers.
- `datasets`: Enables loading and processing datasets.
- `numpy`: Useful for numerical operations.

Step 3: Load IMDB Dataset

subset_size = 100
small_train_data = dataset['train'].select(range(subset_size))
small_val_data = dataset['test'].select(range(subset_size))

3. Model Loading
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name,
num_labels=2)
Explanation:
- `BertTokenizer.from_pretrained`: Loads the tokenizer for the BERT model.
- `BertForSequenceClassification`: Loads the BERT model for classification tasks
with two labels (positive and negative).
4. Training and Evaluation
def tokenize_function(examples):
return tokenizer(examples['text'], padding="max_length", truncation=True)
train_encodings = small_train_data.map(tokenize_function, batched=True)
val_encodings = small_val_data.map(tokenize_function, batched=True)
Explanation:
- `tokenize_function`: Converts text into token IDs with padding and truncation for
uniform input length.
- `map`: Applies the tokenization function to the dataset.
Training arguments are set to control the training process:
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=1,
weight_decay=0.01,
)
Explanation:
- `output_dir`: Directory to save the model outputs.
- `evaluation_strategy`: Evaluates the model at the end of each epoch.
- `learning_rate`: The learning rate for the optimizer.
5. Making Predictions
def classify_text(text):
tokens = tokenizer(
text, max_length=128, padding='max_length', truncation=True,
return_tensors="pt"
)
device = model.device
tokens = tokens.to(device)
outputs = model(**tokens)
prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Positive" if prediction == 1 else "Negative"
return label
This function takes new text as input and classifies it as Positive or Negative:
- Tokenizes the input text.
- Moves tokens to the same device as the model.
- Gets the logits from the model and converts them into predictions.

You might also like