NLP Exercise 10
NLP Exercise 10
classification task.
import torch
import evaluate
from transformers import BertTokenizer, BertForSequenceClassification, Trainer,
TrainingArguments
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_encodings,
eval_dataset=val_encodings,
compute_metrics=compute_metrics,
)
# Show accuracy
print(f"Validation accuracy: {results['eval_accuracy']:.4f}")
# Example usage
new_text = "The food was awful and the service was great!"
print(f"Text: '{new_text}'")
print("Classification:", classify_text(new_text))
Exercise 10: Develop a python program to fine-tune a BERT model for text
classification task.
Install necessary Python libraries for managing datasets, evaluating metrics, and
using the BERT model:
import torch
import evaluate
from transformers import BertTokenizer, BertForSequenceClassification,
Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
subset_size = 100
small_train_data = dataset['train'].select(range(subset_size))
small_val_data = dataset['test'].select(range(subset_size))
3. Model Loading
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name,
num_labels=2)
Explanation:
- `BertTokenizer.from_pretrained`: Loads the tokenizer for the BERT model.
- `BertForSequenceClassification`: Loads the BERT model for classification tasks
with two labels (positive and negative).
4. Training and Evaluation
def tokenize_function(examples):
return tokenizer(examples['text'], padding="max_length", truncation=True)
train_encodings = small_train_data.map(tokenize_function, batched=True)
val_encodings = small_val_data.map(tokenize_function, batched=True)
Explanation:
- `tokenize_function`: Converts text into token IDs with padding and truncation for
uniform input length.
- `map`: Applies the tokenization function to the dataset.
Training arguments are set to control the training process:
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=1,
weight_decay=0.01,
)
Explanation:
- `output_dir`: Directory to save the model outputs.
- `evaluation_strategy`: Evaluates the model at the end of each epoch.
- `learning_rate`: The learning rate for the optimizer.
5. Making Predictions
def classify_text(text):
tokens = tokenizer(
text, max_length=128, padding='max_length', truncation=True,
return_tensors="pt"
)
device = model.device
tokens = tokens.to(device)
outputs = model(**tokens)
prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Positive" if prediction == 1 else "Negative"
return label
This function takes new text as input and classifies it as Positive or Negative:
- Tokenizes the input text.
- Moves tokens to the same device as the model.
- Gets the logits from the model and converts them into predictions.