Social media users frequently encounter abuse, harassment, and insults from other users on a majority of online communication platforms like Facebook, Instagram and Youtube due to which many users stop expressing their ideas and opinions.
What is the solution?
The solution to this problem is to create an effective model that can identify the level of toxicity in comments such as threats, obscenity, insults, racism, etc. Thereby, promoting a peaceful environment for online dialogue.
In this article, we will understand more about Toxic comment multi-label classification and create a model to classify comments into various labels of toxicity.
What is Toxic comment classification?
The toxicity class refers to any comment or text containing offensive or hurtful words. This can involve insults, slurs or other offensive language.
Every supervised classification technique can be further subdivided into three groups based on the number of categories it uses:
1. Binary classification:
It is a type of supervised machine-learning problem that classifies data into two mutually exclusive groups or categories. The two categories can be classified as true and false, 0 and 1, positive and negative, etc.
In toxic comment classification, the model is trained to predict whether a comment is toxic (class 1) or non-toxic (class 0).
Example:
"I hate you!" Predicted class: Toxic (class 1)
"I like you!" Predicted class: Non-toxic (class 0)
2. Multiclass classification:
It is a type of supervised machine-learning problem that classifies data into three or more groups/categories.
A multiclass classifier for Toxic comment classification is trained to detect various degrees of toxicity in comments, such as mild toxicity, severe toxicity, and non-toxic comments, as opposed to just differentiating between toxic and non-toxic comments (binary classification).
Example:
"I want to kill you!" Predicted class: Severe toxicity
"You are so ugly and unconfident" Predicted class: Mild toxicity
"You are a good person" Predicted class: Non-toxic
3. Multilabel classification: Multilabel classification is a supervised machine learning approach where a single instance can be associated with multiple labels simultaneously. It allows the model to assign zero, one, or more labels to each data sample based on its characteristics.
In the context of toxic comment classification, a comment or text can be labelled with multiple toxicity categories if it contains various forms of harmful language.
Example:
"You're an idiot person, and I hope someone hits you!"
Multiple Labels: Offensive language (class 1), Threats (class 1), hatred (class1), non_toxic(class 0)
Toxic Comment Classification using BERT
Let's get started!
About the dataset:
We have a large number of Wikipedia comments which have been labelled by human raters for toxic behaviour. The dataset variables are:
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate
Access the dataset: Toxic Comments dataset
Now, the coding part begins!
Prerequisite
Utilizing PyTorch with transformers, for a more flexible and intuitive interface for building and training deep learning models
!pip install torch
Transformers for using BERT(Bidirectional Encoder Representations from Transformers)
!pip install transformers
Importing necessary libraries
Python3
import numpy as np
import pandas as pd
#data visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
#to avoid warnings
import warnings
warnings.filterwarnings('ignore')
Load the datasets
Python3
data = pd.read_csv("toxicity.csv")
print(data.head())
Output:
id comment_text toxic \
0 0000997932d777bf Explanation\nWhy the edits made under my usern... 0
1 000103f0d9cfb60f D'aww! He matches this background colour I'm s... 0
2 000113f07ec002fd Hey man, I'm really not trying to edit war. It... 0
3 0001b41b1c6bb37e "\nMore\nI can't make any real suggestions on ... 0
4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember... 0
severe_toxic obscene threat insult identity_hate
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
Data Visualization to Understand Class Distribution
Python3
# Visualizing the class distribution of the 'label' column
column_labels = data.columns.tolist()[2:]
label_counts = data[column_labels].sum().sort_values()
# Create a black background for the plot
plt.figure(figsize=(7, 5))
# Create a horizontal bar plot using Seaborn
ax = sns.barplot(x=label_counts.values,
y=label_counts.index, palette='viridis')
# Add labels and title to the plot
plt.xlabel('Number of Occurrences')
plt.ylabel('Labels')
plt.title('Distribution of Label Occurrences')
# Show the plot
plt.show()
Output:
.png)
Checking exact values for each class
Python3
data[column_labels].sum().sort_values()
Output:
threat 478
identity_hate 1405
severe_toxic 1595
insult 7877
obscene 8449
toxic 15294
dtype: int64
Toxic and Non-Toxic Data
Let's check if the data is balanced or not by comparing toxic and clean comments by creating their subsets, and then create a new data frame to visualize and gain insights on the distribution of the dataset.
Python3
# Create subsets based on toxic and clean comments
train_toxic = data[data[column_labels].sum(axis=1) > 0]
train_clean = data[data[column_labels].sum(axis=1) == 0]
# Number of toxic and clean comments
num_toxic = len(train_toxic)
num_clean = len(train_clean)
# Create a DataFrame for visualization
plot_data = pd.DataFrame(
{'Category': ['Toxic', 'Clean'], 'Count': [num_toxic, num_clean]})
# Create a black background for the plot
plt.figure(figsize=(7, 5))
# Horizontal bar plot
ax = sns.barplot(x='Count', y='Category', data=plot_data, palette='viridis')
# Add labels and title to the plot
plt.xlabel('Number of Comments')
plt.ylabel('Category')
plt.title('Distribution of Toxic and Clean Comments')
# Set ticks' color to white
ax.tick_params()
# Show the plot
plt.show()
Output:
.png)
We can observe that our dataset is severely imbalanced.
Let's have a look at the proportion of toxic and clean comments in numbers in order to know the exact numbers and balance the data accordingly.
Python3
print(train_toxic.shape)
print(train_clean.shape)
Output:
(16225, 8)
(143346, 8)
There is a huge difference in the dataset between toxic and clean comments.
Handling class imbalance
To handle the imbalanced data, we can create a new training set in which the number of toxic comments remains the same, and to match that, we will randomly sample 16,225 clean comments and include them in the training set.
The new balanced data frame
Python3
# Randomly sample 15,000 clean comments
train_clean_sampled = train_clean.sample(n=16225, random_state=42)
# Combine the toxic and sampled clean comments
dataframe = pd.concat([train_toxic, train_clean_sampled], axis=0)
# Shuffle the data to avoid any order bias during training
dataframe = df.sample(frac=1, random_state=42)
let's verify with actual figures
Python3
print(train_toxic.shape)
print(train_clean_sampled.shape)
print(dataframe.shape)
Output:
(16225, 8)
(16225, 8)
(32450, 8)
Now, the dataset is balanced with exactly equal instances of toxic and clean comments we can proceed further to tokenizing and encoding comments using BertTokenizer.
Split Data into Training, Validation, and Testing Sets
In this step, we split the data into training, validation, and testing sets. The data is divided into training and testing sets first, and then the testing set is further split into validation and testing sets.
Python3
# Split data into training, testing sets & validation sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
dataframe['comment_text'], dataframe.iloc[:, 2:], test_size=0.25, random_state=42)
Now, we split the validation set
Python3
# validation set
test_texts, val_texts, test_labels, val_labels = train_test_split(
test_texts, test_labels, test_size=0.5, random_state=42)
Now, we will tokenize and encode the comments and labels for the training, testing, and validation sets.
Tokenization and Encoding
Defining 'tokenize_and_encode'
function to perform this task
Python3
# Token and Encode Function
def tokenize_and_encode(tokenizer, comments, labels, max_length=128):
# Initialize empty lists to store tokenized inputs and attention masks
input_ids = []
attention_masks = []
# Iterate through each comment in the 'comments' list
for comment in comments:
# Tokenize and encode the comment using the BERT tokenizer
encoded_dict = tokenizer.encode_plus(
comment,
# Add special tokens like [CLS] and [SEP]
add_special_tokens=True,
# Truncate or pad the comment to 'max_length'
max_length=max_length,
# Pad the comment to 'max_length' with zeros if needed
pad_to_max_length=True,
# Return attention mask to mask padded tokens
return_attention_mask=True,
# Return PyTorch tensors
return_tensors='pt'
)
# Append the tokenized input and attention mask to their respective lists
input_ids.append(encoded_dict['input_ids'])
attention_masks.append(encoded_dict['attention_mask'])
# Concatenate the tokenized inputs and attention masks into tensors
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
# Convert the labels to a PyTorch tensor with the data type float32
labels = torch.tensor(labels, dtype=torch.float32)
# Return the tokenized inputs, attention masks, and labels as PyTorch tensors
return input_ids, attention_masks, labels
Initialize Tokenizer and Model
Now, we will Initialize the BERT tokenizer with the 'bert-base-uncased' model
Python3
# Token Initialization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
Initialize BERT classification Model
After this step, we will initialize the BERT model for sequence classification
Python3
# Model Initialization
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
num_labels=6)
Now, an additional step for faster processing of the model. You can move the model to the GPU if available, or to the CPU if not.
Python3
# Move model to GPU if available
device = torch.device(
'cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)
Apply Tokenization and Encoding
Tokenize and Encode the comments and labels of the train, test and validation set
Python3
# Tokenize and Encode the comments and labels for the training set
input_ids, attention_masks, labels = tokenize_and_encode(
tokenizer,
train_texts,
train_labels.values
)
# Tokenize and Encode the comments and labels for the test set
test_input_ids, test_attention_masks, test_labels = tokenize_and_encode(
tokenizer,
test_texts,
test_labels.values
)
# Tokenize and Encode the comments and labels for the validation set
val_input_ids, val_attention_masks, val_labels = tokenize_and_encode(
tokenizer,
val_texts,
val_labels.values
)
print('Training Comments :',train_texts.shape)
print('Input Ids :',input_ids.shape)
print('Attention Mask :',attention_masks.shape)
print('Labels :',labels.shape)
Output:
Training Comments : (22715,)
Input Ids : torch.Size([22715, 128])
Attention Mask : torch.Size([22715, 128])
Labels : torch.Size([22715, 6])
Let's check an encoded text with the corresponding text and labels
Python3
k = 53
print('Training Comments -->>',train_texts.values[k])
print('\nInput Ids -->>\n',input_ids[k])
print('\nDecoded Ids -->>\n',tokenizer.decode(input_ids[k]))
print('\nAttention Mask -->>\n',attention_masks[k])
print('\nLabels -->>',labels[k])
Output:
Training Comments -->> I have edited the text and wrote with neutral information. Please suggest what went wrong.
Input Ids -->>
tensor([ 101, 1045, 2031, 5493, 1996, 3793, 1998, 2626, 2007, 8699, 2592, 1012,
3531, 6592, 2054, 2253, 3308, 1012, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
Decoded Ids -->>
[CLS] i have edited the text and wrote with neutral information. please suggest what went wrong. [SEP]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD]
Attention Mask -->>
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
Labels -->> tensor([0., 0., 0., 0., 0., 0.])
Creating Pytorch Data Loaders
Now, we will create data loaders to efficiently load the data during training, testing, and validation. The data loaders batch the input data and handle shuffling for the training data.
Python3
# Creating DataLoader for the balanced dataset
batch_size = 32
train_dataset = TensorDataset(input_ids, attention_masks, labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# testing set
test_dataset = TensorDataset(test_input_ids, test_attention_masks, test_labels)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
# validation set
val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_labels)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
Let's check the train_loader data
Python3
print('Batch Size :',train_loader.batch_size)
Batch =next(iter(train_loader))
print('Each Input ids shape :',Batch[0].shape)
print('Input ids :\n',Batch[0][0])
print('Corresponding Decoded text:\n',tokenizer.decode(Batch[0][0]))
print('Corresponding Attention Mask :\n',Batch[1][0])
print('Corresponding Label:',Batch[2][0])
Output:
Batch Size : 32
Each Input ids shape : torch.Size([32, 128])
Input ids :
tensor([ 101, 2175, 3280, 1999, 1037, 2543, 1012, 1045, 2123, 2102,
2228, 3087, 2106, 2062, 4053, 2000, 16948, 2059, 2017, 1999,
1996, 2197, 2048, 2086, 1012, 9119, 1010, 3246, 2017, 2123,
2102, 2272, 2067, 2007, 1037, 28407, 13997, 1006, 2029, 2017,
2471, 5121, 2097, 999, 999, 999, 1007, 6109, 1012, 6564,
1012, 2382, 1012, 19955, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
Corresponding Decoded text:
[CLS] go die in a fire. i dont think anyone did more damage to wikipedia then you in the last two years. goodbye,
hope you dont come back with a sock puppet ( which you almost certainly will!!! ) 93. 86. 30. 194 [SEP]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD]
Corresponding Attention Mask :
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
Corresponding Label: tensor([1., 0., 1., 0., 1., 0.])
Initializes the optimizer for training the model.
AdamW optimizer: We are using AdamW optimizer which refers to Adaptive Moment Estimation. It combines the advantages of RMSprop (Root Mean Square Propagation) and AdaGrad (Adaptive Gradient Algorithm), two additional optimization strategies.
For each model parameter, it includes moving averages of the gradient and the squared gradient, which aid in adjusting the learning rates for various parameters during training.
Python3
# Optimizer setup
optimizer = AdamW(model.parameters(), lr=2e-5)
Model Training
Python3
# Function to Train the Model
def train_model(model, train_loader, optimizer, device, num_epochs):
# Loop through the specified number of epochs
for epoch in range(num_epochs):
# Set the model to training mode
model.train()
# Initialize total loss for the current epoch
total_loss = 0
# Loop through the batches in the training data
for batch in train_loader:
input_ids, attention_mask, labels = [t.to(device) for t in batch]
optimizer.zero_grad()
outputs = model(
input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
optimizer.step()
model.eval() # Set the model to evaluation mode
val_loss = 0
# Disable gradient computation during validation
with torch.no_grad():
for batch in val_loader:
input_ids, attention_mask, labels = [
t.to(device) for t in batch]
outputs = model(
input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
val_loss += loss.item()
# Print the average loss for the current epoch
print(
f'Epoch {epoch+1}, Training Loss: {total_loss/len(train_loader)},Validation loss:{val_loss/len(val_loader)}')
# Call the function to train the model
train_model(model, train_loader, optimizer, device, num_epochs=3)
Output:
Epoch 1, Training Loss: 0.20543626952968852,Validation loss:0.1643741050479459
Epoch 2, Training Loss: 0.13793433358971502,Validation loss:0.14861836971021167
Epoch 3, Training Loss: 0.11418234390587034,Validation loss:0.1539663544862099
Model Evaluation
let's evaluate the model now
Python3
# Evaluate the Model
def evaluate_model(model, test_loader, device):
model.eval() # Set the model to evaluation mode
true_labels = []
predicted_probs = []
with torch.no_grad():
for batch in test_loader:
input_ids, attention_mask, labels = [t.to(device) for t in batch]
# Get model's predictions
outputs = model(input_ids, attention_mask=attention_mask)
# Use sigmoid for multilabel classification
predicted_probs_batch = torch.sigmoid(outputs.logits)
predicted_probs.append(predicted_probs_batch.cpu().numpy())
true_labels_batch = labels.cpu().numpy()
true_labels.append(true_labels_batch)
# Combine predictions and labels for evaluation
true_labels = np.concatenate(true_labels, axis=0)
predicted_probs = np.concatenate(predicted_probs, axis=0)
predicted_labels = (predicted_probs > 0.5).astype(
int) # Apply threshold for binary classification
# Calculate evaluation metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='micro')
recall = recall_score(true_labels, predicted_labels, average='micro')
# Print the evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
# Call the function to evaluate the model on the test data
evaluate_model(model, test_loader, device)
Output:
Accuracy: 0.7099
Precision: 0.8059
Recall: 0.8691
Now, we can evaluate the model based on the metrics results achieved here.
Save the Model
Python3
# Save the tokenizer and model in the same directory
output_dir = "Saved_model"
# Save model's state dictionary and configuration
model.save_pretrained(output_dir)
# Save tokenizer's configuration and vocabulary
tokenizer.save_pretrained(output_dir)
Now, load the model
Load the Model
Python3
# Load the tokenizer and model from the saved directory
model_name = "Saved_model"
Bert_Tokenizer = BertTokenizer.from_pretrained(model_name)
Bert_Model = BertForSequenceClassification.from_pretrained(
model_name).to(device)
Now, comes the interesting part!
Prediction
let's predict user input
Python3
def predict_user_input(input_text, model=Bert_Model, tokenizer=Bert_Tokenizer, device=device):
user_input = [input_text]
user_encodings = tokenizer(
user_input, truncation=True, padding=True, return_tensors="pt")
user_dataset = TensorDataset(
user_encodings['input_ids'], user_encodings['attention_mask'])
user_loader = DataLoader(user_dataset, batch_size=1, shuffle=False)
model.eval()
with torch.no_grad():
for batch in user_loader:
input_ids, attention_mask = [t.to(device) for t in batch]
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions = torch.sigmoid(logits)
predicted_labels = (predictions.cpu().numpy() > 0.5).astype(int)
labels_list = ['toxic', 'severe_toxic', 'obscene',
'threat', 'insult', 'identity_hate']
result = dict(zip(labels_list, predicted_labels[0]))
return result
text = 'Are you insane!'
predict_user_input(input_text=text)
Output:
{'toxic': 1,
'severe_toxic': 0,
'obscene': 0,
'threat': 0,
'insult': 0,
'identity_hate': 0}
We can observe that the comment 'Are you insane!' is a toxic comment.
let's check for more inputs
Python3
predict_user_input(input_text='How are you?')
Output:
{'toxic': 0,
'severe_toxic': 0,
'obscene': 0,
'threat': 0,
'insult': 0,
'identity_hate': 0}
Well, obviously the comment 'How are you?' is not toxic, hence all the other label values are 0
Python3
text = "Such an Idiot person"
predict_user_input(model=Bert_Model,
tokenizer=Bert_Tokenizer,
input_text=text,
device=device)
Output:
{'toxic': 1,
'severe_toxic': 0,
'obscene': 1,
'threat': 0,
'insult': 1,
'identity_hate': 0}
As we can see, the comment "Such an Idiot person" shows true for labels toxic, obscene and insult which is right. It is definitely not a threat or identity threat so those values come out to be 0.
Similar Reads
Sentiment Classification Using BERT
BERT stands for Bidirectional Representation for Transformers and was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architec
12 min read
Text classification using CNN
Text classification is a widely used NLP task in different business problems, and using Convolution Neural Networks (CNNs) has become the most popular choice. In this article, you will learn about the basics of Convolutional neural networks and the implementation of text classification using CNNs, a
5 min read
Classification of Text Documents using Naive Bayes
In natural language processing and machine learning Naive Bayes is a popular method for classifying text documents. It can be used to classifies documents into pre defined types based on likelihood of a word occurring by using Bayes theorem. In this article we will implement Text Classification usin
4 min read
Image Classification using CNN
The article is about creating an Image classifier for identifying cat-vs-dogs using TFLearn in Python. Machine Learning is now one of the hottest topics around the world. Well, it can even be said of the new electricity in today's world. But to be precise what is Machine Learning, well it's just one
7 min read
Multiclass classification using CatBoost
Multiclass or multinomial classification is a fundamental problem in machine learning where our goal is to classify instances into one of several classes or categories of the target feature. CatBoost is a powerful gradient-boosting algorithm that is well-suited and widely used for multiclass classif
10 min read
Spam Classification using OpenAI
The majority of people in today's society own a mobile phone, and they all frequently get communications (SMS/email) on their phones. But the key point is that some of the messages you get may be spam, with very few being genuine or important interactions. You may be tricked into providing your pers
6 min read
Classification Metrics using Sklearn
Machine learning classification is a powerful tool that helps us make predictions and decisions based on data. Whether it's determining whether an email is spam or not, diagnosing diseases from medical images, or predicting customer churn, classification algorithms are at the heart of many real-worl
14 min read
Text Classification using HuggingFace Model
Text classification is a pivotal task in natural language processing (NLP) that categorizes text into predefined categories. It is widely used in sentiment analysis, spam detection, topic labeling, and more. The development of transformer-based models, such as those provided by Hugging Face, has sig
3 min read
Text Classification using scikit-learn in NLP
The purpose of text classification, a key task in natural language processing (NLP), is to categorise text content into preset groups. Topic categorization, sentiment analysis, and spam detection can all benefit from this. In this article, we will use scikit-learn, a Python machine learning toolkit,
5 min read
Text Classification using Logistic Regression
Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories or labels to textual data. It has a wide range of applications, including spam detection, sentiment analysis, topic categorization, and language identification. Logistic Regre
4 min read