0% found this document useful (0 votes)
7 views

Machine Learning Lab5

The document provides a Python code implementation for classifying breast cancer data using the Gaussian Naive Bayes algorithm. It includes data loading, feature selection, model training, evaluation metrics calculation, and example predictions. Justifications for using Gaussian Naive Bayes are also discussed, highlighting its suitability for the dataset's characteristics.

Uploaded by

madhurigade000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Machine Learning Lab5

The document provides a Python code implementation for classifying breast cancer data using the Gaussian Naive Bayes algorithm. It includes data loading, feature selection, model training, evaluation metrics calculation, and example predictions. Justifications for using Gaussian Naive Bayes are also discussed, highlighting its suitability for the dataset's characteristics.

Uploaded by

madhurigade000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

MACHINE LEARNING LAB5

BREAST CANCER DATA SET

CODE’

python
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, recall_score
import matplotlib.pyplot as plt

# Load the breast cancer dataset


data = load_breast_cancer()
selected_features = [1, 3, 6, 7, 8] # Corresponding to texture, area, concavity, concave points,
symmetry

# Select only the required features (b,d,g,h,i from the problem)


X = data.data[:, selected_features]
y = data.target

# Split the data into training and testing sets (using 18% as test data as specified)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.18, random_state=42)

# Initialize and train the Gaussian Naive Bayes classifier


gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)
y_pred_prob = gnb.predict_proba(X_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
recall = recall_score(y_test, y_pred)
error_rate = 1 - accuracy

# Calculate ROC curve and AUC


fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

# Print results
print("Performance Metrics:")
print(f"Accuracy: {accuracy:.3f}")
print(f"Error Rate: {error_rate:.3f}")
print(f"Recall: {recall:.3f}")
print(f"ROC AUC: {roc_auc:.3f}")

print("\nConfusion Matrix:")
print(conf_matrix)

# Example predictions for two instances


print("\nExample Predictions:")
example_instances = X_test[:2]
predictions = gnb.predict(example_instances)
probabilities = gnb.predict_proba(example_instances)

for i, (pred, prob) in enumerate(zip(predictions, probabilities)):


print(f"\nInstance {i+1}:")
print(f"Prediction: {'Malignant' if pred == 0 else 'Benign'}")
print(f"Probability Distribution: Malignant: {prob[0]:.3f}, Benign: {prob[1]:.3f}")

Now, let me explain why Gaussian Naive Bayes is a suitable classifier for this dataset and provide
justification for its use:

1. *Gaussian Naive Bayes Justification:*


- The features in the breast cancer dataset are continuous numerical values that generally follow a
Gaussian distribution (e.g., texture, area, concavity measurements)
- The features are largely independent of each other, which satisfies the naive assumption of feature
independence
- GNB works well with small training sets, making it suitable for medical datasets
- It's computationally efficient and provides probabilistic predictions

2. *Advantages for this specific case:*


- Fast training and prediction times, suitable for real-time medical applications
- Handles missing values well
- Less prone to overfitting compared to more complex models
- Provides probability estimates for classifications, which is crucial in medical diagnosis
- Works well with the selected features (texture, area, concavity, concave points, and symmetry)

3. *Other Types of Naive Bayes Classifiers:*


- Multinomial NB: Better for discrete counts (not suitable here)
- Bernoulli NB: Better for binary/boolean features (not suitable here)
- Complement NB: Better for imbalanced datasets (could be an alternative)

The code implements:


1. Data loading and feature selection
2. Train-test split with 18% test size as specified
3. Model training and evaluation
4. Calculation of required metrics (accuracy, error rate, ROC, recall)
5. Example predictions for two instances

You might also like