0% found this document useful (0 votes)
3 views

ML Python Exercises UOM BDS Classification

Uploaded by

metapi5906
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ML Python Exercises UOM BDS Classification

Uploaded by

metapi5906
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

CHAPTER 3: CLASSIFICATION

1. Decision Tree Classifier

import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn import tree

# Read the CSV file

df = pd.read_csv("salaries.csv")

print(df)

# Prepare inputs and target

inputs = df.drop('salary_more_then_100k', axis='columns')

target = df['salary_more_then_100k']

# Label encode categorical features

le_company = LabelEncoder()

le_job = LabelEncoder()

le_degree = LabelEncoder()

inputs['company_n'] = le_company.fit_transform(inputs['company'])

inputs['job_n'] = le_job.fit_transform(inputs['job'])

inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])

print(inputs)

inputs_n = inputs.drop(['company', 'job', 'degree'], axis='columns')

# Create and fit the decision tree model

model = tree.DecisionTreeClassifier()

model.fit(inputs_n, target)
# Print model score and make predictions

print("Model Score:", model.score(inputs_n, target))

#Is salary of Google, Computer Engineer, Bachelors degree > 100 k ?

print("Prediction for [2, 1, 0]:", model.predict([[2, 1, 0]]))

#Is salary of Google, Computer Engineer, Masters degree > 100 k ?

print("Prediction for [2, 1, 1]:", model.predict([[2, 1, 1]]))

Output:
Model Score: 1.0

Prediction for [2, 1, 0]: [0]

Prediction for [2, 1, 1]: [1]

2. NAIVE BAYES CLASSIFICATION:[pg.no:51-53]

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems. It is mainly used in text classification that includes a high-
dimensional training dataset.

The Bayesian method of calculating conditional probabilities is used in machine learning applications
that involve classification tasks. A simplified version of the Bayes Theorem, known as the Naive Bayes
Classification, is used to reduce computation time and costs.

Applications of Naïve Bayes Classifier:

 It is used for Credit Scoring.


 It is used in medical data classification.
 It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
 It is used in Text classification such as Spam filtering and Sentiment analysis.

PROGRAM:

from sklearn import preprocessing

from sklearn.naive_bayes import GaussianNB

age = ['youth', 'youth', 'middle-aged', 'senior', 'senior',

'senior', 'middle-aged', 'youth', 'youth', 'senior', 'youth',

'middle-aged', 'middle-aged', 'senior']

income = ['high', 'high', 'high', 'medium', 'low', 'low',

'low', 'medium', 'low', 'medium', 'medium', 'medium',

'high', 'medium']

student = ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes',

'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no']

credit_rating = ['fair', 'excellent', 'fair', 'fair', 'fair',

'excellent', 'excellent', 'fair', 'fair', 'fair',

'excellent', 'excellent', 'fair', 'excellent']

buys_computer = ['no', 'no', 'yes', 'yes', 'yes', 'no',

'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']

# Create Label Encoder object

le = preprocessing.LabelEncoder()
# Converting string labels into numbers

age_encoded = le.fit_transform(age)

print(age_encoded)

income_encoded = le.fit_transform(income)

print(income_encoded)

student_encoded = le.fit_transform(student)

print(student_encoded)

credit_encoded = le.fit_transform(credit_rating)

print(credit_encoded)

# Converting string labels into numbers

label = le.fit_transform(buys_computer)

print(label)

# Combining age, income, student, and credit rating into a single list of tuples

features = list(zip(age_encoded, income_encoded, student_encoded, credit_encoded))

# Create a Gaussian Naive Bayes model

model = GaussianNB()

# Train the model using the training sets

model.fit(features, label)

# Predict output

predicted = model.predict([[2, 2, 1, 1]]) # 2: youth, 2: medium, 1: yes, 1: fair

print("Predicted Value:", predicted)

Output:

3. MULTINOMINAL NAIVE BAYES CLASSIFICATION:[pg.no:54-55]

Multinomial Naive Bayes is one of the most popular supervised learning classifications that is used
for the analysis of the categorical text data. Text data classification is gaining popularity because
there is an enormous amount of information available in email, documents, websites, etc. that needs
to be analyzed.
Examples of categorical variables are race, sex, age group, and educational level. While the latter two
variables may also be considered in a numerical manner by using exact values for age and highest
grade completed, it is often more informative to categorize such variables into a relatively small
number of groups.

This dataset is the result of a chemical analysis of wines grown in the same region in Italy but
derived from three different plant varieties. Dataset comprises of 13 features (alcohol, malic_acid,
ash,alkalinity_of_ash,magnesium,total_phenols,flavonoids,nonflavanoid_phenols,proanthocyanins,c
olor_intensity,hue,od280/od315_of_diluted_wines,proline) and type of wine plant variety. This data
has three type of wine Class_0, Class_1, Class_2.

PROGRAM:

# Import scikit-learn dataset library

from sklearn import datasets

# Load dataset

wine = datasets.load_wine()

# Print the names of the 13 features

print("Features:", wine.feature_names)

# Print the label type of wine (class 0, class 1, class 2)

print("Labels:", wine.target_names)

# Print data (feature) shape

print(wine.data.shape)

# Print the wine data features (top 5)

print(wine.data[:5])

print(wine.target)

# Import train test split function from sklearn.model_selection

from sklearn.model_selection import train_test_split

# Split dataset into training set and test set (70% training and 30% test)

X_train, X_test, y_train, y_test =train_test_split(wine.data, wine.target, test_size=0.3,

random_state=109)

# Import Gaussian Naive Bayes model

from sklearn.naive_bayes import GaussianNB

# Create a Gaussian Classifier

gnb = GaussianNB()

# Train the model using the training sets


gnb.fit(X_train, y_train)

# Predict the response for the test dataset

y_pred = gnb.predict(X_test)

print("Predicted Labels:",y_pred)

# Import scikit-learn metrics module for accuracy calculation

from sklearn import metrics

# Model Accuracy to find out how accurate is the classifier

print("Accuracy:", metrics.accuracy_score(y_test, y_pred) *100)

Output:

Classify 1: According to color.

Classify 2: According to the carbon dioxide pressure.

Classify 3: According to the sugar content.

4. LINEAR KERNEL:[pg.no:59-61]

Kernels, also known as kernel techniques or kernel functions, are a collection of distinct forms of
pattern analysis algorithms, using a linear classifier, they solve an existing non-linear problem. SVM
(Support Vector Machines) uses Kernels Methods in ML to solve classification and regression issues.

Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single
Line. It is one of the most common kernels to be used. It is mostly used when there are a Large
number of Features in a particular Data Set.
PROGRAM:

# Load libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score,classification_report, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sn

from sklearn.svm import SVC

# Assign column names to the dataset

column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Load dataset

df = pd.read_csv("iris.csv", names=column_names)

# Split dataset into features and target

X = df.drop('Class', axis=1) # Features

y = df['Class'] # Target variable

# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3, random_state=1)

# Import support vector classifier from sklearn.svm

clf = SVC(kernel='linear')

# Fit the classifier to the training data

clf.fit(X_train, y_train)

# Predict the classes on test set

y_pred = clf.predict(X_test)
print(y_pred)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy * 100)

# Print classification report and confusion matrix

print(classification_report(y_test, y_pred))

# Generate and display confusion matrix heatmap

cm = pd.crosstab(y_test,y_pred,rownames=['Actual'],colnames=['Predicted'])

ax = sn.heatmap(cm, annot=True)

plt.show()

Output:
5. POLYNOMINAL KERNEL:[pg.no:61-63]

PROGRAM:

# Import libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score,classification_report, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sn

# Assign column names to the dataset

colnames = ['sepal-length', 'sepal-width', 'petal-length',

'petal-width', 'Class']

df = pd.read_csv("iris.csv", names=colnames)

# Split dataset into features and target variable


X = df.drop('Class', axis=1) # Features

y = df['Class'] # Target variable

# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3, random_state=1)

# Import support vector classifier from sklearn

from sklearn.svm import SVC

clf = SVC(kernel='poly', degree=8) # Polynomial kernel with degree 8

# Fit the classifier to the training data

clf.fit(X_train, y_train)

# Make predictions on the test data

y_pred = clf.predict(X_test)

# Print the predicted labels

print(y_pred)

# Calculate and print the accuracy score

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy*100)

# Print the classification report

print(classification_report(y_test, y_pred))

# Generate and display the confusion matrix as a heatmap

confusion_matrix = pd.crosstab(y_test, y_pred,

rownames=['Actual'],colnames=['Predicted'])

ax = sn.heatmap(confusion_matrix, annot=True)

plt.show()

Output:
6. RADIAL BASIS FUNCTION KERNEL:[pg.no:63-65]

PROGRAM:

# Load libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score,classification_report, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sn

from sklearn.svm import SVC

# Assign column names to the dataset

col_names = ['sepal-length', 'sepal-width', 'petal-length',

'petal-width', 'Class']

# Load the dataset

dataset = pd.read_csv("iris.csv", names=col_names)

# Separate features and target variable

X = dataset.drop('Class', axis=1)

y = dataset['Class']

# Split dataset into training set and test set (70% training and 30% test)

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3, random_state=1)

# Import support vector classifier

clf = SVC(kernel='rbf', gamma=0.1)

# Fit the classifier to the training data

clf.fit(X_train, y_train)

# Predict the labels for the test set

y_pred = clf.predict(X_test)

# Print the predicted labels and accuracy

print("Predicted labels:", y_pred)

print("Accuracy:", accuracy_score(y_test, y_pred))

# Print the classification report

print(classification_report(y_test, y_pred))
# Generate and display confusion matrix as a heatmap

confusion_matrix = pd.crosstab(y_test, y_pred,

rownames=['Actual'], colnames=['Predicted'])

sn.heatmap(confusion_matrix, annot=True)

plt.show()
7. K-NEAREST NEIGHBOURS

The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised
learning classifier, which uses proximity to make classifications or predictions about the grouping of
an individual data point.

# The make_blobs function from sklearn.datasets is used to generate a synthetic dataset with 500
samples, 2 features, 4 centers, and a cluster standard deviation of 1.5. The X variable contains the
feature vectors, and the y variable contains the corresponding labels.

PROGRAM:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5, random_state = 4)

plt.style.use('seaborn')
plt.figure(figsize = (10,10))

plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')

plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

knn5 = KNeighborsClassifier(n_neighbors = 5)

knn1 = KNeighborsClassifier(n_neighbors=1)

knn5.fit(X_train, y_train)

knn1.fit(X_train, y_train)

y_pred_5 = knn5.predict(X_test)

y_pred_1 = knn1.predict(X_test)

from sklearn.metrics import accuracy_score

print("Accuracy with k=5", accuracy_score(y_test, y_pred_5)*100)

print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)

plt.figure(figsize = (15,5))

plt.subplot(1,2,1)

plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black')

plt.title("Predicted values with k=5", fontsize=20)

plt.subplot(1,2,2)

plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black')

plt.title("Predicted values with k=1", fontsize=20)

plt.show()

Output:

Accuracy with k=5 93.60000000000001

Accuracy with k=1 90.4


8. RANDOM FOREST :[pg.no:70-71]

It builds decision trees on different samples and takes their majority vote for classification and
average in case of regression.

PROGRAM:

# Load libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score,classification_report, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sn

# Load dataset

df = pd.read_csv("iris.csv")
# Assign column names to the dataset

colnames = ['sepal-length', 'sepal-width', 'petal-length',

'petal-width', 'Class']

df.columns = colnames

# Split dataset into features and target

X = df.drop('Class', axis=1) # Features

y = df['Class'] # Target variable

# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3, random_state=1)

# Import random forest classifier and fit the model

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)

# Make predictions on the test set

y_pred = clf.predict(X_test)

# Print predictions and accuracy score

print("Predictions:", y_pred)

print("Accuracy:", accuracy_score(y_test, y_pred) * 100)

# Print classification report and confusion matrix

print("Classification Report:")

print(classification_report(y_test, y_pred))

print("Confusion Matrix:")

cm = confusion_matrix(y_test, y_pred)

print(cm)

# Generate heatmap and display it

plt.figure(figsize=(8, 6))

sn.heatmap(cm, annot=True, fmt='d')

plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.show()

You might also like