Lab Manual
Lab Manual
II yr/IV Sem
EX.NO.1 Implementation of FIND_S algorithm
AIM:
To Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis
based on a given set of training data samples. Read the training data from a .CSV file.
ALGORITHM:
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance
x For each attribute constraint ai in h
If the constraint ai is satisfied by x
Then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Training Examples:
sky airtemp humidity wind water forecast enjoysport
sunny warm normal strong warm same yes
sunny warm hign strong warm same yes
rainy cold high strong warm change no
sunny warm high strong cool change yes
Program:
import csv
num_attributes = 6
a=[]
print("\n The Given Training Data Set \n")
with open('C:\\New folder\\enjoysport.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
a.append (row)
print(row)
print("\n The initial value of hypothesis: ")
hypothesis = ['0'] * num_attributes
print(hypothesis)
for j in range(0,num_attributes):
hypothesis[j] = a[1][j];
print("\n Find S: Finding a Maximally Specific Hypothesis\n")
for i in range(0,len(a)):
if a[i][num_attributes]=='yes':
for j in range(0,num_attributes):
if a[i][j]!=hypothesis[j]:
hypothesis[j]='?'
else :
hypothesis[j]= a[i][j]
print(" For Training instance No:{0} the hypothesis is ".format(i),hypothesis)
print("\n The Maximally Specific Hypothesis for a given Training Examples :\n")
print(hypothesis)
OUTPUT:
The Given Training Data Set
['sunny', 'warm', 'normal', 'strong', 'warm', 'same', 'yes']
['sunny', 'warm', 'high', 'strong', 'warm', 'same', 'yes']
['rainy', 'cold', 'high', 'strong', 'warm', 'change', 'no']
['sunny', 'warm', 'high', 'strong', 'cool', 'change', 'yes']
The initial value of hypothesis:
['0', '0', '0', '0', '0', '0']
Find S: Finding a Maximally Specific Hypothesis
For Training Example No:0 the hypothesis is : ['sunny', 'warm', 'normal', 'strong', 'warm', 'same']
For Training Example No:1 the hypothesis is : ['sunny', 'warm', '?', 'strong', 'warm', 'same']
For Training Example No:2 the hypothesis is : 'sunny', 'warm', '?', 'strong', 'warm',
'same'] For Training Example No:3 the hypothesis is : 'sunny', 'warm', '?', 'strong', '?', '?']
The Maximally Specific Hypothesis for a given Training Examples:
['sunny', 'warm', '?', 'strong', '?', '?']
RESULT:
Thus the Python program to Implement and demonstrate the FIND-S algorithm for finding the most
specific hypothesis based on a given set of training data samples has been implemented and executed
successfully.
EX.NO:2 IMPLEMENTATION OF CANDIDATE ELIMINATION ALGORITHM
AIM:
To implement and demonstrate the Candidate-Elimination algorithm for a given set of training data
examples stored in a .CSV file and to output a description of the set of all hypotheses consistent with the
training examples.
ALGORITHM:
DATASET:
data = pd.read_csv("C:\\Users\\sride\\OneDrive\\Desktop\\dataset.csv")
concepts = np.array(data.iloc[:,0:-1])
print("\nInstances are:\n",concepts)
target = np.array(data.iloc[:,-1])
print("\nTarget Values are: ",target)
for i, h in enumerate(concepts):
print("\nInstance", i+1 , "is ", h)
if target[i] == "yes":
print("Instance is Positive ")
for x in range(len(specific_h)):
if h[x]!=
specific_h[x]: specific_h[x] ='?'
general_h[x][x] ='?'
if target[i] == "no":
print("Instance is Negative ")
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?', '?', '?']]
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
return specific_h, general_h
Generic Boundary: [['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?
'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '
?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?']]
Final Specific_h:
['sunny' 'warm' '?' 'Strong' '?' '?'] Final General_h:
[['sunny', '?', '?', '?', '?', '?'], ['?', 'warm', '?', '?', '?', '?']]
RESULT:
Thus the Candidate-Elimination algorithm for a given set of training data examples stored in a
.CSV file and a description of the set of all hypotheses consistent with the training examples has been
implemented and the output has been obtained.
EX.NO:3 ID3 ALOGORITHM USING DECISION TREE
AIM:
To write a python program to demonstrate the working of the decision tree based ID3
algorithm, using an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
ALGORITHM:
ID3(Examples, Target_attribute, Attributes)
Examples are the training examples.
Target_attribute is the attribute whose value is to be predicted by the tree. Attributes is a
list of other attributes that may be tested by the learned decision tree. Returns a decision
tree that correctly classifies the given Examples.
Otherwise Begin
A ← the attribute from Attributes that best* classifies Examples
The decision attribute for Root ← A
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi Let
Examples vi, be the subset of Examples that have value vi for A If
Examples vi , is empty
Then below this new branch add a leaf node with
label = most common value of Target_attribute in Examples
Else
below this new branch add the subtree ID3(Examples
vi, Targe_tattribute, Attributes – {A}))
End
Return Root
DESCRIPTION:
The best attribute is the one with highest information gain
ENTROPY:
Entropy measures the impurity of a collection of examples.
Dataset:
PlayTennis Dataset is saved as .csv (comma separated values) file in the current working directory otherwise
use the complete path of the dataset set in the program:
TRAINING DATASET:
Outlook Temperature Humidity Wind PlayTennis
Program:
class Node:
def init (self,attribute):
self.attribute=attribute self.children=[]
self.answer=""
def subtables(data,col,delete):
dic={}
coldata=[row[col] for row in data] attr=list(set(coldata))
counts=[O]*len(attr) r=len(data)
c=len(data[O])
for x in range(len(attr)): for y in
range(r):
if data[y][col]==attr[x]: counts[x]+=1
for x in range(len(attr)):
dic[attr[x]]=[[O for i in range(c)] for j in range(counts[x])] pos=O
for y in range(r):
if data[y][col]==attr[x]: if delete:
del data[y][col] dic[attr[x]][pos]=data[y]
pos+=1
return attr,dic
def entropy(S):
attr=list(set(S))
if len(attr)==1: return O
counts=[O,O]
for i in range(2):
counts[i]=sum([1 for x in S if attr[i]==x])/(len(S)*1.O) sums=O
for cnt in counts:
sums+=-1*cnt*math.log(cnt,2) return
sums
def compute_gain(data,col):
attr,dic = subtables(data,col,delete=False)
total_size=len(data)
entropies=[O]*len(attr)
ratio=[O]*len(attr)
n=len(data[O])-1 gains=[O]*n
for col in range(n): gains[col]=compute_gain(data,col)
split=gains.index(max(gains)) node=Node(features[split])
fea = features[:split]+features[split+1:]
attr,dic=subtables(data,split,delete=True) for x in
range(len(attr)):
child=build_tree(dic[attr[x]],fea) node.children.append((attr[x],child))
return node
def print_tree(node,level): if
node.answer!="":
print(" "*level,node.answer) return
print("The decision tree for the dataset using ID3 algorithm is") print_tree(node1,O)
testdata,features=load_csv("c:\\New folder\\id3_test_1.csv")
OUTPUT:
The decision tree for the dataset using ID3 algorithm is Outlook
rain
Wind
weak
yes
strong
no
sunny
Humidity high
no
normal
yes
overcast
yes
The test instance: ['rain', 'cool', 'normal', 'strong'] The label for test instance:
no
The test instance: ['sunny', 'mild', 'normal', 'strong'] The label for test
instance: yes
RESULT:
Thus thepython program to demonstrate the working of the decision tree based ID3
algorithm, using an appropriate data set for building the decision tree has been implemented and executed
successfully,
EX.NO.4 ARTIFICIAL NEURAL NETWORK USING BACKPROPAGATION ALGORITHM
AIM:
To Build an Artificial Neural Network by implementing the Backpropagation algorithm and
test the same using appropriate data sets.
ALGORITHM:
BACKPROPAGATION Algorithm
Each training example is a pair of the form (𝑥, ⃗ ), where (𝑥 ) is the vector of network
BACKPROPAGATION (training_example, ƞ, nin, nout, nhidden )
1. Input the instance ⃗𝑥, to the network and compute the output ou of every
network:
Training Examples:
Expected %in
Example Sleep Study
Exams
1 2 9 92
2 1 5 86
3 3 6 89
Normalizetheinput
Expected
Example Sleep Study
%inExam
s
1 2/3=0.66666667 9/9 =1 0.92
2 1/3=0.33333333 5/9=0.55555556 0.86
3 3/3 =1 6/9=0.66666667 0.89
PROGRAM:
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float) # two inputs [sleep,study]
y = np.array(([92], [86], [89]), dtype=float) # one output [Expected % in Exams]
X = X/np.amax(X,axis=O) # maximum of X array longitudinally y = y/1OO
#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))
#Variable initialization
epoch=5OOO #Setting training iterations lr=O.1
#Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons output_neurons
= 1 #number of neurons at output layer
#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout) outinp=
outinp1+ bout
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output) d_output =
EO* outgrad
EH = d_output.dot(wout.T)
OUTPUT:
Input:
[[O.66666667 1. ] [O.33333333
O.55555556] [1. O.66666667]]
Predicted Output:
[[O.9O64192 ]
[O.892O576 ]
[O.91O3O512]]
RESULT:
Thus the python program to Build an Artificial Neural Network by implementing the Backpropagation
algorithm has been implemented and tested the same using appropriate data sets.
EX.NO:5 LOCALLY WEIGHTED REGRESSION ALGORITHM
DATE:
AIM:
To implement the non-parametric Locally Weighted Regression algorithm in order to fit
datapoints. Select appropriate data set for your experiment and draw graphs.
Regression:
Regression is a technique from statistics that is used to predict values of a desired target quantity
when the target quantity is continuous.
In regression, we seek to identify (or estimate) a continuous variable y associated with
a given input vector x.
y is called the dependent variable.
x is called the independent variable.
Loess/Lowess Regression:
Loess regression is a nonparametric technique that uses local weighted regression to fit a
smooth curve through points in a scatter plot.
Lowess Algorithm:
Locally weighted regression is a very powerful nonparametric model used in statistical
learning.
Given a dataset X, y, we attempt to find a model parameter β(x) that minimizes
residual sum of weighted squared errors.
The weights are given by a kernel function (k or w) which can be chosen arbitrarily
Algorithm
1. Read the Given data Sample to X and the curve (linear or non linear) to Y
2. Set the value for Smoothening parameter or Free parameter say τ
3. Set the bias /Point of interest set x0 which is a subset of X
4. Determine the weight matrix using :
6. Prediction = x0*β
DATASET(5a) :
YearsESalary
xp
eri
en
ce
1.1 3
1.3 4
2.9 5
3 6
3.2 5
4 5
4.1 5
5.1 6
6 9
6.8 9
7.1 9
8.7 1
9.5 1
10.5 1
PROGRAM (5A):
import numpy as np
from matplotlib import pyplot as plt import pandas as
pd
dataset=pd.read_csv('C:\\New folder\\salary_data.csv') x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,1].values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=1/3,random_stat e=O)
from sklearn.linear_model import LinearRegression regressor=LinearRegression()
regressor.fit(x_train,y_train) y_pred=regressor.predict(x_test)
plt.scatter(x_train,y_train,color='red')
plt.plot(x_train,regressor.predict(x_train),color='blue')
plt.scatter(x_test,y_test,color='red')
plt.plot(x_train,regressor.predict(x_train),color='blue') plt.title('Salary vs
Experience(Test data)') plt.xlabel('Years of Experience')
plt.ylabel('Salary') plt.show()
OUTPUT:
PROGRAM(5B) :
import numpy as np
from bokeh.plotting import figure, show, output_notebook from bokeh.layouts
import gridplot
from bokeh.io import push_notebook
n = 1OOO
# generate dataset
X = np.linspace(-3, 3, num=n)
print("The Data Set ( 1O Samples) X :\n",X[1:1O]) Y =
np.log(np.abs(X ** 2 - 1) + .5)
print("The Fitting Curve Data Set (1O Samples) Y :\n",Y[1:1O]) # jitter X
X += np.random.normal(scale=.1, size=n) print("Normalised (1O
Samples) X :\n",X[1:1O])
show(gridplot([ [plot_lwr(1O.),
plot_lwr(1.)],
[plot_lwr(O.1), plot_lwr(O.O1)]]))
OUTPUT:
The Data Set ( 1O Samples) X :
[-2.99399399 -2.98798799 -2.98198198 -2.97597598 -2.96996997 -2.96396396
-2.95795796 -2.95195195 -2.94594595]
AIM:
To write a python program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.
DESCRIPTION:
𝑷(𝑫⁄𝒉)𝑷(𝒉
Bayes’ Theorem is stated as:
𝑷(𝒉⁄𝑫) =
𝑷(𝑫)
)
Where,
P(h|D) is the probability of hypothesis h given the data D. This is called the posterior
probability.
P(D|h) is the probability of data d given that the hypothesis h was true.
P(h) is the probability of hypothesis h being true. This is called the prior probability of h.
P(D) is the probability of the data. This is called the prior probability of D
interested in finding the most probable hypothesis h ∈ H given the observed data D. Any such
After calculating the posterior probability for a number of different hypotheses h, and is
DATASET:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age diabetes
6 148 72 35 0 33.6 0.627 50 1
1 85 66 29 0 26.6 0.351 31 0
8 183 64 0 0 23.3 0.672 32 1
1 89 66 23 94 28.1 0.167 21 0
0 137 40 35 168 43.1 2.288 33 1
5 116 74 0 0 25.6 0.201 30 0
3 78 50 32 88 31 0.248 26 1
10 115 0 0 0 35.3 0.134 29 0
2 197 70 45 543 30.5 0.158 53 1
PROGRAM:
import csv
import random
import math
def loadcsv(filename):
lines = csv.reader(open(filename, "r"))
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for processing
dataset[i] = [float(x) for x in dataset[i]]
return dataset
def separatebyclass(dataset):
separated = {} #dictionary of classes 1 and 0
#creates a dictionary of classes 1 and 0 where the values are
#the instances belonging to each class
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in
separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarizebyclass(dataset):
separated = separatebyclass(dataset);
#print(separated)
summaries = {}
for classvalue, instances in separated.items():
#for key,value in dic.items()
#summaries is a dic of tuples(mean,std) for each class value
summaries[classvalue] = summarize(instances) #summarize is used to cal to mean and std
return summaries
def main():
filename = 'C:\\New folder\\naivedata1.csv'
splitratio = 0.67
dataset = loadcsv(filename);
main()
OUTPUT:
AIM:
To classify set of documents that using the naïve Bayesian Classifier model and to calculate the
accuracy, precision, and recall for the sample data set.
ALGORITHM:
Naive Bayes algorithms for learning and classifying text
LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all possible
target values. This function learns the probability terms P(wk |vj,), describing the probability
that a randomly drawn word from a document in class vj will be the English word wk. It
also learns the class prior probabilities P(vj).
1. collect all words, punctuation, and other tokens that occur in Examples
Vocabulary ← c the set of all distinct words and other tokens occurring in any text
document from Examples
2. calculate the required P(vj) and P(wk|vj) probability
terms For each target value vj in V do
• docsj ← the subset of documents from Examples for which the target value is vj
• P(vj) ← | docsj | / |Examples|
• Textj ← a single document created by concatenating all members of docsj
• n ← total number of distinct word positions in Textj
• for each word wk in Vocabulary
nk ← number of times word wk occurs in Textj
P(wk|vj) ← ( nk + 1) / (n + | Vocabulary| )
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
Return the estimated target value for the document Doc. ai denotes the word found in the
ith position within Doc.
• positions ← all word positions in Doc that contain tokens found in Vocabulary
• Return VNB, where
DATASET:
Text documents Label
I love this sandwich pos
This is an amazing place pos
I feel very good about these beers pos
This is my best work pos
What an awesome view pos
I do not like this restaurant neg
I am tired of this stuff neg
I can't deal with this neg
He is my sworn enemy neg
My boss is horrible neg
This is an awesome place pos
I do not like the taste of this juice neg
I love to dance pos
I am sick and tired of this place neg
What a great holiday pos
That is a bad locality to stay neg
We will have good fun tomorrow pos
I went to my enemy's house today neg
PROGRAM:
import pandas as pd msg=pd.read_csv('naivetext.csv',names=['message','label']) print('The
dimensions of the dataset',msg.shape) msg['labelnum']=msg.label.map({'pos':1,'neg':O})
X=msg.message
y=msg.labelnum
print(X) print(y)
OUTPUT:
The dimensions of the dataset (18, 2)
O I love this sandwich
1 This is an amazing place
2 I feel very good about these beers
3 This is my best work
4 What an awesome view
5 I do not like this restaurant
6 I am tired of this stuff
7 I can't deal with this
8 He is my sworn enemy
9 My boss is horrible
1O This is an awesome place
11 I do not like the taste of this juice
12 I love to dance
13 I am sick and tired of this place
14 What a great holiday
15 That is a bad locality to stay
16 We will have good fun tomorrow
17 I went to my enemy's house today Name:
message, dtype: object
O 1
1 1
2 1
3 1
4 1
5 O
6 O
7 O
8 O
9 O
1O 1
11 O
12 1
13 O
14 1
15 O
16 1
17 O
Name: labelnum, dtype: int64
DESCRIPTION:
Confusion Matrix
True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually negative
False negatives: data points labelled as negative that are actually positive
Example:
Accuracy: how often is the classifier correct?
Example: Movie Review
Doc Text Class
1 I loved the movie +
2 I hated the movie -
3 a great movie. good movie +
4 poor acting -
5 great acting. good movie
+ Unique word
< I, loved, the, movie, hated, a, great, good, poor, acting>
Doc I loved the movie hated a great good poor acting Class
11111+
21111-
32111+
411-
51111+
Doc I loved the movie hated a great good poor acting Class
11111+
32111+
𝑃(+)
51111+
=3
5
𝑃(𝐼 |+)
= 0.6
=1+1
𝑃(𝑙𝑜𝑣𝑒𝑑 |+)
14 + 10 = 0.0833
=1+1
𝑃(𝑡ℎ𝑒 |+)
14 + 10 = 0.0833
=1+1
𝑃(𝑚𝑜𝑣𝑖𝑒 |+)
14 + 10 = 0.0833
=4+1
𝑃(ℎ𝑎𝑡𝑒𝑑 |+)
14 + 10 = 0.2083
=0+1
𝑃(𝑎 |+)
14 + 10 = 0.0416
=1+1
𝑃(𝑔𝑟𝑒𝑎𝑡 |+)
14 + 10 = 0.0833
=2+1
𝑃(𝑔𝑜𝑜𝑑 |+)
14 + 10 = 0.125
=2+1
𝑃(𝑝𝑜𝑜𝑟 |+)
14 + 10 = 0.125
=0+1
𝑃(𝑎𝑐𝑡𝑖𝑛𝑔 |+)
14 + 10 = 0.0416
=1+1
14 + 10 = 0.0833
Doc I loved the movie hated a great good poor acting Class
21111-
𝑃(−) =
411-
2
5
𝑃(𝐼 |−) =
= 0.4
1+1
6 + 10 = 0.125
𝑃(𝑙𝑜𝑣𝑒𝑑 |−) =
0+1
𝑃(𝑡ℎ𝑒 |−) =
6 + 10 = 0.0625
1+1
𝑃(𝑚𝑜𝑣𝑖𝑒|−) =
6 + 10 = 0.125
1+1
𝑃(ℎ𝑎𝑡𝑒𝑑 |−) =
6 + 10 = 0.125
1+1
𝑃(𝑎 |−) =
6 + 10 = 0.125
0+1
𝑃(𝑔𝑟𝑒𝑎𝑡 |−) =
6 + 10 = 0.0625
0+1
𝑃(𝑔𝑜𝑜𝑑 |−) =
6 + 10 = 0.0625
0+1
𝑃(𝑝𝑜𝑜𝑟|−) =
6 + 10 = 0.0625
1+1
𝑃(𝑎𝑐𝑡𝑖𝑛𝑔|−) =
6 + 10 = 0.125
1+1
6 + 10 = 0.125
Let’s classify the new document
I hated the poor acting
If Vj = +
then,
= P(+) P(I | +) P(hated | +) P(the | +) P(poor | +) P(acting | +)
= 0.6 * 0.0833 * 0.0416 * 0.0833 * 0.0416 * 0.0833
= 6.03 X 10−2
If Vj = −
then,
= P(−) P(I | −) P(hated | −) P(the | −) P(poor | −) P(acting | −)
= 0.4 * 0.125 * 0.125 * 0.125 * 0.125 * 0.125
= 1.22 X 10−5
= 1.22 X 10−5 > 6.03 X 10−2
So, the new document belongs to ( − ) class
RESULT:
Thus the set of documents has been classified using the naïve Bayesian Classifier model and the
accuracy, precision, and recall for the sample data set has been calculated.
EX.No:8 CONSTRUCTION OF BAYESIAN NETWORK
AIM:
To Write a Python program to construct a Bayesian network considering medical data and use
this model to demonstrate the diagnosis of heart patients using standard Heart Disease Data Set.
DESCRIPTION:
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
• The directed acyclic graph is a set of random variables represented by nodes.
• The conditional probability distribution of a node (random variable) is defined for every
possible outcome of the preceding causal node(s).
For illustration, consider the following example. Suppose we attempt to turn on our computer, but the
computer does not start (observation/evidence). We would like to know which of the possible causes of
computer failure is more likely. In this simplified illustration, we assume only two possible causes of
this misfortune: electricity failure and computer malfunction.
The corresponding directed acyclic graph is depicted in below figure.
Fig: Directed acyclic graph representing two independent possible causes of a computer
failure. The goal is to calculate the posterior conditional probability distribution of each of the
possible unobserved causes given the observed evidence, i.e. P [Cause | Evidence].
DATA SET:
age sex cp trestbps chol fbs restecg Thalach exang oldpeak slope ca thal heartdisease
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
57 0 4 120 354 0 0 163 1 0.6 1 0 3 0
63 1 4 130 254 0 2 147 0 1.4 2 1 7 2
Attribute Information:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
• Value 1: typical angina
• Value 2: atypical angina
• Value 3: non-anginal pain
• Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
• Value 0: normal
• Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
• Value 2: showing probable or definite left ventricular hypertrophy by Estes'
criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
• Value 1: upsloping
• Value 2: flat
• Value 3: downsloping
12. thal: 3 = normal; 6 = fixed defect; 7 = reversable
defect 13.Heartdisease: It is integer valued from 0 (no
presence) to 4.
PROGRAM:
import numpy as np
import pandas as pd
import csv
from pgmpy.estimators import MaximumLikelihoodEstimator from
pgmpy.models import BayesianModel
from pgmpy.inference import VariableElimination #read Cleveland
Heart Disease data heartDisease = pd.read_csv('heart.csv')
heartDisease = heartDisease.replace('?',np.nan) #display the data
print('Sample instances from the dataset are given below') print(heartDisease.head())
#display the Attributes names and datatyes print('\n
Attributes and datatypes') print(heartDisease.dtypes)
#Creat Model- Bayesian Network
model =BayesianModel([('age','heartdisease'),('sex','heartdisease'), ('exang','heartdisease'),('cp','heartdisease'),
('heartdisease','restec g'),('heartdisease','chol')])
#Learning CPDs using Maximum Likelihood Estimators print('\n Learning CPD
using Maximum likelihood estimators')
model.fit(heartDisease,estimator=MaximumLikelihoodEstimator) # Inferencing with
Bayesian Network
print('\n Inferencing with Bayesian Network:') HeartDiseasetest_infer =
VariableElimination(model) #computing the Probability of HeartDisease given
restecg
print('\n 1.Probability of HeartDisease given evidence=restecg :2')
q1=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={' restecg':2})
print(q1)
#computing the Probability of HeartDisease given cp
print('\n 2.Probability of HeartDisease given evidence= cp:2 ')
q2=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={' cp':2})
print(q2)
OUTPUT:
Attributes and datatypes age
int64
sex int64
cp int64
trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca int64
thal int64
heartdisease int64
dtype: object
AIM:
To Apply EM algorithm to cluster a set of data stored in a .CSV file and using the
same data set for clustering using k-Means algorithm and then to Compare the results of
these two algorithms and comment on the quality of clustering.
Description:
Clustering is an important means of data mining and of algorithms that separate data of similar
nature. Unlike the classification algorithm, clustering belongs to the unsupervised type of
algorithms.
Two representatives of the clustering algorithms are the K-means algorithm and the expectation
maximization (EM) algorithm.
EM and K-means are similar in the sense that they allow model refining of an iterative process to
find the best congestion.
However, the K-means algorithm differs in the method used for calculating the Euclidean
distance while calculating the distance between each of two data items; and EM uses statistical
methods.
The EM algorithm is often used to provide the functions more effectively.
Clustering means to split a large data set into a plurality of clusters of data, which share some
trait of each subset.
It is carried out by calculating the similarity or proximity based on the distance measurement
method. The two can be divided into partial clustering and hierarchical clustering in the data.
Hierarchical clustering can be agglomerative or divisive, i.e. bottom–up or top–down,
respectively. It begins from each element and is intended to form a hierarchical cluster structure.
The elements form a tree structure, which is a single cluster with all the elements on the other
end
Algorithm: K-means clustering
The cluster analysis procedure is analyzed to determine the properties of the data set and the target
variable. It is typically used to determine how to measure similarity distance. Basically, it functions as
follows:
Input: The number of k and a database containing n objects.
Output: A set of k-clusters that minimize the squared-error criterion.
Method:
1. arbitrarily choose k objects as the initial cluster centre’s;
2. repeat;
3. (re)assign each object to the cluster to which the object is the most similar based on
the mean value of the objects in the cluster;
4. update the cluster mean, i.e. calculate the mean value of the object for each cluster;
5. until no change.
Algorithm: EM clustering
The concept of the EM algorithm stems from the Gaussian mixture model (GMM). The GMM method is
one way to improve the density of a given set of sample data modelled as a function of the probability
density of a single-density estimation method with multiple Gaussian probability density function to
model the distribution of the data. In general, to obtain the estimated parameters of each Gaussian blend
component if given a sample data set of the log-likelihood of the data, the maximum is determined by
the
EM algorithm to estimate the optimal model. Principally, the EM clustering method uses the following
algorithm:
Input: Cluster number k, a database, stopping tolerance.
Output: A set of k-clusters with weight that maximize log-likelihood function.
1. Expectation step: For each database record x, compute the membership probability of x in each
cluster h = 1,…, k.
2. Maximization step: Update mixture model parameter (probability weight).
3. Stopping criteria: If stopping criteria are satisfied stop, else set j = j +1 and go to (1).
In the analytical methods available to achieve probability distribution parameters, in all probability the
value of the variable is given. The iterative EM algorithm uses a random variable and, eventually, is a
general method to find the optimal parameters of the hidden distribution function from the given data,
when the data are incomplete or has missing values.
Dataset: (iris_dataset.csv)
names=names)
X = dataset.iloc[:, :-1]
label = {'Iris-setosa': O,'Iris-versicolor': 1, 'Iris-virginica': 2} y = [label[c] for c in
dataset.iloc[:, -1]] plt.figure(figsize=(14,7))
colormap=np.array(['red','lime','black'])
# REAL PLOT
plt.subplot(1,3,1) plt.title('Real')
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y])
# K-PLOT
model=KMeans(n_clusters=3, random_state=O).fit(X) plt.subplot(1,3,2)
plt.title('KMeans') plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[model.labels_])
# GMM PLOT
gmm=GaussianMixture(n_components=3, random_state=O).fit(X) y_cluster_gmm=gmm.predict(X)
plt.subplot(1,3,3) plt.title('GMM
Classification')
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y_cluster_gmm]) print('The accuracy score of EM:
',metrics.accuracy_score(y, y_cluster_gmm)) print('The Confusion matrix of EM:\n
',metrics.confusion_matrix(y, y_cluster_gmm))
OUTPUT:
The accuracy score of K-Mean: O.O
The Confusion matrixof K-Mean:
[[O 5 O]
[4 O 1]
[5 O O]]
The accuracy score of EM: O.O
The Confusion matrix of EM:
[[O 5 O]
[O O 5]
[5 O O]]
RESULT:
AIM:
To write a python program to implement K-Nearest Neighbour classification algorithm to
classify the iris data set and to print both correct and wrong predictions.
ALGORITHM:
K-Nearest Neighbor Algorithm
Training algorithm:
For each training example (x, f (x)), add the example to the list training examples
Classification algorithm:
Given a query instance xq to be classified,
Let x1 . . .xk denote the k instances from training examples that are nearest to xq
Return
Where, f(xi) function to calculate the mean value of the k nearest training examples.
DESCRIPTION
Confusion Matrix
True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually negative
False negatives: data points labelled as negative that are actually positive
Accuracy: how often is the classifier correct?
F1-Score:
Support = TP + FN
Example:
Support _ A = TP_A + FN_A
= 30 + (20 + 10)
= 60
DATASET:
5.1 3.5 1.4 0.2 Iris−setosa
4.9 3 1.4 0.2 Iris−setosa
4.7 3.2 1.3 0.2 Iris−setosa
4.6 3.1 1.5 0.2 Iris−setosa
5 3.6 1.4 0.2 Iris−setosa
7 3.2 4.7 1.4 Iris−versicolor
6.4 3.2 4.5 1.5 Iris−versicolor
6.9 3.1 4.9 1.5 Iris−versicolor
5.5 2.3 4 1.3 Iris−versicolor
6.5 2.8 4.6 1.5 Iris−versicolor
6.7 3 5.2 2.3 Iris−virginica
6.3 2.5 5 1.9 Iris−virginica
6.5 3 5.2 2 Iris−virginica
6.2 3.4 5.4 2.3 Iris−virginica
5.9 3 5.1 1.8 Iris−virginica
Program:
import numpy as np import
pandas as pd
from sklearn.neighbors import KNeighborsClassifier from
sklearn.model_selection import train_test_split from sklearn import metrics
i = O print
("\n
")
print ('%-25s %-25s %-25s' % ('Original Label', 'Predicted Label', 'Correct/Wrong'))
print ("
")
for label in ytest:
print ('%-25s %-25s' % (label, ypred[i]), end="") if (label ==
ypred[i]):
print (' %-25s' % ('Correct')) else:
print (' %-25s' % ('Wrong')) i = i + 1
print ("
")
print("\nConfusion Matrix:\n",metrics.confusion_matrix(ytest, ypred)) print ("
")
print("\nClassification Report:\n",metrics.classification_report(ytest, ypred))
print ("
")
print('Accuracy of the classifer is %O.2f' % metrics.accuracy_score(ytest,ypred))
print ("
")
OUTPUT:
--
Original Label Predicted Label Correct/Wrong
--
Iris-versicolor Iris-versicolor Correct
Iris-virginica Iris-versicolor Wrong
Iris-virginica Iris-virginica Correct
Iris-versicolor Iris-versicolor Correct
Iris-setosa Iris-setosa Correct
Iris-versicolor Iris-versicolor Correct
Iris-setosa Iris-setosa Correct
Iris-setosa Iris-setosa Correct
Iris-virginica Iris-virginica Correct
Iris-virginica Iris-versicolor Wrong
Iris-virginica Iris-virginica Correct
Iris-setosa Iris-setosa Correct
Iris-virginica Iris-virginica Correct
Iris-virginica Iris-virginica Correct
Iris-versicolor Iris-versicolor Correct
--
Confusion Matrix:
[[4 O O]
[O 4 O]
[O 2 5]]
--
Classification Report:
precision recall f1-score support
--
Accuracy of the classifer is O.87
--
RESULT:
Thus the python program to implement k-Nearest Neighbour algorithm to classify the iris data set
has been written and executed successfully.