ML LAB
ML LAB
S NO LIST OF EXPERIMENT
1 Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis based
on a given set of training data samples. Read the training data from a .CSV file.
2 For a given set of training data examples stored in a .CSV file, implement and demonstrate the
Candidate- Elimination algorithm to output a description of the set of all hypotheses consistent
with the training examples.
3 Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an
appropriate data set for building the decision tree and apply this knowledge to classify a new
sample.
4 Exercises to solve the real-world problems using the following machine learning methods: a)
Linear Regression b) Logistic Regression c) Binary Classifier
5 Develop a program for Bias, Variance, Remove duplicates , Cross Validation
7 Build an Artificial Neural Network by implementing the Back propagation algorithm and test the
same
using appropriate data sets.
8 Write a program to implement k-Nearest Neighbor algorithm to classify the iris data set. Print
both correct
and wrong predictions.
9 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
10 Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model
to
perform this task. Built-in Java classes/API can be used to write the program. Calculate the
accuracy,
precision, and recall for your data set.
11 Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for
clustering using k-Means algorithm. Compare the results of these two algorithms and comment
on the
quality of clustering. You can add Java/Python ML library classes/API in the program.
12 Exploratory Data Analysis for Classification using Pandas or Matplotlib.
13 Write a Python program to construct a Bayesian network considering medical data. Use this
model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set
14 Write a program to Implement Support Vector Machines and Principle Component Analysis
Experiment – 1:
Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis
based on a given set of training data samples. Read the training data from a .CSV file.
Aim: Demonstration of FIND-S algorithm for finding the most specific hypothesis
Import csv
Reader=csv.reader(f)
Your_list=list(reader)
For i in your
list Print(i)
Ifi[-1]==”True”:
J=0
For x in i:
If x!=”True”
h[0][j] = x
h[0][j] = '?'
else:
pass
j=j+1
print(h)
PITT 1
III B.Tech II Sem ML Lab Manual
Output
PITT 2
III B.Tech II Sem ML Lab Manual
Experiment – 2:
For a given set of training data examples stored in a .CSV file, implement and demonstrate the
Program code
class Holder:
length '''
Constructor of class Holder holding two parameters,self refers to the instance of the
class '''
self.factors[i]=[]
def
add_values(self,factor,values):self.factors[factor]=values
class CandidateElimination:
Positive={}
dictionary
def run_algorithm(self):'''
Initialize the specific and general boundaries, and loop the dataset against the
algorithm
PITT 3
III B.Tech II Sem ML Lab Manual
'''
PITT 4
III B.Tech II Sem ML Lab Manual
trial_set''' count=0
generalization = self.generalize_inconsistent_S(s,trial_set[0])generalization =
self.get_general(generalization,G)
if generalization:
S_new.append(generalization) S = S_new[:]
boundary G_new = G[:] #initialize the dictionary with no key-value pair (dataset cantake
any value)
print (G_new)for g in G:
ifself.consistent(g,trial_set[0]):G_new.remove(g)
specializations = self.specialize_inconsistent_G(g,trial_set[0])specializationss =
self.get_specific(specializations,S)
specializations
G = G_new[:]
PITT 5
III B.Tech II Sem ML Lab Manual
''' Initialize the general boundary '''
return True
else:
value") def
match_factor(self,value1,value2):
''' Check for the factors values match, necessary while checking the consistency oftraining trial_set
return False
def consistent(self,hypothesis,instance):
''' Check whether the instance is part of the hypothesis '''for i,factor in
return True
for g in hypotheses:
if not self.consistent(g,instance):G_new.remove(g) return G_new
def remove_inconsistent_S(self,hypotheses,instance):
''' For a negative trial_set, the hypotheses in S
inconsistent with it should be removed ''
'S_new = hypotheses[:] for s in hypotheses:
if self.consistent(s,instance):S_new.remove(s)
return S_new
PITT 6
III B.Tech II Sem ML Lab Manual
def remove_more_general(self,hypotheses):
''' After generalizing S for a positive trial_set, the hypothesis in Sgeneral than others in S should be removed '''
S_new = hypotheses[:]for old in hypotheses:
''' After specializing G for a negative trial_set, the hypothesis in Gspecific than others in G should
be removed '''
''' When a inconsistent hypothesis for negative trial_set is seen in the generalboundary G
def get_general(self,generalization,G):
hypothesis in S
if self.more_general(g,generalization):return
def get_specific(self,specializations,S):
''' Checks if there is more specific hypothesis in Sfor each of hypothesis in specializations of
for s in S:
if self.more_specific(s,hypo) or
s==self.initializeS()[0]:valid_specializations.append(hypo) return
valid_specializations
def exists_general(self,hypothesis,G):
'''Used to check if there exists a more general hypothesis ingeneral boundary for version
space''' for g in G:
if self.more_general(g,hypothesis):return True
return False
def exists_specific(self,hypothesis,S):
'''Used to check if there exists a more specific hypothesis ingeneral boundary for version
space''' for s in S:
if self.more_specific(s,hypothesis):return
def more_general(self,hyp1,hyp2):
continue
elif j == '?':
if i != '?':
PITT 8
III B.Tech II Sem ML Lab Manual
return
False elif i
!= j:
return
False else:
continue
return
True
dataset=[(('sunny','warm','normal','strong','warm','same'),'Y'),(('sunny','warm','high','stron
g','warm','same'),'Y'),(('rainy','cold','high','strong','warm','change'),'N'),(('sunny','warm','hi
gh','strong','cool','change'),'Y')]
a = CandidateElimination(dataset,f) #pass the dataset to the algorithm class and call therun algoritm method
a.run_algorithm()
PITT 9
III B.Tech II Sem ML Lab Manual
Output
[('sunny', '?', '?', '?', '?', '?'), ('?', 'warm', '?', '?', '?', '?'), ('?', '?', '?', '?', '?', 'same')]
[('sunny']
PITT 10
III B.Tech II Sem ML Lab Manual
Experiment-3:
Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use
an appropriate data set for building the decision tree and apply this knowledge to classify a
Program code:
class Node:
def init (self, attribute): self.attribute = attributeself.children = [] self.answer = "" def str (self): return self.attribute
def subtables(data, col, delete):dict = {}
for y in range(data.shape[0]):
+= 1 for x in range(items.shape[0]):
dtype="|S32") pos = 0
pos += 1 if delete:
return 0
0 for x in range(items.shape[0]):
PITT 11
III B.Tech II Sem ML Lab Manual
counts[x] = sum(S == items[x]) / (S.size *
1.0) for count in counts:
sums += -1 * count * math.log(count, 2)return
sums def gain_ratio(data, col):
items, dict = subtables(data, col, delete=False)
total_size = data.shape[0]
entropies = np.zeros((items.shape[0], 1))intrinsic = np.zeros((items.shape[0], 1)) for x in
range(items.shape[0]):
ratio = dict[items[x]].shape[0]/(total_size * 1.0) entropies[x] = ratio * entropy(dict[items[x]][:, -1])
intrinsic[x] = ratio * math.log(ratio, 2)
total_entropy = entropy(data[:, -1])iv = -1 * sum(intrinsic)
for x in range(entropies.shape[0]):total_entropy -=
entropies[x] return total_entropy / iv
def create_node(data, metadata):
if (np.unique(data[:, -1])).shape[0] == 1:node = Node("")
node.answer = np.unique(data[:, -1])[0]return node
node = Node(metadata[split])
delete=True) for x in
range(items.shape[0]):
s = ""
print(empty(level), node.answer)return
PITT 12
III B.Tech II Sem ML Lab Manual
print(empty(level), node.attribute)for value, n in node.children:
metadata)print_tree(node, 0)
Data_loader.py
import csv
def read_data(filename):
metadata = []traindata = []
PITT 13
III B.Tech II Sem ML Lab Manual
Input:
Tennis.csv
Output
outlook
overcastb'yes'
rain
wind
b'strong'b'no' b'weak'
b'yes' sunny
humidityb'high'b'no'
b'normal'b'yes
PITT 14
III B.Tech II Sem ML Lab Manual
Experiment – 4:
Exercises to solve the real-world problems using the following machine learning methods.a).
Linear Regression b). Logistic Regression
Aim: To solve the real-world problems using the machine learning methods. Linear Regression and Logistic
Regression
Dataset: std_marks.csv-constructed on own by using students lab internal and external marks.
Program code:
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
data=pd.read_csv(r"E:\sudhakar\std_marks.csv")
print('First 5 rows of the data set are:')
print(data.head())
dim=data.shape
print('Dimensions of the data set are',dim)
print('Statistics of the data are:')
print(data.describe())
print('Correlation matrix of the data set is:')
print(data.corr())
x_set=data[['internal']]
print('First 5 rows of features set are:')
print(x_set.head())
y_set=data[['external']]
print('First 5 rows of features set are:')
print(y_set.head())
x_train,x_test, y_train, y_test = train_test_split(x_set,y_set, test_size = 0.3)
model=linear_model.LinearRegression()
model.fit(x_train,y_train)
print('Regression coefficient is',float(model.coef_))
print('Regression intercept is',float(model.intercept_))
y_pred=model.predict(x_test)
PITT 15
III B.Tech II Sem ML Lab Manual
y_preds=[]
for i in y_pred:
7 y_preds.append(float(i))
print('Predicted values for test data are:')
print(y_preds)
print('mean squared error is ',mean_squared_error(y_test,y_pred))
plt.scatter(x_test,y_test,color='blue',label='actual y values')
plt.plot(x_test,y_pred,color='red',linewidth=3,label='predicted regression line')
plt.ylabel('y value')
plt.xlabel('x value')
plt.title('simple linear regression')
plt.legend(loc='best')
plt.show()
PITT 16
III B.Tech II Sem ML Lab Manual
Output screen shots :
PITT 17
III B.Tech II Sem ML Lab Manual
Experiment – 4b:
Aim: To solve the real-world problems using the machine learning methods. Logistic Regression
Program code:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.preprocessing import StandardScaler
data=pd.read_csv(r"E:\sudhakar\heart.csv")
print('The first 5 rows of the data set are:')
print(data.head())
dim=data.shape
print('Dimensions of the data set are',dim)
print('Statistics of the data are:')
print(data.describe())
print('Correlation matrix of the data set is:')
print(data.corr())
class_lbls=data['target'].unique()
class_labels=[]
for x in class_lbls:
class_labels.append(str(x))
print('Class labels are:')
print(class_labels)
sns.countplot(data['target'])
col_names=data.columns
feature_names=col_names[:-1]
feature_names=list(feature_names)
print('Feature names are:')
print(feature_names)
x_set = data.drop(['target'], axis=1)
print('First 5 rows of features set are:')
print(x_set.head())
y_set=data[['target']]
print('First 5 rows of features set are:')
PITT 18
III B.Tech II Sem ML Lab Manual
print(y_set.head())
scaler=StandardScaler()
x_train,x_test, y_train, y_test = train_test_split(x_set,y_set, test_size = 0.3)
scaler.fit(x_train)
x_train=scaler.transform(x_train)
model = LogisticRegression()
model.fit(x_train, y_train)
x_test=scaler.transform(x_test)
y_pred=model.predict(x_test)
print('Predicted class labels for test data are:')
print(y_pred)
print("Accuracy:",accuracy_score(y_test, y_pred))
print("Precision:",precision_score(y_test, y_pred))
print("Recall:",recall_score(y_test, y_pred))
print(classification_report(y_test,y_pred,target_names=class_labels))
cm=confusion_matrix(y_test,y_pred)
df_cm = pd.DataFrame(cm, columns=class_labels, index = class_labels)
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.set(font_scale=1.5)
sns.heatmap(df_cm, annot=True,cmap="Blues",fmt='d')
PITT 19
III B.Tech II Sem ML Lab Manual
Output screen shots:
PITT 20
III B.Tech II Sem ML Lab Manual
PITT 21
III B.Tech II Sem ML Lab Manual
Experiment – 5:
Program code:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
import matplotlib.pyplot as plt
from statistics import mean,stdev
data=pd.read_csv(r"E:\machine learning\datasets\winequality.csv")
dim=data.shape
print('Dimensions of the data set are',dim)
print('First 5 rows of the data set are:')
print(data.head())
col_names=data.columns
col_names=list(col_names)
print('Attrubte names are:')
print(col_names)
feature_names=col_names[:-1]
print('Feature names are:',feature_names)
x_set=data.drop('quality',axis=1)
y_set=data['quality']
model=linear_model.LinearRegression()
scores=cross_val_score(model, x_set, y_set, cv=10)
k_list=range(2,200)
bias=[]
variance=[]
for k in k_list:
model=linear_model.LinearRegression()
scores=cross_val_score(model, x_set, y_set, cv=k)
bias.append(mean(scores))
PITT 22
III B.Tech II Sem ML Lab Manual
variance.append(stdev(scores))
plt.plot(k_list, bias, 'b', label='bias of model')
plt.plot(k_list, variance, 'r', label='Variance of model')
plt.xlabel('k value')
plt.title('bias-variance trade off')
plt.legend(loc='best')
plt.show()
#From, graph , best value is about 85
model=linear_model.LinearRegression()
scores=cross_val_score(model, x_set, y_set, cv=85)
bias=mean(scores)
variance=stdev(scores)
print('Bias of the model is',bias)
print('Variance of the model is',variance)
PITT 23
III B.Tech II Sem ML Lab Manual
PITT 24
III B.Tech II Sem ML Lab Manual
Experiment-7
Aim: Demonstration of Artificial neural network using back propagation algorithm Program Code
import numpy as np
return x * (1 - x)
#Variable initialization
hidden layers neuronsoutput_neurons = 1 #number of neurons at output layer #weight and bias
initialization wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
output = sigmoid(outinp)
#Backpropagation
EO = y-output
PITT 25
III B.Tech II Sem ML Lab Manual
outgrad = derivatives_sigmoid(output)d_output = EO*
outgrad EH = d_output.dot(wout.T)
d_hiddenlayer = EH * hiddengrad
Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]
Actual
Output:[[0.92] [
0.86]
[ 0.89]]
Predicted
Output:[[0.89559591] [
0.88142069]
[ 0.8928407 ]]
PITT 26
III B.Tech II Sem ML Lab Manual
Experiment-8:
Write a program to implement k-Nearest Neighbor algorithm to classify the iris data set.
Program Code:
trainingSet.append(dataset[x])else:
testSet.append(dataset[x])
for x in range(length):
distances = []
neighbors.append(distances[x][0])return
neighbors def
getResponse(neighbors):classVotes = {}
classVotes[response] += 1
PITT 27
III B.Tech II Sem ML Lab Manual
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(),reverse=True)
return sortedVotes[0][0]
range(len(testSet)):
key=operator.itemgetter(1),
if testSet[x][-1] == predictions[x]:correct += 1 return (correct/float(len(testSet))) * 100.0 def main():
# prepare data trainingSet=[] testSet=[]split = 0.67
# generate predictions
predictions=[]k=3 for x in
range(len(testSet)):
predictions.append(result)
getAccuracy(testSet, predictions)
OUTPUT
[[11 0 0]
[0 9 1]
[0 1 8]]
PITT 28
III B.Tech II Sem ML Lab Manual
Experiment – 9:
data points. Select appropriate data set for your experiment and drawgraphs.
m, n = np1.shape(xmat)
weights = np1.mat(np1.eye((m)))
for j in range(m):
W=(X.T*(wei*X)).I*(X.T*(wei*ymat.T)) return W
ypred[i] = xmat[i]*localWeight(xmat[i],xmat,ymat,k)return
mtip = np1.mat(tip)
m= np1.shape(mbill)[1]
one = np1.mat(np1.ones(m))
PITT 29
III B.Tech II Sem ML Lab Manual
X= np1.hstack((one.T,mbill.T)) #set k here
ypred = localWeightRegression(X,mtip,2)
Output
PITT 30
III B.Tech II Sem ML Lab Manual
Experiment-10:
Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model
to perform this task. Built-in Java classes/API can be used to write the program. Calculate the
accuracy,precision, and recall for your data set
Aim: classification of set of documents using Naive Bayesian classification
Program code
import pandas as pd
msg=pd.read_csv('naivetext1.csv',names=['message','label'])
print('The dimensions of the dataset',msg.shape)
msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.messagey=msg.labelnumprint(X)
print(y)
#splitting the dataset into train and test data from
sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,y)
print(xtest.shape)
print(xtrain.shape) print(ytest.shape) print(ytrain.shape)
#output of count vectoriser is a sparse matrix
from sklearn.feature_extraction.text
import CountVectorizercount_vect = CountVectorizer()
xtrain_dtm = count_vect.fit_transform(xtrain)
xtest_dtm=count_vect.transform(xtest)
print(count_vect.get_feature_names())
df=pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_feature_names())
print(df)
#tabular representation
print(xtrain_dtm)
#sparse matrix representation
# Training Naive Bayes (NB) classifier on training data
from sklearn.naive_bayes import MultinomialNB clf== MultinomialNB().fit(xtrain_dtm,ytrain)
predicted = clf.predict(xtest_dtm)
#printing accuracy metrics
from sklearn import metricsprint('Accuracy metrics')
PITT 31
III B.Tech II Sem ML Lab Manual
print('Accuracy of the classifer is',metrics.accuracy_score(ytest,predicted))
print('Confusion matrix')
print(metrics.confusion_matrix(ytest,predicted))
print('Recall and Precison ')
print(metrics.recall_score(ytest,predicted))
print(metrics.precision_score(ytest,predicted))
'''docs_new = ['I like this place', 'My boss is not my saviour']
OUTPUT:-
['about', 'am', 'amazing', 'an', 'and', 'awesome', 'beers', 'best', 'boss', 'can', 'deal',
'do', 'enemy', 'feel', 'fun', 'good', 'have', 'horrible', 'house', 'is', 'like', 'love', 'my',
'not', 'of', 'place', 'restaurant', 'sandwich', 'sick', 'stuff', 'these', 'this', 'tired', 'to',
'today', 'tomorrow', 'very', 'view', 'we', 'went', 'what', 'will', 'with', 'work']about am amazing an and
awesome beers best boss can ... today \
0 10 0 0 0 0 1 0 0 0 ... 0
1 00 0 0 0 0 0 1 0 0 ... 0
2 00 1100 0 0 0 0 ... 0
3 00 0 0 0 0 0 0 0 0 ... 1
4 00 0 0 0 0 0 0 0 0 ... 0
5 01 001 0 0 0 0 0 ... 0
6 00 0 0 0 0 0 0 0 1 ... 0
7 00 0 0 0 0 0 0 0 0 ... 0
8 01 0 0 0 0 0 0 0 0 ... 0
9 00 0 1 0 1 0 0 0 0 ... 0
10 0 0 0 0 0 0 0 0 0 0 ... 0
11 0 0 0 0 0 0 0 0 1 0 ... 0
12 0 0 0 1 0 1 0 0 0 0 ... 0
PITT 32
III B.Tech II Sem ML Lab Manual
Experiment-11:
Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for clustering
usingkMeans algorithm. Compare the results of these two algorithms and comment on the
quality of clustering. You can add Java/Python ML library classes/API in the program.
Aim: Implementation of EM algorithm to cluster a Heart Disease Data Set
Program Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator
import make_blobsX, y_true = make_blobs(n_samples=100, centers =
4,Cluster_std=0.60,random_state=0)
X = X[:, ::-1]
#flip axes for better plotting
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture (n_components = 4).fit(X)lables = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap=‟viridis‟);probs = gmm.predict_proba(X)
print(probs[:5].round(3))
size = 50 * probs.max(1) ** 2
# square emphasizes differences
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap=‟viridis‟, s=size);
from matplotlib.patches import Ellipse
def draw_ellipse(position, covariance, ax=None, **kwargs);
“””Draw an ellipse with a given position and covariance”””
Ax= ax or plt.gca()
PITT 33
III B.Tech II Sem ML Lab Manual
gmm = GaussianMixture(n_components=4, covariance_type=‟full‟,random_state=42)
plot_gmm(gmm, X)
Output
[[1 ,0, 0, 0]
[0 ,0, 1, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]]
K MEANS :
from sklearn.cluster import KMeans
#from sklearn import metricsimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data=pd.read_csv("kmeansdata.csv")
df1=pd.DataFrame(data)
print(df1)
f1 = df1['Distance_Feature'].valuesf2 = df1['Speeding_Feature'].values
X=np.matrix(list(zip(f1,f2)))plt.plot()
plt.xlim([0, 100])
plt.ylim([0, 50]) plt.title('Dataset') plt.ylabel('speeding_feature')plt.xlabel('Distance_Feature')
plt.scatter(f1,f2)
plt.show()
# create new plot and data
plt.plot()
colors = ['b', 'g', 'r']
markers = ['o', 'v', 's']
# KMeans algorithm#K = 3
kmeans_model = KMeans(n_clusters=3).fit(X)
plt.plot()
for i, l in enumerate(kmeans_model.labels_):
plt.plot(f1[i], f2[i], color=colors[l], marker=markers[l],ls='None')plt.xlim([0, 100])
plt.ylim([0, 50])plt.show()
Driver_ID,Distance_Feature,Speeding_Feature
3423311935,71.24,28
3423313212,52.53,25
3423313724,64.54,27
3423311373,55.69,22
3423310999,54.58,25
3423313857,41.91,10
3423312432,58.64,20
3423311434,52.02,8
3423311328,31.25,34
3423312488,44.31,19
3423311254,49.35,40
PITT 34
III B.Tech II Sem ML Lab Manual
3423312943,58.07,45
3423312536,44.22,22
3423311542,55.73,19
3423312176,46.63,43
3423314176,52.97,32
3423314202,46.25,35
3423311346,51.55,27
3423310666,57.05,26
3423313527,58.45,30
3423312182,43.42,23
3423313590,55.68,37
3423312268,55.15,18
PITT 35
III B.Tech II Sem ML Lab Manual
Experiment -12
Aim: Exploratory data analysis for classification using pandas and Matplotlib
Dataset: tae.csv- The data consist of evaluations of teaching performance over three regular
semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics
Department of the University of Wisconsin-Madison. The scores were divided into 3 roughly equal-
sized categories ("low", "medium", and "high") to form the class variable. The data set is collected
from https://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation
Program code:
import pandas as pd
import matplotlib.pyplot as plt
print(‘pandas version is’, pd. version )
data = pd.read_csv(r"E:\sudhakar\tae.csv",header=None)
col_names=['native_speaker','instructor','course','semester','class_size','score']
data.columns=col_names
print('Data type of target variable is:',data['score'].dtype)
print('Converting target variable data type to categorical')
data['score']=data['score'].astype('category')
print('Afrer conversion, data type of target variable is:',data['score'].dtype)
print('Dimesnions of the data set:')
print(data.shape)
print('The first 5 rows of the data set are:')
print(data.head())
print('The last 5 rows of the data set are:')
print(data.tail())
print('Randomly selected 5 rows of the data set are:')
print(data.sample(5))
print('The columns of the data set are:')
print(data.columns.tolist())
print('Names and data types of attributes are:')
print(data.dtypes)
print('Converting native_speaker data type to categorical')
data['native_speaker']=data['native_speaker'].astype('category')
PITT 36
III B.Tech II Sem ML Lab Manual
print('After conversion,Names and data types of attributes are:')
print(data.dtypes)
print('Information of the data set attributes:')
print(data.info())
print('Statistics of the numerical attributes of the data set are:')
print(data.describe())
print('Statistics of the all attributes of the data set are:')
print(data.describe(include='all'))
print('Corelation matrix of the numerical attributes of the data set is:')
corr=data.corr()
print(corr)
print('Distribution of the target variable is:')
print(data['score'].value_counts())
print('Target class distrubtion w.r.t \'native_speaker\' attribute')
print(pd.crosstab(data.native_speaker,data.score))
print('Target class distrubtion w.r.t \'native_speaker\' attribute')
print(pd.crosstab(data.native_speaker,data.score,normalize='index'))
print('Target class distrubtion w.r.t \'native_speaker\' attribute using groupby')
data.groupby('native_speaker').score.value_counts()
print('Checking for null values:')
print(data.isnull().sum())
data.dropna(subset=['instructor'],axis=0,inplace=True)
print('After removal rows with null values in column \'instructor\'')
print(data.isnull().sum())
print('Unique values in the column named \'score\'')
print(data['score'].unique())
data.plot(kind='scatter',x='semester',y='class_size',color='red')
print('Number of distinct courses semester wise')
data.groupby('semester')['course'].nunique().plot(kind='bar')
print('Frequency of values in column \'semester\'')
data[['semester']].plot(kind='hist')
data.plot(kind='bar',x='semester',y='course',color='red')
ax = plt.gca()#gca means get current axes
data.plot(kind='line',x='semester',y='class_size',ax=ax)
PITT 37
III B.Tech II Sem ML Lab Manual
PITT 38
III B.Tech II Sem ML Lab Manual
PITT 39
III B.Tech II Sem ML Lab Manual
PITT 40
III B.Tech II Sem ML Lab Manual
PITT 41
III B.Tech II Sem ML Lab Manual
PITT Page 42
III B.Tech II Sem ML Lab Manual
Experiment -13:
Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set.
import bayespy as bp
import numpy as np
import csv from colorama
import init from colorama
import Fore, Back, Style init()
data.append([ageEnum[x[0]],genderEnum[x[1]],familyHistoryEnum[x[2]],dietEnum[x[3]],lifeSt
yleEn um[x[4]],cholesterolEnum[x[5]],heartDiseaseEnum[x[6]]])
# Training data for machine learning todo: should import
from csv data = np.array(data)
N = len(data)
PITT Page 43
III B.Tech II Sem ML Lab Manual
p_familyhistory = bp.nodes.Dirichlet(1.0*np.ones(2))
familyhistory = bp.nodes.Categorical(p_familyhistory, plates=(N,))
familyhistory.observe(data[:,2])
p_diet = bp.nodes.Dirichlet(1.0*np.ones(3))
diet = bp.nodes.Categorical(p_diet, plates=(N,))
diet.observe(data[:,3])
p_lifestyle = bp.nodes.Dirichlet(1.0*np.ones(4))
lifestyle = bp.nodes.Categorical(p_lifestyle, plates=(N,)) lifestyle.observe(data[:,4])
p_cholesterol = bp.nodes.Dirichlet(1.0*np.ones(3))
cholesterol = bp.nodes.Categorical(p_cholesterol, plates=(N,))
C:\Anaconda3\lib\site-packages\bayespy\inference\vmp\nodes\categorical.py:107: FutureWarning:
Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead
of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will
result either in an error or a different result.
u0[[np.arange(np.size(x)), np.ravel(x)]] = 1
cholesterol.observe(data[:,5])
# Interactive Test
m=0
while m == 0:
print("\n")
PITT Page 44
III B.Tech II Sem ML Lab Manual
bp.nodes.Categorical, p_heartdisease).get_moments()[0][heartDiseaseEnum['Yes']]
print("Probability(HeartDisease) = " + str(res))
#print(Style.RESET_ALL)
m = int(input("Enter for Continue:0, Exit :1 "))
OUTPUT:
PITT Page 45
III B.Tech II Sem ML Lab Manual
Experiment -14:
Program code:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
data = pd.read_csv(r"E:\sudhakar\haberman.csv", header=None)
#age=age of the patient
#year=Patient's year of operation (year - 1900)
#pos_axil_nodes=Number of positive axillary nodes detected
#survival_status:
1 -the patient survived 5 years or longer
#:2 -the patient died within 5 year col_names=['age','year','pos_axil_nodes','survival_status']
data.columns=col_names
#we removed the attribute year of operation data=data.drop(['year'], axis=1)
print('The first 5 rows of the data set are:') print(data.head())
dim=data.shape
print('Dimensions of the data set are',dim)
print('Statistics of the data are:')
print(data.describe())
print('Correlation matrix of the dataset is:')
print(data.corr())
class_lbls=data['survival_status'].unique() class_labels=[]
for x in class_lbls:
class_labels.append(
str(x))
print('Class labels are:')
print(class_labels) sns.countplot(data['survival_status']) col_names=data.columns
feature_names=col_names[:-1]
feature_names=list(feature_names)
PITT Page 46
III B.Tech II Sem ML Lab Manual
PITT Page 47
III B.Tech II Sem ML Lab Manual
plt.show()
Output screen shots:
PITT Page 48
III B.Tech II Sem ML Lab Manual
PITT Page 49
III B.Tech II Sem ML Lab Manual
Experiment -15:
DS = pnd.read_csv('Wine.csv')
# Now, we will distribute the dataset into two components "X" and "Y"
X = DS.iloc[: , 0:13].values
Y = DS.iloc[: , 13].values
from sklearn.model_selection import train_test_split as tts
X_train = PCa.fit_transform(X_train)
X_test = PCa.transform(X_test)
explained_variance = PCa.explained_variance_ratio_
classifier_1 = LR (random_state = 0)
classifier_1.fit(X_train, Y_train)
Output:
LogisticRegression(random_state=0)
PITT Page 50