MLT - Lab - Manual FINAL
MLT - Lab - Manual FINAL
LABORATORY MANUAL
(19AI24502)
V-SEMESTER - B.TECH - AI&DS
1
MAHENDRA ENGINEERING COLLEGE
(Autonomous)
Syllabus
Artificial Intelligence & Programme
Department
Data Science Code & Name
V Semester
Periods/ Cred
Course Maximum
Course Name Week it
Code marks
L T P C
MACHINE LEARNING
TECHNIQUES 0 0 4 2 100
LABORATORY
Upon completion of this course, the student should be able to get
an idea on:
To apply the concepts of Machine Learning to solve real-world
problems.
To implement basic algorithms in clustering &
Objective(s) classification applied to text & numeric data.
To implement algorithms emphasizing the importance of
bagging & boosting in classification & regression.
To implement algorithms related to dimensionality reduction.
To apply machine learning algorithms for Natural Language
Processing applications.
On completion of this course, students will be able to
To learn to use Weka tool for implementing machine
learning algorithms related to numeric data.
To learn the application of machine learning algorithms for text
data.
Outcome(s)
To use dimensionality reduction algorithms for image
processing applications.
To apply CRFs in text processing applications.
To use fundamental and advanced neural network algorithms
for solving real-world data.
LISTOF EXPERIMENTS
2
2. Root Node Attribute Selection for Decision Trees using Information Gain
3. Bayesian Inference in Gene Expression Analysis
4. Pattern Recognition Application using Bayesian Inference
5. Bagging in Classification
6. Bagging, Boosting applications using Regression Trees
7. Data & Text Classification using Neural Networks
8. Using Weka tool for SVM classification for chosen domain application
9. Data & Text Clustering using K-means algorithm
10 Data & Text Clustering using Gaussian Mixture Models
.
TOTAL PERIODS 45
INDEX
3
SOLVING REGRESSION & CLASSIFICATION USING 5
1. DECISION TREES
BAGGING IN CLASSIFICATION 19
5.
BAGGING, BOOSTING APPLICATIONS USING 24
6.
REGRESSION TREES
DATA & TEXT CLASSIFICATION USING NEURAL 28
7.
NETWORKS
USING WEKA TOOL FOR SVM CLASSIFICATION 30
8. FOR CHOSEN DOMAIN APPLICATION
4
Aim:
To solve Regression & Classification using Decision Trees.
Algorithm:
Step 1: Import the required libraries.
Step 3: Select all the rows and column 1 from the dataset to “X”.
Step 4: Select all of the rows and column 2 from the dataset to “y”.
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = np.array(
[['Asset Flip', 100, 1000],
['Text Based', 500, 3000],
['Visual Novel', 1500, 5000],
['2D Pixel Art', 3500, 8000],
['2D Vector Art', 5000, 6500],
['Strategy', 6000, 7000],
['First Person Shooter', 8000, 15000],
['Simulator', 9500, 20000],
5
['Racing', 12000, 21000],
['RPG', 14000, 25000],
['Sandbox', 15500, 27000],
['Open-World', 16500, 30000],
['MMOFPS', 25000, 52000],
['MMORPG', 30000, 80000]
])
Output:
[['Asset Flip' '100' '1000']
['Text Based' '500' '3000']
['Visual Novel' '1500' '5000']
['2D Pixel Art' '3500' '8000']
['2D Vector Art' '5000' '6500']
['Strategy' '6000' '7000']
['First Person Shooter' '8000' '15000']
['Simulator' '9500' '20000']
['Racing' '12000' '21000']
['RPG' '14000' '25000']
['Sandbox' '15500' '27000']
['Open-World' '16500' '30000']
['MMOFPS' '25000' '52000']
['MMORPG' '30000' '80000']]
X = dataset[:, 1:2].astype(int)
# print X
print(X)
6
Output:
[[ 100]
[ 500]
[ 1500]
[ 3500]
[ 5000]
[ 6000]
[ 8000]
[ 9500]
[12000]
[14000]
[15500]
[16500]
[25000]
[30000]]
y = dataset[:, 2].astype(int)
print(y)
Output:
[1000 3000 5000 8000 6500 7000 15000 20000 21000 25000 27000 30000
52000 80000]
Output:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
7
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=0, splitter='best')
y_pred = regressor.predict([[3750]])
print("Predicted price: % d\n"% y_pred)
Output:
Predicted price: 8000
plt.ylabel('Profit')
plt.show()
8
Output:
9
Result:
The above experiment of Solving Regression & Classification using
Decision Trees is executed and output verified successfully.
10
Exp No 2a: Bagging Applications Using Regression Trees.
Aim:
To Solve Bagging, applications using Regression Trees.
Algorithm:
Step 1: Start
Step 6: stop
Program:
Bagging Examples
11
1 model = BaggingClassifier()
1 # evaluate the model
1 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
2 n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
1 error_score='raise')
3 # report performance
1 print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
4
1
5
1
6
OUTPUT:
Result:
12
The above experiment of Solving Bagging applications using Regression
Trees. is executed and output verified successfully.
Aim:
To Solve Boosting applications using Regression Trees.
Algorithm:
Step 1: Start
Step 6: stop
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import style
style.use('fivethirtyeight')
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
import scipy.stats as sps
13
# Load in the data and define the column labels
dataset = pd.read_csv('data\mushroom.csv',header=None)
dataset = dataset.sample(frac=1)
dataset.columns = ['target','cap-shape','cap-surface','cap-color','bruises','odor','gill-
attachment','gill-spacing',
'gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-
ring','stalk-surface-below-ring','stalk-color-above-ring',
'stalk-color-below-ring','veil-type','veil-color','ring-number','ring-
type','spore-print-color','population',
'habitat']
# Encode the feature values from strings to integers since the sklearn
DecisionTreeClassifier only takes numerical values
for label in dataset.columns:
dataset[label] = LabelEncoder().fit(dataset[label]).transform(dataset[label])
Tree_model = DecisionTreeClassifier(criterion="entropy",max_depth=1)
X = dataset.drop('target',axis=1)
Y = dataset['target'].where(dataset['target']==1,-1)
predictions = np.mean(cross_validate(Tree_model,X,Y,cv=100)['test_score'])
OUTPUT:
14
Result:
The above experiment of Solving Boosting applications using Regression
Trees. is executed and output verified successfully.
Aim:
To solve Data & Text Classification using Neural Networks
Algorithm:
5. Abstract
Program:
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
# initialize our bag of words
bag = []
15
# list of tokenized words for the pattern
pattern_words = doc[0]
# stem each word
pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
# create our bag of words array
for w in words:
bag.append(1) if w in pattern_words else bag.append(0)
training.append(bag)
# output is a '0' for each tag and '1' for current tag
output_row = list(output_empty)
output_row[classes.index(doc[1])] = 1
output.append(output_row)
# sample training/output
i=0
w = documents[i][0]
print ([stemmer.stem(word.lower()) for word in w])
print (training[i])
print (output[i])
Output:
['how', 'ar', 'you', '?']
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0]
16
Result:
The above experiment of Solving Data & Text Classification using Neural
Networks Trees .is executed and output verified successfully.
Aim:
To Solve Data & Text Clustering using K-means algorithm
Algoithm:
Program:
def
elbow_method(Y_sklearn)
:
"""
This is the function used to get optimal number
of clusters in order to feed to the k-means
clustering algorithm.
"""
17
number_clusters = range(1, 7) # Range of
possible clusters that can be generated
kmeans = [KMeans(n_clusters=i, max_iter =
600) for i in number_clusters] # Getting no. of
clusters
score =
[kmeans[i].fit(Y_sklearn).score(Y_sklearn) for i
in range(len(kmeans))] # Getting score
corresponding to each cluster.
score = [i*-1 for i in score] # Getting list of
positive scores.
plt.plot(number_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Method')
plt.show()
elbow_method(Y_sklearn)
# Optimal Clusters = 2
OUTPUT:
18
Result:
The above experiment of Solving Data & Text Clustering using K-means
algorithm Trees .is executed and output verified successfully.
Exp No: 5 Data & Text Clustering Using Gaussian Mixture Models
Aim:
To solve Data & Text Clustering using Gaussian Mixture Models
Algorithm:
Initialize the mean μkμk, the covariance matrix ΣkΣk and the mixing
coefficients πkπk by some random values(or other values).
Compute the CkCk values for all k.
Again Estimate all the parameters using the current \C_k values.
Compute log-likelihood function.
Put some convergence criterion
If the log-likelihood value converges to some value (or if all the parameters
converge to some values) then stop, else return to Step 2.
This algorithm only guarantee that we land to a local optimal point, but it do not
guarantee that this local optima is also the global one. And so, if the algorithm
19
starts from different initialization points, in general it lands into different
configurations.
Program:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/ccdata/CC GENERAL.csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
In [3]:
raw_df = pd.read_csv('../input/ccdata/CC GENERAL.csv')
raw_df = raw_df.drop('CUST_ID', axis = 1)
raw_df.fillna(method ='ffill', inplace = True)
raw_df.head(2)
Out[3]:
20
P
B
O U C C M P
A IN
N R PU AS A I R
L ST ON P
E C C RC H_ S C N C
A A EO U
O A H HA A H R I _
N P LL FF R
F S A SES D _ E P M F
B C U M _P C
F H S _IN V A D A U U T
A E R E UR H
_ _ E ST A D I Y M L E
L _ C N CH A
P A S_ AL N V T M _ L N
A F H TS AS S
U D F LM CE A _ E P _ U
N R A _P ES E
R V R EN _F N L N A P R
C E S U _F S
C A E TS_ RE C I T Y A E
E Q E R RE _
H N Q FRE Q E M S M Y
U S C QU T
A C U QU UE _ I E M
E H EN R
S E E EN N T T N E
N AS CY X
E N CY C R T N
C ES
S C Y X S T
Y
Y
2
4 1
0 0
0 3 0.
. 1 1
. 9. 0
0. 9 0 0. 0 .
9 0.08 5 0
81 5 0. 95. 0 16 0.0 0 8 1
0 0 0.0 333 0 2 0 0
81 . 0 4 0 66 0 0 0 2
0 3 9 0
82 4 0 67 . 2
7 7 0
0 0 0
4 8 0
0 8
9 7
4
21
P
B
O U C C M P
A IN
N R PU AS A I R
L ST ON P
E C C RC H_ S C N C
A A EO U
O A H HA A H R I _
N P LL FF R
F S A SES D _ E P M F
B C U M _P C
F H S _IN V A D A U U T
A E R E UR H
_ _ E ST A D I Y M L E
L _ C N CH A
P A S_ AL N V T M _ L N
A F H TS AS S
U D F LM CE A _ E P _ U
N R A _P ES E
R V R EN _F N L N A P R
C E S U _F S
C A E TS_ RE C I T Y A E
E Q E R RE _
H N Q FRE Q E M S M Y
U S C QU T
A C U QU UE _ I E M
E H EN R
S E E EN N T T N E
N AS CY X
E N CY C R T N
C ES
S C Y X S T
Y
Y
4 9 0
4
6 4 3
0
7 5 2 2
0 2
4 4 5 2
1
1 8 9
7
6 3 7
In [4]:
# Standardize data
scaler = StandardScaler()
scaled_df = scaler.fit_transform(raw_df)
22
# Normalizing the Data
normalized_df = normalize(scaled_df)
# Converting the numpy array into a pandas DataFrame
normalized_df = pd.DataFrame(normalized_df)
# Reducing the dimensions of the data
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(normalized_df)
X_principal = pd.DataFrame(X_principal)
X_principal.columns = ['P1', 'P2']
X_principal.head(2)
Out[4]:
P1 P2
0 -0.489949 -0.679976
1 -0.519098 0.544828
In [5]:
gmm = GaussianMixture(n_components = 3)
gmm.fit(X_principal)
Out[5]:
GaussianMixture(covariance_type='full', init_params='kmeans', max_iter=100,
23
means_init=None, n_components=3, n_init=1, precisions_init=None,
random_state=None, reg_covar=1e-06, tol=0.001, verbose=0,
verbose_interval=10, warm_start=False, weights_init=None)
In [6]:
# Visualizing the clustering
plt.scatter(X_principal['P1'], X_principal['P2'],
c = GaussianMixture(n_components = 3).fit_predict(X_principal), cmap
=plt.cm.winter, alpha = 0.6)
plt.show()
Result:
The above experiment of Solving Data & Text Clustering using Gaussian
Mixture Models is executed and output verified successfully.
Aim:
To solve Data & Text Clustering using Gaussian Mixture Models
24
Algorithm:
Root Node: It represents the entire population or sample and this further
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a
parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Program:
Now, lets draw a Decision Tree for the following data using Information
gain.
Training set: 3 features and 2 classes
25
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Output:
Split on feature X
26
Split on feature Y
Split on feature Z
27
Result:
The above experiment of root node attribute selection for decision trees
using information gain is executed and output verified successfully.
28
Exp No: 7 Bayesian Inference In Gene Expression Analysis
Aim:
To solve Bayesian Inference in Gene Expression Analysis.
Algorithm:
Start
Omics techniques have changed the way we depict the molecular features of
a cell.
The integrative and quantitative analysis of omics data raises unprecedented
expectations for understanding biological systems on a global scale.
However, its inherently noisy nature, together with limited knowledge of
potential sources of variation impacting health and disease, require the use
of proper mathematical and computational methods for its analysis and
integration.
Bayesian inference of probabilistic models allows propagation of the
uncertainty from the experimental data
Stop
Program:
29
Output:
P(y|θ)~N(θ,σ)
Posterior: P(θ = t|y) is the probability that the parameter θ takes a value t,
knowing now that we have observed the data y. Given the data and the a
priori distribution of the parameter of interest, we want to infer the probability of
the different possible values of the parameters of interest. Using the Bayes
Theorem:
Output:
In complex models with many parameters, the relationships between the data and
the parameters, and among parameters, can be represented in Directed Acyclic
Graphs (DAGs), which are useful representations of complex probabilistic models
and show their modular structure (10). Figure 2 shows the DAG for the Bayesian
inference problem of trying to infer differences in gene expression across two
conditions. In particular, DAGs help us to decompose the joint priors and posterior
distributions through the chain rule, paying attention only to the parents of each
parameter:
Output:
P(θ1,…,θn)=∏i=1nP(θi|parents(θi))
Result:
The above experiment of solving Bayesian Inference in Gene Expression
Analysis.
30
Exp No: 8 Pattern Recognition Application Using Bayesian Inference.
Aim:
To solve Pattern Recognition Application using Bayesian
Inference.
Algorithm:
Start
Electromyogram (EMG) has been utilized to interface signals for prosthetic
hands and information devices
A scale mixture model is a stochastic EMG model in which the EMG
variance is considered as a random variable, enabling the representation of
uncertainty in the variance.
his model is extended in this study and utilized for EMG pattern
classification.
The proposed method is trained by variational Bayesian learning, thereby
allowing the automatic determination of the model complexity.
Stop
Procedure:
31
PROGRAM:
32
OUTPUT:
[1 1 0 2 2 0 1 2 1 0 2 1 2 0 2]
[2 1 0 2 2 0 1 2 1 0 2 1 2 0 2]
Accuracy in percent: 93.33
Result:
The above experiment of solving pattern recognition application using
bayesian inference.
33
Exp No: 9 Bagging In Classification
Aim:
To solve Bagging in classification.
Algorithm:
Start
A Bagging classifier is an ensemble meta-estimator that fits base classifiers
each on random subsets of the original dataset
Such a meta-estimator can typically be used as a way to reduce the variance
of a black-box estimator
Each base classifier is trained in parallel with a training set which is
generated by randomly drawing.
Stop
Program:
Since Bagging resamples the original training dataset with replacement, some
instance(or data) may be present multiple times while others are left out.
Original training dataset: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Resampled training set 1: 2, 3, 3, 5, 6, 1, 8, 10, 9, 1
Resampled training set 2: 1, 1, 5, 6, 3, 8, 9, 10, 2, 7
Resampled training set 3: 1, 5, 8, 9, 2, 10, 9, 7, 5, 4
Algorithm for the Bagging classifier:
Classifier generation:
Classification:
for each of the t classifiers:
predict class of instance using classifier.
return class that was predicted most often.
34
Below is the Python implementation of the above algorithm:
seed = 8
kfold = model_selection.KFold(n_splits = 3,
random_state = seed)
# bagging classifier
model = BaggingClassifier(base_estimator = base_cls,
n_estimators = num_trees,
random_state = seed)
35
Output:
accuracy :
0.8372093023255814
Result:
The above experiment of solving Bagging in classification is executed and
output verified successfully.
36
Using Weka Tool For SVM Classification For
Exp No: 10
Chosen Domain Application.
Aim:
To solve using weka tool for SVM classification for chosen
domain application.
Algorithm:
PROGRAM:
OUTPUT:
RESULT:
Thus by using Weka tool for SVM classification for chosen domain
application is successfully executed and verified successfully.
38