DWM Lab Manual 3yr 1
DWM Lab Manual 3yr 1
(21CS3052R)
S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)
1. Basic Statistical
Descriptions
To implement data pre-
2.
processing techniques
To implement principle
3. component analysis
Classification using Decision
4.
Trees
Classification using K
5.
Nearest Neighbor
Classification using
6.
Bayesian Classifiers
Classification using Back
7.
propagation
Association Rule Mining -
8. Apriori
Implementation of K-Means
9.
Clustering
Classification: Support
10.
Vector Machine (SVM)
11 RuleBased Classification
Pre-lab
1. What are the various ways to measure the central tendency of data?
4. Observe the following diagrams; identify the quantile and q-q plot? Define how
the q-q-plotis different from quantile plot?
6. Identify the symmetric data , positively skewed data and negatively skewed data from
the below graphs?
In-lab
1. Given a dataset “cars” for analysis it includes the variables speed and
distance.(Download the dataset from lms)
a) What are the average speed and the distance of the cars?
b) What is the median and midrange of the data?
c) Find mode of the data and comment on the data modality (i.e, unimodal or bimodal)?
d) What are the variance and the standard deviation of the data?
e) Find the five number summaries of the data?
f) Show the histogram and box plot of the data?
WritingspaceoftheProblem:(ForStudent’s useonly)
Post-lab
1. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results. (Download the dataset from lms)
a) Find the maximum and the minimum percentage of the fat and age of the adults who
visited the hospital.
b) Calculate mean, median and midrange of the age.
c) Find the first quartile and third quartile of the data.
d) Draw a scatter plot and q-q plot based on these two variables.
VivaVoce:-
1. Difference between symmetric data and skewed data.
2. What are the most widely used forms of quartiles?
3. Variance and Standard deviation fall under what category of measuring data?
4. What do low and high standard deviations indicate?
5. Based on what condition, two variables are said to be correlated?
(ForEvaluator’suseonly)
DATA PREPROCESSING:
Databases are exceedingly helpless to noisy, missing, and inconsistent data because
oftheir commonly enormous size (frequently a few gigabytes or more).Low-quality
informationwill prompt low-quality mining results. Pre-processing helps to get a quality data.
The stepsinvolvedindatapre-
processingaredatacleaning,dataintegration,datareduction,datatransformation.
Match thefollowing:
i. Datacleaning a.Reducedrepresentationofdata
ii. Dataintegration b. xold/xmax
iii. Datareduction c. deal with missingvalues andnoisydata
iv. Normalization d.works to remove noisydata
v. Datatransformation e. (xold-xmin)/(xmax-xmin)
vi. Decimalscaling f.mergingofdatafrommultipledatastores
vii. Minmax normalization g.scale thedatavalues inspecifiedrange
viii. Zscorenormalization h.convertdatainto appropriateforms
ix. Smoothing i.(xold-mean)/standarddeviation
WritingspaceoftheProblem:(ForStudent’suseonly)
WritingspaceoftheProblem:(ForStudent’suseonly)
Post-lab
1. Data(13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70)
2. Usesmoothingbybinmeanstosmooththeabovedata,usingabindepthof
3. Illustrateyoursteps.
Commentontheeffectofthistechniqueforthe givendata.AlsoPlotahistogram.
WritingspaceoftheProblem:(ForStudent’suseonly)
VivaVoce:-
1. Whatarethefactorsthatcomprisingdataquality?
2. Whatdoyou meanbynoise inthedataset?
3. Whatareoutliersinthedataset?
4. Whatisdiscretization?
5. Whatis thedifferencebetween lossyand losslessin data reduction?
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameofthe Evaluator:
SignatureoftheEvaluatorDateofEvaluation:
Pre-lab:-
PrincipalComponentAnalysis:
Principal Component Analysis is a method of extracting important variables
fromlargesetofvariablesavailableinadataset.Supposethatthedatatobereducedconsistoftuplesor
data vectors described by n attributes or dimensions. Principal components analysis (PCA;
alsocalled the Karhunen-Loeve, or K-L, method) searches for k n-dimensional orthogonal vectors
thatcan best be used to represent the data, where k ≤ n. The original data are thus projected onto
amuchsmaller space, resultingin dimensionalityreduction.
1. Whatareprincipalcomponents?
2. Mentionthesteps toconstruct principalcomponents?
In-lab:-
2. Calculatethe principalcomponentanalysisforthematrixgiveninQ1usingPCA?
USING PCA
Post-lab:-
1. Pollution has been a concern since Industrialization due to its effects on human lives
andplanet. According to WHO, air pollution is having an effect in 7 million premature deaths
perannum. A report is generated on the quality of air in 5 months. It is found that the data
withinthe reported dataset are correlated. So, perform a strategic method to reconstruct the
datasetwith2 components.Also visualizeagraph between the two components?
(Downloadthedatasetfrom lms)
WritingspaceoftheProblem:(ForStudent’suseonly)
VivaVoce:-
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameofthe Evaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#4:ClassificationusingDecisionTrees.
Dateof theSession: / / TimeoftheSession: to
Pre-lab:-
1. What are the attribute selection measures in modeling a decision tree and write
the respective equations for each of them.
2. Whatdoyou mean byentropyina decision tree?Howis it calculated?
3. WhatisInformation gainandhow doesismatterinaDecisionTree?
4. List out the parameters involved in DecisionTreeClassifier and export_graphviz
and trytounderstand the role ofeach parameter.
5. Matchthefollowing:
1. ID3 a.GAIN RATIO
2. CART b.INFORMATION GAIN
3. C4.5 c.GINI INDEX
1. Implement the decision tree algorithm on the given data which has weight and smoothness
asthe segregating criteria for the fruit apple and orange. Apple is represented by the
number ‘1’andorangeby‘0’.Constructa decision treeandapplythe prediction measuresfor
thegiven datato obtain thetypes offruits.
Weight Smooth Fruit
180 7 ?
140 8 ?
150 5 ?
Fruit dataset
https://drive.google.com/file/d/1qoMDjozHHELVn5tFAJxp8mMw0Ggt-BVX/view?usp=sharing
Convertthetraineddecisiontreeclassifierinto graphviz object.Later,weuse
the converted graphviz object for visualization. To visualize the decision tree, you just need
toopenthe .txt file and copythe contents of thefileto pasteinthe graphviz web
portal graphyizwebportaladdress:http://webgraphviz.com
WritingspaceoftheProblem:(ForStudent’suseonly)
Writingspaceof theProblem:(ForStudent’suseonly)
18
2. Below givenisthediabetesdataset.
(Ref: https://drive.google.com/file/d/1PJizP39JPh_T-
5dQVcUVCfswrPSxT734/view?usp=sharing)
Makesuretoinstallthe scikit-learnpackageandotherrequiredpackages.
1. Findthe correlation matrixforthediabetes dataset?
1. Split the dataset into train_set and test_set for modeling and prediction. Divide the
datasetin such a way that the trained dataset constitutes 70 percent of the original
dataset and therestof the part belongs to thetest dataset.
2. Produceadecisiontreemodelusing
a.Gini indexmetric
b. Entropy and Information gain metric on the trained dataset
usingthe DecisionTreeClassifier function.
3. Applythe prediction measureson thetestdataset.
4. Definea
functionnamed accuracy_score byinterpretingthedifferencebetweenthepredicted
values andthe test set values. Displaytheaccuracyinterms of
a. Fraction usingthe accuracy_score function
b. Number ofcorrectpredictions.
6.Printtheconfusionmatrixofthetestdataset.
6. Calculate thefollowingvaluesmanuallyafter obtainingthe confusionmatrix
a. Accuracy
b. Errorrate
c. Precision
d. Recall (sensitivity)
e. F1Score
f. Specificity
6.Compare the two results(obtained from two kinds of metrics) and state which method
ismoreaccurateforthisdataset. Convertthetraineddecisiontreeclassifierinto graphviz object.
Later, weusethe converted graphviz objectforvisualization.
10. PlotROCcurveandcalculateAUC
11. Plotrecall vsprecisioncurve
WritingspaceoftheProblem:(ForStudent’suseonly)
19
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
Post-lab:-
1. WhatistheC4.5algorithmandhowdoesitwork? StatethedifferencesbetweenID3andC4.5.
2.Differentiatebetweenover-fitting,over-fittingandover-
fittingloss?Whydoesitoccurduringclassification?
3.Explaintheconceptofpruningandwhyit is important.Differentiatebetweenpre-
pruningandpost-pruning.
VivaVoce:-
1. Whatisthe difference between supervisedandunsupervised machinelearning?
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 20 of 67 20
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#5:Classificationusing KNearestNeighbour.
Pre-requisite:
In LMS: Find the file named “Concept of k-Nearest-Neighbor.doc”. Read the
specifieddocumentand answer the belowquestions.
Pre-lab:-
2. Listtheindustrialusesofk-nearest-neighbor algorithmintherealworld.
3. Writeanalgorithmfork-nearest-
neighborclassificationgivenk,thenearestnumberofneighbors, and n, the numberof
attributes describingeachtuple.
In-lab:-
Performthefollowing Analysis:
Step-by-stepprocesstocomputek-nearest-neighboralgorithmis:
1. Determineparameterk=no.ofnearestneighbors
2. Calculatethedistancebetween thetest sample and thetrainingsamples.
3. Sortthedistanceanddeterminenearestneighborsbasedonthekthminimumdistance.
4. Gatherthe categoryofnearestneighbors.
5. Usesimplemajorityofthecategoryofnearestneighborsasthepredictionvalueoftestingsam
ple.
Dataset:
Suppose we have the following “StudentDataSet” dataset which consists of 1st
yearCGPA,2ndyearCGPA, Category(C: CRT, NC:Non-CRT)asparameters.
When a new student comes only with 1st year CGPA and 2nd year CGPA as
information,predictthecategory ofthatnewstudent(whetherhebelongstoCRTorNon-CRT)by
Euclidean distance measure, where Euclidean distance between 2 points or tuples,
sayX1=(x11,x12,............,x1n)andX2=(x21,x22, ............ ,x2n), is
Testsample:
1styearCGPAand2ndyearCGPAofthenewstudentare8.4and7.1respectively.(Conside
rk=3)
WritingspaceoftheProblem:(ForStudent’suseonly)
'
Post-lab:-
1. PredicttheCategoryofstudentwith1styearCGPAand2ndyearCGPAas7.3and7.1respectively
using the Manhattan measuring technique formula with
k=3(Manually).Note:TheManhattandistancebetweentwotuples(orpoints)aandbisdefineda
s∑i|ai−bi|
2. By considering the above StudentDataSet ,, predict the Category of the new
studenthaving 1st yearCGPAand2nd yearCGPAas8.4and7.1respectively,
byimplementingthepythoncodeusingManhattandistancemeasureinordertofindnearest
neighbors for k=3 and check whether the output is same for both the
measuringtechniques ornot.
VivaVoce:-
ReferPageno:423,424,425inHanJ&KamberM,“DataMining:ConceptsandTechniques”,ThirdEdition,
Elsevier, 2011
1. k-nearest-neighborisa lazylearningalgorithm.
2. Howcanthedistancebecomputedforattributesthatarenotnumeric,butnominal(orcategorica
l)suchas color?
3. Listsometechniques usedto speedupthe classificationtime.
4. IfthevalueofagivenattributeAismissingintupleX1and/orintupleX2,thedifferenceisalways
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#6:ClassificationusingBayesianClassifiers
Pre-lab:-
1. Matchthefollowing
ColumnA ColumnB
a. NaiveBayesian a. Values arecontinuous
Classification
b. Bayesianbeliefnetwork b. Attributesconditionallydependent
c. Gaussiandistribution c. Toavoidzeroprobability
d. Laplaceestimator d. Attributesconditionallyindependent
2. Explain Baye’s theoremandwrite its derivedformulae.
3. Supposewehave
continuousvaluesforan attributeinadatasetthenhowtocalculateprobability.
4. Letusassume
p(age=youth/buys_car =yes)=0.222,
p(income=medium/buys_car)=0.444 and
p(buys_car=yes)=0.643 then
Find the probability of p(x/buys_car=yes), where x=(income=medium,
age=youth).
In-lab:-
1. Considerthegiventablenamed“Weather_cond.csv”consistingofattributesTemperatureHumidity
,Windy andaclasslabelnamed“Outcome”.Depending on the weatherconditionsyou
havetochoosewhetherto playcricket or not.
a. Unlike conventional function, write a python function to split the dataset into
trainingsetand test set. Assume test sizelength as 0.33.
b. Write a python function to calculate mean and standard deviation for each
numericalattributein thedata set.
c. Calculatethenumberofpriorsforthe given
datasetaftersplittingintotrainingandtestsets usingpython.
WritingspaceoftheProblem:(ForStudent’suseonly)
2. Theproblemiscomprisedof100observationsofmedicaldetailsforPima Indian’s patients. The
records describe instantaneous measurements taken from the patientsuch as their age, the
number of times pregnant and blood workup. All patients are women aged21 or older.All
attributes are numeric,and their units vary from attribute to attribute. Eachrecord has a class
value that indicates whether the patient suffered an onset of diabetes within
5yearsofwhenthemeasurementsweretaken(1)ornot(0).Thisisastandarddatasetthathasbeenstudie
dalot inmachine learningliterature. Agood prediction accuracyis70%-76%.
Implementapython codetofindtheaccuracyforgivendatasetnamed
“Diabetes.csv”basedon train set andtest set. Taketest sizelength as 0.4.
WritingspaceoftheProblem:(ForStudent’suseonly)
Post-lab:-
1. Considerthegiventablethatspecifiesloan classificationproblem.
VivaVoce:-
1. Explainthedifference betweenaValidationSetandaTestSet?
2. WhatarethethreetypesofNaïveBayesclassifier?
3. Howmanyterms arerequired forbuildingaBayes model?
4. Whatis trainingtest andtestingset?
5. WhataretheadvantagesofNaive Bayes?
(ForEvaluator’suseonly)
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#7:ClassificationusingBackpropagation
Dateof theSession: / / TimeoftheSession: to
Pre-lab:-
In LMS: Find the file named “Han J & Kamber M, Data Mining Concepts
andTechniques.doc”.
Readthespecifieddocument fromPg.No:398–404 andanswerthebelowquestions.
1. StatewhetherthegivenstatementisTrue/False.
a. Backpropagationisneuralnetworklearningalgorithm.
3. ExplainaboutMultilayer Feed-ForwardNeuralNetworkwithdiagram.
5. Considerthefollowingtable.
Input DesiredOutput ModelO AbsoluteError Square
utput Error
0 0
1 2
2 4
Predict the Model Output by considering the initial value of weight as 3. Find the Absolute
Errorand Square Error. Use the Backpropagation algorithm to update the weight and try to
minimizethe square error asmuch aspossible.
Hint:
i.ModelOutput= W*I(x)(W=weight, I=Input,x=indexthatiteratesfrom0to
length(Input))
ii. Absolute Error = mod(Model Output-Desired
Output) iii.SquareError= (Absolute Error)^2
WritingspaceoftheProblem:(ForStudent’suseonly)
In-lab:-
Analysis:
Thefollowingstepswillprovidethefoundationthatyouneed to implementtheBackpropagationalgorith
mandapplyit toyourownpredictivemodellingproblems:
1. InitializeNetwork.
2. ForwardPropagate.
i. NeuronActivation.
ii. NeuronTransfer.
iii. ForwardPropagation.
3. BackPropagateError.
i. TransferDerivative
ii. ErrorBackpropagation
4. TrainNetwork.
i. UpdateWeights.
ii. TrainNetwork.
5. TestNetwork.
Dataset:
Supposewehavethefollowing“Results Dataset” whichconsistofGPA’sofsomestudents that
they had scored in two internal tests. And, it also consists of another attributenamed
‘Qualified’, which holds a character(Q/NQ), representing the student qualificationforfinal
examination.
S.No Test– 1 Test– 2 Qualified
1 8.5 8.5 Q
2 8.2 9.0 Q
3 3.5 5.0 NQ
4 5.5 4.5 NQ
5 9.2 9.0 Q
6 7.8 7.3 Q
7 8.0 3.1 NQ
8 10 7.0 Q
9 4.5 6.0 NQ
10 6.8 7.1 Q
11 5.1 4.1 NQ
12 4.2 5.3 NQ
Problem: Trainanetworkonabove“ResultsDataset”byapplyingBackpropagationalgorithm.
a. Initializing anetwork withallweights andbiases. (Considerweightsinrange-0.5to
+0.5,biases=1,LearningRate ={0.5, 0.7,1})
b. Training thenetwork accordingtotheDataset. (ConsiderbothActivatingFunctions–
SigmoidFunctionandTanhFunction)
c. Backpropagating theerrors.
WritingspaceoftheProblem:(ForStudent’suseonly)
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 37 of 67 37
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
Post-lab:-
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 38 of 67 38
1. Use the network which is trained on the above “Results Dataset” and test whether it is
trainedwith 100% accuracy or not. And, predict the result (qualified for final examination or not)
of anewentrywhich contains5.9and 5.9 GPA’softest-1 and test-2 respectively.
WritingspaceoftheProblem:(ForStudent’suseonly)
39
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
VivaVoce:-
1. What are the general tasks that are performed with backpropagation
algorithm? 2.What kind of real-world problemscan neural networkssolve?
3. Whatisa gradient descent?
4. Why is zero initialization not a recommended weight initialization
technique? 5. Howareartificial neural networks different from normal
networks?
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameofthe Evaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#8:AssociationRuleMining -Apriori
Dateof theSession: / / TimeoftheSession: to
Pre-lab:-
1. Definewhatis Apriori algorithm.
2. Whatisassociation minning?
4. Whatisminimumsupportandminimum confidence?
a. Findallthefrequentitemsetsusing apriori algorithm.
b. Obtainsignificantdecisionrules.
In-lab:- Forthefollowinggiventransaction dataset,perform followingoperations :
a.Generate rulesusing Apriori algorithmbyusing belowdataset.
vegetables green whole wheat flour cottage
shrimp almonds avocado mix grapes yams cheese
burgers meatballs eggs
chutney
turkey avocado
mineral energy whole
water milk bar wheatrice greentea eggs
lowfat
yogurt
whole
wheatp french fri
asta es
light
soup cream shallot
frozen green
vegetables spaghetti tea
french fries
eggs petfood
cookies
mineral cooking
turkey burgers water eggs oil
champag
spaghetti ne cookies
mineral
water salmon eggs
mineral
water
lowfat
shrimp chocolate chicken honey oil cookingoil yogurt
turkey eggs
tomatoes mineral
turkey freshtuna spaghetti water blacktea salmon eggs
french fries
meatballs milk honey proteinbar
shampoo
redwine shrimp pasta pepper eggs chocolate
sparkling
rice water
mineral body
spaghetti water ham spray pancakes greentea
grated white toothpaste
burgers cheese eggs pasta avocado honey wine
eggs
WritingspaceoftheProblem:(ForStudent’suseonly)
43
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
Post-lab:-
1. Sameas In-lab questiongeneraterulesonbelowdataset.
semi-
finished ready
citrusfruit bread margarine soups
tropical
fruit yogurt coffee
wholemilk
cream meat
pip fruit yogurt cheese spreads
longlife
otherveget whole condensed bakeryp
ables milk milk roduct
abrasive
wholemilk butter yogurt rice cleaner
rolls/buns
liquor(ap
otherveget UHT- bottled petizer
ables milk rolls/buns beer )
potplants
wholemilk cereals
other
tropica vegetabl white bottled chocolate
l fruit es bread water
bottle
dwat
tropica whole yogurt er dishes
citrusfruit lfruit milk butter curd flour
beef
rolls/bun
frankfurter s soda
tropical
chicken fruit
fruit/vegeta newspape
butter sugar blejuice rs
fruit/vegetab
lejuice
packaged
fruit/vegetab
les
chocolate
specialty
bar
other
vegetables
butter milk pastry
wholemilk
tropical cream processed detergent newspape
fruit cheese cheese rs
bathro
rootveg sweets salty o
tropica etables otherveget frozend rolls/buns pread snac waffle cand mclean
l fruit ables essert flour s k s y er
bottled canned
water beer
yogurt
rolls/bun chocolate
sausage s soda
other
vegetables
shoppi
brown fruit/vegeta canned newspape ngba
bread soda blejuice beer rs gs
WritingspaceoftheProblem:(ForStudent’suseonly)
VivaVoce:-
1. Whoproposed Apriori algorithminwhich year?
2. Whatisfrequent itemset?
3. Whydoweconvertdataset intolist?
4. Whatistheformulafor support,confidence and lift?
5. Howtheyget thenameas Apriori?
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#9:ImplementationofK-MeansClustering
Pre-Requisites:
Data pre-processing
Basics of plotting techniques
Various clustering techniques
Pre-lab:-
1. Matchthefollowing.
Parameters Application
1. pch a. Tosetorientationofaxis labels
2. col b. No.ofplotsperrowand column
3. mfrow c. Tosetplotcolor
4. lwd d. Plottingsymbol
5. las e. Tosetline width
2. Listout various parametersandattributesinKMeansclustering.
4. Listoutvarious applicationsofclustering.
5. DescribeEuclideandistanceand Manhattandistanceinbriefwith its derivedformula.
6. ListoutbasicstepsinvolvedinKMeansclustering.
In-lab:-
1. The given dataset comprises of 150 data entries of different countries around
theworld.It is a report on world happiness, a landmark survey of the state of
globalhappiness that ranks 156 countries by how happy their citizens perceive
themselves tobe, with a focus on the technologies, social norms, conflicts and
government policiesthat have driven those changes. The records contains various
attributes of each
countrythatincludespositive_effect,negative_effect,corruption,freedom,healthlifeexpe
ctancy etc. The data frame includes categorical variables, numerical values
andtheirvalues varyfrom countryto country.
Implementapythoncodeusingscikit-learntodisplayaK-
meansclusteringplotforgivendataframe named“world_happiness_report.csv”.
WritingspaceoftheProblem:(ForStudent’suseonly)
Post-lab:-
1. This lab module aims to build an analysis on customers of a shopping mall. It consists of
150observationsofcustomersconsistingdetailsthatincludegender,age,
annual_income,spending_score etc.Basedonthetwoparameters annual_income and
spending_score,trytobuildaanalysis oncustomers through cluster graphs
Apply k means clustering on the given data set named “Mall_customers” marking number
ofclustersbasedonmeanandstandarddeviationofanytwoattributesofyourchoiceandimplementtheK-
means iterativelytill the centroids get stabilized
WritingspaceoftheProblem:(ForStudent’suseonly)
VivaVoce:-
1. K-meansis whichtypeofalgorithm.
2. In K-meansclusteringalgorithmwhatis
thecriteriausedbythedatapointstogetseparatedfromonecluster to another.
3. Whatarethebasicsteps inKMeansclustering.
4. WhatdoesKreferin K-meansalgorithm - Krefersto kno.ofclusters.
5. HowisK-meansalgorithmisdifferentfromKNN algorithm
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#10:Classification:Support VectorMachine(SVM)
Pre-lab:-
1. What isSVM?
2. When do weuse SVM?
3. What ismaximummarginalhyperplane andwhatistheequationofseparatinghyperpla
ne?
4. What arethetwocasesofSVM?
5. What aretheequationsforpointthatliesabovetheseparating hyperplane andbelowthese
parating hyperplane?
In-lab:-
1. Below is the data of the employees in the company. The data shows whether
employeepurchased software or not. Take x co-ordinate as age and y co-ordinate
asestimated_salary.Now, Considerthefollowing datasetandperformthebelowoperations:
UserID Gender Age EstimatedSalary Purchased
15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 0
15804002 Male 19 76000 0
15728773 Male 27 58000 0
15598044 Female 27 84000 0
15694829 Female 32 150000 1
15600575 Male 25 33000 0
15727311 Female 35 65000 0
15570769 Female 26 80000 0
15606274 Female 26 52000 0
15746139 Male 20 86000 0
15704987 Male 32 18000 0
15628972 Male 18 82000 0
15697686 Male 29 80000 0
15733883 Male 47 25000 1
15617482 Male 45 26000 1
15704583 Male 46 28000 1
15621083 Female 48 29000 1
15649487 Male 45 22000 1
15736760 Female 47 49000 1
15714658 Male 48 41000 1
15599081 Female 45 22000 1
15705113 Male 46 23000 1
15631159 Male 47 20000 1
15792818 Male 49 28000 1
15633531 Female 47 30000 1
15744529 Male 29 43000 0
a. Importthedatasetintopython
b. Splitthedataset set intotrainingand testingsets
c. Applyfeaturescalingon trainingand test sets
d. FitSVMtothetrainingset
e. Visualize thetrainingsetresults
f. Visualize thetestsetresults.
Post-lab:-
1. Below dataset represents the bank transactions of KVB bank for an hour. Consider x
co-ordinate as Balance and y co-ordinate as Trtn_amt. Perform following operations
ongiven dataset:
S.No transaction_ID Balance Trtn_amt sucornot
1 3467 98687.36 500 0
2 4801 8510.47 100 0
3 2093 2475.3 200 1
4 9933 37743.25 1000 0
5 7178 2705.95 600 0
6 1093 60314 750 1
7 3708 812129.5 280 1
8 3804 8076.25 140 0
9 3192 42323.14 310 1
10 3666 47045.25 2500 0
11 8598 96171.25 6900 0
12 8743 608581.8 8520 1
13 9302 586057.3 410 1
14 6127 4587.5 750 0
15 7502 43597.75 250 0
a. Importthedatasetintopython
b. Split thedataset set intotrainingand testingsets
c. Applyfeaturescalingon trainingand test sets
d. FitSVMto thetrainingset
e. Visualize thetrainingsetresults
f. Visualize thetestsetresults.
WritingspaceoftheProblem:(ForStudent’suseonly)
VivaVoce:-
1. What arethe advantagesof SVM?
2. How manytypesofmachine learnings arethere andinwhichtypethis svm fallunder?
3. What arethe turningparametersin SVM?
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#11:RuleBasedClassification
Dateof theSession: / / TimeoftheSession: to
Pre-requisite:
Refer Page no: 355-363 in Han J & Kamber M, “Data Mining: Concepts and Techniques”,
ThirdEdition,Elsevier, 2011
Pre-lab:-
2. Brieflyexplain aboutthe buildingclassificationrules.
4. List someaspectsofsequentialcovering.
5. Whatarethecharacteristics ofrule-basedclassifier?
6. Definecoverageand accuracy.
In-lab:-
1. Implement a simple python code for rule-
basedclassification on “AllElectronicsCustomer” datab
ase (Downloadthedataset fromLMS)
RID age income student Credit_rating Class:buys computer
WritingspaceoftheProblem:(ForStudent’suseonly)
Post-lab:-
1. Extractpossibleclassificationrulesfromthe givendecisiontree.
3. DifferencebetweenDecisiontreeandrulebased classification.
VivaVoce:-
1. Rule-Basedclassifier classifyrecords byusingacollection of rules.
2. Mostrule-basedclassificationsystemsusewhich strategy?
3. Differencebetweenclass-basedorderingandrule-basedordering.
4. Brieflyexplain thebelow termsinyourown words:
a. Mutuallyexclusive
b. Exhaustive
5. Nametheterms that definethe followingstatements:
a. Fractionofrecords that satisfyonlyantecedentofarule.
b. Fractionofrecords thatsatisfyboth antecedentandconsequentof arule.
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation:
Lab#12:OutliersDetection
Dateof theSession: / / TimeoftheSession: to
Pre-lab:-
1. Whatdoyou mean byan outlier? What arethemaincauses foroutliers?
3. Whyisoutlierdetectionnecessaryindata analysis?
4. Howdowecalculate z-score?
5. Considerthebelowdatasetwhichcomprisesoftheincome(in
thousands)of15peopleinanorganisation.
[45,51, 63, 48,67, 48, 56, 2, 62, 59, 44, 61, 99, 46,52]
Whatdoyouobservefromtheabovedata?Isthereanysignificantdifferencebetweentheincomeoffe
w employees?If so,whatcould be thereason ofit?
In-lab:-
1. ThedatasetBostonhousepricesconsistsof9attributesCRIM,ZN,INDUS,LSTAT,NOX,RM, DIS,RAD,
TAX. Thedescription of each attribute
CRIM per capitacrime rate bytown
ZNproportion ofresidential landzoned forlots over25,000 sq.ft.
INDUSproportionofnon-retailbusinessacrespertown
NOXnitricoxides concentration(parts per 10 million)
RMaveragenumberofroomsper dwelling
DISweighteddistancestofiveBostonemploymentcentres
RADindexof accessibilityto radialhighways
TAXfull-valueproperty-taxrateper$10,000
Boston dataset:https://drive.google.com/file/d/1YVYWQWPKsLX1UM-
0XCnGCwD1NIi7_uIv/view?usp=sharing
WritingspaceoftheProblem:(ForStudent’suseonly)
2. Considertheirisdataset.Itincludesthreeirisspecieswith50sampleseachaswellassomeproperties
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 62 of 67
62
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
https://drive.google.com/file/d/1HEEMrAQqAynHdM5TmK0G-
mD5Qr0OW2J8/view?usp=sharing
Importthecsvfileandusetheboxplotmethodtovisualisetheoutliersconsideringthe4propertiesof
aflower. You will noticethat oneofthepropertyhas outliers.
1. Consideringtherangeoftheoutliersfromthevisualisation,displaytheobservationswhichhaveoutli
ers.
2. ImplementaDBSCANmodelfittingonthedatasettakingepsilonvalueas0.8andminimumsamplesv
alueas 19.
3. Print the counter values using the counter function on the model
labels.4.Considering the values obtained from the model labels print the outliers of
the data.5.Draw ascatterplot betweenpetallength and sepalwidthto
visualisetheoutliers.
WritingspaceoftheProblem:(ForStudent’suseonly)
Post-lab:-
Consider the following student
datasethttps://drive.google.com/file/d/1edmKnHjXkTyHT6gSYhwLw9rTpzoy1Cig/view?usp=sharing
a.Findthoseanomalousweightsbyplottingahistogram
b. Intherange0to1,considerthelower_bound=0.1&upper_bound=0.9andfindthe
outliers usingthe quantile method.
c.Segregatetheoutliersfrominlines
using“loc”methodtogetthevaluesof“true_index”.Alsoobtain values of“false_index”.
.
d.Nowfind themedianfrom thevaluesobtained in“true_index”
.
d.Replacealltheoutliers withmedian.
WritingspaceoftheProblem:(ForStudent’suseonly)
64
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
VivaVoce:-
1. Isitgoodtoremove
anoutlierfromthedatasetallthetime?2.Whattheapplicatio
ns of outlier detection.
3. Whatthedifferenttypesof outliers?
4. Areoutliersjustsideproductsofsomeclusteringalgorithms?
5. Whatisthedifference betweennoise and anomoly?
(ForEvaluator’suseonly)
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameoftheEvaluator:
SignatureoftheEvaluatorDateofEvaluation: