0% found this document useful (0 votes)

193 views

DWM Lab Manual 3yr 1

Here are the matches: i. Data cleaning - c. deal with missing values and noisy data ii. Data integration - d. works to remove noisy data iii. Data reduction - a. Reduced representation of data iv. Normalization - b. xold/xmax v. Data transformation - e. change the form but not the information content

Uploaded by

Sandeep Barukula

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

193 views

DWM Lab Manual 3yr 1

Uploaded by

Sandeep Barukula

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

DATA WAREHOUSING AND MINING

(21CS3052R)

STUDENT ID: ACADEMIC YEAR: 2023-24

STUDENT NAME:
Table of Contents

1. Session 01: Basic Statistical Descriptions ..............................................................................

2. Session 02: To implement data pre-processing techniques ....................................................
3. Session 03: To implement principle component analysis .......................................................
4. Session 04: Classification using Decision Trees ......................................................................
5. Session 05: Classification using K Nearest Neighbor ............................................................
6. Session 06: Classification using Bayesian Classifiers ..............................................................
7. Session 07: Classification using Back propagation................................................................
8. Session 08: Association Rule Mining - Apriori ......................................................................
9. Session 9: Implementation of K-Means Clustering.................................................................
10. Session 10: Classification: Support Vector Machine (SVM) ..................................................
11. Session 11: Rule Based Classification ...................................................................................
12. Session 12: Outliers Detection ...............................................................................................
A.Y. 2023-24 LAB/SKILL CONTINUOUS EVALUATION

S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)

1. Basic Statistical
Descriptions
To implement data pre-
2.
processing techniques
To implement principle
3. component analysis
Classification using Decision
4.
Trees
Classification using K
5.
Nearest Neighbor
Classification using
6.
Bayesian Classifiers
Classification using Back
7.
propagation
Association Rule Mining -
8. Apriori
Implementation of K-Means
9.
Clustering
Classification: Support
10.
Vector Machine (SVM)
11 RuleBased Classification

12. Outliers detection

Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#1: Basic Statistical Descriptions

Date of the Session: ____/____/ Time of the Session ____to __

Pre-lab

BASIC STATISTICAL DESCRIPTIONS OF DATA

Basic statistical descriptions provide the analytical foundation for data pre processing. It can be used
to identify properties of the data and highlight which data values should be treated as noisy or
outliers.

Answer the following Questions

1. What are the various ways to measure the central tendency of data?

2. What are the several ways of measuring the dispersion of data?

3. What is IQR (inter quartile range)?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 1 of 67
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

4. Observe the following diagrams; identify the quantile and q-q plot? Define how
the q-q-plotis different from quantile plot?

5. What are the items involved Five number summary?

6. Identify the symmetric data , positively skewed data and negatively skewed data from
the below graphs?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 2 of 67
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab
1. Given a dataset “cars” for analysis it includes the variables speed and
distance.(Download the dataset from lms)

a) What are the average speed and the distance of the cars?
b) What is the median and midrange of the data?
c) Find mode of the data and comment on the data modality (i.e, unimodal or bimodal)?
d) What are the variance and the standard deviation of the data?
e) Find the five number summaries of the data?
f) Show the histogram and box plot of the data?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 3 of 67
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

WritingspaceoftheProblem:(ForStudent’s useonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 4 of 67
4
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab
1. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results. (Download the dataset from lms)

a) Find the maximum and the minimum percentage of the fat and age of the adults who
visited the hospital.
b) Calculate mean, median and midrange of the age.
c) Find the first quartile and third quartile of the data.
d) Draw a scatter plot and q-q plot based on these two variables.

Writing space of the Problem: (For Student’s use only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 5 of 67
5
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Difference between symmetric data and skewed data.
2. What are the most widely used forms of quartiles?
3. Variance and Standard deviation fall under what category of measuring data?
4. What do low and high standard deviations indicate?
5. Based on what condition, two variables are said to be correlated?

(ForEvaluator’suseonly)

Comment of the Evaluator (if Any) Evaluator’s Observation

Marks Secured: out of

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 6 of 67
6
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#2:To implement data pre-processing techniques.

Date of the Session: ____/____/____ Time of the Session ____to_____
Pre Lab

DATA PREPROCESSING:
Databases are exceedingly helpless to noisy, missing, and inconsistent data because
oftheir commonly enormous size (frequently a few gigabytes or more).Low-quality
informationwill prompt low-quality mining results. Pre-processing helps to get a quality data.
The stepsinvolvedindatapre-
processingaredatacleaning,dataintegration,datareduction,datatransformation.

Match thefollowing:
i. Datacleaning a.Reducedrepresentationofdata
ii. Dataintegration b. xold/xmax
iii. Datareduction c. deal with missingvalues andnoisydata
iv. Normalization d.works to remove noisydata
v. Datatransformation e. (xold-xmin)/(xmax-xmin)
vi. Decimalscaling f.mergingofdatafrommultipledatastores
vii. Minmax normalization g.scale thedatavalues inspecifiedrange
viii. Zscorenormalization h.convertdatainto appropriateforms
ix. Smoothing i.(xold-mean)/standarddeviation

1. Mention anytwo methodsthat deals withmissingvalues and noisydata .

2. Mentiontwotechniquesthat areappliedtoobtain areduceddataset.
3. Usingmin-max normalization, transformthevalue 35ontothe range[0.0,1.0].
4. Using z-score normalization, transform the value 35, where the standard deviation is
12.94years.
5. Usingnormalization bydecimalscaling, transformthevalue 35.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 7 of 67
7
In-lab
1. Given a data set “data” for analysis it includes the attribute Country, purchased item,
age,Salary.

a. Identifynumberof missingvalues in a givendataset

b. Dropthe tuples thathavemissingvaluesin theattributes.
c. Checkthedatatypeof age,ifit isnotanintegerthenconvertintointeger.
d. Normalizethesalaryusingsimple featurescaling.
e. Categorizethe salaryintolow, high,mediumbins.
f. Turnthecategoricalvaluesintonumerical.

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 8 of 67 8
2. Suppose john is working as a manager at Nuclear Power Corporation of India and have
beencharged with analyzing the Nuclear power station construction data. He carefully inspects
thecompany’s database identifying and selecting the attributes (cost, date, t1, t2 and cap) to
beincludedin theanalysis.(Download the dataset from lms)
a. Henoticedthatseveralvaluesoftheattributesforvarioustupleshavenorecordedvalue.
b. Heobservedthatdatatypeofyearisrecordedinfloatinsteadof integertype.
c. Hewantsto normalizeall thedata (variables)inequalweights.
d. Finally,hewantsto knowifthereareanyoutliers presentincost ofthe
construction.Youimmediatelyset outto perform this task.
Hint:missingvalues canbesolved byreplacingwithmean)

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 9 of 67 9
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab
1. Data(13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70)
2. Usesmoothingbybinmeanstosmooththeabovedata,usingabindepthof
3. Illustrateyoursteps.

Commentontheeffectofthistechniqueforthe givendata.AlsoPlotahistogram.

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 10 of 67 10
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

2. Usethesetwo methodsbelowto normalizethefollowinggroupof data:200, 300, 400,600and

1000.
a. min-max normalization bysettingmin =0 andmax=1
b. z-scorenormalization
c. z-scorenormalization usingthemeanabsolutedeviationof standard deviation
d. Normalizationbysimplefeaturescaling.

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 11 of 67 11
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

3. a. Normalize the two variables(age, fat) based on z-score normalization

b. Calculate the correlation matrix. Are these two variables positively or negatively correlated?

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 12 of 67 12
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Whatarethefactorsthatcomprisingdataquality?
2. Whatdoyou meanbynoise inthedataset?
3. Whatareoutliersinthedataset?
4. Whatisdiscretization?
5. Whatis thedifferencebetween lossyand losslessin data reduction?

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameofthe Evaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 13 of 67 13
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#3:To implement principle component analysis

Date of the Session: ____/____/____ Time of the Session ____to_____

Pre-lab:-

PrincipalComponentAnalysis:
Principal Component Analysis is a method of extracting important variables
fromlargesetofvariablesavailableinadataset.Supposethatthedatatobereducedconsistoftuplesor
data vectors described by n attributes or dimensions. Principal components analysis (PCA;
alsocalled the Karhunen-Loeve, or K-L, method) searches for k n-dimensional orthogonal vectors
thatcan best be used to represent the data, where k ≤ n. The original data are thus projected onto
amuchsmaller space, resultingin dimensionalityreduction.

1. Whatareprincipalcomponents?
2. Mentionthesteps toconstruct principalcomponents?

In-lab:-

1. Suppose that you are given a small 3x2 matrix,you have

tocalculatePrincipalComponentAnalysiswithout usingpca()function?
Matrix:([3, 5],[4, 2],[1, 6])

WritingspaceoftheProblem: (ForStudent’s useonly)

2. Calculatethe principalcomponentanalysisforthematrixgiveninQ1usingPCA?
USING PCA

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 14 of 67 14
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 15 of 67 15
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-

1. Pollution has been a concern since Industrialization due to its effects on human lives
andplanet. According to WHO, air pollution is having an effect in 7 million premature deaths
perannum. A report is generated on the quality of air in 5 months. It is found that the data
withinthe reported dataset are correlated. So, perform a strategic method to reconstruct the
datasetwith2 components.Also visualizeagraph between the two components?
(Downloadthedatasetfrom lms)

WritingspaceoftheProblem:(ForStudent’suseonly)

VivaVoce:-

1. WhyPCA is preferable,mention thetwo primaryreasons?

2. Is thereanyloss of dataif weuse PCA?
3. PCAisanunsupervisedtechnique,willyouagreewith it? Why?
4. WhataretheapplicationsofPCA?
5. Definecovariancematrix?

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameofthe Evaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 16 of 67 16
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#4:ClassificationusingDecisionTrees.
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-

1. What are the attribute selection measures in modeling a decision tree and write
the       respective equations for each of them.  
2. Whatdoyou mean byentropyina decision tree?Howis it calculated? 
3. WhatisInformation gainandhow doesismatterinaDecisionTree? 
4. List out the parameters involved in DecisionTreeClassifier and export_graphviz
and trytounderstand the role ofeach parameter.
5. Matchthefollowing: 
1. ID3  a.GAIN RATIO  
2. CART  b.INFORMATION GAIN 
3. C4.5  c.GINI INDEX 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 17 of 67 17
In-lab:-

1. Implement the decision tree algorithm on the given data which has weight and smoothness
asthe segregating criteria for the fruit apple and orange. Apple is represented by the
number ‘1’andorangeby‘0’.Constructa decision treeandapplythe prediction measuresfor
thegiven datato obtain thetypes offruits. 
Weight Smooth Fruit
180 7 ?
140 8 ?
150 5 ?
 

 Fruit dataset
https://drive.google.com/file/d/1qoMDjozHHELVn5tFAJxp8mMw0Ggt-BVX/view?usp=sharing

Convertthetraineddecisiontreeclassifierinto graphviz object.Later,weuse
the converted graphviz object for visualization. To visualize the decision tree, you just need
toopenthe .txt file and copythe contents of thefileto pasteinthe graphviz web
portal graphyizwebportaladdress:http://webgraphviz.com 

WritingspaceoftheProblem:(ForStudent’suseonly)

Writingspaceof theProblem:(ForStudent’suseonly)
18
2. Below givenisthediabetesdataset. 

(Ref: https://drive.google.com/file/d/1PJizP39JPh_T-
5dQVcUVCfswrPSxT734/view?usp=sharing)
 
 Makesuretoinstallthe scikit-learnpackageandotherrequiredpackages.  
1. Findthe correlation matrixforthediabetes dataset?
1. Split the dataset into train_set and test_set for modeling and prediction. Divide the
datasetin such a way that the trained dataset constitutes 70 percent of the original
dataset and therestof the part belongs to thetest dataset. 
2. Produceadecisiontreemodelusing  
    a.Gini indexmetric 
    b. Entropy and Information gain metric  on the trained dataset
usingthe DecisionTreeClassifier function. 
3. Applythe prediction measureson thetestdataset. 
4. Definea
functionnamed accuracy_score byinterpretingthedifferencebetweenthepredicted
values andthe test set values. Displaytheaccuracyinterms of  
a. Fraction usingthe accuracy_score function 
b. Number ofcorrectpredictions. 
6.Printtheconfusionmatrixofthetestdataset. 
6. Calculate thefollowingvaluesmanuallyafter obtainingthe confusionmatrix  
a. Accuracy 
b. Errorrate 
c. Precision 
d. Recall (sensitivity)
e. F1Score 
f. Specificity
6.Compare the two results(obtained from two kinds of metrics) and state which method
ismoreaccurateforthisdataset. Convertthetraineddecisiontreeclassifierinto graphviz object.
Later, weusethe converted graphviz objectforvisualization.

10. PlotROCcurveandcalculateAUC
11. Plotrecall vsprecisioncurve

WritingspaceoftheProblem:(ForStudent’suseonly)

19
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-

1. WhatistheC4.5algorithmandhowdoesitwork? StatethedifferencesbetweenID3andC4.5. 
2.Differentiatebetweenover-fitting,over-fittingandover-
fittingloss?Whydoesitoccurduringclassification? 
3.Explaintheconceptofpruningandwhyit is important.Differentiatebetweenpre-
pruningandpost-pruning. 

VivaVoce:-
1. Whatisthe difference between supervisedandunsupervised machinelearning? 
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 20 of 67 20
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

2. What is a confusion matrix? 

3. Which ofthe followingistrue about trainingand testingerrorin such case? 
a. Thedifferencebetweentrainingerror
andtesterrorincreasesasnumberofobservationsincrease.
b. The difference between training error and test error decreases as number
ofobservationsincrease. 
c. Thedifferencebetweentrainingerrorandtest errorwillnot change 
4. Whatisthedifference betweenclassificationandclustering? 
5. WhatareRecommenderSystems? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 21 of 67 21
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#5:Classificationusing KNearestNeighbour.

Dateof theSession: / / TimeoftheSession: to

Pre-requisite:
In LMS: Find the file named “Concept of k-Nearest-Neighbor.doc”. Read the
specifieddocumentand answer the belowquestions.

Pre-lab:-

1. Statewhetherthe given statementistrueor false withsupportedreasoning.

a. k-Nearest-Neighboris a simple algorithm that stores all available cases and

classifiesthenew casebased on dissimilaritymeasure.
b. Thevalueof ‘k’in k-nearest-neighboralgorithmhelps tocheck theno.
oftrainingsetslabelsto assign the mostcommon label forthe testingset.

2. Listtheindustrialusesofk-nearest-neighbor algorithmintherealworld.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 22 of 67 22
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>



3. Writeanalgorithmfork-nearest-
neighborclassificationgivenk,thenearestnumberofneighbors, and n, the numberof
attributes describingeachtuple.

4. Compare the advantages and disadvantages of

eagerclassification(e.g.,decisiontree,Bayesian,neuralnetwork)versuslazyclassification(e.g.
,k-nearest-neighbor,case-basedreasoning).

5. Givethe distancemethods that aremost commonlyusedin k-nearest-neighboralgorithm.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 23 of 67 23
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
Performthefollowing Analysis:
Step-by-stepprocesstocomputek-nearest-neighboralgorithmis:
1. Determineparameterk=no.ofnearestneighbors
2. Calculatethedistancebetween thetest sample and thetrainingsamples.
3. Sortthedistanceanddeterminenearestneighborsbasedonthekthminimumdistance.
4. Gatherthe categoryofnearestneighbors.
5. Usesimplemajorityofthecategoryofnearestneighborsasthepredictionvalueoftestingsam
ple.

Dataset:
Suppose we have the following “StudentDataSet” dataset which consists of 1st
yearCGPA,2ndyearCGPA, Category(C: CRT, NC:Non-CRT)asparameters.

When a new student comes only with 1st year CGPA and 2nd year CGPA as
information,predictthecategory ofthatnewstudent(whetherhebelongstoCRTorNon-CRT)by
Euclidean distance measure, where Euclidean distance between 2 points or tuples,
sayX1=(x11,x12,............,x1n)andX2=(x21,x22, ............ ,x2n), is

Testsample:
1styearCGPAand2ndyearCGPAofthenewstudentare8.4and7.1respectively.(Conside
rk=3)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 24 of 67 24
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 25 of 67 25
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

'
Post-lab:-

1. PredicttheCategoryofstudentwith1styearCGPAand2ndyearCGPAas7.3and7.1respectively
using the Manhattan measuring technique formula with
k=3(Manually).Note:TheManhattandistancebetweentwotuples(orpoints)aandbisdefineda
s∑i|ai−bi|
2. By considering the above StudentDataSet ,, predict the Category of the new
studenthaving 1st yearCGPAand2nd yearCGPAas8.4and7.1respectively,
byimplementingthepythoncodeusingManhattandistancemeasureinordertofindnearest
neighbors for k=3 and check whether the output is same for both the
measuringtechniques ornot.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 26 of 67 26
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
ReferPageno:423,424,425inHanJ&KamberM,“DataMining:ConceptsandTechniques”,ThirdEdition,
Elsevier, 2011

1. k-nearest-neighborisa lazylearningalgorithm.
2. Howcanthedistancebecomputedforattributesthatarenotnumeric,butnominal(orcategorica
l)suchas color?
3. Listsometechniques usedto speedupthe classificationtime.
4. IfthevalueofagivenattributeAismissingintupleX1and/orintupleX2,thedifferenceisalways

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 27 of 67 27
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#6:ClassificationusingBayesianClassifiers

Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. Matchthefollowing   
                ColumnA                    ColumnB 
 
a. NaiveBayesian a. Values arecontinuous 
Classification
b. Bayesianbeliefnetwork b. Attributesconditionallydependent 
c. Gaussiandistribution c. Toavoidzeroprobability 
d. Laplaceestimator d. Attributesconditionallyindependent 

2. Explain Baye’s theoremandwrite its derivedformulae. 

3. Supposewehave
continuousvaluesforan attributeinadatasetthenhowtocalculateprobability. 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 28 of 67 28
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

4. Letusassume  
   p(age=youth/buys_car =yes)=0.222, 
   p(income=medium/buys_car)=0.444 and 
   p(buys_car=yes)=0.643 then 
   Find the probability of p(x/buys_car=yes), where x=(income=medium,

age=youth). 

5. WhileimplementingNaïveBayesian classifier, suppose

wehaveencounteredazeroprobability then we should add one count to each of the
probability to avoid zeroprobability. Whatisthisestimation iscalled?  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 29 of 67 29
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. Considerthegiventablenamed“Weather_cond.csv”consistingofattributesTemperatureHumidity
,Windy andaclasslabelnamed“Outcome”.Depending on the weatherconditionsyou
havetochoosewhetherto playcricket or not. 
a. Unlike conventional function, write a python function to split the dataset into
trainingsetand test set. Assume test sizelength as 0.33. 
b. Write a python function to calculate mean and standard deviation for each
numericalattributein thedata set. 
c. Calculatethenumberofpriorsforthe given
datasetaftersplittingintotrainingandtestsets usingpython. 

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 30 of 67 30
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 31 of 67 31
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

2. Theproblemiscomprisedof100observationsofmedicaldetailsforPima Indian’s patients. The
records describe instantaneous measurements taken from the patientsuch as their age, the
number of times pregnant and blood workup. All patients are women aged21 or older.All
attributes are numeric,and their units vary from attribute to attribute. Eachrecord has a class
value that indicates whether the patient suffered an onset of diabetes within
5yearsofwhenthemeasurementsweretaken(1)ornot(0).Thisisastandarddatasetthathasbeenstudie
dalot inmachine learningliterature. Agood prediction accuracyis70%-76%.
 
Implementapython codetofindtheaccuracyforgivendatasetnamed
“Diabetes.csv”basedon train set andtest set. Taketest sizelength as 0.4. 

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 32 of 67 32
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. Considerthegiventablethatspecifiesloan classificationproblem.  

Tid  HomeOwner  Maritalstatus  AnnualIncome  Defaulted

Borrower 
1  Yes  Single  125K  No 
2  No  Married  100K  No 
3  No  Single  70K  No 
4  Yes  Married  120K  No 
5  No  Divorced  95K  Yes 
6  No  Married  60K  No 
7  Yes  Divorced  220K  No 
8  No  Single  85K  Yes 
9  No  Married  75K  No 
10  No  Single  90K  Yes 
 
a. Computethe class conditional probabilityforeach categorical attribute. 
b. PredicttheclasslabelvaluefortestrecordX=(HomeOwner=No,MaritalStatus=Married,
Income=$120K) 
  
WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 33 of 67 33
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Explainthedifference betweenaValidationSetandaTestSet?
2. WhatarethethreetypesofNaïveBayesclassifier?
3. Howmanyterms arerequired forbuildingaBayes model?
4. Whatis trainingtest andtestingset?
5. WhataretheadvantagesofNaive Bayes?

(ForEvaluator’suseonly)

CommentoftheEvaluator (ifAny) Evaluator’s

ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 34 of 67 34
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#7:ClassificationusingBackpropagation
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
In LMS: Find the file named “Han J & Kamber M, Data Mining Concepts
andTechniques.doc”.  
Readthespecifieddocument fromPg.No:398–404 andanswerthebelowquestions. 

1. StatewhetherthegivenstatementisTrue/False. 
a. Backpropagationisneuralnetworklearningalgorithm. 

b.Backpropagation learns by iteratively processing a data set of training

tuples,comparing the network’s prediction for each tuple with the actual
known targetvalue. 

2. What is the objectiveof Backpropagation? 

3. ExplainaboutMultilayer Feed-ForwardNeuralNetworkwithdiagram.  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 35 of 67 35
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

4. HowdoesBack propagation work? 

5. Considerthefollowingtable. 
 
Input  DesiredOutput  ModelO AbsoluteError  Square
utput  Error 
0  0       
1  2       
2  4       
 
Predict the Model Output by considering the initial value of weight as 3. Find the Absolute
Errorand Square Error. Use the Backpropagation algorithm to update the weight and try to
minimizethe square error asmuch aspossible. 
Hint: 
i.ModelOutput= W*I(x)(W=weight, I=Input,x=indexthatiteratesfrom0to
length(Input)) 
ii. Absolute Error = mod(Model Output-Desired
Output) iii.SquareError= (Absolute Error)^2
WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 36 of 67 36
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
Analysis: 
Thefollowingstepswillprovidethefoundationthatyouneed to implementtheBackpropagationalgorith
mandapplyit toyourownpredictivemodellingproblems: 
1. InitializeNetwork. 
2. ForwardPropagate. 
i. NeuronActivation. 
ii. NeuronTransfer. 
iii. ForwardPropagation. 
3. BackPropagateError. 
i. TransferDerivative 
ii. ErrorBackpropagation 
4. TrainNetwork. 
i. UpdateWeights. 
ii. TrainNetwork. 
5. TestNetwork.

Dataset: 
Supposewehavethefollowing“Results Dataset” whichconsistofGPA’sofsomestudents that
they had scored in two internal tests. And, it also consists of another attributenamed
‘Qualified’, which holds a character(Q/NQ), representing the student qualificationforfinal
examination.  
 
 
S.No Test– 1 Test– 2 Qualified
1 8.5 8.5 Q
2 8.2 9.0 Q
3 3.5 5.0 NQ
4 5.5 4.5 NQ
5 9.2 9.0 Q
6 7.8 7.3 Q
7 8.0 3.1 NQ
8 10 7.0 Q
9 4.5 6.0 NQ
10 6.8 7.1 Q
11 5.1 4.1 NQ
12 4.2 5.3 NQ

Problem:   Trainanetworkonabove“ResultsDataset”byapplyingBackpropagationalgorithm. 
a. Initializing anetwork withallweights andbiases.   (Considerweightsinrange-0.5to
+0.5,biases=1,LearningRate ={0.5, 0.7,1})
b. Training thenetwork accordingtotheDataset. (ConsiderbothActivatingFunctions–
SigmoidFunctionandTanhFunction) 
c. Backpropagating theerrors. 

WritingspaceoftheProblem:(ForStudent’suseonly)
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 37 of 67 37
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 38 of 67 38
1. Use the network which is trained on the above “Results Dataset” and test whether it is
trainedwith 100% accuracy or not. And, predict the result (qualified for final examination or not)
of anewentrywhich contains5.9and 5.9 GPA’softest-1 and test-2 respectively.

WritingspaceoftheProblem:(ForStudent’suseonly)

39
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1.  What are the general tasks that are performed with backpropagation
algorithm? 2.What kind of real-world problemscan neural networkssolve? 
3.  Whatisa gradient descent? 
4. Why is zero initialization not a recommended weight initialization
technique? 5. Howareartificial neural networks different from normal
networks? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameofthe Evaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 40 of 67 40
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#8:AssociationRuleMining -Apriori
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. Definewhatis Apriori algorithm.

2. Whatisassociation minning? 

3. Whatis the needof association minning? 

4. Whatisminimumsupportandminimum confidence? 

5. Consider the market basket transactions given in the following table.

Letmin-sup=40%and min_conf=40% 
TransactionID ItemsBought
T1 A,B,C
T2 A,B,C,D,E
T3 A,C,D
T4 A,C,D,E
T5 A,B,C,D

a. Findallthefrequentitemsetsusing apriori algorithm. 
b. Obtainsignificantdecisionrules. 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 41 of 67 41
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:- Forthefollowinggiventransaction dataset,perform followingoperations : 
a.Generate rulesusing Apriori algorithmbyusing belowdataset. 
vegetables green whole wheat flour cottage
shrimp  almonds  avocado  mix  grapes    yams  cheese 
burgers  meatballs eggs                
 
chutney                      
turkey  avocado                   
mineral energy whole
water  milk  bar  wheatrice  greentea  eggs       
lowfat
yogurt                      
whole
wheatp french fri
asta  es                   
light
soup  cream  shallot                
frozen green
vegetables  spaghetti  tea                
french fries 
                    
eggs  petfood                   
cookies                      
mineral cooking
turkey  burgers  water  eggs  oil          
champag
spaghetti  ne  cookies                
mineral
water  salmon  eggs                
mineral
water                      
lowfat
shrimp  chocolate chicken  honey  oil  cookingoil  yogurt    
 
turkey  eggs                   
tomatoes  mineral
turkey  freshtuna spaghetti  water  blacktea  salmon  eggs 
 
french fries 
meatballs  milk  honey  proteinbar          
shampoo 
redwine  shrimp  pasta  pepper  eggs  chocolate    
sparkling
rice  water                   
mineral body
spaghetti  water  ham  spray  pancakes  greentea       
grated white toothpaste
burgers  cheese  eggs  pasta  avocado  honey  wine   
eggs                      

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 42 of 67 42
parmesan
cheese  spaghetti  soup  avocado  milk  freshbread       
ground mineral frozen
beef  spaghetti  water  milk  eggs  blacktea  salmon  smoothie 
sparkling
water
mineral frenchfries
water eggs chicken chocolate
frozenveg
etables mineral
spaghetti yams water

WritingspaceoftheProblem:(ForStudent’suseonly)

43
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. Sameas In-lab questiongeneraterulesonbelowdataset. 
semi-
finished ready
citrusfruit  bread  margarine  soups                      
tropical
fruit  yogurt  coffee                         
wholemilk                               
cream meat
pip fruit  yogurt  cheese   spreads                      
longlife
otherveget whole condensed bakeryp
ables  milk  milk  roduct                      
abrasive
wholemilk  butter  yogurt  rice  cleaner                   
rolls/buns                               
liquor(ap
otherveget UHT- bottled petizer
ables  milk  rolls/buns  beer  )                   
potplants                               
wholemilk  cereals                            
other
tropica vegetabl white bottled chocolate 
l fruit  es  bread water                   
 
bottle
dwat
tropica whole yogurt  er dishes 
citrusfruit  lfruit  milk  butter  curd  flour         
beef                               
rolls/bun
frankfurter  s  soda                         
tropical
chicken  fruit                            
fruit/vegeta newspape
butter  sugar  blejuice  rs                      
fruit/vegetab
lejuice                               
packaged
fruit/vegetab
les                               
chocolate                               
specialty
bar                               
other
vegetables                               
butter milk  pastry                            
wholemilk                               
tropical cream processed detergent  newspape
fruit  cheese   cheese  rs                   

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 44 of 67 44
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

bathro
rootveg sweets salty o
tropica etables  otherveget frozend rolls/buns pread snac waffle cand mclean
l fruit  ables  essert    flour  s  k  s  y  er 
bottled canned
water  beer                            
yogurt                               
rolls/bun chocolate 
sausage  s  soda                      
other
vegetables                               
shoppi
brown fruit/vegeta canned newspape ngba
bread  soda  blejuice  beer  rs  gs                

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 45 of 67 45
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Whoproposed Apriori algorithminwhich year?  
2. Whatisfrequent itemset? 
3. Whydoweconvertdataset intolist? 
4. Whatistheformulafor support,confidence and lift? 
5. Howtheyget thenameas Apriori? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 46 of 67 46
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#9:ImplementationofK-MeansClustering

Dateof theSession: / / TimeoftheSession: to

Pre-Requisites:
Data pre-processing  
Basics of plotting techniques  
Various clustering techniques 

Pre-lab:-
1. Matchthefollowing.
 
                    Parameters                          Application 
1. pch  a. Tosetorientationofaxis labels 
2. col  b. No.ofplotsperrowand column 
3. mfrow  c. Tosetplotcolor 
4. lwd  d. Plottingsymbol 
5. las  e. Tosetline width 
 

2. Listout various parametersandattributesinKMeansclustering. 

3. Intohow manytypes does clusteringdivided into andname them.

4. Listoutvarious applicationsofclustering.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 47 of 67 47
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

5. DescribeEuclideandistanceand Manhattandistanceinbriefwith its derivedformula.  

6. ListoutbasicstepsinvolvedinKMeansclustering.  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 48 of 67 48
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. The given dataset comprises of 150 data entries of different countries around
theworld.It is a report on world happiness, a landmark survey of the state of
globalhappiness that ranks 156 countries by how happy their citizens perceive
themselves tobe, with a focus on the technologies, social norms, conflicts and
government policiesthat have driven those changes. The records contains various
attributes of each
countrythatincludespositive_effect,negative_effect,corruption,freedom,healthlifeexpe
ctancy etc. The data frame includes categorical variables, numerical values
andtheirvalues varyfrom countryto country.
Implementapythoncodeusingscikit-learntodisplayaK-
meansclusteringplotforgivendataframe named“world_happiness_report.csv”.
WritingspaceoftheProblem:(ForStudent’suseonly)

2. Thegivendataset named “Student_performance”consists of150dataentriesof

students in an institution that displays the performance of a student. It consists
ofvarious attributes such as
gender,ethnicity, test_preparation, math_score, reading_scoreetc.Perform
theKmeansclustering for the given dataset taking an appropriate number of centers
based on meanand standard deviation for the data entries. Analyze the cluster plot
and give a briefnotebased on results obtained.
WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 49 of 67 49
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. This lab module aims to build an analysis on customers of a shopping mall. It consists of
150observationsofcustomersconsistingdetailsthatincludegender,age,
annual_income,spending_score etc.Basedonthetwoparameters annual_income and
spending_score,trytobuildaanalysis oncustomers through cluster graphs
 
Apply k means clustering on the given data set named “Mall_customers” marking number
ofclustersbasedonmeanandstandarddeviationofanytwoattributesofyourchoiceandimplementtheK-
means iterativelytill the centroids get stabilized

WritingspaceoftheProblem:(ForStudent’suseonly)

VivaVoce:-
1. K-meansis whichtypeofalgorithm.
2. In K-meansclusteringalgorithmwhatis
thecriteriausedbythedatapointstogetseparatedfromonecluster to another.
3. Whatarethebasicsteps inKMeansclustering.
4. WhatdoesKreferin K-meansalgorithm - Krefersto kno.ofclusters.
5. HowisK-meansalgorithmisdifferentfromKNN algorithm

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 50 of 67 50
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#10:Classification:Support VectorMachine(SVM)

Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. What isSVM? 

2. When do weuse SVM? 

3. What ismaximummarginalhyperplane andwhatistheequationofseparatinghyperpla
ne? 

4. What arethetwocasesofSVM? 

5. What aretheequationsforpointthatliesabovetheseparating hyperplane andbelowthese
parating hyperplane? 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 51 of 67 51
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. Below is the data of the employees in the company. The data shows  whether
employeepurchased software or not. Take x co-ordinate as age and y co-ordinate
asestimated_salary.Now, Considerthefollowing datasetandperformthebelowoperations: 
UserID  Gender  Age  EstimatedSalary  Purchased 
15624510  Male  19  19000  0 
15810944  Male  35  20000  0 
15668575  Female  26  43000  0 
15603246  Female  27  57000  0 
15804002  Male  19  76000  0 
15728773  Male  27  58000  0 
15598044  Female  27  84000  0 
15694829  Female  32  150000  1 
15600575  Male  25  33000  0 
15727311  Female  35  65000  0 
15570769  Female  26  80000  0 
15606274  Female  26  52000  0 
15746139  Male  20  86000  0 
15704987  Male  32  18000  0 
15628972  Male  18  82000  0 
15697686  Male  29  80000  0 
15733883  Male  47  25000  1 
15617482  Male  45  26000  1 
15704583  Male  46  28000  1 
15621083  Female  48  29000  1 
15649487  Male  45  22000  1 
15736760  Female  47  49000  1 
15714658  Male  48  41000  1 
15599081  Female  45  22000  1 
15705113  Male  46  23000  1 
15631159  Male  47  20000  1 
15792818  Male  49  28000  1 
15633531  Female  47  30000  1 
15744529  Male  29  43000  0 
a. Importthedatasetintopython 
b. Splitthedataset set intotrainingand testingsets 
c. Applyfeaturescalingon trainingand test sets 
d. FitSVMtothetrainingset 
e. Visualize thetrainingsetresults 
f. Visualize thetestsetresults. 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 52 of 67 52
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 53 of 67 53
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
 1. Below dataset represents the bank transactions of KVB bank for an hour. Consider x
co-ordinate as Balance and y co-ordinate as Trtn_amt. Perform following operations
ongiven dataset: 
S.No  transaction_ID  Balance  Trtn_amt  sucornot 
1  3467  98687.36  500  0 
2  4801  8510.47  100  0 
3  2093  2475.3  200  1 
4  9933  37743.25  1000  0 
5  7178  2705.95  600  0 
6  1093  60314  750  1 
7  3708  812129.5  280  1 
8  3804  8076.25  140  0 
9  3192  42323.14  310  1 
10  3666  47045.25  2500  0 
11  8598  96171.25  6900  0 
12  8743  608581.8  8520  1 
13  9302  586057.3  410  1 
14  6127  4587.5  750  0 
15  7502  43597.75  250  0 
 
a. Importthedatasetintopython 
b. Split thedataset set intotrainingand testingsets 
c. Applyfeaturescalingon trainingand test sets 
d. FitSVMto thetrainingset 
e. Visualize thetrainingsetresults 
f. Visualize thetestsetresults. 

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 54 of 67 54
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. What arethe advantagesof SVM? 
2. How manytypesofmachine learnings arethere andinwhichtypethis svm fallunder? 
3. What arethe turningparametersin SVM? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 55 of 67 55
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#11:RuleBasedClassification
Dateof theSession: / / TimeoftheSession: to

Pre-requisite: 
Refer Page no: 355-363 in Han J & Kamber M, “Data Mining: Concepts and Techniques”,
ThirdEdition,Elsevier, 2011    

Pre-lab:-

1. Whatis rule-basedclassification indatamining? 

2. Brieflyexplain aboutthe buildingclassificationrules.  

3. When to stopbuildinga rule? 

4. List someaspectsofsequentialcovering. 

5. Whatarethecharacteristics ofrule-basedclassifier?  

6. Definecoverageand accuracy.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 56 of 67 56
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. Implement a simple python code for rule-
basedclassification on “AllElectronicsCustomer” datab
ase  (Downloadthedataset fromLMS) 
 
RID  age  income  student  Credit_rating  Class:buys computer   

1  youth  high  no  fair  no 

2  youth  high  no  excellent  no 
3  middle_aged  high  no  fair  yes 
4  senior  medium  no  fair  yes 
5  senior  low  yes  fair  yes 
6  senior  low  yes  excellent  no 
7  middle_aged  low  yes  excellent  yes 
8  youth  medium  no  fair  no 
9  youth  low  yes  fair  yes 
10  senior  medium  yes  fair  yes 
11  youth  medium  yes  excellent  yes 
12  middle_aged  medium  no  excellent  yes 
13  middle_aged  high  yes  fair  yes 
14  senior  medium  no  excellent  no 
 
a.Calculateaccuracy,coverageandprinttheRIDvalueswhen the
followingrulesare satisfied:  
o Rule R1: if the age of the person is in the category of “youth” and he/she is
astudentthen the person purchases thecomputer. 
o Rule R2: if age of the person is in the category of “middle_aged” , income
iseither medium or high and with excellent Credit_rating  then the person
buys acomputer  
o RuleR3: if ageoftheperson is inthecategoryof“senior”andhe/sheis
astudentthen purchasesacomputer. 
o Rule R4: if age of the person is in the category of “senior”
, income ishigh, he/sheisastudentandwith Credit_rating fair thenpurchasesaco
mputer. 
  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 57 of 67 57
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 58 of 67 58
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. Extractpossibleclassificationrulesfromthe givendecisiontree.

2. Writethesequential coveringalgorithm used inruleinduction.

3. DifferencebetweenDecisiontreeandrulebased classification.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 59 of 67 59
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Rule-Basedclassifier classifyrecords byusingacollection of rules.
2. Mostrule-basedclassificationsystemsusewhich strategy?
3. Differencebetweenclass-basedorderingandrule-basedordering.
4. Brieflyexplain thebelow termsinyourown words:
a. Mutuallyexclusive
b. Exhaustive
5. Nametheterms that definethe followingstatements:
a. Fractionofrecords that satisfyonlyantecedentofarule.
b. Fractionofrecords thatsatisfyboth antecedentandconsequentof arule.

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 60 of 67 60
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#12:OutliersDetection
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. Whatdoyou mean byan outlier? What arethemaincauses foroutliers?

2. What are the important methods for outlier detection?

3. Whyisoutlierdetectionnecessaryindata analysis?

4. Howdowecalculate z-score?

5. Considerthebelowdatasetwhichcomprisesoftheincome(in
thousands)of15peopleinanorganisation.
[45,51, 63, 48,67, 48, 56, 2, 62, 59, 44, 61, 99, 46,52]
Whatdoyouobservefromtheabovedata?Isthereanysignificantdifferencebetweentheincomeoffe
w employees?If so,whatcould be thereason ofit?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 61 of 67 61
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. ThedatasetBostonhousepricesconsistsof9attributesCRIM,ZN,INDUS,LSTAT,NOX,RM, DIS,RAD,
TAX. Thedescription of each attribute
 CRIM per capitacrime rate bytown
 ZNproportion ofresidential landzoned forlots over25,000 sq.ft.
 INDUSproportionofnon-retailbusinessacrespertown
 NOXnitricoxides concentration(parts per 10 million)
 RMaveragenumberofroomsper dwelling
 DISweighteddistancestofiveBostonemploymentcentres
 RADindexof accessibilityto radialhighways
 TAXfull-valueproperty-taxrateper$10,000

Boston dataset:https://drive.google.com/file/d/1YVYWQWPKsLX1UM-
0XCnGCwD1NIi7_uIv/view?usp=sharing

a. Usingboxplot detectwhich columnshaveoutliers

b. ImplementscatterplotbetweenINDUSandTAXandinspecttheoutliers
c. Applyz_scoreoutlierdetection method on Bostondataset consideringthreshold =3
d. Print anyfivez_scorevalues of theoutliers.
e. Removealltheoutliers obtainedfromthedatasetandrefashionthedataset.
f. Applyinterquantilerange(IQR)outlierdetectiononthedatasetandprintIQRvaluesofeachcolu
mns.
g. Calculate lower_bound and upper_bound and print boolean values wherein the outliers
arerepresentedas TRUE.
h. Removealltheoutliersproducedbyinterquartilerangemethodandrefashionthedataset.

WritingspaceoftheProblem:(ForStudent’suseonly)

2. Considertheirisdataset.Itincludesthreeirisspecieswith50sampleseachaswellassomeproperties
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 62 of 67

62
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

about eachflower. Thecolumnsin this dataset are:

 SepalLengthCm
 SepalWidthCm
 PetalLengthCm
 PetalWidthCm
 Species

https://drive.google.com/file/d/1HEEMrAQqAynHdM5TmK0G-
mD5Qr0OW2J8/view?usp=sharing

Importthecsvfileandusetheboxplotmethodtovisualisetheoutliersconsideringthe4propertiesof
aflower. You will noticethat oneofthepropertyhas outliers.

1. Consideringtherangeoftheoutliersfromthevisualisation,displaytheobservationswhichhaveoutli
ers.
2. ImplementaDBSCANmodelfittingonthedatasettakingepsilonvalueas0.8andminimumsamplesv
alueas 19.
3. Print the counter values using the counter function on the model
labels.4.Considering the values obtained from the model labels print the outliers of
the data.5.Draw ascatterplot betweenpetallength and sepalwidthto
visualisetheoutliers.

WritingspaceoftheProblem:(ForStudent’suseonly)

Post-lab:-
Consider the following student
datasethttps://drive.google.com/file/d/1edmKnHjXkTyHT6gSYhwLw9rTpzoy1Cig/view?usp=sharing

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 63 of 67 63
whichconsists of studentdetails of twoschools in a town.
i. Findthestudentswho havetakenmorenumberofleaves
thantheaveragenumberofabsences by implementing a z_score functiontaking
mean and standarddeviationinto account.

ii. Findthenumberof students whogot least andhighest scorein

thesubjectG1consideringthreshold =2.5

iii. Applyboxplot fortheabovetwo instances.

2. Canwefindoutliersforcategoricalvalues?Explain.
3. An sugar factory weighs every sugar packet in the weighing machine before
packingthem into cartons. As per the guidelines of the factory,the standard weight of
each sugarpacket should be 60 grams. It has been observed that during the final
weighing of thepackets, few of themgave an anomalous weight due to malfunctioning
of weighingmachines.
Consider the below dataset which comprises of weights of the
packets.https://drive.google.com/file/d/1JkdkQ3j-
J93DCfZa3kUjDycEtRzShk6V/view?usp=sharing

a.Findthoseanomalousweightsbyplottingahistogram

b. Intherange0to1,considerthelower_bound=0.1&upper_bound=0.9andfindthe
outliers usingthe quantile method.

c.Segregatetheoutliersfrominlines
using“loc”methodtogetthevaluesof“true_index”.Alsoobtain values of“false_index”.
.
d.Nowfind themedianfrom thevaluesobtained in“true_index”
.
d.Replacealltheoutliers withmedian.

WritingspaceoftheProblem:(ForStudent’suseonly)

64
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-

1. Isitgoodtoremove
anoutlierfromthedatasetallthetime?2.Whattheapplicatio
ns of outlier detection.
3. Whatthedifferenttypesof outliers?
4. Areoutliersjustsideproductsofsomeclusteringalgorithms?
5. Whatisthedifference betweennoise and anomoly?

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title <TO BE FILLED BY CC> ACADEMIC YEAR: 2023-24

Course Code(s) <TO BE FILLED BY CC AND MUST INCLUDE ALL R,A,P CODES> Page 65 of 67

7210 SAS Presentation
No ratings yet
7210 SAS Presentation
48 pages
OM4200 Alarm Clearing Procedures
No ratings yet
OM4200 Alarm Clearing Procedures
246 pages
Documentation Guide: Optix Bws 1600G Backbone DWDM Optical Transmission System V100R007
No ratings yet
Documentation Guide: Optix Bws 1600G Backbone DWDM Optical Transmission System V100R007
37 pages
Sted User Manual
No ratings yet
Sted User Manual
91 pages
SONET SDH DWDM 5 Days Workshop Web Dubai
No ratings yet
SONET SDH DWDM 5 Days Workshop Web Dubai
12 pages
Front Office Engineer
No ratings yet
Front Office Engineer
5 pages
WCDMA NodeB Data Configuration
No ratings yet
WCDMA NodeB Data Configuration
58 pages
05-OptiX RTN 900 License Operation Guide V1.0-20091221-A
No ratings yet
05-OptiX RTN 900 License Operation Guide V1.0-20091221-A
47 pages
Resume Sunil
No ratings yet
Resume Sunil
4 pages
FTTX Solution Configuration Guide (V100R007 - 02) PDF
No ratings yet
FTTX Solution Configuration Guide (V100R007 - 02) PDF
411 pages
EC8751-Optical Communication - by WWW - Learnengineering.in
No ratings yet
EC8751-Optical Communication - by WWW - Learnengineering.in
21 pages
OSN 9800 Product Documentation V100R007C00 - 01 20171204065122
No ratings yet
OSN 9800 Product Documentation V100R007C00 - 01 20171204065122
31 pages
01-Training Course WDM Principle-Editted
No ratings yet
01-Training Course WDM Principle-Editted
89 pages
E1/T1 Interface Plug-In Card FOR Alstom e-DXC Enhanced Digital Cross-Connect User'S Manual
No ratings yet
E1/T1 Interface Plug-In Card FOR Alstom e-DXC Enhanced Digital Cross-Connect User'S Manual
41 pages
EC6702-Optical Communication and Networks
No ratings yet
EC6702-Optical Communication and Networks
20 pages
Commissioning and Configuration Guide (V800R006C02 - 03, GPON)
100% (1)
Commissioning and Configuration Guide (V800R006C02 - 03, GPON)
573 pages
HT6000 DWDM Optical Transmission System
No ratings yet
HT6000 DWDM Optical Transmission System
22 pages
iSAM Config Subcon Bharti Mum
No ratings yet
iSAM Config Subcon Bharti Mum
11 pages
Planning The ONS 6800
No ratings yet
Planning The ONS 6800
46 pages
Array Waveguide Gratings (AWG)
No ratings yet
Array Waveguide Gratings (AWG)
15 pages
User Guide FTB-5700 (200 v2) English (1058433)
No ratings yet
User Guide FTB-5700 (200 v2) English (1058433)
96 pages
Optix NG WDM Maintenance Case Collection PDF
No ratings yet
Optix NG WDM Maintenance Case Collection PDF
10 pages
Installation Guide: RADWIN 5000
No ratings yet
Installation Guide: RADWIN 5000
72 pages
Hcie-Wlan v1.0 Exam Outline
No ratings yet
Hcie-Wlan v1.0 Exam Outline
5 pages
Case Study - OLA To ROADM Strategy
No ratings yet
Case Study - OLA To ROADM Strategy
44 pages
Maintenance Guide (V100R001C00 02)
100% (2)
Maintenance Guide (V100R001C00 02)
459 pages
Final College PPT
No ratings yet
Final College PPT
42 pages
1830-PSS New NE Software Installation 10-10-2012
No ratings yet
1830-PSS New NE Software Installation 10-10-2012
39 pages
OptiX RTN 950 Radio Transmission System
No ratings yet
OptiX RTN 950 Radio Transmission System
250 pages
400 Gbe
No ratings yet
400 Gbe
1 page
Spectrum Fibre Brochure
No ratings yet
Spectrum Fibre Brochure
15 pages
Datasheet GRP2602 English PDF
No ratings yet
Datasheet GRP2602 English PDF
2 pages
Fixed & Mobile Networks Convergence: ITU/BDT Arab Regional Workshop On " Convergence: Policies and Regulations"
No ratings yet
Fixed & Mobile Networks Convergence: ITU/BDT Arab Regional Workshop On " Convergence: Policies and Regulations"
47 pages
FONST 4000 Intelligent OTN Equipment Product Description (Version F)
No ratings yet
FONST 4000 Intelligent OTN Equipment Product Description (Version F)
442 pages
NetApp Pre - NS0-593 60q-DEMO
No ratings yet
NetApp Pre - NS0-593 60q-DEMO
24 pages
1640 Fox
No ratings yet
1640 Fox
4 pages
4cae001247 MPLS-TP
No ratings yet
4cae001247 MPLS-TP
8 pages
Optical Transport Network
No ratings yet
Optical Transport Network
4 pages
OptiX PTN 1900 Hardware Description - (V100R002C01 04)
No ratings yet
OptiX PTN 1900 Hardware Description - (V100R002C01 04)
218 pages
OptiX PTN 950 Maintenance Guide (V100R005) PDF
100% (1)
OptiX PTN 950 Maintenance Guide (V100R005) PDF
1,052 pages
04 Huawei OptiXtrans DC908 Solutions and Technologies Introduction V1.4
No ratings yet
04 Huawei OptiXtrans DC908 Solutions and Technologies Introduction V1.4
30 pages
S1720, S2700, S5700, and S6720 V200R011C10 Configuration Guide - Basic Configuration PDF
No ratings yet
S1720, S2700, S5700, and S6720 V200R011C10 Configuration Guide - Basic Configuration PDF
474 pages
OSN 3500 Short Description PDF
No ratings yet
OSN 3500 Short Description PDF
3 pages
E7 20 - GPON 16x PDF
No ratings yet
E7 20 - GPON 16x PDF
2 pages
UMX User Manual
No ratings yet
UMX User Manual
78 pages
Ruijie RG-S5300-E Series Gigabit 1
No ratings yet
Ruijie RG-S5300-E Series Gigabit 1
16 pages
OTA105101 OptiX OSN 1500 Hardware Description ISSUE 1.20
No ratings yet
OTA105101 OptiX OSN 1500 Hardware Description ISSUE 1.20
119 pages
Infineon XMC4500 RM
No ratings yet
Infineon XMC4500 RM
2,636 pages
FibeAir IP 20S Technical Description PDF
No ratings yet
FibeAir IP 20S Technical Description PDF
210 pages
Product Description FONST 3000 (Intelligent OTN Equipment)
No ratings yet
Product Description FONST 3000 (Intelligent OTN Equipment)
310 pages
Guide To Create Cross-Connection Huawei
100% (1)
Guide To Create Cross-Connection Huawei
6 pages
HUAWEI AX3 Pro Guía de Inicio Rápido - (WS7100&WS7200,01, ES)
No ratings yet
HUAWEI AX3 Pro Guía de Inicio Rápido - (WS7100&WS7200,01, ES)
108 pages
Optical Network
No ratings yet
Optical Network
18 pages
Cacti 0.8 Beginner's Guide
From Everand
Cacti 0.8 Beginner's Guide
Thomas Urban
No ratings yet
Radio Frequency (RF) Engineering in Telecom: Telecom Titans: Building the Connected World, #3
From Everand
Radio Frequency (RF) Engineering in Telecom: Telecom Titans: Building the Connected World, #3
Jordan Grant
No ratings yet
Lab Workbook With Solutions-Final PDF
100% (5)
Lab Workbook With Solutions-Final PDF
109 pages
TE - Syllabus - R2019 July9
No ratings yet
TE - Syllabus - R2019 July9
3 pages
3_Preprocessing
No ratings yet
3_Preprocessing
82 pages
Gujarat Technological University: Page 1 of 2
No ratings yet
Gujarat Technological University: Page 1 of 2
2 pages
Skill 3 and 4
No ratings yet
Skill 3 and 4
7 pages
Jeon Dissertation 2013
No ratings yet
Jeon Dissertation 2013
357 pages
Statistics for the Terrified 6th Edition Kranzler All Chapters Instant Download
100% (15)
Statistics for the Terrified 6th Edition Kranzler All Chapters Instant Download
47 pages
Co4 Qestions-Assgn
No ratings yet
Co4 Qestions-Assgn
3 pages
Electric Flux
No ratings yet
Electric Flux
8 pages
1-s2.0-S1877050920310231-main (1)
No ratings yet
1-s2.0-S1877050920310231-main (1)
8 pages
Introduction To Social Credit by DR Bryan W. Monahan Excellent Publication For The Beginner
No ratings yet
Introduction To Social Credit by DR Bryan W. Monahan Excellent Publication For The Beginner
143 pages
Abrigo Funding Benchmark Data 2022Q3
No ratings yet
Abrigo Funding Benchmark Data 2022Q3
37 pages
SASHelp
No ratings yet
SASHelp
8 pages
Mat107 Mat
No ratings yet
Mat107 Mat
2 pages
VMware - LeetCode
No ratings yet
VMware - LeetCode
3 pages
Design and Analysis of Composite Drive Shaft For Automotive Application IJERTV3IS110410
No ratings yet
Design and Analysis of Composite Drive Shaft For Automotive Application IJERTV3IS110410
8 pages
Calculate Rack and Pinion
No ratings yet
Calculate Rack and Pinion
5 pages
A Gentle Introduction To Tensors
No ratings yet
A Gentle Introduction To Tensors
87 pages
Rural Marketing Research
No ratings yet
Rural Marketing Research
25 pages
Sampoorna-Jeevan-Brochure 141119 v04 PDF
No ratings yet
Sampoorna-Jeevan-Brochure 141119 v04 PDF
24 pages
100 RF Rese
No ratings yet
100 RF Rese
9 pages
ZJSON4ABAP
No ratings yet
ZJSON4ABAP
6 pages
Class X Computer Project 2022-23
No ratings yet
Class X Computer Project 2022-23
3 pages
Lasso Regression Using Glmnet For Binary Outcome - Cross Validated PDF
No ratings yet
Lasso Regression Using Glmnet For Binary Outcome - Cross Validated PDF
4 pages
Math4_Q4_SLMWK1
No ratings yet
Math4_Q4_SLMWK1
9 pages
Analysis1 Spoca Appliedtotranslation
No ratings yet
Analysis1 Spoca Appliedtotranslation
5 pages
Unit - 1 Diffraction - Notes PDF
100% (1)
Unit - 1 Diffraction - Notes PDF
31 pages
NEET UG Physics Electrostatics MCQs PDF
No ratings yet
NEET UG Physics Electrostatics MCQs PDF
51 pages
Just Numbers
No ratings yet
Just Numbers
18 pages
For Sribd
No ratings yet
For Sribd
24 pages
M1 October 2021
No ratings yet
M1 October 2021
32 pages
(Ebook) Bootstrap methods: a guide for practitioners and researchers by Michael R. Chernick ISBN 9780470192566, 9780471756217, 0470192569, 0471756210 - The latest updated ebook is now available for download
100% (1)
(Ebook) Bootstrap methods: a guide for practitioners and researchers by Michael R. Chernick ISBN 9780470192566, 9780471756217, 0470192569, 0471756210 - The latest updated ebook is now available for download
49 pages
Physical Chemistry BUK
No ratings yet
Physical Chemistry BUK
66 pages
Chapter-8-Worked-Solutions Year 11 Standard Maths Cambridge NSW
No ratings yet
Chapter-8-Worked-Solutions Year 11 Standard Maths Cambridge NSW
18 pages
Crane and Shuttle Optimization in Warehousing Systems: Amatol, Bade2, Pasquale
No ratings yet
Crane and Shuttle Optimization in Warehousing Systems: Amatol, Bade2, Pasquale
7 pages

DWM Lab Manual 3yr 1

Uploaded by

DWM Lab Manual 3yr 1

Uploaded by

DATA WAREHOUSING AND MINING

STUDENT ID: ACADEMIC YEAR: 2023-24

1. Session 01: Basic Statistical Descriptions ..............................................................................

12. Outliers detection

Lab#1: Basic Statistical Descriptions

BASIC STATISTICAL DESCRIPTIONS OF DATA

Answer the following Questions

2. What are the several ways of measuring the dispersion of data?

3. What is IQR (inter quartile range)?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

5. What are the items involved Five number summary?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Writing space of the Problem: (For Student’s use only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Comment of the Evaluator (if Any) Evaluator’s Observation

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Lab#2:To implement data pre-processing techniques.

1. Mention anytwo methodsthat deals withmissingvalues and noisydata .

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

a. Identifynumberof missingvalues in a givendataset

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

2. Usethesetwo methodsbelowto normalizethefollowinggroupof data:200, 300, 400,600and

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

3. a. Normalize the two variables(age, fat) based on z-score normalization

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Lab#3:To implement principle component analysis

1. Suppose that you are given a small 3x2 matrix,you have

WritingspaceoftheProblem: (ForStudent’s useonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

1. WhyPCA is preferable,mention thetwo primaryreasons?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

2. What is a confusion matrix?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Dateof theSession: / / TimeoftheSession: to

1. Statewhetherthe given statementistrueor false withsupportedreasoning.

a. k-Nearest-Neighboris a simple algorithm that stores all available cases and

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

4. Compare the advantages and disadvantages of

5. Givethe distancemethods that aremost commonlyusedin k-nearest-neighboralgorithm.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Std.No 1st 2ndyearCGPA Category

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Dateof theSession: / / TimeoftheSession: to

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

5. WhileimplementingNaïveBayesian classifier, suppose

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Tid HomeOwner Maritalstatus AnnualIncome Defaulted

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

CommentoftheEvaluator (ifAny) Evaluator’s

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

b.Backpropagation learns by iteratively processing a data set of training

2. What is the objectiveof Backpropagation?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

4. HowdoesBack propagation work?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

3. Whatis the needof association minning?

5. Consider the market basket transactions given in the following table.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

2. What is a confusion matrix? 

Tid  HomeOwner  Maritalstatus  AnnualIncome  Defaulted

2. What is the objectiveof Backpropagation? 

4. HowdoesBack propagation work? 

3. Whatis the needof association minning? 

1. Whatis rule-basedclassification indatamining? 

3. When to stopbuildinga rule? 

1  youth  high  no  fair  no