0% found this document useful (0 votes)
57 views

ML unit-3

Machine learning jntuh R22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

ML unit-3

Machine learning jntuh R22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

R22MachineLearningLectureNotes

UNIT-III
LearningwithTrees:DecisionTrees,ConstructingDecisionTrees,Classificationand Regression
Trees.
EnsembleLearning:Boosting,Bagging,Differentwaystocombineclassifiers,BasicStatistics,
Gaussian Mixture Models, Nearest Neighbour Methods.
UnsupervisedLearning:KMeans Algorithm.
DecisionTrees:
 DecisionTreeisaSupervisedlearningtechniquethatcanbeusedforbothclassification
andRegressionproblems,butmostlyitispreferredforsolvingClassificationproblems.
 Itisatree-structuredclassifier,whereinternalnodesrepresentthefeaturesofadataset,
branches represent the decision rules and each leaf node represents the outcome.
 InaDecisiontree,therearetwonodes,whicharetheDecisionNodeandLeafNode.
 Decision nodes are used to make any decision and have multiple branches,
whereasLeaf nodes are the output of those decisions and do not contain any further
branches.
 Thedecisionsor thetestsareperformedon thebasis offeaturesof the givendataset.
 Itisagraphicalrepresentationforgettingallthepossiblesolutionstoa problem/decision
based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

1
R22MachineLearningLectureNotes

Example:

 Oneofthereasons that decision trees arepopularis that wecan turnthem into aset of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex:if thereis a partythen go to it
ifthereisnotapartyandyouhave anurgentdeadlinethen study
ConstructingDecisionTrees:
TypesofDecisionTree Algorithms:
 ID3: This algorithm measures how mixed up the data is at a node using something
called entropy. It then chooses the feature that helps to clarify the data the most.
 C4.5: This is an improved version of ID3 that can handlemissingdataand continuous
attributes.
 CART: This algorithm uses a different measure called Gini impurity to decide how to
split the data. It can be used for both classification (sorting data into categories) and
regression (predicting continuous values) tasks.

ID3 Algorithm:

EntropyinInformation Theory:
 Entropymeasures theamount of impurityin a setof features.
 TheentropyH of aset ofprobabilities piis:

 where the logarithm is base 2 because we are imagining that we encode everything
using binary digits (bits), and we define 0 log 0 = 0.

2
R22MachineLearningLectureNotes

 If all of the examples are positive, then we don’t get any extra information from
knowing the value of the feature for any particular example, since whatever the value
of the feature, the example will be positive. Thus, the entropy of that feature is 0.
 However, if the feature separates the examples into 50% positive and 50% negative,
then the amount of entropy is at a maximum, and knowing about that feature is very
useful to us.
 Forourdecisiontree,thebestfeaturetopickastheonetoclassifyonnowistheone that gives
you the most information, i.e., the one with the highest entropy.
InformationGain:
 Itisdefinedastheentropyofthewholesetminustheentropywhenaparticularfeature is
chosen.

 TheID3algorithmcomputesthisinformationgainforeachfeatureandchoosesthe one
that produces the highest value.

3
R22MachineLearningLectureNotes

C4.5 Algorithm:
 Itisanimprovedversionof ID3.
 Pruningisanother methodthat canhelp usavoid overfitting.
 IthelpsinimprovingtheperformanceoftheDecisiontreebycuttingthenodesorsub- nodes
which are not significant.
 Additionally,itremovesthe branches whichhaveverylow importance.
 There aremainly2 waysforpruning:
 Pre-pruning–wecanstopgrowingthetreeearlier,whichmeanswecanprune/remove/cut a
node if it has low importance while growing the tree.
 Post-pruning–onceourtreeisbuilttoitsdepth,wecanstartpruningthenodesbased on their
significance.
 C4.5uses adifferent methodcalled rulepost-pruning.
 ThisconsistsoftakingthetreegeneratedbyID3,convertingittoasetofif-thenrules, and then
pruning each rule by removing preconditions if the accuracy of the rule increases
without it.
 The rules are then sorted according to their accuracyon the training set and applied in
order.
 The advantages of dealing with rules are that they are easier to read and their order in
the tree does not matter, just their accuracy in the classification.
 ForContinuousVariables,thesimplestsolutionistodiscretisethecontinuousvariable.
 Computation complexity of Decision Tree is O(dnlogn) where n is number of data
points, d is number of dimensions.

4
R22MachineLearningLectureNotes

ClassificationExample:constructthedecision treetodecidewhat todointhe evening

Westartwithwhichfeaturehastoselectedasarootnode? Compute

Entropy of S:

findwhichfeaturehasthemaximalinformation gain:

5
R22MachineLearningLectureNotes

 Therefore,therootnodewillbethepartyfeature,whichhastwofeaturevalues(‘yes’ and
‘no’), so it will have two branches coming out of it.


Whenwelookatthe‘yes’branch,weseethatinallfivecaseswheretherewasaparty we went to
it, so we just put a leaf node there, saying ‘party’.
 Forthe‘no’branch,outofthefivecasestherearethreedifferentoutcomes,sonowwe need to
choose another feature.
 Thefivecases weare lookingat are:

 We’veusedthepartyfeature,sowejustneedtocalculatetheinformationgainofthe other
two over these five examples:

6
R22MachineLearningLectureNotes

 Here, Deadline feature has maximum information gain. Hence, we selected Deadline
feature for splitting data.

 Finally, wewill getthefollowingdecisiontree.

7
R22MachineLearningLectureNotes

ClassificationandRegression Trees(CART):
 Itisanotherwell-knowntree-basedalgorithm,CART,whosenameindicatesthatitcan be
used for both classification and regression.

GiniImpurity:

 Itisthe probabilityofmisclassifyinga randomlychosenelement inaset.


 The ‘impurity’ in the name suggests that the aim of the decision tree is to have each
leaf node represent a set of data points that are in the same class, so that there are no
mismatches. This is known as purity.
 Ifaleafis purethenallof thetrainingdata withinit havejust oneclass.
 Consideradataset Dthatcontainssamplesfrom kclasses.
 The probability of samples belonging to class i at a given node can be denoted as p i.
Then the Gini Impurity of is defined as: 

 Thenodewithuniform classdistribution hasthehighest impurity.


 Theminimum impurityis obtained when all records belongto thesameclass.

 Anattributewith thesmallest GiniImpurityis selectedfor splittingthe node.

8
R22MachineLearningLectureNotes

RegressioninTrees:

 A Regression tree is an algorithm where the target variable is continuous and the tree
is used to predict its value.

 Regression Tree works by splitting the training data recursively into smaller subsets
based on specific criteria.
 The objective is to split the data in a way that minimizes the residual reduction (Sum
of Squared Error) in each subset.
 Residual Reduction- Residual reduction is a measure of how much the average
squared difference between the predicted values and the actual values for the target
variable is reduced by splitting the subset. The lower the residual reduction, the better
the model fits the data.
 Splitting Criteria- CART evaluates every possible split at each node and selects the
one that results in the greatest reduction of residual error in the resulting subsets. This
processisrepeateduntilastoppingcriterionismet,suchasreachingthemaximumtree depth
or having too few instances in a leaf node.

9
R22MachineLearningLectureNotes

EnsembleLearning:
 EnsemblelearningreferstotheapproachofcombiningmultipleMLmodelstoproduce a
more accurate and robust prediction compared to any individual model.
 Theconventionalensemblemethodsincludebagging,boosting,andstacking-based
methods

Boosting:
 Boosting is an ensemble technique that combines multiple weak learners to create
astrong learner.

10
R22MachineLearningLectureNotes

 The ensemble of weak models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire training dataset is
predicted correctly.
 Oneofthemostwell-known boostingalgorithmsis AdaBoost(AdaptiveBoosting).
AdaBoost:
 AdaBoost short for Adaptive Boosting is an ensemble learning used in machine
learning for classification and regression problems.
 ThemainideabehindAdaBoostistoiterativelytraintheweakclassifieronthetraining
datasetwitheachsuccessiveclassifiergivingmoreweightagetothedatapointsthatare
misclassified.
 ThefinalAdaBoostmodelisdecidedbycombiningalltheweakclassifierthathasbeen
usedfortrainingwiththeweightagegiventothemodelsaccordingtotheiraccuracies.
 The model which has the highest accuracy is given the highest weightage while the
model which has the lowest accuracy is given a lower weightage.
Stepsin AdaBoost:

1. Weight Initialization

Atthestart,everyinstanceisassignedanidenticalweight.Theseweightsdeterminethe importance of
every example.

2. ModelTraining

Aweaklearnerisskilledatthedataset,withtheaimofminimizingclassification errors.

3. WeightedErrorCalculation

The weighted mistakes are then calculated by means of summing up the weights of the
misclassified times. This step emphasizes the importance of the samples which are tough to
classify.

4. ModelWeightCalculation

TheweightofthesusceptiblelearneriscalculatedprimarilybasedontheirPerformancein classifying
the training data. Models that perform properly are assigned higher weights, indicating that
they're more reliable.

5. UpdateInstanceWeights

Theexampleweightsareupdatedtooffermoreweighttothemisclassifiedsamplesfromthe previous
step.

6. Repeat

Steps2through5arerepeatedforapredefinedvarietyofiterationsortilladistinctiveoverall
performance threshold is met.

11
R22MachineLearningLectureNotes

7. FinalModelCreation

Theverylaststurdymodel(alsoreferredtoastheensemble)iscreatedbymeansof combining the


weighted outputs of all weak learners.

8. Classification

Tomakepredictionson newrecords, AdaBoost usesthe verylast ensemblemodel.

Bagging:
 Bagging is a supervised learning technique that can be used for both regression and
classification tasks.

 Hereis anoverview ofthesteps includingBaggingclassifier algorithm:


 BootstrapSampling:Dividestheoriginaltrainingdatainto‘N’subsetsandrandomly
selects a subset with replacement in some rows from other subsets. This step ensures
that the base models are trained on diverse subsets of the data and there is no class
imbalance.
 BaseModelTraining:Foreachbootstrappedsample,trainabasemodelindependently on
that subset of data. These weak models are trained in parallel to increase
computational efficiency and reduce time consumption.

12
R22MachineLearningLectureNotes

 PredictionAggregation:Tomakeapredictionontestingdatacombinethepredictions of all
base models. For classification tasks, it can include majority voting or weighted
majority while for regression, it involves averaging the predictions.
 Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset
ofparticularbasemodelsduringthebootstrappingmethod.These“out-of-bag”samples
canbeusedtoestimatethemodel’sperformancewithouttheneedforcross-validation.
 FinalPrediction:Afteraggregatingthepredictionsfromallthebasemodels,Bagging
produces a final prediction for each instance.
Random Forest:
 The idea is largelythat if one tree is good, then manytrees (a forest) should be better,
provided that there is enough variety between them.
 Itworks bycreatinganumberof Decision Trees duringthe training phase.
 Eachtreeisconstructedusingarandomsubsetofthedatasettomeasurearandom subset of
features in each partition.
 This randomness introduces variability among individual trees, reducing the risk
ofoverfitting and improving overall prediction performance.
 Inprediction,thealgorithmaggregatestheresultsofalltrees,eitherbyvoting(for
classification tasks) or by averaging (for regression tasks)

13
R22MachineLearningLectureNotes

Stacking:
 Stackingcombines manyensemblemethods in order to build ameta-learner.
 Stackinghas twolevels oflearning: 1)baselearningand 2) meta-learning.
 Inthe firstone, thebaselearners aretrained with trainingdata set.
 Oncetrained, the baselearners create anew data set forameta-learner.
 Themeta-learneristhentrainedwiththat newtrainingdata set.
 Finally,thetrained meta-learneris usedto classifynew instances.

Differentwaystocombine classifiers:
 If the number of classifiers is odd and the classifiers are each independent of each
other,thenmajorityvotingwillreturnthecorrectlabelifmorethanhalfofthe classifiers
agree.
 Forregression problems, ratherthantakingthemajorityvote,it is common to takethe
mean of the outputs.
 However, the mean is heavily affected by outliers, with the result that the median is a
more common average to use.
 Itistheuseofthemedianthatproducesthebaggingalgorithm,whichismeanttoimply ‘robust
bagging’.
Basic Statistics:
Mean:
 The"mean"is theaveragevalue of adataset.
 It is calculated by adding up all the values in the dataset and dividing by the number
of observations.
 The mean is a useful measure of central tendency because it is sensitive to
outliers,meaning that extreme values can significantly affect the value of the mean.
Median:

 The"median"is themiddle value in adataset.


 Itiscalculatedbyarrangingthevaluesinthedatasetinorderandfindingthevalue that lies
in the middle.
 Ifthereareanevennumberofvaluesinthedataset,themedianistheaverageofthe two
middle values.

14
R22MachineLearningLectureNotes

 Themedianisausefulmeasureofcentraltendencybecauseitisnotaffectedby
outliers,meaningthatextremevaluesdonotsignificantlyaffectthevalueofthe
median.

Mode:
 The"mode"is themost common value in adataset.
 Itis calculated byfindingthe valuethat occurs most frequentlyin the dataset.
 Iftherearemultiplevaluesthatoccurwiththesamefrequency,thedatasetissaidtobe bimodal,
trimodal, or multimodal.
 Themodeisausefulmeasureofcentraltendencybecauseitcanidentifythemost common
value in a dataset.
 However,itisnotagoodmeasureofcentraltendencyfordatasetswithawiderangeof values
or datasets with no repeating values.
Variance:
 Varianceisameasureof howmuchthedataforavariablevariesfromit'smean.

Covariance:
 Covarianceisa measureof relationshipbetweentwo variables thatisscaledependent,
i.e.howmuchwillavariablechangewhenanothervariablechanges.

StandardDeviation:
 Thesquareroot ofthevarianceisknown as thestandard deviation
Mahalanobis Distance:
 MahalanobisDistanceisastatisticaltoolusedtomeasurethedistancebetweenapoint and a
distribution.
 Itisapowerfultechniquethatconsidersthecorrelationsbetweenvariablesinadataset,
making it a valuable tool in various applications such as outlier detection, clustering,
and classification.
D²=(x-μ)ᵀΣ⁻¹(x-μ)

15
R22MachineLearningLectureNotes

WhereD²isthesquaredMahalanobisDistance,xisthepointinquestion,μisthemean vector
of the distribution, Σ is the covariance matrix of the distribution, and ᵀ denotes the
transpose of a matrix.
TheGaussian/NormalDistribution:
 Normal distribution, also known as the Gaussian distribution, is a continuous
probability distribution that is symmetric about the mean, depicting that data near the
mean are more frequent in occurrence than data far from the mean.

ThebiasandVarianceTrade-off:
 Biasisthedifferencebetweentheaveragepredictionofourmodelandthecorrectvalue which
we are trying to predict.
 Model with high bias pays very little attention to the training data and oversimplifies
the model. It always leads to high error on training and test data.
 Variance is the variability of model prediction for a given data point or a value which
tells us spread of our data.

16
R22MachineLearningLectureNotes

 Modelwithhighvariancepaysalotofattentiontotrainingdataanddoesnotgeneralize on the
data which it hasn’t seen before.
 Asaresult, such models performverywell on trainingdatabut has high errorrateson test
data.
 Ifourmodelistoosimpleandhasveryfewparametersthenitmayhave highbiasand low
variance.
 Ontheotherhand ifour model has large number ofparameters thenit’s goingto have
high variance and low bias.
 Soweneedtofindtheright/goodbalancewithoutoverfittingandunderfittingthedata.

GaussianMixture Models:
 GMMisblendingmultipleGaussian distributionstoformasingle model.
 AGaussianmixturemodel(GMM)isamachinelearningmethodusedtodeterminethe
probability each data point belongs to a given cluster. The model is a soft clustering
method used in unsupervised learning.

17
R22MachineLearningLectureNotes

 Insoftclustering,insteadofforcefullyassigningadatapointtoasinglecluster,GMM
assignsprobabilitiesthatindicatethelikelihoodofthatdatapointbelongingtoeachof the
Gaussian components.

Notation:

 K:Number ofGaussiancomponents
 N:Numberofdata points
 D:Dimensionalityofthedata

GMM Parameters:

 Means(μ):CenterlocationsofGaussiancomponents.
 CovarianceMatrices (Σ):Definetheshapeandspreadof eachcomponent.
 Weights(π):Probabilityofselectingeachcomponent.

Model Training

 TrainingaGMM involvessettingthe parametersusingavailabledata.


 TheExpectation-Maximization(EM)techniqueisoftenemployed,alternating
between the Expectation (E) and Maximization (M) steps until convergence.

Expectation-Maximization:

 DuringtheEstep,themodelcalculatestheprobabilityofeachdatapoint belongingto each


Gaussian component.
 TheMstepthenadjuststhe model’sparametersbasedonthese probabilities.

ClusteringandDensityEstimation:

 Post-training,GMMsclusterdatapointsbasedonthehighestposteriorprobability.
 Theyarealsousedfordensityestimation,assessingtheprobabilitydensityatany point
in the feature space.

NearestNeighbour Methods:
K-NearestNeighborsAlgorithm:
 The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method
employed to tackle classification and regression problems.

18
R22MachineLearningLectureNotes

Step1:Selectingtheoptimal valueof K

 Krepresentsthenumberofnearestneighborsthatneedstobeconsideredwhilemaking
prediction.
Step2:Calculatingdistance
 To measure the similarity between target and training data points, Euclidean distance
isused.Distanceiscalculatedbetweeneachofthedatapointsinthedatasetandtarget point.
Step3:FindingNearestNeighbors

 The k data points with the smallest distances to the target point are the nearest
neighbors.
Step4: VotingforClassificationorTakingAveragefor Regression
 Intheclassificationproblem,theclasslabelsofK-nearestneighborsaredeterminedby
performingmajorityvoting.Theclasswiththemostoccurrencesamongtheneighbors
becomes the predicted class for the target data point.
 In the regression problem, the class label is calculated by taking average of the target
values of K nearest neighbors. The calculated average value becomes the predicted
output for the target data point.
KDimensionalTree(KD Tree):
 KDTree is a space partitioning data structure for organizing points in K-Dimensional
space.
 ItisanimprovementoverKNN.
 Itisusefulforrepresentingdataefficiently.
 In KDTree the data points are organized and partitioned on the basis of some specific
conditions.
 Thepurposeofthetree wasto storespatialdata with thegoalofaccomplishing:

1. Nearestneighbor search.
2. Rangequeries.
3. Fastlook-up.

Example:
AsimpleexampletoshowcasetheinsertionintoaK-DimensionalTree,wewilluseak=2. The points
we will be adding are: (7,8), (12,3), (14,1), (4,12), (9,1), (2,7), and (10,19).

19
R22MachineLearningLectureNotes

UnsupervisedLearning:
 Unsupervisedlearningisa typeof machinelearningthat learnsfrom unlabeled data.
 Thismeans that thedatadoes not haveanypre-existinglabels or categories.
 The goal of unsupervised learning is to discover patterns and relationships in the data
without any explicit guidance.
TypesofUnsupervisedLearning:
Unsupervisedlearningisclassifiedintotwocategoriesof algorithms:

 Clustering:Aclusteringproblemiswhereyouwanttodiscovertheinherentgroupings
inthedata,suchasgroupingcustomersbypurchasingbehavior.Clusteringisatypeof
unsupervised learning that is used to group similar data points together.
 Association:Anassociationrulelearningproblemiswhereyouwanttodiscoverrules
thatdescribelargeportionsofyourdata,such aspeoplethatbuyXalsotendtobuyY.

ApplicationsofUnsupervisedlearning:

 Anomalydetection:Unsupervisedlearningcanidentifyunusualpatternsordeviations
from normal behavior in data, enabling the detection of fraud, intrusion, or system
failures.
 Scientific discovery: Unsupervised learning can uncover hidden relationships and
patterns in scientific data, leading to new hypotheses and insights in various scientific
fields.
 Recommendation systems: Unsupervised learning can identify patterns and
similaritiesinuserbehaviorandpreferencestorecommendproducts,movies,ormusic that
align with their interests.
 Customersegmentation:Unsupervisedlearningcanidentifygroupsofcustomerswith
similarcharacteristics,allowingbusinessestotargetmarketingcampaignsandimprove
customer service more effectively.
 Image analysis: Unsupervised learning can group images based on their content,
facilitating tasks such as image classification, object detection, and image retrieval.

KMeans Algorithm:
 K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters

20
R22MachineLearningLectureNotes

Thek-meansclusteringalgorithm mainlyperforms two tasks:


 DeterminesthebestvalueforK centerpointsor centroids byaniterativeprocess.
 Assigns each data point to its closestk-center. Those data points which arenear to the
particular k-center, create a cluster.

21
R22MachineLearningLectureNotes

ApplicationsofK-MeansClustering:
 CustomerSegmentation
 Document Clustering
 ImageSegmentation
 RecommendationEngines
 ImageCompression

Advantagesof K-MeansClustering:

 Simple and Easy to implement: The K-means algorithm is easy to understand


andimplement.
 FastandEfficient:K-meansiscomputationallyefficientandcanhandlelargedatasets with
high dimensionality.
 Scalability:K-meanscanhandlelargedatasetswithmanydatapointsandcanbeeasily scaled
to handle even larger datasets.
 Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.

DisadvantagesofK-MeansClustering:

 Sensitivitytoinitialcentroids:K-meansissensitivetotheinitialselectionofcentroids and
can converge to a suboptimal solution.

22
R22MachineLearningLectureNotes

 Requires specifying the number of clusters: The number of clusters k needs to


bespecifiedbeforerunningthealgorithm,whichcanbechallenginginsomeapplications.
 Sensitive to outliers: K-means is sensitive to outliers, which can have a
significantimpact on the resulting clusters.

*****

23

You might also like