Module -02 Machine Learning(BCS602) Notes
Module -02 Machine Learning(BCS602) Notes
Module-2
Chapter–01-UnderstandingData–2
BivariateDataandMultivariateData
BivariateData
Bivariatedatainvolvestwovariables,andthegoalofbivariateanalysisistoexplorethe
relationship betwethem.
Thisrelationshipcanhelpincomparisons,identifyingcauses,andfurtherexploratio
nof the data.
BivariateDatainvolvestwovariables.Bivariatedatadealswithcausesofrelationshi
ps. The aim is to find relationships among data.
ConsiderthefollowingTable2.3,withdataofthetemperatureinashopandsalesof
sweaters.
Page1
MACHINE LEARNING [BCS602]
ScatterPlot
Ascatterplotisausefulgraphicalmethodforvisualizingbivariatedata.
Itisparticularlyeffectivefor illustratingtherelationshipbetweentwovariables.
Page2
MACHINE LEARNING [BCS602]
Thekeyfeaturesofascatterplotare:
Strength:Indicateshowcloselythedatapointsfitapatternortrend.
Shape:Helpsinidentifyingthetypeofrelationship(linear,quadratic,etc.).
Direction:Showswhethertherelationshipispositive,negative,orneutral.
Outliers:Helpsidentifyanypointsthatdeviatesignificantlyfromthetrend.
Scatterplotsareoftenusedintheexploratoryphaseofdataanalysisbeforecalculating
correlation coefficients or fitting regression models.
BivariateStatistics
Therearevariousstatisticalmeasurestodescribetherelationshipbetweentwo
variables.
TwoimportantbivariatestatisticsareCovarianceandCorrelation.
Covariance
Covariancemeasuresthejointvariabilityoftworandomvariables.Ittellsyouwhet
her an increase in one variable results in anincrease or decrease in theother
variable.
Mathematically,thecovariancebetweentwovariablesXandYisdefined as:
Page3
MACHINE LEARNING [BCS602]
Covariancevalues:
Positivecovariance:Asonevariableincreases,theothervariablealsoincreases.
Negativecovariance:Asonevariableincreases,theothervariabledecreases.
Zerocovariance:Nolinear relationshipbetweenthevariables.
Correlation
Whilecovariancemeasuresthedirectionoftherelationship,correlationquantifie
sthe strength of the relationship between two variables.
Themostcommonmeasureof correlationisthePearsoncorrelationcoefficient:
Unlikecovariance,correlationisdimensionless,meaningitisnotaffectedbytheu
nitsof the variables.
Page4
MACHINE LEARNING [BCS602]
MultivariateStatistics
Multivariatedatareferstodatathatinvolvesmorethantwovariables,andinmachine
learning, most datasets are multivariate.
Thegoalofmultivariateanalysisistounderstandrelationshipsamongmultiple
variables simultaneously.
Thiscaninvolvemultipledependent(response)variables,andisoftenusedfor
analyzing more complex data scenarios.
Multivariateanalysistechniquesinclude:
RegressionAnalysis
PrincipalComponentAnalysis(PCA)
PathAnalysis
Themeanvectorisusedtorepresentthemeanofmultiplevariables,andthecovar
iance matrix represents the variance and relationships among all variables.
Themeanvectorisalsoknownasthecentroid,whilethecovariancematrixisals
o referred to as the dispersion matrix.
MultivariateAnalysisTechn
iques Regression
Analysis:
Usedtomodeltherelationshipbetweenmultipleindependentvariablesanda
dependent variable.
FactorAnalysis:
Astatisticalmethodusedtoidentifyunderlyingrelationshipsbetweenobserved
Page5
MACHINE LEARNING [BCS602]
variables.
Page6
MACHINE LEARNING [BCS602]
MultivariateAnalysisofVariance(MANOVA):
ExtendsANOVAtoanalyzemultipledependentvariablessimultaneousl
y. VisualizationTechniquesforMultivariateData
Heatmap
Aheatmapisagraphicalrepresentationofa2Dmatrixwherevaluesarerepresente
dby colors. In a heatmap:
Darkercolorsindicatelargervalues.
Lightercolorsindicatesmallervalues.
Applications:
Heatmapsareusefulforvisualizingcomplexdataliketrafficpatternsorpatienthealt
h data, where you can easily identify regions of higher or lower values.
Example:
Page7
MACHINE LEARNING [BCS602]
Invehicletrafficdata,regionswithheavytrafficarehighlightedwithdarkcolors,mak
ing it easy to spot problem areas.
Pairplot(orScatterMatrix)
Apairplot(orscattermatrix)isamatrixofscatterplotsthatshowsrelationships
between every pair of variables in a multivariate dataset.
Thismethodallowsyoutovisuallyexaminecorrelationsorrelationshipsbetween
variables.
Arandommatrixofthreecolumnsischosenandtherelationshipsofthecolumnsis
plotted as a pairplot (or scattermatrix) as shown below in Figure 2.14.
VisualLayout:Eachscatterplotinthematrixshowstherelationshipbetwee
ntwo variables.
Usefulness:Byexaminingthepairplot,youcaneasilyidentifypatt
erns, correlations, or clusters among the variables.
Page8
MACHINE LEARNING [BCS602]
EssentialMathematicsforMultivariateData
Intherealmofmachinelearningandmultivariatedataanalysis,severalmathematica
l concepts are foundational.
TheseincludeconceptsfromLinearAlgebra,Statistics,Probability,andOptim
ization. Below is an overview of essential mathematical tools that are
necessary for understanding and working with multivariate data.
LinearAlgebra
Linearalgebraiscrucialinmachinelearningasitprovidesthetoolsfordealingwith
data in the form of vectors and matrices. Here's a breakdown of important
topics:
Vectors:Avectorisanorderedlistofnumbers.Itcanrepresentdatapoints
or features of an observation in a multivariate dataset.
o Dotproductandcrossproductareusedtocomputeprojectio
nsand angles between vectors.
Matrices:Amatrixisa2Darrayofnumbers.Inmachinelearning,matriceso
ften represent data where rows are instances and columns are
features.
o Matrixmultiplicationallowsthetransformationofdataandisus
edin
variousalgorithmslikelinearregression,neuralnetworks,andm
ore.
EigenvaluesandEigenvectors:Theseareimportantfordimensionalityr
eduction
techniquessuchasPrincipalComponentAnalysis(PCA).Theyareused
to transform data into a new basis that captures the most variance.
DeterminantsandInverses:Thedeterminantofamatrixtellsusifthema
trixis invertible(non-
Page9
MACHINE LEARNING [BCS602]
singular).Theinverseofamatrixisusedtosolvelinearsystems of
equations.
SingularValueDecomposition(SVD):Thisisafactorizationmethoduse
dinPCA and other dimensionality reduction techniques to decompose
a matrix into singular values and vectors.
Page10
MACHINE LEARNING [BCS602]
Statistics
Statisticsiskeytounderstandingtherelationshipsbetweendifferentvariablesin
multivariate data. Key concepts include:
MeanandVariance:Measuresofcentraltendency(mean)andspread(var
iance) are essential to understanding the distribution of each variable.
Covariance: Covariance measures the relationship between two
variables. A
positivecovarianceindicatesthatasonevariableincreases,theothertend
sto increase.
Correlation:Correlationisanormalizedmeasureofcovariancethatindicat
esthe strength and direction of the relationship between two
variables.
MultivariateNormalDistribution:Manymachinelearningalgorithmsas
sumethat the data follows a multivariate normal distribution, which
extends the idea of normal distribution to more than one variable.
PrincipalComponentAnalysis(PCA):PCAisusedtoreducethedimensio
nalityof the dataset while retaining as much variance as possible. It
uses eigenvectors and eigenvalues to identify the principal
components.
Probability
Probabilitytheoryunderpinstheconceptofuncertainty,whichisinherentinreal-
world data:
Page11
MACHINE LEARNING [BCS602]
omes.
Commondistributionsinmachinelearningincludethenormaldistributio
nand the multinomial distribution.
Page12
MACHINE LEARNING [BCS602]
Optimization
Optimizationiskeytofindingthebestmodelformultivariatedata.Manymachine
learning algorithms are formulated as optimization problems.
GradientDescent:Aniterativeoptimizationalgorithmusedtominimiz
eacost function (such as in linear regression or neural networks).
ConvexOptimization:Involvesminimizingconvexfunctions,and
playsa
significantroleinmachinelearning,asmanycostfunctionsareconv
ex.
LagrangeMultipliers:Usedforoptimizingfunctionssubjecttoconstraints,
which is often seen in constrained optimization problems in machine
learning.
MultivariateAnalysis
MultivariateRegression:Thisistheextensionoflinearregressiontop
redict multiple dependent variables using a set of independent
variables.
MultivariateAnalysisofVariance(MANOVA):AnextensionofANOVAus
edwhen there are two or more dependent variables. It tests for
differences between groups.
Page13
MACHINE LEARNING [BCS602]
FactorAnalysis:Amethodforidentifyingtheunderlyingrelationshipsbet
ween observed variables. It’s often used in exploratory data
analysis.
Page14
MACHINE LEARNING [BCS602]
GraphicalTechniquesfor MultivariateData
ScatterPlots:Ascatterplotcanbeusedtovisualizetherelationshipbetwee
ntwo
variables.Formultivariatedata,pairplotsorscattermatricesareusedtoe
xamine the relationships between all pairs of variables.
Heatmaps:Usedtovisualizecorrelationmatricesorcovariancematrices,
where color intensity represents the strength of the relationship.
MultivariateDataModels
MultivariateNormalDistribution:Ageneralizationoftheunivariaten
ormal
distributiontomultiplevariables,frequentlyassumedinmultivariatestatis
tical analysis.
MultivariateLinearModels:Modelssuchasmultipleregression,where
multiple independent variables are used to predict a set of
dependent variables.
DimensionalityReduction
Dimensionalityreductionisusedtoreducethenumberofvariablesinadatasetwhile
maintaining the essential information:
PrincipalComponentAnalysis(PCA):Atechniquethatreducesthedimen
sionality of the dataset by projecting the data onto a set of orthogonal
axes (principal components) that explain the most variance.
t-SNE:Atechniquefordimensionalityreductionthatiswell-
suitedforvisualizing high-dimensional data in 2D or 3D space.
FeatureEngineeringandDimensionalityReductionTechniques
Featureengineeringanddimensionalityreductionarecriticalstepsinmachinelea
Page15
MACHINE LEARNING [BCS602]
rning workflows.
Page16
MACHINE LEARNING [BCS602]
Theyensurethatmodelsarenotonlyaccuratebutalsoefficient,interpretable,and
scalable.
1. FeatureEngineering
Featureengineeringinvolvescreating,modifying,orselectingfeatures(variables)
from raw data to improve the performance of machine learning models.
TechniquesinFeature Engineering
1. FeatureCreation
2. FeatureTransformation
o Normalization:Scalingvaluestoaspecificrange,typically[0,1].
o Standardization:Transformingfeaturestohaveameanof0
anda standard deviation of 1.
o LogTransformation:Reducingtheimpactoflargevaluesbyapplyi
ngthe log function.
o PowerTransformation:Stabilizingvariancebyapplyingfuncti
onslike square root or exponential transformations.
3. HandlingMissingValues
o Imputation:Fillingmissingvalueswithstatisticalmeasures(m
ean, median, mode) or predictions from models.
o DroppingFeaturesorRows:Removingfeaturesorsampleswithexc
essive missing data.
Page17
MACHINE LEARNING [BCS602]
4. EncodingCategoricalFeatures
o LabelEncoding:Assigningnumericalvaluestocategories.
o One-HotEncoding:Creatingbinarycolumnsforeachcategory.
o TargetEncoding:Replacingcategorieswiththemeanofthet
arget variable.
5. FeatureSelection
o FilterMethods:Usingstatisticaltests(e.g.,correlation,chi-
square)to select features.
o WrapperMethods:Selectingfeaturesbasedontheperforman
ceofa model (e.g., recursive feature elimination).
o EmbeddedMethods:Featureselectionintegratedintomodeltraini
ng(e.g., regularization methods like LASSO).
DimensionalityReduction
Dimensionalityreductionaimstoreducethenumberoffeatureswhilepreservingas
much relevant information as possible.
Ithelpscombatissueslikeoverfitting,highcomputationalcosts,andthecurseof
dimensionality.
TechniquesforDimensionalityReduction
1. PrincipalComponentAnalysis(PCA)
o Purpose:Identifiesdirections(principalcomponents)inthedat
athat explain the maximum variance.
o Projectsdataontoanewcoordinatesystemwhereeachaxisrepresent
sa principal component.
o Capturesthemostvarianceinthefirstfewcomponents.
Applications:Commonlyusedinimagecompression,geneexpr
ession analysis, and exploratory data analysis.
Page18
MACHINE LEARNING [BCS602]
2. LinearDiscriminantAnalysis(LDA)
o Purpose:SimilartoPCAbutfocusesonmaximizingclassseparabilit
yin supervised learning tasks.
o Projectsdataontoalower-
dimensionalspacewhilemaintainingclass distinction.
Applications:Oftenusedinclassificationproblems.
3. t-DistributedStochasticNeighborEmbedding (t-SNE)
o Purpose:Reduceshigh-dimensionaldatato2Dor3Dforvisualization.
o Preservesthelocalstructureofthedatawhilesacrificingglobalstructure.
Applications:Usefulforvisualizingclustersinhigh-
dimensionaldatalike embeddings.
4. Autoencoders(DeepLearning-BasedReduction)
o Purpose:Learnsacompressedrepresentationofthedatausingn
eural networks.
o Theencodercompressesthedata,andthedecoderreconstructsit.
o Thebottlenecklayerrepresentsthereduceddimensions.
Applications:Imagecompression,anomalydetection,andgenerativ
e models.
5. Feature Agglomeration
o Purpose:Groupsfeatureswithsimilarcharacteristics(hierarc
hical clustering for features).
o Combinesredundantfeatures intoasinglerepresentativefeature.
Applications:Usefulfordatasetswithmanycorrelatedfeatures.
Page19
MACHINE LEARNING [BCS602]
6. IndependentComponentAnalysis (ICA)
o Purpose:Decomposesdataintostatisticallyindependentcomponents.
o Usefulforsignalswithnon-Gaussiandistributions.
Applications:Signalprocessing,suchasseparatingaudiosignalsinth
e "cocktail party problem."
7. FactorAnalysis
o Purpose:Identifiesunderlyinglatentvariables
(factors)thatexplain observed variables.
o Assumesthatobserveddataisinfluencedbyasmallernumber
of unobservable factors.
Applications:Psychometrics,finance,andsocialsciences.
8. BackwardFeatureElimination
o Purpose:Iterativelyremovesfeaturesthathavetheleastimpacto
nthe target variable.
o Usesatrainedmodel'sperformanceasthecriterion.
Applications:Effectiveforsmalldatasetswherecomputationalcostisn’ta
concern.
CombiningFeatureEngineeringandDimensionality
Manymachinelearningframeworks(e.g.,scikit-
learn)supportbuildingpipelineswhere feature engineering and dimensionality
reduction steps are automated.
HybridMetho
ds: For
Page20
MACHINE LEARNING [BCS602]
example:
Page21
MACHINE LEARNING [BCS602]
o CombinePCAwithfeatureselectiontoreducenoiseandretainrel
evant features.
o Useautoencoderstogeneratecompactfeatures,thenapplysupe
rvised learning techniques.
Applicati
ons Text
Data:
o UseTF-
IDFforfeaturecreationandLatentSemanticAnalysis(LSA)for
dimensionality reduction.
ImageData:
o ApplyConvolutionalAutoencodersorPCAforreducingpixel-baseddatadimensions.
GenomicData:
o Use PCAort-SNEtovisualizehigh-dimensionalgeneexpressiondata.
SensorData:
o CombineFouriertransformsforfeatureextractionandPCAfordimens
ionality reduction.
BestPractices
UnderstandData:Alwaysbeginwithexploratorydataanalysis(EDA)tounderstan
d feature importance and relationships.
DomainKnowledge:Incorporatedomainexpertisetocreatemeaningfulfeatures.
Page22
MACHINE LEARNING [BCS602]
AvoidOver-
Reduction:Ensurethatdimensionalityreductiontechniquesretainsufficient
information to build an accurate model.
Page23
MACHINE LEARNING [BCS602]
Evaluate:Continuouslyevaluatefeatureengineeringanddimensionalityreductio
nusing cross-validation.
Page24
MACHINE LEARNING [BCS602]
Chapter– 02
BasicLearningTheory
DesignofLearningSystem
Alearningsystemisacomputationalsystemthatusesalgorithmstolearnfromdataor
experiences to improve its performance over time.
Thedesignofsuchsystemsfocusesonthefollowingessentialsteps:
Thefirststepinbuildingalearningsystemisselectingthetypeoftrainingexperience
it will use to learn. This involves determining the source of dataandhow it will
be used.
TypesofTrainingExperience:
DirectExperience:
Thesystemisexplicitlyprovidedwithexamplesofboardstatesandtheircorr
ect moves.
Example:Inachessgame,thesystemisgivenspecificboardstatesandtheoptim
al moves for those states.
IndirectExperience:
Insteadofexplicitguidance,thesystemisprovidedwithsequencesofmovesa
nd their results.
Example:Thesystemobservestheoutcome(winorloss)ofdifferentmo
ve sequences and learns to optimize its strategy.
Page25
MACHINE LEARNING [BCS602]
Supervisedvs.UnsupervisedTraining:
Insupervisedtraining,asupervisorlabelsallvalidmovesforagivenboardstate.
Intheabsenceofasupervisor,thesystemusesself-playorexplorationtolearn.For
example,achessagentcanplaygamesagainstitselfandidentifysuccessfulmoves.
TrainingDataDistribution:
o Forreliableperformance,trainingsamplesmustcoverawiderangeofscenarios.
o Ifthetrainingdataandtestingdatahavesimilardistributions,thesyste
m's performance will be better.
DeterminingtheTargetFunction
Thetargetfunctionrepresentstheknowledgethesystemneedstolearn.
Itspecifiesthegoalofthelearningsystemandwhatitistryingtopredictoroptimize.
Page26
MACHINE LEARNING [BCS602]
RepresentationoftheTarget Function
Oncethetargetfunctionisdefined,thenextstepisdecidinghowtorepresentit.The
representation depends on the complexity of the problem and the available
computational resources.
CommonRepresentations:
LookupTables:
Usedforsimpleproblemswhereallpossiblestatesandactionscanbeenumerated.
Example:Asmallchessboardwithalimitednumberofmoves.
MathematicalFunctions:
Representedusingequationsormodels(e.g.,linearregressionorpolyno
mial equations).
MachineLearningModels:
Forcomplexsystems,modelslikeneuralnetworks,decisiontrees,orsupportve
ctor machines are used to approximate the target function.
Example:Usinganeuralnetworktopredictthebestchessmovesbasedonboa
rd states.
FunctionApproximation
Inmostreal-worldproblems,thetargetfunctionistoocomplextoberepresented
exactly. Instead, an approximation of the target function is learned.
Approaches
toApproximation:
Parametric Models:
Page27
MACHINE LEARNING [BCS602]
Modelswithafixednumberofparameters(e.g.,linearregression,neuralnetwo
Modelsthatadapttheircomplexitytotheamountofdata(e.g.,k-nearestneighbors,
decision trees).
LearningAlgorithms:
PracticalExample:DesigningaChessLearningSyste
m Training Experience:
Useacombinationofself-play(indirectexperience)andhistoricalgamedata(direct
experience).
TargetFunction:
DefinethetargetfunctionasselectingthebestmoveMgiventheboardstateB:
RepresentationoftheTargetFunction:
Useadeepneuralnetworktorepresentthetargetfunction,whereinputsareboard
states and outputs are move probabilities.
FunctionApproximation:
Page28
MACHINE LEARNING [BCS602]
Traintheneuralnetworkusingreinforcementlearning,withrewardsbasedonthe
outcome of games played by the system.
Introduction toConceptofLearning
Conceptlearningisastrategyinmachinelearningthatinvolvesacquiringabstract
knowledge or inferring general concepts from the given training data.
Itenablesthelearnertogeneralizefromspecifictrainingexamplesandclassifyobje
cts or instances based on common, relevant features.
WhatisConceptLearning?
Conceptlearningistheprocessofabstractionandgeneralizationfromdata,where:
Thelearneridentifiescommonfeaturessharedbypositiveexamples.
Itusesthesefeaturestoclassifynewinstancesintocategories.
Itinvolves:
Comparingandcontrastingcategoriesbyanalyzingpositiveand
negative examples.
Simplifyingobservationsfromtrainingdataintoamodelorhypothesis.
Applyingthismodeltoclassifyfuturedata.
Thisprocessisalsoknownaslearningfromexperie
Categorization:
o Conceptlearningenablesclassificationofobjectsbasedonasetofrelev
ant features.
Page29
MACHINE LEARNING [BCS602]
o Forexample,humansclassifyanimalslikeelephants,cats,ordogsbased
on specific distinguishing features.
Boolean-ValuedFunction:
o EachconceptorcategorylearnedisrepresentedasaBooleanfunctionthatretu
rns true or false:
Trueforpositiveexamplesthatbelongtothecategory.
Falsefornegativeexamplesthatdonotbelongtothecategory.
Example:
FormalDefinitionof ConceptLearning
ConceptlearningistheprocessofinferringaBoolean-
valuedfunctionbyprocessing training examples.
Thegoalisto:
1. Identifyasetofspecificorcommonfeatures.
2. Usethesefeaturestodefineatargetconceptforclassifyingobjects.
ComponentsofConceptLea
rning Input:
o Alabeledtrainingdatasetconsistingof:
Positiveexamples:Instancesthatbelongtothetargetconcept.
Page30
MACHINE LEARNING [BCS602]
Negativeexamples:Instancesthatdonotbelongtothetar
get concept.
o Thelearnerusesthispastexperiencetotrainthemodel.
Output:
o TheTargetConceptorTargetFunctionf(x):
Afunctionf(x)mapsinputxtooutputy.
Theoutputisusedtodeterminetherelevantfeaturesf
or classification.
o Example:Identifyinganelephantrequiresaspecificsetoffeaturessuchas"ha
sa trunk" and "has tusks."
Testing:
o Newinstancesareprovidedtotestthelearnedmodel.
o Thesystemclassifiesthesenewinstancesbasedonthehypothesisderiveddur
ing training.
ProcessofConceptLear
ning Training:
o Thelearnerobservesasetoflabeledexamples(positiveandnegativeinstances).
o Itidentifiescommon,relevantfeaturesfromthepositiveexamplesandcontr
asts them with negative examples.
HypothesisFormation:
o Thesystemgeneratesahypothesistorepresentthetargetconcept.
o Example:"Anelephanthasatrunkandtusks"couldbethehypothesistoclassify
an elephant.
Page31
MACHINE LEARNING [BCS602]
Generalization:
o Thehypothesisisgeneralizedtoclassifynewinstancescorrectly.
TestingandValidation:
o Thelearnedmodelistestedonunseendatatoevaluateitsperformance.
Example:Concept LearningforAnimals
Input:Trainingdatasetofanimalswithlabeledfeatures.
o Positiveexamples:Animalslabeledas"elephants."
o Negativeexamples:Animalsnotlabeledas"elephants."
Output:Targetconceptforanelephant,e.g.,"hasatrunk,""hastusks,"and"largesi
concept.
ApplicationsofConceptLearning
1. NaturalLanguageProcessing:Categorizingwordsorsentencesb
asedon grammatical or semantic features.
2. ImageRecognition:Identifyingobjectsorpatternsinimages.
3. RecommendationSystems:Classifyingproductsorservicestop
rovide personalized recommendations.
4. MedicalDiagnosis:Identifyingdiseasesbasedonsymptomsandmedic
altest results.
ModellinginMachineLearning
Amachinelearningmodelabstractsatrainingdatasetandmakespredictionson
unseen data.
Page32
MACHINE LEARNING [BCS602]
Training:Involvesfeedingtrainingdataintoamachinelearningalgorithm,tuning
parameters, and generating a predictive model.
Goals:Selectingtherightmodel,trainingeffectively,reducingtrainingtime,and
achieving high performance on unseen data.
TypesofParameters:
ModelParameters:Learnabledirectlyfromtrainingdata(e.g.,regressioncoe
fficients, decision tree splits, neural network weights).
Hyperparameters:Cannotbelearneddirectlyandmustbeset(e.g.,regula
rization strength, number of trees in random forests).
EvaluationandErrorMetrics
Dataset Splitting:
o Trainingdataset:Usedtotrainthemodel.
o Testdataset:Usedtoevaluatethemodel'sabilitytogeneralize.
ErrorTypes:
o TrainingError(In-sampleError):Errorwhenthemodelistestedontrainingdata.
o TestError (Out-of-sampleError):Errorwhenpredictingonunseentestdata.
LossFunction:Measurespredictionerror.Example:MeanSquaredError(MSE
)—a smaller value indicates higher accuracy.
StepsinMachineLearningProcess
AlgorithmSelection:Chooseamodelsuitablefortheproblemanddat
Page33
MACHINE LEARNING [BCS602]
Tuning:Adjustparameterstoimproveaccuracy.
Challenges:
Balancingperformance(accuracy)andcomplexity(overfittingorunderfitting).
Approaches:
1. Resamplingmethodslikesplittingdatasetsorcross-validation.
2. Calculatingaccuracyorerrormetrics.
3. Probabilistic frameworksforscoringmodel performance.
ResamplingMethods
RandomTrain/
TestSplits:Randomlysplitthedatafortrainingandtesting. Cross-
Validation:Tunemodelsbysplittingdataintofolds:
o K-foldCross-Validation:Splitdataintokparts,trainonk-
1folds,andtestonthe remaining fold.
Page34
MACHINE LEARNING [BCS602]
o StratifiedK-
fold:Ensureseachfoldcontainsaproportionatedistributionofclass labels.
Page35
MACHINE LEARNING [BCS602]
o Leave-One-OutCross-
Validation(LOOCV):Trainonalldataexceptoneinstance; repeat for
every instance.
Page36
MACHINE LEARNING [BCS602]
VisualizingModelPerformance
ROCCurve(ReceiverOperatingCharacteristic):
o PlotsTruePositiveRatevs.FalsePositiveRate.
o AreaUndertheCurve(AUC):Measuresclassifierperformanc
e(1.0= perfect, closer to diagonal = less accurate).
Precision-RecallCurve:
o Usefulforimbalanceddatasets toevaluateprecisionandrecall.
ScoringandComplexityMethods
ScoringModels:Combinemodelperformanceandcomplexityintoasinglescore.
Selectsthesimplestmodelwiththefewestbitstorepresentbothdataandpredictions.
Page37
MACHINE LEARNING [BCS602]
Page38