DV Module-3 Notes
DV Module-3 Notes
ERatkig meaning from Data Submit heir Predi tÛnb, Whet eno ngt teoss Care
mongeinto Aangn teormt
Hew do compamies exthact meanig from the data thy hae) e Corn petitoy in tha
Ennple Exann plez VaiovsCooTpati4fons g4t anked
mL Conpatitians ore
ilam CutieYs kid Dvid Huffaken
ACCo vairg to knowledge Discovly and Tata Minicg (KDD)
Featune ERacticn &eneration 2) The one ttme miuion- dolan Netfz pie
Featun ERtnaction vefens to tating Yaw dump 4 data 3) kagge model Compatitiona
Hou hanNe and aVoid he garbagce in ganbage out and
fead vew daba into am aljertin without engugh tore thauh. Remaks Tore te 2 reman k& abot cata science
Take tk aw data ad clean tke aw data Cenpelitiors Thoy are
ive) uKe cleamdd data as input featwrenatibute 1) Want
to am algoithm a model
s)Ccotig these Competitins Puts Gne in a
Po?tion to codify data seience tr detine
Featue sletion
it6 Scope
Feature seleetion is he Proces o ConstUting or Advontags CompLtitions. Cut out te Mesty thfng4
Betoka yoL stant baldigmeds ik
data or tran&formed data to be te Pxeditors )Askíng Good auesfians
(br Votabies foy thodels 4 algovithms ) Collectfng d citaiog data
3) what happam& onceou bave un madel
Back Goound in incduding NiAualigation + canmeicafin
DData Sience ompelitiohs
fy then to gancrate Revemue > Coptuse the Featuyes which we may not eared
wtimately we ae lack in FeatuYCAbch may CAL
A Sinple Such model is kogistic RegreK3Ýon"
This modd is ba4ead on probability and it tuuns the ) wbether or not it is possible to even Capiure he
Prcbabiity that a Usr etun8 fos the Se (ond manth in fomation
bo;ed on the behavior of tte User in the first montt. 2) wheten ox not it even occns to yak at all to ty to
->To Captuxe
ecoYd éach and eveny VL0Y Action. ’we have cetain Buckets| scenaxios whae a
onLe he Ations ane TeCoxded and t acked hen it Featue Can be qemenate
wfu be added to the data set. Relevant d Useful but it is inopoLsSble to Gaptue it
-5 A) the Ations stu be store d in a Log file wits a 2) The Featune is seAeaunt d Use ful POssible te kg it
tout you add it
5Thik Log flts wi be con vented into Yows Column 3) T Feature ixrelevant f Ueu Pasa ble to Jag it,
and stoxed in the dataet but you didn't
Each oM) belong4 to A) Not xelevant use fal but ou dent krow that and
Each Column *epreents a Featuse you lag it
5) NOt xelevantUKefal but Yaueithot Cannt æit or it
didnt OcCWn to You
Page No
Dele
Daty
O Relevant d Use Aut. but it is impos& fbl e to Feature selictin Alari lorn&
Captuse i t
Lot o Snformation is thexe about User but Featune selection is a proLEA Of
w annt captutng aleut te UXeHAfe Subset of data OY tranA foc nald idinkifyic
datathat is VBed
Ex i) tHow mue h Eree tfme do actual hawe as input to the madel
i)what axe the ohen apPsthat the Usen has EK Raw aata may have many td undant or
down loaded
corrlated
Vaiables But You dent want to inelude al those
if) Fiod out whekon he UseM is Unmplbyed ex not. Variables in youh modd.
iv) Do t Suttey om Many dikeases ’ you might want to ConstMet neu Vaia bles by
V)Do ty kaue Adictive Pensonali ty trans fa rmirg the vaniables uith a kog rithm (o0
But we Canntt captue Tramcfoxmi ContinuoUs Va able into a Biaty an
betxe feeding them Zoto tke madel
Temioalog y Diffexamt noames tor featuss bå Staishciant
Relovant and Ubeful Possible to Jog it and You dd. Conputtn scentist.
YoL kogead into an app dunfrg Yoor brainStormirg Statistictan Say Explounatoxy vaxiaktea
Sa4ion But Just kecause ya chese to tog if doe &n't
mean that it is elevantUseful Feature predictos when thiy are descibig
the Subset of data that i8 tie ilp to a model
e Reldant and ce ful, Poxsible to Jog it but You didn't 5 Computen Scilmntict Featureg
SUppose if You didnt ecoded the Uxen action
Aike Uplaading cAEn's photo to Iheir profide ErAmple' USe1 Ratention for Chasíog Dxagons Application
Yetun T0 Neasse ULEN veterntiÝn fox Chaing dgoy Pplteation
ohe of the key uay to avaid missing uAlfeaturee 9 Re coxda UK&y Attions behaniar
iX hy doing Uaktlity studies 2 Stoye UKeNALtions in tmt Stam ped event dos
3) PxaceK thesE Aogs into & dataset ith
eo No Yeleast / Uis4 but yaudoit knaw that and dy it Na and cadumns (Featwe GRatam)
Row = UBY Ra COrd
You haue eqzed tt but yau don't attually heed it . coumn Afe atuse FeatwuSaleckon
4
Constdeh the UBey Achiong Fov chaing Dvogom gp Featwe salactiom Mkods eature slaction slacts rotst
Tlevant subset of featuxes fxoco ta bgiral 0 dropping vedundat
An Totxoduction to Vaniable and nc4ereleont
DNumbn a daå ke Ucen visfted in the Firt month Featwe Selecion
2)Amont of fme Undil second viAit - 18atelie Guyon
Publakhid obove
3) Numbir Points on Papen in 2003
4)Total Numbkx of Point& in Atrst mnonh ) This Papen anny focUS on Con structing and sele tirg
5) Did Uke f ut chaáng drogans pxofile Subsets o Feat es that ane VEe fut to but4 a
ood pedictoy.
) Sceen Size o deNice ThoH ae 3typeA 0f fe atwre selection Mebcds
8) Numben dxago's kiled Fil tenb
Uk data set oy alooVe Actons o
3) Embedded metsod&
nUmd
Pled
Profile Aae aenden NUm. dragons NUMfriernds invite:
kiued
YAppo Method 1
WrapptnS Featune Celeetioy tries to fnd subsets Step wi6e RegresionA mhad for featw selcton
featwnes of Sane fid st3e 1hat involves selattirg Features Accordirg to sesne
k numby Ch Featunes Subket tox
n hings Features to a Yeqression rodal
stoPike Regvessian tnlBods JAlgag: 3 sneois
to U8e 3 Foswand sele ttian mdhod
A2 Backwand Eimingticn MUhod
St o Featuxe 3- Combined AppYcach DFovwanj 4 Back inrdl
Feat we selections: ) Fonwad Sele etion Nnthod:
oenexate sub&eti
step1: stant with a Regrekion Madel wfth ro featursk
¿vadualy add one festure at &tine aceg
to wheCh featue improves ke mad l tie mcst
Penfoxmance baed on a seleeti¡n Catenion
Step2 i Build al POssible RegxesiCn Nodel& ofth a
5for each Featue subAlA, wr estimate the algorithis Siogle Predictor. Píck ike blut
ievauating it ustng Stip3Ty al POsAble medels that include he best
Predictoy d a Second Predicto Pick th begt
5Thm ue add oY) Yemve features Stbset
kaep addtng one Featuse at a tim stop
bayed on te estímate Stipl i
ohaxa selction CAteniao ne dongen impraveE
-) Thg a itonative proces&. So if thoe ae n flatwes
mL models in he first ithaion. hut instead gets worse.
tbem we buiidh
’ue nead to Conbider 2 aspects Backwand elimf nution NhOd
)seeting An Algoithm to Use to selet Featuxes -
2) Decidng that tocides alt
featurenz i& Step1 Stat with a Ye9 sesion Nudet
to decide that u set oy featue at Atme
feat wus2 gadualy emave onesemod makes the
whse
accexdi to the feature Criterion.
fmpaount io the sele etin
bigest
Page No.
Date Paga No
Date
Step 2: Step vomovirg FestureA when voovig ithe. 3 AIC THe goal is to minioise A2
Alc = 2k - 2lnlL)
ci
featurs tmakek the celoetiom cHiteHi n
whene K= The NUmnben eh
Lnl) tmaimized valnePatametons in te modd
o the log Iike ihood
Con bined AppYOachi
Lp*
Bic iThe Aoal is to minimize Bfc
Most Subeet metods captusls Minim um sewndancy Bic k* ln (n) - 2lnlL)
manimUN) el VaM Ce whea n Ng obuer vaions
EX:
6Yeedy Algerihm stat& wits tte Best Featuse above
takesa few more highy Xanked, emoves the WoxS
and SO on 5 Entaopy: Tt i& a measwe iopwity in a datatet
Filten Model.
’ Ths is a Hy brid AppX0a ch with a Dt i8 UAed to bwild a eiBiOn tree
2- P-vaues
3 AIc- (A Kake In for matio% citeron)
h BiC(Bayesian Tn fornotion Cri toon)
5 EotroPY
1R Sauanod
Study
-’Durin Traintng
dofeat wne sele ction
Sone steps the NL algorithm
featuxe
NO
[La34 =Yes
Pub
Definition: ’ A Deci8ion Tca is a type o MI odel figDDecikion Tree fos College tudents
Used fos claeificotim or Regseuion Erplantien
taks Totd îotome date
îna decision tree each node TeprUent D college studexnts spand theix time fox attinding Panies.
featune and leahode epreAents Study tatching TY
i4 studeotk have Paxty A no dead line hen
thay wiLfe to paty
categorya Bet featuxe ds spat
-3Suet altibute
in t at each
deciaion tree Hht w wxtch TV chéhis
*Fex fnang clas data settnthopy kes betueen odl Infosmafion(1 6)Gain
aktribute
Fox ula
T6 (x, a) Hx)- H(xla)
Enthpy HCx) =Entvopg
HCla): conditinal EntvopY
XGvt Rnom aniable Cond(tionad Entropy Fosmula
Probability c'x ia
true e) Falke Let
Entoey aph V2
PLX=azhe)
HLxla)
Featwe
S-N
prohabiliteA o Samples