0% found this document useful (0 votes)
11 views

DV Module-3 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

DV Module-3 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Nodute-3 Page No.

Featue Gtereratio)feature selo tion Dato

Extnacting meaning from dlata


)Motivating Application 3 UseY lCUstoman Retention.

eportised place fo innaginationj


3) Featuxe seleetion Alsori thma.
Filtens rappets, Decision Trees Rardon Forests

Recom mendaticN Systems


data pYDduet
) Buildisg a Usen focin
Algoríthmic ingredens of a Recom mesndation Engine
E)
6) Dimemg i onality Redu ction
D Singuan valuc Decomposition
s) princípal Component Anakik
9) ELenci&e
Build Yown own ReComm EmdatON Syyten
Pape No
Prge

ERatkig meaning from Data Submit heir Predi tÛnb, Whet eno ngt teoss Care
mongeinto Aangn teormt
Hew do compamies exthact meanig from the data thy hae) e Corn petitoy in tha
Ennple Exann plez VaiovsCooTpati4fons g4t anked
mL Conpatitians ore
ilam CutieYs kid Dvid Huffaken
ACCo vairg to knowledge Discovly and Tata Minicg (KDD)
Featune ERacticn &eneration 2) The one ttme miuion- dolan Netfz pie
Featun ERtnaction vefens to tating Yaw dump 4 data 3) kagge model Compatitiona
Hou hanNe and aVoid he garbagce in ganbage out and
fead vew daba into am aljertin without engugh tore thauh. Remaks Tore te 2 reman k& abot cata science
Take tk aw data ad clean tke aw data Cenpelitiors Thoy are
ive) uKe cleamdd data as input featwrenatibute 1) Want
to am algoithm a model
s)Ccotig these Competitins Puts Gne in a
Po?tion to codify data seience tr detine
Featue sletion
it6 Scope
Feature seleetion is he Proces o ConstUting or Advontags CompLtitions. Cut out te Mesty thfng4
Betoka yoL stant baldigmeds ik
data or tran&formed data to be te Pxeditors )Askíng Good auesfians
(br Votabies foy thodels 4 algovithms ) Collectfng d citaiog data
3) what happam& onceou bave un madel
Back Goound in incduding NiAualigation + canmeicafin
DData Sience ompelitiohs

0Data Scitnce Cono,pe tittenswhone îndividualx to


Croyd sausncing Muns UAig May Peaple to
Solve. a problamm Dndepeadintls
teams Con campatz oyen a pexit ofSelral wee ks or manths
to deaign Piudickion Algorith mbad on he A
challenge
Pantic ulan dataset Cexnpetitox& and Contestants compte to fd se
Th ogase~ation or the People who Conduct Data b t sahution
Science Compeitions providea (he tofonatir
fox te Conpetdtoxe EJample Amajan Mechanical TusE anlfne cYeMdSaneirg
Sohvice
2) An Evaluation metuc
3). Some set Rules Houw oftero Com pekitors Con
Ppe No

TE Afves taLk to human. )A sing le contestant


Anagon nechaical TuE
help ma.chines
a
set imagcs that need to be labolod
A Txaining set anda 1est st Lw hene theys
Hapry %e be Used as the boyis of
3)TheKe lnbes could
data scierntict (or)aContestant
atraining set or asUPOvised lantng Problemg,
3)An alsthm Coutd then be t n d on thexe 5The contestoot Buldz a modd bogd on te
label
Human - Laheled images to Atomatically txaining sat
Theo t e data sciurtist be his model to gt
Mechaical Tok i& an example of Axtifical Tnteligence. Paedieted s foy the test at arsi uPlogd
The Humans ahe helping th Machines into the kagg lu Sytem to Bec thelY
MachinA Helplicq he tumams evaluation Scoxe
A data Sciertist no nated to shaa his actval coda
The kasgk medel
’The artictpvnts Can Submt thei modls UP to
we ave making 5 timeA a day dicg he Comptiton&
data sciene a spoxt A8 Contestants Submit thels pidicttons the
kaggl -focms clatienghip ith compamies and with th ka4gle eadenboad vPdates inomiciáat thg to d'splay
data scientsts the cont extont's Cwnent evaliation octaic on
kagie hosts Competitions fox Buzine s&es that the hold out test set
Es`Entfaly wst CrouidsOuH Ce to Solve the ix 2Their Custo maHs kag4tez CustamnHA H Cenpanies
data problero
A Kaggle pxevides the infxastauctuxe and altracts th
cafo science -talent ’Amismatch betacen thase wha need Anelseis
The LempamlA a the Poy fng CuSstom eHs tox th those with SkitIs
Th campanits PAOvide datasets aund data SolvLS Companies ange data xablem&
Psrcble that thy wnt to Solved kaggla by (Cxowd Sauncing ) hastiga kase Gampetit bn
LowdsouHCeg he~e Problens it ata scientist& Example kaggie's EAsay soriog canapett bin
a1Owd th world
)h Data.saiustist )A singk contestant kag4e PRvides Hand - Scoved EAsays to ta data
2)Thr CUSto mob Sclentist
TRn,it is Used by e data scientist to buld an
automatic essag scovtng engne
Page No
Dete

FoN tis Com peition, thoe ate 5 exsay Bets (


Seletted essayz Yange Arom an avenge lenqth o
Motvatiog APplication uBenl cUstana) Ratertan
150 to 550 wYds Pt Yespon8e USey Retention yefens to the
Al Yespunses we written by Students UsUn Continue to ULe a psgduct Ineasue of hou long
senvicel Applicati
Al E8s y wee Han gYaded and dovble Scored .hen
EX SUppose a
Tht data &et hx te Column8: Net flf
tottoaing
A uniee identificx -fox toch individual Amazon
Hotstarou. tu.be E
Studont Cksay Bet AlL he aboV app
ae Subscxiktian bogd it
2) 1-5 An Td fox each Bet o Says PHedefined dwation fox which we Pay
3) EBBAy 1The AsCIT e t o a Students esponse. and svbscibed to toe Sevices. on e he tImt
A) Atont = Ratey t8 GYade expixes, we haNe to Pay again to Continue to te
5)-Tatoy 2 Ratr 's GYade SUbsription
&) xade= Resolved Scose betweem the atens oux Subscai pticn gunirates the xEvinut foy
thase application,.
Thowh EPoriment tthat ase the Ethical TmpliCaticns 1f usen Contto uaUs the Subscibtin toen it i& calltd
of obo - Grader 9 Usey(custone) Retention
The moe NUmbin h Subscxbee we have, te mOve
Human Gadey anent alung fair
Ane machine nmak ing thing mos0 Stuct uned and mony will the aPp qets
Revemue Yate2 ways
is thi8 inh biBing (HCativity ?
* Ts te goal Df a test to wite a gcd eBBay ox to do Nctention ate f
uell ina standoidized test ) Find a way to incxease the

Tt is taie to xe tain n 2xisting 6 thaun to

Pyediet user Yetention Situafiom.


D Bvtld a model that
e
their be haviaur that th
heat ents baed on
cwvent mant
has ebited in the to build an
pxeietive madels, ae need
5To build such acclerateat
we mag Re the madel to gire a Fvoe Mmth to UseNs Featwe Genanaticon Feat ure Et aotion
U&CH Retentio.
Ex: video Stcam APP
onsideh HL fellouing UBCH Action Behaviows The Poces of ecoxdiog eath
Uben Atkian belorgs to
movies a panticulan U8OY tas uatco featwne generotion Etotticn chile
io the Arst nmmt UAing an 0pplicatin
wwhat is Hse dunation
3) wetkey the UseY Coptncd th app in all 30 day1 butit is alko gecd to hawe inagination a data
2) hetke thOy doun loaded thia app ect Seilntist
By VAing all these we cam predict the Uke libo0d that todey's techralgg envíxmrknt one can
CE IL ContEoue the substráptíon o4 Lofoxmation P day thiough seasco gentrate Pta bytes
Ohce they pxadict the Usesbe havio then it wu be -> Among t hem How mamy of Ithese Eeatures ArE NOAe
easien foy xetaining the UseHs i0 some FeatuwEA COnnt add meanfng to the mtelz

fy then to gancrate Revemue > Coptuse the Featuyes which we may not eared
wtimately we ae lack in FeatuYCAbch may CAL
A Sinple Such model is kogistic RegreK3Ýon"
This modd is ba4ead on probability and it tuuns the ) wbether or not it is possible to even Capiure he
Prcbabiity that a Usr etun8 fos the Se (ond manth in fomation
bo;ed on the behavior of tte User in the first montt. 2) wheten ox not it even occns to yak at all to ty to
->To Captuxe
ecoYd éach and eveny VL0Y Action. ’we have cetain Buckets| scenaxios whae a
onLe he Ations ane TeCoxded and t acked hen it Featue Can be qemenate
wfu be added to the data set. Relevant d Useful but it is inopoLsSble to Gaptue it
-5 A) the Ations stu be store d in a Log file wits a 2) The Featune is seAeaunt d Use ful POssible te kg it
tout you add it
5Thik Log flts wi be con vented into Yows Column 3) T Feature ixrelevant f Ueu Pasa ble to Jag it,
and stoxed in the dataet but you didn't
Each oM) belong4 to A) Not xelevant use fal but ou dent krow that and
Each Column *epreents a Featuse you lag it
5) NOt xelevantUKefal but Yaueithot Cannt æit or it
didnt OcCWn to You
Page No
Dele
Daty

O Relevant d Use Aut. but it is impos& fbl e to Feature selictin Alari lorn&
Captuse i t
Lot o Snformation is thexe about User but Featune selection is a proLEA Of
w annt captutng aleut te UXeHAfe Subset of data OY tranA foc nald idinkifyic
datathat is VBed
Ex i) tHow mue h Eree tfme do actual hawe as input to the madel
i)what axe the ohen apPsthat the Usen has EK Raw aata may have many td undant or
down loaded
corrlated
Vaiables But You dent want to inelude al those
if) Fiod out whekon he UseM is Unmplbyed ex not. Variables in youh modd.
iv) Do t Suttey om Many dikeases ’ you might want to ConstMet neu Vaia bles by
V)Do ty kaue Adictive Pensonali ty trans fa rmirg the vaniables uith a kog rithm (o0
But we Canntt captue Tramcfoxmi ContinuoUs Va able into a Biaty an
betxe feeding them Zoto tke madel
Temioalog y Diffexamt noames tor featuss bå Staishciant
Relovant and Ubeful Possible to Jog it and You dd. Conputtn scentist.
YoL kogead into an app dunfrg Yoor brainStormirg Statistictan Say Explounatoxy vaxiaktea
Sa4ion But Just kecause ya chese to tog if doe &n't
mean that it is elevantUseful Feature predictos when thiy are descibig
the Subset of data that i8 tie ilp to a model
e Reldant and ce ful, Poxsible to Jog it but You didn't 5 Computen Scilmntict Featureg
SUppose if You didnt ecoded the Uxen action
Aike Uplaading cAEn's photo to Iheir profide ErAmple' USe1 Ratention for Chasíog Dxagons Application
Yetun T0 Neasse ULEN veterntiÝn fox Chaing dgoy Pplteation
ohe of the key uay to avaid missing uAlfeaturee 9 Re coxda UK&y Attions behaniar
iX hy doing Uaktlity studies 2 Stoye UKeNALtions in tmt Stam ped event dos
3) PxaceK thesE Aogs into & dataset ith
eo No Yeleast / Uis4 but yaudoit knaw that and dy it Na and cadumns (Featwe GRatam)
Row = UBY Ra COrd
You haue eqzed tt but yau don't attually heed it . coumn Afe atuse FeatwuSaleckon

) sellct subseE & Featunes and ire Ihat s


5 NOt vcle vant/Useful and You efther Cantt capture tt or it
input to t modal Legis tic Ragtten)
You den A heed it.
Page No
Dete

4
Constdeh the UBey Achiong Fov chaing Dvogom gp Featwe salactiom Mkods eature slaction slacts rotst
Tlevant subset of featuxes fxoco ta bgiral 0 dropping vedundat
An Totxoduction to Vaniable and nc4ereleont
DNumbn a daå ke Ucen visfted in the Firt month Featwe Selecion
2)Amont of fme Undil second viAit - 18atelie Guyon
Publakhid obove
3) Numbir Points on Papen in 2003
4)Total Numbkx of Point& in Atrst mnonh ) This Papen anny focUS on Con structing and sele tirg
5) Did Uke f ut chaáng drogans pxofile Subsets o Feat es that ane VEe fut to but4 a
ood pedictoy.
) Sceen Size o deNice ThoH ae 3typeA 0f fe atwre selection Mebcds
8) Numben dxago's kiled Fil tenb
Uk data set oy alooVe Actons o
3) Embedded metsod&
nUmd
Pled
Profile Aae aenden NUm. dragons NUMfriernds invite:
kiued

Filter merod Uses statis thical tools to z dact Feature


Subset bayed en theis relation ship ith target Naiable.
This ntbod TUnoves Eeature with Lau Conrelation
Feot we selecttom with tre Variabe be fose trainiog the nal
2ony Subset of Featwes ase taken foom te cbONe datosEt NL mOdel
Gs inPutx to the Logistic Regrezsion madel It estimates the vetatisn stvingeh 0f the Ylation ship
USYng tafistical tools
Substription fey nest monh by UAing 1hean xeqsm ) hi Sa,uare Test setofFOatuse
model 2) Infomation Gnain
Logistic Reguuion ferm ik: 3) Fsher's SCOYe Featung celectio
) Peat&n corrdation
Logit (P (C; =l/2:))-d+f-; 5) ANO VA ect
whene
C;= 1 if Usen i Uke chong Dragon& afp any Per fox mance
4fme in tte sub ce anet molh

2; Featwne . Indepuntent var nttribute


/Poamter ect
Dale

YAppo Method 1

WrapptnS Featune Celeetioy tries to fnd subsets Step wi6e RegresionA mhad for featw selcton
featwnes of Sane fid st3e 1hat involves selattirg Features Accordirg to sesne
k numby Ch Featunes Subket tox
n hings Features to a Yeqression rodal
stoPike Regvessian tnlBods JAlgag: 3 sneois
to U8e 3 Foswand sele ttian mdhod
A2 Backwand Eimingticn MUhod
St o Featuxe 3- Combined AppYcach DFovwanj 4 Back inrdl
Feat we selections: ) Fonwad Sele etion Nnthod:
oenexate sub&eti
step1: stant with a Regrekion Madel wfth ro featursk
¿vadualy add one festure at &tine aceg
to wheCh featue improves ke mad l tie mcst
Penfoxmance baed on a seleeti¡n Catenion
Step2 i Build al POssible RegxesiCn Nodel& ofth a
5for each Featue subAlA, wr estimate the algorithis Siogle Predictor. Píck ike blut
ievauating it ustng Stip3Ty al POsAble medels that include he best
Predictoy d a Second Predicto Pick th begt
5Thm ue add oY) Yemve features Stbset
kaep addtng one Featuse at a tim stop
bayed on te estímate Stipl i
ohaxa selction CAteniao ne dongen impraveE
-) Thg a itonative proces&. So if thoe ae n flatwes
mL models in he first ithaion. hut instead gets worse.
tbem we buiidh
’ue nead to Conbider 2 aspects Backwand elimf nution NhOd
)seeting An Algoithm to Use to selet Featuxes -
2) Decidng that tocides alt
featurenz i& Step1 Stat with a Ye9 sesion Nudet
to decide that u set oy featue at Atme
feat wus2 gadualy emave onesemod makes the
whse
accexdi to the feature Criterion.
fmpaount io the sele etin
bigest
Page No.
Date Paga No
Date

Step 2: Step vomovirg FestureA when voovig ithe. 3 AIC THe goal is to minioise A2
Alc = 2k - 2lnlL)
ci
featurs tmakek the celoetiom cHiteHi n
whene K= The NUmnben eh
Lnl) tmaimized valnePatametons in te modd
o the log Iike ihood
Con bined AppYOachi
Lp*
Bic iThe Aoal is to minimize Bfc
Most Subeet metods captusls Minim um sewndancy Bic k* ln (n) - 2lnlL)
manimUN) el VaM Ce whea n Ng obuer vaions
EX:
6Yeedy Algerihm stat& wits tte Best Featuse above
takesa few more highy Xanked, emoves the WoxS
and SO on 5 Entaopy: Tt i& a measwe iopwity in a datatet
Filten Model.
’ Ths is a Hy brid AppX0a ch with a Dt i8 UAed to bwild a eiBiOn tree

Slecttoo Cxitenion DiSadvantage wraPPS Mebod iA ONentittirg


Diffennt selection citeuons ae

2- P-vaues
3 AIc- (A Kake In for matio% citeron)
h BiC(Bayesian Tn fornotion Cri toon)
5 EotroPY
1R Sauanod

eplained by the model


P- valueg : To estinate Coltictent s Bs in
tenms P-vadnes, as me thet

obtain ng the tat statis,. A)to be7heNeNN-


cofulns
3ero
Page No
Dete

3) Cmbedded ethods pY Dntin Ne Ihod


Decisiorm T9e
NO
b) Remdom Foxe s
|Dead Line Panty
a) Decision T3ee8

Study
-’Durin Traintng
dofeat wne sele ction
Sone steps the NL algorithm
featuxe
NO
[La34 =Yes
Pub

At each node SplRE %ey choose the best


to spRt he data by study] TV =Lentnades
twc bsL2 ak big deetsi on& dawn into Sesues aueltiog

Definition: ’ A Deci8ion Tca is a type o MI odel figDDecikion Tree fos College tudents
Used fos claeificotim or Regseuion Erplantien
taks Totd îotome date
îna decision tree each node TeprUent D college studexnts spand theix time fox attinding Panies.
featune and leahode epreAents Study tatching TY
i4 studeotk have Paxty A no dead line hen
thay wiLfe to paty
categorya Bet featuxe ds spat
-3Suet altibute
in t at each
deciaion tree Hht w wxtch TV chéhis

E7ampk:0 College Student fhcing de cizion af how to spend

The dee2ion depinds on Bunch o FactorA:


Panty cy deadliges
Casa bout nost
Page No. Page No
Date Oat

b ABBWme that we bHeat Cocnpound Bucstíns snto


dayify the Use as mulkiple Yes NO Quetions
To
Chaing Yagons eiample Denote
Dyesgoíng to cone back next nonth
5For a Giivem randcm Vai able x we denote by
NO NOt 4oog to ccne back next mon It
Clas o Any gívEm UKLH i8 depeN demt on many foc tos Like
D Find ut Entropy tox fnfoemative Feotune

Entropy: Dt is vsed to Qucntify the mest tn feomative


Ho do Jou Constuueta deoision Tree dfeature
what matkematícal fropeHties ao Used * Entropy is used to buld de cigTen 4acu
* EntropY Helps Us to decide wbich featan s
Did Usen fil out the bet splitting Featwe (ey what is te mest
theis Profile informative auestion to ask?
-yes
A pwe node- Tn a sUbset if ttthe Featr
Did USeH COme back
Did vsen iled/slag
more thoun nce in P e node
the first weet Subscxipion
yes Plx)onary NO
Yes NO yes
Did vB,invite at Prediction: Psediction: NÌ yes
PYE dicto: usr
Jeayt one Asiend UBer winot FUX -0) =o 3 NO
wfu Come back
Cone back Cime balk
-to Join
Pre Nodes
E NO Entnofy fo Pwe node o
* Tmpwe nade- in a subset, a fet wne Valtes
Prediction: User
ae not. & sane clae iA Calld
wtu Comne back wtu not Come back! impare ngdes Entropy fov impwre
node 1
USeMSubsciption
4 Dectson tvee to cho4ing dyagon% EX

Gifve pnplanotion c above diogxan NO


3 NO
yes
Pe No

*Fex fnang clas data settnthopy kes betueen odl Infosmafion(1 6)Gain
aktribute
Fox ula
T6 (x, a) Hx)- H(xla)
Enthpy HCx) =Entvopg
HCla): conditinal EntvopY
XGvt Rnom aniable Cond(tionad Entropy Fosmula
Probability c'x ia
true e) Falke Let

Entoey aph V2

PLX=azhe)
HLxla)
Featwe

S-N

prohabiliteA o Samples

At each Jt in a decision t e constrLction selct


a But spliLing esure b tollauing feature

2 vse Entxpy anformation Gyain to ftnd a


Best splRAting featoe

You might also like