Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
Data Science - A Kaggle Walkthrough - Introduction - 1 PDF
DataScience:AKaggleWalkthroughIntroduction
Ihavespentalotoftimeworkingwithspreadsheets,databases,anddatamoregenerally.This
workhasledtomehavingaveryparticularsetofskills,skillsIhaveacquiredoveraverylong
career.Skillsthatmakemeanightmareforpeoplelikeyou.Ifyouletmydaughtergonow,
thatllbetheendofit.Iwillnotlookforyou,Iwillnotpursueyou.Butifyoudont,Iwilllookfor
you,Iwillfindyou,andIwillkillyou.
ThebadasseryofLiamNeesonaside,althoughIhavespentyearsworkingwithdatainarange
ofcapacities,theskillsandtechniquesrequiredfordatascienceareaveryspecificsubsetthat
donottendtocomeupintoomanyjobs.Whatismore,datasciencetendstoinvolvealot
moreprogrammingthanmostotherdatarelatedworkandthiscanbeintimidatingforpeople
whoarenotcomingfromacomputersciencebackground.Theproblemis,peoplewhowork
withdatainothercontexts(e.g.economicsandstatistics),aswellasthosewithindustry
specificexperienceandknowledge,canoftenbringdifferentandimportantperspectivestodata
scienceproblems.Yet,thesepeopleoftenfeelunabletocontributebecausetheydonot
understandprogrammingortheblackboxmodelsbeingused.
Somethingthathasnothingtodowithdatascience
Therefore,inaprobablyfutileattempttoshedsomelightonthisfield,thiswillbethefirstpartin
amultipartserieslookingatwhatdatascienceinvolvesandsomeofthetechniquesmost
commonlyused.Thisseriesisnotintendedtomakeeveryoneexpertsondatascience,ratherit
isintendedtosimplytryandremovesomeofthefearandmysterysurroundingthefield.In
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/
1/5
6/7/2016
DataScience:AKaggleWalkthroughIntroduction
ordertobeaspracticalaspossible,thisserieswillbestructuredasawalkthroughofthe
processofenteringaKagglecompetitionandthestepstakentoarriveatthefinalsubmission.
WhatisKaggle?
Forthosethatdonotknow,Kaggleisawebsitethathostsdatascienceproblemsforanonline
communityofdatascienceenthusiaststosolve.Theseproblemscanbeanythingfrom
predictingcancerbasedonpatientdata,tosentimentanalysisofmoviereviewsand
handwritingrecognitiontheonlythingtheyallhaveincommonisthattheyareproblems
requiringtheapplicationofdatasciencetobesolved.
TheproblemsonKagglecomefromarangeofsources.Someareprovidedjustforfunand/or
educationalpurposes,butmanyareprovidedbycompaniesthathavegenuineproblemsthey
aretryingtosolve.AsanincentiveforKaggleuserstocompete,prizesareoftenawardedfor
winningthesecompetitions,orfinishinginthetopxpositions.Sometimestheprizeisajobor
productsfromthecompany,buttherecanalsobesubstantialmonetaryprizes.HomeDepotfor
exampleiscurrentlyoffering$40,000forthealgorithmthatreturnsthemostrelevantsearch
resultsonhomedepot.com.
Despitethelargeprizesonofferthough,manypeopleonKagglecompetesimplyforpractice
andtheexperience.Thecompetitionsinvolveinterestingproblemsandthereareplentyof
userswhosubmittheirscriptspublically,providinganexcellentopportunityforlearningfor
thosejusttryingtobreakintothefield.Therearealsoactivediscussionforumsfullofpeople
willingtoprovideadviceandassistancetootherusers.
Whatisnotspelledoutonthewebsite,butisassumedknowledge,isthattomakeaccurate
predictions,youwillhavetousemachinelearning.
MachineLearning
Whenitcomestomachinelearning,thereisalotofgeneralmisunderstandingaboutwhatthis
actuallyinvolves.Whiletherearedifferentformsofmachinelearning,theonethatIwillfocus
onhereisknownasclassification,whichisaformofsupervisedlearning.Classificationisthe
processofassigningrecordsorinstances(thinkrowsinadataset)toaspecificcategoryina
predeterminedsetofcategories.Thinkaboutaproblemlikepredictingwhichpassengerson
theTitanicsurvived(i.e.therearetwocategoriessurvivedanddidnotsurvive)basedon
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/
2/5
6/7/2016
DataScience:AKaggleWalkthroughIntroduction
theirage,classandgender[1].
TitanicClassificationProblem
Passenger
Age
Class
Gender
Survived?
0001
32
First
Female
0002
12
Second
Male
0003
64
Steerage
Male
0004
23
Steerage
Male
0005
11
Steerage
Male
0006
42
Steerage
Male
0007
Second
Female
0008
Steerage
Female
0009
19
Steerage
Male
0010
55
First
Male
0011
53
First
Female
0012
27
Second
Male
Referringspecificallytosupervisedlearningalgorithms,thewaythesepredictionsaremadeis
byprovidingthealgorithmwithadataset(typicallythelargerthebetter)oftrainingdata.This
trainingdatacontainsalltheinformationavailabletomakethepredictionaswellasthe
categorieseachrecordcorrespondsto.Thisdataisthenusedtotrainthealgorithmtofindthe
mostaccuratewaytoclassifythoserecordsforwhichwedonotknowthecategory.
TrainingData
Passenger
Age
Class
Gender
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/
Survived?
3/5
6/7/2016
DataScience:AKaggleWalkthroughIntroduction
0013
23
Second
Female
0014
21
Steerage
Female
0015
46
Steerage
Male
0016
32
First
Male
0017
13
First
Female
0018
24
Second
Male
0019
29
First
Male
0020
80
Second
Male
0021
Steerage
Female
0022
44
Steerage
Male
0023
35
Steerage
Female
0024
10
Steerage
Male
Althoughthatseemsrelativelystraightforward,partofwhatmakesdatasciencesucha
complexfieldisthelimitlessnumberofwaysthatapredictivemodelcanbebuilt.Therearea
hugenumberofdifferentalgorithmsthatcanbetrained,mostlywithweirdsoundingnameslike
NeuralNetwork,RandomForestandSupportVectorMachine(wewilllookatsomeofthesein
moredetailinfutureinstallments).Thesealgorithmscanalsobecombinedtocreateasingle
model.Infact,thepeople/teamsthatendupwinningKagglecompetitionsoftencombinethe
predictionsofanumberofdifferentalgorithms.
Tomakethingsmorecomplicated,withineachalgorithm,thereisarangeofparametersthat
canbeadjustedtosignificantlyalterthepredictionaccuracy,andtheseparameterswillvaryfor
eachclassificationproblem.Findingtheoptimalsetofparameterstomaximizeaccuracyis
oftenanartinitself.
Finally,justfeedingthetrainingdataintoanalgorithmandhopingforthebestistypicallyafast
tracktopoorperformance(ifitworksatall).Significanttimeisneededtocleanthedata,correct
formatsandaddadditionalfeaturestomaximizethepredictivecapabilityofthealgorithm.We
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/
4/5
6/7/2016
DataScience:AKaggleWalkthroughIntroduction
willgointomoredetailonbothoftheserequirementsinfutureinstallments.
OK,sonowletsputallthisintocontextbylookingatthecompetitionIentered,provided
byAirbnb.Theaimofthecompetitionwastopredictthecountrythatuserswillmaketheirfirst
bookingin,basedonsomebasicuserprofiledata[2].Inthiscase,thecategorieswerethe
differentcountryoptionsandanadditionalcategoryforusersthathadnotmadeaprevious
bookingthroughAirbnb.Thetrainingdatawasasetofusersforwhomwewereprovidedwith
thecorrectcategory(i.e.whatcountrytheymadetheirfirstbookingin).Usingthetrainingdata,
Iwasrequiredtotrainthemodeltoaccuratelypredictthecountryoffirstbooking,andthen
submitmypredictionsforasetofusersforwhomwedidnotknowtheoutcome.
How?
Theaimofthisseriesistowalkthroughtheprocessofassessingandanalyzingdata,cleaning,
transformingandaddingnewfeatures,constructingandtestingamodel,andfinallycreating
finalpredictions.TheprimarytechnologyIwillbeusingasIwalkthroughthisisPython,in
combinationwithExcel/GoogleSheetstoanalyzesomeoftheoutputs.WhyPython?Thereare
severalreasons:
1.Itisfreeandopensource.
2.Ithasagreatrangeoflibraries(alsofree)thatprovideaccesstoalargenumberof
machinelearningalgorithmsandotherusefultools.ThelibrariesIwillprimarilyuseare
numpy,pandasandsklearn.
3.Itisverypopular,meaningwhenIgetstuckonaproblem,thereisusuallyplentyof
materialanddocumentationtobefoundonlineforhelp.
4.Itisveryfast(primarilythereasonIhavechosenPythonoverR).
Forthosethatareinterestedinfollowingthisseriesbutdonothaveaprogramming
background,donotpanicalthoughIwillshowcodesnippetsaswegobeingabletoread
thecodeisnotvitaltounderstandingwhatishappening.
NextTime
Inthenextpiece,wewillstartlookingatthedatainmoredetailanddiscusshowwecanclean
andtransformit,tohelpoptimizethemodelperformance.
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/
5/5