0% found this document useful (0 votes)
258 views

Data Science - A Kaggle Walkthrough - Introduction - 1 PDF

Uploaded by

Teodor von Burg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
258 views

Data Science - A Kaggle Walkthrough - Introduction - 1 PDF

Uploaded by

Teodor von Burg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

6/7/2016

DataScience:AKaggleWalkthroughIntroduction

Ihavespentalotoftimeworkingwithspreadsheets,databases,anddatamoregenerally.This
workhasledtomehavingaveryparticularsetofskills,skillsIhaveacquiredoveraverylong
career.Skillsthatmakemeanightmareforpeoplelikeyou.Ifyouletmydaughtergonow,
thatllbetheendofit.Iwillnotlookforyou,Iwillnotpursueyou.Butifyoudont,Iwilllookfor
you,Iwillfindyou,andIwillkillyou.
ThebadasseryofLiamNeesonaside,althoughIhavespentyearsworkingwithdatainarange
ofcapacities,theskillsandtechniquesrequiredfordatascienceareaveryspecificsubsetthat
donottendtocomeupintoomanyjobs.Whatismore,datasciencetendstoinvolvealot
moreprogrammingthanmostotherdatarelatedworkandthiscanbeintimidatingforpeople
whoarenotcomingfromacomputersciencebackground.Theproblemis,peoplewhowork
withdatainothercontexts(e.g.economicsandstatistics),aswellasthosewithindustry
specificexperienceandknowledge,canoftenbringdifferentandimportantperspectivestodata
scienceproblems.Yet,thesepeopleoftenfeelunabletocontributebecausetheydonot
understandprogrammingortheblackboxmodelsbeingused.

Somethingthathasnothingtodowithdatascience

Therefore,inaprobablyfutileattempttoshedsomelightonthisfield,thiswillbethefirstpartin
amultipartserieslookingatwhatdatascienceinvolvesandsomeofthetechniquesmost
commonlyused.Thisseriesisnotintendedtomakeeveryoneexpertsondatascience,ratherit
isintendedtosimplytryandremovesomeofthefearandmysterysurroundingthefield.In
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/

1/5

6/7/2016

DataScience:AKaggleWalkthroughIntroduction

ordertobeaspracticalaspossible,thisserieswillbestructuredasawalkthroughofthe
processofenteringaKagglecompetitionandthestepstakentoarriveatthefinalsubmission.

WhatisKaggle?
Forthosethatdonotknow,Kaggleisawebsitethathostsdatascienceproblemsforanonline
communityofdatascienceenthusiaststosolve.Theseproblemscanbeanythingfrom
predictingcancerbasedonpatientdata,tosentimentanalysisofmoviereviewsand
handwritingrecognitiontheonlythingtheyallhaveincommonisthattheyareproblems
requiringtheapplicationofdatasciencetobesolved.
TheproblemsonKagglecomefromarangeofsources.Someareprovidedjustforfunand/or
educationalpurposes,butmanyareprovidedbycompaniesthathavegenuineproblemsthey
aretryingtosolve.AsanincentiveforKaggleuserstocompete,prizesareoftenawardedfor
winningthesecompetitions,orfinishinginthetopxpositions.Sometimestheprizeisajobor
productsfromthecompany,buttherecanalsobesubstantialmonetaryprizes.HomeDepotfor
exampleiscurrentlyoffering$40,000forthealgorithmthatreturnsthemostrelevantsearch
resultsonhomedepot.com.
Despitethelargeprizesonofferthough,manypeopleonKagglecompetesimplyforpractice
andtheexperience.Thecompetitionsinvolveinterestingproblemsandthereareplentyof
userswhosubmittheirscriptspublically,providinganexcellentopportunityforlearningfor
thosejusttryingtobreakintothefield.Therearealsoactivediscussionforumsfullofpeople
willingtoprovideadviceandassistancetootherusers.
Whatisnotspelledoutonthewebsite,butisassumedknowledge,isthattomakeaccurate
predictions,youwillhavetousemachinelearning.

MachineLearning
Whenitcomestomachinelearning,thereisalotofgeneralmisunderstandingaboutwhatthis
actuallyinvolves.Whiletherearedifferentformsofmachinelearning,theonethatIwillfocus
onhereisknownasclassification,whichisaformofsupervisedlearning.Classificationisthe
processofassigningrecordsorinstances(thinkrowsinadataset)toaspecificcategoryina
predeterminedsetofcategories.Thinkaboutaproblemlikepredictingwhichpassengerson
theTitanicsurvived(i.e.therearetwocategoriessurvivedanddidnotsurvive)basedon
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/

2/5

6/7/2016

DataScience:AKaggleWalkthroughIntroduction

theirage,classandgender[1].

TitanicClassificationProblem
Passenger

Age

Class

Gender

Survived?

0001

32

First

Female

0002

12

Second

Male

0003

64

Steerage

Male

0004

23

Steerage

Male

0005

11

Steerage

Male

0006

42

Steerage

Male

0007

Second

Female

0008

Steerage

Female

0009

19

Steerage

Male

0010

55

First

Male

0011

53

First

Female

0012

27

Second

Male

Referringspecificallytosupervisedlearningalgorithms,thewaythesepredictionsaremadeis
byprovidingthealgorithmwithadataset(typicallythelargerthebetter)oftrainingdata.This
trainingdatacontainsalltheinformationavailabletomakethepredictionaswellasthe
categorieseachrecordcorrespondsto.Thisdataisthenusedtotrainthealgorithmtofindthe
mostaccuratewaytoclassifythoserecordsforwhichwedonotknowthecategory.

TrainingData
Passenger

Age

Class

Gender

http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/

Survived?
3/5

6/7/2016

DataScience:AKaggleWalkthroughIntroduction

0013

23

Second

Female

0014

21

Steerage

Female

0015

46

Steerage

Male

0016

32

First

Male

0017

13

First

Female

0018

24

Second

Male

0019

29

First

Male

0020

80

Second

Male

0021

Steerage

Female

0022

44

Steerage

Male

0023

35

Steerage

Female

0024

10

Steerage

Male

Althoughthatseemsrelativelystraightforward,partofwhatmakesdatasciencesucha
complexfieldisthelimitlessnumberofwaysthatapredictivemodelcanbebuilt.Therearea
hugenumberofdifferentalgorithmsthatcanbetrained,mostlywithweirdsoundingnameslike
NeuralNetwork,RandomForestandSupportVectorMachine(wewilllookatsomeofthesein
moredetailinfutureinstallments).Thesealgorithmscanalsobecombinedtocreateasingle
model.Infact,thepeople/teamsthatendupwinningKagglecompetitionsoftencombinethe
predictionsofanumberofdifferentalgorithms.
Tomakethingsmorecomplicated,withineachalgorithm,thereisarangeofparametersthat
canbeadjustedtosignificantlyalterthepredictionaccuracy,andtheseparameterswillvaryfor
eachclassificationproblem.Findingtheoptimalsetofparameterstomaximizeaccuracyis
oftenanartinitself.
Finally,justfeedingthetrainingdataintoanalgorithmandhopingforthebestistypicallyafast
tracktopoorperformance(ifitworksatall).Significanttimeisneededtocleanthedata,correct
formatsandaddadditionalfeaturestomaximizethepredictivecapabilityofthealgorithm.We
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/

4/5

6/7/2016

DataScience:AKaggleWalkthroughIntroduction

willgointomoredetailonbothoftheserequirementsinfutureinstallments.
OK,sonowletsputallthisintocontextbylookingatthecompetitionIentered,provided
byAirbnb.Theaimofthecompetitionwastopredictthecountrythatuserswillmaketheirfirst
bookingin,basedonsomebasicuserprofiledata[2].Inthiscase,thecategorieswerethe
differentcountryoptionsandanadditionalcategoryforusersthathadnotmadeaprevious
bookingthroughAirbnb.Thetrainingdatawasasetofusersforwhomwewereprovidedwith
thecorrectcategory(i.e.whatcountrytheymadetheirfirstbookingin).Usingthetrainingdata,
Iwasrequiredtotrainthemodeltoaccuratelypredictthecountryoffirstbooking,andthen
submitmypredictionsforasetofusersforwhomwedidnotknowtheoutcome.

How?
Theaimofthisseriesistowalkthroughtheprocessofassessingandanalyzingdata,cleaning,
transformingandaddingnewfeatures,constructingandtestingamodel,andfinallycreating
finalpredictions.TheprimarytechnologyIwillbeusingasIwalkthroughthisisPython,in
combinationwithExcel/GoogleSheetstoanalyzesomeoftheoutputs.WhyPython?Thereare
severalreasons:
1.Itisfreeandopensource.
2.Ithasagreatrangeoflibraries(alsofree)thatprovideaccesstoalargenumberof
machinelearningalgorithmsandotherusefultools.ThelibrariesIwillprimarilyuseare
numpy,pandasandsklearn.
3.Itisverypopular,meaningwhenIgetstuckonaproblem,thereisusuallyplentyof
materialanddocumentationtobefoundonlineforhelp.
4.Itisveryfast(primarilythereasonIhavechosenPythonoverR).
Forthosethatareinterestedinfollowingthisseriesbutdonothaveaprogramming
background,donotpanicalthoughIwillshowcodesnippetsaswegobeingabletoread
thecodeisnotvitaltounderstandingwhatishappening.

NextTime
Inthenextpiece,wewillstartlookingatthedatainmoredetailanddiscusshowwecanclean
andtransformit,tohelpoptimizethemodelperformance.
http://brettromero.com/wordpress/datascienceakagglewalkthroughintroduction/

5/5

You might also like