Math1005 Notes
Math1005 Notes
ControlledExperiment:
● Wewanttostudywhetherthetreatmentcausesther esponse
● Theresponsethatwewantmaybecausedbyotherfactors/variable
● Hence,optimally,weconduct2parallelexperimentswhicho nlydifferinwhether
thetreatmentisadministered ornot
● Thisiscalledcontrolledexperimenti.e.wec ontroltheeffectsoftheother
variablesonthetreatment
Confounding
● Confoundingoccurswhentheeffectofonevariable(X)onanothervariable(Y)is
cloudedbytheinfluenceofanothervariable(Z)
Thediagramiscalled“causalgraph”
Bias
● Meansthatthequantityofinterestissystematicallyunderoroverestimated
● Biasisoftencausedbyaconfoundingvariablebutitcanalsohaveothercauses
andsometimesitcanevenbedesired
● Inthismodulewewillonlyconsiderbiasduetoconfoundingvariables.Thisis
biaswewanttoavoid
TypesofBias:
● SelectionBias:IftheTreatmentisnotcomparabletothecontrolgroup,thenthe
differencesbetweenthetwogroupscanconfoundtheeffectofthetreatment
● ObserverBias:
○ Ifthesubjectsorinvestigatorsareawareoftheidentityofthetwogroups,
wecangetbiasineithertheresponsesorevaluations,astheymay
deliberatelyorsubconsciouslyreportmoreorlessfavourableresults
○ Infact,thesubjectmayevenrespondtotheideaofthetreatment-thisis
calledp laceboeffect
● Theplaceboisapretendtreatment.Itisdesignedtobeneutraland
indistinguishablefromthetreatment
Theplaceboeffectisaneffectwhichoccursfromthesubjectthinkingtheyhave
hadthetreatment
● ConsentBias:Canoccurwhensubjectschoosewhetherornottheytakepartin
theexperiment
● Thisquicklyraisesmanyethicalquestions
● Howcanweavoidconsentbias?
● Whodetermineswhoispartofeachgroup
● Itmaybeunethicaltowithholdtreatmentforthoseinthecontrolgrouporenforce
treatmentforthoseinthetreatmentgroup
SolutionforSelectionandObserverBias:
● WeneedtoconductaR andomisedControlledDouble-BlindTrialwhereboth
thesubjects(“singleblind”)andinvestigators(“doubleblind”)arenotawareofthe
identityofthegroups
● Inaddition,thecontrolofthepatient’sexpectations(i.e.theirresponse)andthe
investigator’sobservations(evaluationofresponse).
● Todosoweusually:
● Havea3rdpartyadministratorofthetreatmentandplacebo
● Designtheplacebotomimicthetreatmentasmuchaspossible
Summary
● Thedesignofastatisticalstudyiscriticalinordertoobtainresultsthatcanbe
generalised.Thebestmethodforcomparisonisacontrolledrandomised
double-blindtrial,butthisisoftennotpossible
Lecture2
TheNeedforObservationalStudies
● Inobservationstudies,theassignmentofsubjectsintotreatmentandcontrol
groups,isoutsidethecontroloftheinvestigator
● Manyresearchquestionsrequireano bservationalstudy,ratherthana
controlledexperiment
● Theconclusionsofobservationalstudiesrequiregreatcare
● Anobservationalstudyisoneinwhichtheinvestigatorhasnocontroloverthe
subjectsorqualitiesofinterest;sheisjustanobserver.Inparticularthe
investigatorcannotuserandomisationforallocationintogroups
Precautions
● Itisverydifficulttoestablishcausation
○ Itisrathereasytoestablishassociation(thatonethingislinkedto
another)
■ Associationmays uggestcausation
■ Butassociationdoesnotp rovecausation
○ ObservationalStudiescanhavemisleadinghiddenconfounders
■ Confounderscanbehardtofind,andcanmisleadaboutacause
andeffectrelationship
● ObservationalstudieswithaconfoundingvariablecanleadtoSimpson’s
Paradox
○ Simpson’sParadox(orthereversingparadox)wasfirstmentionedby
BritishstatisticianUdnyYulein1903.ItwasnamedafterEdwardH.
Simpson
○ Sometimesthereisacleartrendinindividualgroupsofdatathat
reverseswhengroupsarep ooledtogether
■ Itoccurswhenrelationshipsbetweenpercentagesinsubgroupsare
reversedwhenthesubgroupsarecombined,becauseofa
confoundingorlurkingvariable
■ Theassociationbetweenapairofvariables(X,Y)reversessign
uponconditioningofathirdvariableZ,regardlessofthevalue
takenbyZ.
● Historicalcontrol
○ Somestudiespresentthemselvesasacontrolledexperiment,buton
furtherexamination,thereisahistoricalcontrolandt imeisa
confoundingvariable.(Note:Thisispartlyobservationalandpartlyan
experiment)
○ Investigatorsmightcomparetheeffectofanewmedicationoncurrent
patients,withanoldmedicationonp astpatients.TheTreatmentGroup
(newdrug)andthehistoricalControlGroup(olddrug)maydifferin
aspectsbesidethetreatment
○ Controlledexperimentsneedtobeperformedinthesametimeperiod
(contemporaneously)
Usesfortheword‘Control’
● Acontrol=asubjectwhodidnotgetthetreatment
● Acontrolledexperiment=astudy/experimentwheretheinvestigatorsallocate
subjectsintodifferentgroups
● Controllingforconfounders=tryingtoreducetheinfluenceofconfounding
variables
Summary
● Manystatisticalinvolveobservation;dataandsoweneedtobeverycarefulwith
interpretationerrorsasconfusingassociationforcausation,misleading
confounders,Simpson’sParadoxandhistoricalcontrols
Lecture3
Whatisdata?
● Dataisinformationaboutthesetofs ubjectsbeingstudied(likeroadfatalities)
○ Mostcommonly,datareferstothesamplenotthepopulation
Differenttypesofdata
● Therearedifferenttypesofdata,indifferentformats,forexample:
○ Surveydata
○ Spreadsheettypedata
○ MRIimagedata
Bigdata
● Bigdatareferstothemassiveamountsofdatabeingcollected
● Bigdataiscommonlyh ighdimensional,whichmeansthattherearemore
variablesp
thatsubjectsn
○ Forexample,genomicsdatacanhave3billionvariable,asaperson’s
DNAsequenceis3billionbasepairslong
○ Measurementseverymilliseconds
○ Imagedataorvideodata
● Bigdatarequiresmorecomplexvisualisations
InitialDataAnalysis(IDA)
● InitialDataAnalysisisafirstgenerallookatthedata,withoutformallyanswering
theresearchquestions
○ IDAhelpsyoutoseewhetherthedatacanansweryouresearchquestions
○ IDAmayposeotherresearchquestions
○ IDAcan
■ Identifythedatamainqualities;
■ Suggestthepopulationsfromwhichasamplederives
What’sinvolvedinIDA?
● InitialDataAnalysiscommonlyinvolves:
○ Databackground:checkingthequalityandintegrityofthedata
○ Datastructure:whatinformationhasbeencollected?
○ Datawrangling:scrapingcleaning,tiding,resharping,splitting,combining
○ Datasummaries:graphicalandnumerical
● Herewefocusonstructure&graphicalsummariesforqualitativeand
quantitativedata
Variables
● Av ariablemeasuresordescribessomeattributeofthesubjects
○ Datawithp(explanatory)variablesissaidtohaved imensionp
● Numberofvariables
○ Univariate(1[explanatory]variable)
○ Bivariate(2[explanatoryvariables)
○ Multivariate(above2[explanatory]variables)
● Typesofvariables
○ QualitativeorCategorical(Categories)R:Factor
■ Ordinal(Ordered)
● Binary(2categories)
● 3+categories
■ Nominal(Non-ordered)
● Binary(2categories)
● 3+categories
○ QuantitativeorNumerical(Measurements)R:Numeric
■ Discrete(Separated)R:Integer(int)
■ Continuous(Continuum)R:Double
Choosingagraphicalsummary
● Theaimofagraphicalsummaryistobesthighlightfeaturesofthisdata
○ Tosomeextentweusetrialanderror
○ Whilethepiechartmaybepopular,itisusuallynotinformative
Summary
● Thetypeofvariablesdetermineswhattypeofgraphicalsummaryismost
appropriate
Lecture4
Overviewofhistogram
● Weuseahistogramforquantitativedata
● Ahistogramhighlightsthepercentageofdatainoneclassintervalcomparedto
another
○ Itconsistsofasetofblockswhichrepresentthepercentagesbyarea
○ Theareaofthehistogramis100%
○ Thehorizontalscaleisdividedintoc lassintervals
○ Theareaofeachblockrepresentsthepercentageofsubjectsinthat
particularclassinterval
● DensityScale:
% in the block
○ Heightofeachblock= length of the class interval
○ Heightofeachblock=averagepercentageperhorizontalunit
● Forcontinuousdata,weneedane ndpointconventionfordatapointsthatfall
ontheborderoftwoclassintervals
○ Ifanintervalcontainstheleftendpointbutexcludestherightendpoint,
thenan18yearoldwouldbecountedin[18,25)not[0,18)
○ Wecallthisleft-closedandright-opened
● Numberofclassintervals
CommonMistakeswitchHistograms
● Theblackheightsareequaltothepercentageortotalnumbers
○ Herewewronglyusethet otalnumbers(orp ercentage)astheheights
○ Unlesstheclassintervalsarethesamesize,inbothcasesthiswillmakes
largerclassintervalslooklikealargeroverall%
○ Solution:Usedensityastheheight,especiallyifclassintervalsarenot
thesamesize.Don'tusepercentagetotalnumbers
● Usetoomanyortoofewclassintervals
○ Thiscanhidethetruepatterninthedata.Asaruleofthumb,usebetween
10-15classintervals
Strategy
● Onlycountthosedeathswherepersonisdriving
● Findforregistereddrivinglicenceswithageinformation
● Combineinformationandderiveadeathrateperdrivinglicencefordifferentage
groups
● Conclusion:Deathrateperlicenceisapproximatelythesameforagegroup[18,
25)and[70,105).Bothratesareapproximatelythreetimeshigherthanthedeath
rateforagegroups[25,70)
Simpleboxplot
● Theboxplotplotsthemedian(‘middle’datapoint),themiddle50%ofthedatain
abox,themaximumandminimum,anddeterminesanyoutliers
● Wewillconsiderhowtodrawtheboxplotwhenwelearnabouttheinterquartile
range(IQR)inalaterlecture
Comparativeboxplots
● Acomparativeboxplotssplitsupaquantitativevariablebyaqualitativevariable
Heatmap
● Aheatmapmightbeagoodchoicehere.Aheatmapisespeciallyusefulwhena
contingencytableisnotpracticalduetotoomanydifferentvalues
Summary
● Thehistogramisagraphicalsummaryforquantitativedatawhichshowsthe
percentageofsubjectsperclassinterval.Theboxplotshowsthemiddle50%of
thedataandit’sspread.Thescatterplotshowstherelationshipbetweentwo
variables.Aheatmapisa‘contingencytable’fornumerical/continuousdata
Lecture5
Advantagesofnumericalsummaries
● Anumericalsummaryreducesallthedatatoonesimplenumber0(“statistic”)
○ Thislosesalotofinformation
○ Howeveritallowseasycommunicationcomparisons
● Majorfeaturesthatwecansummarisenumericallyare:
○ Maximum
○ Minimum
○ Centre[samplemean,median]
○ Spread[standarddeviation,range,IQR]
WhichmightbeusefulfortalkingaboutNewtownhouseprices
● Itdepends
● Reportingthecentrewithoutthespreadcanbemisleading
Usefulnotationfordata(Ext)
● Inthiscourse,weintentionallyfocusonstatisticalconceptsinw
ords.Thisisvital
forcollaboratingwithpeoplefromdifferentfields.Themathematicsisintroduced
in2ndyear.However,heresomesimplemathematicalnotationishelpful.
● Observationsofasinglevariableofsizen canberepresentedby:
○ x1 , x2 , ..., xn
● Therankedobservations(orderedfromsmallesttolargest)are:
○ x(1) , x(2) , ..., x(n)
● Thesumoftheobservationsare:
n
○ ∑ xi
i=1
SampleMean
● Thesamplemeanisthea
verageofthedata:
Sum of data
○ S ample M ean = Size of data
,or
n
∑ xi
○ x= n
i=1
Samplemeanasabalancingpoint
● Thesamplemeanistheuniquepointatwhichthedataisb
alancedi.e.The
readingsandthelowerreadingsallcanceleachotherout.Forexample:When
meanis1407.143(thousands)
○ 19WatkinStsoldfor$1950(thousands)
■ Thisgivesagapof(1950-1407.143)
■ Thisis$542.857(thousands)a bovethesamplemeanprice
○ 30PearlStsold$1250(thousands)
■ Thisgivesagapof(1250-1407.143)
■ Thisis$157.143b elowthesamplemeanprice
SampleMedian
● Thesamplemedianx̃ isthem iddledatapoint,whentheobservationsare
orderedfromsmallesttolargest
○ Foranoddsizednumberofobservations:
■ SampleMean=theuniquemiddlepoint= x( n+1 )
2
○ Foranevensizednumberofobservations:
x( n ) + x( n +1)
2 2
■ SampleMean=averageofthemiddlepoints=
2
StatisticalThinking
● Ifyouhadtochoosebetweenreportingthesamplemeanorsamplemedianfor
Newtownproperties,whichwouldyouchooseandwhy?
■ Forthefullpropertyportfolio,thesamplemeanandthesample
medianarefairlysimilar
■ Forthe4bedroomhouses,thesamplemeanishigherthanthe
samplemedianbecauseitisbeing“pulledup”bysomevery
expensivehouses
○ Fortheaveragebuyer,thesamplemedianwouldbemoreusefulasan
indicationofthesortofpriceneededtogetintothemarket
○ Foranyagentsellinghousesinthearea,thesamplemeanmightbemore
usefulinordertopredicttheiraveragecommissions
○ Inpractise,wecanreportboth
Robustness
● Thesamplemedianissaidtober obustandisagoodsummaryforskeweddata
asitisnotaffectedbyo utliers
● Supposetherewasadataentrymistake,andthelowestpropertyrecordedas
370wasinfactthehighestsoldat3700.Howwouldthesamplemeanchange?
Howwouldthesamplemedianchange?
○ Thesamplemeanwouldbehigher,aswehavereplacedthesmallest
readingbynowmaximum
○ Themedianwouldshiftup,fromtheaveragex(28) andx(29) tothe
averageofx(29) andx(30)
Comparingthesamplemeanandthemedian
● Thedifferencebetweenthesamplemeanandthemediancanbeindicationof
thes
hapeofthedata
○ Forsymmetricdata,weexpectthesamplemeanandsamplemedianto
besame:x = x̃
○ Forleftskeweddata,weexpectthesamplemeantobesmallerthanthe
samplemedian:x < x̃
○ Fortherightskeweddata,weexpectthesamplemeantobelargerthan
thesamplemedian:x > x̃
Whichisoptimalfordescribingthecentre?
● Bothhavestrengthsandweaknessesdependingonthenatureofthedata
● Sometimesneithergivesasensiblesenseoflocation,forexampleisthedatais
bimodal
● Asthes amplemedianisrobust,itispreferablefordatawhichisskewedorhas
manyoutliers,likeSydneyhouseprices
● Thesamplemeanishelpfulfordatawhichb asicallysymmetric,withnottoo
manyoutliers,andfortheoreticalanalysis
Limitationsofboth?
● Boththesamplemeanandsamplemedianallowveryeasilycomparisons,and
areeasilyunderstandable
● However,theyneedtobepairedwithameasureofspread
● Noteinthefollowingexample,thesamplemeansarethesame,butthedataare
verydifferent
Summary
● Boththesamplemeanandsamplemediansummarisethecentredata.The
samplemedianisrobustmakingitabetterchoiceforskeweddataorwhere
thereareoutliers.Bothneedtobepairedwithameasureofspread.
Lecture6
1stattempt:Themeangap
● Meangap=samplemean(data-samplemean(data))
● Note:Itwillalwaysbe0
○ Fromthedefinition,themeangapmustbe0,asthemeanisthe
balancingpointofthegaps
○ Orforthosewholikealgebra,themeangapis
n n
∑ (xi −x) ∑ xi nx
i=1 i=1
n
=n − n
=0
Betteroption:Standarddeviation
● First,definetheRootMeanSquare(RMS)
○ TheRMSmeasuresthea verageofasetofnumbers,regardlessofthe
signs
○ Thestepsare:S quarethenumbers,thenMeantheresults,thenRootthe
result
■ RMS(numbers)=
○ Soeffectively,theS
√sample mean(numbers )
quareandtheR
2
ootoperations“reverse”eachother
√
n
2
∑ (gapi )
● ApplyingRMSofgaps=
√ sample mean(gaps)2 = n
i=1
● Toavoidthecancellationofthegaps,anotherpossiblemethodistoconsiderthe
n
∑ ∣gapi ∣
i=1
averageoftheabsolutevaluesofthegaps:
n .However,thisisharder
algebraically
StandarddeviationintermsofRMS
● Thestandarddeviationmeasuresthes preadofthedata
○ S Dpop =RMSof(gapsfromthemean)
√
n
2
∑ (xi −x)
i=1
n
Howtotellthedifferencewhenthedataisapopulationorasample?
● Itcanbetrickytoworkoutwhetheryourdataisapopulationorsample
● Lookatheinformationaboutthedatastoryandtheresearchquestions
StandardUnits(“Zscore”)
● Standardunitsofadatapoint=howmanystandarddeviationsisitbelowor
abovethemean
data point − mean
○ Standardunits= SD
● Thismeansdatapoint=mean+SD×standardunits
IQR
● IQR=Rangeofthemiddle50%ofthedata
○ MoreformallyIQR=Q3 − Q1 ,where
■ Q1 isthe25%percentile(1stquartile)andQ3 isthe75%
percentile(3rdquartile)
■ Themedianisthe50%percentileor2ndquartilex =Q2
Quantile,quartile,percentile
● Thesetofq-quantilesdividesthedataintoqequalsets(intermsofpercentage
ofdata)
● Percentileis100-quantile
● Thesetofq uartilesdividesthedatainto4quarters
● SOtherangeofthe50%ofpropertiessoldisalmostamilliondollars
IQRontheboxplot
● TheIQRisthelengthoftheboxintheboxplot.Itrepresentsthespanofthe
middle50%ofthehousessold
● Thelowerandu pperthresholdsareadistanceof1.5fromthequartiles(by
convention)
○ LT= Q1 -1.5IQR
○ UT=Q3 +1.5IQR
● Dataoutsidethesethresholdsisconsideredano utlier(“extremereading”)
CoefficientofVariation
● TheCoefficientofVariation(CV)combinesthemeanandstandarddeviation
SD
intoonesummary:CV= mean
● TheCVisusedin:
○ Analyticalchemistrytoexpresstheprecisionandrepeatabilityofanassay
○ Engineeringandphysicalforqualityassurancestudies
○ Economicsfordeterminingthevitalityofasecurity
Lecture7
NormalCurve:Origins
● Thenormalcurvewasdiscoveredaround1720byAbrahamdeMoivre,also
famousforthebeautifuldeMoivre’sformula
WhyistheNormalcurvefamous?
● TheNormalcurveapproximatesmanynaturalphenomena
● TheNormalcurvecanmodeldatacausedbycombiningalargenumberof
independentobservations.
General&StandardNormalcurves
● TheStandardNormalCurve(Z )hasmean=0andSD=1.Short:N(0,1)
● TheGeneralNormalCurve(X )hasanymeanandSD.Caution:Itisdenotedby
N(mean,SD2 )
TheNormalcurveformula
● ItturnstheNormalcurvehasasimpleformula,althoughyouwon’tneedtouseit
directly
(x−μ)2
1
● TheformulafortheGeneralNormalCurveis e 2σ 2
for
√2πσ2
x ∈ (−∞, ∞) whereμandσarethe(population)meanandSDrespectively
FindingtheareaundertheStandardNormalcurve
● Method1:Integration
0.7 y2
1 −
○ Mathematically,wecoulduseintegration:area= ∫ e 2 dy
−∞ √2π
○ Butthisdoesnothaveaclosed-form
● Method2:NormalTables
● Method3:UseR
○ Thepnormcommandworksoutthelowertailarea
○ Thepnorm(x,lower.tail =F)worksouttheuppertailarea
FindingtheareaundertheStandardNormalcurve
● InR
PropertiesoftheNormalcurve
● AllNormalcurvessatisfythe“68%-95%-99.7%Rule”
○ Thearea1SDoutfromthemeaninbothdirectionsis0.68(68%)
○ Thearea2SDoutfromthemeaninbothdirectionsis0.95(95%)
○ Thearea3SDoutfromthemeaninbothdirectionsis0.997(99.7%)
● AnyGeneralNormalcanberescaledintotheStandardNormal
○ ForanypointonaNormalcurve,thestandardunits(orzscore)ishow
manystandarddeviationsthatpointisabove(+)orbelow(-)themean
data point − sample mean
○ standardunits=
sample SD
● TheNormalcurveissymmetric
○ IfXfollowsanormalcurvewithmean),then
■ P (X < − 0.5) = P (X > 0.5)
Summary
● TheNormalcurvenaturallydescribesmanyhistograms,andsocanbeusedin
modellingdata.Ithasmanyusefulproperties,includingthe68/95/99.7%rule.
AnyGeneralNormalcanberescaledintoaStandardNormal
Lecture8
ReproducibleResearch
● Increasingly,journalsarerequiringreproducibleresearch,whichrequires“data
setandsoftwaretobemadeavailableforverifyingpublishedfindingsand
conductingalternativeanalyses”.
○ AstudybyBegleyandEllis(2012)foundthat47outof53medical
researchpapersfocusedoncancerresearchthatwasirreproducible
○ Afollow-upstudybyBegley(2013)identified“6flagsforsuspectwork”:
studieswerenotperformedbyinvestigatorsblindedtotheexperimental
versusthecontrolarms,therewasafailuretorepeatexperiments,alack
ofpositiveandnegativecontrols,failuretoshowalldata,inappropriate
useofstatisticaltestsandusereagentsthatwereappropriatelyvalidated
Whatcangowrong?
● Withoutreproducibleresearch:
○ Dataversioncanchange(egpeopleeditanExcelfilewithout
documentingwhathaschangedandwhy);
○ Graphicalsummariescanchange(egpeoplecanphotoshopimages
withoutkeepingrecordofwhatchangedandwhy)
● Reproducibleresearchisaboutbeingresponsiblewithpossiblehumanerrors,or
worse,detectingintentionallychangedresults
Lecture9
BivariateData
● Bivariatedatainvolvesap
airofvariables.Weareinterestedintherelationship
betweenthetwovariablesCanonevariablebeusedtopredicttheother?
○ Formally,wehave(xi , y i ) fori = 1, 2, ..., n
○ X iscalledtheindependentvariable(orexplanatoryvariable,predictor
orregressor)
○ Y iscalledthed ependentvariable(orresponsevariable).
ScatterPlot
● As catterplotisagraphicalsummaryoftwoquantitativevariablesonthesame
2Dplane,resultinacloudofpoints
Howcanwesummariseascatterplot?
● Thescatterplotcanbesummarisedbythefollowingfivenumericalsummaries
○ SamplemeanandsampleSDofX (x, SDx )
○ SamplemeanandsampleSDofY (y, SD y )
○ Correlationcoefficient(r)
TheCorrelationcoefficient
● Thecorrelationcoefficientr isanumericalsummarythatmeasuresthe
clusteringaroundtheline
● Itindicatesboththesignandstrengthofthelinearassociation
● Thecorrelationcoefficientisbetween-1and1
○ Ifr ispositive:thecloudslopedup
○ Ifr isnegative:thecloudslopesdown
○ Asr getscloserto± 1 :thepointsclustermoretightlyaroundtheline
Whydoesrmeasureassociation
● Itdividesthescatterplotinto4quadrants,atthepointofaverages(centre)
○ Amajorityofpointsintheupperright(+)andlowerleftquadrants(+)will
beindicatedbyapositiver
○ Amajorityofpointsintheupperleft(-)andthelowerright(-)willbe
indicatedbyanegativer
Symmetry
● Thecorrelationcoefficientisnotaffectedbyinterchangingthevariables
Scaling
● Thecorrelationcoefficientisshiftandscaleinvariant
Warning
1. Thecorrelationcoefficientisunitless
○ Mistake:r=0.8meansthat80%ofthepointsaretightlyclusteredaround
thelineoristwiceasclusteredasr=0.4
2. Outlierscanoverlyinfluencethecorrelationcoefficient
3. Non-linearassociationcan’tbedetectedbythecorrelationcoefficient
4. Thesamecorrelationcoefficientcanarisefromverydifferentdata
5. Ratesofaveragestendtoinflatethecorrelationcoefficient
○ Anecologicalcorrelation(orspatialcorrelation)isthecorrelation
betweentwovariablesthataregroupmeansorrates
○ Forexample,ifwerecordedtheheightsoffathersandsonsinmany
communitiesandthencalculatedtheaverageforeachcommunity
○ Ecologicalcorrelationstendtooverestimatethestrengthofassociation
betweenthetwovariables
6. Associationisnotcausation
○ Correlationmeasuresassociation
○ Butasdiscussed,associationdoesnotnecessarilymeancausation
○ Bothvariablesmaybesimultaneouslyinfluencedbya3rdvariable
(confounder)
Summary
● Thescatterplotisacloudofpointswhichrepresentsbivariatequantitativedata(
pairofvariables).Usefulsummariesarethetwopointofaverages(sample
means),thetwosampleSDsofthevariablesandonecorrelationcoefficient.The
correlationcoefficientisthemeanoftheproductofthevariablesinstandardunits
andcanbefoundusingcor()inR
Lecture10
RegressionLine
1. SDLine(Notgreat)
○ TheS Dlinemightlooklikeagoodcandidateasitconnectsthepintsof
averages(x, y ) to(x + SD x , y + SDy ) (forthisdatawithpositive
correlation)
○ However,itdoesnotusethecorrelationcoefficient,soitisinsensitiveto
theamountofclusteringaroundtheline
○ Notehowitunderestimates(LHS)andoverestimates(RHS)atthe
extremes
2. RegressionLine
○ Todescribethescatterplot,weneedtousea
llfivesummaries:
x, y , SDx , SDy , r
○ TheRegressionlineconnects(x, y ) to(x + SD x , y + rSDy )
SummaryRegressionLine
● Wecanderivethe(least-squares)regressionlineusingcalculus,byminimizing
thesquaredresiduals(extension)
Predictions
1. Baselineprediction
○ Ifyoudon’tusex asaninformationsourceatall,abasicpredictionofy
wouldbethea verageofy overa
llthex valuesinthedata
○ SoforanyCEreading,wecouldpredicttheNWairqualitytobe56.13
2. Predictioninastrip
○ Givenacertainvaluex0 ,amorecarefulpredictionofy wouldbethe
averageofallthey inthedatacorrespondingtoaneighbourhoodofx
valuearoundx0 .
3. TheRegressionline
○ ThebestpredictionisbasedontheRegressionline
○ ForAQI,wehavey = 19.8874 + 0.7138x
Residuals
● Ar esidualistheverticaldistance(or‘gap’)ofapointaboveorbelowthe
regressionline
● Aresidualrepresentstheerrorbetweentheactualvalueandtheprediction
● Moreformally,aresidualisei = y i − y︿i ,giventheactualvalue(y i ) andthe
︿
prediction(y i )
Residualplot
● Aresidualplotgraphstheresidualsvsx
● Ifthelinearfitisappropriateforthedata,itshouldshownopattern(random
0)
abouty =
● Theresidualplotisadiagnosticplottochecktheappropriatenessofalinear
model
VerticalStrips
irection,then
● Iftheverticalstripsonthescatterplotshowequalspreadinthey d
thedataish
omoscedastic
○ Theregressionlinecouldbeusedforpredictions
irection,thenthedatais
● Iftheverticalstripsdon’tshowequalspreadinthey d
heteroscedastic
○ Theregressionlineshouldnotbeusedforpredictions
Commonmistakeswhenpredicting
1. Extrapolating
○ Ifwemakeapredictionfromanx valuethatisnotwithintherangeofthe
data,thenthatpredictioncanbecompletelyu nreliable
2. Notcheckingthescatterplot
○ Wecanhaveahighcorrelationcoefficientandthenfitaregressionline,
butthedatamaynotevenbelinear
○ Soalwayscheckthescatterplot
3. Notcheckingtheresidualplot
○ Youshouldalsochecktheresidualplot
○ Thisdetectsanypatternthathasbeencapturedbyfittingalinearmodel
○ Ifthelinearmodelisappropriate,theresidualplotshouldbearandom
0)
scatterofpoints(aboutthehorizontalliney =
Summary
● Forprediction,theregressionlineisbetterthantheSDlineasitusesallfive
numericalsummariesforthescatterplot
● ForRegressionline,ther esidualsarethegapsbetweenteha ctualvalueand
thep
rediction
● Theresidualplotisadiagnosticforseeingwhetheralinearmodelisappropriate
-ifitisrandom,thenalinearmodelseemsappropriate
● Iftheverticalstripsonthescatterplotshowe
qualspreadinthey-direction,then
thedataish omoscedastic,otherwise,thedataish eteroscedastic
Lecture11
Probability
● Thefrequentistdefinitionofprobability(orchance)isthepercentageoftimea
certaineventisexpectedtohappenifthesameprocessisrepeatedlong-term
(infinitelyoften)
● ThisdiffersfromtheBayesiandefinitionofprobabilitywhichrelatestothedegree
ofbeliefthataneventwilloccur(extension)
BasicpropertiesofProbability
1. Probabilitiesarebetween0%(impossible)and100%(certain)
○ P(Impossibleevent)=0
○ P(Certainevent)=1
2. Theprobabilityofsomethingequals100%minusitsopposite(c omplement)
○ P(Event)-1-P(Complementevent)
Conditionalprobability
● Conditionalprobabilityisthechancethatacertainevent(1)occurs,g iven
anotherevent(2)hasoccurred
○ P(Event1|Event2)
MultiplicationRule
● Theprobabilitythattwoeventsoccuristhechanceofthe1steventm ultipliedby
thatchanceof2ndevent,giventhe1sthasoccurred
○ P(Event1andEvent2)=P(event1)✕P(Event2|Event1)
AdditionRule
● Theprobabilityatleastoneoftwoeventsoccursisthechanceofthe1stevent
plusthechanceof2ndeventm inustheprobabilitythatbotheventsoccur
○ P(Event1orEvent2)=P(Event1)+P(Event2)-P(Event1andEvent2)
Mutuallyexclusive
● Twoeventsarem utuallyexclusivewhentheoccurrenceofoneeventprevents
theother
Independence
● Twoeventsareindependentifthechanceof1stgiventhe2ndisthesameas
the1st,ie.P(Event1|Event2)=P(Event1)
TheProsecutor’sfallacy
● Theprosecutor’sfallacyisamistakeinstatisticalthinking,wherebyitis
assumedthattheprobabilityofarandommatchisequaltotheprobabilitythat
thedefendantisinnocent
○ Ithasbeenusedbytheprosecutiontoarguefortheguiltofadefendant
duringfamouscriminaltrials
○ Itcanalsobeusedbydefenselawyerstoarguefortheinnocenceoftheir
client
Summary
● AdditionRule
○ Twoeventsaremutuallyexclusivewhentheoccurrenceofoneevent
preventstheother
○ Iftwoeventsaremutuallyexclusivethenthechanceofa tleastoneevent
occurringisthes
umoftheindividualchances
● MultiplicationRule
○ Twoeventsareindependentiftheoccurrenceofthefirsteventdoesnot
changethechanceofthesecondevent
○ Ifthetwoeventsareindependentthenthechanceofb othevents
occurringisthem
ultiplicationoftheindividualchances
Lecture12
Countinganddrawingtrees(Theoldway)
● Forsimplechanceproblems,agoodwaytostartis:
a. Method1:Writeafulllistofoutcomesandcounttheoutcomesofinterest
■ Writealistofalloutcomes
■ Countwhichoutcomesbelongtotheeventofinterest
b. Method2:Summariseinatreediagram
■ Drawatree
Runningasimulation(Thenewway)
1. Method3:Simulate
○ Useandsimulatethrowingdice x timesandrecordthefindings
Summary
● Countingoutcomesordrawingatreetoderiveprobabilitiesofoutcomescan
quicklybecometedious.Onesolutionistousesimulations
Lecture13
Chanceerror
● Everytimeyoutossafaircoin,thereischancevariability
○ Numberofheads(observedvalue)=halfthenumberoftosses(expected
value)+chanceerror
LawofAverages
● TheLawofAveragesstatesthatthep roportionofheadsbecomesmorestable
asthelengthofthesimulationincreasesandapproachesafixednumbercalled
ther elativefrequency
● Thechanceerrorinthenumberofheadsislikelytobelargeinabsolutesize,but
smallrelativetothenumberoftosses.
ImportantFacts
● Forafaircoin:
○ Evenifweobserve100headsinarow,stillP(Tail)=0.5.
MisunderstandingthisleadstotheGambler’sFallacy
○ Asthenumberoftossesincreases
■ Theabsolutesizeofthechanceerrorincreases
■ Theabsolutepercentage(i.e.‘relative’)sizeofthechanceerror
decreases
■ Theproportionoftheeventwillconvergetothetheoreticalor
expectedproportion
Summary
● Forindependentevents,itisamistaketoassumethatthechanceofobservinga
particulareventchangesovertime,eveniftheeventhasnotoccurredforalong
time.ThisistheGambler’sfallacyanddownfall
● RatherTheLawofLargeNumbersstatesthattheo bservedproportionof
occurrencesoftheevent,inthelongrun,approachesthee xpectedproportion
Lecture14
Boxmodel
● Theboxmodelisasimplewaytodescribemanychanceprocesses
● Theboxrepresentsthepopulation,containingdifferenttypesoft ickets
● Weneedtoknow:
○ Thenumberorp roportionofeachkindofticketinthebox
○ Thenumberofd rawsfromthebox
○ Fornow,weonlyconsiderdrawingwithreplacement
ModellingtheSumofasample
● FortheS umofrandomdrawsfromaboxmodelwithreplacement,
○ observedvalue=expectedvalue+chanceerror
■ Expectedvalue(EV)=numberofdraws×meanofthebox
√
■ Standarderror(SE)= number of draws ×SDofthebox
■ SEistheexpectedmagnitudeofthechanceerror.
HowtocalculatetheSDofthebox
● Astheb oxrepresentsthepopulation,theS
Doftheboxisthep
opulationSD
● WecouldcallitSD pop ,butinthiscontext,wewillsimplyuseSD
● 3waystocalculatetheSDofthebox
○ Formula:RMS(gaps)=RootoftheMeanoftheSquaredgaps
○ R:popsd()withpackagemulticon
○ Shortcut(forsimplybinary(twotickets)boxes)
■ Ifaboxonlycontains2differentnumbers(“big”and“small”),then
● SD=(big-small)
√proportion of big × proportion of small
Howdoeschanceerrorrelatetostandarderror
● Anobservedvalueislikelytobearounditsexpectedvalue,withac hanceerror
similartoSE
● Observedvaluesusuallyliewithin2SEsawayfromtheexpectedvalue
ModellingtheMeanoftheSample
● AstheM eanofthesampleisjusttheS
umofthesampledividedbythenumber
ofthedraws,wegetanequivalentresultasfollows
● FortheM eanoftherandomdrawsfromaboxmodelwithreplacement
○ observedvalue=expectedvalue+chanceerror
■ Expectedvalue(EV)=meanofthebox
SD of the box
■ Standarderror(SE)=
√number of draws
Comparison
● Noticethattherearetwosetsofformulas,dependingonwhetherweare
modellingthes umorm eanofasample
● Thesamplequestionwilldictatewhetherthesimormeanofasampleismore
appropriate
● GiventhemeanandSDofthepopulation
○ SumofSample
■ Expectedvalue(EV)=n×mean
■ Standarderror(SE)=√n ×SD
○ MeanoftheSample
■ Expectedvalue(EV)=mean
SD
■ Standarderror(SE)=
√n
● Noticethatasthesamplesize(n)increases,theSEforthesumincreases,but
theSEforthemeandecreases
Summary
● Theboxmodelsasimplechanceprocessinvolvingdrawingticketsfromafixed
box(population).
● WecandescribethebehaviouroftheSUmandtheMeanofthesampleinterms
oftheexpectedvalue(EV)andthestandarderror(SE),andcomparetothe
observed(OV)
● WecanfindSD box byusingtheshortcutformulaorpopsd()
● GiventhemeanandSDofthepopulation
○ Whenthereisonedesiredoutcome:makethedesiredticketsa“1”andall
othertickets“0”
Lecture15
TheCentralLimitTheorem
● Ifdrawsareindependentandrandomwithreplacementandthesamplesizefor
thesum(oraverage)issufficientlylarge,then
○ Thedistributionforthesum(oraverage)willcloselyfollowthen ormal
curve,evenifthecontentsoftheboxdonot
● “TheNormalcurvebecomesagoodmodelforthechanceerrorofasum(or
average)insufficientlylargesamples”
● “Asthesamplesizeincreases,thedistributionforasum(oraverage)tends
towardstheNormaldistribution”
ConditionsfortheCLT
● Thedrawsmustberandomandindependentfromafixedpopulation
● Thenumberofdrawsmustbereasonablylarge(especiallyifthehistogramofthe
boxdiffersfromthenormalcurve)
● Howlarge?Thisdependsontheshapeofthehistogram
● Acommonconventionisthenumberofdrawslargerthan30(assuminga
basicallysymmetricdistributionwithnoobviousoutliers)
○ However,thisisNOTarule
Summary
● TheCentralLimitTheoremstatesthatfrorepeatedsimulationsofachance
processresultinginas umora
verage,thesimulationhistogramoftheobserved
valuesconvergestotheNormaldistribution
Lecture16
Parameter&Estimate
● Ap
arameterisanumericalfactaboutthepopulationwhichweareinterestedin.
Forexamplethepopulationmeanμorpopulationstandarddeviationσ
● Anestimate(ors
tatistic)isacalculatorofsamplevalueswhichbestpredicts
︿
theparameter.Forexamplesthesamplemean μ (sometimesalsodenotedx )
︿
orsamplestandarddeviationσ
● Observedvalue(OV)=expectedvalue(EV)+chanceerror
● IntheSampleMeancase:
︿
○ μ =μ+chanceerror
■ Thechanceerrorisrandombynature(noise).Wecanquantifythe
chanceerrorbyestimatingthespread(=expectedmagnitude)of
thechanceerror.Thisspreadiscalledstandarderror(SE)anditis
thestandarddeviationofthechanceerror.Itisoftendenotedbyσ
(aswell)
■ Notthatwehavetwodifferentσsnow.Wecancallthepopulation
SDσ pop andthestandarderror(=standarddeviationofthechance
error)remainsdenotedbyσ.Notethatσ pop =SD(Box)
■ Thecentrallimittheoremtellsusthatforas
amplemean,chance
1
errorbehavesapproximatelylikeN (o, σ 2 ), σ− ×
√sample size
SD(Box)andinpracticewecanestimateσby
︿ 1
σ= × SD(Sample).InparticularSE=SD(chance
√sample size
︿
error)=σ≈ SE = ︿ σ
● Meanandstandarddeviationdescribeasetofdata.Theyaren umerical
summaries.Expectedvalueands tandarderrordescribethesum/meanofa
randomsample.Thestandarderroristhestandarddeviationofthechance
error
TheCorrectionfactor
● Whensamplingwithreplacement,theSEisdeterminedbytheabsolutes ample
size
● Whensamplingwithoutreplacement,theSEwillbedecreasedbyincreasingthe
ratioofsamplesizetopopulationsize,aswhenahigherproportionofthe
populationissampled,thevariabilitywilldecrease
● Whenthesampleisonlyasmallpartofthepopulation,thesizeofthepopulation
hasalmostnoeffectontheSEoftheestimate
● SE without replacement − correlation f actor × SE without replication wherecorrection
factor=
Summary
√ population size−sample size
population size − 1
● Ifwedrawwithoutreplacement,thenstrictlytheSEshouldbeadjustedbythe
correlationfactor
○ correctionfactor=
√ population size−sample size
population size − 1
● However,forlargepopulationsizecomparedtosamplesize,thecorrectionfactor
isalmost1
Lecture17
PopulationvsSample
● As ampleisapartofthep
opulation
Limitationofacensus
● Collectingeveryunitofapopulation:
○ Ishard
○ Takeslotsoftime
○ Costsalotofmoney
○ Requireslotsofresources
Findingthebestestimateoftheparameter
● MuchStatisticaltheoryisconcernedwithhowtofindthebestestimateofa
parameter
● 2criticalissuesare:
○ Howwasthesamplechosen?Isitrepresentativeofthepopulation?
○ Whatestimateisclosesttotheparameter?
Examplesofhowbiascanoccur
● Ifthereisasystematictendencytoexcludeorincludeonetypesofpersonfrom
thesample
○ E.g.Conveniencesampling(or“grabsampling”):Anon-probability
samplingtechniquewheresubjectsareselectedbecauseoftheir
convenientaccessibility.Itisdefinitelynotrecommended,exceptpossibly
totestasurvey(pilot)
● Ifsomeparticipantsfailtocompletesurveys
○ Whatwastheresponserate?
○ Non-respondentscanbeverydifferenttorespondents
● Ifcharacteristicsoftheinterviewhaveaneffectontheanswergivenby
participants
● Iftheformofthequestioninthesurveyaffectstheresponsetothequestion
● Becausepeoplemayforgetdetails
● Becauseofsensitivequestions:peoplemaynottellthetruth
● Becauseoflackofclarityinthequestion
● Becauseattributesoftheinterviewprocessmaycausebias
Warningaboutbiasandsamplesize
● Whenaselectionprocessisbiased,takingalargersampledoesnotreduce
bias,ratheritcanamplifythebias.Itrepeatsthemistakeonalargerscale
● Inthefamous1936USelections,theLiterarydigestmagazinepredictedan
overwhelmingvictoryforAlfredLandonoverFranklinRoosevelt,basedonapoll
of2.4millionpeople.However,Rooseveltwon62%to38%.TheDigestwent
bankruptsoonafter
● Theproblemwasthattheirsamplingprocedureinvolvedmailingquestionnaires
to10millionpeople,withnamesandaddressesfromsourcesthatwerebiased
againstthepoor
Howtopickagoodsample?
● Asamplingprocedureshouldgivearepresentativecrosssectionofthe
population
● Weuseap robabilitymethodtopickthesample,sothat
○ Theinterviewerisnotinvolvedintheselection.Themethodofselectionis
impartial
○ Theinterviewercancomputethechanceofanyparticularindividuals
beingchoseni.e.Thereisadefinedprocedureforselectingthesample,
whichuseschance.Itisobjective.
● Forexample,Simplerandomsamplinginvolvesdrawingatrandomwithout
replacement
● Multi-stageclustersampling
○ Assimplerandomsamplingisoftennotpractical,organisationsmayuse
multi-stageclustersampling.Thisisaprobabilitysamplingtechnique
whichtakessamplesinstages,andindividualsorclustersarechosenat
randomateachstage.
UnavoidableBias
● Evenwithaprobabilitymethoddeterminingthesample,biascaneasilycomein
● Inaddition,becausethesampleisonlypartofthepopulation,wealwayshave
chanceerror
○ Parameterestimate=trueparameter+bias+chanceerror
Summary
● Unlessacensusispossible,informationaboutapopulationcomesfroman
estimatefromasample.Thereliabilityofsuchanestimatedependsonhowthe
samplewaschosen.Hence,weusuallyhave:
○ Observedvalue=trueparameter+bias+chanceerror;or
○ Parameterestimate=trueparameter+bias+chanceerror
Lecture18
ConfidenceInterval
● Aconfidenceintervalquantifiestheuncertaintyofourestimates.
● Aq %confidenceintervalcoversthetrueparameterwithq %probability.More
precisely,ifyoucalculatedintervalsformanysamplesunderthesamesetting,
q%ofthemwouldcoverthetrueparameter
● Ifthechancee rrorfollowsasymmetricdistribution,thenaq
%confidenceinterval
isgivenby:
1−q
○ ObservedValue± (1 − 2
) thpercentile(ChanceError)
○ Forthe95%confidenceintervalwethushave
■ [OV−97.5thpercentile(CE),OV+97.5thpercentile(CE)]
HypothesisTesting
● InH ypothesisTesting,westartwithah
ypothesisaboutourpopulation.For
example:
○ “Thecoinisfair(sothepopulationmeanis0.5)”
● Wethencalculatethec hanceerrorandd ecidewhether:
○ Thechanceerrorfellwithinanintervaltobeexpected→Ourdatais
consistentwiththehypothesis
○ Thechanceerrorwasextremelybig→Eitherweobservedaveryrare
eventorourhypothesisiswrong
3MainSteps
● Setupresearchquestion
○ HypothesisH 0 vs H 1
● Weighupevidence
○ Assumptions
○ TestStatistic
○ P-value
● Explainconclusion
○ Conclusion
Whyhypothesistesting?
● Tomakeevidencebaseddecisions,weneedtoweighupevidence,
● HypothesisTestingisascientificmethodforweighinguptheevidencegivenin
thedataagainstagivenhypothesis(model)
○ Wesaythatthedataisnotconsistentwiththehypothesisifthedifference
betweentheobservedvalue(inourcasesamplemeanorsamplesum)
andtheexpectedvalue(assumingthehypothesis)istoobig
○ Alternativeformulation:Ifthechanceerroristoobigweshouldconsiderto
rejectthehypothesis
Summary
● Ac
onfidenceintervalqualifiestheuncertaintyofourestimates.Aq %
confidenceintervalcoversthetrueparameterswithq %probability
● Hypothesistestingisascientificmethodforweighinguptheevidenceinthe
dataagainstagivehypothesis(model)
Lecture19
TheZTest
● Thistestisusedtotestahypothesisaboutap roportioninapopulation
● Someexamplescouldbe:
○ Istheproportionofthecoinflipsthatareheadsequalto50%
○ IstheproportionofCEOsthatarefemalelessthan50%
○ Istheproportionofstudentsthatdropoutofschoolgreaterthan25%
H:Hypothesis
● ThenullhypothesisH 0 postulatesacertainexpectedvalue
● ThealternativehypothesisH 1 isthattheunderlyingexpectedvalueisactually
different
● Whenperformingastatisticaltest,wecalculatethechanceerrorunderH 0 and
weighupwhetheritssizeisplausibleornot
A:Assumption(forZTest)
● Observationareindependentofeachother
● Samplemean(samplesum)followsan ormaldistributionorsamplesizeisbig
enoughsuchthatnormalityisapproximatelysatisfied(fromCentralLimit
Theorem)
○ WedoN OTneedthedatatobenormal
○ Butifthesamplesizeisnotbigenough,thenwecanusethatifthedatais
approximatelynormal,thesamplemeanandsamplesumwillbe
approximatelynormalaswell
T:Teststatistic(forZtest)
● At eststatisticmeasuresthedifferencebetweenwhatiso bservedinthedata
adnwhatise xpectedfromthenullhypothesis
● Ittakestheform
observed value (OV ) − expected value (EV ) chance error(CE)
○ Teststatistic= standard error (SE) = standard error (SE)
● NOTE:Ifthenullhypothesisistrue,thentheteststatisticfollowsastandard
normalcurve:N(0,1)
P:p-value(forZtest)
● Thep-valueisthechanceofobservingtheteststatistic(orsomethingmore
extremeassumingH 0 istrue:
○ InaZtest,theteststatisticfollowsastandardnormalcurve,hencethe
p-valueisgivenby
■ p = P (Z ≥ ∣test statistic∣)
○ WhereZ isastandardnormal:Z ~N (0, 1)
P:p-value(Ingeneral)
● Ingeneral(foralltests),thesmallerthepvalue,thelesslikelyistoobservea
teststatisticofthemagnitudeobserved.Ifthep-valueissmallenoughitraises
evidencetorejectthenullhypothesis,H 0 .
< α ,whereα isa
● Theconventionistorejectthenullhypothesisifp
ignificantlevel,oftenchosenasα =0.05
predetermineds
● However,youdon’tneedtofollowthisstrictly,ayoushouldn’t
Summaryofthehypothesistest
● H:Ifp =Proportionofpatientswhorespondedtothetreatment,wetestH 0 :p =
0.8vsH 1 :p >0.8
● A:Weassumethattheparticipantinthetournamentgroupareindependentof
eachotherandgivenasamplesizeof29,thesamplemeanisapproximately
normal
● T:Theteststatisticfortheobservedsumis1.3
● P:Thep-valueforthisteststatisticis0.097
● C:Asthep-valueisgreaterthan0.05,wedonothaveenoughevidencetoreject
thenullhypothesisandsothedatanotprovidestrongevidencethatp >0.8
One-sidedandTwo-sidedTests
● 1sided:
○ Specifiesthedirectionofthealternativehypothesis.EgH 1 : p > 0.8
● 2sided:
○ Doesnotspecifythedirectionofthealternativehypothesis.
H 1 : p =/ 0.8
○ Inthiscasethep-valuedoubles
Summary
● Hypothesistestingisascientificmethodforweighinguptheevidencegivenin
thedataagainstagivenhypothesis(model).Itinvolvesthefollowingparts:
○ H:HypothesisH 0 vsH 1
○ A:Assumptions
○ T:TestStatistic
○ P:p-value
○ C:Conclusion
● TheZtestisusedtotestahypothesisaboutap
roportioninapopulation
Lecture20
WhentousetheZ-test
● TousetheZ-test,weneedtok
nowthepopulationSD
○ Oneplausiblecase:UnderH 0 inabinarycaseassumingaproportion
(andusingBoxSD)
● Canwejuste stimatethepopulationSDusingthesampleSD?
○ Yes
○ Butthisestimationwilladde xtravariabilitytotheteststatisticasthe
sampleSDvariesfromsampletosample
○ Forlargesamples,thedifferencebetweenthepopulationandsampleSD
shouldbesmall,andsotheZ-testmaybeappropriate
○ Forsmallsamples,thedifferencewillbemorenoticeable.Hence,we
shouldusethet -Testinstead
Thet-Test
● W.SGosset(1876-1936)inventedasimilartesttotheZ-test,whichusesthe
sampleSDandthet-distribution
● Thet-distributionvariesinshapeaccordingtothesamplesize.Thesmallerthe
samplesizeis,themorevariablethesampleis, andhuendthedistributionofthe
teststatisticswillbe“wider”.Thedegreeof“wideness”(alsocalledd egreeof
freedom)dependsonthesamplesizeandhereitisn − 1 .Wewritesucha
distributionastn−1
Thet-distribution
● Thet-distributionwithν degreesoffreedom(tν )
● ν = ∞ resultsinthestandardnormaldistribution
● Thestandardisedchanceerror(ifstandardisedwiththtexamplesd)followsa
t-distributionwithν = n − 1 = sample size − 1
Summary:T-Test
● H:H 0 :populationmean=μ vs H 1 :populationmean<, >, = μ /
● A:Individualsareindependent;samplesizeislargeenoughfortheCLT(or
populationisnormal)
observed mean − population mean
● T:Teststatistic= ︿ sample SD
SE = √n
● P:Usetn−1 curvetofindtailareaforobservedteststatistic
○ 1-sided:P (tn−1 > ∣T est statistic∣)
○ 2-sided:2 × P (tn−1 > ∣T est statistic∣)
● C:RetainorRejectH 0
Summary
● Thet-testisusedtodecidewhetheranobserveddifferencebetweendataand
expectedvalueisjustduetochanceerroralone(thenullhypothesis)oranother
reason(alternativehypothesis)
● IfthepopulationSDisunknown,weusethet-test,especiallyinthecaseofsmall
samples
● Theteststatisticis:
observed value − expected value
○ ︿
SE
● Wecanalsousethet-distributiontoconstructconfidenceintervals
Lecture21
Inference
● Whilevisualisationofthedatagivesusaninitialg
limpseatthepossible
relationshipbetweenthetwopopulations(thosewhohavedrunkaRedBulland
thosewhohavenot),weoftenwanttomakeadecisiononwhetherthemeanof
thetwopopulationsisthesameordifferent.
● Inferenceismakingadecisionaboutpopulationparameter(s)basedona
sample
2-SampleT-Test
● H:Hypothesis
○ μ1 =meanheartrateofthecontrolgroup
○ μ2 =meanheartrateourtreatmentgroup
○ H 0 :Thereisnodifference:μ1 = μ2 ,orμ1 − μ2 = 0
○ H 1 :Thereisadifference:μ1 =/ μ2 ,orμ1 − μ2 =/ 0
● A:Assumption
○ A1)Allobservedindividualsareindependent(withingroupsandbetween
differentgroups)
■ Thetwosample(RedBullandControl)containdifferentpeople
● Note:Thisdesigndiffersfromthecaffeineoneinwhichthe
samepersonistestedatboth0and13mglevelofcaffeine
andweconsiderthesampleofdifferencesfromeachpair
● Thepaireddifferencescaneliminatepersonaleffectonthe
experimentalresultbutitisalsohardertofindthesame
persontoundergobothtreatmentandcontrolforsome
experiments
○ A2)ThesamplemeansfollowaNormaldistribution
■ Oursamplesarequitesmall,sotheCentrallimitTheoremmightnot
fullykickin.Hence,2samplet-Testisquestionable
○ A3)The2populationshaveequalspread(SD/variance)
■ Weassumethatthe2populationshavethes amevariationin
heartrate
■ Check:BoxPlots,Histograms,VarianceTest
● BoxplotsshowthatRBseemstohavesmallersd.But
differencemightnotbesignificant
■ Better:ThisassumptioncanberelaxedbyusingtheW elch2
sampleT-Test
● T:TestStatistic
○ Equalvariance
■ Wecompare2populations.Ourobservedvalueisthedifferencein
samplemeans.Ournullhypothesisisthatthereisnodifferencein
populationmeans
x x 0
● test statistic =
OV −EV
︿ = 1 −︿2 − where,
SE SE
︿=
SE
√ SD 2 p ( n1 +
1
1
n2
), df = n1 + n2 − 2 basedonthe
2 (n1 −1)SD 2 1 +(n2 −1)SD 2 2
pooledsampleSD,whereSD p = n1 +n2 −2
Summary
● The2SampleT-Testisusedtotestforthed
ifferenceinmeansoftwo
populations
● Weneedtoassumethat:
○ Allobservedindividualsareindependent
○ ThesamplemeansfollowaNormaldistribution
○ The2populationshavee qualspread(SD/variance)
● WecanrelaxthefinalassumptionbyusingaW elchTwo-SampleT-Test
Lecture22
Welch2-SampleT-test
● Welch2-SampleT-testhasadifferentSEanddfformulacomparedtothe
standardtwosamplet-test
● Standarderroranddfformulaforthedifferencewithu
nequalvariance:
√
s21 s22
○ SE = n1
+ n2
2
○ df =
( ) s2 s2
1
+
n1 n2
2
2 2
æ( ) ( ) ö
s2 s2
1 2
n1 n2
ç ÷
n1 −1
+ n2 −1
ç ÷
è ︿
ø
■ Wheresk = SD (Samplek ), k = 1, 2
Non-IndependentData(PairedT-test)
● Sometimesitisdesirabletoanalysedependentdata.Weoftendesignan
experimenttotakeadvantageofthisdependencyinordertocontrolvariation
betweenexperimentalgroups
Summary
● WecanperformaWelchTwo-samplet-testtocomparetwopopulationswith
differentvariances
● WecanperformaPairedt-test(onesamplet-test)fiwehavepaireddata
● Wecanperformat-testfortheslopeofaregressionline