0% found this document useful (0 votes)
27 views

Math1005 Notes

Uploaded by

Rithvik Nair
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Math1005 Notes

Uploaded by

Rithvik Nair
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture‌‌1‌ ‌

Controlled‌‌Experiment:‌ ‌
● We‌‌want‌‌to‌‌study‌‌whether‌‌the‌t‌reatment‌‌‌causes‌‌the‌r‌ esponse‌ ‌
● The‌‌response‌‌that‌‌we‌‌want‌‌may‌‌be‌‌caused‌‌by‌‌other‌‌factors/variable‌ ‌
● Hence,‌‌optimally,‌‌we‌‌conduct‌‌2‌‌parallel‌‌experiments‌‌which‌o ‌ nly‌‌‌differ‌‌in‌‌whether‌
the‌‌treatment‌‌is‌‌administered‌ ‌or‌‌not‌ ‌
● This‌‌is‌‌called‌‌controlled‌‌experiment‌‌i.e.‌‌we‌c ‌ ontrol‌‌‌the‌‌effects‌‌of‌‌the‌‌other‌‌
variables‌‌on‌‌the‌‌treatment‌ ‌
Confounding‌ ‌
● Confounding‌‌occurs‌‌when‌‌the‌‌effect‌‌of‌‌one‌‌variable‌‌(X)‌‌on‌‌another‌‌variable‌‌(Y)‌‌is‌‌
clouded‌‌by‌‌the‌‌influence‌‌of‌‌another‌‌variable‌‌(Z)‌ ‌









The‌‌diagram‌‌is‌‌called‌‌“causal‌‌graph”‌ ‌

Bias‌ ‌
● Means‌‌that‌‌the‌‌quantity‌‌of‌‌interest‌‌is‌‌systematically‌‌under‌‌or‌‌overestimated‌ ‌
● Bias‌‌is‌‌often‌‌caused‌‌by‌‌a‌‌confounding‌‌variable‌‌but‌‌it‌‌can‌‌also‌‌have‌‌other‌‌causes‌‌
and‌‌sometimes‌‌it‌‌can‌‌even‌‌be‌‌desired‌ ‌
● In‌‌this‌‌module‌‌we‌‌will‌‌only‌‌consider‌‌bias‌‌due‌‌to‌‌confounding‌‌variables.‌‌This‌‌is‌‌
bias‌‌we‌‌want‌‌to‌‌avoid‌ ‌

Types‌‌of‌‌Bias:‌ ‌
● Selection‌‌Bias:‌‌‌If‌‌the‌‌Treatment‌‌is‌‌not‌‌comparable‌‌to‌‌the‌‌control‌‌group,‌‌then‌‌the‌‌
differences‌‌between‌‌the‌‌two‌‌groups‌‌can‌‌confound‌‌the‌‌effect‌‌of‌‌the‌‌treatment‌ ‌
● Observer‌‌Bias:‌ ‌
○ If‌‌the‌‌subjects‌‌or‌‌investigators‌‌are‌‌aware‌‌of‌‌the‌‌identity‌‌of‌‌the‌‌two‌‌groups,‌‌
we‌‌can‌‌get‌‌bias‌‌in‌‌either‌‌the‌‌responses‌‌or‌‌evaluations,‌‌as‌‌they‌‌may‌‌
deliberately‌‌or‌‌subconsciously‌‌report‌‌more‌‌or‌‌less‌‌favourable‌‌results‌ ‌
○ In‌‌fact,‌‌the‌‌subject‌‌may‌‌even‌‌respond‌‌to‌‌the‌‌idea‌‌of‌‌the‌‌treatment‌‌-‌‌this‌‌is‌‌
called‌p‌ lacebo‌‌effect‌ ‌


● The‌‌placebo‌‌‌is‌‌a‌‌pretend‌‌treatment.‌‌It‌‌is‌‌designed‌‌to‌‌be‌‌neutral‌‌and‌‌
indistinguishable‌‌from‌‌the‌‌treatment‌ ‌
The‌‌placebo‌‌effect‌‌‌is‌‌an‌‌effect‌‌which‌‌occurs‌‌from‌‌the‌‌subject‌‌thinking‌‌they‌‌have‌‌
had‌‌the‌‌treatment‌
● Consent‌‌Bias:‌‌‌Can‌‌occur‌‌when‌‌subjects‌‌choose‌‌whether‌‌or‌‌not‌‌they‌‌take‌‌part‌‌in‌‌
the‌‌experiment‌ ‌
● This‌‌quickly‌‌raises‌‌many‌‌ethical‌‌questions‌ ‌
● How‌‌can‌‌we‌‌avoid‌‌consent‌‌bias?‌ ‌
● Who‌‌determines‌‌who‌‌is‌‌part‌‌of‌‌each‌‌group‌ ‌
● It‌‌may‌‌be‌‌unethical‌‌to‌‌withhold‌‌treatment‌‌for‌‌those‌‌in‌‌the‌‌control‌‌group‌‌or‌‌enforce‌‌
treatment‌‌for‌‌those‌‌in‌‌the‌‌treatment‌‌group‌ ‌

Solution‌‌for‌‌Selection‌‌and‌‌Observer‌‌Bias:‌ ‌
● We‌‌need‌‌to‌‌conduct‌‌a‌R ‌ andomised‌‌Controlled‌‌Double-Blind‌‌‌Trial‌‌where‌‌both‌‌
the‌‌subjects‌‌(“single‌‌blind”)‌‌and‌‌investigators‌‌(“double‌‌blind”)‌‌are‌‌not‌‌aware‌‌of‌‌the‌‌
identity‌‌of‌‌the‌‌groups‌ ‌
● In‌‌addition,‌‌the‌‌control‌‌of‌‌the‌‌patient’s‌‌expectations‌‌(i.e.‌‌their‌‌response)‌‌and‌‌the‌‌
investigator’s‌‌observations‌‌(evaluation‌‌of‌‌response).‌ ‌
● To‌‌do‌‌so‌‌we‌‌usually:‌ ‌
● Have‌‌a‌‌3rd‌‌party‌‌administrator‌‌of‌‌the‌‌treatment‌‌and‌‌placebo‌ ‌
● Design‌‌the‌‌placebo‌‌to‌‌mimic‌‌the‌‌treatment‌‌as‌‌much‌‌as‌‌possible‌ ‌

Summary‌ ‌
● The‌‌design‌‌of‌‌a‌‌statistical‌‌study‌‌is‌‌critical‌‌in‌‌order‌‌to‌‌obtain‌‌results‌‌that‌‌can‌‌be‌‌
generalised.‌‌The‌‌best‌‌method‌‌for‌‌comparison‌‌is‌‌a‌‌controlled‌‌randomised‌‌
double-blind‌‌trial,‌‌but‌‌this‌‌is‌‌often‌‌not‌‌possible‌ ‌

Lecture‌‌2‌ ‌
The‌‌Need‌‌for‌‌Observational‌‌Studies‌ ‌
● In‌‌observation‌‌studies,‌‌the‌‌assignment‌‌of‌‌subjects‌‌into‌‌treatment‌‌and‌‌control‌‌
groups,‌‌is‌‌outside‌‌the‌‌control‌‌of‌‌the‌‌investigator‌ ‌
● Many‌‌research‌‌questions‌‌require‌‌an‌o ‌ bservational‌‌study‌,‌‌rather‌‌than‌‌a‌‌
controlled‌‌experiment‌ ‌
● The‌‌conclusions‌‌of‌‌observational‌‌studies‌‌require‌‌great‌‌care‌ ‌
● An‌‌observational‌‌study‌‌is‌‌one‌‌in‌‌which‌‌the‌‌investigator‌‌has‌‌no‌‌control‌‌over‌‌the‌‌
subjects‌‌or‌‌qualities‌‌of‌‌interest;‌‌she‌‌is‌‌just‌‌an‌‌observer.‌‌In‌‌particular‌‌the‌‌
investigator‌‌cannot‌‌use‌‌randomisation‌‌for‌‌allocation‌‌into‌‌groups‌ ‌

Precautions‌ ‌


● It‌‌is‌‌very‌‌difficult‌‌to‌‌establish‌‌causation‌ ‌
○ It‌‌is‌‌rather‌‌easy‌‌to‌‌establish‌‌association‌‌(that‌‌one‌‌thing‌‌is‌‌linked‌‌to‌‌
another)‌ ‌
■ Association‌‌may‌s ‌ uggest‌‌‌causation‌ ‌
■ But‌‌association‌‌does‌‌not‌p ‌ rove‌‌‌causation‌ ‌
○ Observational‌‌Studies‌‌can‌‌have‌‌misleading‌‌hidden‌‌confounders‌ ‌
■ Confounders‌‌can‌‌be‌‌hard‌‌to‌‌find,‌‌and‌‌can‌‌mislead‌‌about‌‌a‌‌cause‌‌
and‌‌effect‌‌relationship‌ ‌
● Observational‌‌studies‌‌with‌‌a‌‌confounding‌‌variable‌‌can‌‌lead‌‌to‌‌Simpson’s‌‌
Paradox‌ ‌
○ Simpson’s‌‌Paradox‌‌‌(or‌‌the‌‌reversing‌‌paradox)‌‌was‌‌first‌‌mentioned‌‌by‌‌
British‌‌statistician‌‌Udny‌‌Yule‌‌in‌‌1903.‌‌It‌‌was‌‌named‌‌after‌‌Edward‌‌H.‌‌
Simpson‌ ‌
○ Sometimes‌‌there‌‌is‌‌a‌‌clear‌‌trend‌‌in‌i‌ndividual‌‌‌groups‌‌of‌‌data‌‌that‌‌
reverses‌‌when‌‌groups‌‌are‌p ‌ ooled‌‌‌together‌ ‌
■ It‌‌occurs‌‌when‌‌relationships‌‌between‌‌percentages‌‌in‌‌subgroups‌‌are‌‌
reversed‌‌when‌‌the‌‌subgroups‌‌are‌‌combined,‌‌because‌‌of‌‌a‌‌
confounding‌‌or‌‌lurking‌‌variable‌ ‌
■ The‌‌association‌‌between‌‌a‌‌pair‌‌of‌‌variables‌‌(X,‌‌Y)‌‌reverses‌‌sign‌‌
upon‌‌conditioning‌‌of‌‌a‌‌third‌‌variable‌‌Z,‌‌regardless‌‌of‌‌the‌‌value‌‌
taken‌‌by‌‌Z.‌‌ ‌
● Historical‌‌control‌ ‌
○ Some‌‌studies‌‌present‌‌themselves‌‌as‌‌a‌‌controlled‌‌experiment,‌‌but‌‌on‌‌
further‌‌examination,‌‌there‌‌is‌‌a‌‌historical‌‌control‌‌‌and‌t‌ ime‌‌‌is‌‌a‌‌
confounding‌‌variable.‌‌(Note:‌‌This‌‌is‌‌partly‌‌observational‌‌and‌‌partly‌‌an‌‌
experiment)‌ ‌
○ Investigators‌‌might‌‌compare‌‌the‌‌effect‌‌of‌‌a‌‌new‌‌medication‌‌on‌‌current‌‌
patients,‌‌with‌‌an‌‌old‌‌medication‌‌on‌p ‌ ast‌‌‌patients.‌‌The‌‌Treatment‌‌Group‌‌
(new‌‌drug)‌‌and‌‌the‌‌historical‌‌Control‌‌Group‌‌(old‌‌drug)‌‌may‌‌differ‌‌in‌‌
aspects‌‌beside‌‌the‌‌treatment‌ ‌
○ Controlled‌‌experiments‌‌need‌‌to‌‌be‌‌performed‌‌in‌‌the‌‌same‌‌time‌‌period‌‌
(contemporaneously)‌ ‌
Uses‌‌for‌‌the‌‌word‌‌‘Control’‌ ‌
● A‌‌control‌‌=‌‌a‌‌subject‌‌who‌‌did‌‌not‌‌get‌‌the‌‌treatment‌ ‌
● A‌‌controlled‌‌experiment‌‌=‌‌a‌‌study/experiment‌‌where‌‌the‌‌investigators‌‌allocate‌‌
subjects‌‌into‌‌different‌‌groups‌ ‌
● Controlling‌‌for‌‌confounders‌‌=‌‌trying‌‌to‌‌reduce‌‌the‌‌influence‌‌of‌‌confounding‌‌
variables‌ ‌
Summary‌ ‌


● Many‌‌statistical‌‌involve‌‌observation;‌‌data‌‌and‌‌so‌‌we‌‌need‌‌to‌‌be‌‌very‌‌careful‌‌with‌‌
interpretation‌‌errors‌‌as‌‌confusing‌‌association‌‌for‌‌causation,‌‌misleading‌‌
confounders,‌‌Simpson’s‌‌Paradox‌‌and‌‌historical‌‌controls‌ ‌

Lecture‌‌3‌ ‌
What‌‌is‌‌data?‌ ‌
● Data‌‌is‌‌information‌‌about‌‌the‌‌set‌‌of‌s ‌ ubjects‌‌‌being‌‌studied‌‌(like‌‌road‌‌fatalities)‌ ‌
○ Most‌‌commonly,‌‌data‌‌refers‌‌to‌‌the‌‌sample‌‌not‌‌the‌‌population‌ ‌
Different‌‌types‌‌of‌‌data‌ ‌
● There‌‌are‌‌different‌‌types‌‌of‌‌data,‌‌in‌‌different‌‌formats,‌‌for‌‌example:‌ ‌
○ Survey‌‌data‌ ‌
○ Spreadsheet‌‌type‌‌data‌ ‌
○ MRI‌‌image‌‌data‌ ‌
Big‌‌data‌ ‌
● Big‌‌data‌‌refers‌‌to‌‌the‌‌massive‌‌amounts‌‌of‌‌data‌‌being‌‌collected‌ ‌
● Big‌‌data‌‌is‌‌commonly‌h ‌ igh‌‌dimensional‌,‌‌which‌‌means‌‌that‌‌there‌‌are‌‌more‌‌
variables‌p
‌ ‌‌that‌‌subjects‌n‌‌‌
○ For‌‌example,‌‌genomics‌‌data‌‌can‌‌have‌‌3‌‌billion‌‌variable,‌‌as‌‌a‌‌person’s‌‌
DNA‌‌sequence‌‌is‌‌3‌‌billion‌‌base‌‌pairs‌‌long‌ ‌
○ Measurements‌‌every‌‌milliseconds‌ ‌
○ Image‌‌data‌‌or‌‌video‌‌data‌ ‌
● Big‌‌data‌‌requires‌‌more‌‌complex‌‌visualisations‌ ‌
Initial‌‌Data‌‌Analysis‌‌(IDA)‌ ‌
● Initial‌‌Data‌‌Analysis‌‌is‌‌a‌‌first‌‌general‌‌look‌‌at‌‌the‌‌data,‌‌without‌‌formally‌‌answering‌‌
the‌‌research‌‌questions‌ ‌
○ IDA‌‌helps‌‌you‌‌to‌‌see‌‌whether‌‌the‌‌data‌‌can‌‌answer‌‌you‌‌research‌‌questions‌ ‌
○ IDA‌‌may‌‌pose‌‌other‌‌research‌‌questions‌ ‌
○ IDA‌‌can‌ ‌
■ Identify‌‌the‌‌data‌‌main‌‌qualities;‌ ‌
■ Suggest‌‌the‌‌populations‌‌from‌‌which‌‌a‌‌sample‌‌derives‌ ‌
What’s‌‌involved‌‌in‌‌IDA?‌ ‌
● Initial‌‌Data‌‌Analysis‌‌commonly‌‌involves:‌ ‌
○ Data‌‌background:‌‌checking‌‌the‌‌quality‌‌and‌‌integrity‌‌of‌‌the‌‌data‌ ‌
○ Data‌‌structure:‌‌what‌‌information‌‌has‌‌been‌‌collected?‌ ‌
○ Data‌‌wrangling:‌‌scraping‌‌cleaning,‌‌tiding,‌‌resharping,‌‌splitting,‌‌combining‌ ‌
○ Data‌‌summaries:‌‌graphical‌‌and‌‌numerical‌ ‌
● Here‌‌we‌‌focus‌‌on‌‌structure‌‌‌&‌‌graphical‌‌summaries‌‌‌for‌‌qualitative‌‌and‌‌
quantitative‌‌data‌ ‌


Variables‌ ‌
● A‌v‌ ariable‌‌‌measures‌‌or‌‌describes‌‌some‌‌attribute‌‌of‌‌the‌‌subjects‌ ‌
○ Data‌‌with‌‌p‌‌(explanatory)‌‌variables‌‌is‌‌said‌‌to‌‌have‌d ‌ imension‌‌p ‌‌‌
● Number‌‌of‌‌variables‌ ‌
○ Univariate‌‌(1‌‌[explanatory]‌‌variable)‌ ‌
○ Bivariate‌‌(2‌‌[explanatory‌‌variables)‌ ‌
○ Multivariate‌‌(above‌‌2‌‌[explanatory]‌‌variables)‌ ‌
● Types‌‌of‌‌variables‌ ‌
○ Qualitative‌‌or‌‌Categorical‌‌(Categories)‌‌R:Factor‌ ‌
■ Ordinal‌‌(Ordered)‌ ‌
● Binary‌‌(2‌‌categories)‌ ‌
● 3+‌‌categories‌ ‌
■ Nominal‌‌(Non-ordered)‌ ‌
● Binary‌‌(2‌‌categories)‌ ‌
● 3+‌‌categories‌ ‌
○ Quantitative‌‌or‌‌Numerical‌‌(Measurements)‌‌R:Numeric‌ ‌
■ Discrete‌‌(Separated)‌‌R:Integer‌‌(int)‌ ‌
■ Continuous‌‌(Continuum)‌‌R:Double‌ ‌
Choosing‌‌a‌‌graphical‌‌summary‌ ‌
● The‌‌aim‌‌of‌‌a‌‌graphical‌‌summary‌‌is‌‌to‌‌best‌‌highlight‌‌features‌‌of‌‌this‌‌data‌ ‌
○ To‌‌some‌‌extent‌‌we‌‌use‌‌trial‌‌and‌‌error‌ ‌
○ While‌‌the‌‌pie‌‌chart‌‌may‌‌be‌‌popular,‌‌it‌‌is‌‌usually‌‌not‌‌informative‌ ‌
Summary‌ ‌
● The‌‌type‌‌of‌‌variables‌‌determines‌‌what‌‌type‌‌of‌‌graphical‌‌summary‌‌is‌‌most‌
appropriate‌ ‌

Lecture‌‌4‌ ‌
Overview‌‌of‌‌histogram‌ ‌
● We‌‌use‌‌a‌‌histogram‌‌for‌‌quantitative‌‌data‌ ‌
● A‌‌histogram‌‌highlights‌‌the‌‌percentage‌‌of‌‌data‌‌in‌‌one‌‌class‌‌interval‌‌compared‌‌to‌‌
another‌ ‌
○ It‌‌consists‌‌of‌‌a‌‌set‌‌of‌‌blocks‌‌which‌‌represent‌‌the‌‌percentages‌‌by‌‌area‌ ‌
○ The‌‌area‌‌of‌‌the‌‌histogram‌‌is‌‌100%‌ ‌
○ The‌‌horizontal‌‌scale‌‌is‌‌divided‌‌into‌c ‌ lass‌‌intervals‌ ‌
○ The‌‌area‌‌of‌‌each‌‌block‌‌‌represents‌‌the‌‌percentage‌‌of‌‌subjects‌‌in‌‌that‌‌
particular‌‌class‌‌interval‌ ‌
● Density‌‌Scale‌:‌ ‌
% in the block
○ Height‌‌of‌‌each‌‌block‌‌=‌‌ length of the class interval ‌


○ Height‌‌of‌‌each‌‌block‌‌=‌‌average‌‌percentage‌‌per‌‌horizontal‌‌unit‌ ‌
● For‌‌continuous‌‌data,‌‌we‌‌need‌‌an‌e ‌ ndpoint‌‌convention‌‌‌for‌‌data‌‌points‌‌that‌‌fall‌‌
on‌‌the‌‌border‌‌of‌‌two‌‌class‌‌intervals‌ ‌
○ If‌‌an‌‌interval‌‌contains‌‌the‌‌left‌‌endpoint‌‌but‌‌excludes‌‌the‌‌right‌‌endpoint,‌‌
then‌‌an‌‌18‌‌year‌‌old‌‌would‌‌be‌‌counted‌‌in‌‌[18,25)‌‌not‌‌[0,18)‌ ‌
○ We‌‌call‌‌this‌‌left-closed‌‌and‌‌right-opened‌ ‌
● Number‌‌of‌‌class‌‌intervals‌ ‌
Common‌‌Mistake‌‌switch‌‌Histograms‌ ‌
● The‌‌black‌‌heights‌‌are‌‌equal‌‌to‌‌the‌‌percentage‌‌or‌‌total‌‌numbers‌ ‌
○ Here‌‌we‌‌wrongly‌‌use‌‌the‌t‌ otal‌‌numbers‌‌‌(or‌p ‌ ercentage‌)‌‌as‌‌the‌‌heights‌ ‌
○ Unless‌‌the‌‌class‌‌intervals‌‌are‌‌the‌‌same‌‌size,‌‌in‌‌both‌‌cases‌‌this‌‌will‌‌makes‌‌
larger‌‌class‌‌intervals‌‌look‌‌like‌‌a‌‌larger‌‌overall‌‌%‌ ‌
○ Solution‌:‌‌Use‌‌density‌‌as‌‌the‌‌height,‌‌especially‌‌if‌‌class‌‌intervals‌‌are‌‌not‌‌
the‌‌same‌‌size.‌‌Don't‌‌use‌‌percentage‌‌total‌‌numbers‌ ‌
● Use‌‌too‌‌many‌‌or‌‌too‌‌few‌‌class‌‌intervals‌ ‌
○ This‌‌can‌‌hide‌‌the‌‌true‌‌pattern‌‌in‌‌the‌‌data.‌‌As‌‌a‌‌rule‌‌of‌‌thumb,‌‌use‌‌between‌‌
10-15‌‌class‌‌intervals‌ ‌
Strategy‌ ‌
● Only‌‌count‌‌those‌‌deaths‌‌where‌‌person‌‌is‌‌driving‌ ‌
● Find‌‌for‌‌registered‌‌driving‌‌licences‌‌with‌‌age‌‌information‌ ‌
● Combine‌‌information‌‌and‌‌derive‌‌a‌‌death‌‌rate‌‌per‌‌driving‌‌licence‌‌for‌‌different‌‌age‌‌
groups‌ ‌
● Conclusion:‌‌Death‌‌rate‌‌per‌‌licence‌‌is‌‌approximately‌‌the‌‌same‌‌for‌‌age‌‌group‌‌[18,‌‌
25)‌‌and‌‌[70,‌‌105).‌‌Both‌‌rates‌‌are‌‌approximately‌‌three‌‌times‌‌higher‌‌than‌‌the‌‌death‌‌
rate‌‌for‌‌age‌‌groups‌‌[25,‌‌70)‌ ‌
Simple‌‌box‌‌plot‌ ‌
● The‌‌boxplot‌‌plots‌‌the‌‌median‌‌(‘middle’‌‌data‌‌point),‌‌the‌‌middle‌‌50%‌‌of‌‌the‌‌data‌‌in‌‌
a‌‌box,‌‌the‌‌maximum‌‌and‌‌minimum,‌‌and‌‌determines‌‌any‌‌outliers‌ ‌
● We‌‌will‌‌consider‌‌how‌‌to‌‌draw‌‌the‌‌box‌‌plot‌‌when‌‌we‌‌learn‌‌about‌‌the‌‌interquartile‌‌
range‌‌(IQR)‌‌in‌‌a‌‌later‌‌lecture‌ ‌
Comparative‌‌box‌‌plots‌ ‌
● A‌‌comparative‌‌boxplots‌‌splits‌‌up‌‌a‌‌quantitative‌‌variable‌‌by‌‌a‌‌qualitative‌‌variable‌ ‌
Heatmap‌ ‌
● A‌‌heatmap‌‌might‌‌be‌‌a‌‌good‌‌choice‌‌here.‌‌A‌‌heatmap‌‌is‌‌especially‌‌useful‌‌when‌‌a‌‌
contingency‌‌table‌‌is‌‌not‌‌practical‌‌due‌‌to‌‌too‌‌many‌‌different‌‌values‌ ‌
Summary‌ ‌
● The‌‌histogram‌‌is‌‌a‌‌graphical‌‌summary‌‌for‌‌quantitative‌‌data‌‌which‌‌shows‌‌the‌‌
percentage‌‌of‌‌subjects‌‌per‌‌class‌‌interval.‌‌The‌‌boxplot‌‌shows‌‌the‌‌middle‌‌50%‌‌of‌‌
the‌‌data‌‌and‌‌it’s‌‌spread.‌‌The‌‌scatterplot‌‌shows‌‌the‌‌relationship‌‌between‌‌two‌‌
variables.‌‌A‌‌heatmap‌‌is‌‌a‌‌‘contingency‌‌table’‌‌for‌‌numerical/continuous‌‌data‌ ‌


Lecture‌‌5‌ ‌
Advantages‌‌of‌‌numerical‌‌summaries‌ ‌
● A‌‌numerical‌‌summary‌‌reduces‌‌all‌‌the‌‌data‌‌to‌‌one‌‌simple‌‌number‌‌0‌‌(“statistic”)‌ ‌
○ This‌‌loses‌‌a‌‌lot‌‌of‌‌information‌ ‌
○ However‌‌it‌‌allows‌‌easy‌‌communication‌‌comparisons‌ ‌
● Major‌‌features‌‌that‌‌we‌‌can‌‌summarise‌‌numerically‌‌are:‌ ‌
○ Maximum‌ ‌
○ Minimum‌ ‌
○ Centre‌‌‌[sample‌‌mean,‌‌median]‌ ‌
○ Spread‌‌‌[standard‌‌deviation,‌‌range,‌‌IQR]‌ ‌
Which‌‌might‌‌be‌‌useful‌‌for‌‌talking‌‌about‌‌Newtown‌‌house‌‌prices‌ ‌
● It‌‌depends‌ ‌
● Reporting‌‌the‌‌centre‌‌without‌‌the‌‌spread‌‌can‌‌be‌‌misleading‌ ‌
Useful‌‌notation‌‌for‌‌data‌‌(Ext)‌ ‌
● In‌‌this‌‌course,‌‌we‌‌intentionally‌‌focus‌‌on‌‌statistical‌‌concepts‌‌in‌w
‌ ords‌.‌‌This‌‌is‌‌vital‌‌
for‌‌collaborating‌‌with‌‌people‌‌from‌‌different‌‌fields.‌‌The‌‌mathematics‌‌is‌‌introduced‌‌
in‌‌2nd‌‌year.‌‌However,‌‌here‌‌some‌‌simple‌‌mathematical‌‌notation‌‌is‌‌helpful.‌ ‌
● Observations‌‌of‌‌a‌‌single‌‌variable‌‌of‌‌size‌n ‌ ‌‌can‌‌be‌‌represented‌‌by:‌ ‌
○ x1 , x2 , ..., xn ‌
● The‌‌ranked‌‌observations‌‌(ordered‌‌from‌‌smallest‌‌to‌‌largest)‌‌are:‌ ‌
○ x(1) , x(2) , ..., x(n) ‌
● The‌‌sum‌‌of‌‌the‌‌observations‌‌are:‌ ‌
n
○ ∑ xi ‌
i=1
Sample‌‌Mean‌ ‌
● The‌‌sample‌‌mean‌‌is‌‌the‌a
‌ verage‌‌‌of‌‌the‌‌data:‌ ‌
Sum of data
○ S ample M ean = Size of data
,‌‌‌or‌ ‌
n
∑ xi
○ x= n
i=1

Sample‌‌mean‌‌as‌‌a‌‌balancing‌‌point‌
● The‌‌sample‌‌mean‌‌is‌‌the‌‌unique‌‌point‌‌at‌‌which‌‌the‌‌data‌‌is‌b
‌ alanced‌‌‌i.e.‌‌The‌‌
readings‌‌and‌‌the‌‌lower‌‌readings‌‌all‌‌cancel‌‌each‌‌other‌‌out.‌‌For‌‌example:‌‌When‌‌
mean‌‌is‌‌1407.143‌‌(thousands)‌ ‌
○ 19‌‌Watkin‌‌St‌‌sold‌‌for‌‌$1950‌‌(thousands)‌ ‌
■ This‌‌gives‌‌a‌‌gap‌‌of‌‌(1950-1407.143)‌ ‌
■ This‌‌is‌‌$542.857‌‌(thousands)‌a ‌ bove‌‌‌the‌‌sample‌‌mean‌‌price‌ ‌


○ 30‌‌Pearl‌‌St‌‌sold‌‌$1250‌‌(thousands)‌ ‌
■ This‌‌gives‌‌a‌‌gap‌‌of‌‌(1250-1407.143)‌ ‌
■ This‌‌is‌‌$157.143‌b ‌ elow‌‌‌the‌‌sample‌‌mean‌‌price‌ ‌
Sample‌‌Median‌ ‌
● The‌‌sample‌‌median‌‌x̃ ‌is‌‌the‌m ‌ iddle‌‌data‌‌point‌,‌‌when‌‌the‌‌observations‌‌are‌‌
ordered‌‌from‌‌smallest‌‌to‌‌largest‌ ‌
○ For‌‌an‌‌odd‌‌sized‌‌number‌‌of‌‌observations:‌ ‌

■ Sample‌‌Mean‌‌=‌‌the‌‌unique‌‌middle‌‌point‌‌=‌‌ x( n+1 ) ‌
2
○ For‌‌an‌‌even‌‌sized‌‌number‌‌of‌‌observations:‌ ‌
x( n ) + x( n +1)
2 2
■ Sample‌‌Mean‌‌=‌‌average‌‌of‌‌the‌‌middle‌‌points‌‌=‌‌ ‌
2
Statistical‌‌Thinking‌ ‌
● If‌‌you‌‌had‌‌to‌‌choose‌‌between‌‌reporting‌‌the‌‌sample‌‌mean‌‌or‌‌sample‌‌median‌‌for‌
Newtown‌‌properties,‌‌which‌‌would‌‌you‌‌choose‌‌and‌‌why?‌ ‌
■ For‌‌the‌‌full‌‌property‌‌portfolio,‌‌the‌‌sample‌‌mean‌‌and‌‌the‌‌sample‌‌
median‌‌are‌‌fairly‌‌similar‌ ‌
■ For‌‌the‌‌4‌‌bedroom‌‌houses,‌‌the‌‌sample‌‌mean‌‌is‌‌higher‌‌than‌‌the‌‌
sample‌‌median‌‌because‌‌it‌‌is‌‌being‌‌“pulled‌‌up”‌‌by‌‌some‌‌very‌‌
expensive‌‌houses‌ ‌
○ For‌‌the‌‌average‌‌buyer,‌‌the‌‌sample‌‌median‌‌would‌‌be‌‌more‌‌useful‌‌as‌‌an‌‌
indication‌‌of‌‌the‌‌sort‌‌of‌‌price‌‌needed‌‌to‌‌get‌‌into‌‌the‌‌market‌ ‌
○ For‌‌any‌‌agent‌‌selling‌‌houses‌‌in‌‌the‌‌area,‌‌the‌‌sample‌‌mean‌‌might‌‌be‌‌more‌‌
useful‌‌in‌‌order‌‌to‌‌predict‌‌their‌‌average‌‌commissions‌ ‌
○ In‌‌practise,‌‌we‌‌can‌‌report‌‌both‌ ‌

Robustness‌ ‌
● The‌‌sample‌‌median‌‌is‌‌said‌‌to‌‌be‌r‌ obust‌‌‌and‌‌is‌‌a‌‌good‌‌summary‌‌for‌‌skewed‌‌data‌‌
as‌‌it‌‌is‌‌not‌‌affected‌‌by‌o‌ utliers‌ ‌
● Suppose‌‌there‌‌was‌‌a‌‌data‌‌entry‌‌mistake,‌‌and‌‌the‌‌lowest‌‌property‌‌recorded‌‌as‌‌
370‌‌was‌‌in‌‌fact‌‌the‌‌highest‌‌sold‌‌at‌‌3700.‌‌How‌‌would‌‌the‌‌sample‌‌mean‌‌change?‌‌
How‌‌would‌‌the‌‌sample‌‌median‌‌change?‌ ‌
○ The‌‌sample‌‌mean‌‌would‌‌be‌‌higher,‌‌as‌‌we‌‌have‌‌replaced‌‌the‌‌smallest‌‌
reading‌‌by‌‌now‌‌maximum‌ ‌
○ The‌‌median‌‌would‌‌shift‌‌up,‌‌from‌‌the‌‌average‌‌x(28) ‌and‌‌x(29) ‌to‌‌the‌‌

average‌‌of‌‌x(29) ‌and‌‌x(30) ‌
Comparing‌‌the‌‌sample‌‌mean‌‌and‌‌the‌‌median‌ ‌


● The‌‌difference‌‌between‌‌the‌‌sample‌‌mean‌‌and‌‌the‌‌median‌‌can‌‌be‌‌indication‌‌of‌‌
the‌s
‌ hape‌‌‌of‌‌the‌‌data‌ ‌
○ For‌‌symmetric‌‌data,‌‌we‌‌expect‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌median‌‌to‌‌
be‌‌same:‌‌x = x̃ ‌
○ For‌‌left‌‌skewed‌‌data,‌‌we‌‌expect‌‌the‌‌sample‌‌mean‌‌to‌‌be‌‌smaller‌‌than‌‌the‌‌
sample‌‌median:‌‌x < x̃ ‌
○ For‌‌the‌‌right‌‌skewed‌‌data,‌‌we‌‌expect‌‌the‌‌sample‌‌mean‌‌to‌‌be‌‌larger‌‌than‌‌
the‌‌sample‌‌median:‌x > x̃ ‌
Which‌‌is‌‌optimal‌‌for‌‌describing‌‌the‌‌centre?‌ ‌
● Both‌‌have‌‌strengths‌‌and‌‌weaknesses‌‌depending‌‌on‌‌the‌‌nature‌‌of‌‌the‌‌data‌ ‌
● Sometimes‌‌neither‌‌gives‌‌a‌‌sensible‌‌sense‌‌of‌‌location,‌‌for‌‌example‌‌is‌‌the‌‌data‌‌is‌‌
bimodal‌ ‌
● As‌‌the‌s ‌ ample‌‌median‌‌is‌‌robust‌,‌‌it‌‌is‌‌preferable‌‌for‌‌data‌‌which‌‌is‌‌skewed‌‌or‌‌has‌‌
many‌‌outliers,‌‌like‌‌Sydney‌‌house‌‌prices‌ ‌
● The‌‌sample‌‌mean‌‌‌is‌‌helpful‌‌for‌‌data‌‌which‌b ‌ asically‌‌symmetric‌,‌‌with‌‌not‌‌too‌‌
many‌‌outliers,‌‌and‌‌for‌‌theoretical‌‌analysis‌ ‌
Limitations‌‌of‌‌both?‌ ‌
● Both‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌median‌‌allow‌‌very‌‌easily‌‌comparisons,‌‌and‌‌
are‌‌easily‌‌understandable‌ ‌
● However,‌‌they‌‌need‌‌to‌‌be‌‌paired‌‌with‌‌a‌‌measure‌‌of‌‌spread‌ ‌
● Note‌‌in‌‌the‌‌following‌‌example,‌‌the‌‌sample‌‌means‌‌are‌‌the‌‌same,‌‌but‌‌the‌‌data‌‌are‌‌
very‌‌different‌ ‌
Summary‌ ‌
● Both‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌median‌‌summarise‌‌the‌‌centre‌‌data.‌‌The‌‌
sample‌‌median‌‌is‌‌robust‌‌making‌‌it‌‌a‌‌better‌‌choice‌‌for‌‌skewed‌‌data‌‌or‌‌where‌‌
there‌‌are‌‌outliers.‌‌Both‌‌need‌‌to‌‌be‌‌paired‌‌with‌‌a‌‌measure‌‌of‌‌spread.‌ ‌

Lecture‌‌6‌ ‌
1st‌‌attempt:‌‌The‌‌mean‌‌gap‌ ‌
● Mean‌‌gap‌‌=‌‌sample‌‌mean(data‌‌-‌‌sample‌‌mean(data))‌ ‌
● Note:‌‌It‌‌will‌‌always‌‌be‌‌0‌ ‌
○ From‌‌the‌‌definition,‌‌the‌‌mean‌‌gap‌‌must‌‌be‌‌0,‌‌as‌‌the‌‌mean‌‌is‌‌the‌‌
balancing‌‌point‌‌‌of‌‌the‌‌gaps‌ ‌
○ Or‌‌for‌‌those‌‌who‌‌like‌‌algebra,‌‌the‌‌mean‌‌gap‌‌is‌‌
n n
∑ (xi −x) ∑ xi nx
i=1 i=1
n
=n − n
=0 ‌
Better‌‌option:‌‌Standard‌‌deviation‌ ‌


● First,‌‌define‌‌the‌‌Root‌‌Mean‌‌Square‌‌(RMS)‌ ‌
○ The‌‌RMS‌‌measures‌‌the‌a ‌ verage‌‌‌of‌‌a‌‌set‌‌of‌‌numbers,‌‌regardless‌‌of‌‌the‌‌
signs‌ ‌
○ The‌‌steps‌‌are:‌S ‌ quare‌‌‌the‌‌numbers,‌‌then‌‌Mean‌‌‌the‌‌results,‌‌then‌‌Root‌‌‌the‌‌
result‌ ‌

■ RMS(numbers)‌‌=‌‌
○ So‌‌effectively,‌‌the‌S
√sample mean(numbers )
‌ quare‌‌‌and‌‌the‌R
2

‌ oot‌‌‌operations‌‌“reverse”‌‌each‌‌other‌ ‌


n
2
∑ (gapi )
● Applying‌‌RMS‌‌of‌‌gaps‌‌=‌‌
√ sample mean(gaps)2 ‌‌=‌‌ n
i=1

● To‌‌avoid‌‌the‌‌cancellation‌‌of‌‌the‌‌gaps,‌‌another‌‌possible‌‌method‌‌is‌‌to‌‌consider‌‌the‌‌
n
∑ ∣gapi ∣
i=1
average‌‌of‌‌the‌‌absolute‌‌values‌‌of‌‌the‌‌gaps:‌‌
n ‌.‌‌However,‌‌this‌‌is‌‌harder‌‌
algebraically‌ ‌
Standard‌‌deviation‌‌in‌‌terms‌‌of‌‌RMS‌ ‌
● The‌‌standard‌‌deviation‌‌measures‌‌the‌s‌ pread‌‌‌of‌‌the‌‌data‌ ‌
○ S Dpop =‌‌RMS‌‌of‌‌(gaps‌‌from‌‌the‌‌mean)‌ ‌

○ Formally,‌‌S Dpop =‌‌


√M ean of (gaps f rom the mean) ‌ 2
‌=‌‌


n
2
∑ (xi −x)
i=1

n
How‌‌to‌‌tell‌‌the‌‌difference‌‌when‌‌the‌‌data‌‌is‌‌a‌‌population‌‌or‌‌a‌‌sample?‌ ‌
● It‌‌can‌‌be‌‌tricky‌‌to‌‌work‌‌out‌‌whether‌‌your‌‌data‌‌is‌‌a‌‌population‌‌or‌‌sample‌ ‌
● Look‌‌a‌‌the‌‌information‌‌about‌‌the‌‌data‌‌story‌‌and‌‌the‌‌research‌‌questions‌ ‌
Standard‌‌Units‌‌(“Z‌‌score”)‌ ‌
● Standard‌‌units‌‌of‌‌a‌‌data‌‌point‌‌=‌‌how‌‌many‌‌standard‌‌deviations‌‌is‌‌it‌‌below‌‌or‌‌
above‌‌the‌‌mean‌ ‌
data point − mean
○ Standard‌‌units‌‌=‌‌ SD

● This‌‌means‌‌data‌‌point‌‌=‌‌mean‌‌+‌‌SD‌‌×‌‌standard‌‌units‌ ‌
IQR‌ ‌
● IQR‌‌=‌‌Range‌‌of‌‌the‌‌middle‌‌50%‌‌of‌‌the‌‌data‌
○ More‌‌formally‌‌IQR‌‌=‌‌Q3 − Q1 ,‌‌where‌ ‌
■ Q1 ‌is‌‌the‌‌25%‌‌percentile‌‌(1st‌‌quartile)‌‌and‌‌Q3 is‌‌the‌‌75%‌‌
percentile‌‌(3rd‌‌quartile)‌ ‌


■ The‌‌median‌‌is‌‌the‌‌50%‌‌percentile‌‌or‌‌2nd‌‌quartile‌‌x ‌‌=‌‌Q2 ‌
Quantile,‌‌quartile,‌‌percentile‌ ‌
● The‌‌set‌‌of‌‌q-‌quantiles‌‌‌divides‌‌the‌‌data‌‌into‌‌q‌‌equal‌‌sets‌‌(in‌‌terms‌‌of‌‌percentage‌‌
of‌‌data)‌ ‌
● Percentile‌‌‌is‌‌100-quantile‌ ‌
● The‌‌set‌‌of‌q ‌ uartiles‌‌‌divides‌‌the‌‌data‌‌into‌‌4‌‌quarters‌ ‌
● SO‌‌the‌‌range‌‌of‌‌the‌‌50%‌‌of‌‌properties‌‌sold‌‌is‌‌almost‌‌a‌‌million‌‌dollars‌ ‌
IQR‌‌on‌‌the‌‌boxplot‌ ‌
● The‌‌IQR‌‌is‌‌the‌‌length‌‌of‌‌the‌‌box‌‌in‌‌the‌‌box‌‌plot.‌‌It‌‌represents‌‌the‌‌span‌‌of‌‌the‌‌
middle‌‌50%‌‌of‌‌the‌‌houses‌‌sold‌ ‌
● The‌‌lower‌‌‌and‌u ‌ pper‌‌thresholds‌‌‌are‌‌a‌‌distance‌‌of‌‌1.5‌‌from‌‌the‌‌quartiles‌‌(by‌‌
convention)‌ ‌
○ LT‌‌=‌‌ Q1 ‌-‌‌1.5IQR‌ ‌
○ UT‌‌=‌‌Q3 ‌+‌‌1.5IQR‌ ‌
● Data‌‌outside‌‌these‌‌thresholds‌‌is‌‌considered‌‌an‌o ‌ utlier‌‌‌(“extreme‌‌reading”)‌ ‌
Coefficient‌‌of‌‌Variation‌ ‌
● The‌‌Coefficient‌‌of‌‌Variation‌‌(CV)‌‌‌combines‌‌the‌‌mean‌‌and‌‌standard‌‌deviation‌‌
SD
into‌‌one‌‌summary:‌‌CV‌‌=‌‌ mean ‌
● The‌‌CV‌‌is‌‌used‌‌in:‌ ‌
○ Analytical‌‌chemistry‌‌to‌‌express‌‌the‌‌precision‌‌and‌‌repeatability‌‌of‌‌an‌‌assay‌ ‌
○ Engineering‌‌and‌‌physical‌‌for‌‌quality‌‌assurance‌‌studies‌ ‌
○ Economics‌‌for‌‌determining‌‌the‌‌vitality‌‌of‌‌a‌‌security‌ ‌

Lecture‌‌7‌ ‌
Normal‌‌Curve:‌‌Origins‌ ‌
● The‌‌normal‌‌curve‌‌was‌‌discovered‌‌around‌‌1720‌‌by‌‌Abraham‌‌de‌‌Moivre,‌‌also‌‌
famous‌‌for‌‌the‌‌beautiful‌‌de‌‌Moivre’s‌‌formula‌ ‌
Why‌‌is‌‌the‌‌Normal‌‌curve‌‌famous?‌ ‌
● The‌‌Normal‌‌curve‌‌approximates‌‌many‌‌natural‌‌phenomena‌ ‌
● The‌‌Normal‌‌curve‌‌can‌‌model‌‌data‌‌caused‌‌by‌‌combining‌‌a‌‌large‌‌number‌‌of‌‌
independent‌‌observations.‌ ‌
General‌‌&‌‌Standard‌‌Normal‌‌curves‌ ‌
● The‌‌Standard‌‌‌Normal‌‌Curve‌‌(‌Z )‌‌has‌‌mean‌‌=‌‌0‌‌and‌‌SD‌‌=‌‌1.‌‌Short:‌‌N(0,‌‌1)‌ ‌
● The‌‌General‌‌‌Normal‌‌Curve‌‌(‌X )‌‌has‌‌any‌‌mean‌‌and‌‌SD.‌‌Caution:‌‌It‌‌is‌‌denoted‌‌by‌‌
N(mean,‌‌SD2 )‌‌ ‌
The‌‌Normal‌‌curve‌‌formula‌ ‌
● It‌‌turns‌‌the‌‌Normal‌‌curve‌‌has‌‌a‌‌simple‌‌formula,‌‌although‌‌you‌‌won’t‌‌need‌‌to‌‌use‌‌it‌‌
directly‌ ‌


(x−μ)2
1
● The‌‌formula‌‌for‌‌the‌‌General‌‌Normal‌‌Curve‌‌is‌‌ e 2σ 2 ‌
‌for‌‌
√2πσ2
x ∈ (−∞, ∞) ‌‌where‌‌μ‌‌and‌‌σ‌‌are‌‌the‌‌(population)‌‌mean‌‌and‌‌SD‌‌respectively‌ ‌
Finding‌‌the‌‌area‌‌under‌‌the‌‌Standard‌‌Normal‌‌curve‌
● Method‌‌1:‌‌Integration‌ ‌
0.7 y2
1 −
○ Mathematically,‌‌we‌‌could‌‌use‌‌integration:‌‌area‌‌=‌‌ ∫ e 2 dy ‌
−∞ √2π
○ But‌‌this‌‌does‌‌not‌‌have‌‌a‌‌closed-form‌ ‌
● Method‌‌2:‌‌Normal‌‌Tables‌ ‌
● Method‌‌3:‌‌Use‌‌R‌ ‌
○ The‌‌pnorm‌‌command‌‌works‌‌out‌‌the‌‌lower‌‌tail‌‌area‌ ‌
○ The‌‌pnorm(x,‌‌lower.tail‌ ‌=‌‌F)‌‌works‌‌out‌‌the‌‌upper‌‌tail‌‌area‌ ‌
Finding‌‌the‌‌area‌‌under‌‌the‌‌Standard‌‌Normal‌‌curve‌
● In‌‌R‌ ‌
Properties‌‌of‌‌the‌‌Normal‌‌curve‌ ‌
● All‌‌Normal‌‌curves‌‌satisfy‌‌the‌‌“68%‌‌-‌‌95%‌‌-‌‌99.7%‌‌Rule”‌ ‌
○ The‌‌area‌‌1‌‌SD‌‌out‌‌from‌‌the‌‌mean‌‌in‌‌both‌‌directions‌‌is‌‌0.68‌‌(68%)‌ ‌
○ The‌‌area‌‌2‌‌SD‌‌out‌‌from‌‌the‌‌mean‌‌in‌‌both‌‌directions‌‌is‌‌0.95‌‌(95%)‌ ‌
○ The‌‌area‌‌3‌‌SD‌‌out‌‌from‌‌the‌‌mean‌‌in‌‌both‌‌directions‌‌is‌‌0.997‌‌(99.7%)‌ ‌
● Any‌‌General‌‌Normal‌‌can‌‌be‌‌rescaled‌‌into‌‌the‌‌Standard‌‌Normal‌ ‌
○ For‌‌any‌‌point‌‌on‌‌a‌‌Normal‌‌curve,‌‌the‌‌standard‌‌units‌‌(or‌‌z‌‌score)‌‌is‌‌how‌‌
many‌‌standard‌‌deviations‌‌that‌‌point‌‌is‌‌above‌‌(+)‌‌or‌‌below‌‌(-)‌‌the‌‌mean‌ ‌
data point − sample mean
○ standard‌‌units‌‌=‌‌ ‌
sample SD
● The‌‌Normal‌‌curve‌‌is‌‌symmetric‌ ‌
○ If‌‌X‌‌follows‌‌a‌‌normal‌‌curve‌‌with‌‌mean‌‌),‌‌then‌ ‌
■ P (X < − 0.5) = P (X > 0.5) ‌
Summary‌ ‌
● The‌‌Normal‌‌curve‌‌naturally‌‌describes‌‌many‌‌histograms,‌‌and‌‌so‌‌can‌‌be‌‌used‌‌in‌‌
modelling‌‌data.‌‌It‌‌has‌‌many‌‌useful‌‌properties,‌‌including‌‌the‌‌68/95/99.7%‌‌rule.‌‌
Any‌‌General‌‌Normal‌‌can‌‌be‌‌rescaled‌‌into‌‌a‌‌Standard‌‌Normal‌ ‌

Lecture‌‌8‌ ‌
Reproducible‌‌Research‌ ‌


● Increasingly,‌‌journals‌‌are‌‌requiring‌‌reproducible‌‌research,‌‌which‌‌requires‌‌“data‌‌
set‌‌and‌‌software‌‌to‌‌be‌‌made‌‌available‌‌for‌‌verifying‌‌published‌‌findings‌‌and‌‌
conducting‌‌alternative‌‌analyses”.‌ ‌
○ A‌‌study‌‌by‌‌Begley‌‌and‌‌Ellis‌‌(2012)‌‌found‌‌that‌‌47‌‌out‌‌of‌‌53‌‌medical‌‌
research‌‌papers‌‌focused‌‌on‌‌cancer‌‌research‌‌that‌‌was‌‌irreproducible‌ ‌
○ A‌‌follow-up‌‌study‌‌by‌‌Begley‌‌(2013)‌‌identified‌‌“6‌‌flags‌‌for‌‌suspect‌‌work”:‌‌
studies‌‌were‌‌not‌‌performed‌‌by‌‌investigators‌‌blinded‌‌to‌‌the‌‌experimental‌‌
versus‌‌the‌‌control‌‌arms,‌‌there‌‌was‌‌a‌‌failure‌‌to‌‌repeat‌‌experiments,‌‌a‌‌lack‌‌
of‌‌positive‌‌and‌‌negative‌‌controls,‌‌failure‌‌to‌‌show‌‌all‌‌data,‌‌inappropriate‌‌
use‌‌of‌‌statistical‌‌tests‌‌and‌‌use‌‌reagents‌‌that‌‌were‌‌appropriately‌‌validated‌ ‌
What‌‌can‌‌go‌‌wrong?‌ ‌
● Without‌‌reproducible‌‌research:‌ ‌
○ Data‌‌version‌‌can‌‌change‌‌(eg‌‌people‌‌edit‌‌an‌‌Excel‌‌file‌‌without‌‌
documenting‌‌what‌‌has‌‌changed‌‌and‌‌why);‌ ‌
○ Graphical‌‌summaries‌‌can‌‌change‌‌(eg‌‌people‌‌can‌‌photoshop‌‌images‌‌
without‌‌keeping‌‌record‌‌of‌‌what‌‌changed‌‌and‌‌why)‌ ‌
● Reproducible‌‌research‌‌is‌‌about‌‌being‌‌responsible‌‌with‌‌possible‌‌human‌‌errors,‌‌or‌‌
worse,‌‌detecting‌‌intentionally‌‌changed‌‌results‌‌ ‌

Lecture‌‌9‌ ‌
Bivariate‌‌Data‌ ‌
● Bivariate‌‌data‌‌involves‌‌a‌p
‌ air‌‌‌of‌‌variables.‌‌We‌‌are‌‌interested‌‌in‌‌the‌‌relationship‌‌
between‌‌the‌‌two‌‌variables‌‌Can‌‌one‌‌variable‌‌be‌‌used‌‌to‌‌predict‌‌the‌‌other?‌ ‌
○ Formally,‌‌we‌‌have‌‌(xi , y i ) ‌for‌‌i = 1, 2, ..., n ‌
○ X ‌is‌‌called‌‌the‌i‌ndependent‌‌‌variable‌‌(or‌‌explanatory‌‌variable,‌‌predictor‌‌
or‌‌regressor)‌ ‌
○ Y ‌is‌‌called‌‌the‌d‌ ependent‌‌‌variable‌‌(or‌‌response‌‌variable).‌ ‌
Scatter‌‌Plot‌ ‌
● A‌s ‌ catter‌‌plot‌‌‌is‌‌a‌‌graphical‌‌summary‌‌of‌‌two‌‌quantitative‌‌variables‌‌on‌‌the‌‌same‌‌
2D‌‌plane,‌‌result‌‌in‌‌a‌‌cloud‌‌of‌‌points‌ ‌
How‌‌can‌‌we‌‌summarise‌‌a‌‌scatter‌‌plot?‌ ‌
● The‌‌scatter‌‌plot‌‌can‌‌be‌‌summarised‌‌by‌‌the‌‌following‌‌five‌‌‌numerical‌‌summaries‌ ‌
○ Sample‌‌mean‌‌and‌‌sample‌‌SD‌‌of‌‌X (x, SDx ) ‌
○ Sample‌‌mean‌‌and‌‌sample‌‌SD‌‌of‌‌Y (y, SD y ) ‌
○ Correlation‌‌coefficient‌‌(r) ‌
The‌‌Correlation‌‌coefficient‌ ‌


● The‌‌correlation‌‌coefficient‌‌‌r ‌is‌‌a‌‌numerical‌‌summary‌‌that‌‌measures‌‌the‌‌
clustering‌‌around‌‌the‌‌line‌ ‌
● It‌‌indicates‌‌both‌‌the‌‌sign‌‌and‌‌strength‌‌of‌‌the‌‌linear‌‌association‌ ‌
● The‌‌correlation‌‌coefficient‌‌is‌‌between‌‌-1‌‌and‌‌1‌ ‌
○ If‌‌r ‌is‌‌positive:‌‌the‌‌cloud‌‌sloped‌‌up‌ ‌
○ If‌‌r ‌is‌‌negative:‌‌the‌‌cloud‌‌slopes‌‌down‌ ‌
○ As‌r ‌gets‌‌closer‌‌to‌‌± 1 ‌:‌‌the‌‌points‌‌cluster‌‌more‌‌tightly‌‌around‌‌the‌‌line‌ ‌
Why‌‌does‌‌r‌‌measure‌‌association‌ ‌
● It‌‌divides‌‌the‌‌scatter‌‌plot‌‌into‌‌4‌‌quadrants,‌‌at‌‌the‌‌point‌‌of‌‌averages‌‌(centre)‌
○ A‌‌majority‌‌of‌‌points‌‌in‌‌the‌‌upper‌‌right‌‌(+)‌‌and‌‌lower‌‌left‌‌quadrants‌‌(+)‌‌will‌‌
be‌‌indicated‌‌by‌‌a‌‌positive‌‌r‌ ‌
○ A‌‌majority‌‌of‌‌points‌‌in‌‌the‌‌upper‌‌left‌‌(-)‌‌and‌‌the‌‌lower‌‌right‌‌(-)‌‌will‌‌be‌‌
indicated‌‌by‌‌a‌‌negative‌‌r‌ ‌
Symmetry‌ ‌
● The‌‌correlation‌‌coefficient‌‌is‌‌not‌‌affected‌‌by‌‌interchanging‌‌the‌‌variables‌ ‌
Scaling‌ ‌
● The‌‌correlation‌‌coefficient‌‌is‌‌shift‌‌and‌‌scale‌‌invariant‌ ‌
Warning‌ ‌
1. The‌‌correlation‌‌coefficient‌‌is‌‌unitless‌ ‌
○ Mistake‌:‌‌r‌‌=‌‌0.8‌‌means‌‌that‌‌80%‌‌of‌‌the‌‌points‌‌are‌‌tightly‌‌clustered‌‌around‌‌
the‌‌line‌‌or‌‌is‌‌twice‌‌as‌‌clustered‌‌as‌‌r‌‌=‌‌0.4‌ ‌
2. Outliers‌‌can‌‌overly‌‌influence‌‌the‌‌correlation‌‌coefficient‌ ‌
3. Non-linear‌‌association‌‌can’t‌‌be‌‌detected‌‌by‌‌the‌‌correlation‌‌coefficient‌ ‌
4. The‌‌same‌‌correlation‌‌coefficient‌‌can‌‌arise‌‌from‌‌very‌‌different‌‌data‌ ‌
5. Rates‌‌of‌‌averages‌‌tend‌‌to‌‌inflate‌‌the‌‌correlation‌‌coefficient‌ ‌
○ An‌‌ecological‌‌correlation‌‌‌(or‌‌spatial‌‌correlation)‌‌is‌‌the‌‌correlation‌‌
between‌‌two‌‌variables‌‌that‌‌are‌‌group‌‌means‌‌or‌‌rates‌ ‌
○ For‌‌example,‌‌if‌‌we‌‌recorded‌‌the‌‌heights‌‌of‌‌fathers‌‌and‌‌sons‌‌in‌‌many‌‌
communities‌‌and‌‌then‌‌calculated‌‌the‌‌average‌‌for‌‌each‌‌community‌ ‌
○ Ecological‌‌correlations‌‌tend‌‌to‌‌overestimate‌‌the‌‌strength‌‌of‌‌association‌‌
between‌‌the‌‌two‌‌variables‌ ‌
6. Association‌‌is‌‌not‌‌causation‌ ‌
○ Correlation‌‌measures‌‌association‌ ‌
○ But‌‌as‌‌discussed,‌‌association‌‌does‌‌not‌‌necessarily‌‌mean‌‌causation‌ ‌
○ Both‌‌variables‌‌may‌‌be‌‌simultaneously‌‌influenced‌‌by‌‌a‌‌3rd‌‌variable‌‌
(confounder)‌ ‌
Summary‌ ‌
● The‌‌scatter‌‌plot‌‌is‌‌a‌‌cloud‌‌of‌‌points‌‌which‌‌represents‌‌bivariate‌‌quantitative‌‌data‌‌(‌‌
pair‌‌of‌‌variables).‌‌Useful‌‌summaries‌‌are‌‌the‌‌two‌‌point‌‌of‌‌averages‌‌(sample‌‌


means),‌‌the‌‌two‌‌sample‌‌SDs‌‌of‌‌the‌‌variables‌‌and‌‌one‌‌correlation‌‌coefficient.‌‌The‌‌
correlation‌‌coefficient‌‌is‌‌the‌‌mean‌‌of‌‌the‌‌product‌‌of‌‌the‌‌variables‌‌in‌‌standard‌‌units‌‌
and‌‌can‌‌be‌‌found‌‌using‌‌cor()‌‌in‌‌R‌ ‌

Lecture‌‌10‌ ‌
Regression‌‌Line‌ ‌
1. SD‌‌Line‌‌(Not‌‌great)‌ ‌
○ ‌The‌S ‌ D‌‌line‌‌‌might‌‌look‌‌like‌‌a‌‌good‌‌candidate‌‌as‌‌it‌‌connects‌‌the‌‌pints‌‌of‌‌
averages‌‌(x, y ) ‌‌to‌‌(x + SD x , y + SDy ) ‌(for‌‌this‌‌data‌‌with‌‌positive‌‌
correlation)‌ ‌
○ However,‌‌it‌‌does‌‌not‌‌use‌‌the‌‌correlation‌‌coefficient,‌‌so‌‌it‌‌is‌‌insensitive‌‌to‌‌
the‌‌amount‌‌of‌‌clustering‌‌around‌‌the‌‌line‌ ‌
○ Note‌‌how‌‌it‌‌underestimates‌‌(LHS)‌‌and‌‌overestimates‌‌(RHS)‌‌at‌‌the‌‌
extremes‌ ‌
2. Regression‌‌Line‌ ‌
○ To‌‌describe‌‌the‌‌scatter‌‌plot,‌‌we‌‌need‌‌to‌‌use‌a
‌ ll‌‌five‌‌‌summaries:‌ ‌
x, y , SDx , SDy , r ‌
○ The‌‌Regression‌‌line‌‌‌connects‌‌(x, y ) ‌‌to‌‌(x + SD x , y + rSDy ) ‌
Summary‌‌Regression‌‌Line‌ ‌
● We‌‌can‌‌derive‌‌the‌‌(least-squares)‌‌regression‌‌line‌‌using‌‌calculus,‌‌by‌‌minimizing‌‌
the‌‌squared‌‌residuals‌‌‌(extension)‌ ‌
Predictions‌ ‌
1. Baseline‌‌prediction‌ ‌
○ If‌‌you‌‌don’t‌‌use‌‌x ‌as‌‌an‌‌information‌‌source‌‌at‌‌all,‌‌a‌‌basic‌‌prediction‌‌of‌‌y
would‌‌be‌‌the‌a ‌ verage‌‌‌of‌‌y ‌over‌a
‌ ll‌‌‌the‌‌x ‌values‌‌in‌‌the‌‌data‌ ‌
○ So‌‌for‌‌any‌‌CE‌‌reading,‌‌we‌‌could‌‌predict‌‌the‌‌NW‌‌air‌‌quality‌‌to‌‌be‌‌56.13‌ ‌
2. Prediction‌‌in‌‌a‌‌strip‌ ‌
○ Given‌‌a‌‌certain‌‌value‌‌x0 ‌,‌‌a‌‌more‌‌careful‌‌prediction‌‌of‌‌y ‌‌would‌‌be‌‌the‌‌
average‌‌of‌‌all‌‌the‌‌y ‌in‌‌the‌‌data‌‌corresponding‌‌to‌‌a‌‌neighbourhood‌‌of‌‌x ‌
value‌‌around‌‌x0 ‌.‌ ‌
3. The‌‌Regression‌‌line‌ ‌
○ The‌‌best‌‌prediction‌‌is‌‌based‌‌on‌‌the‌‌Regression‌‌line‌ ‌
○ For‌‌AQI,‌‌we‌‌have‌‌y = 19.8874 + 0.7138x ‌
Residuals‌ ‌


● A‌r‌ esidual‌‌‌is‌‌the‌‌vertical‌‌distance‌‌(or‌‌‘gap’)‌‌of‌‌a‌‌point‌‌above‌‌or‌‌below‌‌the‌‌
regression‌‌line‌ ‌
● A‌‌residual‌‌represents‌‌the‌‌error‌‌between‌‌the‌‌actual‌‌value‌‌and‌‌the‌‌prediction‌ ‌
● More‌‌formally,‌‌a‌‌residual‌‌is‌‌ei = y i − y︿i ‌,‌‌given‌‌the‌‌actual‌‌value‌‌(y i ) ‌‌and‌‌the‌‌
︿
prediction‌‌(y i ) ‌
Residual‌‌plot‌ ‌
● A‌‌residual‌‌plot‌‌graphs‌‌the‌‌residuals‌‌vs‌‌x ‌
● If‌‌the‌‌linear‌‌fit‌‌is‌‌appropriate‌‌for‌‌the‌‌data,‌‌it‌‌should‌‌show‌‌no‌‌pattern‌‌(random‌‌
‌ ‌‌0)‌ ‌
about‌‌y =
● The‌‌residual‌‌plot‌‌is‌‌a‌‌diagnostic‌‌plot‌‌to‌‌check‌‌the‌‌appropriateness‌‌of‌‌a‌‌linear‌‌
model‌ ‌
Vertical‌‌Strips‌ ‌
‌ irection,‌‌then‌‌
● If‌‌the‌‌vertical‌‌strips‌‌on‌‌the‌‌scatter‌‌plot‌‌show‌‌equal‌‌spread‌‌in‌‌the‌‌y d
the‌‌data‌‌is‌h
‌ omoscedastic‌ ‌
○ The‌‌regression‌‌line‌‌could‌‌be‌‌used‌‌for‌‌predictions‌ ‌
‌ irection,‌‌then‌‌the‌‌data‌‌is‌‌
● If‌‌the‌‌vertical‌‌strips‌‌don’t‌‌show‌‌equal‌‌spread‌‌in‌‌the‌‌y d
heteroscedastic‌ ‌
○ The‌‌regression‌‌line‌‌should‌‌not‌‌be‌‌used‌‌for‌‌predictions‌ ‌
Common‌‌mistakes‌‌when‌‌predicting‌ ‌
1. Extrapolating‌ ‌
○ If‌‌we‌‌make‌‌a‌‌prediction‌‌from‌‌an‌‌x ‌‌value‌‌that‌‌is‌‌not‌‌within‌‌the‌‌range‌‌of‌‌the‌‌
data,‌‌then‌‌that‌‌prediction‌‌can‌‌be‌‌completely‌u‌ nreliable‌ ‌
2. Not‌‌checking‌‌the‌‌scatter‌‌plot‌ ‌
○ We‌‌can‌‌have‌‌a‌‌high‌‌correlation‌‌coefficient‌‌and‌‌then‌‌fit‌‌a‌‌regression‌‌line,‌‌
but‌‌the‌‌data‌‌may‌‌not‌‌even‌‌be‌‌linear‌ ‌
○ So‌‌always‌‌check‌‌the‌‌scatter‌‌plot‌ ‌
3. Not‌‌checking‌‌the‌‌residual‌‌plot‌ ‌
○ You‌‌should‌‌also‌‌check‌‌the‌‌residual‌‌plot‌ ‌
○ This‌‌detects‌‌any‌‌pattern‌‌that‌‌has‌‌been‌‌captured‌‌by‌‌fitting‌‌a‌‌linear‌‌model‌ ‌
○ If‌‌the‌‌linear‌‌model‌‌is‌‌appropriate,‌‌the‌‌residual‌‌plot‌‌should‌‌be‌‌a‌‌random‌‌
‌ ‌‌0)‌ ‌
scatter‌‌of‌‌points‌‌(about‌‌the‌‌horizontal‌‌line‌‌y =
Summary‌ ‌
● For‌‌prediction,‌‌the‌‌regression‌‌line‌‌is‌‌better‌‌than‌‌the‌‌SD‌‌line‌‌as‌‌it‌‌uses‌‌all‌‌five‌‌
numerical‌‌summaries‌‌for‌‌the‌‌scatter‌‌plot‌ ‌
● For‌‌Regression‌‌line,‌‌the‌r‌ esiduals‌‌‌are‌‌the‌‌gaps‌‌between‌‌teh‌a ‌ ctual‌‌value‌‌‌and‌‌
the‌p
‌ rediction‌ ‌


● The‌‌residual‌‌plot‌‌is‌‌a‌‌diagnostic‌‌for‌‌seeing‌‌whether‌‌a‌‌linear‌‌model‌‌is‌‌appropriate‌‌
-‌‌if‌‌it‌‌is‌‌random,‌‌then‌‌a‌‌linear‌‌model‌‌seems‌‌appropriate‌ ‌
● If‌‌the‌‌vertical‌‌strips‌‌on‌‌the‌‌scatter‌‌plot‌‌show‌e
‌ qual‌‌spread‌‌‌in‌‌the‌‌y-direction,‌‌then‌‌
the‌‌data‌‌is‌h ‌ omoscedastic‌,‌‌otherwise,‌‌the‌‌data‌‌is‌h‌ eteroscedastic‌ ‌

Lecture‌‌11‌ ‌
Probability‌ ‌
● The‌‌frequentist‌‌definition‌‌of‌‌probability‌‌‌(or‌‌chance)‌‌is‌‌the‌‌percentage‌‌of‌‌time‌‌a‌‌
certain‌‌event‌‌is‌‌expected‌‌to‌‌happen‌‌if‌‌the‌‌same‌‌process‌‌is‌‌repeated‌‌long-term‌‌
(infinitely‌‌often)‌ ‌
● This‌‌differs‌‌from‌‌the‌‌Bayesian‌‌definition‌‌of‌‌probability‌‌which‌‌relates‌‌to‌‌the‌‌degree‌‌
of‌‌belief‌‌that‌‌an‌‌even‌‌twill‌‌occur‌‌(extension)‌ ‌
Basic‌‌properties‌‌of‌‌Probability‌ ‌
1. Probabilities‌‌are‌‌between‌‌0%‌‌(impossible)‌‌and‌‌100%‌‌(certain)‌ ‌
○ P(Impossible‌‌event)‌‌=‌‌0‌ ‌
○ P(Certain‌‌event)‌‌=‌‌1‌ ‌
2. The‌‌probability‌‌of‌‌something‌‌equals‌‌100%‌‌minus‌‌its‌‌opposite‌‌(c ‌ omplement‌)‌ ‌
○ P(Event)‌‌-‌‌1‌‌-‌‌P(Complement‌‌event)‌ ‌
Conditional‌‌probability‌ ‌
● Conditional‌‌probability‌‌‌is‌‌the‌‌chance‌‌that‌‌a‌‌certain‌‌event‌‌(1)‌‌occurs,‌g ‌ iven‌‌
another‌‌event‌‌(2)‌‌has‌‌occurred‌ ‌
○ P(Event‌‌1|Event‌‌2)‌ ‌
Multiplication‌‌Rule‌ ‌
● The‌‌probability‌‌that‌‌two‌‌events‌‌occur‌‌is‌‌the‌‌chance‌‌of‌‌the‌‌1st‌‌event‌m ‌ ultiplied‌‌‌by‌‌
that‌‌chance‌‌of‌‌2nd‌‌event,‌‌given‌‌the‌‌1st‌‌has‌‌occurred‌ ‌
○ P(Event‌‌1‌‌and‌‌Event‌‌2)‌‌=‌‌P(event‌‌1)‌‌✕‌‌P(Event‌‌2|Event‌‌1)‌ ‌
Addition‌‌Rule‌ ‌
● The‌‌probability‌‌at‌‌least‌‌one‌‌of‌‌two‌‌events‌‌occurs‌‌is‌‌the‌‌chance‌‌of‌‌the‌‌1st‌‌event‌‌
plus‌‌‌the‌‌chance‌‌of‌‌2nd‌‌event‌m ‌ inus‌‌‌the‌‌probability‌‌that‌‌both‌‌events‌‌occur‌ ‌
○ P(Event‌‌1‌‌or‌‌Event‌‌2)‌‌=‌‌P(Event‌‌1)‌‌+‌‌P(Event‌‌2)‌‌-‌‌P(Event‌‌1‌‌and‌‌Event‌‌2)‌ ‌
Mutually‌‌exclusive‌ ‌
● Two‌‌events‌‌are‌m ‌ utually‌‌exclusive‌‌‌when‌‌the‌‌occurrence‌‌of‌‌one‌‌event‌‌prevents‌‌
the‌‌other‌ ‌
Independence‌ ‌
● Two‌‌events‌‌are‌i‌ndependent‌‌‌if‌‌the‌‌chance‌‌of‌‌1st‌‌given‌‌the‌‌2nd‌‌is‌‌the‌‌same‌‌as‌‌
the‌‌1st,‌‌ie.‌‌P(Event‌‌1|Event‌‌2)‌‌=‌‌P(Event‌‌1)‌ ‌
The‌‌Prosecutor’s‌‌fallacy‌ ‌


● The‌‌prosecutor’s‌‌fallacy‌‌‌is‌‌a‌‌mistake‌‌in‌‌statistical‌‌thinking,‌‌whereby‌‌it‌‌is‌‌
assumed‌‌that‌‌the‌‌probability‌‌of‌‌a‌‌random‌‌match‌‌is‌‌equal‌‌to‌‌the‌‌probability‌‌that‌‌
the‌‌defendant‌‌is‌‌innocent‌ ‌
○ It‌‌has‌‌been‌‌used‌‌by‌‌the‌‌prosecution‌‌to‌‌argue‌‌for‌‌the‌‌guilt‌‌of‌‌a‌‌defendant‌‌
during‌‌famous‌‌criminal‌‌trials‌ ‌
○ It‌‌can‌‌also‌‌be‌‌used‌‌by‌‌defense‌‌lawyers‌‌to‌‌argue‌‌for‌‌the‌‌innocence‌‌of‌‌their‌‌
client‌ ‌
Summary‌ ‌
● Addition‌‌Rule‌ ‌
○ Two‌‌events‌‌are‌‌mutually‌‌exclusive‌‌when‌‌the‌‌occurrence‌‌of‌‌one‌‌event‌‌
prevents‌‌the‌‌other‌ ‌
○ If‌‌two‌‌events‌‌are‌‌mutually‌‌exclusive‌‌then‌‌the‌‌chance‌‌of‌a ‌ t‌‌least‌‌one‌‌event‌‌
occurring‌‌is‌‌the‌s
‌ um‌‌‌of‌‌the‌‌individual‌‌chances‌ ‌
● Multiplication‌‌Rule‌ ‌
○ Two‌‌events‌‌are‌‌independent‌‌if‌‌the‌‌occurrence‌‌of‌‌the‌‌first‌‌event‌‌does‌‌not‌‌
change‌‌the‌‌chance‌‌of‌‌the‌‌second‌‌event‌ ‌
○ If‌‌the‌‌two‌‌events‌‌are‌‌independent‌‌then‌‌the‌‌chance‌‌of‌b ‌ oth‌‌‌events‌‌
occurring‌‌is‌‌the‌m
‌ ultiplication‌‌‌of‌‌the‌‌individual‌‌chances‌ ‌

Lecture‌‌12‌ ‌
Counting‌‌and‌‌drawing‌‌trees‌‌(The‌‌old‌‌way)‌ ‌
● For‌‌simple‌‌chance‌‌problems,‌‌a‌‌good‌‌way‌‌to‌‌start‌‌is:‌ ‌
a. Method‌‌1:‌‌Write‌‌a‌‌full‌‌list‌‌of‌‌outcomes‌‌and‌‌count‌‌the‌‌outcomes‌‌of‌‌interest‌
■ Write‌‌a‌‌list‌‌of‌‌all‌‌outcomes‌ ‌
■ Count‌‌which‌‌outcomes‌‌belong‌‌to‌‌the‌‌event‌‌of‌‌interest‌ ‌
b. Method‌‌2:‌‌Summarise‌‌in‌‌a‌‌tree‌‌diagram‌ ‌
■ Draw‌‌a‌‌tree‌ ‌
Running‌‌a‌‌simulation‌‌(The‌‌new‌‌way)‌ ‌
1. Method‌‌3:‌‌Simulate‌ ‌
○ Use‌‌and‌‌simulate‌‌throwing‌‌dice‌ ‌x t‌imes‌‌and‌‌record‌‌the‌‌findings‌ ‌
Summary‌ ‌
● Counting‌‌outcomes‌‌or‌‌drawing‌‌a‌‌tree‌‌to‌‌derive‌‌probabilities‌‌of‌‌outcomes‌‌can‌‌
quickly‌‌become‌‌tedious.‌‌One‌‌solution‌‌is‌‌to‌‌use‌‌simulations‌ ‌

Lecture‌‌13‌ ‌
Chance‌‌error‌ ‌
● Every‌‌time‌‌you‌‌toss‌‌a‌‌fair‌‌coin,‌‌there‌‌is‌‌chance‌‌variability‌ ‌


○ Number‌‌of‌‌heads‌‌(observed‌‌value)‌‌=‌‌half‌‌the‌‌number‌‌of‌‌tosses‌‌(expected‌‌
value)‌‌+‌‌chance‌‌error‌ ‌
Law‌‌of‌‌Averages‌ ‌
● The‌‌Law‌‌of‌‌Averages‌‌‌states‌‌that‌‌the‌p ‌ roportion‌‌‌of‌‌heads‌‌becomes‌‌more‌‌stable‌‌
as‌‌the‌‌length‌‌of‌‌the‌‌simulation‌‌increases‌‌and‌‌approaches‌‌a‌‌fixed‌‌number‌‌called‌‌
the‌r‌ elative‌‌frequency‌ ‌
● The‌‌chance‌‌error‌‌in‌‌the‌‌number‌‌of‌‌heads‌‌is‌‌likely‌‌to‌‌be‌l‌arge‌‌‌in‌‌absolute‌‌size,‌‌but‌‌
small‌‌‌relative‌‌to‌‌the‌‌number‌‌of‌‌tosses.‌ ‌
Important‌‌Facts‌ ‌
● For‌‌a‌‌fair‌‌coin:‌ ‌
○ Even‌‌if‌‌we‌‌observe‌‌100‌‌heads‌‌in‌‌a‌‌row,‌‌still‌‌P(Tail)‌‌=‌‌0.5.‌‌
Misunderstanding‌‌this‌‌leads‌‌to‌‌the‌‌Gambler’s‌‌Fallacy‌ ‌
○ As‌‌the‌‌number‌‌of‌‌tosses‌i‌ncreases‌ ‌
■ The‌‌absolute‌‌size‌‌of‌‌the‌‌chance‌‌error‌i‌ncreases‌ ‌
■ The‌‌absolute‌‌percentage‌‌(i.e.‌‌‘relative’)‌‌size‌‌of‌‌the‌‌chance‌‌error‌‌
decreases‌ ‌
■ The‌‌proportion‌‌of‌‌the‌‌event‌‌will‌‌converge‌‌‌to‌‌the‌‌theoretical‌‌or‌‌
expected‌‌proportion‌ ‌
Summary‌ ‌
● For‌‌independent‌‌events,‌‌it‌‌is‌‌a‌‌mistake‌‌to‌‌assume‌‌that‌‌the‌‌chance‌‌of‌‌observing‌‌a‌‌
particular‌‌event‌‌changes‌‌over‌‌time,‌‌even‌‌if‌‌the‌‌event‌‌has‌‌not‌‌occurred‌‌for‌‌a‌‌long‌‌
time.‌‌This‌‌is‌‌the‌‌Gambler’s‌‌fallacy‌‌and‌‌downfall‌ ‌
● Rather‌‌The‌‌Law‌‌of‌‌Large‌‌Numbers‌‌states‌‌that‌‌the‌o ‌ bserved‌‌proportion‌‌‌of‌‌
occurrences‌‌of‌‌the‌‌event,‌‌in‌‌the‌‌long‌‌run,‌‌approaches‌‌the‌e ‌ xpected‌‌proportion‌ ‌

Lecture‌‌14‌ ‌
Box‌‌model‌ ‌
● The‌‌box‌‌model‌‌‌is‌‌a‌‌simple‌‌way‌‌to‌‌describe‌‌many‌‌chance‌‌processes‌ ‌
● The‌‌box‌‌represents‌‌the‌‌population‌,‌‌containing‌‌different‌‌types‌‌of‌t‌ ickets‌ ‌
● We‌‌need‌‌to‌‌know:‌ ‌
○ The‌‌number‌‌‌or‌p ‌ roportion‌‌‌of‌‌each‌‌kind‌‌of‌‌ticket‌‌in‌‌the‌‌box‌ ‌
○ The‌‌number‌‌of‌d ‌ raws‌‌‌from‌‌the‌‌box‌ ‌
○ For‌‌now,‌‌we‌‌only‌‌consider‌‌drawing‌‌with‌‌replacement‌ ‌
Modelling‌‌the‌‌Sum‌‌of‌‌a‌‌sample‌ ‌
● For‌‌the‌S‌ um‌‌‌of‌‌random‌‌draws‌‌from‌‌a‌‌box‌‌model‌‌with‌‌replacement,‌‌ ‌
○ observed‌‌value‌‌=‌‌expected‌‌value‌‌+‌‌chance‌‌error‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌number‌‌of‌‌draws‌‌×‌‌mean‌‌of‌‌the‌‌box‌ ‌


■ Standard‌‌error‌‌(SE)‌‌=‌‌ number of draws ‌×‌‌SD‌‌of‌‌the‌‌box‌ ‌
■ SE‌‌is‌‌the‌‌expected‌‌magnitude‌‌of‌‌the‌‌chance‌‌error.‌ ‌


How‌‌to‌‌calculate‌‌the‌‌SD‌‌of‌‌the‌‌box‌ ‌
● As‌‌the‌b ‌ ox‌‌‌represents‌‌the‌‌population,‌‌the‌S
‌ D‌‌of‌‌the‌‌box‌‌‌is‌‌the‌p
‌ opulation‌‌SD‌ ‌
● We‌‌could‌‌call‌‌it‌‌SD pop ‌,‌‌but‌‌in‌‌this‌‌context,‌‌we‌‌will‌‌simply‌‌use‌‌SD‌ ‌
● 3‌‌ways‌‌to‌‌calculate‌‌the‌‌SD‌‌of‌‌the‌‌box‌ ‌
○ Formula:‌‌RMS(gaps)‌‌=‌‌Root‌‌of‌‌the‌‌Mean‌‌of‌‌the‌‌Squared‌‌gaps‌ ‌
○ R:‌‌popsd()‌‌with‌‌package‌‌multicon‌ ‌
○ Short‌‌cut‌‌(for‌‌simply‌‌binary‌‌(two‌‌tickets)‌‌boxes)‌ ‌
■ If‌‌a‌‌box‌‌only‌‌contains‌‌2‌‌different‌‌numbers‌‌(“big”‌‌and‌‌“small”),‌‌then‌ ‌
● SD‌‌=‌‌(big‌‌-‌‌small)‌
√proportion of big × proportion of small
How‌‌does‌‌chance‌‌error‌‌relate‌‌to‌‌standard‌‌error‌ ‌
● An‌‌observed‌‌value‌‌is‌‌likely‌‌to‌‌be‌‌around‌‌its‌‌expected‌‌value,‌‌with‌‌a‌c ‌ hance‌‌error‌‌
similar‌‌to‌‌SE‌ ‌
● Observed‌‌values‌‌usually‌‌lie‌‌within‌‌2‌‌SEs‌‌away‌‌from‌‌the‌‌expected‌‌value‌ ‌
Modelling‌‌the‌‌Mean‌‌of‌‌the‌‌Sample‌ ‌
● As‌‌the‌M ‌ ean‌‌‌of‌‌the‌‌sample‌‌is‌‌just‌‌the‌S
‌ um‌‌‌of‌‌the‌‌sample‌‌divided‌‌by‌‌the‌‌number‌‌
of‌‌the‌‌draws,‌‌we‌‌get‌‌an‌‌equivalent‌‌result‌‌as‌‌follows‌ ‌
● For‌‌the‌M ‌ ean‌‌‌of‌‌the‌‌random‌‌draws‌‌from‌‌a‌‌box‌‌model‌‌with‌‌replacement‌ ‌
○ observed‌‌value‌‌=‌‌expected‌‌value‌‌+‌‌chance‌‌error‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌mean‌‌of‌‌the‌‌box‌ ‌
SD of the box
■ Standard‌‌error‌‌(SE)‌‌=‌‌ ‌
√number of draws
Comparison‌ ‌
● Notice‌‌that‌‌there‌‌are‌‌two‌‌sets‌‌of‌‌formulas,‌‌depending‌‌on‌‌whether‌‌we‌‌are‌‌
modelling‌‌the‌s ‌ um‌‌‌or‌m‌ ean‌‌‌of‌‌a‌‌sample‌ ‌
● The‌‌sample‌‌question‌‌‌will‌‌dictate‌‌whether‌‌the‌‌sim‌‌or‌‌mean‌‌of‌‌a‌‌sample‌‌is‌‌more‌‌
appropriate‌ ‌
● Given‌‌the‌‌mean‌‌and‌‌SD‌‌of‌‌the‌‌population‌ ‌
○ Sum‌‌of‌‌Sample‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌n‌‌×‌‌mean‌ ‌
■ Standard‌‌error‌‌(SE)‌‌=‌‌√n ‌×‌‌SD‌ ‌
○ Mean‌‌of‌‌the‌‌Sample‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌mean‌ ‌
SD
■ Standard‌‌error‌‌(SE)‌‌=‌‌ ‌
√n
● Notice‌‌that‌‌as‌‌the‌‌sample‌‌size‌‌(n)‌‌increases,‌‌the‌‌SE‌‌for‌‌the‌‌sum‌‌increases,‌‌but‌‌
the‌‌SE‌‌for‌‌the‌‌mean‌‌decreases‌ ‌
Summary‌ ‌


● The‌‌box‌‌models‌‌a‌‌simple‌‌chance‌‌process‌‌involving‌‌drawing‌‌tickets‌‌from‌‌a‌‌fixed‌‌
box‌‌(population).‌ ‌
● We‌‌can‌‌describe‌‌the‌‌behaviour‌‌of‌‌the‌‌SUm‌‌and‌‌the‌‌Mean‌‌of‌‌the‌‌sample‌‌in‌‌terms‌‌
of‌‌the‌‌expected‌‌value‌‌(EV)‌‌and‌‌the‌‌standard‌‌error‌‌(SE),‌‌and‌‌compare‌‌to‌‌the‌‌
observed‌‌(OV)‌ ‌
● We‌‌can‌‌find‌‌SD box ‌by‌‌using‌‌the‌‌shortcut‌‌formula‌‌or‌‌popsd()‌ ‌
● Given‌‌the‌‌mean‌‌and‌‌SD‌‌of‌‌the‌‌population‌ ‌
○ When‌‌there‌‌is‌‌one‌‌desired‌‌outcome:‌‌make‌‌the‌‌desired‌‌tickets‌‌a‌‌“1”‌‌and‌‌all‌‌
other‌‌tickets‌‌“0”‌ ‌

Lecture‌‌15‌ ‌
The‌‌Central‌‌Limit‌‌Theorem‌ ‌
● If‌‌draws‌‌are‌‌independent‌‌and‌‌random‌‌with‌‌replacement‌‌and‌‌the‌‌sample‌‌size‌‌for‌‌
the‌‌sum‌‌(or‌‌average)‌‌is‌‌sufficiently‌‌large,‌‌then‌
○ The‌‌distribution‌‌‌for‌‌the‌‌sum‌‌(or‌‌average)‌‌will‌‌closely‌‌follow‌‌the‌n ‌ ormal‌‌
curve‌,‌‌even‌‌if‌‌the‌‌contents‌‌of‌‌the‌‌box‌‌do‌‌not‌ ‌
● “The‌‌Normal‌‌curve‌‌becomes‌‌a‌‌good‌‌model‌‌for‌‌the‌‌chance‌‌error‌‌of‌‌a‌‌sum‌‌(or‌‌
average)‌‌in‌‌sufficiently‌‌large‌‌samples”‌ ‌
● “As‌‌the‌‌sample‌‌size‌‌increases,‌‌the‌‌distribution‌‌for‌‌a‌‌sum‌‌(or‌‌average)‌‌tends‌‌
towards‌‌the‌‌Normal‌‌distribution”‌ ‌
Conditions‌‌for‌‌the‌‌CLT‌ ‌
● The‌‌draws‌‌must‌‌be‌‌random‌‌and‌‌independent‌‌from‌‌a‌‌fixed‌‌population‌ ‌
● The‌‌number‌‌of‌‌draws‌‌must‌‌be‌‌reasonably‌‌large‌‌‌(especially‌‌if‌‌the‌‌histogram‌‌of‌‌the‌‌
box‌‌differs‌‌from‌‌the‌‌normal‌‌curve)‌ ‌
● How‌‌large?‌‌This‌‌depends‌‌on‌‌the‌‌shape‌‌of‌‌the‌‌histogram‌ ‌
● A‌‌common‌‌convention‌‌is‌‌the‌‌number‌‌of‌‌draws‌‌larger‌‌than‌‌30‌‌(assuming‌‌a‌‌
basically‌‌symmetric‌‌distribution‌‌with‌‌no‌‌obvious‌‌outliers)‌ ‌
○ However,‌‌this‌‌is‌‌NOT‌‌‌a‌‌rule‌ ‌
Summary‌ ‌
● The‌‌Central‌‌Limit‌‌Theorem‌‌‌states‌‌that‌‌fro‌‌repeated‌‌simulations‌‌of‌‌a‌‌chance‌‌
process‌‌resulting‌‌in‌‌a‌s ‌ um‌‌‌or‌a
‌ verage‌,‌‌the‌‌simulation‌‌histogram‌‌of‌‌the‌‌observed‌‌
values‌‌converges‌‌to‌‌the‌‌Normal‌‌distribution‌ ‌

Lecture‌‌16‌ ‌
Parameter‌‌&‌‌Estimate‌
● A‌p
‌ arameter‌‌‌is‌‌a‌‌numerical‌‌fact‌‌about‌‌the‌‌population‌‌which‌‌we‌‌are‌‌interested‌‌in.‌‌
For‌‌example‌‌the‌‌population‌‌mean‌‌μ‌‌or‌‌population‌‌standard‌‌deviation‌‌σ‌ ‌


● An‌‌estimate‌‌‌(or‌s
‌ tatistic‌)‌‌is‌‌a‌‌calculator‌‌of‌‌sample‌‌values‌‌which‌‌best‌‌predicts‌‌
︿
the‌‌parameter.‌‌For‌‌examples‌‌the‌‌sample‌‌mean‌‌ ‌μ ‌(sometimes‌‌also‌‌denoted‌‌x ‌)‌‌
︿
or‌‌sample‌‌standard‌‌deviation‌‌σ ‌
● Observed‌‌value‌‌(OV)‌‌=‌‌expected‌‌value‌‌(EV)‌‌+‌‌chance‌‌error‌ ‌
● In‌‌the‌‌Sample‌‌Mean‌‌case:‌ ‌
︿
○ μ ‌‌=‌‌μ‌‌+‌‌chance‌‌error‌ ‌
■ The‌‌chance‌‌error‌‌is‌‌random‌‌by‌‌nature‌‌(noise).‌‌We‌‌can‌‌quantify‌‌the‌‌
chance‌‌error‌‌by‌‌estimating‌‌the‌‌spread‌‌(=expected‌‌magnitude)‌‌of‌‌
the‌‌chance‌‌error.‌‌This‌‌spread‌‌is‌‌called‌‌standard‌‌error‌‌(SE)‌‌and‌‌it‌‌is‌‌
the‌‌standard‌‌deviation‌‌of‌‌the‌‌chance‌‌error.‌‌It‌‌is‌‌often‌‌denoted‌‌by‌‌σ‌‌
(as‌‌well)‌ ‌
■ Not‌‌that‌‌we‌‌have‌‌two‌‌different‌‌σs‌‌now.‌‌We‌‌can‌‌call‌‌the‌‌population‌‌
SD‌‌σ pop ‌and‌‌the‌‌standard‌‌error‌‌(=standard‌‌deviation‌‌of‌‌the‌‌chance‌‌
error)‌‌remains‌‌denoted‌‌by‌‌σ.‌‌Note‌‌that‌‌σ pop ‌=‌‌SD(Box)‌ ‌
■ The‌‌central‌‌limit‌‌theorem‌‌tells‌‌us‌‌that‌‌for‌‌a‌s
‌ ample‌‌mean‌,‌‌chance‌‌
1
error‌‌behaves‌‌approximately‌‌like‌‌N (o, σ 2 ), σ− ×
√sample size
SD(Box)‌‌and‌‌in‌‌practice‌‌we‌‌can‌‌estimate‌‌σ‌‌by‌ ‌
︿ 1
σ= × ‌SD(Sample).‌‌In‌‌particular‌‌SE‌‌=‌‌SD(chance‌‌
√sample size
︿
error)‌‌=‌‌σ‌‌≈ SE = ︿ σ ‌
● Mean‌‌and‌‌standard‌‌deviation‌‌‌describe‌‌a‌‌set‌‌of‌‌data.‌‌They‌‌are‌n ‌ umerical‌‌
summaries‌.‌‌Expected‌‌value‌‌and‌s ‌ tandard‌‌error‌‌‌describe‌‌the‌‌sum/mean‌‌of‌‌a‌‌
random‌‌sample‌.‌‌The‌‌standard‌‌error‌‌is‌‌the‌‌standard‌‌deviation‌‌of‌‌the‌‌chance‌‌
error‌ ‌
The‌‌Correction‌‌factor‌ ‌
● When‌‌sampling‌‌with‌‌‌replacement,‌‌the‌‌SE‌‌is‌‌determined‌‌by‌‌the‌‌absolute‌s ‌ ample‌‌
size‌ ‌
● When‌‌sampling‌‌without‌‌‌replacement,‌‌the‌‌SE‌‌will‌‌be‌‌decreased‌‌by‌‌increasing‌‌the‌‌
ratio‌‌of‌‌sample‌‌size‌‌‌to‌‌population‌‌size‌,‌‌as‌‌when‌‌a‌‌higher‌‌proportion‌‌of‌‌the‌‌
population‌‌is‌‌sampled,‌‌the‌‌variability‌‌will‌‌decrease‌ ‌
● When‌‌the‌‌sample‌‌is‌‌only‌‌a‌‌small‌‌part‌‌of‌‌the‌‌population,‌‌the‌‌size‌‌of‌‌the‌‌population‌‌
has‌‌almost‌‌no‌‌effect‌‌on‌‌the‌‌SE‌‌of‌‌the‌‌estimate‌ ‌
● SE without replacement − correlation f actor × SE without replication ‌where‌‌correction‌‌

factor‌‌=‌‌
Summary‌ ‌
√ population size−sample size
population size − 1


● If‌‌we‌‌draw‌‌without‌‌replacement‌,‌‌then‌‌strictly‌‌the‌‌SE‌‌should‌‌be‌‌adjusted‌‌by‌‌the‌‌
correlation‌‌factor‌ ‌

○ correction‌‌factor‌‌=‌‌
√ population size−sample size
population size − 1

● However,‌‌for‌‌large‌‌population‌‌size‌‌compared‌‌to‌‌sample‌‌size,‌‌the‌‌correction‌‌factor‌‌
is‌‌almost‌‌1‌ ‌

Lecture‌‌17‌ ‌
Population‌‌vs‌‌Sample‌ ‌
● A‌s ‌ ample‌‌‌is‌‌a‌‌part‌‌of‌‌the‌p
‌ opulation‌ ‌
Limitation‌‌of‌‌a‌‌census‌ ‌
● Collecting‌‌every‌‌‌unit‌‌of‌‌a‌‌population:‌ ‌
○ Is‌‌hard‌ ‌
○ Takes‌‌lots‌‌of‌‌time‌ ‌
○ Costs‌‌a‌‌lot‌‌of‌‌money‌ ‌
○ Requires‌‌lots‌‌of‌‌resources‌ ‌
Finding‌‌the‌‌best‌‌estimate‌‌of‌‌the‌‌parameter‌ ‌
● Much‌‌Statistical‌‌theory‌‌is‌‌concerned‌‌with‌‌how‌‌to‌‌find‌‌the‌‌best‌‌estimate‌‌of‌‌a‌‌
parameter‌ ‌
● 2‌‌critical‌‌issues‌‌are:‌ ‌
○ How‌‌was‌‌the‌‌sample‌‌chosen?‌‌Is‌‌it‌‌representative‌‌of‌‌the‌‌population?‌ ‌
○ What‌‌estimate‌‌is‌‌closest‌‌to‌‌the‌‌parameter?‌ ‌
Examples‌‌of‌‌how‌‌bias‌‌can‌‌occur‌ ‌
● If‌‌there‌‌is‌‌a‌‌systematic‌‌tendency‌‌to‌‌exclude‌‌or‌‌include‌‌one‌‌types‌‌of‌‌person‌‌from‌‌
the‌‌sample‌ ‌
○ E.g.‌‌Convenience‌‌sampling‌‌(or‌‌“grab‌‌sampling”):‌‌A‌‌non-probability‌‌
sampling‌‌technique‌‌where‌‌subjects‌‌are‌‌selected‌‌because‌‌of‌‌their‌‌
convenient‌‌accessibility.‌‌It‌‌is‌‌definitely‌‌not‌‌recommended,‌‌except‌‌possibly‌‌
to‌‌test‌‌a‌‌survey‌‌(pilot)‌ ‌
● If‌‌some‌‌participants‌‌fail‌‌to‌‌complete‌‌surveys‌ ‌
○ What‌‌was‌‌the‌‌response‌‌rate?‌ ‌
○ Non-respondents‌‌can‌‌be‌‌very‌‌different‌‌to‌‌respondents‌ ‌
● If‌‌characteristics‌‌of‌‌the‌‌interview‌‌have‌‌an‌‌effect‌‌on‌‌the‌‌answer‌‌given‌‌by‌‌
participants‌ ‌
● If‌‌the‌‌form‌‌of‌‌the‌‌question‌‌in‌‌the‌‌survey‌‌affects‌‌the‌‌response‌‌to‌‌the‌‌question‌ ‌
● Because‌‌people‌‌may‌‌forget‌‌details‌ ‌
● Because‌‌of‌‌sensitive‌‌questions:‌‌people‌‌may‌‌not‌‌tell‌‌the‌‌truth‌ ‌
● Because‌‌of‌‌lack‌‌of‌‌clarity‌‌in‌‌the‌‌question‌ ‌
● Because‌‌attributes‌‌of‌‌the‌‌interview‌‌process‌‌may‌‌cause‌‌bias‌ ‌


Warning‌‌about‌‌bias‌‌and‌‌sample‌‌size‌ ‌
● When‌‌a‌‌selection‌‌process‌‌is‌‌biased,‌‌taking‌‌a‌‌larger‌‌sample‌‌‌does‌‌not‌‌reduce‌‌
bias,‌‌rather‌‌it‌‌can‌‌amplify‌‌‌the‌‌bias.‌‌It‌‌repeats‌‌the‌‌mistake‌‌on‌‌a‌‌larger‌‌scale‌ ‌
● In‌‌the‌‌famous‌‌1936‌‌US‌‌elections,‌‌the‌‌Literary‌‌digest‌‌‌magazine‌‌predicted‌‌an‌‌
overwhelming‌‌victory‌‌for‌‌Alfred‌‌Landon‌‌over‌‌Franklin‌‌Roosevelt,‌‌based‌‌on‌‌a‌‌poll‌‌
of‌‌2.4‌‌million‌‌people.‌‌However,‌‌Roosevelt‌‌won‌‌62%‌‌to‌‌38%.‌‌The‌‌Digest‌‌went‌‌
bankrupt‌‌soon‌‌after‌ ‌
● The‌‌problem‌‌was‌‌that‌‌their‌‌sampling‌‌procedure‌‌involved‌‌mailing‌‌questionnaires‌‌
to‌‌10‌‌million‌‌people,‌‌with‌‌names‌‌and‌‌addresses‌‌from‌‌sources‌‌that‌‌were‌‌biased‌‌
against‌‌the‌‌poor‌ ‌
How‌‌to‌‌pick‌‌a‌‌good‌‌sample?‌ ‌
● A‌‌sampling‌‌procedure‌‌should‌‌give‌‌a‌‌representative‌‌cross‌‌section‌‌of‌‌the‌‌
population‌ ‌
● We‌‌use‌‌a‌p ‌ robability‌‌method‌‌‌to‌‌pick‌‌the‌‌sample,‌‌so‌‌that‌ ‌
○ The‌‌interviewer‌‌is‌‌not‌‌involved‌‌in‌‌the‌‌selection.‌‌The‌‌method‌‌of‌‌selection‌‌is‌‌
impartial‌ ‌
○ The‌‌interviewer‌‌can‌‌compute‌‌the‌‌chance‌‌of‌‌any‌‌particular‌‌individuals‌‌
being‌‌chosen‌‌i.e.‌‌There‌‌is‌‌a‌‌defined‌‌procedure‌‌for‌‌selecting‌‌the‌‌sample,‌‌
which‌‌uses‌‌chance.‌‌It‌‌is‌‌objective.‌ ‌
● For‌‌example,‌‌Simple‌‌random‌‌sampling‌‌‌involves‌‌drawing‌‌at‌‌random‌‌without‌‌
replacement‌ ‌
● Multi-stage‌‌cluster‌‌sampling‌ ‌
○ As‌‌simple‌‌random‌‌sampling‌‌is‌‌often‌‌not‌‌practical,‌‌organisations‌‌may‌‌use‌‌
multi-stage‌‌cluster‌‌sampling.‌‌This‌‌is‌‌a‌‌probability‌‌sampling‌‌technique‌‌
which‌‌takes‌‌samples‌‌in‌‌stages,‌‌and‌‌individuals‌‌or‌‌clusters‌‌are‌‌chosen‌‌at‌‌
random‌‌at‌‌each‌‌stage.‌ ‌
Unavoidable‌‌Bias‌ ‌
● Even‌‌with‌‌a‌‌probability‌‌method‌‌determining‌‌the‌‌sample,‌‌bias‌‌can‌‌easily‌‌come‌‌in‌ ‌
● In‌‌addition,‌‌because‌‌the‌‌sample‌‌is‌‌only‌‌part‌‌of‌‌the‌‌population,‌‌we‌‌always‌‌have‌‌
chance‌‌error‌ ‌
○ Parameter‌‌estimate‌‌=‌‌true‌‌parameter‌‌+‌‌bias‌‌+‌‌chance‌‌error‌ ‌
Summary‌ ‌
● Unless‌‌a‌‌census‌‌is‌‌possible,‌‌information‌‌about‌‌a‌‌population‌‌comes‌‌from‌‌an‌‌
estimate‌‌from‌‌a‌‌sample.‌‌The‌‌reliability‌‌of‌‌such‌‌an‌‌estimate‌‌depends‌‌on‌‌how‌‌the‌‌
sample‌‌was‌‌chosen.‌‌Hence,‌‌we‌‌usually‌‌have:‌ ‌
○ Observed‌‌value‌‌=‌‌true‌‌parameter‌‌+‌‌bias‌‌+‌‌chance‌‌error;‌‌or‌ ‌
○ Parameter‌‌estimate‌‌=‌‌true‌‌parameter‌‌+‌‌bias‌‌+‌‌chance‌‌error‌ ‌


Lecture‌‌18‌ ‌
Confidence‌‌Interval‌ ‌
● A‌‌confidence‌‌interval‌‌quantifies‌‌the‌‌uncertainty‌‌of‌‌our‌‌estimates.‌ ‌
● A‌q ‌ %‌‌‌confidence‌‌interval‌‌covers‌‌the‌‌true‌‌parameter‌‌with‌q ‌ %‌‌‌probability.‌‌More‌‌
precisely,‌‌if‌‌you‌‌calculated‌‌intervals‌‌for‌‌many‌‌samples‌‌under‌‌the‌‌same‌‌setting,‌‌
q%‌‌‌of‌‌them‌‌would‌‌cover‌‌the‌‌true‌‌parameter‌ ‌
● If‌‌the‌‌chance‌e ‌ rror‌‌‌follows‌‌a‌‌symmetric‌‌distribution,‌‌then‌‌a‌q
‌ %‌‌‌confidence‌‌interval‌‌
is‌‌given‌‌by:‌ ‌
1−q
○ Observed‌‌Value‌‌± (1 − 2
) ‌th‌‌percentile‌‌(Chance‌‌Error)‌ ‌
○ For‌‌the‌‌95%‌‌confidence‌‌interval‌‌we‌‌thus‌‌have‌ ‌
■ [OV‌‌−‌‌97.5th‌‌percentile‌‌(CE),‌‌OV‌‌+‌‌97.5th‌‌percentile‌‌(CE)]‌ ‌
Hypothesis‌‌Testing‌ ‌
● In‌H ‌ ypothesis‌‌Testing‌,‌‌we‌‌start‌‌with‌‌a‌h
‌ ypothesis‌‌‌about‌‌our‌‌population.‌‌For‌‌
example:‌ ‌
○ “The‌‌coin‌‌is‌‌fair‌‌(so‌‌the‌‌population‌‌mean‌‌is‌‌0.5)”‌ ‌
● We‌‌then‌‌calculate‌‌the‌c‌ hance‌‌error‌‌‌and‌d ‌ ecide‌‌‌whether:‌ ‌
○ The‌‌chance‌‌error‌‌fell‌‌within‌‌an‌‌interval‌‌to‌‌be‌‌expected‌‌→‌‌Our‌‌data‌‌is‌‌
consistent‌‌with‌‌the‌‌hypothesis‌ ‌
○ The‌‌chance‌‌error‌‌was‌‌extremely‌‌big‌‌→‌‌Either‌‌we‌‌observed‌‌a‌‌very‌‌rare‌‌
event‌‌or‌‌our‌‌hypothesis‌‌is‌‌wrong‌ ‌
3‌‌Main‌‌Steps‌ ‌
● Set‌‌up‌‌research‌‌question‌ ‌
○ Hypothesis‌‌H 0 vs H 1 ‌
● Weigh‌‌up‌‌evidence‌ ‌
○ Assumptions‌ ‌
○ Test‌‌Statistic‌ ‌
○ P-value‌ ‌
● Explain‌‌conclusion‌ ‌
○ Conclusion‌ ‌
Why‌‌hypothesis‌‌testing?‌ ‌
● To‌‌make‌‌evidence‌‌based‌‌decisions,‌‌we‌‌need‌‌to‌‌weigh‌‌up‌‌‌evidence,‌ ‌
● Hypothesis‌‌Testing‌‌is‌‌a‌‌scientific‌‌method‌‌for‌‌weighing‌‌up‌‌the‌‌evidence‌‌given‌‌in‌‌
the‌‌data‌‌against‌‌a‌‌given‌‌hypothesis‌‌(model)‌ ‌
○ We‌‌say‌‌that‌‌the‌‌data‌‌is‌‌not‌‌consistent‌‌with‌‌the‌‌hypothesis‌‌if‌‌the‌‌difference‌‌
between‌‌the‌‌observed‌‌value‌‌(in‌‌our‌‌case‌‌sample‌‌mean‌‌or‌‌sample‌‌sum)‌‌
and‌‌the‌‌expected‌‌value‌‌(assuming‌‌the‌‌hypothesis)‌‌is‌‌too‌‌big‌ ‌
○ Alternative‌‌formulation:‌‌If‌‌the‌‌chance‌‌error‌‌is‌‌too‌‌big‌‌we‌‌should‌‌consider‌‌to‌‌
reject‌‌the‌‌hypothesis‌ ‌


Summary‌ ‌
● A‌c
‌ onfidence‌‌interval‌‌‌qualifies‌‌the‌‌uncertainty‌‌of‌‌our‌‌estimates.‌‌A‌q‌ %‌‌
confidence‌‌interval‌‌covers‌‌the‌‌true‌‌parameters‌‌with‌q ‌ %‌‌‌probability‌ ‌
● Hypothesis‌‌testing‌‌‌is‌‌a‌‌scientific‌‌method‌‌for‌‌weighing‌‌up‌‌the‌‌evidence‌‌in‌‌the‌‌
data‌‌against‌‌a‌‌give‌‌hypothesis‌‌(model)‌ ‌

Lecture‌‌19‌ ‌
The‌‌Z‌‌Test‌ ‌
● This‌‌test‌‌is‌‌used‌‌to‌‌test‌‌a‌‌hypothesis‌‌about‌‌a‌p‌ roportion‌‌‌in‌‌a‌‌population‌ ‌
● Some‌‌examples‌‌could‌‌be:‌ ‌
○ Is‌‌the‌‌proportion‌‌of‌‌the‌‌coin‌‌flips‌‌that‌‌are‌‌heads‌‌equal‌‌to‌‌50%‌ ‌
○ Is‌‌the‌‌proportion‌‌of‌‌CEOs‌‌that‌‌are‌‌female‌‌less‌‌than‌‌50%‌ ‌
○ Is‌‌the‌‌proportion‌‌of‌‌students‌‌that‌‌dropout‌‌of‌‌school‌‌greater‌‌than‌‌25%‌ ‌
H:‌‌Hypothesis‌ ‌
● The‌‌null‌‌hypothesis‌‌‌H 0 ‌postulates‌‌a‌‌certain‌‌expected‌‌value‌ ‌
● The‌‌alternative‌‌hypothesis‌‌‌H 1 ‌is‌‌that‌‌the‌‌underlying‌‌expected‌‌value‌‌is‌‌actually‌‌
different‌ ‌
● When‌‌performing‌‌a‌‌statistical‌‌test,‌‌we‌‌calculate‌‌the‌‌chance‌‌error‌‌under‌‌H 0 ‌and‌‌
weigh‌‌up‌‌whether‌‌its‌‌size‌‌is‌‌plausible‌‌or‌‌not‌ ‌
A:‌‌Assumption‌‌(for‌‌Z‌‌Test)‌ ‌
● Observation‌‌are‌‌independent‌‌‌of‌‌each‌‌other‌ ‌
● Sample‌‌mean‌‌(sample‌‌sum)‌‌follows‌‌a‌n ‌ ormal‌‌distribution‌‌‌or‌‌sample‌‌size‌‌is‌‌big‌‌
enough‌‌such‌‌that‌‌normality‌‌is‌‌approximately‌‌satisfied‌‌(from‌‌Central‌‌Limit‌‌
Theorem)‌ ‌
○ We‌‌do‌N ‌ OT‌‌‌need‌‌the‌‌data‌‌to‌‌be‌‌normal‌ ‌
○ But‌‌if‌‌the‌‌sample‌‌size‌‌is‌‌not‌‌big‌‌enough,‌‌then‌‌we‌‌can‌‌use‌‌that‌‌if‌‌the‌‌data‌‌is‌‌
approximately‌‌normal,‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌sum‌‌will‌‌be‌‌
approximately‌‌normal‌‌as‌‌well‌ ‌
T:‌‌Test‌‌statistic‌‌(for‌‌Z‌‌test)‌ ‌
● A‌t‌ est‌‌statistic‌‌‌measures‌‌the‌‌difference‌‌between‌‌what‌‌is‌o ‌ bserved‌‌‌in‌‌the‌‌data‌‌
adn‌‌what‌‌is‌e ‌ xpected‌‌‌from‌‌the‌‌null‌‌hypothesis‌ ‌
● It‌‌takes‌‌the‌‌form‌ ‌
observed value (OV ) − expected value (EV ) chance error(CE)
○ Test‌‌statistic‌‌=‌ standard error (SE) ‌=‌‌ standard error (SE) ‌
● NOTE:‌‌If‌‌the‌‌null‌‌hypothesis‌‌is‌‌true,‌‌then‌‌the‌‌test‌‌statistic‌‌follows‌‌a‌‌standard‌‌
normal‌‌curve:‌‌N(0,1)‌ ‌
P:‌‌p-value‌‌(for‌‌Z‌‌test)‌ ‌


● The‌‌p-value‌‌‌is‌‌the‌‌chance‌‌of‌‌observing‌‌the‌‌test‌‌statistic‌‌(or‌‌something‌‌more‌‌
extreme‌‌assuming‌‌H 0 ‌is‌‌true:‌ ‌
○ In‌‌a‌‌Z‌‌test,‌‌the‌‌test‌‌statistic‌‌follows‌‌a‌‌standard‌‌normal‌‌curve,‌‌hence‌‌the‌‌
p-value‌‌is‌‌given‌‌by‌ ‌
■ p = P (Z ≥ ∣test statistic∣) ‌
○ Where‌‌Z ‌‌is‌‌a‌‌standard‌‌normal:‌‌Z ~N (0, 1) ‌
P:‌‌p-value‌‌(In‌‌general)‌ ‌
● In‌‌general‌‌(for‌‌all‌‌tests),‌‌the‌‌smaller‌‌the‌‌p‌‌value,‌‌the‌‌less‌‌likely‌‌is‌‌to‌‌observe‌‌a‌‌
test‌‌statistic‌‌of‌‌the‌‌magnitude‌‌observed.‌‌If‌‌the‌‌p-value‌‌is‌‌small‌‌enough‌‌it‌‌raises‌‌
evidence‌‌to‌‌reject‌‌the‌‌null‌‌hypothesis,‌‌H 0 ‌.‌ ‌
< α ‌,‌‌where‌‌α ‌is‌‌a‌‌
● The‌‌convention‌‌is‌‌to‌‌reject‌‌the‌‌null‌‌hypothesis‌‌if‌‌p
‌ ignificant‌‌level‌,‌‌often‌‌chosen‌‌as‌‌α ‌‌=‌‌0.05‌ ‌
predetermined‌s
● However,‌‌you‌‌don’t‌‌need‌‌to‌‌follow‌‌this‌‌strictly,‌‌a‌‌you‌‌shouldn’t‌ ‌
Summary‌‌of‌‌the‌‌hypothesis‌‌test‌ ‌
● H:‌‌If‌‌p ‌=Proportion‌‌of‌‌patients‌‌who‌‌responded‌‌to‌‌the‌‌treatment,‌‌we‌‌test‌‌H 0 ‌:‌‌p ‌=‌‌
0.8‌‌vs‌‌H 1 ‌:‌‌p ‌‌>‌‌0.8‌ ‌
● A:‌‌We‌‌assume‌‌that‌‌the‌‌participant‌‌in‌‌the‌‌tournament‌‌group‌‌are‌‌independent‌‌of‌‌
each‌‌other‌‌and‌‌given‌‌a‌‌sample‌‌size‌‌of‌‌29,‌‌the‌‌sample‌‌mean‌‌is‌‌approximately‌‌
normal‌ ‌
● T:‌‌The‌‌test‌‌statistic‌‌for‌‌the‌‌observed‌‌sum‌‌is‌‌1.3‌ ‌
● P:‌‌The‌‌p-value‌‌for‌‌this‌‌test‌‌statistic‌‌is‌‌0.097‌ ‌
● C:‌‌As‌‌the‌‌p-value‌‌is‌‌greater‌‌than‌‌0.05,‌‌we‌‌do‌‌not‌‌have‌‌enough‌‌evidence‌‌to‌‌reject‌‌
the‌‌null‌‌hypothesis‌‌and‌‌so‌‌the‌‌data‌‌not‌‌provide‌‌strong‌‌evidence‌‌that‌‌p ‌‌>‌‌0.8‌ ‌
One-sided‌‌and‌‌Two-sided‌‌Tests‌ ‌
● 1‌‌sided:‌ ‌
○ Specifies‌‌the‌‌direction‌‌of‌‌the‌‌alternative‌‌hypothesis.‌‌Eg‌‌H 1 : p > 0.8 ‌
● 2‌‌sided:‌ ‌
○ Does‌‌not‌‌specify‌‌the‌‌direction‌‌of‌‌the‌‌alternative‌‌hypothesis.‌‌
H 1 : p =/ 0.8 ‌
○ In‌‌this‌‌case‌‌the‌‌p-value‌‌doubles‌ ‌
Summary‌ ‌
● Hypothesis‌‌testing‌‌‌is‌‌a‌‌scientific‌‌method‌‌for‌‌weighing‌‌up‌‌the‌‌evidence‌‌given‌‌in‌‌
the‌‌data‌‌against‌‌a‌‌given‌‌hypothesis‌‌(model).‌‌It‌‌involves‌‌the‌‌following‌‌parts:‌
○ H:‌‌Hypothesis‌‌H 0 ‌vs‌‌H 1 ‌
○ A:‌‌Assumptions‌ ‌
○ T:‌‌Test‌‌Statistic‌ ‌
○ P:‌‌p-value‌ ‌


○ C:‌‌Conclusion‌ ‌
● The‌‌Z‌‌test‌‌is‌‌used‌‌to‌‌test‌‌a‌‌hypothesis‌‌about‌‌a‌p
‌ roportion‌‌‌in‌‌a‌‌population‌ ‌

Lecture‌‌20‌ ‌
When‌‌to‌‌use‌‌the‌‌Z-test‌ ‌
● ‌To‌‌use‌‌the‌‌Z-test,‌‌we‌‌need‌‌to‌k
‌ now‌‌‌the‌‌population‌‌SD‌ ‌
○ One‌‌plausible‌‌case:‌‌Under‌‌H 0 ‌in‌‌a‌‌binary‌‌case‌‌assuming‌‌a‌‌proportion‌‌
(and‌‌using‌‌Box‌‌SD)‌ ‌
● Can‌‌we‌‌just‌e ‌ stimate‌‌‌the‌‌population‌‌SD‌‌using‌‌the‌‌sample‌‌SD?‌ ‌
○ Yes‌ ‌
○ But‌‌this‌‌estimation‌‌will‌‌add‌e ‌ xtra‌‌variability‌‌‌to‌‌the‌‌test‌‌statistic‌‌as‌‌the‌‌
sample‌‌SD‌‌varies‌‌from‌‌sample‌‌to‌‌sample‌ ‌
○ For‌‌large‌‌samples,‌‌the‌‌difference‌‌between‌‌the‌‌population‌‌and‌‌sample‌‌SD‌‌
should‌‌be‌‌small,‌‌and‌‌so‌‌the‌‌Z-test‌‌may‌‌be‌‌appropriate‌ ‌
○ For‌‌small‌‌samples,‌‌the‌‌difference‌‌will‌‌be‌‌more‌‌noticeable.‌‌Hence,‌‌we‌‌
should‌‌use‌‌the‌t‌ -Test‌‌‌instead‌ ‌
The‌‌t-Test‌ ‌
● W.S‌‌Gosset‌‌(1876-1936)‌‌invented‌‌a‌‌similar‌‌test‌‌to‌‌the‌‌Z-test,‌‌which‌‌uses‌‌the‌‌
sample‌‌SD‌‌and‌‌the‌‌t-distribution‌ ‌
● The‌‌t-distribution‌‌varies‌‌in‌‌shape‌‌according‌‌to‌‌the‌‌sample‌‌size.‌‌The‌‌smaller‌‌the‌‌
sample‌‌size‌‌is,‌‌the‌‌more‌‌variable‌‌the‌‌sample‌‌is,‌ ‌and‌‌huend‌‌the‌‌distribution‌‌of‌‌the‌‌
test‌‌statistics‌‌will‌‌be‌‌“wider”.‌‌The‌‌degree‌‌of‌‌“wideness”‌‌(also‌‌called‌d ‌ egree‌‌of‌‌
freedom‌)‌‌depends‌‌on‌‌the‌‌sample‌‌size‌‌and‌‌here‌‌it‌‌is‌‌n − 1 ‌.‌‌We‌‌write‌‌such‌‌a‌‌
distribution‌‌as‌‌tn−1 ‌
The‌‌t-distribution‌ ‌
● The‌‌t-distribution‌‌with‌‌ν ‌‌degrees‌‌of‌‌freedom‌‌(‌tν ‌)‌ ‌
● ν = ∞ ‌‌results‌‌in‌‌the‌‌standard‌‌normal‌‌distribution‌ ‌
● The‌‌standardised‌‌chance‌‌error‌‌(if‌‌standardised‌‌with‌‌tht‌‌example‌‌sd)‌‌follows‌‌a‌‌
t-distribution‌‌with‌‌ν = n − 1 = sample size − 1 ‌
Summary:‌‌T-Test‌ ‌
● H:‌H 0 ‌:population‌‌mean‌‌=‌‌μ vs H 1 ‌:‌‌population‌‌mean‌‌<, >, = μ /
● A:‌‌Individuals‌‌are‌‌independent;‌‌sample‌‌size‌‌is‌‌large‌‌enough‌‌for‌‌the‌‌CLT‌‌(or‌‌
population‌‌is‌‌normal)‌ ‌
observed mean − population mean
● T:‌‌Test‌‌statistic‌‌=‌ ︿ sample SD ‌
SE = √n
● P:Use‌‌tn−1 ‌curve‌‌to‌‌find‌‌tail‌‌area‌‌for‌‌observed‌‌test‌‌statistic‌ ‌


○ 1-sided:‌P (tn−1 > ∣T est statistic∣) ‌
○ 2-sided:‌2 × P (tn−1 > ∣T est statistic∣) ‌
● C:‌‌Retain‌‌or‌‌Reject‌‌H 0 ‌
Summary‌ ‌
● The‌‌t-test‌‌is‌‌used‌‌to‌‌decide‌‌whether‌‌an‌‌observed‌‌difference‌‌between‌‌data‌‌and‌‌
expected‌‌value‌‌is‌‌just‌‌due‌‌to‌‌chance‌‌error‌‌alone‌‌(the‌‌null‌‌hypothesis)‌‌or‌‌another‌‌
reason‌‌(alternative‌‌hypothesis)‌ ‌
● If‌‌the‌‌population‌‌SD‌‌is‌‌unknown,‌‌we‌‌use‌‌the‌‌t-test,especially‌‌in‌‌the‌‌case‌‌of‌‌small‌‌
samples‌ ‌
● The‌‌test‌‌statistic‌‌is:‌ ‌
observed value − expected value
○ ︿ ‌
SE
● We‌‌can‌‌also‌‌use‌‌the‌‌t-distribution‌‌to‌‌construct‌‌confidence‌‌intervals‌ ‌

Lecture‌‌21‌ ‌
Inference‌ ‌
● While‌‌visualisation‌‌of‌‌the‌‌data‌‌gives‌‌us‌‌an‌‌initial‌g
‌ limpse‌‌‌at‌‌the‌‌possible‌‌
relationship‌‌between‌‌the‌‌two‌‌populations‌‌(those‌‌who‌‌have‌‌drunk‌‌a‌‌Red‌‌Bull‌‌and‌‌
those‌‌who‌‌have‌‌not),‌‌we‌‌often‌‌want‌‌to‌‌make‌‌a‌‌decision‌‌‌on‌‌whether‌‌the‌‌mean‌‌of‌‌
the‌‌two‌‌populations‌‌is‌‌the‌‌same‌‌or‌‌different.‌ ‌
● Inference‌‌‌is‌‌making‌‌a‌‌decision‌‌about‌‌population‌‌parameter(s)‌‌based‌‌on‌‌a‌‌
sample‌ ‌
2-Sample‌‌T-Test‌ ‌
● H:‌‌Hypothesis‌ ‌
○ μ1 ‌=‌‌mean‌‌heart‌‌rate‌‌of‌‌the‌‌control‌‌group‌ ‌
○ μ2 ‌=‌‌mean‌‌heart‌‌rate‌‌our‌‌treatment‌‌group‌ ‌
○ H 0 ‌:‌‌There‌‌is‌‌no‌‌difference:‌μ1 = μ2 ‌,‌‌or‌‌μ1 − μ2 = 0 ‌
○ H 1 ‌:‌‌There‌‌is‌‌a‌‌difference:‌μ1 =/ μ2 ‌,‌‌or‌‌μ1 − μ2 =/ 0 ‌
● A:‌‌Assumption‌ ‌
○ A1)‌‌All‌‌observed‌‌individuals‌‌are‌‌independent‌‌(within‌‌groups‌‌and‌‌between‌‌
different‌‌groups)‌ ‌
■ The‌‌two‌‌sample‌‌(Red‌‌Bull‌‌and‌‌Control)‌‌contain‌‌different‌‌people‌ ‌
● Note:‌‌This‌‌design‌‌differs‌‌from‌‌the‌‌caffeine‌‌one‌‌in‌‌which‌‌the‌‌
same‌‌person‌‌is‌‌tested‌‌at‌‌both‌‌0‌‌and‌‌13‌‌mg‌‌level‌‌of‌‌caffeine‌‌
and‌‌we‌‌consider‌‌the‌‌sample‌‌of‌‌differences‌‌from‌‌each‌‌pair‌ ‌


● The‌‌paired‌‌differences‌‌can‌‌eliminate‌‌personal‌‌effect‌‌on‌‌the‌‌
experimental‌‌result‌‌but‌‌it‌‌is‌‌also‌‌harder‌‌to‌‌find‌‌the‌‌same‌‌
person‌‌to‌‌undergo‌‌both‌‌treatment‌‌and‌‌control‌‌for‌‌some‌‌
experiments‌ ‌
○ A2)‌‌The‌‌sample‌‌means‌‌follow‌‌a‌‌Normal‌‌distribution‌ ‌
■ Our‌‌samples‌‌are‌‌quite‌‌small,‌‌so‌‌the‌‌Central‌‌limit‌‌Theorem‌‌might‌‌not‌‌
fully‌‌kick‌‌in.‌‌Hence,‌‌2‌‌sample‌‌t-Test‌‌is‌‌questionable‌ ‌
○ A3)‌‌The‌‌2‌‌populations‌‌have‌‌equal‌‌spread‌‌(SD/variance)‌ ‌
■ We‌‌assume‌‌that‌‌the‌‌2‌‌populations‌‌have‌‌the‌s ‌ ame‌‌variation‌‌‌in‌‌
heart‌‌rate‌ ‌
■ Check:‌‌Box‌‌Plots,‌‌Histograms,‌‌Variance‌‌Test‌ ‌
● Box‌‌plots‌‌show‌‌that‌‌RB‌‌seems‌‌to‌‌have‌‌smaller‌‌sd.‌‌But‌‌
difference‌‌might‌‌not‌‌be‌‌significant‌ ‌
■ Better:‌‌This‌‌assumption‌‌can‌‌be‌‌relaxed‌‌by‌‌using‌‌the‌W ‌ elch‌‌2‌‌
sample‌‌T-Test‌ ‌
● T:‌‌Test‌‌Statistic‌ ‌
○ Equal‌‌variance‌ ‌
■ We‌‌compare‌‌2‌‌populations.‌‌Our‌‌observed‌‌value‌‌is‌‌the‌‌difference‌‌in‌‌
sample‌‌means.‌‌Our‌‌null‌‌hypothesis‌‌is‌‌that‌‌there‌‌is‌‌no‌‌difference‌‌in‌‌
population‌‌means‌ ‌
x x 0
● test statistic =
OV −EV
︿ = 1 −︿2 − ‌where,‌‌
SE SE
︿=
SE
√ SD 2 p ( n1 +
1
1
n2
), df = n1 + n2 − 2 ‌‌based‌‌on‌‌the‌‌
2 (n1 −1)SD 2 1 +(n2 −1)SD 2 2
pooled‌‌sample‌‌SD,‌‌where‌‌SD p = n1 +n2 −2 ‌
Summary‌ ‌
● The‌‌2‌‌Sample‌‌T-Test‌‌‌is‌‌used‌‌to‌‌test‌‌for‌‌the‌d
‌ ifference‌‌in‌‌means‌‌‌of‌‌two‌‌
populations‌ ‌
● We‌‌need‌‌to‌‌assume‌‌that:‌ ‌
○ All‌‌observed‌‌individuals‌‌are‌i‌ndependent‌ ‌
○ The‌‌sample‌‌means‌‌follow‌‌a‌‌Normal‌‌distribution‌ ‌
○ The‌‌2‌‌populations‌‌have‌e ‌ qual‌‌spread‌‌‌(SD/variance)‌ ‌
● We‌‌can‌‌relax‌‌the‌‌final‌‌assumption‌‌by‌‌using‌‌a‌W ‌ elch‌‌Two-Sample‌‌T-Test‌ ‌

Lecture‌‌22‌ ‌
Welch‌‌2-Sample‌‌T-test‌ ‌
● Welch‌‌2-Sample‌‌T-test‌‌has‌‌a‌‌different‌‌SE‌‌and‌‌df‌‌formula‌‌compared‌‌to‌‌the‌‌
standard‌‌two‌‌sample‌‌t-test‌ ‌


● Standard‌‌error‌‌and‌‌df‌‌formula‌‌for‌‌the‌‌difference‌‌with‌u
‌ nequal‌‌variance‌:‌ ‌


s21 s22
○ SE = n1
+ n2

2

○ df =
( ) s2 s2
1
+
n1 n2
2


2 2
æ( ) ( ) ö
s2 s2
1 2
n1 n2
ç ÷
n1 −1
+ n2 −1
ç ÷
è ︿
ø
■ Where‌‌sk = SD (Samplek ), k = 1, 2 ‌
Non-Independent‌‌Data‌‌(Paired‌‌T-test)‌ ‌
● Sometimes‌‌it‌‌is‌‌desirable‌‌to‌‌analyse‌‌dependent‌‌‌data.‌‌We‌‌often‌‌design‌‌an‌‌
experiment‌‌to‌‌take‌‌advantage‌‌of‌‌this‌‌dependency‌‌in‌‌order‌‌to‌‌control‌‌variation‌‌
between‌‌experimental‌‌groups‌‌ ‌
Summary‌ ‌
● We‌‌can‌‌perform‌‌a‌‌Welch‌‌Two-sample‌‌t-test‌‌‌to‌‌compare‌‌two‌‌populations‌‌with‌‌
different‌‌variances‌ ‌
● We‌‌can‌‌perform‌‌a‌‌Paired‌‌t-test‌‌‌(one‌‌sample‌‌t-test)‌‌fi‌‌we‌‌have‌‌paired‌‌data‌ ‌
● We‌‌can‌‌perform‌‌a‌‌t-test‌‌for‌‌the‌‌slope‌‌‌of‌‌a‌‌regression‌‌line‌ ‌

You might also like